[PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues.

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues.
@ 2026-04-25  6:04 Kuniyuki Iwashima
  2026-04-25  6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima
  2026-04-25  6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima
  0 siblings, 2 replies; 81+ messages in thread
From: Kuniyuki Iwashima @ 2026-04-25  6:04 UTC (permalink / raw)
  To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky
  Cc: David Ahern, Kuniyuki Iwashima, Kuniyuki Iwashima, linux-rdma

Patch 1 fixes racy allocation/destruction of per-netns UDP
tunnel sockets.

Patch 2 fixes unsafe access to the socket in rxe_find_route6().

Changes:
  v2:
    Patch 1: Set up UDP tunnels in __net_init instead of adding mutex.

  v1: https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/


Kuniyuki Iwashima (2):
  RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
  RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6().

 drivers/infiniband/sw/rxe/rxe.c     |   6 --
 drivers/infiniband/sw/rxe/rxe_net.c | 137 +++-------------------------
 drivers/infiniband/sw/rxe/rxe_net.h |   5 +-
 drivers/infiniband/sw/rxe/rxe_ns.c  |  97 ++++++++------------
 drivers/infiniband/sw/rxe/rxe_ns.h  |   1 -
 5 files changed, 56 insertions(+), 190 deletions(-)

-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
  2026-04-25  6:04 [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues Kuniyuki Iwashima
@ 2026-04-25  6:04 ` Kuniyuki Iwashima
  2026-04-25 15:47   ` David Ahern
  2026-04-25 21:25   ` Zhu Yanjun
  2026-04-25  6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima
  1 sibling, 2 replies; 81+ messages in thread
From: Kuniyuki Iwashima @ 2026-04-25  6:04 UTC (permalink / raw)
  To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky
  Cc: David Ahern, Kuniyuki Iwashima, Kuniyuki Iwashima, linux-rdma,
	syzbot+d8f76778263ab65c2b21

syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0]

The problem is ->newlink() and ->dellink() can be called
concurrently with no synchronisation, leading sk leak or
double free, etc.

We defer UDP tunnel allocation to the first device creation,
but this would requrie per-netns locking.

Let's allocate UDP tunnels in the __init_net hook.

Now extra sock_hold() and __sock_put() are no longer needed.

Note that rxe_ns_pernet_sk6() is broken and will be fixed
in the following patch.

[0]:
Oops: general protection fault, probably for non-canonical address 0xdffffc000000000d: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
CPU: 3 UID: 0 PID: 12652 Comm: syz.7.1709 Tainted: G             L      syzkaller #0 PREEMPT(full)
Tainted: [L]=SOFTLOCKUP
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:kernel_sock_shutdown+0x47/0x70 net/socket.c:3785
Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 33 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 20 49 8d 7c 24 68 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 1a 49 8b 44 24 68 89 ee 48 89 df 5b 5d 41 5c e9 46
RSP: 0018:ffffc9000566f180 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff888058587240 RCX: 0000000000000000
RDX: 000000000000000d RSI: ffffffff895ced12 RDI: 0000000000000068
RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1006d98945
R10: ffff888036cc4a2b R11: 0000003683c25c00 R12: 0000000000000000
R13: ffff88805c998000 R14: 0000000000000002 R15: 0000000000000018
FS:  00007f1306d976c0(0000) GS:ffff8880d65db000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1306d97d58 CR3: 00000000404f1000 CR4: 0000000000352ef0
DR0: ffffffffffffffff DR1: 00000000000001f8 DR2: 0000000000000002
DR3: ffffffffefffff15 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 udp_tunnel_sock_release+0x68/0x80 net/ipv4/udp_tunnel_core.c:202
 rxe_release_udp_tunnel drivers/infiniband/sw/rxe/rxe_net.c:294 [inline]
 rxe_sock_put+0xae/0x130 drivers/infiniband/sw/rxe/rxe_net.c:639
 rxe_net_del+0x83/0x120 drivers/infiniband/sw/rxe/rxe_net.c:660
 rxe_dellink+0x15/0x20 drivers/infiniband/sw/rxe/rxe.c:254
 nldev_dellink+0x289/0x3c0 drivers/infiniband/core/nldev.c:1849
 rdma_nl_rcv_msg+0x392/0x6f0 drivers/infiniband/core/netlink.c:195
 rdma_nl_rcv_skb.constprop.0.isra.0+0x2cb/0x410 drivers/infiniband/core/netlink.c:239
 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
 netlink_unicast+0x585/0x850 net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:787 [inline]
 __sock_sendmsg net/socket.c:802 [inline]
 ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2698
 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2752
 __sys_sendmsg+0x170/0x220 net/socket.c:2784
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x10b/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1305f9c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f1306d97028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f1306216090 RCX: 00007f1305f9c819
RDX: 0000000000000000 RSI: 00002000000002c0 RDI: 0000000000000003
RBP: 00007f1306032c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f1306216128 R14: 00007f1306216090 R15: 00007ffd8ecad288
 </TASK>
Modules linked in:

Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace")
Reported-by: syzbot+d8f76778263ab65c2b21@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69ea344f.a00a0220.17a17.0040.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2: Set up UDP tunnels in __net_init instead of adding mutex.
v1: https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/
---
 drivers/infiniband/sw/rxe/rxe.c     |   6 --
 drivers/infiniband/sw/rxe/rxe_net.c | 126 ++--------------------------
 drivers/infiniband/sw/rxe/rxe_net.h |   5 +-
 drivers/infiniband/sw/rxe/rxe_ns.c  |  90 +++++++++-----------
 drivers/infiniband/sw/rxe/rxe_ns.h  |   1 -
 5 files changed, 47 insertions(+), 181 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index b0714f9abe3d..111ba4e57261 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -236,10 +236,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 		goto err;
 	}
 
-	err = rxe_net_init(ndev);
-	if (err)
-		return err;
-
 	err = rxe_net_add(ibdev_name, ndev);
 	if (err) {
 		rxe_err("failed to add %s\n", ndev->name);
@@ -251,8 +247,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
 
 static int rxe_dellink(struct ib_device *dev)
 {
-	rxe_net_del(dev);
-
 	return 0;
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 50a2cb5405e2..9080d4c893a1 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -256,8 +256,8 @@ static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port,
-					   bool ipv6)
+struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port,
+				  bool ipv6)
 {
 	int err;
 	struct socket *sock;
@@ -285,13 +285,12 @@ static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port,
 	/* Setup UDP tunnel */
 	setup_udp_tunnel_sock(net, sock, &tnl_cfg);
 
-	return sock;
+	return sock->sk;
 }
 
-static void rxe_release_udp_tunnel(struct socket *sk)
+void rxe_release_udp_tunnel(struct sock *sk)
 {
-	if (sk)
-		udp_tunnel_sock_release(sk);
+	udp_tunnel_sock_release(sk->sk_socket);
 }
 
 static void prepare_udp_hdr(struct sk_buff *skb, __be16 src_port,
@@ -629,43 +628,6 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 	return 0;
 }
 
-static void rxe_sock_put(struct sock *sk,
-					void (*set_sk)(struct net *, struct sock *),
-					struct net *net)
-{
-	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
-		__sock_put(sk);
-	} else {
-		rxe_release_udp_tunnel(sk->sk_socket);
-		sk = NULL;
-		set_sk(net, sk);
-	}
-}
-
-void rxe_net_del(struct ib_device *dev)
-{
-	struct rxe_dev *rxe = container_of(dev, struct rxe_dev, ib_dev);
-	struct net_device *ndev;
-	struct sock *sk;
-	struct net *net;
-
-	ndev = rxe_ib_device_get_netdev(&rxe->ib_dev);
-	if (!ndev)
-		return;
-
-	net = dev_net(ndev);
-
-	sk = rxe_ns_pernet_sk4(net);
-	if (sk)
-		rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net);
-
-	sk = rxe_ns_pernet_sk6(net);
-	if (sk)
-		rxe_sock_put(sk, rxe_ns_pernet_set_sk6, net);
-
-	dev_put(ndev);
-}
-
 static void rxe_port_event(struct rxe_dev *rxe,
 			   enum ib_event_type event)
 {
@@ -722,7 +684,6 @@ static int rxe_notify(struct notifier_block *not_blk,
 	switch (event) {
 	case NETDEV_UNREGISTER:
 		ib_unregister_device_queued(&rxe->ib_dev);
-		rxe_net_del(&rxe->ib_dev);
 		break;
 	case NETDEV_CHANGEMTU:
 		rxe_dbg_dev(rxe, "%s changed mtu to %d\n", ndev->name, ndev->mtu);
@@ -752,56 +713,6 @@ static struct notifier_block rxe_net_notifier = {
 	.notifier_call = rxe_notify,
 };
 
-static int rxe_net_ipv4_init(struct net *net)
-{
-	struct sock *sk;
-	struct socket *sock;
-
-	sk = rxe_ns_pernet_sk4(net);
-	if (sk) {
-		sock_hold(sk);
-		return 0;
-	}
-
-	sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false);
-	if (IS_ERR(sock)) {
-		pr_err("Failed to create IPv4 UDP tunnel\n");
-		return -1;
-	}
-	rxe_ns_pernet_set_sk4(net, sock->sk);
-
-	return 0;
-}
-
-static int rxe_net_ipv6_init(struct net *net)
-{
-#if IS_ENABLED(CONFIG_IPV6)
-	struct sock *sk;
-	struct socket *sock;
-
-	sk = rxe_ns_pernet_sk6(net);
-	if (sk) {
-		sock_hold(sk);
-		return 0;
-	}
-
-	sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true);
-	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
-		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
-		return 0;
-	}
-
-	if (IS_ERR(sock)) {
-		pr_err("Failed to create IPv6 UDP tunnel\n");
-		return -1;
-	}
-
-	rxe_ns_pernet_set_sk6(net, sock->sk);
-
-#endif
-	return 0;
-}
-
 int rxe_register_notifier(void)
 {
 	int err;
@@ -819,30 +730,3 @@ void rxe_net_exit(void)
 {
 	unregister_netdevice_notifier(&rxe_net_notifier);
 }
-
-int rxe_net_init(struct net_device *ndev)
-{
-	struct net *net;
-	struct sock *sk;
-	int err;
-
-	net = dev_net(ndev);
-
-	err = rxe_net_ipv4_init(net);
-	if (err)
-		return err;
-
-	err = rxe_net_ipv6_init(net);
-	if (err)
-		goto err_out;
-
-	return 0;
-
-err_out:
-	/* If ipv6 error, release ipv4 resource */
-	sk = rxe_ns_pernet_sk4(net);
-	if (sk)
-		rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net);
-
-	return err;
-}
diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
index 56249677d692..592b0e577f32 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.h
+++ b/drivers/infiniband/sw/rxe/rxe_net.h
@@ -11,11 +11,12 @@
 #include <net/if_inet6.h>
 #include <linux/module.h>
 
+struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port, bool ipv6);
+void rxe_release_udp_tunnel(struct sock *sk);
+
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
-void rxe_net_del(struct ib_device *dev);
 
 int rxe_register_notifier(void);
-int rxe_net_init(struct net_device *ndev);
 void rxe_net_exit(void);
 
 #endif /* RXE_NET_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
index 8b9d734229b2..06eb2e2387a1 100644
--- a/drivers/infiniband/sw/rxe/rxe_ns.c
+++ b/drivers/infiniband/sw/rxe/rxe_ns.c
@@ -7,8 +7,10 @@
 #include <linux/skbuff.h>
 #include <linux/pid_namespace.h>
 #include <net/udp_tunnel.h>
+#include <rdma/ib_verbs.h>
 
 #include "rxe_ns.h"
+#include "rxe_net.h"
 
 /*
  * Per network namespace data
@@ -23,40 +25,54 @@ struct rxe_ns_sock {
  */
 static unsigned int rxe_pernet_id;
 
-/*
- * Called for every existing and added network namespaces
- */
-static int rxe_ns_init(struct net *net)
+static __net_init int rxe_ns_init(struct net *net)
 {
-	/* defer socket create in the namespace to the first
-	 * device create.
-	 */
+	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
+	struct sock *sk;
+	int err = 0;
+
+	sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false);
+	if (IS_ERR(sk)) {
+		err = PTR_ERR(sk);
+		goto out;
+	}
+
+	RCU_INIT_POINTER(ns_sk->rxe_sk4, sk);
+
+#if IS_ENABLED(CONFIG_IPV6)
+	sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true);
+	if (IS_ERR(sk)) {
+		err = PTR_ERR(sk);
+		if (err == -EAFNOSUPPORT) {
+			err = 0;
+			goto out;
+		}
+
+		sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1);
+		rxe_release_udp_tunnel(sk);
+		goto out;
+	}
 
-	return 0;
+	RCU_INIT_POINTER(ns_sk->rxe_sk6, sk);
+#endif
+out:
+	return err;
 }
 
-static void rxe_ns_exit(struct net *net)
+static __net_exit void rxe_ns_exit(struct net *net)
 {
-	/* called when the network namespace is removed
-	 */
 	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
 	struct sock *sk;
 
-	rcu_read_lock();
-	sk = rcu_dereference(ns_sk->rxe_sk4);
-	rcu_read_unlock();
-	if (sk) {
-		rcu_assign_pointer(ns_sk->rxe_sk4, NULL);
-		udp_tunnel_sock_release(sk->sk_socket);
-	}
+	sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1);
+	RCU_INIT_POINTER(ns_sk->rxe_sk4, NULL);
+	rxe_release_udp_tunnel(sk);
 
 #if IS_ENABLED(CONFIG_IPV6)
-	rcu_read_lock();
-	sk = rcu_dereference(ns_sk->rxe_sk6);
-	rcu_read_unlock();
+	sk = rcu_dereference_protected(ns_sk->rxe_sk6, 1);
 	if (sk) {
-		rcu_assign_pointer(ns_sk->rxe_sk6, NULL);
-		udp_tunnel_sock_release(sk->sk_socket);
+		RCU_INIT_POINTER(ns_sk->rxe_sk6, NULL);
+		rxe_release_udp_tunnel(sk);
 	}
 #endif
 }
@@ -71,26 +87,6 @@ static struct pernet_operations rxe_net_ops = {
 	.size = sizeof(struct rxe_ns_sock),
 };
 
-struct sock *rxe_ns_pernet_sk4(struct net *net)
-{
-	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
-	struct sock *sk;
-
-	rcu_read_lock();
-	sk = rcu_dereference(ns_sk->rxe_sk4);
-	rcu_read_unlock();
-
-	return sk;
-}
-
-void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk)
-{
-	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
-
-	rcu_assign_pointer(ns_sk->rxe_sk4, sk);
-	synchronize_rcu();
-}
-
 #if IS_ENABLED(CONFIG_IPV6)
 struct sock *rxe_ns_pernet_sk6(struct net *net)
 {
@@ -103,14 +99,6 @@ struct sock *rxe_ns_pernet_sk6(struct net *net)
 
 	return sk;
 }
-
-void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
-{
-	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
-
-	rcu_assign_pointer(ns_sk->rxe_sk6, sk);
-	synchronize_rcu();
-}
 #endif /* IPV6 */
 
 int rxe_namespace_init(void)
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h
index 4da2709e6b71..7f48d624fa05 100644
--- a/drivers/infiniband/sw/rxe/rxe_ns.h
+++ b/drivers/infiniband/sw/rxe/rxe_ns.h
@@ -3,7 +3,6 @@
 #ifndef RXE_NS_H
 #define RXE_NS_H
 
-struct sock *rxe_ns_pernet_sk4(struct net *net);
 void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk);
 
 #if IS_ENABLED(CONFIG_IPV6)
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
  2026-04-25  6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima
@ 2026-04-25 15:47   ` David Ahern
  2026-04-25 20:55     ` Kuniyuki Iwashima
  2026-04-25 21:25   ` Zhu Yanjun
  1 sibling, 1 reply; 81+ messages in thread
From: David Ahern @ 2026-04-25 15:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky
  Cc: Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21

On 4/25/26 12:04 AM, Kuniyuki Iwashima wrote:
> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0]
> 
> The problem is ->newlink() and ->dellink() can be called
> concurrently with no synchronisation, leading sk leak or
> double free, etc.

My expectation is that the synchronization is managed by:

rdma_nl_rcv_msg()
    down_read(&rdma_nl_types[index].sem);

as the RTNL equivalent.

> 
> We defer UDP tunnel allocation to the first device creation,
> but this would requrie per-netns locking.

typo: s/requrie/require/



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
  2026-04-25 15:47   ` David Ahern
@ 2026-04-25 20:55     ` Kuniyuki Iwashima
  2026-04-26 16:40       ` David Ahern
  0 siblings, 1 reply; 81+ messages in thread
From: Kuniyuki Iwashima @ 2026-04-25 20:55 UTC (permalink / raw)
  To: David Ahern
  Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima,
	linux-rdma, syzbot+d8f76778263ab65c2b21

On Sat, Apr 25, 2026 at 8:47 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 4/25/26 12:04 AM, Kuniyuki Iwashima wrote:
> > syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0]
> >
> > The problem is ->newlink() and ->dellink() can be called
> > concurrently with no synchronisation, leading sk leak or
> > double free, etc.
>
> My expectation is that the synchronization is managed by:
>
> rdma_nl_rcv_msg()
>     down_read(&rdma_nl_types[index].sem);
>
> as the RTNL equivalent.

but down_read() is a shared lock and does not work as
per-netns exclusive locking.


>
> >
> > We defer UDP tunnel allocation to the first device creation,
> > but this would requrie per-netns locking.
>
> typo: s/requrie/require/

Will fix.

Thanks

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
  2026-04-25 20:55     ` Kuniyuki Iwashima
@ 2026-04-26 16:40       ` David Ahern
  0 siblings, 0 replies; 81+ messages in thread
From: David Ahern @ 2026-04-26 16:40 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Kuniyuki Iwashima,
	linux-rdma, syzbot+d8f76778263ab65c2b21

On 4/25/26 2:55 PM, Kuniyuki Iwashima wrote:
> On Sat, Apr 25, 2026 at 8:47 AM David Ahern <dsahern@kernel.org> wrote:
>>
>> On 4/25/26 12:04 AM, Kuniyuki Iwashima wrote:
>>> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0]
>>>
>>> The problem is ->newlink() and ->dellink() can be called
>>> concurrently with no synchronisation, leading sk leak or
>>> double free, etc.
>>
>> My expectation is that the synchronization is managed by:
>>
>> rdma_nl_rcv_msg()
>>     down_read(&rdma_nl_types[index].sem);
>>
>> as the RTNL equivalent.
> 
> but down_read() is a shared lock and does not work as
> per-netns exclusive locking.

Well, that's a face palm moment for me; skimmed that code a bit too
quickly when reviewing the rxe patches.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
  2026-04-25  6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima
  2026-04-25 15:47   ` David Ahern
@ 2026-04-25 21:25   ` Zhu Yanjun
  2026-04-26 16:42     ` David Ahern
  1 sibling, 1 reply; 81+ messages in thread
From: Zhu Yanjun @ 2026-04-25 21:25 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky,
	yanjun.zhu@linux.dev
  Cc: David Ahern, Kuniyuki Iwashima, linux-rdma,
	syzbot+d8f76778263ab65c2b21

在 2026/4/24 23:04, Kuniyuki Iwashima 写道:
> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0]
> 
> The problem is ->newlink() and ->dellink() can be called
> concurrently with no synchronisation, leading sk leak or
> double free, etc.
> 
> We defer UDP tunnel allocation to the first device creation,
> but this would requrie per-netns locking.
> 
> Let's allocate UDP tunnels in the __init_net hook.
> 
> Now extra sock_hold() and __sock_put() are no longer needed.
> 
> Note that rxe_ns_pernet_sk6() is broken and will be fixed
> in the following patch.
> 
> [0]:
> Oops: general protection fault, probably for non-canonical address 0xdffffc000000000d: 0000 [#1] SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
> CPU: 3 UID: 0 PID: 12652 Comm: syz.7.1709 Tainted: G             L      syzkaller #0 PREEMPT(full)
> Tainted: [L]=SOFTLOCKUP
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> RIP: 0010:kernel_sock_shutdown+0x47/0x70 net/socket.c:3785
> Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 33 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 20 49 8d 7c 24 68 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 1a 49 8b 44 24 68 89 ee 48 89 df 5b 5d 41 5c e9 46
> RSP: 0018:ffffc9000566f180 EFLAGS: 00010202
> RAX: dffffc0000000000 RBX: ffff888058587240 RCX: 0000000000000000
> RDX: 000000000000000d RSI: ffffffff895ced12 RDI: 0000000000000068
> RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1006d98945
> R10: ffff888036cc4a2b R11: 0000003683c25c00 R12: 0000000000000000
> R13: ffff88805c998000 R14: 0000000000000002 R15: 0000000000000018
> FS:  00007f1306d976c0(0000) GS:ffff8880d65db000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f1306d97d58 CR3: 00000000404f1000 CR4: 0000000000352ef0
> DR0: ffffffffffffffff DR1: 00000000000001f8 DR2: 0000000000000002
> DR3: ffffffffefffff15 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Call Trace:
>   <TASK>
>   udp_tunnel_sock_release+0x68/0x80 net/ipv4/udp_tunnel_core.c:202
>   rxe_release_udp_tunnel drivers/infiniband/sw/rxe/rxe_net.c:294 [inline]
>   rxe_sock_put+0xae/0x130 drivers/infiniband/sw/rxe/rxe_net.c:639
>   rxe_net_del+0x83/0x120 drivers/infiniband/sw/rxe/rxe_net.c:660
>   rxe_dellink+0x15/0x20 drivers/infiniband/sw/rxe/rxe.c:254
>   nldev_dellink+0x289/0x3c0 drivers/infiniband/core/nldev.c:1849
>   rdma_nl_rcv_msg+0x392/0x6f0 drivers/infiniband/core/netlink.c:195
>   rdma_nl_rcv_skb.constprop.0.isra.0+0x2cb/0x410 drivers/infiniband/core/netlink.c:239
>   netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
>   netlink_unicast+0x585/0x850 net/netlink/af_netlink.c:1344
>   netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
>   sock_sendmsg_nosec net/socket.c:787 [inline]
>   __sock_sendmsg net/socket.c:802 [inline]
>   ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2698
>   ___sys_sendmsg+0x190/0x1e0 net/socket.c:2752
>   __sys_sendmsg+0x170/0x220 net/socket.c:2784
>   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>   do_syscall_64+0x10b/0xf80 arch/x86/entry/syscall_64.c:94
>   entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f1305f9c819
> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f1306d97028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 00007f1306216090 RCX: 00007f1305f9c819
> RDX: 0000000000000000 RSI: 00002000000002c0 RDI: 0000000000000003
> RBP: 00007f1306032c91 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007f1306216128 R14: 00007f1306216090 R15: 00007ffd8ecad288
>   </TASK>
> Modules linked in:
> 
> Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace")
> Reported-by: syzbot+d8f76778263ab65c2b21@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/69ea344f.a00a0220.17a17.0040.GAE@google.com/
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
> ---
> v2: Set up UDP tunnels in __net_init instead of adding mutex.
> v1: https://lore.kernel.org/all/20260424013759.728288-1-kuniyu@google.com/
> ---
>   drivers/infiniband/sw/rxe/rxe.c     |   6 --
>   drivers/infiniband/sw/rxe/rxe_net.c | 126 ++--------------------------
>   drivers/infiniband/sw/rxe/rxe_net.h |   5 +-
>   drivers/infiniband/sw/rxe/rxe_ns.c  |  90 +++++++++-----------
>   drivers/infiniband/sw/rxe/rxe_ns.h  |   1 -
>   5 files changed, 47 insertions(+), 181 deletions(-)
> 

All the commits are functionally correct, but I noticed some regressions 
when running:
make -C tools/testing/selftests/rdma/ TARGET=rdma run_tests

After applying this commit, the UDP port 4791 starts listening in both 
init_net and all other net namespaces as soon as modprobe rdma_rxe is 
executed. This breaks tests that expect the port to be unoccupied until 
a device is actually created.

I have a workaround to fix the current test failures. However, I think 
we should also add a new test case to the RDMA selftests. This test 
should explicitly verify that port 4791 is correctly listening in all 
namespaces immediately after the module is loaded, reflecting the new 
architectural change.

The workaround:
----------------------------Begin--------------------------------------
diff --git a/tools/testing/selftests/rdma/rxe_ipv6.sh 
b/tools/testing/selftests/rdma/rxe_ipv6.sh
index b7059bfd6d7c..e808d9829752 100755
--- a/tools/testing/selftests/rdma/rxe_ipv6.sh
+++ b/tools/testing/selftests/rdma/rxe_ipv6.sh
@@ -56,8 +56,8 @@ echo "Verified: Port $PORT is active."
  echo "Deleting RDMA link..."
  ip netns exec "$NS_NAME" rdma link del "$RXE_NAME"

-if ip netns exec "$NS_NAME" ss -Hul6n sport = :$PORT | grep -q 
":$PORT"; then
-    echo "Error: UDP port $PORT still active after link deletion."
+if ! ip netns exec "$NS_NAME" ss -Hul6n sport = :$PORT | grep -q 
":$PORT"; then
+    echo "Error: UDP port $PORT is not active after link deletion."
      exit 1
  fi
  echo "Verified: Port $PORT closed successfully."
diff --git a/tools/testing/selftests/rdma/rxe_socket_with_netns.sh 
b/tools/testing/selftests/rdma/rxe_socket_with_netns.sh
index 002e5098f751..0ad4a8d4d755 100755
--- a/tools/testing/selftests/rdma/rxe_socket_with_netns.sh
+++ b/tools/testing/selftests/rdma/rxe_socket_with_netns.sh
@@ -68,8 +68,8 @@ echo "Deleting rxe0..."
  rdma link del rxe0

  # Port should now be gone
-if ss -Huln sport = :$PORT | grep -q ":$PORT"; then
-    echo "Error: UDP port $PORT still exists after all links deleted"
+if ! ss -Huln sport = :$PORT | grep -q ":$PORT"; then
+    echo "Error: UDP port $PORT does not exist after all links deleted"
      exit 1
  fi

diff --git a/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh 
b/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh
index 021ca451499d..07efe9ea6a71 100755
--- a/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh
+++ b/tools/testing/selftests/rdma/rxe_test_NETDEV_UNREGISTER.sh
@@ -55,8 +55,8 @@ if rdma link show "$RXE_NAME" 2>/dev/null; then
      exit 1
  fi

-if ss -Huln sport == :$RDMA_PORT | grep -q ":$RDMA_PORT"; then
-    echo "Error: UDP port $RDMA_PORT still listening after netdev removal."
+if ! ss -Huln sport == :$RDMA_PORT | grep -q ":$RDMA_PORT"; then
+    echo "Error: UDP port $RDMA_PORT is not listening after netdev 
removal."
      exit 1
  fi
---------------------------End-----------------------------------------

The new testcase is like the following:

----------------------------Begin--------------------------------------
# Load module
modprobe rdma_rxe

# Check init_net
ss -lnup | grep -q ":4791" || echo "Test Failed: Port 4791 not listening 
in init_net"

# Check in a new namespace
ip netns add rxe_test
ip netns exec rxe_test ss -lnup | grep -q ":4791" || echo "Test Failed: 
Port 4791 not listening in netns"
---------------------------End-----------------------------------------

Just my 2 cent suggestion. It is up to you about how to fix it.

To now I am fine with these 2 commits. Please David Ahern, Leon and 
Jason comment.

Thanks a lot.

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun


> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index b0714f9abe3d..111ba4e57261 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -236,10 +236,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   		goto err;
>   	}
>   
> -	err = rxe_net_init(ndev);
> -	if (err)
> -		return err;
> -
>   	err = rxe_net_add(ibdev_name, ndev);
>   	if (err) {
>   		rxe_err("failed to add %s\n", ndev->name);
> @@ -251,8 +247,6 @@ static int rxe_newlink(const char *ibdev_name, struct net_device *ndev)
>   
>   static int rxe_dellink(struct ib_device *dev)
>   {
> -	rxe_net_del(dev);
> -
>   	return 0;
>   }
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 50a2cb5405e2..9080d4c893a1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -256,8 +256,8 @@ static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
>   	return 0;
>   }
>   
> -static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port,
> -					   bool ipv6)
> +struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port,
> +				  bool ipv6)
>   {
>   	int err;
>   	struct socket *sock;
> @@ -285,13 +285,12 @@ static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port,
>   	/* Setup UDP tunnel */
>   	setup_udp_tunnel_sock(net, sock, &tnl_cfg);
>   
> -	return sock;
> +	return sock->sk;
>   }
>   
> -static void rxe_release_udp_tunnel(struct socket *sk)
> +void rxe_release_udp_tunnel(struct sock *sk)
>   {
> -	if (sk)
> -		udp_tunnel_sock_release(sk);
> +	udp_tunnel_sock_release(sk->sk_socket);
>   }
>   
>   static void prepare_udp_hdr(struct sk_buff *skb, __be16 src_port,
> @@ -629,43 +628,6 @@ int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
>   	return 0;
>   }
>   
> -static void rxe_sock_put(struct sock *sk,
> -					void (*set_sk)(struct net *, struct sock *),
> -					struct net *net)
> -{
> -	if (refcount_read(&sk->sk_refcnt) > SK_REF_FOR_TUNNEL) {
> -		__sock_put(sk);
> -	} else {
> -		rxe_release_udp_tunnel(sk->sk_socket);
> -		sk = NULL;
> -		set_sk(net, sk);
> -	}
> -}
> -
> -void rxe_net_del(struct ib_device *dev)
> -{
> -	struct rxe_dev *rxe = container_of(dev, struct rxe_dev, ib_dev);
> -	struct net_device *ndev;
> -	struct sock *sk;
> -	struct net *net;
> -
> -	ndev = rxe_ib_device_get_netdev(&rxe->ib_dev);
> -	if (!ndev)
> -		return;
> -
> -	net = dev_net(ndev);
> -
> -	sk = rxe_ns_pernet_sk4(net);
> -	if (sk)
> -		rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net);
> -
> -	sk = rxe_ns_pernet_sk6(net);
> -	if (sk)
> -		rxe_sock_put(sk, rxe_ns_pernet_set_sk6, net);
> -
> -	dev_put(ndev);
> -}
> -
>   static void rxe_port_event(struct rxe_dev *rxe,
>   			   enum ib_event_type event)
>   {
> @@ -722,7 +684,6 @@ static int rxe_notify(struct notifier_block *not_blk,
>   	switch (event) {
>   	case NETDEV_UNREGISTER:
>   		ib_unregister_device_queued(&rxe->ib_dev);
> -		rxe_net_del(&rxe->ib_dev);
>   		break;
>   	case NETDEV_CHANGEMTU:
>   		rxe_dbg_dev(rxe, "%s changed mtu to %d\n", ndev->name, ndev->mtu);
> @@ -752,56 +713,6 @@ static struct notifier_block rxe_net_notifier = {
>   	.notifier_call = rxe_notify,
>   };
>   
> -static int rxe_net_ipv4_init(struct net *net)
> -{
> -	struct sock *sk;
> -	struct socket *sock;
> -
> -	sk = rxe_ns_pernet_sk4(net);
> -	if (sk) {
> -		sock_hold(sk);
> -		return 0;
> -	}
> -
> -	sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false);
> -	if (IS_ERR(sock)) {
> -		pr_err("Failed to create IPv4 UDP tunnel\n");
> -		return -1;
> -	}
> -	rxe_ns_pernet_set_sk4(net, sock->sk);
> -
> -	return 0;
> -}
> -
> -static int rxe_net_ipv6_init(struct net *net)
> -{
> -#if IS_ENABLED(CONFIG_IPV6)
> -	struct sock *sk;
> -	struct socket *sock;
> -
> -	sk = rxe_ns_pernet_sk6(net);
> -	if (sk) {
> -		sock_hold(sk);
> -		return 0;
> -	}
> -
> -	sock = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true);
> -	if (PTR_ERR(sock) == -EAFNOSUPPORT) {
> -		pr_warn("IPv6 is not supported, can not create a UDPv6 socket\n");
> -		return 0;
> -	}
> -
> -	if (IS_ERR(sock)) {
> -		pr_err("Failed to create IPv6 UDP tunnel\n");
> -		return -1;
> -	}
> -
> -	rxe_ns_pernet_set_sk6(net, sock->sk);
> -
> -#endif
> -	return 0;
> -}
> -
>   int rxe_register_notifier(void)
>   {
>   	int err;
> @@ -819,30 +730,3 @@ void rxe_net_exit(void)
>   {
>   	unregister_netdevice_notifier(&rxe_net_notifier);
>   }
> -
> -int rxe_net_init(struct net_device *ndev)
> -{
> -	struct net *net;
> -	struct sock *sk;
> -	int err;
> -
> -	net = dev_net(ndev);
> -
> -	err = rxe_net_ipv4_init(net);
> -	if (err)
> -		return err;
> -
> -	err = rxe_net_ipv6_init(net);
> -	if (err)
> -		goto err_out;
> -
> -	return 0;
> -
> -err_out:
> -	/* If ipv6 error, release ipv4 resource */
> -	sk = rxe_ns_pernet_sk4(net);
> -	if (sk)
> -		rxe_sock_put(sk, rxe_ns_pernet_set_sk4, net);
> -
> -	return err;
> -}
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.h b/drivers/infiniband/sw/rxe/rxe_net.h
> index 56249677d692..592b0e577f32 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.h
> +++ b/drivers/infiniband/sw/rxe/rxe_net.h
> @@ -11,11 +11,12 @@
>   #include <net/if_inet6.h>
>   #include <linux/module.h>
>   
> +struct sock *rxe_setup_udp_tunnel(struct net *net, __be16 port, bool ipv6);
> +void rxe_release_udp_tunnel(struct sock *sk);
> +
>   int rxe_net_add(const char *ibdev_name, struct net_device *ndev);
> -void rxe_net_del(struct ib_device *dev);
>   
>   int rxe_register_notifier(void);
> -int rxe_net_init(struct net_device *ndev);
>   void rxe_net_exit(void);
>   
>   #endif /* RXE_NET_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
> index 8b9d734229b2..06eb2e2387a1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_ns.c
> +++ b/drivers/infiniband/sw/rxe/rxe_ns.c
> @@ -7,8 +7,10 @@
>   #include <linux/skbuff.h>
>   #include <linux/pid_namespace.h>
>   #include <net/udp_tunnel.h>
> +#include <rdma/ib_verbs.h>
>   
>   #include "rxe_ns.h"
> +#include "rxe_net.h"
>   
>   /*
>    * Per network namespace data
> @@ -23,40 +25,54 @@ struct rxe_ns_sock {
>    */
>   static unsigned int rxe_pernet_id;
>   
> -/*
> - * Called for every existing and added network namespaces
> - */
> -static int rxe_ns_init(struct net *net)
> +static __net_init int rxe_ns_init(struct net *net)
>   {
> -	/* defer socket create in the namespace to the first
> -	 * device create.
> -	 */
> +	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> +	struct sock *sk;
> +	int err = 0;
> +
> +	sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), false);
> +	if (IS_ERR(sk)) {
> +		err = PTR_ERR(sk);
> +		goto out;
> +	}
> +
> +	RCU_INIT_POINTER(ns_sk->rxe_sk4, sk);
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +	sk = rxe_setup_udp_tunnel(net, htons(ROCE_V2_UDP_DPORT), true);
> +	if (IS_ERR(sk)) {
> +		err = PTR_ERR(sk);
> +		if (err == -EAFNOSUPPORT) {
> +			err = 0;
> +			goto out;
> +		}
> +
> +		sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1);
> +		rxe_release_udp_tunnel(sk);
> +		goto out;
> +	}
>   
> -	return 0;
> +	RCU_INIT_POINTER(ns_sk->rxe_sk6, sk);
> +#endif
> +out:
> +	return err;
>   }
>   
> -static void rxe_ns_exit(struct net *net)
> +static __net_exit void rxe_ns_exit(struct net *net)
>   {
> -	/* called when the network namespace is removed
> -	 */
>   	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
>   	struct sock *sk;
>   
> -	rcu_read_lock();
> -	sk = rcu_dereference(ns_sk->rxe_sk4);
> -	rcu_read_unlock();
> -	if (sk) {
> -		rcu_assign_pointer(ns_sk->rxe_sk4, NULL);
> -		udp_tunnel_sock_release(sk->sk_socket);
> -	}
> +	sk = rcu_dereference_protected(ns_sk->rxe_sk4, 1);
> +	RCU_INIT_POINTER(ns_sk->rxe_sk4, NULL);
> +	rxe_release_udp_tunnel(sk);
>   
>   #if IS_ENABLED(CONFIG_IPV6)
> -	rcu_read_lock();
> -	sk = rcu_dereference(ns_sk->rxe_sk6);
> -	rcu_read_unlock();
> +	sk = rcu_dereference_protected(ns_sk->rxe_sk6, 1);
>   	if (sk) {
> -		rcu_assign_pointer(ns_sk->rxe_sk6, NULL);
> -		udp_tunnel_sock_release(sk->sk_socket);
> +		RCU_INIT_POINTER(ns_sk->rxe_sk6, NULL);
> +		rxe_release_udp_tunnel(sk);
>   	}
>   #endif
>   }
> @@ -71,26 +87,6 @@ static struct pernet_operations rxe_net_ops = {
>   	.size = sizeof(struct rxe_ns_sock),
>   };
>   
> -struct sock *rxe_ns_pernet_sk4(struct net *net)
> -{
> -	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> -	struct sock *sk;
> -
> -	rcu_read_lock();
> -	sk = rcu_dereference(ns_sk->rxe_sk4);
> -	rcu_read_unlock();
> -
> -	return sk;
> -}
> -
> -void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk)
> -{
> -	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> -
> -	rcu_assign_pointer(ns_sk->rxe_sk4, sk);
> -	synchronize_rcu();
> -}
> -
>   #if IS_ENABLED(CONFIG_IPV6)
>   struct sock *rxe_ns_pernet_sk6(struct net *net)
>   {
> @@ -103,14 +99,6 @@ struct sock *rxe_ns_pernet_sk6(struct net *net)
>   
>   	return sk;
>   }
> -
> -void rxe_ns_pernet_set_sk6(struct net *net, struct sock *sk)
> -{
> -	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> -
> -	rcu_assign_pointer(ns_sk->rxe_sk6, sk);
> -	synchronize_rcu();
> -}
>   #endif /* IPV6 */
>   
>   int rxe_namespace_init(void)
> diff --git a/drivers/infiniband/sw/rxe/rxe_ns.h b/drivers/infiniband/sw/rxe/rxe_ns.h
> index 4da2709e6b71..7f48d624fa05 100644
> --- a/drivers/infiniband/sw/rxe/rxe_ns.h
> +++ b/drivers/infiniband/sw/rxe/rxe_ns.h
> @@ -3,7 +3,6 @@
>   #ifndef RXE_NS_H
>   #define RXE_NS_H
>   
> -struct sock *rxe_ns_pernet_sk4(struct net *net);
>   void rxe_ns_pernet_set_sk4(struct net *net, struct sock *sk);
>   
>   #if IS_ENABLED(CONFIG_IPV6)


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown().
  2026-04-25 21:25   ` Zhu Yanjun
@ 2026-04-26 16:42     ` David Ahern
  0 siblings, 0 replies; 81+ messages in thread
From: David Ahern @ 2026-04-26 16:42 UTC (permalink / raw)
  To: Zhu Yanjun, Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe,
	Leon Romanovsky
  Cc: Kuniyuki Iwashima, linux-rdma, syzbot+d8f76778263ab65c2b21

On 4/25/26 3:25 PM, Zhu Yanjun wrote:
> 在 2026/4/24 23:04, Kuniyuki Iwashima 写道:
>> syzbot reported null-ptr-deref in kernel_sock_shutdown(). [0]
>>
>> The problem is ->newlink() and ->dellink() can be called
>> concurrently with no synchronisation, leading sk leak or
>> double free, etc.
>>
>> We defer UDP tunnel allocation to the first device creation,
>> but this would requrie per-netns locking.
>>
>> Let's allocate UDP tunnels in the __init_net hook.
>>
>> Now extra sock_hold() and __sock_put() are no longer needed.
>>
>> Note that rxe_ns_pernet_sk6() is broken and will be fixed
>> in the following patch.
>>
...
> 
> All the commits are functionally correct, but I noticed some regressions
> when running:
> make -C tools/testing/selftests/rdma/ TARGET=rdma run_tests
> 
> After applying this commit, the UDP port 4791 starts listening in both
> init_net and all other net namespaces as soon as modprobe rdma_rxe is
> executed. This breaks tests that expect the port to be unoccupied until
> a device is actually created.

Not opening the port until an rxe device is created in that namespace
needs to be kept.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6().
  2026-04-25  6:04 [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues Kuniyuki Iwashima
  2026-04-25  6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima
@ 2026-04-25  6:04 ` Kuniyuki Iwashima
  2026-04-25 21:26   ` Zhu Yanjun
  1 sibling, 1 reply; 81+ messages in thread
From: Kuniyuki Iwashima @ 2026-04-25  6:04 UTC (permalink / raw)
  To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky
  Cc: David Ahern, Kuniyuki Iwashima, Kuniyuki Iwashima, linux-rdma

rxe_ns_pernet_sk6() is fundamentally broken.

rcu_read_lock() only silences rcu_dereference() splat.

The returned socket is no longer protected, and it may be
freed during ip6_dst_lookup_flow().

Let's call rxe_ns_pernet_sk6() and ip6_dst_lookup_flow()
under RCU.

Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 drivers/infiniband/sw/rxe/rxe_net.c | 11 ++++++++---
 drivers/infiniband/sw/rxe/rxe_ns.c  |  7 +------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 9080d4c893a1..8fca5c24c8b1 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -133,16 +133,21 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
 					 struct in6_addr *saddr,
 					 struct in6_addr *daddr)
 {
-	struct dst_entry *ndst;
+	struct dst_entry *ndst = NULL;
 	struct flowi6 fl6 = {};
+	struct sock *sk;
 
 	fl6.flowi6_oif = ndev->ifindex;
 	memcpy(&fl6.saddr, saddr, sizeof(*saddr));
 	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
 	fl6.flowi6_proto = IPPROTO_UDP;
 
-	ndst = ip6_dst_lookup_flow(net, rxe_ns_pernet_sk6(net), &fl6, NULL);
-	if (IS_ERR(ndst)) {
+	rcu_read_lock();
+	sk = rxe_ns_pernet_sk6(net);
+	if (sk)
+		ndst = ip6_dst_lookup_flow(net, sk, &fl6, NULL);
+	rcu_read_unlock();
+	if (IS_ERR_OR_NULL(ndst)) {
 		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
 		return NULL;
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
index 06eb2e2387a1..ef408ffc0558 100644
--- a/drivers/infiniband/sw/rxe/rxe_ns.c
+++ b/drivers/infiniband/sw/rxe/rxe_ns.c
@@ -91,13 +91,8 @@ static struct pernet_operations rxe_net_ops = {
 struct sock *rxe_ns_pernet_sk6(struct net *net)
 {
 	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
-	struct sock *sk;
-
-	rcu_read_lock();
-	sk = rcu_dereference(ns_sk->rxe_sk6);
-	rcu_read_unlock();
 
-	return sk;
+	return rcu_dereference(ns_sk->rxe_sk6);
 }
 #endif /* IPV6 */
 
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6().
  2026-04-25  6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima
@ 2026-04-25 21:26   ` Zhu Yanjun
  0 siblings, 0 replies; 81+ messages in thread
From: Zhu Yanjun @ 2026-04-25 21:26 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky,
	yanjun.zhu@linux.dev
  Cc: David Ahern, Kuniyuki Iwashima, linux-rdma

在 2026/4/24 23:04, Kuniyuki Iwashima 写道:
> rxe_ns_pernet_sk6() is fundamentally broken.
> 
> rcu_read_lock() only silences rcu_dereference() splat.
> 
> The returned socket is no longer protected, and it may be
> freed during ip6_dst_lookup_flow().
> 
> Let's call rxe_ns_pernet_sk6() and ip6_dst_lookup_flow()
> under RCU.
> 
> Fixes: f1327abd6abe ("RDMA/rxe: Support RDMA link creation and destruction per net namespace")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>

Thanks a lot. Please David Ahern, Leon and Jason comment.

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

> ---
>   drivers/infiniband/sw/rxe/rxe_net.c | 11 ++++++++---
>   drivers/infiniband/sw/rxe/rxe_ns.c  |  7 +------
>   2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index 9080d4c893a1..8fca5c24c8b1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -133,16 +133,21 @@ static struct dst_entry *rxe_find_route6(struct rxe_qp *qp,
>   					 struct in6_addr *saddr,
>   					 struct in6_addr *daddr)
>   {
> -	struct dst_entry *ndst;
> +	struct dst_entry *ndst = NULL;
>   	struct flowi6 fl6 = {};
> +	struct sock *sk;
>   
>   	fl6.flowi6_oif = ndev->ifindex;
>   	memcpy(&fl6.saddr, saddr, sizeof(*saddr));
>   	memcpy(&fl6.daddr, daddr, sizeof(*daddr));
>   	fl6.flowi6_proto = IPPROTO_UDP;
>   
> -	ndst = ip6_dst_lookup_flow(net, rxe_ns_pernet_sk6(net), &fl6, NULL);
> -	if (IS_ERR(ndst)) {
> +	rcu_read_lock();
> +	sk = rxe_ns_pernet_sk6(net);
> +	if (sk)
> +		ndst = ip6_dst_lookup_flow(net, sk, &fl6, NULL);
> +	rcu_read_unlock();
> +	if (IS_ERR_OR_NULL(ndst)) {
>   		rxe_dbg_qp(qp, "no route to %pI6\n", daddr);
>   		return NULL;
>   	}
> diff --git a/drivers/infiniband/sw/rxe/rxe_ns.c b/drivers/infiniband/sw/rxe/rxe_ns.c
> index 06eb2e2387a1..ef408ffc0558 100644
> --- a/drivers/infiniband/sw/rxe/rxe_ns.c
> +++ b/drivers/infiniband/sw/rxe/rxe_ns.c
> @@ -91,13 +91,8 @@ static struct pernet_operations rxe_net_ops = {
>   struct sock *rxe_ns_pernet_sk6(struct net *net)
>   {
>   	struct rxe_ns_sock *ns_sk = net_generic(net, rxe_pernet_id);
> -	struct sock *sk;
> -
> -	rcu_read_lock();
> -	sk = rcu_dereference(ns_sk->rxe_sk6);
> -	rcu_read_unlock();
>   
> -	return sk;
> +	return rcu_dereference(ns_sk->rxe_sk6);
>   }
>   #endif /* IPV6 */
>   


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem
@ 2026-04-11 14:49 Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 01/15] RDMA/core: " Jiri Pirko
                   ` (14 more replies)
  0 siblings, 15 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

This patchset introduces a generic buffer descriptor infrastructure
for passing memory buffers (dma-buf or user VA) to uverbs commands,
and wires it up for CQ and QP creation in the uverbs core, efa, mlx5,
bnxt_re and mlx4 drivers.
Instead of adding per-command UAPI attributes for each new buffer
type, a single UVERBS_ATTR_BUFFERS array attribute carries all buffer
descriptors. Each descriptor specifies a buffer type and is indexed by
per-command slot enums. A consumption check ensures userspace and
driver agree on which buffers are used.
The patchset:
1. Introduces the core ib_umem_list infrastructure and UAPI.
2. Factors out CQ buffer umem processing into a helper.
3. Integrates umem_list into CQ creation, with fallback to existing
   per-attribute path.
4-7. Converts efa, mlx5, bnxt_re and mlx4 to use umem_list for CQ
   buffer.
8. Removes the legacy umem field from struct ib_cq, now that all
   drivers use umem_list for CQ buffer management.
9. Adds a consumption check verifying all umem_list buffers were
   consumed by the driver after CQ creation.
10. Integrates umem_list into QP creation.
11. Converts mlx5 QP creation to use umem_list.
12-15. Extends CQ and QP with doorbell record buffer slots and
   converts mlx5 to use them.

Note this re-works the original patchset trying to handle this:
https://lore.kernel.org/all/20260203085003.71184-1-jiri@resnulli.us/
The code is so much different I'm sending this is a new patchset.

---
v1->v2:
one fix and one rebase, see individual patches for changelog

Jiri Pirko (15):
  RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  RDMA/uverbs: Push out CQ buffer umem processing into a helper
  RDMA/uverbs: Integrate umem_list into CQ creation
  RDMA/efa: Use umem_list for user CQ buffer
  RDMA/mlx5: Use umem_list for user CQ buffer
  RDMA/bnxt_re: Use umem_list for user CQ buffer
  RDMA/mlx4: Use umem_list for user CQ buffer
  RDMA/uverbs: Remove legacy umem field from struct ib_cq
  RDMA/uverbs: Verify all umem_list buffers are consumed after CQ
    creation
  RDMA/uverbs: Integrate umem_list into QP creation
  RDMA/mlx5: Use umem_list for QP buffers in create_qp
  RDMA/uverbs: Add doorbell record buffer slot to CQ umem_list
  RDMA/mlx5: Use umem_list for CQ doorbell record
  RDMA/uverbs: Add doorbell record buffer slot to QP umem_list
  RDMA/mlx5: Use umem_list for QP doorbell record

 drivers/infiniband/core/core_priv.h           |   1 +
 drivers/infiniband/core/umem.c                | 248 ++++++++++++++++++
 drivers/infiniband/core/uverbs_cmd.c          |  18 +-
 drivers/infiniband/core/uverbs_std_types_cq.c | 158 ++++++-----
 drivers/infiniband/core/uverbs_std_types_qp.c |  22 +-
 drivers/infiniband/core/verbs.c               |  27 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  23 +-
 drivers/infiniband/hw/efa/efa_verbs.c         |  17 +-
 drivers/infiniband/hw/mlx4/cq.c               |  41 +--
 drivers/infiniband/hw/mlx5/cq.c               |  40 ++-
 drivers/infiniband/hw/mlx5/doorbell.c         |  41 ++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |   3 +-
 drivers/infiniband/hw/mlx5/qp.c               |  76 ++++--
 drivers/infiniband/hw/mlx5/srq.c              |   2 +-
 include/rdma/ib_umem.h                        |  54 ++++
 include/rdma/ib_verbs.h                       |   5 +-
 include/rdma/uverbs_ioctl.h                   |  14 +
 include/uapi/rdma/ib_user_ioctl_cmds.h        |  17 ++
 include/uapi/rdma/ib_user_ioctl_verbs.h       |  27 ++
 19 files changed, 663 insertions(+), 171 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-12 12:33   ` Michael Margolin
  2026-04-21 13:46   ` Jason Gunthorpe
  2026-04-11 14:49 ` [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper Jiri Pirko
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Add a unified mechanism for userspace to pass memory buffers to any
uverbs command via a single UVERBS_ATTR_BUFFERS attribute. Each
buffer is described by struct ib_uverbs_buffer_desc with a type
discriminator supporting dma-buf and user VA backed memory, extensible
for future buffer types.

The ib_umem_list API enables any uverbs command to accept multiple
buffers indexed by per-command slot enums, without requiring new UAPI
attributes for each buffer. A consumption check ensures userspace and
driver agree on which buffers are used.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/core/umem.c          | 248 ++++++++++++++++++++++++
 include/rdma/ib_umem.h                  |  54 ++++++
 include/rdma/uverbs_ioctl.h             |  14 ++
 include/uapi/rdma/ib_user_ioctl_cmds.h  |   1 +
 include/uapi/rdma/ib_user_ioctl_verbs.h |  27 +++
 5 files changed, 344 insertions(+)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 786fa1aa8e55..f5b03e903b9d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -37,6 +37,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/err.h>
 #include <linux/export.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
@@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 		return 0;
 }
 EXPORT_SYMBOL(ib_umem_copy_from);
+
+struct ib_umem_list {
+	unsigned int count; /* Total slots in the list. */
+	unsigned long provided; /* Bitmask of slots provided by the user. */
+	unsigned long loaded; /* Bitmask of slots loaded by the driver. */
+	struct ib_umem *umems[] __counted_by(count);
+};
+
+/**
+ * ib_umem_list_create - Create a umem list from UVERBS_ATTR_BUFFERS
+ * @device: IB device
+ * @attrs: uverbs attribute bundle
+ * @slot_max: highest buffer slot index (count = slot_max + 1)
+ *
+ * Return: umem list, or ERR_PTR on failure.
+ */
+struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
+					 const struct uverbs_attr_bundle *attrs,
+					 unsigned int slot_max)
+{
+	const struct ib_uverbs_buffer_desc *descs;
+	struct ib_umem_dmabuf *umem_dmabuf;
+	struct ib_umem_list *list;
+	struct ib_umem *umem;
+	unsigned int count;
+	int num_descs;
+	int err;
+	int i;
+
+	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
+		return ERR_PTR(-EINVAL);
+	count = slot_max + 1;
+
+	num_descs = uverbs_attr_ptr_get_array_size(
+		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
+		sizeof(*descs));
+	if (num_descs == -ENOENT) {
+		num_descs = 0;
+		descs = NULL;
+	} else if (num_descs < 0) {
+		return ERR_PTR(num_descs);
+	} else if (num_descs > count) {
+		return ERR_PTR(-EINVAL);
+	} else {
+		descs = uverbs_attr_get_alloced_ptr(attrs, UVERBS_ATTR_BUFFERS);
+		if (IS_ERR(descs))
+			return ERR_CAST(descs);
+	}
+
+	list = kzalloc(struct_size(list, umems, count), GFP_KERNEL);
+	if (!list)
+		return ERR_PTR(-ENOMEM);
+	list->count = count;
+
+	for (i = 0; i < num_descs; i++) {
+		unsigned int idx = descs[i].index;
+
+		if (descs[i].reserved) {
+			err = -EINVAL;
+			goto err_release;
+		}
+		if (idx >= count || (list->provided & BIT(idx))) {
+			err = -EINVAL;
+			goto err_release;
+		}
+
+		switch (descs[i].type) {
+		case IB_UVERBS_BUFFER_TYPE_DMABUF:
+			umem_dmabuf = ib_umem_dmabuf_get_pinned(
+				device, descs[i].addr, descs[i].length,
+				descs[i].fd, IB_ACCESS_LOCAL_WRITE);
+			if (IS_ERR(umem_dmabuf)) {
+				err = PTR_ERR(umem_dmabuf);
+				goto err_release;
+			}
+			list->umems[idx] = &umem_dmabuf->umem;
+			break;
+		case IB_UVERBS_BUFFER_TYPE_VA:
+			umem = ib_umem_get(device, descs[i].addr,
+					   descs[i].length, IB_ACCESS_LOCAL_WRITE);
+			if (IS_ERR(umem)) {
+				err = PTR_ERR(umem);
+				goto err_release;
+			}
+			list->umems[idx] = umem;
+			break;
+		default:
+			err = -EINVAL;
+			goto err_release;
+		}
+		list->provided |= BIT(idx);
+	}
+
+	return list;
+
+err_release:
+	ib_umem_list_release(list);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(ib_umem_list_create);
+
+/**
+ * ib_umem_list_release - Release all umems in the list and free it
+ * @list: umem list
+ */
+void ib_umem_list_release(struct ib_umem_list *list)
+{
+	int i;
+
+	if (!list)
+		return;
+	for (i = 0; i < list->count; i++)
+		ib_umem_release(list->umems[i]);
+	kfree(list);
+}
+EXPORT_SYMBOL(ib_umem_list_release);
+
+/**
+ * ib_umem_list_check_consumed - Verify all provided umems were loaded
+ * @list: umem list
+ *
+ * Return: 0 if all provided slots were loaded, -EINVAL otherwise.
+ */
+int ib_umem_list_check_consumed(const struct ib_umem_list *list)
+{
+	return (list->provided & ~list->loaded) == 0 ? 0 : -EINVAL;
+}
+EXPORT_SYMBOL(ib_umem_list_check_consumed);
+
+/**
+ * ib_umem_list_insert - Insert a umem into the list at a given index
+ * @list: umem list
+ * @index: per-command buffer slot index
+ * @umem: umem pointer to store
+ *
+ * Stores @umem at @index (replacing any existing). For use from create_cq
+ * when the buffer comes from legacy ATTRs rather than the buffer list.
+ */
+void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
+			 struct ib_umem *umem)
+{
+	ib_umem_list_replace(list, index, umem);
+	if (umem)
+		list->provided |= BIT(index);
+}
+EXPORT_SYMBOL(ib_umem_list_insert);
+
+/**
+ * ib_umem_list_load - Load a umem from the list by index
+ * @list: umem list (may be NULL)
+ * @index: per-command buffer slot index
+ * @size: minimum required umem length
+ *
+ * Return: umem pointer, or NULL if the slot is empty or
+ * the slot is out of bounds, or ERR_PTR(-EINVAL) if the umem is too small.
+ */
+struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
+				 unsigned int index, size_t size)
+{
+	struct ib_umem *umem;
+
+	if (!list || index >= list->count)
+		return NULL;
+	umem = list->umems[index];
+	if (!umem)
+		return NULL;
+	if (umem->length < size)
+		return ERR_PTR(-EINVAL);
+	list->loaded |= BIT(index);
+	return umem;
+}
+EXPORT_SYMBOL(ib_umem_list_load);
+
+/**
+ * ib_umem_list_load_or_get - Umem from list or pin user memory
+ * @list: umem list (may be NULL)
+ * @index: per-command buffer slot index
+ * @device: IB device for ib_umem_get when the list slot is empty
+ * @addr: user virtual address for ib_umem_get
+ * @size: length for ib_umem_get
+ * @access: access flags for ib_umem_get
+ *
+ * If @list has a umem at @index, returns it like ib_umem_list_load() (and
+ * marks the slot loaded). Otherwise calls ib_umem_get() with the given
+ * @access flags and on success stores the result at @index when
+ * @list is non-NULL.
+ *
+ * Return: valid umem pointer, or ERR_PTR.
+ */
+struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
+					 unsigned int index,
+					 struct ib_device *device,
+					 unsigned long addr, size_t size,
+					 int access)
+{
+	struct ib_umem *umem;
+
+	umem = ib_umem_list_load(list, index, size);
+	if (IS_ERR(umem) || umem)
+		return umem;
+	umem = ib_umem_get(device, addr, size, access);
+	if (IS_ERR(umem))
+		return umem;
+	if (list && index < list->count)
+		list->umems[index] = umem;
+	return umem;
+}
+EXPORT_SYMBOL(ib_umem_list_load_or_get);
+
+/**
+ * ib_umem_list_replace - Replace umem at index, releasing the previous one
+ * @list: umem list (may be NULL)
+ * @index: per-command buffer slot index
+ * @umem: new umem pointer (may be NULL to clear the slot)
+ *
+ * Stores @umem at @index. If a different umem was already stored there, it is
+ * released. Used for CQ resize and similar.
+ */
+void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
+			  struct ib_umem *umem)
+{
+	struct ib_umem *old;
+
+	if (!list || index >= list->count)
+		return;
+	old = list->umems[index];
+	list->umems[index] = umem;
+	if (old && old != umem)
+		ib_umem_release(old);
+}
+EXPORT_SYMBOL(ib_umem_list_replace);
+
+/**
+ * ib_umem_release_non_listed - Release a umem that is not stored in the list
+ * @list: umem list
+ * @index: per-command buffer slot index
+ * @umem: umem pointer to release
+ *
+ * Releases @umem if it is not stored in @list.
+ */
+void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
+				struct ib_umem *umem)
+{
+	if (!list || index >= list->count || list->umems[index] != umem)
+		ib_umem_release(umem);
+}
+EXPORT_SYMBOL(ib_umem_release_non_listed);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 2ad52cc1d52b..924acb8d08c3 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -11,6 +11,7 @@
 
 struct ib_device;
 struct dma_buf_attach_ops;
+struct uverbs_attr_bundle;
 
 struct ib_umem {
 	struct ib_device       *ibdev;
@@ -80,6 +81,36 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 void ib_umem_release(struct ib_umem *umem);
 int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
 		      size_t length);
+
+/**
+ * struct ib_umem_list - collection of pre-mapped umems
+ *
+ * Created from the UVERBS_ATTR_BUFFERS attribute. Each entry is indexed
+ * by a per-command buffer slot enum (e.g., IB_UMEM_CQ_BUF for CQ CREATE).
+ * Drivers use ib_umem_list_load() to retrieve a specific umem by index.
+ */
+struct ib_umem_list;
+
+struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
+					 const struct uverbs_attr_bundle *attrs,
+					 unsigned int slot_max);
+void ib_umem_list_release(struct ib_umem_list *list);
+int ib_umem_list_check_consumed(const struct ib_umem_list *list);
+void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
+			 struct ib_umem *umem);
+
+struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
+				  unsigned int index, size_t size);
+struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
+					 unsigned int index,
+					 struct ib_device *device,
+					 unsigned long addr, size_t size,
+					 int access);
+void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
+			  struct ib_umem *umem);
+void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
+				struct ib_umem *umem);
+
 unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 				     unsigned long pgsz_bitmap,
 				     unsigned long virt);
@@ -230,5 +261,28 @@ static inline void ib_umem_dmabuf_revoke_lock(struct ib_umem_dmabuf *umem_dmabuf
 static inline void ib_umem_dmabuf_revoke_unlock(struct ib_umem_dmabuf *umem_dmabuf) {}
 static inline void ib_umem_dmabuf_revoke(struct ib_umem_dmabuf *umem_dmabuf) {}
 
+struct ib_umem_list;
+
+static inline void ib_umem_list_release(struct ib_umem_list *list) { }
+static inline struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
+						unsigned int index,
+						size_t size)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+static inline struct ib_umem *
+ib_umem_list_load_or_get(struct ib_umem_list *list, unsigned int index,
+			 struct ib_device *device, unsigned long addr,
+			 size_t size, int access)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+static inline void ib_umem_list_replace(struct ib_umem_list *list,
+					unsigned int index,
+					struct ib_umem *umem) { }
+static inline void ib_umem_release_non_listed(struct ib_umem_list *list,
+					      unsigned int index,
+					      struct ib_umem *umem) { }
+
 #endif /* CONFIG_INFINIBAND_USER_MEM */
 #endif /* IB_UMEM_H */
diff --git a/include/rdma/uverbs_ioctl.h b/include/rdma/uverbs_ioctl.h
index e2af17da3e32..05bcab27a87d 100644
--- a/include/rdma/uverbs_ioctl.h
+++ b/include/rdma/uverbs_ioctl.h
@@ -590,6 +590,20 @@ struct uapi_definition {
 			    UA_OPTIONAL,                                       \
 			    .is_udata = 1)
 
+/*
+ * Optional array of struct ib_uverbs_buffer_desc describing memory regions
+ * backed by dma-buf or user virtual address. Can be added to any method
+ * that needs external buffer support.
+ * Each entry carries an index field selecting the per-command buffer slot.
+ * Use ib_umem_list_create() to map them and ib_umem_list_load() to access.
+ */
+#define UVERBS_ATTR_BUFFERS()                                                  \
+	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_BUFFERS,                               \
+			   UVERBS_ATTR_MIN_SIZE(                               \
+				sizeof(struct ib_uverbs_buffer_desc)),         \
+			   UA_OPTIONAL,                                        \
+			   UA_ALLOC_AND_COPY)
+
 /* =================================================
  *              Parsing infrastructure
  * =================================================
diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
index 72041c1b0ea5..10aa6568abf1 100644
--- a/include/uapi/rdma/ib_user_ioctl_cmds.h
+++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
@@ -64,6 +64,7 @@ enum {
 	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
 	UVERBS_ATTR_UHW_OUT,
 	UVERBS_ID_DRIVER_NS_WITH_UHW,
+	UVERBS_ATTR_BUFFERS,
 };
 
 enum uverbs_methods_device {
diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 90c5cd8e7753..41ed9f75b4de 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -273,4 +273,31 @@ struct ib_uverbs_gid_entry {
 	__u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */
 };
 
+enum ib_uverbs_buffer_type {
+	IB_UVERBS_BUFFER_TYPE_DMABUF,
+	IB_UVERBS_BUFFER_TYPE_VA,
+};
+
+/*
+ * Describes a single buffer backed by dma-buf or user virtual address.
+ * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
+ * accepts this attribute defines its own per-command buffer slot enum.
+ * The index field selects the buffer slot this descriptor maps to.
+ *
+ * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
+ * @type: buffer type from enum ib_uverbs_buffer_type
+ * @index: per-command buffer slot index
+ * @reserved: must be zero
+ * @addr: offset within dma-buf, or user virtual address for VA
+ * @length: buffer length in bytes
+ */
+struct ib_uverbs_buffer_desc {
+	__s32 fd;
+	__u32 type;
+	__u32 index;
+	__u32 reserved;
+	__aligned_u64 addr;
+	__aligned_u64 length;
+};
+
 #endif
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-11 14:49 ` [PATCH rdma-next v2 01/15] RDMA/core: " Jiri Pirko
@ 2026-04-12 12:33   ` Michael Margolin
  2026-04-13  8:32     ` Jiri Pirko
  2026-04-21 13:46   ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Michael Margolin @ 2026-04-12 12:33 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, jgg, leon, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Sat, Apr 11, 2026 at 04:49:01PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@nvidia.com>
> 
> Add a unified mechanism for userspace to pass memory buffers to any
> uverbs command via a single UVERBS_ATTR_BUFFERS attribute. Each
> buffer is described by struct ib_uverbs_buffer_desc with a type
> discriminator supporting dma-buf and user VA backed memory, extensible
> for future buffer types.
> 
> The ib_umem_list API enables any uverbs command to accept multiple
> buffers indexed by per-command slot enums, without requiring new UAPI
> attributes for each buffer. A consumption check ensures userspace and
> driver agree on which buffers are used.
> 
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c          | 248 ++++++++++++++++++++++++
>  include/rdma/ib_umem.h                  |  54 ++++++
>  include/rdma/uverbs_ioctl.h             |  14 ++
>  include/uapi/rdma/ib_user_ioctl_cmds.h  |   1 +
>  include/uapi/rdma/ib_user_ioctl_verbs.h |  27 +++
>  5 files changed, 344 insertions(+)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 786fa1aa8e55..f5b03e903b9d 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -37,6 +37,7 @@
>  #include <linux/dma-mapping.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/mm.h>
> +#include <linux/err.h>
>  #include <linux/export.h>
>  #include <linux/slab.h>
>  #include <linux/pagemap.h>
> @@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>  		return 0;
>  }
>  EXPORT_SYMBOL(ib_umem_copy_from);
> +
> +struct ib_umem_list {
> +	unsigned int count; /* Total slots in the list. */
> +	unsigned long provided; /* Bitmask of slots provided by the user. */
> +	unsigned long loaded; /* Bitmask of slots loaded by the driver. */
> +	struct ib_umem *umems[] __counted_by(count);
> +};
> +
> +/**
> + * ib_umem_list_create - Create a umem list from UVERBS_ATTR_BUFFERS
> + * @device: IB device
> + * @attrs: uverbs attribute bundle
> + * @slot_max: highest buffer slot index (count = slot_max + 1)
> + *
> + * Return: umem list, or ERR_PTR on failure.
> + */
> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
> +					 const struct uverbs_attr_bundle *attrs,
> +					 unsigned int slot_max)
> +{
> +	const struct ib_uverbs_buffer_desc *descs;
> +	struct ib_umem_dmabuf *umem_dmabuf;
> +	struct ib_umem_list *list;
> +	struct ib_umem *umem;
> +	unsigned int count;
> +	int num_descs;
> +	int err;
> +	int i;
> +
> +	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
> +		return ERR_PTR(-EINVAL);
> +	count = slot_max + 1;
> +
> +	num_descs = uverbs_attr_ptr_get_array_size(
> +		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
> +		sizeof(*descs));
> +	if (num_descs == -ENOENT) {
> +		num_descs = 0;
> +		descs = NULL;
> +	} else if (num_descs < 0) {
> +		return ERR_PTR(num_descs);
> +	} else if (num_descs > count) {
> +		return ERR_PTR(-EINVAL);
> +	} else {
> +		descs = uverbs_attr_get_alloced_ptr(attrs, UVERBS_ATTR_BUFFERS);
> +		if (IS_ERR(descs))
> +			return ERR_CAST(descs);
> +	}
> +
> +	list = kzalloc(struct_size(list, umems, count), GFP_KERNEL);
> +	if (!list)
> +		return ERR_PTR(-ENOMEM);
> +	list->count = count;
> +
> +	for (i = 0; i < num_descs; i++) {

While I like the idea of standardizing the way we pass buffer
information to the kernel, the list thing looks like over generalization
to me, especially after Leon's refactoring of CQ creation. Maybe we can
add buffer as a new attribute type that can be used for multiple
parameters in a command, and have a helper with the code below that
takes an attribute id and returns a umem object, letting each handler
store it. This would also make it easier for drivers to pass their
private buffers using this infrastructure.

Michael

> +		unsigned int idx = descs[i].index;
> +
> +		if (descs[i].reserved) {
> +			err = -EINVAL;
> +			goto err_release;
> +		}
> +		if (idx >= count || (list->provided & BIT(idx))) {
> +			err = -EINVAL;
> +			goto err_release;
> +		}
> +
> +		switch (descs[i].type) {
> +		case IB_UVERBS_BUFFER_TYPE_DMABUF:
> +			umem_dmabuf = ib_umem_dmabuf_get_pinned(
> +				device, descs[i].addr, descs[i].length,
> +				descs[i].fd, IB_ACCESS_LOCAL_WRITE);
> +			if (IS_ERR(umem_dmabuf)) {
> +				err = PTR_ERR(umem_dmabuf);
> +				goto err_release;
> +			}
> +			list->umems[idx] = &umem_dmabuf->umem;
> +			break;
> +		case IB_UVERBS_BUFFER_TYPE_VA:
> +			umem = ib_umem_get(device, descs[i].addr,
> +					   descs[i].length, IB_ACCESS_LOCAL_WRITE);
> +			if (IS_ERR(umem)) {
> +				err = PTR_ERR(umem);
> +				goto err_release;
> +			}
> +			list->umems[idx] = umem;
> +			break;
> +		default:
> +			err = -EINVAL;
> +			goto err_release;
> +		}
> +		list->provided |= BIT(idx);
> +	}
> +
> +	return list;
> +
> +err_release:
> +	ib_umem_list_release(list);
> +	return ERR_PTR(err);
> +}
> +EXPORT_SYMBOL(ib_umem_list_create);
> +
> +/**
> + * ib_umem_list_release - Release all umems in the list and free it
> + * @list: umem list
> + */
> +void ib_umem_list_release(struct ib_umem_list *list)
> +{
> +	int i;
> +
> +	if (!list)
> +		return;
> +	for (i = 0; i < list->count; i++)
> +		ib_umem_release(list->umems[i]);
> +	kfree(list);
> +}
> +EXPORT_SYMBOL(ib_umem_list_release);
> +
> +/**
> + * ib_umem_list_check_consumed - Verify all provided umems were loaded
> + * @list: umem list
> + *
> + * Return: 0 if all provided slots were loaded, -EINVAL otherwise.
> + */
> +int ib_umem_list_check_consumed(const struct ib_umem_list *list)
> +{
> +	return (list->provided & ~list->loaded) == 0 ? 0 : -EINVAL;
> +}
> +EXPORT_SYMBOL(ib_umem_list_check_consumed);
> +
> +/**
> + * ib_umem_list_insert - Insert a umem into the list at a given index
> + * @list: umem list
> + * @index: per-command buffer slot index
> + * @umem: umem pointer to store
> + *
> + * Stores @umem at @index (replacing any existing). For use from create_cq
> + * when the buffer comes from legacy ATTRs rather than the buffer list.
> + */
> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
> +			 struct ib_umem *umem)
> +{
> +	ib_umem_list_replace(list, index, umem);
> +	if (umem)
> +		list->provided |= BIT(index);
> +}
> +EXPORT_SYMBOL(ib_umem_list_insert);
> +
> +/**
> + * ib_umem_list_load - Load a umem from the list by index
> + * @list: umem list (may be NULL)
> + * @index: per-command buffer slot index
> + * @size: minimum required umem length
> + *
> + * Return: umem pointer, or NULL if the slot is empty or
> + * the slot is out of bounds, or ERR_PTR(-EINVAL) if the umem is too small.
> + */
> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
> +				 unsigned int index, size_t size)
> +{
> +	struct ib_umem *umem;
> +
> +	if (!list || index >= list->count)
> +		return NULL;
> +	umem = list->umems[index];
> +	if (!umem)
> +		return NULL;
> +	if (umem->length < size)
> +		return ERR_PTR(-EINVAL);
> +	list->loaded |= BIT(index);
> +	return umem;
> +}
> +EXPORT_SYMBOL(ib_umem_list_load);
> +
> +/**
> + * ib_umem_list_load_or_get - Umem from list or pin user memory
> + * @list: umem list (may be NULL)
> + * @index: per-command buffer slot index
> + * @device: IB device for ib_umem_get when the list slot is empty
> + * @addr: user virtual address for ib_umem_get
> + * @size: length for ib_umem_get
> + * @access: access flags for ib_umem_get
> + *
> + * If @list has a umem at @index, returns it like ib_umem_list_load() (and
> + * marks the slot loaded). Otherwise calls ib_umem_get() with the given
> + * @access flags and on success stores the result at @index when
> + * @list is non-NULL.
> + *
> + * Return: valid umem pointer, or ERR_PTR.
> + */
> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
> +					 unsigned int index,
> +					 struct ib_device *device,
> +					 unsigned long addr, size_t size,
> +					 int access)
> +{
> +	struct ib_umem *umem;
> +
> +	umem = ib_umem_list_load(list, index, size);
> +	if (IS_ERR(umem) || umem)
> +		return umem;
> +	umem = ib_umem_get(device, addr, size, access);
> +	if (IS_ERR(umem))
> +		return umem;
> +	if (list && index < list->count)
> +		list->umems[index] = umem;
> +	return umem;
> +}
> +EXPORT_SYMBOL(ib_umem_list_load_or_get);
> +
> +/**
> + * ib_umem_list_replace - Replace umem at index, releasing the previous one
> + * @list: umem list (may be NULL)
> + * @index: per-command buffer slot index
> + * @umem: new umem pointer (may be NULL to clear the slot)
> + *
> + * Stores @umem at @index. If a different umem was already stored there, it is
> + * released. Used for CQ resize and similar.
> + */
> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
> +			  struct ib_umem *umem)
> +{
> +	struct ib_umem *old;
> +
> +	if (!list || index >= list->count)
> +		return;
> +	old = list->umems[index];
> +	list->umems[index] = umem;
> +	if (old && old != umem)
> +		ib_umem_release(old);
> +}
> +EXPORT_SYMBOL(ib_umem_list_replace);
> +
> +/**
> + * ib_umem_release_non_listed - Release a umem that is not stored in the list
> + * @list: umem list
> + * @index: per-command buffer slot index
> + * @umem: umem pointer to release
> + *
> + * Releases @umem if it is not stored in @list.
> + */
> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
> +				struct ib_umem *umem)
> +{
> +	if (!list || index >= list->count || list->umems[index] != umem)
> +		ib_umem_release(umem);
> +}
> +EXPORT_SYMBOL(ib_umem_release_non_listed);
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 2ad52cc1d52b..924acb8d08c3 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -11,6 +11,7 @@
>  
>  struct ib_device;
>  struct dma_buf_attach_ops;
> +struct uverbs_attr_bundle;
>  
>  struct ib_umem {
>  	struct ib_device       *ibdev;
> @@ -80,6 +81,36 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
>  void ib_umem_release(struct ib_umem *umem);
>  int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>  		      size_t length);
> +
> +/**
> + * struct ib_umem_list - collection of pre-mapped umems
> + *
> + * Created from the UVERBS_ATTR_BUFFERS attribute. Each entry is indexed
> + * by a per-command buffer slot enum (e.g., IB_UMEM_CQ_BUF for CQ CREATE).
> + * Drivers use ib_umem_list_load() to retrieve a specific umem by index.
> + */
> +struct ib_umem_list;
> +
> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
> +					 const struct uverbs_attr_bundle *attrs,
> +					 unsigned int slot_max);
> +void ib_umem_list_release(struct ib_umem_list *list);
> +int ib_umem_list_check_consumed(const struct ib_umem_list *list);
> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
> +			 struct ib_umem *umem);
> +
> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
> +				  unsigned int index, size_t size);
> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
> +					 unsigned int index,
> +					 struct ib_device *device,
> +					 unsigned long addr, size_t size,
> +					 int access);
> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
> +			  struct ib_umem *umem);
> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
> +				struct ib_umem *umem);
> +
>  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  				     unsigned long pgsz_bitmap,
>  				     unsigned long virt);
> @@ -230,5 +261,28 @@ static inline void ib_umem_dmabuf_revoke_lock(struct ib_umem_dmabuf *umem_dmabuf
>  static inline void ib_umem_dmabuf_revoke_unlock(struct ib_umem_dmabuf *umem_dmabuf) {}
>  static inline void ib_umem_dmabuf_revoke(struct ib_umem_dmabuf *umem_dmabuf) {}
>  
> +struct ib_umem_list;
> +
> +static inline void ib_umem_list_release(struct ib_umem_list *list) { }
> +static inline struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
> +						unsigned int index,
> +						size_t size)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +static inline struct ib_umem *
> +ib_umem_list_load_or_get(struct ib_umem_list *list, unsigned int index,
> +			 struct ib_device *device, unsigned long addr,
> +			 size_t size, int access)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +static inline void ib_umem_list_replace(struct ib_umem_list *list,
> +					unsigned int index,
> +					struct ib_umem *umem) { }
> +static inline void ib_umem_release_non_listed(struct ib_umem_list *list,
> +					      unsigned int index,
> +					      struct ib_umem *umem) { }
> +
>  #endif /* CONFIG_INFINIBAND_USER_MEM */
>  #endif /* IB_UMEM_H */
> diff --git a/include/rdma/uverbs_ioctl.h b/include/rdma/uverbs_ioctl.h
> index e2af17da3e32..05bcab27a87d 100644
> --- a/include/rdma/uverbs_ioctl.h
> +++ b/include/rdma/uverbs_ioctl.h
> @@ -590,6 +590,20 @@ struct uapi_definition {
>  			    UA_OPTIONAL,                                       \
>  			    .is_udata = 1)
>  
> +/*
> + * Optional array of struct ib_uverbs_buffer_desc describing memory regions
> + * backed by dma-buf or user virtual address. Can be added to any method
> + * that needs external buffer support.
> + * Each entry carries an index field selecting the per-command buffer slot.
> + * Use ib_umem_list_create() to map them and ib_umem_list_load() to access.
> + */
> +#define UVERBS_ATTR_BUFFERS()                                                  \
> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_BUFFERS,                               \
> +			   UVERBS_ATTR_MIN_SIZE(                               \
> +				sizeof(struct ib_uverbs_buffer_desc)),         \
> +			   UA_OPTIONAL,                                        \
> +			   UA_ALLOC_AND_COPY)
> +
>  /* =================================================
>   *              Parsing infrastructure
>   * =================================================
> diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
> index 72041c1b0ea5..10aa6568abf1 100644
> --- a/include/uapi/rdma/ib_user_ioctl_cmds.h
> +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
> @@ -64,6 +64,7 @@ enum {
>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
>  	UVERBS_ATTR_UHW_OUT,
>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
> +	UVERBS_ATTR_BUFFERS,
>  };
>  
>  enum uverbs_methods_device {
> diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
> index 90c5cd8e7753..41ed9f75b4de 100644
> --- a/include/uapi/rdma/ib_user_ioctl_verbs.h
> +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
> @@ -273,4 +273,31 @@ struct ib_uverbs_gid_entry {
>  	__u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */
>  };
>  
> +enum ib_uverbs_buffer_type {
> +	IB_UVERBS_BUFFER_TYPE_DMABUF,
> +	IB_UVERBS_BUFFER_TYPE_VA,
> +};
> +
> +/*
> + * Describes a single buffer backed by dma-buf or user virtual address.
> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
> + * accepts this attribute defines its own per-command buffer slot enum.
> + * The index field selects the buffer slot this descriptor maps to.
> + *
> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
> + * @type: buffer type from enum ib_uverbs_buffer_type
> + * @index: per-command buffer slot index
> + * @reserved: must be zero
> + * @addr: offset within dma-buf, or user virtual address for VA
> + * @length: buffer length in bytes
> + */
> +struct ib_uverbs_buffer_desc {
> +	__s32 fd;
> +	__u32 type;
> +	__u32 index;
> +	__u32 reserved;
> +	__aligned_u64 addr;
> +	__aligned_u64 length;
> +};
> +
>  #endif
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-12 12:33   ` Michael Margolin
@ 2026-04-13  8:32     ` Jiri Pirko
  2026-04-13 16:02       ` Michael Margolin
  0 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-13  8:32 UTC (permalink / raw)
  To: Michael Margolin
  Cc: linux-rdma, jgg, leon, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Sun, Apr 12, 2026 at 02:33:22PM +0200, mrgolin@amazon.com wrote:
>On Sat, Apr 11, 2026 at 04:49:01PM +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@nvidia.com>
>> 
>> Add a unified mechanism for userspace to pass memory buffers to any
>> uverbs command via a single UVERBS_ATTR_BUFFERS attribute. Each
>> buffer is described by struct ib_uverbs_buffer_desc with a type
>> discriminator supporting dma-buf and user VA backed memory, extensible
>> for future buffer types.
>> 
>> The ib_umem_list API enables any uverbs command to accept multiple
>> buffers indexed by per-command slot enums, without requiring new UAPI
>> attributes for each buffer. A consumption check ensures userspace and
>> driver agree on which buffers are used.
>> 
>> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> ---
>>  drivers/infiniband/core/umem.c          | 248 ++++++++++++++++++++++++
>>  include/rdma/ib_umem.h                  |  54 ++++++
>>  include/rdma/uverbs_ioctl.h             |  14 ++
>>  include/uapi/rdma/ib_user_ioctl_cmds.h  |   1 +
>>  include/uapi/rdma/ib_user_ioctl_verbs.h |  27 +++
>>  5 files changed, 344 insertions(+)
>> 
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index 786fa1aa8e55..f5b03e903b9d 100644
>> --- a/drivers/infiniband/core/umem.c
>> +++ b/drivers/infiniband/core/umem.c
>> @@ -37,6 +37,7 @@
>>  #include <linux/dma-mapping.h>
>>  #include <linux/sched/signal.h>
>>  #include <linux/sched/mm.h>
>> +#include <linux/err.h>
>>  #include <linux/export.h>
>>  #include <linux/slab.h>
>>  #include <linux/pagemap.h>
>> @@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>>  		return 0;
>>  }
>>  EXPORT_SYMBOL(ib_umem_copy_from);
>> +
>> +struct ib_umem_list {
>> +	unsigned int count; /* Total slots in the list. */
>> +	unsigned long provided; /* Bitmask of slots provided by the user. */
>> +	unsigned long loaded; /* Bitmask of slots loaded by the driver. */
>> +	struct ib_umem *umems[] __counted_by(count);
>> +};
>> +
>> +/**
>> + * ib_umem_list_create - Create a umem list from UVERBS_ATTR_BUFFERS
>> + * @device: IB device
>> + * @attrs: uverbs attribute bundle
>> + * @slot_max: highest buffer slot index (count = slot_max + 1)
>> + *
>> + * Return: umem list, or ERR_PTR on failure.
>> + */
>> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
>> +					 const struct uverbs_attr_bundle *attrs,
>> +					 unsigned int slot_max)
>> +{
>> +	const struct ib_uverbs_buffer_desc *descs;
>> +	struct ib_umem_dmabuf *umem_dmabuf;
>> +	struct ib_umem_list *list;
>> +	struct ib_umem *umem;
>> +	unsigned int count;
>> +	int num_descs;
>> +	int err;
>> +	int i;
>> +
>> +	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
>> +		return ERR_PTR(-EINVAL);
>> +	count = slot_max + 1;
>> +
>> +	num_descs = uverbs_attr_ptr_get_array_size(
>> +		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
>> +		sizeof(*descs));
>> +	if (num_descs == -ENOENT) {
>> +		num_descs = 0;
>> +		descs = NULL;
>> +	} else if (num_descs < 0) {
>> +		return ERR_PTR(num_descs);
>> +	} else if (num_descs > count) {
>> +		return ERR_PTR(-EINVAL);
>> +	} else {
>> +		descs = uverbs_attr_get_alloced_ptr(attrs, UVERBS_ATTR_BUFFERS);
>> +		if (IS_ERR(descs))
>> +			return ERR_CAST(descs);
>> +	}
>> +
>> +	list = kzalloc(struct_size(list, umems, count), GFP_KERNEL);
>> +	if (!list)
>> +		return ERR_PTR(-ENOMEM);
>> +	list->count = count;
>> +
>> +	for (i = 0; i < num_descs; i++) {
>
>While I like the idea of standardizing the way we pass buffer
>information to the kernel, the list thing looks like over generalization
>to me, especially after Leon's refactoring of CQ creation. Maybe we can
>add buffer as a new attribute type that can be used for multiple
>parameters in a command, and have a helper with the code below that
>takes an attribute id and returns a umem object, letting each handler
>store it. This would also make it easier for drivers to pass their
>private buffers using this infrastructure.

Currently we have set of attrs (4) to pass CQ umem. I tried to make this
very smooth for all possible uverbs, passing single attr of array of
structs describing a buffer. Uverb attr api knows how to work with
arrays, all clicks.

Drivers can easily pass their specific buffers over this list too. I
didn't implement it as there was no need, but the idea is to have index>X
for driver specific indexes.

What's the benefit of passing per-uverb attrs with a struct? Perhaps I'm
missing something.





>
>Michael
>
>> +		unsigned int idx = descs[i].index;
>> +
>> +		if (descs[i].reserved) {
>> +			err = -EINVAL;
>> +			goto err_release;
>> +		}
>> +		if (idx >= count || (list->provided & BIT(idx))) {
>> +			err = -EINVAL;
>> +			goto err_release;
>> +		}
>> +
>> +		switch (descs[i].type) {
>> +		case IB_UVERBS_BUFFER_TYPE_DMABUF:
>> +			umem_dmabuf = ib_umem_dmabuf_get_pinned(
>> +				device, descs[i].addr, descs[i].length,
>> +				descs[i].fd, IB_ACCESS_LOCAL_WRITE);
>> +			if (IS_ERR(umem_dmabuf)) {
>> +				err = PTR_ERR(umem_dmabuf);
>> +				goto err_release;
>> +			}
>> +			list->umems[idx] = &umem_dmabuf->umem;
>> +			break;
>> +		case IB_UVERBS_BUFFER_TYPE_VA:
>> +			umem = ib_umem_get(device, descs[i].addr,
>> +					   descs[i].length, IB_ACCESS_LOCAL_WRITE);
>> +			if (IS_ERR(umem)) {
>> +				err = PTR_ERR(umem);
>> +				goto err_release;
>> +			}
>> +			list->umems[idx] = umem;
>> +			break;
>> +		default:
>> +			err = -EINVAL;
>> +			goto err_release;
>> +		}
>> +		list->provided |= BIT(idx);
>> +	}
>> +
>> +	return list;
>> +
>> +err_release:
>> +	ib_umem_list_release(list);
>> +	return ERR_PTR(err);
>> +}
>> +EXPORT_SYMBOL(ib_umem_list_create);
>> +
>> +/**
>> + * ib_umem_list_release - Release all umems in the list and free it
>> + * @list: umem list
>> + */
>> +void ib_umem_list_release(struct ib_umem_list *list)
>> +{
>> +	int i;
>> +
>> +	if (!list)
>> +		return;
>> +	for (i = 0; i < list->count; i++)
>> +		ib_umem_release(list->umems[i]);
>> +	kfree(list);
>> +}
>> +EXPORT_SYMBOL(ib_umem_list_release);
>> +
>> +/**
>> + * ib_umem_list_check_consumed - Verify all provided umems were loaded
>> + * @list: umem list
>> + *
>> + * Return: 0 if all provided slots were loaded, -EINVAL otherwise.
>> + */
>> +int ib_umem_list_check_consumed(const struct ib_umem_list *list)
>> +{
>> +	return (list->provided & ~list->loaded) == 0 ? 0 : -EINVAL;
>> +}
>> +EXPORT_SYMBOL(ib_umem_list_check_consumed);
>> +
>> +/**
>> + * ib_umem_list_insert - Insert a umem into the list at a given index
>> + * @list: umem list
>> + * @index: per-command buffer slot index
>> + * @umem: umem pointer to store
>> + *
>> + * Stores @umem at @index (replacing any existing). For use from create_cq
>> + * when the buffer comes from legacy ATTRs rather than the buffer list.
>> + */
>> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
>> +			 struct ib_umem *umem)
>> +{
>> +	ib_umem_list_replace(list, index, umem);
>> +	if (umem)
>> +		list->provided |= BIT(index);
>> +}
>> +EXPORT_SYMBOL(ib_umem_list_insert);
>> +
>> +/**
>> + * ib_umem_list_load - Load a umem from the list by index
>> + * @list: umem list (may be NULL)
>> + * @index: per-command buffer slot index
>> + * @size: minimum required umem length
>> + *
>> + * Return: umem pointer, or NULL if the slot is empty or
>> + * the slot is out of bounds, or ERR_PTR(-EINVAL) if the umem is too small.
>> + */
>> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
>> +				 unsigned int index, size_t size)
>> +{
>> +	struct ib_umem *umem;
>> +
>> +	if (!list || index >= list->count)
>> +		return NULL;
>> +	umem = list->umems[index];
>> +	if (!umem)
>> +		return NULL;
>> +	if (umem->length < size)
>> +		return ERR_PTR(-EINVAL);
>> +	list->loaded |= BIT(index);
>> +	return umem;
>> +}
>> +EXPORT_SYMBOL(ib_umem_list_load);
>> +
>> +/**
>> + * ib_umem_list_load_or_get - Umem from list or pin user memory
>> + * @list: umem list (may be NULL)
>> + * @index: per-command buffer slot index
>> + * @device: IB device for ib_umem_get when the list slot is empty
>> + * @addr: user virtual address for ib_umem_get
>> + * @size: length for ib_umem_get
>> + * @access: access flags for ib_umem_get
>> + *
>> + * If @list has a umem at @index, returns it like ib_umem_list_load() (and
>> + * marks the slot loaded). Otherwise calls ib_umem_get() with the given
>> + * @access flags and on success stores the result at @index when
>> + * @list is non-NULL.
>> + *
>> + * Return: valid umem pointer, or ERR_PTR.
>> + */
>> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
>> +					 unsigned int index,
>> +					 struct ib_device *device,
>> +					 unsigned long addr, size_t size,
>> +					 int access)
>> +{
>> +	struct ib_umem *umem;
>> +
>> +	umem = ib_umem_list_load(list, index, size);
>> +	if (IS_ERR(umem) || umem)
>> +		return umem;
>> +	umem = ib_umem_get(device, addr, size, access);
>> +	if (IS_ERR(umem))
>> +		return umem;
>> +	if (list && index < list->count)
>> +		list->umems[index] = umem;
>> +	return umem;
>> +}
>> +EXPORT_SYMBOL(ib_umem_list_load_or_get);
>> +
>> +/**
>> + * ib_umem_list_replace - Replace umem at index, releasing the previous one
>> + * @list: umem list (may be NULL)
>> + * @index: per-command buffer slot index
>> + * @umem: new umem pointer (may be NULL to clear the slot)
>> + *
>> + * Stores @umem at @index. If a different umem was already stored there, it is
>> + * released. Used for CQ resize and similar.
>> + */
>> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
>> +			  struct ib_umem *umem)
>> +{
>> +	struct ib_umem *old;
>> +
>> +	if (!list || index >= list->count)
>> +		return;
>> +	old = list->umems[index];
>> +	list->umems[index] = umem;
>> +	if (old && old != umem)
>> +		ib_umem_release(old);
>> +}
>> +EXPORT_SYMBOL(ib_umem_list_replace);
>> +
>> +/**
>> + * ib_umem_release_non_listed - Release a umem that is not stored in the list
>> + * @list: umem list
>> + * @index: per-command buffer slot index
>> + * @umem: umem pointer to release
>> + *
>> + * Releases @umem if it is not stored in @list.
>> + */
>> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
>> +				struct ib_umem *umem)
>> +{
>> +	if (!list || index >= list->count || list->umems[index] != umem)
>> +		ib_umem_release(umem);
>> +}
>> +EXPORT_SYMBOL(ib_umem_release_non_listed);
>> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
>> index 2ad52cc1d52b..924acb8d08c3 100644
>> --- a/include/rdma/ib_umem.h
>> +++ b/include/rdma/ib_umem.h
>> @@ -11,6 +11,7 @@
>>  
>>  struct ib_device;
>>  struct dma_buf_attach_ops;
>> +struct uverbs_attr_bundle;
>>  
>>  struct ib_umem {
>>  	struct ib_device       *ibdev;
>> @@ -80,6 +81,36 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
>>  void ib_umem_release(struct ib_umem *umem);
>>  int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>>  		      size_t length);
>> +
>> +/**
>> + * struct ib_umem_list - collection of pre-mapped umems
>> + *
>> + * Created from the UVERBS_ATTR_BUFFERS attribute. Each entry is indexed
>> + * by a per-command buffer slot enum (e.g., IB_UMEM_CQ_BUF for CQ CREATE).
>> + * Drivers use ib_umem_list_load() to retrieve a specific umem by index.
>> + */
>> +struct ib_umem_list;
>> +
>> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
>> +					 const struct uverbs_attr_bundle *attrs,
>> +					 unsigned int slot_max);
>> +void ib_umem_list_release(struct ib_umem_list *list);
>> +int ib_umem_list_check_consumed(const struct ib_umem_list *list);
>> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
>> +			 struct ib_umem *umem);
>> +
>> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
>> +				  unsigned int index, size_t size);
>> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
>> +					 unsigned int index,
>> +					 struct ib_device *device,
>> +					 unsigned long addr, size_t size,
>> +					 int access);
>> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
>> +			  struct ib_umem *umem);
>> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
>> +				struct ib_umem *umem);
>> +
>>  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>>  				     unsigned long pgsz_bitmap,
>>  				     unsigned long virt);
>> @@ -230,5 +261,28 @@ static inline void ib_umem_dmabuf_revoke_lock(struct ib_umem_dmabuf *umem_dmabuf
>>  static inline void ib_umem_dmabuf_revoke_unlock(struct ib_umem_dmabuf *umem_dmabuf) {}
>>  static inline void ib_umem_dmabuf_revoke(struct ib_umem_dmabuf *umem_dmabuf) {}
>>  
>> +struct ib_umem_list;
>> +
>> +static inline void ib_umem_list_release(struct ib_umem_list *list) { }
>> +static inline struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
>> +						unsigned int index,
>> +						size_t size)
>> +{
>> +	return ERR_PTR(-EOPNOTSUPP);
>> +}
>> +static inline struct ib_umem *
>> +ib_umem_list_load_or_get(struct ib_umem_list *list, unsigned int index,
>> +			 struct ib_device *device, unsigned long addr,
>> +			 size_t size, int access)
>> +{
>> +	return ERR_PTR(-EOPNOTSUPP);
>> +}
>> +static inline void ib_umem_list_replace(struct ib_umem_list *list,
>> +					unsigned int index,
>> +					struct ib_umem *umem) { }
>> +static inline void ib_umem_release_non_listed(struct ib_umem_list *list,
>> +					      unsigned int index,
>> +					      struct ib_umem *umem) { }
>> +
>>  #endif /* CONFIG_INFINIBAND_USER_MEM */
>>  #endif /* IB_UMEM_H */
>> diff --git a/include/rdma/uverbs_ioctl.h b/include/rdma/uverbs_ioctl.h
>> index e2af17da3e32..05bcab27a87d 100644
>> --- a/include/rdma/uverbs_ioctl.h
>> +++ b/include/rdma/uverbs_ioctl.h
>> @@ -590,6 +590,20 @@ struct uapi_definition {
>>  			    UA_OPTIONAL,                                       \
>>  			    .is_udata = 1)
>>  
>> +/*
>> + * Optional array of struct ib_uverbs_buffer_desc describing memory regions
>> + * backed by dma-buf or user virtual address. Can be added to any method
>> + * that needs external buffer support.
>> + * Each entry carries an index field selecting the per-command buffer slot.
>> + * Use ib_umem_list_create() to map them and ib_umem_list_load() to access.
>> + */
>> +#define UVERBS_ATTR_BUFFERS()                                                  \
>> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_BUFFERS,                               \
>> +			   UVERBS_ATTR_MIN_SIZE(                               \
>> +				sizeof(struct ib_uverbs_buffer_desc)),         \
>> +			   UA_OPTIONAL,                                        \
>> +			   UA_ALLOC_AND_COPY)
>> +
>>  /* =================================================
>>   *              Parsing infrastructure
>>   * =================================================
>> diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
>> index 72041c1b0ea5..10aa6568abf1 100644
>> --- a/include/uapi/rdma/ib_user_ioctl_cmds.h
>> +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
>> @@ -64,6 +64,7 @@ enum {
>>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
>>  	UVERBS_ATTR_UHW_OUT,
>>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
>> +	UVERBS_ATTR_BUFFERS,
>>  };
>>  
>>  enum uverbs_methods_device {
>> diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
>> index 90c5cd8e7753..41ed9f75b4de 100644
>> --- a/include/uapi/rdma/ib_user_ioctl_verbs.h
>> +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
>> @@ -273,4 +273,31 @@ struct ib_uverbs_gid_entry {
>>  	__u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */
>>  };
>>  
>> +enum ib_uverbs_buffer_type {
>> +	IB_UVERBS_BUFFER_TYPE_DMABUF,
>> +	IB_UVERBS_BUFFER_TYPE_VA,
>> +};
>> +
>> +/*
>> + * Describes a single buffer backed by dma-buf or user virtual address.
>> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
>> + * accepts this attribute defines its own per-command buffer slot enum.
>> + * The index field selects the buffer slot this descriptor maps to.
>> + *
>> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
>> + * @type: buffer type from enum ib_uverbs_buffer_type
>> + * @index: per-command buffer slot index
>> + * @reserved: must be zero
>> + * @addr: offset within dma-buf, or user virtual address for VA
>> + * @length: buffer length in bytes
>> + */
>> +struct ib_uverbs_buffer_desc {
>> +	__s32 fd;
>> +	__u32 type;
>> +	__u32 index;
>> +	__u32 reserved;
>> +	__aligned_u64 addr;
>> +	__aligned_u64 length;
>> +};
>> +
>>  #endif
>> -- 
>> 2.53.0
>> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-13  8:32     ` Jiri Pirko
@ 2026-04-13 16:02       ` Michael Margolin
  2026-04-13 18:22         ` Jiri Pirko
  0 siblings, 1 reply; 81+ messages in thread
From: Michael Margolin @ 2026-04-13 16:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, jgg, leon, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Mon, Apr 13, 2026 at 10:32:15AM +0200, Jiri Pirko wrote:
> Sun, Apr 12, 2026 at 02:33:22PM +0200, mrgolin@amazon.com wrote:
> >On Sat, Apr 11, 2026 at 04:49:01PM +0200, Jiri Pirko wrote:
> >> From: Jiri Pirko <jiri@nvidia.com>
> >> 
> >> Add a unified mechanism for userspace to pass memory buffers to any
> >> uverbs command via a single UVERBS_ATTR_BUFFERS attribute. Each
> >> buffer is described by struct ib_uverbs_buffer_desc with a type
> >> discriminator supporting dma-buf and user VA backed memory, extensible
> >> for future buffer types.
> >> 
> >> The ib_umem_list API enables any uverbs command to accept multiple
> >> buffers indexed by per-command slot enums, without requiring new UAPI
> >> attributes for each buffer. A consumption check ensures userspace and
> >> driver agree on which buffers are used.
> >> 
> >> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> >> ---
> >>  drivers/infiniband/core/umem.c          | 248 ++++++++++++++++++++++++
> >>  include/rdma/ib_umem.h                  |  54 ++++++
> >>  include/rdma/uverbs_ioctl.h             |  14 ++
> >>  include/uapi/rdma/ib_user_ioctl_cmds.h  |   1 +
> >>  include/uapi/rdma/ib_user_ioctl_verbs.h |  27 +++
> >>  5 files changed, 344 insertions(+)
> >> 
> >> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> >> index 786fa1aa8e55..f5b03e903b9d 100644
> >> --- a/drivers/infiniband/core/umem.c
> >> +++ b/drivers/infiniband/core/umem.c
> >> @@ -37,6 +37,7 @@
> >>  #include <linux/dma-mapping.h>
> >>  #include <linux/sched/signal.h>
> >>  #include <linux/sched/mm.h>
> >> +#include <linux/err.h>
> >>  #include <linux/export.h>
> >>  #include <linux/slab.h>
> >>  #include <linux/pagemap.h>
> >> @@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
> >>  		return 0;
> >>  }
> >>  EXPORT_SYMBOL(ib_umem_copy_from);
> >> +
> >> +struct ib_umem_list {
> >> +	unsigned int count; /* Total slots in the list. */
> >> +	unsigned long provided; /* Bitmask of slots provided by the user. */
> >> +	unsigned long loaded; /* Bitmask of slots loaded by the driver. */
> >> +	struct ib_umem *umems[] __counted_by(count);
> >> +};
> >> +
> >> +/**
> >> + * ib_umem_list_create - Create a umem list from UVERBS_ATTR_BUFFERS
> >> + * @device: IB device
> >> + * @attrs: uverbs attribute bundle
> >> + * @slot_max: highest buffer slot index (count = slot_max + 1)
> >> + *
> >> + * Return: umem list, or ERR_PTR on failure.
> >> + */
> >> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
> >> +					 const struct uverbs_attr_bundle *attrs,
> >> +					 unsigned int slot_max)
> >> +{
> >> +	const struct ib_uverbs_buffer_desc *descs;
> >> +	struct ib_umem_dmabuf *umem_dmabuf;
> >> +	struct ib_umem_list *list;
> >> +	struct ib_umem *umem;
> >> +	unsigned int count;
> >> +	int num_descs;
> >> +	int err;
> >> +	int i;
> >> +
> >> +	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
> >> +		return ERR_PTR(-EINVAL);
> >> +	count = slot_max + 1;
> >> +
> >> +	num_descs = uverbs_attr_ptr_get_array_size(
> >> +		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
> >> +		sizeof(*descs));
> >> +	if (num_descs == -ENOENT) {
> >> +		num_descs = 0;
> >> +		descs = NULL;
> >> +	} else if (num_descs < 0) {
> >> +		return ERR_PTR(num_descs);
> >> +	} else if (num_descs > count) {
> >> +		return ERR_PTR(-EINVAL);
> >> +	} else {
> >> +		descs = uverbs_attr_get_alloced_ptr(attrs, UVERBS_ATTR_BUFFERS);
> >> +		if (IS_ERR(descs))
> >> +			return ERR_CAST(descs);
> >> +	}
> >> +
> >> +	list = kzalloc(struct_size(list, umems, count), GFP_KERNEL);
> >> +	if (!list)
> >> +		return ERR_PTR(-ENOMEM);
> >> +	list->count = count;
> >> +
> >> +	for (i = 0; i < num_descs; i++) {
> >
> >While I like the idea of standardizing the way we pass buffer
> >information to the kernel, the list thing looks like over generalization
> >to me, especially after Leon's refactoring of CQ creation. Maybe we can
> >add buffer as a new attribute type that can be used for multiple
> >parameters in a command, and have a helper with the code below that
> >takes an attribute id and returns a umem object, letting each handler
> >store it. This would also make it easier for drivers to pass their
> >private buffers using this infrastructure.
> 
> Currently we have set of attrs (4) to pass CQ umem. I tried to make this
> very smooth for all possible uverbs, passing single attr of array of
> structs describing a buffer. Uverb attr api knows how to work with
> arrays, all clicks.
> 
> Drivers can easily pass their specific buffers over this list too. I
> didn't implement it as there was no need, but the idea is to have index>X
> for driver specific indexes.

Why do we need to invent a new way instead of just adding another
argument in a command, that consists of all the info needed to pass a
buffer? Also how can this work for objects that have only private umem?

> What's the benefit of passing per-uverb attrs with a struct? Perhaps I'm
> missing something.

Mostly simplification by untying two unrelated things:
1) way of passing args to kernel
2) object lifetime management

And also significantly reducing the amount of code changes required to
achieve this.

Michael

> >
> >> +		unsigned int idx = descs[i].index;
> >> +
> >> +		if (descs[i].reserved) {
> >> +			err = -EINVAL;
> >> +			goto err_release;
> >> +		}
> >> +		if (idx >= count || (list->provided & BIT(idx))) {
> >> +			err = -EINVAL;
> >> +			goto err_release;
> >> +		}
> >> +
> >> +		switch (descs[i].type) {
> >> +		case IB_UVERBS_BUFFER_TYPE_DMABUF:
> >> +			umem_dmabuf = ib_umem_dmabuf_get_pinned(
> >> +				device, descs[i].addr, descs[i].length,
> >> +				descs[i].fd, IB_ACCESS_LOCAL_WRITE);
> >> +			if (IS_ERR(umem_dmabuf)) {
> >> +				err = PTR_ERR(umem_dmabuf);
> >> +				goto err_release;
> >> +			}
> >> +			list->umems[idx] = &umem_dmabuf->umem;
> >> +			break;
> >> +		case IB_UVERBS_BUFFER_TYPE_VA:
> >> +			umem = ib_umem_get(device, descs[i].addr,
> >> +					   descs[i].length, IB_ACCESS_LOCAL_WRITE);
> >> +			if (IS_ERR(umem)) {
> >> +				err = PTR_ERR(umem);
> >> +				goto err_release;
> >> +			}
> >> +			list->umems[idx] = umem;
> >> +			break;
> >> +		default:
> >> +			err = -EINVAL;
> >> +			goto err_release;
> >> +		}
> >> +		list->provided |= BIT(idx);
> >> +	}
> >> +
> >> +	return list;
> >> +
> >> +err_release:
> >> +	ib_umem_list_release(list);
> >> +	return ERR_PTR(err);
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_list_create);
> >> +
> >> +/**
> >> + * ib_umem_list_release - Release all umems in the list and free it
> >> + * @list: umem list
> >> + */
> >> +void ib_umem_list_release(struct ib_umem_list *list)
> >> +{
> >> +	int i;
> >> +
> >> +	if (!list)
> >> +		return;
> >> +	for (i = 0; i < list->count; i++)
> >> +		ib_umem_release(list->umems[i]);
> >> +	kfree(list);
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_list_release);
> >> +
> >> +/**
> >> + * ib_umem_list_check_consumed - Verify all provided umems were loaded
> >> + * @list: umem list
> >> + *
> >> + * Return: 0 if all provided slots were loaded, -EINVAL otherwise.
> >> + */
> >> +int ib_umem_list_check_consumed(const struct ib_umem_list *list)
> >> +{
> >> +	return (list->provided & ~list->loaded) == 0 ? 0 : -EINVAL;
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_list_check_consumed);
> >> +
> >> +/**
> >> + * ib_umem_list_insert - Insert a umem into the list at a given index
> >> + * @list: umem list
> >> + * @index: per-command buffer slot index
> >> + * @umem: umem pointer to store
> >> + *
> >> + * Stores @umem at @index (replacing any existing). For use from create_cq
> >> + * when the buffer comes from legacy ATTRs rather than the buffer list.
> >> + */
> >> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
> >> +			 struct ib_umem *umem)
> >> +{
> >> +	ib_umem_list_replace(list, index, umem);
> >> +	if (umem)
> >> +		list->provided |= BIT(index);
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_list_insert);
> >> +
> >> +/**
> >> + * ib_umem_list_load - Load a umem from the list by index
> >> + * @list: umem list (may be NULL)
> >> + * @index: per-command buffer slot index
> >> + * @size: minimum required umem length
> >> + *
> >> + * Return: umem pointer, or NULL if the slot is empty or
> >> + * the slot is out of bounds, or ERR_PTR(-EINVAL) if the umem is too small.
> >> + */
> >> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
> >> +				 unsigned int index, size_t size)
> >> +{
> >> +	struct ib_umem *umem;
> >> +
> >> +	if (!list || index >= list->count)
> >> +		return NULL;
> >> +	umem = list->umems[index];
> >> +	if (!umem)
> >> +		return NULL;
> >> +	if (umem->length < size)
> >> +		return ERR_PTR(-EINVAL);
> >> +	list->loaded |= BIT(index);
> >> +	return umem;
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_list_load);
> >> +
> >> +/**
> >> + * ib_umem_list_load_or_get - Umem from list or pin user memory
> >> + * @list: umem list (may be NULL)
> >> + * @index: per-command buffer slot index
> >> + * @device: IB device for ib_umem_get when the list slot is empty
> >> + * @addr: user virtual address for ib_umem_get
> >> + * @size: length for ib_umem_get
> >> + * @access: access flags for ib_umem_get
> >> + *
> >> + * If @list has a umem at @index, returns it like ib_umem_list_load() (and
> >> + * marks the slot loaded). Otherwise calls ib_umem_get() with the given
> >> + * @access flags and on success stores the result at @index when
> >> + * @list is non-NULL.
> >> + *
> >> + * Return: valid umem pointer, or ERR_PTR.
> >> + */
> >> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
> >> +					 unsigned int index,
> >> +					 struct ib_device *device,
> >> +					 unsigned long addr, size_t size,
> >> +					 int access)
> >> +{
> >> +	struct ib_umem *umem;
> >> +
> >> +	umem = ib_umem_list_load(list, index, size);
> >> +	if (IS_ERR(umem) || umem)
> >> +		return umem;
> >> +	umem = ib_umem_get(device, addr, size, access);
> >> +	if (IS_ERR(umem))
> >> +		return umem;
> >> +	if (list && index < list->count)
> >> +		list->umems[index] = umem;
> >> +	return umem;
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_list_load_or_get);
> >> +
> >> +/**
> >> + * ib_umem_list_replace - Replace umem at index, releasing the previous one
> >> + * @list: umem list (may be NULL)
> >> + * @index: per-command buffer slot index
> >> + * @umem: new umem pointer (may be NULL to clear the slot)
> >> + *
> >> + * Stores @umem at @index. If a different umem was already stored there, it is
> >> + * released. Used for CQ resize and similar.
> >> + */
> >> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
> >> +			  struct ib_umem *umem)
> >> +{
> >> +	struct ib_umem *old;
> >> +
> >> +	if (!list || index >= list->count)
> >> +		return;
> >> +	old = list->umems[index];
> >> +	list->umems[index] = umem;
> >> +	if (old && old != umem)
> >> +		ib_umem_release(old);
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_list_replace);
> >> +
> >> +/**
> >> + * ib_umem_release_non_listed - Release a umem that is not stored in the list
> >> + * @list: umem list
> >> + * @index: per-command buffer slot index
> >> + * @umem: umem pointer to release
> >> + *
> >> + * Releases @umem if it is not stored in @list.
> >> + */
> >> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
> >> +				struct ib_umem *umem)
> >> +{
> >> +	if (!list || index >= list->count || list->umems[index] != umem)
> >> +		ib_umem_release(umem);
> >> +}
> >> +EXPORT_SYMBOL(ib_umem_release_non_listed);
> >> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> >> index 2ad52cc1d52b..924acb8d08c3 100644
> >> --- a/include/rdma/ib_umem.h
> >> +++ b/include/rdma/ib_umem.h
> >> @@ -11,6 +11,7 @@
> >>  
> >>  struct ib_device;
> >>  struct dma_buf_attach_ops;
> >> +struct uverbs_attr_bundle;
> >>  
> >>  struct ib_umem {
> >>  	struct ib_device       *ibdev;
> >> @@ -80,6 +81,36 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
> >>  void ib_umem_release(struct ib_umem *umem);
> >>  int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
> >>  		      size_t length);
> >> +
> >> +/**
> >> + * struct ib_umem_list - collection of pre-mapped umems
> >> + *
> >> + * Created from the UVERBS_ATTR_BUFFERS attribute. Each entry is indexed
> >> + * by a per-command buffer slot enum (e.g., IB_UMEM_CQ_BUF for CQ CREATE).
> >> + * Drivers use ib_umem_list_load() to retrieve a specific umem by index.
> >> + */
> >> +struct ib_umem_list;
> >> +
> >> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
> >> +					 const struct uverbs_attr_bundle *attrs,
> >> +					 unsigned int slot_max);
> >> +void ib_umem_list_release(struct ib_umem_list *list);
> >> +int ib_umem_list_check_consumed(const struct ib_umem_list *list);
> >> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
> >> +			 struct ib_umem *umem);
> >> +
> >> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
> >> +				  unsigned int index, size_t size);
> >> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
> >> +					 unsigned int index,
> >> +					 struct ib_device *device,
> >> +					 unsigned long addr, size_t size,
> >> +					 int access);
> >> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
> >> +			  struct ib_umem *umem);
> >> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
> >> +				struct ib_umem *umem);
> >> +
> >>  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
> >>  				     unsigned long pgsz_bitmap,
> >>  				     unsigned long virt);
> >> @@ -230,5 +261,28 @@ static inline void ib_umem_dmabuf_revoke_lock(struct ib_umem_dmabuf *umem_dmabuf
> >>  static inline void ib_umem_dmabuf_revoke_unlock(struct ib_umem_dmabuf *umem_dmabuf) {}
> >>  static inline void ib_umem_dmabuf_revoke(struct ib_umem_dmabuf *umem_dmabuf) {}
> >>  
> >> +struct ib_umem_list;
> >> +
> >> +static inline void ib_umem_list_release(struct ib_umem_list *list) { }
> >> +static inline struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
> >> +						unsigned int index,
> >> +						size_t size)
> >> +{
> >> +	return ERR_PTR(-EOPNOTSUPP);
> >> +}
> >> +static inline struct ib_umem *
> >> +ib_umem_list_load_or_get(struct ib_umem_list *list, unsigned int index,
> >> +			 struct ib_device *device, unsigned long addr,
> >> +			 size_t size, int access)
> >> +{
> >> +	return ERR_PTR(-EOPNOTSUPP);
> >> +}
> >> +static inline void ib_umem_list_replace(struct ib_umem_list *list,
> >> +					unsigned int index,
> >> +					struct ib_umem *umem) { }
> >> +static inline void ib_umem_release_non_listed(struct ib_umem_list *list,
> >> +					      unsigned int index,
> >> +					      struct ib_umem *umem) { }
> >> +
> >>  #endif /* CONFIG_INFINIBAND_USER_MEM */
> >>  #endif /* IB_UMEM_H */
> >> diff --git a/include/rdma/uverbs_ioctl.h b/include/rdma/uverbs_ioctl.h
> >> index e2af17da3e32..05bcab27a87d 100644
> >> --- a/include/rdma/uverbs_ioctl.h
> >> +++ b/include/rdma/uverbs_ioctl.h
> >> @@ -590,6 +590,20 @@ struct uapi_definition {
> >>  			    UA_OPTIONAL,                                       \
> >>  			    .is_udata = 1)
> >>  
> >> +/*
> >> + * Optional array of struct ib_uverbs_buffer_desc describing memory regions
> >> + * backed by dma-buf or user virtual address. Can be added to any method
> >> + * that needs external buffer support.
> >> + * Each entry carries an index field selecting the per-command buffer slot.
> >> + * Use ib_umem_list_create() to map them and ib_umem_list_load() to access.
> >> + */
> >> +#define UVERBS_ATTR_BUFFERS()                                                  \
> >> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_BUFFERS,                               \
> >> +			   UVERBS_ATTR_MIN_SIZE(                               \
> >> +				sizeof(struct ib_uverbs_buffer_desc)),         \
> >> +			   UA_OPTIONAL,                                        \
> >> +			   UA_ALLOC_AND_COPY)
> >> +
> >>  /* =================================================
> >>   *              Parsing infrastructure
> >>   * =================================================
> >> diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
> >> index 72041c1b0ea5..10aa6568abf1 100644
> >> --- a/include/uapi/rdma/ib_user_ioctl_cmds.h
> >> +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
> >> @@ -64,6 +64,7 @@ enum {
> >>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
> >>  	UVERBS_ATTR_UHW_OUT,
> >>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
> >> +	UVERBS_ATTR_BUFFERS,
> >>  };
> >>  
> >>  enum uverbs_methods_device {
> >> diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
> >> index 90c5cd8e7753..41ed9f75b4de 100644
> >> --- a/include/uapi/rdma/ib_user_ioctl_verbs.h
> >> +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
> >> @@ -273,4 +273,31 @@ struct ib_uverbs_gid_entry {
> >>  	__u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */
> >>  };
> >>  
> >> +enum ib_uverbs_buffer_type {
> >> +	IB_UVERBS_BUFFER_TYPE_DMABUF,
> >> +	IB_UVERBS_BUFFER_TYPE_VA,
> >> +};
> >> +
> >> +/*
> >> + * Describes a single buffer backed by dma-buf or user virtual address.
> >> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
> >> + * accepts this attribute defines its own per-command buffer slot enum.
> >> + * The index field selects the buffer slot this descriptor maps to.
> >> + *
> >> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
> >> + * @type: buffer type from enum ib_uverbs_buffer_type
> >> + * @index: per-command buffer slot index
> >> + * @reserved: must be zero
> >> + * @addr: offset within dma-buf, or user virtual address for VA
> >> + * @length: buffer length in bytes
> >> + */
> >> +struct ib_uverbs_buffer_desc {
> >> +	__s32 fd;
> >> +	__u32 type;
> >> +	__u32 index;
> >> +	__u32 reserved;
> >> +	__aligned_u64 addr;
> >> +	__aligned_u64 length;
> >> +};
> >> +
> >>  #endif
> >> -- 
> >> 2.53.0
> >> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-13 16:02       ` Michael Margolin
@ 2026-04-13 18:22         ` Jiri Pirko
  2026-04-16 12:10           ` Michael Margolin
  0 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-13 18:22 UTC (permalink / raw)
  To: Michael Margolin
  Cc: linux-rdma, jgg, leon, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Mon, Apr 13, 2026 at 06:02:32PM +0200, mrgolin@amazon.com wrote:
>On Mon, Apr 13, 2026 at 10:32:15AM +0200, Jiri Pirko wrote:
>> Sun, Apr 12, 2026 at 02:33:22PM +0200, mrgolin@amazon.com wrote:
>> >On Sat, Apr 11, 2026 at 04:49:01PM +0200, Jiri Pirko wrote:
>> >> From: Jiri Pirko <jiri@nvidia.com>
>> >> 
>> >> Add a unified mechanism for userspace to pass memory buffers to any
>> >> uverbs command via a single UVERBS_ATTR_BUFFERS attribute. Each
>> >> buffer is described by struct ib_uverbs_buffer_desc with a type
>> >> discriminator supporting dma-buf and user VA backed memory, extensible
>> >> for future buffer types.
>> >> 
>> >> The ib_umem_list API enables any uverbs command to accept multiple
>> >> buffers indexed by per-command slot enums, without requiring new UAPI
>> >> attributes for each buffer. A consumption check ensures userspace and
>> >> driver agree on which buffers are used.
>> >> 
>> >> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> >> ---
>> >>  drivers/infiniband/core/umem.c          | 248 ++++++++++++++++++++++++
>> >>  include/rdma/ib_umem.h                  |  54 ++++++
>> >>  include/rdma/uverbs_ioctl.h             |  14 ++
>> >>  include/uapi/rdma/ib_user_ioctl_cmds.h  |   1 +
>> >>  include/uapi/rdma/ib_user_ioctl_verbs.h |  27 +++
>> >>  5 files changed, 344 insertions(+)
>> >> 
>> >> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> >> index 786fa1aa8e55..f5b03e903b9d 100644
>> >> --- a/drivers/infiniband/core/umem.c
>> >> +++ b/drivers/infiniband/core/umem.c
>> >> @@ -37,6 +37,7 @@
>> >>  #include <linux/dma-mapping.h>
>> >>  #include <linux/sched/signal.h>
>> >>  #include <linux/sched/mm.h>
>> >> +#include <linux/err.h>
>> >>  #include <linux/export.h>
>> >>  #include <linux/slab.h>
>> >>  #include <linux/pagemap.h>
>> >> @@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>> >>  		return 0;
>> >>  }
>> >>  EXPORT_SYMBOL(ib_umem_copy_from);
>> >> +
>> >> +struct ib_umem_list {
>> >> +	unsigned int count; /* Total slots in the list. */
>> >> +	unsigned long provided; /* Bitmask of slots provided by the user. */
>> >> +	unsigned long loaded; /* Bitmask of slots loaded by the driver. */
>> >> +	struct ib_umem *umems[] __counted_by(count);
>> >> +};
>> >> +
>> >> +/**
>> >> + * ib_umem_list_create - Create a umem list from UVERBS_ATTR_BUFFERS
>> >> + * @device: IB device
>> >> + * @attrs: uverbs attribute bundle
>> >> + * @slot_max: highest buffer slot index (count = slot_max + 1)
>> >> + *
>> >> + * Return: umem list, or ERR_PTR on failure.
>> >> + */
>> >> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
>> >> +					 const struct uverbs_attr_bundle *attrs,
>> >> +					 unsigned int slot_max)
>> >> +{
>> >> +	const struct ib_uverbs_buffer_desc *descs;
>> >> +	struct ib_umem_dmabuf *umem_dmabuf;
>> >> +	struct ib_umem_list *list;
>> >> +	struct ib_umem *umem;
>> >> +	unsigned int count;
>> >> +	int num_descs;
>> >> +	int err;
>> >> +	int i;
>> >> +
>> >> +	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
>> >> +		return ERR_PTR(-EINVAL);
>> >> +	count = slot_max + 1;
>> >> +
>> >> +	num_descs = uverbs_attr_ptr_get_array_size(
>> >> +		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
>> >> +		sizeof(*descs));
>> >> +	if (num_descs == -ENOENT) {
>> >> +		num_descs = 0;
>> >> +		descs = NULL;
>> >> +	} else if (num_descs < 0) {
>> >> +		return ERR_PTR(num_descs);
>> >> +	} else if (num_descs > count) {
>> >> +		return ERR_PTR(-EINVAL);
>> >> +	} else {
>> >> +		descs = uverbs_attr_get_alloced_ptr(attrs, UVERBS_ATTR_BUFFERS);
>> >> +		if (IS_ERR(descs))
>> >> +			return ERR_CAST(descs);
>> >> +	}
>> >> +
>> >> +	list = kzalloc(struct_size(list, umems, count), GFP_KERNEL);
>> >> +	if (!list)
>> >> +		return ERR_PTR(-ENOMEM);
>> >> +	list->count = count;
>> >> +
>> >> +	for (i = 0; i < num_descs; i++) {
>> >
>> >While I like the idea of standardizing the way we pass buffer
>> >information to the kernel, the list thing looks like over generalization
>> >to me, especially after Leon's refactoring of CQ creation. Maybe we can
>> >add buffer as a new attribute type that can be used for multiple
>> >parameters in a command, and have a helper with the code below that
>> >takes an attribute id and returns a umem object, letting each handler
>> >store it. This would also make it easier for drivers to pass their
>> >private buffers using this infrastructure.
>> 
>> Currently we have set of attrs (4) to pass CQ umem. I tried to make this
>> very smooth for all possible uverbs, passing single attr of array of
>> structs describing a buffer. Uverb attr api knows how to work with
>> arrays, all clicks.
>> 
>> Drivers can easily pass their specific buffers over this list too. I
>> didn't implement it as there was no need, but the idea is to have index>X
>> for driver specific indexes.
>
>Why do we need to invent a new way instead of just adding another
>argument in a command, that consists of all the info needed to pass a
>buffer? Also how can this work for objects that have only private umem?

You can put the buf array attr to any uverb, some may not have
"standard" indexes.


>
>> What's the benefit of passing per-uverb attrs with a struct? Perhaps I'm
>> missing something.
>
>Mostly simplification by untying two unrelated things:
>1) way of passing args to kernel
>2) object lifetime management

Could you be more specific please?


>
>And also significantly reducing the amount of code changes required to
>achieve this.

Significantly? I'm not sure I follow, but I guess that is related to my
previous question. I'm not sure I understand what you have exacly in
mind. Regarding UAPI, I think I understand, but regarding kernel
internals, I don't :(


>
>Michael
>
>> >
>> >> +		unsigned int idx = descs[i].index;
>> >> +
>> >> +		if (descs[i].reserved) {
>> >> +			err = -EINVAL;
>> >> +			goto err_release;
>> >> +		}
>> >> +		if (idx >= count || (list->provided & BIT(idx))) {
>> >> +			err = -EINVAL;
>> >> +			goto err_release;
>> >> +		}
>> >> +
>> >> +		switch (descs[i].type) {
>> >> +		case IB_UVERBS_BUFFER_TYPE_DMABUF:
>> >> +			umem_dmabuf = ib_umem_dmabuf_get_pinned(
>> >> +				device, descs[i].addr, descs[i].length,
>> >> +				descs[i].fd, IB_ACCESS_LOCAL_WRITE);
>> >> +			if (IS_ERR(umem_dmabuf)) {
>> >> +				err = PTR_ERR(umem_dmabuf);
>> >> +				goto err_release;
>> >> +			}
>> >> +			list->umems[idx] = &umem_dmabuf->umem;
>> >> +			break;
>> >> +		case IB_UVERBS_BUFFER_TYPE_VA:
>> >> +			umem = ib_umem_get(device, descs[i].addr,
>> >> +					   descs[i].length, IB_ACCESS_LOCAL_WRITE);
>> >> +			if (IS_ERR(umem)) {
>> >> +				err = PTR_ERR(umem);
>> >> +				goto err_release;
>> >> +			}
>> >> +			list->umems[idx] = umem;
>> >> +			break;
>> >> +		default:
>> >> +			err = -EINVAL;
>> >> +			goto err_release;
>> >> +		}
>> >> +		list->provided |= BIT(idx);
>> >> +	}
>> >> +
>> >> +	return list;
>> >> +
>> >> +err_release:
>> >> +	ib_umem_list_release(list);
>> >> +	return ERR_PTR(err);
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_list_create);
>> >> +
>> >> +/**
>> >> + * ib_umem_list_release - Release all umems in the list and free it
>> >> + * @list: umem list
>> >> + */
>> >> +void ib_umem_list_release(struct ib_umem_list *list)
>> >> +{
>> >> +	int i;
>> >> +
>> >> +	if (!list)
>> >> +		return;
>> >> +	for (i = 0; i < list->count; i++)
>> >> +		ib_umem_release(list->umems[i]);
>> >> +	kfree(list);
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_list_release);
>> >> +
>> >> +/**
>> >> + * ib_umem_list_check_consumed - Verify all provided umems were loaded
>> >> + * @list: umem list
>> >> + *
>> >> + * Return: 0 if all provided slots were loaded, -EINVAL otherwise.
>> >> + */
>> >> +int ib_umem_list_check_consumed(const struct ib_umem_list *list)
>> >> +{
>> >> +	return (list->provided & ~list->loaded) == 0 ? 0 : -EINVAL;
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_list_check_consumed);
>> >> +
>> >> +/**
>> >> + * ib_umem_list_insert - Insert a umem into the list at a given index
>> >> + * @list: umem list
>> >> + * @index: per-command buffer slot index
>> >> + * @umem: umem pointer to store
>> >> + *
>> >> + * Stores @umem at @index (replacing any existing). For use from create_cq
>> >> + * when the buffer comes from legacy ATTRs rather than the buffer list.
>> >> + */
>> >> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
>> >> +			 struct ib_umem *umem)
>> >> +{
>> >> +	ib_umem_list_replace(list, index, umem);
>> >> +	if (umem)
>> >> +		list->provided |= BIT(index);
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_list_insert);
>> >> +
>> >> +/**
>> >> + * ib_umem_list_load - Load a umem from the list by index
>> >> + * @list: umem list (may be NULL)
>> >> + * @index: per-command buffer slot index
>> >> + * @size: minimum required umem length
>> >> + *
>> >> + * Return: umem pointer, or NULL if the slot is empty or
>> >> + * the slot is out of bounds, or ERR_PTR(-EINVAL) if the umem is too small.
>> >> + */
>> >> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
>> >> +				 unsigned int index, size_t size)
>> >> +{
>> >> +	struct ib_umem *umem;
>> >> +
>> >> +	if (!list || index >= list->count)
>> >> +		return NULL;
>> >> +	umem = list->umems[index];
>> >> +	if (!umem)
>> >> +		return NULL;
>> >> +	if (umem->length < size)
>> >> +		return ERR_PTR(-EINVAL);
>> >> +	list->loaded |= BIT(index);
>> >> +	return umem;
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_list_load);
>> >> +
>> >> +/**
>> >> + * ib_umem_list_load_or_get - Umem from list or pin user memory
>> >> + * @list: umem list (may be NULL)
>> >> + * @index: per-command buffer slot index
>> >> + * @device: IB device for ib_umem_get when the list slot is empty
>> >> + * @addr: user virtual address for ib_umem_get
>> >> + * @size: length for ib_umem_get
>> >> + * @access: access flags for ib_umem_get
>> >> + *
>> >> + * If @list has a umem at @index, returns it like ib_umem_list_load() (and
>> >> + * marks the slot loaded). Otherwise calls ib_umem_get() with the given
>> >> + * @access flags and on success stores the result at @index when
>> >> + * @list is non-NULL.
>> >> + *
>> >> + * Return: valid umem pointer, or ERR_PTR.
>> >> + */
>> >> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
>> >> +					 unsigned int index,
>> >> +					 struct ib_device *device,
>> >> +					 unsigned long addr, size_t size,
>> >> +					 int access)
>> >> +{
>> >> +	struct ib_umem *umem;
>> >> +
>> >> +	umem = ib_umem_list_load(list, index, size);
>> >> +	if (IS_ERR(umem) || umem)
>> >> +		return umem;
>> >> +	umem = ib_umem_get(device, addr, size, access);
>> >> +	if (IS_ERR(umem))
>> >> +		return umem;
>> >> +	if (list && index < list->count)
>> >> +		list->umems[index] = umem;
>> >> +	return umem;
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_list_load_or_get);
>> >> +
>> >> +/**
>> >> + * ib_umem_list_replace - Replace umem at index, releasing the previous one
>> >> + * @list: umem list (may be NULL)
>> >> + * @index: per-command buffer slot index
>> >> + * @umem: new umem pointer (may be NULL to clear the slot)
>> >> + *
>> >> + * Stores @umem at @index. If a different umem was already stored there, it is
>> >> + * released. Used for CQ resize and similar.
>> >> + */
>> >> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
>> >> +			  struct ib_umem *umem)
>> >> +{
>> >> +	struct ib_umem *old;
>> >> +
>> >> +	if (!list || index >= list->count)
>> >> +		return;
>> >> +	old = list->umems[index];
>> >> +	list->umems[index] = umem;
>> >> +	if (old && old != umem)
>> >> +		ib_umem_release(old);
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_list_replace);
>> >> +
>> >> +/**
>> >> + * ib_umem_release_non_listed - Release a umem that is not stored in the list
>> >> + * @list: umem list
>> >> + * @index: per-command buffer slot index
>> >> + * @umem: umem pointer to release
>> >> + *
>> >> + * Releases @umem if it is not stored in @list.
>> >> + */
>> >> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
>> >> +				struct ib_umem *umem)
>> >> +{
>> >> +	if (!list || index >= list->count || list->umems[index] != umem)
>> >> +		ib_umem_release(umem);
>> >> +}
>> >> +EXPORT_SYMBOL(ib_umem_release_non_listed);
>> >> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
>> >> index 2ad52cc1d52b..924acb8d08c3 100644
>> >> --- a/include/rdma/ib_umem.h
>> >> +++ b/include/rdma/ib_umem.h
>> >> @@ -11,6 +11,7 @@
>> >>  
>> >>  struct ib_device;
>> >>  struct dma_buf_attach_ops;
>> >> +struct uverbs_attr_bundle;
>> >>  
>> >>  struct ib_umem {
>> >>  	struct ib_device       *ibdev;
>> >> @@ -80,6 +81,36 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
>> >>  void ib_umem_release(struct ib_umem *umem);
>> >>  int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>> >>  		      size_t length);
>> >> +
>> >> +/**
>> >> + * struct ib_umem_list - collection of pre-mapped umems
>> >> + *
>> >> + * Created from the UVERBS_ATTR_BUFFERS attribute. Each entry is indexed
>> >> + * by a per-command buffer slot enum (e.g., IB_UMEM_CQ_BUF for CQ CREATE).
>> >> + * Drivers use ib_umem_list_load() to retrieve a specific umem by index.
>> >> + */
>> >> +struct ib_umem_list;
>> >> +
>> >> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
>> >> +					 const struct uverbs_attr_bundle *attrs,
>> >> +					 unsigned int slot_max);
>> >> +void ib_umem_list_release(struct ib_umem_list *list);
>> >> +int ib_umem_list_check_consumed(const struct ib_umem_list *list);
>> >> +void ib_umem_list_insert(struct ib_umem_list *list, unsigned int index,
>> >> +			 struct ib_umem *umem);
>> >> +
>> >> +struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
>> >> +				  unsigned int index, size_t size);
>> >> +struct ib_umem *ib_umem_list_load_or_get(struct ib_umem_list *list,
>> >> +					 unsigned int index,
>> >> +					 struct ib_device *device,
>> >> +					 unsigned long addr, size_t size,
>> >> +					 int access);
>> >> +void ib_umem_list_replace(struct ib_umem_list *list, unsigned int index,
>> >> +			  struct ib_umem *umem);
>> >> +void ib_umem_release_non_listed(struct ib_umem_list *list, unsigned int index,
>> >> +				struct ib_umem *umem);
>> >> +
>> >>  unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>> >>  				     unsigned long pgsz_bitmap,
>> >>  				     unsigned long virt);
>> >> @@ -230,5 +261,28 @@ static inline void ib_umem_dmabuf_revoke_lock(struct ib_umem_dmabuf *umem_dmabuf
>> >>  static inline void ib_umem_dmabuf_revoke_unlock(struct ib_umem_dmabuf *umem_dmabuf) {}
>> >>  static inline void ib_umem_dmabuf_revoke(struct ib_umem_dmabuf *umem_dmabuf) {}
>> >>  
>> >> +struct ib_umem_list;
>> >> +
>> >> +static inline void ib_umem_list_release(struct ib_umem_list *list) { }
>> >> +static inline struct ib_umem *ib_umem_list_load(struct ib_umem_list *list,
>> >> +						unsigned int index,
>> >> +						size_t size)
>> >> +{
>> >> +	return ERR_PTR(-EOPNOTSUPP);
>> >> +}
>> >> +static inline struct ib_umem *
>> >> +ib_umem_list_load_or_get(struct ib_umem_list *list, unsigned int index,
>> >> +			 struct ib_device *device, unsigned long addr,
>> >> +			 size_t size, int access)
>> >> +{
>> >> +	return ERR_PTR(-EOPNOTSUPP);
>> >> +}
>> >> +static inline void ib_umem_list_replace(struct ib_umem_list *list,
>> >> +					unsigned int index,
>> >> +					struct ib_umem *umem) { }
>> >> +static inline void ib_umem_release_non_listed(struct ib_umem_list *list,
>> >> +					      unsigned int index,
>> >> +					      struct ib_umem *umem) { }
>> >> +
>> >>  #endif /* CONFIG_INFINIBAND_USER_MEM */
>> >>  #endif /* IB_UMEM_H */
>> >> diff --git a/include/rdma/uverbs_ioctl.h b/include/rdma/uverbs_ioctl.h
>> >> index e2af17da3e32..05bcab27a87d 100644
>> >> --- a/include/rdma/uverbs_ioctl.h
>> >> +++ b/include/rdma/uverbs_ioctl.h
>> >> @@ -590,6 +590,20 @@ struct uapi_definition {
>> >>  			    UA_OPTIONAL,                                       \
>> >>  			    .is_udata = 1)
>> >>  
>> >> +/*
>> >> + * Optional array of struct ib_uverbs_buffer_desc describing memory regions
>> >> + * backed by dma-buf or user virtual address. Can be added to any method
>> >> + * that needs external buffer support.
>> >> + * Each entry carries an index field selecting the per-command buffer slot.
>> >> + * Use ib_umem_list_create() to map them and ib_umem_list_load() to access.
>> >> + */
>> >> +#define UVERBS_ATTR_BUFFERS()                                                  \
>> >> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_BUFFERS,                               \
>> >> +			   UVERBS_ATTR_MIN_SIZE(                               \
>> >> +				sizeof(struct ib_uverbs_buffer_desc)),         \
>> >> +			   UA_OPTIONAL,                                        \
>> >> +			   UA_ALLOC_AND_COPY)
>> >> +
>> >>  /* =================================================
>> >>   *              Parsing infrastructure
>> >>   * =================================================
>> >> diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
>> >> index 72041c1b0ea5..10aa6568abf1 100644
>> >> --- a/include/uapi/rdma/ib_user_ioctl_cmds.h
>> >> +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
>> >> @@ -64,6 +64,7 @@ enum {
>> >>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
>> >>  	UVERBS_ATTR_UHW_OUT,
>> >>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
>> >> +	UVERBS_ATTR_BUFFERS,
>> >>  };
>> >>  
>> >>  enum uverbs_methods_device {
>> >> diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
>> >> index 90c5cd8e7753..41ed9f75b4de 100644
>> >> --- a/include/uapi/rdma/ib_user_ioctl_verbs.h
>> >> +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
>> >> @@ -273,4 +273,31 @@ struct ib_uverbs_gid_entry {
>> >>  	__u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */
>> >>  };
>> >>  
>> >> +enum ib_uverbs_buffer_type {
>> >> +	IB_UVERBS_BUFFER_TYPE_DMABUF,
>> >> +	IB_UVERBS_BUFFER_TYPE_VA,
>> >> +};
>> >> +
>> >> +/*
>> >> + * Describes a single buffer backed by dma-buf or user virtual address.
>> >> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
>> >> + * accepts this attribute defines its own per-command buffer slot enum.
>> >> + * The index field selects the buffer slot this descriptor maps to.
>> >> + *
>> >> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
>> >> + * @type: buffer type from enum ib_uverbs_buffer_type
>> >> + * @index: per-command buffer slot index
>> >> + * @reserved: must be zero
>> >> + * @addr: offset within dma-buf, or user virtual address for VA
>> >> + * @length: buffer length in bytes
>> >> + */
>> >> +struct ib_uverbs_buffer_desc {
>> >> +	__s32 fd;
>> >> +	__u32 type;
>> >> +	__u32 index;
>> >> +	__u32 reserved;
>> >> +	__aligned_u64 addr;
>> >> +	__aligned_u64 length;
>> >> +};
>> >> +
>> >>  #endif
>> >> -- 
>> >> 2.53.0
>> >> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-13 18:22         ` Jiri Pirko
@ 2026-04-16 12:10           ` Michael Margolin
  2026-04-16 13:34             ` Jiri Pirko
  2026-04-21 12:52             ` Jason Gunthorpe
  0 siblings, 2 replies; 81+ messages in thread
From: Michael Margolin @ 2026-04-16 12:10 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, jgg, leon, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Mon, Apr 13, 2026 at 08:22:05PM +0200, Jiri Pirko wrote:
> >> >While I like the idea of standardizing the way we pass buffer
> >> >information to the kernel, the list thing looks like over generalization
> >> >to me, especially after Leon's refactoring of CQ creation. Maybe we can
> >> >add buffer as a new attribute type that can be used for multiple
> >> >parameters in a command, and have a helper with the code below that
> >> >takes an attribute id and returns a umem object, letting each handler
> >> >store it. This would also make it easier for drivers to pass their
> >> >private buffers using this infrastructure.
> >> 
> >> Currently we have set of attrs (4) to pass CQ umem. I tried to make this
> >> very smooth for all possible uverbs, passing single attr of array of
> >> structs describing a buffer. Uverb attr api knows how to work with
> >> arrays, all clicks.
> >> 
> >> Drivers can easily pass their specific buffers over this list too. I
> >> didn't implement it as there was no need, but the idea is to have index>X
> >> for driver specific indexes.
> >
> >Why do we need to invent a new way instead of just adding another
> >argument in a command, that consists of all the info needed to pass a
> >buffer? Also how can this work for objects that have only private umem?
> 
> You can put the buf array attr to any uverb, some may not have
> "standard" indexes.
> 
Not sure I fully follow your idea here, can you elaborate on how you
plan to reserve index>X range in an enum used as index into dynamic
array?
 
> >
> >> What's the benefit of passing per-uverb attrs with a struct? Perhaps I'm
> >> missing something.
> >
> >Mostly simplification by untying two unrelated things:
> >1) way of passing args to kernel
> >2) object lifetime management
> 
> Could you be more specific please?
> 
> 
> >
> >And also significantly reducing the amount of code changes required to
> >achieve this.
> 
> Significantly? I'm not sure I follow, but I guess that is related to my
> previous question. I'm not sure I understand what you have exacly in
> mind. Regarding UAPI, I think I understand, but regarding kernel
> internals, I don't :(
> 

I imagine the changes for getting umem using the new mechanism in
create CQ command are about:

Define a new optional buffer attribute:

-       UVERBS_ATTR_UHW());
+       UVERBS_ATTR_UHW(),
+       UVERBS_ATTR_BUFFER(UVERBS_ATTR_CREATE_CQ_BUFFER,
+                          UA_OPTIONAL));

Get umem from the new attribute if available or fallback to existing
attributes:

-       umem = uverbs_create_cq_get_umem(ib_dev, attrs);
-       if (IS_ERR(umem)) {
-               ret = PTR_ERR(umem);
-               goto err_event_file;
-       }
+       if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER)) {
+               ret = uverbs_get_umem(&umem, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER);
+               if (ret)
+                       goto err_event_file;
+       } else {
+               umem = uverbs_create_cq_get_umem(ib_dev, attrs);
+               if (IS_ERR(umem)) {
+                       ret = PTR_ERR(umem);
+                       goto err_event_file;
+               }
+       }

Drivers don't need to change.


> >> >> + * Optional array of struct ib_uverbs_buffer_desc describing memory regions
> >> >> + * backed by dma-buf or user virtual address. Can be added to any method
> >> >> + * that needs external buffer support.
> >> >> + * Each entry carries an index field selecting the per-command buffer slot.
> >> >> + * Use ib_umem_list_create() to map them and ib_umem_list_load() to access.
> >> >> + */
> >> >> +#define UVERBS_ATTR_BUFFERS()                                                  \
> >> >> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_BUFFERS,                               \
> >> >> +			   UVERBS_ATTR_MIN_SIZE(                               \
> >> >> +				sizeof(struct ib_uverbs_buffer_desc)),         \
> >> >> +			   UA_OPTIONAL,                                        \
> >> >> +			   UA_ALLOC_AND_COPY)
> >> >> +
> >> >>  /* =================================================
> >> >>   *              Parsing infrastructure
> >> >>   * =================================================
> >> >> diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
> >> >> index 72041c1b0ea5..10aa6568abf1 100644
> >> >> --- a/include/uapi/rdma/ib_user_ioctl_cmds.h
> >> >> +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
> >> >> @@ -64,6 +64,7 @@ enum {
> >> >>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
> >> >>  	UVERBS_ATTR_UHW_OUT,
> >> >>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
> >> >> +	UVERBS_ATTR_BUFFERS,

I don't think you can add anything here as it overlaps with driver
specific attributes. I suggest defining per command attr id and passing
it by caller into ib_umem_list_create.

> >> >>  };
> >> >>  
> >> >>  enum uverbs_methods_device {
> >> >> diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
> >> >> index 90c5cd8e7753..41ed9f75b4de 100644
> >> >> --- a/include/uapi/rdma/ib_user_ioctl_verbs.h
> >> >> +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
> >> >> @@ -273,4 +273,31 @@ struct ib_uverbs_gid_entry {
> >> >>  	__u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */
> >> >>  };
> >> >>  
> >> >> +enum ib_uverbs_buffer_type {
> >> >> +	IB_UVERBS_BUFFER_TYPE_DMABUF,
> >> >> +	IB_UVERBS_BUFFER_TYPE_VA,
> >> >> +};
> >> >> +
> >> >> +/*
> >> >> + * Describes a single buffer backed by dma-buf or user virtual address.
> >> >> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
> >> >> + * accepts this attribute defines its own per-command buffer slot enum.
> >> >> + * The index field selects the buffer slot this descriptor maps to.
> >> >> + *
> >> >> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
> >> >> + * @type: buffer type from enum ib_uverbs_buffer_type
> >> >> + * @index: per-command buffer slot index
> >> >> + * @reserved: must be zero
> >> >> + * @addr: offset within dma-buf, or user virtual address for VA
> >> >> + * @length: buffer length in bytes
> >> >> + */
> >> >> +struct ib_uverbs_buffer_desc {
> >> >> +	__s32 fd;
> >> >> +	__u32 type;
> >> >> +	__u32 index;
> >> >> +	__u32 reserved;
> >> >> +	__aligned_u64 addr;
> >> >> +	__aligned_u64 length;
> >> >> +};
> >> >> +
> >> >>  #endif
> >> >> -- 
> >> >> 2.53.0
> >> >> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-16 12:10           ` Michael Margolin
@ 2026-04-16 13:34             ` Jiri Pirko
  2026-04-21 12:50               ` Jason Gunthorpe
  2026-04-21 12:52             ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-16 13:34 UTC (permalink / raw)
  To: Michael Margolin
  Cc: linux-rdma, jgg, leon, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Thu, Apr 16, 2026 at 02:10:00PM +0200, mrgolin@amazon.com wrote:
>On Mon, Apr 13, 2026 at 08:22:05PM +0200, Jiri Pirko wrote:
>> >> >While I like the idea of standardizing the way we pass buffer
>> >> >information to the kernel, the list thing looks like over generalization
>> >> >to me, especially after Leon's refactoring of CQ creation. Maybe we can
>> >> >add buffer as a new attribute type that can be used for multiple
>> >> >parameters in a command, and have a helper with the code below that
>> >> >takes an attribute id and returns a umem object, letting each handler
>> >> >store it. This would also make it easier for drivers to pass their
>> >> >private buffers using this infrastructure.
>> >> 
>> >> Currently we have set of attrs (4) to pass CQ umem. I tried to make this
>> >> very smooth for all possible uverbs, passing single attr of array of
>> >> structs describing a buffer. Uverb attr api knows how to work with
>> >> arrays, all clicks.
>> >> 
>> >> Drivers can easily pass their specific buffers over this list too. I
>> >> didn't implement it as there was no need, but the idea is to have index>X
>> >> for driver specific indexes.
>> >
>> >Why do we need to invent a new way instead of just adding another
>> >argument in a command, that consists of all the info needed to pass a
>> >buffer? Also how can this work for objects that have only private umem?
>> 
>> You can put the buf array attr to any uverb, some may not have
>> "standard" indexes.
>> 
>Not sure I fully follow your idea here, can you elaborate on how you
>plan to reserve index>X range in an enum used as index into dynamic
>array?

#define UVERBS_BUF_DRIVER_BASE	1024

and then:

struct ib_uverbs_buffer_desc bufs[] = {
	{
		.fd     = cq_dmabuf_fd,
		.type   = IB_UVERBS_BUFFER_TYPE_DMABUF,
		.index  = UVERBS_BUF_CQ_BUF,           /* 0 */
		.addr   = 0,
		.length = cq_buf_size,
	},
	{
		.fd     = dbr_dmabuf_fd,
		.type   = IB_UVERBS_BUFFER_TYPE_DMABUF,
		.index  = UVERBS_BUF_CQ_DBR,           /* 1 */
		.addr   = dbr_offset,
		.length = 8,
	},
	{
		.fd     = uar_dmabuf_fd,
		.type   = IB_UVERBS_BUFFER_TYPE_DMABUF,
->>>>>>>>>	.index  = MLX5_BUF_CQ_UAR,             /* 1024 */
		.addr   = 0,
		.length = 4096,
	},
};



> 
>> >
>> >> What's the benefit of passing per-uverb attrs with a struct? Perhaps I'm
>> >> missing something.
>> >
>> >Mostly simplification by untying two unrelated things:
>> >1) way of passing args to kernel
>> >2) object lifetime management
>> 
>> Could you be more specific please?
>> 
>> 
>> >
>> >And also significantly reducing the amount of code changes required to
>> >achieve this.
>> 
>> Significantly? I'm not sure I follow, but I guess that is related to my
>> previous question. I'm not sure I understand what you have exacly in
>> mind. Regarding UAPI, I think I understand, but regarding kernel
>> internals, I don't :(
>> 
>
>I imagine the changes for getting umem using the new mechanism in
>create CQ command are about:
>
>Define a new optional buffer attribute:
>
>-       UVERBS_ATTR_UHW());
>+       UVERBS_ATTR_UHW(),
>+       UVERBS_ATTR_BUFFER(UVERBS_ATTR_CREATE_CQ_BUFFER,
>+                          UA_OPTIONAL));

Okay, that may be doable. I'm just curious about the in-kernel
management of umems. With list, it aligns. With your suggested approach
we would have to iterate over these attrs and assemble in-kernel list to
manage umems.


>
>Get umem from the new attribute if available or fallback to existing
>attributes:
>
>-       umem = uverbs_create_cq_get_umem(ib_dev, attrs);
>-       if (IS_ERR(umem)) {
>-               ret = PTR_ERR(umem);
>-               goto err_event_file;
>-       }
>+       if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER)) {
>+               ret = uverbs_get_umem(&umem, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER);
>+               if (ret)
>+                       goto err_event_file;
>+       } else {
>+               umem = uverbs_create_cq_get_umem(ib_dev, attrs);
>+               if (IS_ERR(umem)) {
>+                       ret = PTR_ERR(umem);
>+                       goto err_event_file;
>+               }
>+       }
>
>Drivers don't need to change.

uverbs_create_cq_get_umem only works with legacy attrs. Not that
interesting. How do you propose to handle other umems, when uverb
supports multiple umems (like + DRB umem for create CQ)? I'm
particularly interested in consumption validation and life-cycle
management (that is a bit trickier for create QP).



>
>
>> >> >> + * Optional array of struct ib_uverbs_buffer_desc describing memory regions
>> >> >> + * backed by dma-buf or user virtual address. Can be added to any method
>> >> >> + * that needs external buffer support.
>> >> >> + * Each entry carries an index field selecting the per-command buffer slot.
>> >> >> + * Use ib_umem_list_create() to map them and ib_umem_list_load() to access.
>> >> >> + */
>> >> >> +#define UVERBS_ATTR_BUFFERS()                                                  \
>> >> >> +	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_BUFFERS,                               \
>> >> >> +			   UVERBS_ATTR_MIN_SIZE(                               \
>> >> >> +				sizeof(struct ib_uverbs_buffer_desc)),         \
>> >> >> +			   UA_OPTIONAL,                                        \
>> >> >> +			   UA_ALLOC_AND_COPY)
>> >> >> +
>> >> >>  /* =================================================
>> >> >>   *              Parsing infrastructure
>> >> >>   * =================================================
>> >> >> diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
>> >> >> index 72041c1b0ea5..10aa6568abf1 100644
>> >> >> --- a/include/uapi/rdma/ib_user_ioctl_cmds.h
>> >> >> +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
>> >> >> @@ -64,6 +64,7 @@ enum {
>> >> >>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
>> >> >>  	UVERBS_ATTR_UHW_OUT,
>> >> >>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
>> >> >> +	UVERBS_ATTR_BUFFERS,
>
>I don't think you can add anything here as it overlaps with driver
>specific attributes. I suggest defining per command attr id and passing
>it by caller into ib_umem_list_create.
>
>> >> >>  };
>> >> >>  
>> >> >>  enum uverbs_methods_device {
>> >> >> diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
>> >> >> index 90c5cd8e7753..41ed9f75b4de 100644
>> >> >> --- a/include/uapi/rdma/ib_user_ioctl_verbs.h
>> >> >> +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
>> >> >> @@ -273,4 +273,31 @@ struct ib_uverbs_gid_entry {
>> >> >>  	__u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */
>> >> >>  };
>> >> >>  
>> >> >> +enum ib_uverbs_buffer_type {
>> >> >> +	IB_UVERBS_BUFFER_TYPE_DMABUF,
>> >> >> +	IB_UVERBS_BUFFER_TYPE_VA,
>> >> >> +};
>> >> >> +
>> >> >> +/*
>> >> >> + * Describes a single buffer backed by dma-buf or user virtual address.
>> >> >> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
>> >> >> + * accepts this attribute defines its own per-command buffer slot enum.
>> >> >> + * The index field selects the buffer slot this descriptor maps to.
>> >> >> + *
>> >> >> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
>> >> >> + * @type: buffer type from enum ib_uverbs_buffer_type
>> >> >> + * @index: per-command buffer slot index
>> >> >> + * @reserved: must be zero
>> >> >> + * @addr: offset within dma-buf, or user virtual address for VA
>> >> >> + * @length: buffer length in bytes
>> >> >> + */
>> >> >> +struct ib_uverbs_buffer_desc {
>> >> >> +	__s32 fd;
>> >> >> +	__u32 type;
>> >> >> +	__u32 index;
>> >> >> +	__u32 reserved;
>> >> >> +	__aligned_u64 addr;
>> >> >> +	__aligned_u64 length;
>> >> >> +};
>> >> >> +
>> >> >>  #endif
>> >> >> -- 
>> >> >> 2.53.0
>> >> >> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-16 13:34             ` Jiri Pirko
@ 2026-04-21 12:50               ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-21 12:50 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael Margolin, linux-rdma, leon, gal.pressman, sleybo, parav,
	mbloch, yanjun.zhu, marco.crivellari, roman.gushchin, phaddad,
	lirongqing, ynachum, huangjunxian6, kalesh-anakkur.purayil,
	ohartoov, michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Thu, Apr 16, 2026 at 03:34:05PM +0200, Jiri Pirko wrote:

> >Get umem from the new attribute if available or fallback to existing
> >attributes:
> >
> >-       umem = uverbs_create_cq_get_umem(ib_dev, attrs);
> >-       if (IS_ERR(umem)) {
> >-               ret = PTR_ERR(umem);
> >-               goto err_event_file;
> >-       }
> >+       if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER)) {
> >+               ret = uverbs_get_umem(&umem, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER);
> >+               if (ret)
> >+                       goto err_event_file;
> >+       } else {
> >+               umem = uverbs_create_cq_get_umem(ib_dev, attrs);
> >+               if (IS_ERR(umem)) {
> >+                       ret = PTR_ERR(umem);
> >+                       goto err_event_file;
> >+               }
> >+       }
> >
> >Drivers don't need to change.
> 
> uverbs_create_cq_get_umem only works with legacy attrs. Not that
> interesting. How do you propose to handle other umems, when uverb
> supports multiple umems (like + DRB umem for create CQ)? I'm
> particularly interested in consumption validation and life-cycle
> management (that is a bit trickier for create QP).

Having a standardized attrs getter for a umem is sort of interesting,
but you are right it doesn't address the the lifecycle, the driver
would still have to keep track of the returned umem.

The interest in working on the umems was two parts
 - Make more drivers accept more kinds of umems (ie dmabuf) 
   by getting them out of the driver data
 - Have the core code participate more in managing the lifecycle of
   the umem to avoid driver duplication

The latter was easier on the cq/mr cases but QP is quite a bit more
complicated.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-16 12:10           ` Michael Margolin
  2026-04-16 13:34             ` Jiri Pirko
@ 2026-04-21 12:52             ` Jason Gunthorpe
  2026-04-22 10:32               ` Jiri Pirko
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-21 12:52 UTC (permalink / raw)
  To: Michael Margolin
  Cc: Jiri Pirko, linux-rdma, leon, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Thu, Apr 16, 2026 at 12:10:00PM +0000, Michael Margolin wrote:
> > >> >> @@ -64,6 +64,7 @@ enum {
> > >> >>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
> > >> >>  	UVERBS_ATTR_UHW_OUT,
> > >> >>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
> > >> >> +	UVERBS_ATTR_BUFFERS,
> 
> I don't think you can add anything here as it overlaps with driver
> specific attributes. I suggest defining per command attr id and passing
> it by caller into ib_umem_list_create.

Right, the expectation would be to have a ATTRS_QP_BUFFERS

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-21 12:52             ` Jason Gunthorpe
@ 2026-04-22 10:32               ` Jiri Pirko
  2026-04-22 16:30                 ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-22 10:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michael Margolin, linux-rdma, leon, gal.pressman, sleybo, parav,
	mbloch, yanjun.zhu, marco.crivellari, roman.gushchin, phaddad,
	lirongqing, ynachum, huangjunxian6, kalesh-anakkur.purayil,
	ohartoov, michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Tue, Apr 21, 2026 at 02:52:12PM +0200, jgg@ziepe.ca wrote:
>On Thu, Apr 16, 2026 at 12:10:00PM +0000, Michael Margolin wrote:
>> > >> >> @@ -64,6 +64,7 @@ enum {
>> > >> >>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
>> > >> >>  	UVERBS_ATTR_UHW_OUT,
>> > >> >>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
>> > >> >> +	UVERBS_ATTR_BUFFERS,
>> 
>> I don't think you can add anything here as it overlaps with driver
>> specific attributes. I suggest defining per command attr id and passing
>> it by caller into ib_umem_list_create.
>
>Right, the expectation would be to have a ATTRS_QP_BUFFERS

Okay. I was under impression I can add a generic attr, I was wrong :/

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-22 10:32               ` Jiri Pirko
@ 2026-04-22 16:30                 ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-22 16:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Michael Margolin, linux-rdma, leon, gal.pressman, sleybo, parav,
	mbloch, yanjun.zhu, marco.crivellari, roman.gushchin, phaddad,
	lirongqing, ynachum, huangjunxian6, kalesh-anakkur.purayil,
	ohartoov, michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Wed, Apr 22, 2026 at 12:32:26PM +0200, Jiri Pirko wrote:
> Tue, Apr 21, 2026 at 02:52:12PM +0200, jgg@ziepe.ca wrote:
> >On Thu, Apr 16, 2026 at 12:10:00PM +0000, Michael Margolin wrote:
> >> > >> >> @@ -64,6 +64,7 @@ enum {
> >> > >> >>  	UVERBS_ATTR_UHW_IN = UVERBS_ID_DRIVER_NS,
> >> > >> >>  	UVERBS_ATTR_UHW_OUT,
> >> > >> >>  	UVERBS_ID_DRIVER_NS_WITH_UHW,
> >> > >> >> +	UVERBS_ATTR_BUFFERS,
> >> 
> >> I don't think you can add anything here as it overlaps with driver
> >> specific attributes. I suggest defining per command attr id and passing
> >> it by caller into ib_umem_list_create.
> >
> >Right, the expectation would be to have a ATTRS_QP_BUFFERS
> 
> Okay. I was under impression I can add a generic attr, I was wrong :/

There may be a way to do that, but the [UVERBS_ID_DRIVER_NS:MAX]
number space is fully delegated to drivers and all the low values are
already taken.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-11 14:49 ` [PATCH rdma-next v2 01/15] RDMA/core: " Jiri Pirko
  2026-04-12 12:33   ` Michael Margolin
@ 2026-04-21 13:46   ` Jason Gunthorpe
  2026-04-22 11:33     ` Jiri Pirko
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-21 13:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Sat, Apr 11, 2026 at 04:49:01PM +0200, Jiri Pirko wrote:
> @@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,

> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
> +					 const struct uverbs_attr_bundle *attrs,
> +					 unsigned int slot_max)
> +{
> +	const struct ib_uverbs_buffer_desc *descs;
> +	struct ib_umem_dmabuf *umem_dmabuf;
> +	struct ib_umem_list *list;
> +	struct ib_umem *umem;
> +	unsigned int count;
> +	int num_descs;
> +	int err;
> +	int i;
> +
> +	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
> +		return ERR_PTR(-EINVAL);
> +	count = slot_max + 1;
> +
> +	num_descs = uverbs_attr_ptr_get_array_size(
> +		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
> +		sizeof(*descs));

uverbs_attr_ptr_get_array_size() should get a const on the parameter,
seems to have been missed originally

> +/*
> + * Describes a single buffer backed by dma-buf or user virtual address.
> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
> + * accepts this attribute defines its own per-command buffer slot enum.
> + * The index field selects the buffer slot this descriptor maps to.
> + *
> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
> + * @type: buffer type from enum ib_uverbs_buffer_type
> + * @index: per-command buffer slot index
> + * @reserved: must be zero
> + * @addr: offset within dma-buf, or user virtual address for VA
> + * @length: buffer length in bytes
> + */
> +struct ib_uverbs_buffer_desc {
> +	__s32 fd;
> +	__u32 type;
> +	__u32 index;
> +	__u32 reserved;
> +	__aligned_u64 addr;
> +	__aligned_u64 length;
> +};

This seems like a good idea, we should have done it earlier :\

Arguably if you do this then the first issue of being more flexible
with umems is addressed, so a uverbs_attr_ptr_get_umem() looks much
more feasible.

Just brain storming, but if we let the driver pass in its uhw
information inot a getter:

  struct ib_umem *uverbs_attr_get_umem(struct
      uverbs_attr_bundle *attrs, u16 idx,
      u64 uhw_umem_base, u64 umem_len);

  dbr_umem = uverbs_attr_get_umem(attrs,
                     MLX5_IB_ATTR_QP_DBR, uhw->base, uhw->len);

Then if the new attribute is provided the uhw is ignored, otherwise a
ib_uverbs_buffer_desc is created from the udata parameters instead.

Drivers use the normal attr indexes to define their many umems for
something complicated lik QP.

For the lifecycle.. This series adds a 
  +       cq->umem_list     = umem_list;

So it is not a big leap to imagine a linked list in the object that is
appended by the umem create function. Pass the list head into the umem
allocator, free the whole linked list in the core code.

This has some appeal because it is an easier conversion of all the
drivers, instead of re-threading their flows to accept a pre-created
umem they just have to be updated to call the new function in all the
places they are currently getting umems.

You'd probably have a further helper for cq that could extract the
existing common cq attrs to a ib_uverbs_buffer_desc:

  cq_umem = uverbs_attr_get_cq_umem(attrs, cq, uhw->base, uhw->len);

Probably similar for mr and a common mr attribute.

This will be easier to put revocable and dynamic ops into the scheme,
they can be passed as arugments to the get function instead of some
complicated thing in the central ops structure.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-21 13:46   ` Jason Gunthorpe
@ 2026-04-22 11:33     ` Jiri Pirko
  2026-04-22 14:06       ` Jiri Pirko
  0 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-22 11:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Tue, Apr 21, 2026 at 03:46:35PM +0200, jgg@ziepe.ca wrote:
>On Sat, Apr 11, 2026 at 04:49:01PM +0200, Jiri Pirko wrote:
>> @@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>
>> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
>> +					 const struct uverbs_attr_bundle *attrs,
>> +					 unsigned int slot_max)
>> +{
>> +	const struct ib_uverbs_buffer_desc *descs;
>> +	struct ib_umem_dmabuf *umem_dmabuf;
>> +	struct ib_umem_list *list;
>> +	struct ib_umem *umem;
>> +	unsigned int count;
>> +	int num_descs;
>> +	int err;
>> +	int i;
>> +
>> +	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
>> +		return ERR_PTR(-EINVAL);
>> +	count = slot_max + 1;
>> +
>> +	num_descs = uverbs_attr_ptr_get_array_size(
>> +		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
>> +		sizeof(*descs));
>
>uverbs_attr_ptr_get_array_size() should get a const on the parameter,
>seems to have been missed originally

Okay.


>
>> +/*
>> + * Describes a single buffer backed by dma-buf or user virtual address.
>> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
>> + * accepts this attribute defines its own per-command buffer slot enum.
>> + * The index field selects the buffer slot this descriptor maps to.
>> + *
>> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
>> + * @type: buffer type from enum ib_uverbs_buffer_type
>> + * @index: per-command buffer slot index
>> + * @reserved: must be zero
>> + * @addr: offset within dma-buf, or user virtual address for VA
>> + * @length: buffer length in bytes
>> + */
>> +struct ib_uverbs_buffer_desc {
>> +	__s32 fd;
>> +	__u32 type;
>> +	__u32 index;
>> +	__u32 reserved;
>> +	__aligned_u64 addr;
>> +	__aligned_u64 length;
>> +};
>
>This seems like a good idea, we should have done it earlier :\

Yeah :/


>
>Arguably if you do this then the first issue of being more flexible
>with umems is addressed, so a uverbs_attr_ptr_get_umem() looks much
>more feasible.
>
>Just brain storming, but if we let the driver pass in its uhw
>information inot a getter:
>
>  struct ib_umem *uverbs_attr_get_umem(struct
>      uverbs_attr_bundle *attrs, u16 idx,
>      u64 uhw_umem_base, u64 umem_len);
>
>  dbr_umem = uverbs_attr_get_umem(attrs,
>                     MLX5_IB_ATTR_QP_DBR, uhw->base, uhw->len);
>
>Then if the new attribute is provided the uhw is ignored, otherwise a
>ib_uverbs_buffer_desc is created from the udata parameters instead.
>
>Drivers use the normal attr indexes to define their many umems for
>something complicated lik QP.

Won't this go backwards? I mean, I was under impression that we want to
move the umem creation to core. What you suggest is the driver initiates
the umem creation. I personally think that it is nicer the way you
suggest, since the core is the owner and responsible for cleanup and
umems are created upon need.

One think. How about the consumption checking? I remember that for my
previous attempt on uverb umems you asked to check if each attr was
processed or not and in case it was not, yell out at the user.


>
>For the lifecycle.. This series adds a 
>  +       cq->umem_list     = umem_list;
> 
>So it is not a big leap to imagine a linked list in the object that is
>appended by the umem create function. Pass the list head into the umem
>allocator, free the whole linked list in the core code.

Yeah, that would be okay.

>
>This has some appeal because it is an easier conversion of all the
>drivers, instead of re-threading their flows to accept a pre-created
>umem they just have to be updated to call the new function in all the
>places they are currently getting umems.
>
>You'd probably have a further helper for cq that could extract the
>existing common cq attrs to a ib_uverbs_buffer_desc:
>
>  cq_umem = uverbs_attr_get_cq_umem(attrs, cq, uhw->base, uhw->len);
>
>Probably similar for mr and a common mr attribute.

Got it.


>
>This will be easier to put revocable and dynamic ops into the scheme,
>they can be passed as arugments to the get function instead of some
>complicated thing in the central ops structure.
>
>Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-22 11:33     ` Jiri Pirko
@ 2026-04-22 14:06       ` Jiri Pirko
  2026-04-22 16:51         ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-22 14:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Wed, Apr 22, 2026 at 01:33:06PM +0200, jiri@resnulli.us wrote:
>Tue, Apr 21, 2026 at 03:46:35PM +0200, jgg@ziepe.ca wrote:
>>On Sat, Apr 11, 2026 at 04:49:01PM +0200, Jiri Pirko wrote:
>>> @@ -332,3 +333,250 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
>>
>>> +struct ib_umem_list *ib_umem_list_create(struct ib_device *device,
>>> +					 const struct uverbs_attr_bundle *attrs,
>>> +					 unsigned int slot_max)
>>> +{
>>> +	const struct ib_uverbs_buffer_desc *descs;
>>> +	struct ib_umem_dmabuf *umem_dmabuf;
>>> +	struct ib_umem_list *list;
>>> +	struct ib_umem *umem;
>>> +	unsigned int count;
>>> +	int num_descs;
>>> +	int err;
>>> +	int i;
>>> +
>>> +	if (WARN_ON_ONCE(slot_max >= BITS_PER_LONG))
>>> +		return ERR_PTR(-EINVAL);
>>> +	count = slot_max + 1;
>>> +
>>> +	num_descs = uverbs_attr_ptr_get_array_size(
>>> +		(struct uverbs_attr_bundle *)attrs, UVERBS_ATTR_BUFFERS,
>>> +		sizeof(*descs));
>>
>>uverbs_attr_ptr_get_array_size() should get a const on the parameter,
>>seems to have been missed originally
>
>Okay.
>
>
>>
>>> +/*
>>> + * Describes a single buffer backed by dma-buf or user virtual address.
>>> + * Passed as an array via UVERBS_ATTR_BUFFERS. Each uverb command that
>>> + * accepts this attribute defines its own per-command buffer slot enum.
>>> + * The index field selects the buffer slot this descriptor maps to.
>>> + *
>>> + * @fd: dma-buf file descriptor (valid for IB_UVERBS_BUFFER_TYPE_DMABUF)
>>> + * @type: buffer type from enum ib_uverbs_buffer_type
>>> + * @index: per-command buffer slot index
>>> + * @reserved: must be zero
>>> + * @addr: offset within dma-buf, or user virtual address for VA
>>> + * @length: buffer length in bytes
>>> + */
>>> +struct ib_uverbs_buffer_desc {
>>> +	__s32 fd;
>>> +	__u32 type;
>>> +	__u32 index;
>>> +	__u32 reserved;
>>> +	__aligned_u64 addr;
>>> +	__aligned_u64 length;
>>> +};
>>
>>This seems like a good idea, we should have done it earlier :\
>
>Yeah :/
>
>
>>
>>Arguably if you do this then the first issue of being more flexible
>>with umems is addressed, so a uverbs_attr_ptr_get_umem() looks much
>>more feasible.
>>
>>Just brain storming, but if we let the driver pass in its uhw
>>information inot a getter:
>>
>>  struct ib_umem *uverbs_attr_get_umem(struct
>>      uverbs_attr_bundle *attrs, u16 idx,
>>      u64 uhw_umem_base, u64 umem_len);
>>
>>  dbr_umem = uverbs_attr_get_umem(attrs,
>>                     MLX5_IB_ATTR_QP_DBR, uhw->base, uhw->len);
>>
>>Then if the new attribute is provided the uhw is ignored, otherwise a
>>ib_uverbs_buffer_desc is created from the udata parameters instead.
>>
>>Drivers use the normal attr indexes to define their many umems for
>>something complicated lik QP.
>
>Won't this go backwards? I mean, I was under impression that we want to
>move the umem creation to core. What you suggest is the driver initiates
>the umem creation. I personally think that it is nicer the way you
>suggest, since the core is the owner and responsible for cleanup and
>umems are created upon need.
>
>One think. How about the consumption checking? I remember that for my
>previous attempt on uverb umems you asked to check if each attr was
>processed or not and in case it was not, yell out at the user.

Well, I think I can still track consumption per loaded attr. I'm on it.


>
>
>>
>>For the lifecycle.. This series adds a 
>>  +       cq->umem_list     = umem_list;
>> 
>>So it is not a big leap to imagine a linked list in the object that is
>>appended by the umem create function. Pass the list head into the umem
>>allocator, free the whole linked list in the core code.
>
>Yeah, that would be okay.
>
>>
>>This has some appeal because it is an easier conversion of all the
>>drivers, instead of re-threading their flows to accept a pre-created
>>umem they just have to be updated to call the new function in all the
>>places they are currently getting umems.
>>
>>You'd probably have a further helper for cq that could extract the
>>existing common cq attrs to a ib_uverbs_buffer_desc:
>>
>>  cq_umem = uverbs_attr_get_cq_umem(attrs, cq, uhw->base, uhw->len);
>>
>>Probably similar for mr and a common mr attribute.
>
>Got it.
>
>
>>
>>This will be easier to put revocable and dynamic ops into the scheme,
>>they can be passed as arugments to the get function instead of some
>>complicated thing in the central ops structure.
>>
>>Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-22 14:06       ` Jiri Pirko
@ 2026-04-22 16:51         ` Jason Gunthorpe
  2026-04-23 13:08           ` Jiri Pirko
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-22 16:51 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Wed, Apr 22, 2026 at 04:06:03PM +0200, Jiri Pirko wrote:
> >>Just brain storming, but if we let the driver pass in its uhw
> >>information inot a getter:
> >>
> >>  struct ib_umem *uverbs_attr_get_umem(struct
> >>      uverbs_attr_bundle *attrs, u16 idx,
> >>      u64 uhw_umem_base, u64 umem_len);
> >>
> >>  dbr_umem = uverbs_attr_get_umem(attrs,
> >>                     MLX5_IB_ATTR_QP_DBR, uhw->base, uhw->len);
> >>
> >>Then if the new attribute is provided the uhw is ignored, otherwise a
> >>ib_uverbs_buffer_desc is created from the udata parameters instead.
> >>
> >>Drivers use the normal attr indexes to define their many umems for
> >>something complicated lik QP.
> >
> >Won't this go backwards? I mean, I was under impression that we want to
> >move the umem creation to core. What you suggest is the driver initiates
> >the umem creation. I personally think that it is nicer the way you
> >suggest, since the core is the owner and responsible for cleanup and
> >umems are created upon need.

Well, brainstorming idea. I'd like to hear from Leon too

But if we set the general goals as:

1) All umem creations should have a struct ib_uverbs_buffer_desc at
   the UAPI boundary
2) ib_uverbs_buffer_desc should pass directly to umem code without any
   driver touching it. ib_uverbs_buffer_desc should be the only way to
   create a umem from a driver.
3) Existing UWH umem descriptions must continue to work if the desc is
   not provided, by reforming them into a desc
3) Cleanup and lifecycle should be centralized

I know the initial thinking was coloured by the CQ design which had
the core do everything, but this is echoing back to the old LWN
article "the midlayer mistake":

https://lwn.net/Articles/336262/

And here we are making the basic choice if the midlayer should alloc
the umem and pass it to the driver or the driver should call a library
function to obtain it.

The primary error to correct is pricipally #1, that the drivers did
not have a standardized uAPI surface so it could not be extended to
new forms of umem types.

So, for instance if we restructure the CQ to follow the library
pattern it would have drivers call some

umem = uverbs_attr_get_cq_umem(attrs, cq, uhw->base, uhw->len);

Which will internally obtain the ib_uverbs_buffer_desc:
 1) Directly from the new ib_uverbs_buffer_desc native ATTR_CQ_BUFFER attr
 2) By decoding and converting the existing attrs to
    ib_uverbs_buffer_desc
 3) By converting base/len into a VA type ib_uverbs_buffer_desc

Then just ask the umem layer to build a ib_uverbs_buffer_desc.

We can follow the same pattern for the other cases. If the uAPI has a
logical all-driver umem then a have a uverbs_attr_get_XX_umem() that
uses a core attr

Otherwise use a lower level function and the driver provides a
driver-specific attr to handle its non-general umem.

> >One think. How about the consumption checking? I remember that for my
> >previous attempt on uverb umems you asked to check if each attr was
> >processed or not and in case it was not, yell out at the user.
> 
> Well, I think I can still track consumption per loaded attr. I'm on it.

Yeah, we need to come up with a good story for how the uAPI should
work. As above there are three CQ options, what to do if the user
provides something nonsensical? For CQ I imagine that the helper will
do it internally with if statements.

In general the uattr system doesn't validate that mandatory attributes
where read by the driver. That might be an interesting debug feature
for sure.

I think my original remark was related to the lists, it is much easier
to pass extra items in the list and that would create a uABI problem
down the road if they are silently ignored by today's kernel.

Whereas if the driver has to define mandatory attributes to pass its
unique ib_uverbs_buffer_desc I'm not worried about future ABI because
eveything is now clearly labled and the uattrs system already has a
built in way to reject using a future kernel's driver attribute on an
older kernel.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-22 16:51         ` Jason Gunthorpe
@ 2026-04-23 13:08           ` Jiri Pirko
  2026-04-23 15:08             ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-23 13:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Wed, Apr 22, 2026 at 06:51:01PM +0200, jgg@ziepe.ca wrote:
>On Wed, Apr 22, 2026 at 04:06:03PM +0200, Jiri Pirko wrote:
>> >>Just brain storming, but if we let the driver pass in its uhw
>> >>information inot a getter:
>> >>
>> >>  struct ib_umem *uverbs_attr_get_umem(struct
>> >>      uverbs_attr_bundle *attrs, u16 idx,
>> >>      u64 uhw_umem_base, u64 umem_len);
>> >>
>> >>  dbr_umem = uverbs_attr_get_umem(attrs,
>> >>                     MLX5_IB_ATTR_QP_DBR, uhw->base, uhw->len);
>> >>
>> >>Then if the new attribute is provided the uhw is ignored, otherwise a
>> >>ib_uverbs_buffer_desc is created from the udata parameters instead.
>> >>
>> >>Drivers use the normal attr indexes to define their many umems for
>> >>something complicated lik QP.
>> >
>> >Won't this go backwards? I mean, I was under impression that we want to
>> >move the umem creation to core. What you suggest is the driver initiates
>> >the umem creation. I personally think that it is nicer the way you
>> >suggest, since the core is the owner and responsible for cleanup and
>> >umems are created upon need.
>
>Well, brainstorming idea. I'd like to hear from Leon too
>
>But if we set the general goals as:
>
>1) All umem creations should have a struct ib_uverbs_buffer_desc at
>   the UAPI boundary
>2) ib_uverbs_buffer_desc should pass directly to umem code without any
>   driver touching it. ib_uverbs_buffer_desc should be the only way to
>   create a umem from a driver.
>3) Existing UWH umem descriptions must continue to work if the desc is
>   not provided, by reforming them into a desc
>3) Cleanup and lifecycle should be centralized

Agreed.


>
>I know the initial thinking was coloured by the CQ design which had
>the core do everything, but this is echoing back to the old LWN
>article "the midlayer mistake":
>
>https://lwn.net/Articles/336262/
>
>And here we are making the basic choice if the midlayer should alloc
>the umem and pass it to the driver or the driver should call a library
>function to obtain it.
>
>The primary error to correct is pricipally #1, that the drivers did
>not have a standardized uAPI surface so it could not be extended to
>new forms of umem types.
>
>So, for instance if we restructure the CQ to follow the library
>pattern it would have drivers call some
>
>umem = uverbs_attr_get_cq_umem(attrs, cq, uhw->base, uhw->len);
>
>Which will internally obtain the ib_uverbs_buffer_desc:
> 1) Directly from the new ib_uverbs_buffer_desc native ATTR_CQ_BUFFER attr
> 2) By decoding and converting the existing attrs to
>    ib_uverbs_buffer_desc
> 3) By converting base/len into a VA type ib_uverbs_buffer_desc
>
>Then just ask the umem layer to build a ib_uverbs_buffer_desc.

Yep. I have that planned-out.


>
>We can follow the same pattern for the other cases. If the uAPI has a
>logical all-driver umem then a have a uverbs_attr_get_XX_umem() that
>uses a core attr
>
>Otherwise use a lower level function and the driver provides a
>driver-specific attr to handle its non-general umem.
>
>> >One think. How about the consumption checking? I remember that for my
>> >previous attempt on uverb umems you asked to check if each attr was
>> >processed or not and in case it was not, yell out at the user.
>> 
>> Well, I think I can still track consumption per loaded attr. I'm on it.
>
>Yeah, we need to come up with a good story for how the uAPI should
>work. As above there are three CQ options, what to do if the user
>provides something nonsensical? For CQ I imagine that the helper will
>do it internally with if statements.
>
>In general the uattr system doesn't validate that mandatory attributes
>where read by the driver. That might be an interesting debug feature
>for sure.
>
>I think my original remark was related to the lists, it is much easier
>to pass extra items in the list and that would create a uABI problem
>down the road if they are silently ignored by today's kernel.
>
>Whereas if the driver has to define mandatory attributes to pass its
>unique ib_uverbs_buffer_desc I'm not worried about future ABI because
>eveything is now clearly labled and the uattrs system already has a
>built in way to reject using a future kernel's driver attribute on an
>older kernel.

Hmm, but the attr may be optional, yet silently ignored (for example by
driver that does not support it). I think we still need to sanitize such
silent ignores.


>
>Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 01/15] RDMA/core: Introduce generic buffer descriptor infrastructure for umem
  2026-04-23 13:08           ` Jiri Pirko
@ 2026-04-23 15:08             ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-23 15:08 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Thu, Apr 23, 2026 at 03:08:19PM +0200, Jiri Pirko wrote:
> >Whereas if the driver has to define mandatory attributes to pass its
> >unique ib_uverbs_buffer_desc I'm not worried about future ABI because
> >eveything is now clearly labled and the uattrs system already has a
> >built in way to reject using a future kernel's driver attribute on an
> >older kernel.
> 
> Hmm, but the attr may be optional, yet silently ignored (for example by
> driver that does not support it). I think we still need to sanitize such
> silent ignores.

The kernel schema describes attributes as mandatory and optional. If a
mandatory attribute is missing then the ioctl will fail before
reaching the handler.

The userspace can also describe the attribute it is passing in as
mandatory and optional via UVERBS_ATTR_F_MANDATORY. If this flag is
set and the kernel does not have the attribute in its schema then the
ioctl will fail before invoking the handler.

The only remaining case is where the driver has an attribute in its
schema and doesn't actually use it for some reason. I'm not sure this
is so important to worry about, at least from an ABI perspective
everything is properly labeled and there won't be forward/backwards
compat issues.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 01/15] RDMA/core: " Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-21 13:25   ` Jason Gunthorpe
  2026-04-11 14:49 ` [PATCH rdma-next v2 03/15] RDMA/uverbs: Integrate umem_list into CQ creation Jiri Pirko
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Extract the UVERBS_ATTR_CREATE_CQ_BUFFER_* attribute processing from
the CQ create handler into uverbs_create_cq_get_umem() and separate
buffer acquisition logic from the rest of CQ creation.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/core/uverbs_std_types_cq.c | 127 ++++++++++--------
 1 file changed, 69 insertions(+), 58 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_std_types_cq.c b/drivers/infiniband/core/uverbs_std_types_cq.c
index d2c8f71f934c..4afe27fef6c9 100644
--- a/drivers/infiniband/core/uverbs_std_types_cq.c
+++ b/drivers/infiniband/core/uverbs_std_types_cq.c
@@ -58,6 +58,72 @@ static int uverbs_free_cq(struct ib_uobject *uobject,
 	return 0;
 }
 
+static struct ib_umem *uverbs_create_cq_get_umem(struct ib_device *ib_dev,
+						  struct uverbs_attr_bundle *attrs)
+{
+	struct ib_umem_dmabuf *umem_dmabuf;
+	u64 buffer_length;
+	u64 buffer_offset;
+	u64 buffer_va;
+	int buffer_fd;
+	int ret;
+
+	if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_VA)) {
+		ret = uverbs_copy_from(&buffer_va, attrs,
+				       UVERBS_ATTR_CREATE_CQ_BUFFER_VA);
+		if (ret)
+			return ERR_PTR(ret);
+
+		ret = uverbs_copy_from(&buffer_length, attrs,
+				       UVERBS_ATTR_CREATE_CQ_BUFFER_LENGTH);
+		if (ret)
+			return ERR_PTR(ret);
+
+		if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_FD) ||
+		    uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET) ||
+		    !ib_dev->ops.create_user_cq)
+			return ERR_PTR(-EINVAL);
+
+		return ib_umem_get(ib_dev, buffer_va, buffer_length,
+				   IB_ACCESS_LOCAL_WRITE);
+	}
+
+	if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_FD)) {
+		ret = uverbs_get_raw_fd(&buffer_fd, attrs,
+					UVERBS_ATTR_CREATE_CQ_BUFFER_FD);
+		if (ret)
+			return ERR_PTR(ret);
+
+		ret = uverbs_copy_from(&buffer_offset, attrs,
+				       UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET);
+		if (ret)
+			return ERR_PTR(ret);
+
+		ret = uverbs_copy_from(&buffer_length, attrs,
+				       UVERBS_ATTR_CREATE_CQ_BUFFER_LENGTH);
+		if (ret)
+			return ERR_PTR(ret);
+
+		if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_VA) ||
+		    !ib_dev->ops.create_user_cq)
+			return ERR_PTR(-EINVAL);
+
+		umem_dmabuf = ib_umem_dmabuf_get_pinned(ib_dev, buffer_offset,
+							buffer_length, buffer_fd,
+							IB_ACCESS_LOCAL_WRITE);
+		if (IS_ERR(umem_dmabuf))
+			return ERR_CAST(umem_dmabuf);
+		return &umem_dmabuf->umem;
+	}
+
+	if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET) ||
+	    uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_LENGTH) ||
+	    !ib_dev->ops.create_cq)
+		return ERR_PTR(-EINVAL);
+
+	return NULL;
+}
+
 static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	struct uverbs_attr_bundle *attrs)
 {
@@ -66,16 +132,11 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 		typeof(*obj), uevent.uobject);
 	struct ib_uverbs_completion_event_file *ev_file = NULL;
 	struct ib_device *ib_dev = attrs->context->device;
-	struct ib_umem_dmabuf *umem_dmabuf;
 	struct ib_cq_init_attr attr = {};
 	struct ib_uobject *ev_file_uobj;
 	struct ib_umem *umem = NULL;
-	u64 buffer_length;
-	u64 buffer_offset;
 	struct ib_cq *cq;
 	u64 user_handle;
-	u64 buffer_va;
-	int buffer_fd;
 	int ret;
 
 	if ((!ib_dev->ops.create_cq && !ib_dev->ops.create_user_cq) ||
@@ -122,59 +183,9 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	INIT_LIST_HEAD(&obj->comp_list);
 	INIT_LIST_HEAD(&obj->uevent.event_list);
 
-	if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_VA)) {
-
-		ret = uverbs_copy_from(&buffer_va, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_VA);
-		if (ret)
-			goto err_event_file;
-
-		ret = uverbs_copy_from(&buffer_length, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_LENGTH);
-		if (ret)
-			goto err_event_file;
-
-		if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_FD) ||
-		    uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET) ||
-		    !ib_dev->ops.create_user_cq) {
-			ret = -EINVAL;
-			goto err_event_file;
-		}
-
-		umem = ib_umem_get(ib_dev, buffer_va, buffer_length, IB_ACCESS_LOCAL_WRITE);
-		if (IS_ERR(umem)) {
-			ret = PTR_ERR(umem);
-			goto err_event_file;
-		}
-	} else if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_FD)) {
-
-		ret = uverbs_get_raw_fd(&buffer_fd, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_FD);
-		if (ret)
-			goto err_event_file;
-
-		ret = uverbs_copy_from(&buffer_offset, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET);
-		if (ret)
-			goto err_event_file;
-
-		ret = uverbs_copy_from(&buffer_length, attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_LENGTH);
-		if (ret)
-			goto err_event_file;
-
-		if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_VA) ||
-		    !ib_dev->ops.create_user_cq) {
-			ret = -EINVAL;
-			goto err_event_file;
-		}
-
-		umem_dmabuf = ib_umem_dmabuf_get_pinned(ib_dev, buffer_offset, buffer_length,
-							buffer_fd, IB_ACCESS_LOCAL_WRITE);
-		if (IS_ERR(umem_dmabuf)) {
-			ret = PTR_ERR(umem_dmabuf);
-			goto err_event_file;
-		}
-		umem = &umem_dmabuf->umem;
-	} else if (uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET) ||
-		   uverbs_attr_is_valid(attrs, UVERBS_ATTR_CREATE_CQ_BUFFER_LENGTH) ||
-		   !ib_dev->ops.create_cq) {
-		ret = -EINVAL;
+	umem = uverbs_create_cq_get_umem(ib_dev, attrs);
+	if (IS_ERR(umem)) {
+		ret = PTR_ERR(umem);
 		goto err_event_file;
 	}
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper
  2026-04-11 14:49 ` [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper Jiri Pirko
@ 2026-04-21 13:25   ` Jason Gunthorpe
  2026-04-22 10:56     ` Jiri Pirko
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-21 13:25 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Sat, Apr 11, 2026 at 04:49:02PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@nvidia.com>
> 
> Extract the UVERBS_ATTR_CREATE_CQ_BUFFER_* attribute processing from
> the CQ create handler into uverbs_create_cq_get_umem() and separate
> buffer acquisition logic from the rest of CQ creation.
> 
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> ---
>  drivers/infiniband/core/uverbs_std_types_cq.c | 127 ++++++++++--------
>  1 file changed, 69 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/infiniband/core/uverbs_std_types_cq.c b/drivers/infiniband/core/uverbs_std_types_cq.c
> index d2c8f71f934c..4afe27fef6c9 100644
> --- a/drivers/infiniband/core/uverbs_std_types_cq.c
> +++ b/drivers/infiniband/core/uverbs_std_types_cq.c
> @@ -58,6 +58,72 @@ static int uverbs_free_cq(struct ib_uobject *uobject,
>  	return 0;
>  }
>  
> +static struct ib_umem *uverbs_create_cq_get_umem(struct ib_device *ib_dev,
> +						  struct uverbs_attr_bundle *attrs)
> +{

I suggest making a function like this:

int uverbs_create_cq_to_umem_desc(struct uverbs_attr_bundle *attrs,
                                  struct ib_uverbs_buffer_desc *dec);

And lets focus the umem code on working consistently with struct
ib_uverbs_buffer_desc.

Ie as a general plan lets try to convert all the different
descriptions we have in the uapi for umems into a
ib_uverbs_buffer_desc and convert that to a umem?

Broadly I'd imagine introducing a new uattr for CQ to pass the
ib_uverbs_buffer_desc as well so the end result of all this churn has
the option for every umem to be described by ib_uverbs_buffer_desc at
the uapi boundary.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper
  2026-04-21 13:25   ` Jason Gunthorpe
@ 2026-04-22 10:56     ` Jiri Pirko
  2026-04-22 16:32       ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Jiri Pirko @ 2026-04-22 10:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

Tue, Apr 21, 2026 at 03:25:32PM +0200, jgg@ziepe.ca wrote:
>On Sat, Apr 11, 2026 at 04:49:02PM +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@nvidia.com>
>> 
>> Extract the UVERBS_ATTR_CREATE_CQ_BUFFER_* attribute processing from
>> the CQ create handler into uverbs_create_cq_get_umem() and separate
>> buffer acquisition logic from the rest of CQ creation.
>> 
>> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> ---
>>  drivers/infiniband/core/uverbs_std_types_cq.c | 127 ++++++++++--------
>>  1 file changed, 69 insertions(+), 58 deletions(-)
>> 
>> diff --git a/drivers/infiniband/core/uverbs_std_types_cq.c b/drivers/infiniband/core/uverbs_std_types_cq.c
>> index d2c8f71f934c..4afe27fef6c9 100644
>> --- a/drivers/infiniband/core/uverbs_std_types_cq.c
>> +++ b/drivers/infiniband/core/uverbs_std_types_cq.c
>> @@ -58,6 +58,72 @@ static int uverbs_free_cq(struct ib_uobject *uobject,
>>  	return 0;
>>  }
>>  
>> +static struct ib_umem *uverbs_create_cq_get_umem(struct ib_device *ib_dev,
>> +						  struct uverbs_attr_bundle *attrs)
>> +{
>
>I suggest making a function like this:
>
>int uverbs_create_cq_to_umem_desc(struct uverbs_attr_bundle *attrs,
>                                  struct ib_uverbs_buffer_desc *dec);
>
>And lets focus the umem code on working consistently with struct
>ib_uverbs_buffer_desc.

Okay, makes sense.


>
>Ie as a general plan lets try to convert all the different
>descriptions we have in the uapi for umems into a
>ib_uverbs_buffer_desc and convert that to a umem?
>
>Broadly I'd imagine introducing a new uattr for CQ to pass the
>ib_uverbs_buffer_desc as well so the end result of all this churn has
>the option for every umem to be described by ib_uverbs_buffer_desc at
>the uapi boundary.

Wait, I'm missing something. I'm already introducing the BUFFERS attr
that passes a list of ib_uverbs_buffer_desc. What exactly do you mean
here?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper
  2026-04-22 10:56     ` Jiri Pirko
@ 2026-04-22 16:32       ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-22 16:32 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: linux-rdma, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

On Wed, Apr 22, 2026 at 12:56:52PM +0200, Jiri Pirko wrote:
> >Broadly I'd imagine introducing a new uattr for CQ to pass the
> >ib_uverbs_buffer_desc as well so the end result of all this churn has
> >the option for every umem to be described by ib_uverbs_buffer_desc at
> >the uapi boundary.
> 
> Wait, I'm missing something. I'm already introducing the BUFFERS attr
> that passes a list of ib_uverbs_buffer_desc. What exactly do you mean
> here?

Yeah, that's what I mean, however it is done every API should get a
ib_uverbs_buffer_desc. This series does exactly that with the BUFFERS
attr.

Meaning it would supersede the existing mass of single attrs in CQ,
that design doesn't seem to have turned out so good unfortunately.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 03/15] RDMA/uverbs: Integrate umem_list into CQ creation
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 01/15] RDMA/core: " Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 04/15] RDMA/efa: Use umem_list for user CQ buffer Jiri Pirko
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Wire up the generic buffer descriptor infrastructure to the CQ create
command, with fallback to the existing per-attribute path. Add
umem_list field to struct ib_cq and define the CQ buffer slot enum.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/core/uverbs_cmd.c          | 15 +++++++++++--
 drivers/infiniband/core/uverbs_std_types_cq.c | 22 ++++++++++++++-----
 drivers/infiniband/core/verbs.c               |  9 +++++---
 include/rdma/ib_verbs.h                       |  2 ++
 include/uapi/rdma/ib_user_ioctl_cmds.h        |  6 +++++
 5 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index a768436ba468..77874834108b 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -42,6 +42,7 @@
 
 #include <rdma/uverbs_types.h>
 #include <rdma/uverbs_std_types.h>
+#include <rdma/ib_umem.h>
 #include <rdma/ib_ucaps.h>
 #include "rdma_core.h"
 
@@ -1011,6 +1012,7 @@ static int create_cq(struct uverbs_attr_bundle *attrs,
 {
 	struct ib_ucq_object           *obj;
 	struct ib_uverbs_completion_event_file    *ev_file = NULL;
+	struct ib_umem_list	       *umem_list;
 	struct ib_cq                   *cq;
 	int                             ret;
 	struct ib_uverbs_ex_create_cq_resp resp = {};
@@ -1044,16 +1046,23 @@ static int create_cq(struct uverbs_attr_bundle *attrs,
 	attr.comp_vector = cmd->comp_vector;
 	attr.flags = cmd->flags;
 
+	umem_list = ib_umem_list_create(ib_dev, attrs, UVERBS_BUF_CQ_MAX);
+	if (IS_ERR(umem_list)) {
+		ret = PTR_ERR(umem_list);
+		goto err_file;
+	}
+
 	cq = rdma_zalloc_drv_obj(ib_dev, ib_cq);
 	if (!cq) {
 		ret = -ENOMEM;
-		goto err_file;
+		goto err_list_release;
 	}
 	cq->device        = ib_dev;
 	cq->uobject       = obj;
 	cq->comp_handler  = ib_uverbs_comp_handler;
 	cq->event_handler = ib_uverbs_cq_event_handler;
 	cq->cq_context    = ev_file ? &ev_file->ev_queue : NULL;
+	cq->umem_list     = umem_list;
 	atomic_set(&cq->usecnt, 0);
 
 	rdma_restrack_new(&cq->res, RDMA_RESTRACK_CQ);
@@ -1079,9 +1088,11 @@ static int create_cq(struct uverbs_attr_bundle *attrs,
 	return uverbs_response(attrs, &resp, sizeof(resp));
 
 err_free:
-	ib_umem_release(cq->umem);
+	ib_umem_release_non_listed(umem_list, UVERBS_BUF_CQ_BUF, cq->umem);
 	rdma_restrack_put(&cq->res);
 	kfree(cq);
+err_list_release:
+	ib_umem_list_release(umem_list);
 err_file:
 	if (ev_file)
 		ib_uverbs_release_ucq(ev_file, obj);
diff --git a/drivers/infiniband/core/uverbs_std_types_cq.c b/drivers/infiniband/core/uverbs_std_types_cq.c
index 4afe27fef6c9..f87cd11470fc 100644
--- a/drivers/infiniband/core/uverbs_std_types_cq.c
+++ b/drivers/infiniband/core/uverbs_std_types_cq.c
@@ -134,6 +134,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	struct ib_device *ib_dev = attrs->context->device;
 	struct ib_cq_init_attr attr = {};
 	struct ib_uobject *ev_file_uobj;
+	struct ib_umem_list *umem_list;
 	struct ib_umem *umem = NULL;
 	struct ib_cq *cq;
 	u64 user_handle;
@@ -183,17 +184,24 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	INIT_LIST_HEAD(&obj->comp_list);
 	INIT_LIST_HEAD(&obj->uevent.event_list);
 
+	umem_list = ib_umem_list_create(ib_dev, attrs, UVERBS_BUF_CQ_MAX);
+	if (IS_ERR(umem_list)) {
+		ret = PTR_ERR(umem_list);
+		goto err_event_file;
+	}
+
 	umem = uverbs_create_cq_get_umem(ib_dev, attrs);
 	if (IS_ERR(umem)) {
 		ret = PTR_ERR(umem);
-		goto err_event_file;
+		goto err_umem_list;
 	}
+	if (umem)
+		ib_umem_list_insert(umem_list, UVERBS_BUF_CQ_BUF, umem);
 
 	cq = rdma_zalloc_drv_obj(ib_dev, ib_cq);
 	if (!cq) {
 		ret = -ENOMEM;
-		ib_umem_release(umem);
-		goto err_event_file;
+		goto err_umem_list;
 	}
 
 	cq->device        = ib_dev;
@@ -206,6 +214,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	 * CQ creation based on their internal udata.
 	 */
 	cq->umem = umem;
+	cq->umem_list     = umem_list;
 	atomic_set(&cq->usecnt, 0);
 
 	rdma_restrack_new(&cq->res, RDMA_RESTRACK_CQ);
@@ -231,9 +240,11 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	return ret;
 
 err_free:
-	ib_umem_release(cq->umem);
+	ib_umem_release_non_listed(umem_list, UVERBS_BUF_CQ_BUF, cq->umem);
 	rdma_restrack_put(&cq->res);
 	kfree(cq);
+err_umem_list:
+	ib_umem_list_release(umem_list);
 err_event_file:
 	if (obj->uevent.event_file)
 		uverbs_uobject_put(&obj->uevent.event_file->uobj);
@@ -281,7 +292,8 @@ DECLARE_UVERBS_NAMED_METHOD(
 	UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET,
 			   UVERBS_ATTR_TYPE(u64),
 			   UA_OPTIONAL),
-	UVERBS_ATTR_UHW());
+	UVERBS_ATTR_UHW(),
+	UVERBS_ATTR_BUFFERS());
 
 static int UVERBS_HANDLER(UVERBS_METHOD_CQ_DESTROY)(
 	struct uverbs_attr_bundle *attrs)
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index bac87de9cc67..ed163fc56ef8 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -50,6 +50,7 @@
 #include <rdma/ib_cache.h>
 #include <rdma/ib_addr.h>
 #include <rdma/ib_umem.h>
+#include <rdma/ib_user_ioctl_cmds.h>
 #include <rdma/rw.h>
 #include <rdma/lag.h>
 
@@ -2223,9 +2224,9 @@ struct ib_cq *__ib_create_cq(struct ib_device *device,
 	}
 	/*
 	 * We are in kernel verbs flow and drivers are not allowed
-	 * to set umem pointer, it needs to stay NULL.
+	 * to set umem or umem_list pointers, they need to stay NULL.
 	 */
-	WARN_ON_ONCE(cq->umem);
+	WARN_ON_ONCE(cq->umem || cq->umem_list);
 
 	rdma_restrack_add(&cq->res);
 	return cq;
@@ -2245,6 +2246,7 @@ EXPORT_SYMBOL(rdma_set_cq_moderation);
 
 int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 {
+	struct ib_umem_list *umem_list = cq->umem_list;
 	int ret;
 
 	if (WARN_ON_ONCE(cq->shared))
@@ -2257,9 +2259,10 @@ int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 	if (ret)
 		return ret;
 
-	ib_umem_release(cq->umem);
+	ib_umem_release_non_listed(umem_list, UVERBS_BUF_CQ_BUF, cq->umem);
 	rdma_restrack_del(&cq->res);
 	kfree(cq);
+	ib_umem_list_release(umem_list);
 	return ret;
 }
 EXPORT_SYMBOL(ib_destroy_cq_user);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9dd76f489a0b..dd6c0d68497d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1740,6 +1740,8 @@ struct ib_cq {
 	unsigned int comp_vector;
 	struct ib_umem *umem;
 
+	struct ib_umem_list    *umem_list;
+
 	/*
 	 * Implementation details of the RDMA core, don't use in drivers:
 	 */
diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
index 10aa6568abf1..375e4e224f6a 100644
--- a/include/uapi/rdma/ib_user_ioctl_cmds.h
+++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
@@ -120,6 +120,12 @@ enum uverbs_attrs_create_cq_cmd_attr_ids {
 	UVERBS_ATTR_CREATE_CQ_BUFFER_OFFSET,
 };
 
+enum uverbs_buf_cq_slots {
+	UVERBS_BUF_CQ_BUF,
+	__UVERBS_BUF_CQ_MAX,
+	UVERBS_BUF_CQ_MAX = __UVERBS_BUF_CQ_MAX - 1,
+};
+
 enum uverbs_attrs_destroy_cq_cmd_attr_ids {
 	UVERBS_ATTR_DESTROY_CQ_HANDLE,
 	UVERBS_ATTR_DESTROY_CQ_RESP,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 04/15] RDMA/efa: Use umem_list for user CQ buffer
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (2 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 03/15] RDMA/uverbs: Integrate umem_list into CQ creation Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 05/15] RDMA/mlx5: " Jiri Pirko
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Load the CQ buffer using ib_umem_list_load() instead of ibcq->umem.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/hw/efa/efa_verbs.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 7bd0838ebc99..b3236a40b87f 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -1124,6 +1124,7 @@ int efa_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	struct efa_ibv_create_cq cmd;
 	struct efa_cq *cq = to_ecq(ibcq);
 	int entries = attr->cqe;
+	struct ib_umem *umem;
 	bool set_src_addr;
 	int err;
 
@@ -1172,20 +1173,18 @@ int efa_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	cq->ucontext = ucontext;
 	cq->size = PAGE_ALIGN(cmd.cq_entry_size * entries * cmd.num_sub_cqs);
 
-	if (ibcq->umem) {
-		if (ibcq->umem->length < cq->size) {
-			ibdev_dbg(&dev->ibdev, "External memory too small\n");
-			err = -EINVAL;
-			goto err_out;
-		}
-
-		if (!ib_umem_is_contiguous(ibcq->umem)) {
+	umem = ib_umem_list_load(ibcq->umem_list, UVERBS_BUF_CQ_BUF, cq->size);
+	if (IS_ERR(umem)) {
+		err = PTR_ERR(umem);
+		goto err_out;
+	} else if (umem) {
+		if (!ib_umem_is_contiguous(umem)) {
 			ibdev_dbg(&dev->ibdev, "Non contiguous CQ unsupported\n");
 			err = -EINVAL;
 			goto err_out;
 		}
 
-		cq->dma_addr = ib_umem_start_dma_addr(ibcq->umem);
+		cq->dma_addr = ib_umem_start_dma_addr(umem);
 	} else {
 		cq->cpu_addr = efa_zalloc_mapped(dev, &cq->dma_addr, cq->size,
 						 DMA_FROM_DEVICE);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 05/15] RDMA/mlx5: Use umem_list for user CQ buffer
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (3 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 04/15] RDMA/efa: Use umem_list for user CQ buffer Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 06/15] RDMA/bnxt_re: " Jiri Pirko
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Use ib_umem_list_load_or_get() and ib_umem_list_replace() to work
with umem instead of ibcq->umem.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/hw/mlx5/cq.c | 35 +++++++++++++++------------------
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index a76b7a36087d..bb9ed7caec67 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -727,6 +727,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 	int ncont;
 	void *cqc;
 	int err;
+	struct ib_umem *umem;
 	struct mlx5_ib_ucontext *context = rdma_udata_to_drv_context(
 		udata, struct mlx5_ib_ucontext, ibucontext);
 
@@ -745,31 +746,29 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 
 	*cqe_size = ucmd.cqe_size;
 
-	if (!cq->ibcq.umem)
-		cq->ibcq.umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
-					    entries * ucmd.cqe_size,
-					    IB_ACCESS_LOCAL_WRITE);
-	if (IS_ERR(cq->ibcq.umem))
-		return PTR_ERR(cq->ibcq.umem);
+	umem = ib_umem_list_load_or_get(cq->ibcq.umem_list, UVERBS_BUF_CQ_BUF,
+					&dev->ib_dev, ucmd.buf_addr,
+					entries * ucmd.cqe_size,
+					IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(umem))
+		return PTR_ERR(umem);
 
 	page_size = mlx5_umem_find_best_cq_quantized_pgoff(
-		cq->ibcq.umem, cqc, log_page_size, MLX5_ADAPTER_PAGE_SHIFT,
+		umem, cqc, log_page_size, MLX5_ADAPTER_PAGE_SHIFT,
 		page_offset, 64, &page_offset_quantized);
-	if (!page_size) {
-		err = -EINVAL;
-		goto err_umem;
-	}
+	if (!page_size)
+		return -EINVAL;
 
 	err = mlx5_ib_db_map_user(context, ucmd.db_addr, &cq->db);
 	if (err)
-		goto err_umem;
+		return err;
 
-	ncont = ib_umem_num_dma_blocks(cq->ibcq.umem, page_size);
+	ncont = ib_umem_num_dma_blocks(umem, page_size);
 	mlx5_ib_dbg(
 		dev,
 		"addr 0x%llx, size %u, npages %zu, page_size %lu, ncont %d\n",
 		ucmd.buf_addr, entries * ucmd.cqe_size,
-		ib_umem_num_pages(cq->ibcq.umem), page_size, ncont);
+		ib_umem_num_pages(umem), page_size, ncont);
 
 	*inlen = MLX5_ST_SZ_BYTES(create_cq_in) +
 		 MLX5_FLD_SZ_BYTES(create_cq_in, pas[0]) * ncont;
@@ -780,7 +779,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 	}
 
 	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, *cqb, pas);
-	mlx5_ib_populate_pas(cq->ibcq.umem, page_size, pas, 0);
+	mlx5_ib_populate_pas(umem, page_size, pas, 0);
 
 	cqc = MLX5_ADDR_OF(create_cq_in, *cqb, cq_context);
 	MLX5_SET(cqc, cqc, log_page_size,
@@ -851,9 +850,6 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 
 err_db:
 	mlx5_ib_db_unmap_user(context, &cq->db);
-
-err_umem:
-	/* UMEM is released by ib_core */
 	return err;
 }
 
@@ -1434,7 +1430,8 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 
 	if (udata) {
 		cq->ibcq.cqe = entries - 1;
-		ib_umem_release(cq->ibcq.umem);
+		ib_umem_list_replace(cq->ibcq.umem_list, UVERBS_BUF_CQ_BUF,
+				     cq->resize_umem);
 		cq->ibcq.umem = cq->resize_umem;
 		cq->resize_umem = NULL;
 	} else {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 06/15] RDMA/bnxt_re: Use umem_list for user CQ buffer
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (4 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 05/15] RDMA/mlx5: " Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 07/15] RDMA/mlx4: " Jiri Pirko
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Use ib_umem_list_load_or_get() and ib_umem_list_replace() to work
with umem instead of ibcq->umem.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/hw/bnxt_re/ib_verbs.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 7ed294516b7e..5c6fc81fad6a 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3379,6 +3379,7 @@ int bnxt_re_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *att
 	struct bnxt_re_cq_req req;
 	int rc;
 	u32 active_cqs, entries;
+	struct ib_umem *umem;
 
 	if (attr->flags)
 		return -EOPNOTSUPP;
@@ -3402,15 +3403,14 @@ int bnxt_re_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *att
 		entries = bnxt_re_init_depth(attr->cqe + 1,
 					     dev_attr->max_cq_wqes + 1, uctx);
 
-	if (!ibcq->umem) {
-		ibcq->umem = ib_umem_get(&rdev->ibdev, req.cq_va,
-					 entries * sizeof(struct cq_base),
-					 IB_ACCESS_LOCAL_WRITE);
-		if (IS_ERR(ibcq->umem))
-			return PTR_ERR(ibcq->umem);
-	}
+	umem = ib_umem_list_load_or_get(ibcq->umem_list, UVERBS_BUF_CQ_BUF,
+					&rdev->ibdev, req.cq_va,
+					entries * sizeof(struct cq_base),
+					IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(umem))
+		return PTR_ERR(umem);
 
-	rc = bnxt_re_setup_sginfo(rdev, ibcq->umem, &cq->qplib_cq.sg_info);
+	rc = bnxt_re_setup_sginfo(rdev, umem, &cq->qplib_cq.sg_info);
 	if (rc)
 		return rc;
 
@@ -3516,8 +3516,10 @@ static void bnxt_re_resize_cq_complete(struct bnxt_re_cq *cq)
 
 	cq->qplib_cq.max_wqe = cq->resize_cqe;
 	if (cq->resize_umem) {
-		ib_umem_release(cq->ib_cq.umem);
+		ib_umem_list_replace(cq->ib_cq.umem_list, UVERBS_BUF_CQ_BUF,
+				     cq->resize_umem);
 		cq->ib_cq.umem = cq->resize_umem;
+		cq->qplib_cq.sg_info.umem = cq->resize_umem;
 		cq->resize_umem = NULL;
 		cq->resize_cqe = 0;
 	}
@@ -4113,7 +4115,7 @@ int bnxt_re_poll_cq(struct ib_cq *ib_cq, int num_entries, struct ib_wc *wc)
 	/* User CQ; the only processing we do is to
 	 * complete any pending CQ resize operation.
 	 */
-	if (cq->ib_cq.umem) {
+	if (ib_cq->uobject) {
 		if (cq->resize_umem)
 			bnxt_re_resize_cq_complete(cq);
 		return 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 07/15] RDMA/mlx4: Use umem_list for user CQ buffer
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (5 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 06/15] RDMA/bnxt_re: " Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 08/15] RDMA/uverbs: Remove legacy umem field from struct ib_cq Jiri Pirko
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Use ib_umem_list_load() and ib_umem_list_replace() to work
with umem instead of ibcq->umem.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
v1->v2:
- rebase on top of Leon's fix
---
 drivers/infiniband/hw/mlx4/cq.c | 40 ++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 7a6eb602d4a6..f6ef85cc37a1 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -152,6 +152,7 @@ int mlx4_ib_create_user_cq(struct ib_cq *ibcq,
 	int shift;
 	int n;
 	int err;
+	struct ib_umem *umem;
 	struct mlx4_ib_ucontext *context = rdma_udata_to_drv_context(
 		udata, struct mlx4_ib_ucontext, ibucontext);
 
@@ -172,22 +173,30 @@ int mlx4_ib_create_user_cq(struct ib_cq *ibcq,
 	if (err)
 		goto err_cq;
 
-	if (ibcq->umem &&
-	    (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_SW_CQ_INIT))
-		return -EOPNOTSUPP;
-
-	buf_addr = (void *)(unsigned long)ucmd.buf_addr;
-
-	if (!ibcq->umem)
-		ibcq->umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
-					 entries * cqe_size,
-					 IB_ACCESS_LOCAL_WRITE);
-	if (IS_ERR(ibcq->umem)) {
-		err = PTR_ERR(ibcq->umem);
+	umem = ib_umem_list_load(ibcq->umem_list, UVERBS_BUF_CQ_BUF,
+				 entries * cqe_size);
+	if (IS_ERR(umem)) {
+		err = PTR_ERR(umem);
 		goto err_cq;
 	}
+	if (umem) {
+		if (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_SW_CQ_INIT)
+			return -EOPNOTSUPP;
+	} else {
+		umem = ib_umem_get(&dev->ib_dev, ucmd.buf_addr,
+				   entries * cqe_size,
+				   IB_ACCESS_LOCAL_WRITE);
+		if (IS_ERR(umem)) {
+			err = PTR_ERR(umem);
+			goto err_cq;
+		}
+		ib_umem_list_replace(ibcq->umem_list, UVERBS_BUF_CQ_BUF,
+				     umem);
+	}
+
+	buf_addr = (void *)(unsigned long)ucmd.buf_addr;
 
-	shift = mlx4_ib_umem_calc_optimal_mtt_size(cq->ibcq.umem, 0, &n);
+	shift = mlx4_ib_umem_calc_optimal_mtt_size(umem, 0, &n);
 	if (shift < 0) {
 		err = shift;
 		goto err_cq;
@@ -197,7 +206,7 @@ int mlx4_ib_create_user_cq(struct ib_cq *ibcq,
 	if (err)
 		goto err_cq;
 
-	err = mlx4_ib_umem_write_mtt(dev, &cq->buf.mtt, cq->ibcq.umem);
+	err = mlx4_ib_umem_write_mtt(dev, &cq->buf.mtt, umem);
 	if (err)
 		goto err_mtt;
 
@@ -471,7 +480,8 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 	if (ibcq->uobject) {
 		cq->buf      = cq->resize_buf->buf;
 		cq->ibcq.cqe = cq->resize_buf->cqe;
-		ib_umem_release(cq->ibcq.umem);
+		ib_umem_list_replace(ibcq->umem_list, UVERBS_BUF_CQ_BUF,
+				     cq->resize_umem);
 		cq->ibcq.umem     = cq->resize_umem;
 
 		kfree(cq->resize_buf);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 08/15] RDMA/uverbs: Remove legacy umem field from struct ib_cq
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (6 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 07/15] RDMA/mlx4: " Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 09/15] RDMA/uverbs: Verify all umem_list buffers are consumed after CQ creation Jiri Pirko
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Now that all drivers use umem_list for CQ buffer management, the
legacy umem field in struct ib_cq is no longer needed. Remove it
along with the associated ib_umem_release_non_listed() calls in
error and destroy paths, as buffer lifetime is fully managed through
ib_umem_list_release().

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/core/uverbs_cmd.c          | 1 -
 drivers/infiniband/core/uverbs_std_types_cq.c | 9 ---------
 drivers/infiniband/core/verbs.c               | 5 ++---
 drivers/infiniband/hw/bnxt_re/ib_verbs.c      | 1 -
 drivers/infiniband/hw/mlx4/cq.c               | 1 -
 drivers/infiniband/hw/mlx5/cq.c               | 1 -
 include/rdma/ib_verbs.h                       | 2 --
 7 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 77874834108b..60fafa1fb7b4 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1088,7 +1088,6 @@ static int create_cq(struct uverbs_attr_bundle *attrs,
 	return uverbs_response(attrs, &resp, sizeof(resp));
 
 err_free:
-	ib_umem_release_non_listed(umem_list, UVERBS_BUF_CQ_BUF, cq->umem);
 	rdma_restrack_put(&cq->res);
 	kfree(cq);
 err_list_release:
diff --git a/drivers/infiniband/core/uverbs_std_types_cq.c b/drivers/infiniband/core/uverbs_std_types_cq.c
index f87cd11470fc..c165ff5446f6 100644
--- a/drivers/infiniband/core/uverbs_std_types_cq.c
+++ b/drivers/infiniband/core/uverbs_std_types_cq.c
@@ -209,11 +209,6 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	cq->comp_handler  = ib_uverbs_comp_handler;
 	cq->event_handler = ib_uverbs_cq_event_handler;
 	cq->cq_context    = ev_file ? &ev_file->ev_queue : NULL;
-	/*
-	 * If UMEM is not provided here, legacy drivers will set it during
-	 * CQ creation based on their internal udata.
-	 */
-	cq->umem = umem;
 	cq->umem_list     = umem_list;
 	atomic_set(&cq->usecnt, 0);
 
@@ -227,9 +222,6 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	if (ret)
 		goto err_free;
 
-	/* Check that driver didn't overrun existing umem */
-	WARN_ON(umem && cq->umem != umem);
-
 	obj->uevent.uobject.object = cq;
 	obj->uevent.uobject.user_handle = user_handle;
 	rdma_restrack_add(&cq->res);
@@ -240,7 +232,6 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	return ret;
 
 err_free:
-	ib_umem_release_non_listed(umem_list, UVERBS_BUF_CQ_BUF, cq->umem);
 	rdma_restrack_put(&cq->res);
 	kfree(cq);
 err_umem_list:
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index ed163fc56ef8..35700bad8310 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2224,9 +2224,9 @@ struct ib_cq *__ib_create_cq(struct ib_device *device,
 	}
 	/*
 	 * We are in kernel verbs flow and drivers are not allowed
-	 * to set umem or umem_list pointers, they need to stay NULL.
+	 * to set umem_list pointer, it needs to stay NULL.
 	 */
-	WARN_ON_ONCE(cq->umem || cq->umem_list);
+	WARN_ON_ONCE(cq->umem_list);
 
 	rdma_restrack_add(&cq->res);
 	return cq;
@@ -2259,7 +2259,6 @@ int ib_destroy_cq_user(struct ib_cq *cq, struct ib_udata *udata)
 	if (ret)
 		return ret;
 
-	ib_umem_release_non_listed(umem_list, UVERBS_BUF_CQ_BUF, cq->umem);
 	rdma_restrack_del(&cq->res);
 	kfree(cq);
 	ib_umem_list_release(umem_list);
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 5c6fc81fad6a..e63780c78781 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -3518,7 +3518,6 @@ static void bnxt_re_resize_cq_complete(struct bnxt_re_cq *cq)
 	if (cq->resize_umem) {
 		ib_umem_list_replace(cq->ib_cq.umem_list, UVERBS_BUF_CQ_BUF,
 				     cq->resize_umem);
-		cq->ib_cq.umem = cq->resize_umem;
 		cq->qplib_cq.sg_info.umem = cq->resize_umem;
 		cq->resize_umem = NULL;
 		cq->resize_cqe = 0;
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index f6ef85cc37a1..3217c5faf0d5 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -482,7 +482,6 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 		cq->ibcq.cqe = cq->resize_buf->cqe;
 		ib_umem_list_replace(ibcq->umem_list, UVERBS_BUF_CQ_BUF,
 				     cq->resize_umem);
-		cq->ibcq.umem     = cq->resize_umem;
 
 		kfree(cq->resize_buf);
 		cq->resize_buf = NULL;
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index bb9ed7caec67..6118deb5e6dc 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -1432,7 +1432,6 @@ int mlx5_ib_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 		cq->ibcq.cqe = entries - 1;
 		ib_umem_list_replace(cq->ibcq.umem_list, UVERBS_BUF_CQ_BUF,
 				     cq->resize_umem);
-		cq->ibcq.umem = cq->resize_umem;
 		cq->resize_umem = NULL;
 	} else {
 		struct mlx5_ib_cq_buf tbuf;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index dd6c0d68497d..cf7fa69415a1 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1738,8 +1738,6 @@ struct ib_cq {
 	u8 interrupt:1;
 	u8 shared:1;
 	unsigned int comp_vector;
-	struct ib_umem *umem;
-
 	struct ib_umem_list    *umem_list;
 
 	/*
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 09/15] RDMA/uverbs: Verify all umem_list buffers are consumed after CQ creation
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (7 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 08/15] RDMA/uverbs: Remove legacy umem field from struct ib_cq Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 10/15] RDMA/uverbs: Integrate umem_list into QP creation Jiri Pirko
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

After the driver creates the CQ, verify that all user-provided
umem buffers were actually consumed by the driver. This rejects
requests where userspace provides buffers that the driver does
not support.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/core/uverbs_std_types_cq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/infiniband/core/uverbs_std_types_cq.c b/drivers/infiniband/core/uverbs_std_types_cq.c
index c165ff5446f6..d3176032d0ac 100644
--- a/drivers/infiniband/core/uverbs_std_types_cq.c
+++ b/drivers/infiniband/core/uverbs_std_types_cq.c
@@ -222,6 +222,10 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 	if (ret)
 		goto err_free;
 
+	ret = ib_umem_list_check_consumed(umem_list);
+	if (ret)
+		goto err_destroy_cq;
+
 	obj->uevent.uobject.object = cq;
 	obj->uevent.uobject.user_handle = user_handle;
 	rdma_restrack_add(&cq->res);
@@ -231,6 +235,8 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
 			     sizeof(cq->cqe));
 	return ret;
 
+err_destroy_cq:
+	ib_dev->ops.destroy_cq(cq, &attrs->driver_udata);
 err_free:
 	rdma_restrack_put(&cq->res);
 	kfree(cq);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 10/15] RDMA/uverbs: Integrate umem_list into QP creation
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (8 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 09/15] RDMA/uverbs: Verify all umem_list buffers are consumed after CQ creation Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 11/15] RDMA/mlx5: Use umem_list for QP buffers in create_qp Jiri Pirko
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Wire up the generic buffer descriptor infrastructure to the QP create
command. Add umem_list field to struct ib_qp and define the QP buffer
slot enums.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
v1->v2: Fix umem_list double free
---
 drivers/infiniband/core/core_priv.h           |  1 +
 drivers/infiniband/core/uverbs_cmd.c          |  4 ++--
 drivers/infiniband/core/uverbs_std_types_qp.c | 22 ++++++++++++++++---
 drivers/infiniband/core/verbs.c               | 19 +++++++++++++---
 include/rdma/ib_verbs.h                       |  3 +++
 include/uapi/rdma/ib_user_ioctl_cmds.h        |  8 +++++++
 6 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a2c36666e6fc..3f7b0803f186 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -321,6 +321,7 @@ void nldev_exit(void);
 
 struct ib_qp *ib_create_qp_user(struct ib_device *dev, struct ib_pd *pd,
 				struct ib_qp_init_attr *attr,
+				struct ib_umem_list *umem_list,
 				struct ib_udata *udata,
 				struct ib_uqp_object *uobj, const char *caller);
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 60fafa1fb7b4..ce482ed047b0 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1467,8 +1467,8 @@ static int create_qp(struct uverbs_attr_bundle *attrs,
 		attr.source_qpn = cmd->source_qpn;
 	}
 
-	qp = ib_create_qp_user(device, pd, &attr, &attrs->driver_udata, obj,
-			       KBUILD_MODNAME);
+	qp = ib_create_qp_user(device, pd, &attr, NULL,
+			       &attrs->driver_udata, obj, KBUILD_MODNAME);
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
 		goto err_put;
diff --git a/drivers/infiniband/core/uverbs_std_types_qp.c b/drivers/infiniband/core/uverbs_std_types_qp.c
index be0730e8509e..5d76bfac6544 100644
--- a/drivers/infiniband/core/uverbs_std_types_qp.c
+++ b/drivers/infiniband/core/uverbs_std_types_qp.c
@@ -4,6 +4,7 @@
  */
 
 #include <rdma/uverbs_std_types.h>
+#include <rdma/ib_umem.h>
 #include "rdma_core.h"
 #include "uverbs.h"
 #include "core_priv.h"
@@ -96,6 +97,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QP_CREATE)(
 	struct ib_xrcd *xrcd = NULL;
 	struct ib_uobject *xrcd_uobj = NULL;
 	struct ib_device *device;
+	struct ib_umem_list *umem_list;
 	u64 user_handle;
 	int ret;
 
@@ -248,14 +250,24 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QP_CREATE)(
 	set_caps(&attr, &cap, true);
 	mutex_init(&obj->mcast_lock);
 
-	qp = ib_create_qp_user(device, pd, &attr, &attrs->driver_udata, obj,
-			       KBUILD_MODNAME);
+	umem_list = ib_umem_list_create(device, attrs, UVERBS_BUF_QP_MAX);
+	if (IS_ERR(umem_list)) {
+		ret = PTR_ERR(umem_list);
+		goto err_put;
+	}
+
+	qp = ib_create_qp_user(device, pd, &attr, umem_list,
+			       &attrs->driver_udata, obj, KBUILD_MODNAME);
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
 		goto err_put;
 	}
 	ib_qp_usecnt_inc(qp);
 
+	ret = ib_umem_list_check_consumed(umem_list);
+	if (ret)
+		goto err_destroy_qp;
+
 	if (attr.qp_type == IB_QPT_XRC_TGT) {
 		obj->uxrcd = container_of(xrcd_uobj, struct ib_uxrcd_object,
 					  uobject);
@@ -277,6 +289,9 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QP_CREATE)(
 			     sizeof(qp->qp_num));
 
 	return ret;
+
+err_destroy_qp:
+	ib_destroy_qp_user(qp, &attrs->driver_udata);
 err_put:
 	if (obj->uevent.event_file)
 		uverbs_uobject_put(&obj->uevent.event_file->uobj);
@@ -340,7 +355,8 @@ DECLARE_UVERBS_NAMED_METHOD(
 	UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_CREATE_QP_RESP_QP_NUM,
 			   UVERBS_ATTR_TYPE(u32),
 			   UA_MANDATORY),
-	UVERBS_ATTR_UHW());
+	UVERBS_ATTR_UHW(),
+	UVERBS_ATTR_BUFFERS());
 
 static int UVERBS_HANDLER(UVERBS_METHOD_QP_DESTROY)(
 	struct uverbs_attr_bundle *attrs)
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 35700bad8310..0fe6cb1a9f07 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1266,6 +1266,7 @@ static struct ib_qp *create_xrc_qp_user(struct ib_qp *qp,
 
 static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
 			       struct ib_qp_init_attr *attr,
+			       struct ib_umem_list *umem_list,
 			       struct ib_udata *udata,
 			       struct ib_uqp_object *uobj, const char *caller)
 {
@@ -1292,6 +1293,7 @@ static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
 	qp->registered_event_handler = attr->event_handler;
 	qp->port = attr->port_num;
 	qp->qp_context = attr->qp_context;
+	qp->umem_list = umem_list;
 
 	spin_lock_init(&qp->mr_lock);
 	INIT_LIST_HEAD(&qp->rdma_mrs);
@@ -1326,6 +1328,7 @@ static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
 	qp->device->ops.destroy_qp(qp, udata ? &dummy : NULL);
 err_create:
 	rdma_restrack_put(&qp->res);
+	ib_umem_list_release(qp->umem_list);
 	kfree(qp);
 	return ERR_PTR(ret);
 
@@ -1339,21 +1342,23 @@ static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
  * @attr: A list of initial attributes required to create the
  *   QP.  If QP creation succeeds, then the attributes are updated to
  *   the actual capabilities of the created QP.
+ * @umem_list: pre-mapped dma-buf umem list, or NULL
  * @udata: User data
  * @uobj: uverbs obect
  * @caller: caller's build-time module name
  */
 struct ib_qp *ib_create_qp_user(struct ib_device *dev, struct ib_pd *pd,
 				struct ib_qp_init_attr *attr,
+				struct ib_umem_list *umem_list,
 				struct ib_udata *udata,
 				struct ib_uqp_object *uobj, const char *caller)
 {
 	struct ib_qp *qp, *xrc_qp;
 
 	if (attr->qp_type == IB_QPT_XRC_TGT)
-		qp = create_qp(dev, pd, attr, NULL, NULL, caller);
+		qp = create_qp(dev, pd, attr, umem_list, NULL, NULL, caller);
 	else
-		qp = create_qp(dev, pd, attr, udata, uobj, NULL);
+		qp = create_qp(dev, pd, attr, umem_list, udata, uobj, NULL);
 	if (attr->qp_type != IB_QPT_XRC_TGT || IS_ERR(qp))
 		return qp;
 
@@ -1415,10 +1420,16 @@ struct ib_qp *ib_create_qp_kernel(struct ib_pd *pd,
 	if (qp_init_attr->cap.max_rdma_ctxs)
 		rdma_rw_init_qp(device, qp_init_attr);
 
-	qp = create_qp(device, pd, qp_init_attr, NULL, NULL, caller);
+	qp = create_qp(device, pd, qp_init_attr, NULL, NULL, NULL, caller);
 	if (IS_ERR(qp))
 		return qp;
 
+	/*
+	 * We are in kernel verbs flow and drivers are not allowed
+	 * to set umem_list pointer, it needs to stay NULL.
+	 */
+	WARN_ON_ONCE(qp->umem_list);
+
 	ib_qp_usecnt_inc(qp);
 
 	if (qp_init_attr->cap.max_rdma_ctxs) {
@@ -2147,6 +2158,7 @@ int ib_destroy_qp_user(struct ib_qp *qp, struct ib_udata *udata)
 {
 	const struct ib_gid_attr *alt_path_sgid_attr = qp->alt_path_sgid_attr;
 	const struct ib_gid_attr *av_sgid_attr = qp->av_sgid_attr;
+	struct ib_umem_list *umem_list = qp->umem_list;
 	struct ib_qp_security *sec;
 	int ret;
 
@@ -2184,6 +2196,7 @@ int ib_destroy_qp_user(struct ib_qp *qp, struct ib_udata *udata)
 
 	rdma_restrack_del(&qp->res);
 	kfree(qp);
+	ib_umem_list_release(umem_list);
 	return ret;
 }
 EXPORT_SYMBOL(ib_destroy_qp_user);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index cf7fa69415a1..d78f62611a7e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1524,6 +1524,7 @@ enum ib_mr_rereg_flags {
 };
 
 struct ib_umem;
+struct ib_umem_list;
 
 enum rdma_remove_reason {
 	/*
@@ -1944,6 +1945,8 @@ struct ib_qp {
 
 	/* The counter the qp is bind to */
 	struct rdma_counter    *counter;
+
+	struct ib_umem_list    *umem_list;
 };
 
 struct ib_dm {
diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
index 375e4e224f6a..9c5d3f989977 100644
--- a/include/uapi/rdma/ib_user_ioctl_cmds.h
+++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
@@ -167,6 +167,14 @@ enum uverbs_attrs_create_qp_cmd_attr_ids {
 	UVERBS_ATTR_CREATE_QP_RESP_QP_NUM,
 };
 
+enum uverbs_buf_qp_slots {
+	UVERBS_BUF_QP_BUF,
+	UVERBS_BUF_QP_RQ_BUF,
+	UVERBS_BUF_QP_SQ_BUF,
+	__UVERBS_BUF_QP_MAX,
+	UVERBS_BUF_QP_MAX = __UVERBS_BUF_QP_MAX - 1,
+};
+
 enum uverbs_attrs_destroy_qp_cmd_attr_ids {
 	UVERBS_ATTR_DESTROY_QP_HANDLE,
 	UVERBS_ATTR_DESTROY_QP_RESP,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 11/15] RDMA/mlx5: Use umem_list for QP buffers in create_qp
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (9 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 10/15] RDMA/uverbs: Integrate umem_list into QP creation Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 12/15] RDMA/uverbs: Add doorbell record buffer slot to CQ umem_list Jiri Pirko
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Load the QP and SQ buffer umems from the umem_list, falling back to
ib_umem_get() for the legacy path. Use ib_umem_release_non_listed()
on error and destroy paths in order to release umem properly.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/hw/mlx5/qp.c | 70 +++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 8f50e7342a76..ba5b41fa5ef9 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -938,6 +938,14 @@ static int adjust_bfregn(struct mlx5_ib_dev *dev,
 				bfregn % MLX5_NON_FP_BFREGS_PER_UAR;
 }
 
+static unsigned int mlx5_qp_buf_slot(struct mlx5_ib_qp *qp)
+{
+	if (qp->type == IB_QPT_RAW_PACKET ||
+	    qp->flags & IB_QP_CREATE_SOURCE_QPN)
+		return UVERBS_BUF_QP_RQ_BUF;
+	return UVERBS_BUF_QP_BUF;
+}
+
 static int _create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 			   struct mlx5_ib_qp *qp, struct ib_udata *udata,
 			   struct ib_qp_init_attr *attr, u32 **in,
@@ -998,14 +1006,26 @@ static int _create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 	if (err)
 		goto err_bfreg;
 
-	if (ucmd->buf_addr && ubuffer->buf_size) {
-		ubuffer->buf_addr = ucmd->buf_addr;
-		ubuffer->umem = ib_umem_get(&dev->ib_dev, ubuffer->buf_addr,
-					    ubuffer->buf_size, 0);
+	ubuffer->umem = NULL;
+	if (ubuffer->buf_size) {
+		ubuffer->umem = ib_umem_list_load(qp->ibqp.umem_list, mlx5_qp_buf_slot(qp),
+						  ubuffer->buf_size);
 		if (IS_ERR(ubuffer->umem)) {
 			err = PTR_ERR(ubuffer->umem);
 			goto err_bfreg;
+		} else if (!ubuffer->umem && ucmd->buf_addr) {
+			ubuffer->buf_addr = ucmd->buf_addr;
+			ubuffer->umem = ib_umem_get(&dev->ib_dev, ubuffer->buf_addr,
+						    ubuffer->buf_size, 0);
+			if (IS_ERR(ubuffer->umem)) {
+				err = PTR_ERR(ubuffer->umem);
+				goto err_bfreg;
+			}
+			ib_umem_list_replace(qp->ibqp.umem_list, mlx5_qp_buf_slot(qp),
+					     ubuffer->umem);
 		}
+	}
+	if (ubuffer->umem) {
 		page_size = mlx5_umem_find_best_quantized_pgoff(
 			ubuffer->umem, qpc, log_page_size,
 			MLX5_ADAPTER_PAGE_SHIFT, page_offset, 64,
@@ -1015,8 +1035,6 @@ static int _create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 			goto err_umem;
 		}
 		ncont = ib_umem_num_dma_blocks(ubuffer->umem, page_size);
-	} else {
-		ubuffer->umem = NULL;
 	}
 
 	*inlen = MLX5_ST_SZ_BYTES(create_qp_in) +
@@ -1056,7 +1074,8 @@ static int _create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 	kvfree(*in);
 
 err_umem:
-	ib_umem_release(ubuffer->umem);
+	ib_umem_release_non_listed(qp->ibqp.umem_list, mlx5_qp_buf_slot(qp),
+				   ubuffer->umem);
 
 err_bfreg:
 	if (bfregn != MLX5_IB_INVALID_BFREG)
@@ -1073,7 +1092,8 @@ static void destroy_qp(struct mlx5_ib_dev *dev, struct mlx5_ib_qp *qp,
 	if (udata) {
 		/* User QP */
 		mlx5_ib_db_unmap_user(context, &qp->db);
-		ib_umem_release(base->ubuffer.umem);
+		ib_umem_release_non_listed(qp->ibqp.umem_list, mlx5_qp_buf_slot(qp),
+					   base->ubuffer.umem);
 
 		/*
 		 * Free only the BFREGs which are handled by the kernel.
@@ -1334,7 +1354,8 @@ static int get_qp_ts_format(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *send_cq,
 static int create_raw_packet_qp_sq(struct mlx5_ib_dev *dev,
 				   struct ib_udata *udata,
 				   struct mlx5_ib_sq *sq, void *qpin,
-				   struct ib_pd *pd, struct mlx5_ib_cq *cq)
+				   struct ib_pd *pd, struct mlx5_ib_cq *cq,
+				   struct ib_umem_list *umem_list)
 {
 	struct mlx5_ib_ubuffer *ubuffer = &sq->ubuffer;
 	__be64 *pas;
@@ -1352,10 +1373,11 @@ static int create_raw_packet_qp_sq(struct mlx5_ib_dev *dev,
 	if (ts_format < 0)
 		return ts_format;
 
-	sq->ubuffer.umem = ib_umem_get(&dev->ib_dev, ubuffer->buf_addr,
-				       ubuffer->buf_size, 0);
-	if (IS_ERR(sq->ubuffer.umem))
-		return PTR_ERR(sq->ubuffer.umem);
+	ubuffer->umem = ib_umem_list_load_or_get(umem_list, UVERBS_BUF_QP_SQ_BUF,
+						 &dev->ib_dev, ubuffer->buf_addr,
+						 ubuffer->buf_size, 0);
+	if (IS_ERR(ubuffer->umem))
+		return PTR_ERR(ubuffer->umem);
 	page_size = mlx5_umem_find_best_quantized_pgoff(
 		ubuffer->umem, wq, log_wq_pg_sz, MLX5_ADAPTER_PAGE_SHIFT,
 		page_offset, 64, &page_offset_quantized);
@@ -1412,18 +1434,21 @@ static int create_raw_packet_qp_sq(struct mlx5_ib_dev *dev,
 	return 0;
 
 err_umem:
-	ib_umem_release(sq->ubuffer.umem);
+	ib_umem_release_non_listed(umem_list, UVERBS_BUF_QP_SQ_BUF,
+				   sq->ubuffer.umem);
 	sq->ubuffer.umem = NULL;
 
 	return err;
 }
 
 static void destroy_raw_packet_qp_sq(struct mlx5_ib_dev *dev,
-				     struct mlx5_ib_sq *sq)
+				     struct mlx5_ib_sq *sq,
+				     struct ib_umem_list *umem_list)
 {
 	destroy_flow_rule_vport_sq(sq);
 	mlx5_core_destroy_sq_tracked(dev, &sq->base.mqp);
-	ib_umem_release(sq->ubuffer.umem);
+	ib_umem_release_non_listed(umem_list, UVERBS_BUF_QP_SQ_BUF,
+				   sq->ubuffer.umem);
 }
 
 static int create_raw_packet_qp_rq(struct mlx5_ib_dev *dev,
@@ -1567,7 +1592,8 @@ static int create_raw_packet_qp(struct mlx5_ib_dev *dev, struct mlx5_ib_qp *qp,
 				u32 *in, size_t inlen, struct ib_pd *pd,
 				struct ib_udata *udata,
 				struct mlx5_ib_create_qp_resp *resp,
-				struct ib_qp_init_attr *init_attr)
+				struct ib_qp_init_attr *init_attr,
+				struct ib_umem_list *umem_list)
 {
 	struct mlx5_ib_raw_packet_qp *raw_packet_qp = &qp->raw_packet_qp;
 	struct mlx5_ib_sq *sq = &raw_packet_qp->sq;
@@ -1587,7 +1613,8 @@ static int create_raw_packet_qp(struct mlx5_ib_dev *dev, struct mlx5_ib_qp *qp,
 			return err;
 
 		err = create_raw_packet_qp_sq(dev, udata, sq, in, pd,
-					      to_mcq(init_attr->send_cq));
+					      to_mcq(init_attr->send_cq),
+					      umem_list);
 		if (err)
 			goto err_destroy_tis;
 
@@ -1651,7 +1678,7 @@ static int create_raw_packet_qp(struct mlx5_ib_dev *dev, struct mlx5_ib_qp *qp,
 err_destroy_sq:
 	if (!qp->sq.wqe_cnt)
 		return err;
-	destroy_raw_packet_qp_sq(dev, sq);
+	destroy_raw_packet_qp_sq(dev, sq, umem_list);
 err_destroy_tis:
 	destroy_raw_packet_qp_tis(dev, sq, pd);
 
@@ -1671,7 +1698,7 @@ static void destroy_raw_packet_qp(struct mlx5_ib_dev *dev,
 	}
 
 	if (qp->sq.wqe_cnt) {
-		destroy_raw_packet_qp_sq(dev, sq);
+		destroy_raw_packet_qp_sq(dev, sq, qp->ibqp.umem_list);
 		destroy_raw_packet_qp_tis(dev, sq, qp->ibqp.pd);
 	}
 }
@@ -2393,7 +2420,8 @@ static int create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 		qp->raw_packet_qp.sq.ubuffer.buf_addr = ucmd->sq_buf_addr;
 		raw_packet_qp_copy_info(qp, &qp->raw_packet_qp);
 		err = create_raw_packet_qp(dev, qp, in, inlen, pd, udata,
-					   &params->resp, init_attr);
+					   &params->resp, init_attr,
+					   qp->ibqp.umem_list);
 	} else
 		err = mlx5_qpc_create_qp(dev, &base->mqp, in, inlen, out);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 12/15] RDMA/uverbs: Add doorbell record buffer slot to CQ umem_list
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (10 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 11/15] RDMA/mlx5: Use umem_list for QP buffers in create_qp Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 13/15] RDMA/mlx5: Use umem_list for CQ doorbell record Jiri Pirko
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Extend the CQ buffer slot enum with UVERBS_BUF_CQ_DBR, allowing
userspace to provide doorbell record memory via the generic buffer
descriptor infrastructure.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 include/uapi/rdma/ib_user_ioctl_cmds.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
index 9c5d3f989977..26c2e3b2125a 100644
--- a/include/uapi/rdma/ib_user_ioctl_cmds.h
+++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
@@ -122,6 +122,7 @@ enum uverbs_attrs_create_cq_cmd_attr_ids {
 
 enum uverbs_buf_cq_slots {
 	UVERBS_BUF_CQ_BUF,
+	UVERBS_BUF_CQ_DBR,
 	__UVERBS_BUF_CQ_MAX,
 	UVERBS_BUF_CQ_MAX = __UVERBS_BUF_CQ_MAX - 1,
 };
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 13/15] RDMA/mlx5: Use umem_list for CQ doorbell record
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (11 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 12/15] RDMA/uverbs: Add doorbell record buffer slot to CQ umem_list Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 14/15] RDMA/uverbs: Add doorbell record buffer slot to QP umem_list Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 15/15] RDMA/mlx5: Use umem_list for QP doorbell record Jiri Pirko
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Load the doorbell record umem from the umem_list, falling back to
ib_umem_get() for the legacy path. Pass the umem_list and a
per-command slot index through the doorbell mapping infrastructure.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/hw/mlx5/cq.c       |  4 ++-
 drivers/infiniband/hw/mlx5/doorbell.c | 41 +++++++++++++++++++++++----
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  3 +-
 drivers/infiniband/hw/mlx5/qp.c       |  4 +--
 drivers/infiniband/hw/mlx5/srq.c      |  2 +-
 5 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 6118deb5e6dc..ef36417a3c65 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -759,7 +759,9 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 	if (!page_size)
 		return -EINVAL;
 
-	err = mlx5_ib_db_map_user(context, ucmd.db_addr, &cq->db);
+	err = mlx5_ib_db_map_user(context, ucmd.db_addr,
+				  cq->ibcq.umem_list, UVERBS_BUF_CQ_DBR,
+				  sizeof(__be32) * 2, &cq->db);
 	if (err)
 		return err;
 
diff --git a/drivers/infiniband/hw/mlx5/doorbell.c b/drivers/infiniband/hw/mlx5/doorbell.c
index bd68fcf011f4..a1c5851aba10 100644
--- a/drivers/infiniband/hw/mlx5/doorbell.c
+++ b/drivers/infiniband/hw/mlx5/doorbell.c
@@ -40,25 +40,51 @@
 struct mlx5_ib_user_db_page {
 	struct list_head	list;
 	struct ib_umem	       *umem;
+	struct ib_umem_list     *umem_list;
+	unsigned int		dbr_index;
 	unsigned long		user_virt;
 	int			refcnt;
 	struct mm_struct	*mm;
 };
 
 int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
-			struct mlx5_db *db)
+			struct ib_umem_list *umem_list, unsigned int dbr_index,
+			size_t dbr_size, struct mlx5_db *db)
 {
+	unsigned long dma_offset;
 	struct mlx5_ib_user_db_page *page;
+	struct ib_umem *umem;
 	int err = 0;
 
 	mutex_lock(&context->db_page_mutex);
 
+	umem = ib_umem_list_load(umem_list, dbr_index, dbr_size);
+	if (IS_ERR(umem)) {
+		err = PTR_ERR(umem);
+		goto out;
+	} else if (umem) {
+		/* External umem path - no page sharing */
+		page = kzalloc_obj(*page);
+		if (!page) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		page->umem = umem;
+		page->umem_list = umem_list;
+		page->dbr_index = dbr_index;
+		dma_offset = ib_umem_offset(umem);
+		goto add_page;
+	}
+
+	dma_offset = virt & ~PAGE_MASK;
+
 	list_for_each_entry(page, &context->db_page_list, list)
 		if ((current->mm == page->mm) &&
 		    (page->user_virt == (virt & PAGE_MASK)))
 			goto found;
 
-	page = kmalloc_obj(*page);
+	page = kzalloc_obj(*page);
 	if (!page) {
 		err = -ENOMEM;
 		goto out;
@@ -76,11 +102,11 @@ int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
 	mmgrab(current->mm);
 	page->mm = current->mm;
 
+add_page:
 	list_add(&page->list, &context->db_page_list);
 
 found:
-	db->dma = sg_dma_address(page->umem->sgt_append.sgt.sgl) +
-		  (virt & ~PAGE_MASK);
+	db->dma = sg_dma_address(page->umem->sgt_append.sgt.sgl) + dma_offset;
 	db->u.user_page = page;
 	++page->refcnt;
 
@@ -96,8 +122,11 @@ void mlx5_ib_db_unmap_user(struct mlx5_ib_ucontext *context, struct mlx5_db *db)
 
 	if (!--db->u.user_page->refcnt) {
 		list_del(&db->u.user_page->list);
-		mmdrop(db->u.user_page->mm);
-		ib_umem_release(db->u.user_page->umem);
+		if (db->u.user_page->mm)
+			mmdrop(db->u.user_page->mm);
+		ib_umem_release_non_listed(db->u.user_page->umem_list,
+					  db->u.user_page->dbr_index,
+					  db->u.user_page->umem);
 		kfree(db->u.user_page);
 	}
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 94d1e4f83679..f68f8466e60a 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1261,7 +1261,8 @@ to_mmmap(struct rdma_user_mmap_entry *rdma_entry)
 int mlx5_ib_dev_res_cq_init(struct mlx5_ib_dev *dev);
 int mlx5_ib_dev_res_srq_init(struct mlx5_ib_dev *dev);
 int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
-			struct mlx5_db *db);
+			struct ib_umem_list *umem_list, unsigned int dbr_index,
+			size_t dbr_size, struct mlx5_db *db);
 void mlx5_ib_db_unmap_user(struct mlx5_ib_ucontext *context, struct mlx5_db *db);
 void __mlx5_ib_cq_clean(struct mlx5_ib_cq *cq, u32 qpn, struct mlx5_ib_srq *srq);
 void mlx5_ib_cq_clean(struct mlx5_ib_cq *cq, u32 qpn, struct mlx5_ib_srq *srq);
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index ba5b41fa5ef9..3edfe44f911a 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -918,7 +918,7 @@ static int create_user_rq(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 		ib_umem_num_pages(rwq->umem), page_size, rwq->rq_num_pas,
 		offset);
 
-	err = mlx5_ib_db_map_user(ucontext, ucmd->db_addr, &rwq->db);
+	err = mlx5_ib_db_map_user(ucontext, ucmd->db_addr, NULL, 0, 0, &rwq->db);
 	if (err) {
 		mlx5_ib_dbg(dev, "map failed\n");
 		goto err_umem;
@@ -1062,7 +1062,7 @@ static int _create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 		resp->bfreg_index = MLX5_IB_INVALID_BFREG;
 	qp->bfregn = bfregn;
 
-	err = mlx5_ib_db_map_user(context, ucmd->db_addr, &qp->db);
+	err = mlx5_ib_db_map_user(context, ucmd->db_addr, NULL, 0, 0, &qp->db);
 	if (err) {
 		mlx5_ib_dbg(dev, "map failed\n");
 		goto err_free;
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 852f6f502d14..d4dbbd5a500f 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -74,7 +74,7 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
 	}
 	in->umem = srq->umem;
 
-	err = mlx5_ib_db_map_user(ucontext, ucmd.db_addr, &srq->db);
+	err = mlx5_ib_db_map_user(ucontext, ucmd.db_addr, NULL, 0, 0, &srq->db);
 	if (err) {
 		mlx5_ib_dbg(dev, "map doorbell failed\n");
 		goto err_umem;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 14/15] RDMA/uverbs: Add doorbell record buffer slot to QP umem_list
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (12 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 13/15] RDMA/mlx5: Use umem_list for CQ doorbell record Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  2026-04-11 14:49 ` [PATCH rdma-next v2 15/15] RDMA/mlx5: Use umem_list for QP doorbell record Jiri Pirko
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Extend the QP buffer slot enum with UVERBS_BUF_QP_DBR, allowing
userspace to provide doorbell record memory via the generic buffer
descriptor infrastructure.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 include/uapi/rdma/ib_user_ioctl_cmds.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h
index 26c2e3b2125a..1a47942ca1a6 100644
--- a/include/uapi/rdma/ib_user_ioctl_cmds.h
+++ b/include/uapi/rdma/ib_user_ioctl_cmds.h
@@ -172,6 +172,7 @@ enum uverbs_buf_qp_slots {
 	UVERBS_BUF_QP_BUF,
 	UVERBS_BUF_QP_RQ_BUF,
 	UVERBS_BUF_QP_SQ_BUF,
+	UVERBS_BUF_QP_DBR_BUF,
 	__UVERBS_BUF_QP_MAX,
 	UVERBS_BUF_QP_MAX = __UVERBS_BUF_QP_MAX - 1,
 };
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH rdma-next v2 15/15] RDMA/mlx5: Use umem_list for QP doorbell record
  2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
                   ` (13 preceding siblings ...)
  2026-04-11 14:49 ` [PATCH rdma-next v2 14/15] RDMA/uverbs: Add doorbell record buffer slot to QP umem_list Jiri Pirko
@ 2026-04-11 14:49 ` Jiri Pirko
  14 siblings, 0 replies; 81+ messages in thread
From: Jiri Pirko @ 2026-04-11 14:49 UTC (permalink / raw)
  To: linux-rdma
  Cc: jgg, leon, mrgolin, gal.pressman, sleybo, parav, mbloch,
	yanjun.zhu, marco.crivellari, roman.gushchin, phaddad, lirongqing,
	ynachum, huangjunxian6, kalesh-anakkur.purayil, ohartoov,
	michaelgur, shayd, edwards, sriharsha.basavapatna,
	andrew.gospodarek, selvin.xavier

From: Jiri Pirko <jiri@nvidia.com>

Pass the QP umem_list to the doorbell mapping infrastructure for
QP creation.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
---
 drivers/infiniband/hw/mlx5/qp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 3edfe44f911a..6010fbb43d7a 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -1062,7 +1062,9 @@ static int _create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
 		resp->bfreg_index = MLX5_IB_INVALID_BFREG;
 	qp->bfregn = bfregn;
 
-	err = mlx5_ib_db_map_user(context, ucmd->db_addr, NULL, 0, 0, &qp->db);
+	err = mlx5_ib_db_map_user(context, ucmd->db_addr,
+				  qp->ibqp.umem_list, UVERBS_BUF_QP_DBR_BUF,
+				  sizeof(__be32) * 2, &qp->db);
 	if (err) {
 		mlx5_ib_dbg(dev, "map failed\n");
 		goto err_free;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v2 0/4] Firmware LSM hook
@ 2026-03-31  5:56 Leon Romanovsky
  2026-03-31  5:56 ` [PATCH v2 1/4] bpf: add firmware command validation hook Leon Romanovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 81+ messages in thread
From: Leon Romanovsky @ 2026-03-31  5:56 UTC (permalink / raw)
  To: KP Singh, Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Leon Romanovsky, Jason Gunthorpe,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron
  Cc: bpf, linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, Jonathan Cameron

From Chiara:

This patch set introduces a new BPF LSM hook to validate firmware commands
triggered by userspace before they are submitted to the device. The hook
runs after the command buffer is constructed, right before it is sent
to firmware.

The goal is to allow a security module to allow or deny a given command
before it is submitted to firmware. BPF LSM can attach to this hook
to implement such policies. This allows fine-grained policies for different
firmware commands. 

In this series, the new hook is called from RDMA uverbs and from the fwctl
subsystem. Both the uverbs and fwctl interfaces use ioctl, so an obvious
candidate would seem to be the file_ioctl hook. However, the userspace
attributes used to build the firmware command buffer are copied from
userspace (copy_from_user()) deep in the driver, depending on various
conditions. As a result, file_ioctl does not have the information required
to make a policy decision.

This newly introduced hook provides the command buffer together with relevant
metadata (device, command class, and a class-specific device identifier), so
security modules can distinguish between different command classes and devices.

The hook can be used by other drivers that submit firmware commands via a command
buffer.

Thanks

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
Changes in v2:
- Fixed style formatting issues pointed by Jonathan
- Added Jonathan's and Dave's ROB tags
- Implemented as BPF LSM hook instead of general LSM hook
- Added selftest to execute that new hook
- Removed extra FW_CMD_CLASS_MAX enum, it is not needed
- Link to v1: https://patch.msgid.link/20260309-fw-lsm-hook-v1-0-4a6422e63725@nvidia.com

---
Chiara Meiohas (4):
      bpf: add firmware command validation hook
      selftests/bpf: add test cases for fw_validate_cmd hook
      RDMA/mlx5: Externally validate FW commands supplied in DEVX interface
      fwctl/mlx5: Externally validate FW commands supplied in fwctl

 drivers/fwctl/mlx5/main.c                        | 12 +++++-
 drivers/infiniband/hw/mlx5/devx.c                | 49 ++++++++++++++++++------
 include/linux/bpf_lsm.h                          | 41 ++++++++++++++++++++
 kernel/bpf/bpf_lsm.c                             | 11 ++++++
 tools/testing/selftests/bpf/progs/verifier_lsm.c | 23 +++++++++++
 5 files changed, 122 insertions(+), 14 deletions(-)
---
base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
change-id: 20260309-fw-lsm-hook-7c094f909ffc

Best regards,
--  
Leon Romanovsky <leonro@nvidia.com>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 1/4] bpf: add firmware command validation hook
  2026-03-31  5:56 [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
@ 2026-03-31  5:56 ` Leon Romanovsky
  2026-04-16  8:43   ` Matt Bobrowski
  2026-03-31  5:56 ` [PATCH v2 2/4] selftests/bpf: add test cases for fw_validate_cmd hook Leon Romanovsky
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 81+ messages in thread
From: Leon Romanovsky @ 2026-03-31  5:56 UTC (permalink / raw)
  To: KP Singh, Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Leon Romanovsky, Jason Gunthorpe,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron
  Cc: bpf, linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla

From: Chiara Meiohas <cmeiohas@nvidia.com>

Drivers communicate with device firmware either via register-based
commands (writing parameters into device registers) or by passing
a command buffer using shared-memory mechanisms.

The proposed fw_validate_cmd hook is intended for the command buffer
mechanism, which is commonly used on modern, complex devices.

This hook allows inspecting firmware command buffers before they are
sent to the device.
The hook receives the command buffer, device, command class, and a
class-specific id:
  - class_id (enum fw_cmd_class) allows BPF programs to
    differentiate between classes of firmware commands.
    In this series, class_id distinguishes between commands from the
    RDMA uverbs interface and from fwctl.
  - id is a class-specific device identifier. For uverbs, id is the
    RDMA driver identifier (enum rdma_driver_id). For fwctl, id is the
    device type (enum fwctl_device_type).

The mailbox format varies across vendors and may even differ between
firmware versions, so policy authors must be familiar with the
specific device's mailbox format. BPF programs can be tailored to
inspect the mailbox accordingly, making BPF the natural fit.
Therefore, the hook is defined using the LSM_HOOK macro in bpf_lsm.c
rather than in lsm_hook_defs.h, as it is a BPF-only hook.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/bpf_lsm.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/bpf_lsm.c    | 11 +++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h
index 643809cc78c33..7ad7e153f486c 100644
--- a/include/linux/bpf_lsm.h
+++ b/include/linux/bpf_lsm.h
@@ -12,6 +12,21 @@
 #include <linux/bpf_verifier.h>
 #include <linux/lsm_hooks.h>
 
+struct device;
+
+/**
+ * enum fw_cmd_class - Class of the firmware command passed to
+ * bpf_lsm_fw_validate_cmd.
+ * This allows BPF programs to distinguish between different command classes.
+ *
+ * @FW_CMD_CLASS_UVERBS: Command originated from the RDMA uverbs interface
+ * @FW_CMD_CLASS_FWCTL: Command originated from the fwctl interface
+ */
+enum fw_cmd_class {
+	FW_CMD_CLASS_UVERBS,
+	FW_CMD_CLASS_FWCTL,
+};
+
 #ifdef CONFIG_BPF_LSM
 
 #define LSM_HOOK(RET, DEFAULT, NAME, ...) \
@@ -53,6 +68,24 @@ int bpf_set_dentry_xattr_locked(struct dentry *dentry, const char *name__str,
 int bpf_remove_dentry_xattr_locked(struct dentry *dentry, const char *name__str);
 bool bpf_lsm_has_d_inode_locked(const struct bpf_prog *prog);
 
+/**
+ * bpf_lsm_fw_validate_cmd() - Validate a firmware command
+ * @in: pointer to the firmware command input buffer
+ * @in_len: length of the firmware command input buffer
+ * @dev: device associated with the command
+ * @class_id: class of the firmware command
+ * @id: device identifier, specific to the command @class_id
+ *
+ * Check permissions before sending a firmware command generated by
+ * userspace to the device.
+ *
+ * Return: Returns 0 if permission is granted, or a negative errno
+ * value to deny the operation.
+ */
+int bpf_lsm_fw_validate_cmd(const void *in, size_t in_len,
+			    const struct device *dev,
+			    enum fw_cmd_class class_id, u32 id);
+
 #else /* !CONFIG_BPF_LSM */
 
 static inline bool bpf_lsm_is_sleepable_hook(u32 btf_id)
@@ -104,6 +137,14 @@ static inline bool bpf_lsm_has_d_inode_locked(const struct bpf_prog *prog)
 {
 	return false;
 }
+
+static inline int bpf_lsm_fw_validate_cmd(const void *in, size_t in_len,
+					  const struct device *dev,
+					  enum fw_cmd_class class_id, u32 id)
+{
+	return 0;
+}
+
 #endif /* CONFIG_BPF_LSM */
 
 #endif /* _LINUX_BPF_LSM_H */
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 0c4a0c8e6f703..fbdc056995fee 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -28,12 +28,23 @@ __weak noinline RET bpf_lsm_##NAME(__VA_ARGS__)	\
 }
 
 #include <linux/lsm_hook_defs.h>
+
+/*
+ * fw_validate_cmd is not in lsm_hook_defs.h because it is a BPF-only
+ * hook — mailbox formats are device-specific, making BPF the natural
+ * fit for inspection.
+ */
+LSM_HOOK(int, 0, fw_validate_cmd, const void *in, size_t in_len,
+	 const struct device *dev, enum fw_cmd_class class_id, u32 id)
+EXPORT_SYMBOL_GPL(bpf_lsm_fw_validate_cmd);
+
 #undef LSM_HOOK
 
 #define LSM_HOOK(RET, DEFAULT, NAME, ...) BTF_ID(func, bpf_lsm_##NAME)
 BTF_SET_START(bpf_lsm_hooks)
 #include <linux/lsm_hook_defs.h>
 #undef LSM_HOOK
+BTF_ID(func, bpf_lsm_fw_validate_cmd)
 BTF_SET_END(bpf_lsm_hooks)
 
 BTF_SET_START(bpf_lsm_disabled_hooks)

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 1/4] bpf: add firmware command validation hook
  2026-03-31  5:56 ` [PATCH v2 1/4] bpf: add firmware command validation hook Leon Romanovsky
@ 2026-04-16  8:43   ` Matt Bobrowski
  0 siblings, 0 replies; 81+ messages in thread
From: Matt Bobrowski @ 2026-04-16  8:43 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: KP Singh, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Jason Gunthorpe, Saeed Mahameed, Itay Avraham, Dave Jiang,
	Jonathan Cameron, bpf, linux-kernel, linux-kselftest, linux-rdma,
	Chiara Meiohas, Maher Sanalla

On Tue, Mar 31, 2026 at 08:56:33AM +0300, Leon Romanovsky wrote:
> From: Chiara Meiohas <cmeiohas@nvidia.com>
> 
> Drivers communicate with device firmware either via register-based
> commands (writing parameters into device registers) or by passing
> a command buffer using shared-memory mechanisms.
> 
> The proposed fw_validate_cmd hook is intended for the command buffer
> mechanism, which is commonly used on modern, complex devices.
> 
> This hook allows inspecting firmware command buffers before they are
> sent to the device.
> The hook receives the command buffer, device, command class, and a
> class-specific id:
>   - class_id (enum fw_cmd_class) allows BPF programs to
>     differentiate between classes of firmware commands.
>     In this series, class_id distinguishes between commands from the
>     RDMA uverbs interface and from fwctl.
>   - id is a class-specific device identifier. For uverbs, id is the
>     RDMA driver identifier (enum rdma_driver_id). For fwctl, id is the
>     device type (enum fwctl_device_type).
> 
> The mailbox format varies across vendors and may even differ between
> firmware versions, so policy authors must be familiar with the
> specific device's mailbox format. BPF programs can be tailored to
> inspect the mailbox accordingly, making BPF the natural fit.
> Therefore, the hook is defined using the LSM_HOOK macro in bpf_lsm.c
> rather than in lsm_hook_defs.h, as it is a BPF-only hook.
> 
> Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
> Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  include/linux/bpf_lsm.h | 41 +++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/bpf_lsm.c    | 11 +++++++++++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h
> index 643809cc78c33..7ad7e153f486c 100644
> --- a/include/linux/bpf_lsm.h
> +++ b/include/linux/bpf_lsm.h
> @@ -12,6 +12,21 @@
>  #include <linux/bpf_verifier.h>
>  #include <linux/lsm_hooks.h>
>  
> +struct device;
> +
> +/**
> + * enum fw_cmd_class - Class of the firmware command passed to
> + * bpf_lsm_fw_validate_cmd.
> + * This allows BPF programs to distinguish between different command classes.
> + *
> + * @FW_CMD_CLASS_UVERBS: Command originated from the RDMA uverbs interface
> + * @FW_CMD_CLASS_FWCTL: Command originated from the fwctl interface
> + */
> +enum fw_cmd_class {
> +	FW_CMD_CLASS_UVERBS,
> +	FW_CMD_CLASS_FWCTL,
> +};
> +
>  #ifdef CONFIG_BPF_LSM
>  
>  #define LSM_HOOK(RET, DEFAULT, NAME, ...) \
> @@ -53,6 +68,24 @@ int bpf_set_dentry_xattr_locked(struct dentry *dentry, const char *name__str,
>  int bpf_remove_dentry_xattr_locked(struct dentry *dentry, const char *name__str);
>  bool bpf_lsm_has_d_inode_locked(const struct bpf_prog *prog);
>  
> +/**
> + * bpf_lsm_fw_validate_cmd() - Validate a firmware command
> + * @in: pointer to the firmware command input buffer
> + * @in_len: length of the firmware command input buffer
> + * @dev: device associated with the command
> + * @class_id: class of the firmware command
> + * @id: device identifier, specific to the command @class_id
> + *
> + * Check permissions before sending a firmware command generated by
> + * userspace to the device.
> + *
> + * Return: Returns 0 if permission is granted, or a negative errno
> + * value to deny the operation.
> + */
> +int bpf_lsm_fw_validate_cmd(const void *in, size_t in_len,
> +			    const struct device *dev,
> +			    enum fw_cmd_class class_id, u32 id);
> +
>  #else /* !CONFIG_BPF_LSM */
>  
>  static inline bool bpf_lsm_is_sleepable_hook(u32 btf_id)
> @@ -104,6 +137,14 @@ static inline bool bpf_lsm_has_d_inode_locked(const struct bpf_prog *prog)
>  {
>  	return false;
>  }
> +
> +static inline int bpf_lsm_fw_validate_cmd(const void *in, size_t in_len,
> +					  const struct device *dev,
> +					  enum fw_cmd_class class_id, u32 id)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_BPF_LSM */
>  
>  #endif /* _LINUX_BPF_LSM_H */
> diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> index 0c4a0c8e6f703..fbdc056995fee 100644
> --- a/kernel/bpf/bpf_lsm.c
> +++ b/kernel/bpf/bpf_lsm.c
> @@ -28,12 +28,23 @@ __weak noinline RET bpf_lsm_##NAME(__VA_ARGS__)	\
>  }
>  
>  #include <linux/lsm_hook_defs.h>
> +
> +/*
> + * fw_validate_cmd is not in lsm_hook_defs.h because it is a BPF-only
> + * hook — mailbox formats are device-specific, making BPF the natural
> + * fit for inspection.
> + */
> +LSM_HOOK(int, 0, fw_validate_cmd, const void *in, size_t in_len,
> +	 const struct device *dev, enum fw_cmd_class class_id, u32 id)
> +EXPORT_SYMBOL_GPL(bpf_lsm_fw_validate_cmd);
> +

If you decide to stick w/ this BPF LSM based workaround, you can drop
the reliance on LSM_HOOK() entirely here.

>  #undef LSM_HOOK
>  
>  #define LSM_HOOK(RET, DEFAULT, NAME, ...) BTF_ID(func, bpf_lsm_##NAME)
>  BTF_SET_START(bpf_lsm_hooks)
>  #include <linux/lsm_hook_defs.h>
>  #undef LSM_HOOK
> +BTF_ID(func, bpf_lsm_fw_validate_cmd)
>  BTF_SET_END(bpf_lsm_hooks)
>  
>  BTF_SET_START(bpf_lsm_disabled_hooks)
> 
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v2 2/4] selftests/bpf: add test cases for fw_validate_cmd hook
  2026-03-31  5:56 [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
  2026-03-31  5:56 ` [PATCH v2 1/4] bpf: add firmware command validation hook Leon Romanovsky
@ 2026-03-31  5:56 ` Leon Romanovsky
  2026-03-31  5:56 ` [PATCH v2 3/4] RDMA/mlx5: Externally validate FW commands supplied in DEVX interface Leon Romanovsky
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 81+ messages in thread
From: Leon Romanovsky @ 2026-03-31  5:56 UTC (permalink / raw)
  To: KP Singh, Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Leon Romanovsky, Jason Gunthorpe,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron
  Cc: bpf, linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla

From: Chiara Meiohas <cmeiohas@nvidia.com>

The first test validates that the BPF verifier accepts a program
that accesses the hook parameters (in_len) and returns
values in the valid errno range.

The second test validates that the BPF verifier rejects a program
that returns a positive value, which is outside the valid [-4095, 0]
return range for BPF-LSM hooks.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 tools/testing/selftests/bpf/progs/verifier_lsm.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/verifier_lsm.c b/tools/testing/selftests/bpf/progs/verifier_lsm.c
index 38e8e91768620..9b2487948f8cb 100644
--- a/tools/testing/selftests/bpf/progs/verifier_lsm.c
+++ b/tools/testing/selftests/bpf/progs/verifier_lsm.c
@@ -188,4 +188,27 @@ int BPF_PROG(null_check, struct file *file)
 	return 0;
 }
 
+SEC("lsm/fw_validate_cmd")
+__description("lsm fw_validate_cmd: validate hook parameters")
+__success
+int BPF_PROG(fw_validate_cmd_test, const void *in, size_t in_len,
+	     const struct device *dev, enum fw_cmd_class class_id, u32 id)
+{
+	if (!in_len)
+		return -22;
+
+	return 0;
+}
+
+SEC("lsm/fw_validate_cmd")
+__description("lsm fw_validate_cmd: invalid positive return")
+__failure __msg("R0 has smin=1 smax=1 should have been in [-4095, 0]")
+__naked int fw_validate_cmd_fail(void *ctx)
+{
+	asm volatile (
+	"r0 = 1;"
+	"exit;"
+	::: __clobber_all);
+}
+
 char _license[] SEC("license") = "GPL";

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v2 3/4] RDMA/mlx5: Externally validate FW commands supplied in DEVX interface
  2026-03-31  5:56 [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
  2026-03-31  5:56 ` [PATCH v2 1/4] bpf: add firmware command validation hook Leon Romanovsky
  2026-03-31  5:56 ` [PATCH v2 2/4] selftests/bpf: add test cases for fw_validate_cmd hook Leon Romanovsky
@ 2026-03-31  5:56 ` Leon Romanovsky
  2026-03-31  5:56 ` [PATCH v2 4/4] fwctl/mlx5: Externally validate FW commands supplied in fwctl Leon Romanovsky
  2026-04-09 12:12 ` [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
  4 siblings, 0 replies; 81+ messages in thread
From: Leon Romanovsky @ 2026-03-31  5:56 UTC (permalink / raw)
  To: KP Singh, Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Leon Romanovsky, Jason Gunthorpe,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron
  Cc: bpf, linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, Jonathan Cameron

From: Chiara Meiohas <cmeiohas@nvidia.com>

DEVX is an RDMA uverbs extension that allows userspace to submit
firmware command buffers. The driver inspects the command and then
passes the buffer through for firmware execution.

Call bpf_lsm_fw_validate_cmd() before dispatching firmware commands
through DEVX.

This allows BPF programs to implement custom policies and enforce
per-command security policy on user-triggered firmware commands.
For example, a BPF program could restrict specific firmware
operations to privileged users.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/devx.c | 49 +++++++++++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c
index 0066b2738ac89..b7a2e19987018 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -18,6 +18,7 @@
 #include "devx.h"
 #include "qp.h"
 #include <linux/xarray.h>
+#include <linux/bpf_lsm.h>
 
 #define UVERBS_MODULE_NAME mlx5_ib
 #include <rdma/uverbs_named_ioctl.h>
@@ -1111,6 +1112,8 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OTHER)(
 	struct mlx5_ib_dev *dev;
 	void *cmd_in = uverbs_attr_get_alloced_ptr(
 		attrs, MLX5_IB_ATTR_DEVX_OTHER_CMD_IN);
+	int cmd_in_len = uverbs_attr_get_len(attrs,
+					MLX5_IB_ATTR_DEVX_OTHER_CMD_IN);
 	int cmd_out_len = uverbs_attr_get_len(attrs,
 					MLX5_IB_ATTR_DEVX_OTHER_CMD_OUT);
 	void *cmd_out;
@@ -1135,9 +1138,12 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OTHER)(
 		return PTR_ERR(cmd_out);
 
 	MLX5_SET(general_obj_in_cmd_hdr, cmd_in, uid, uid);
-	err = mlx5_cmd_do(dev->mdev, cmd_in,
-			  uverbs_attr_get_len(attrs, MLX5_IB_ATTR_DEVX_OTHER_CMD_IN),
-			  cmd_out, cmd_out_len);
+	err = bpf_lsm_fw_validate_cmd(cmd_in, cmd_in_len, &dev->ib_dev.dev,
+				      FW_CMD_CLASS_UVERBS, RDMA_DRIVER_MLX5);
+	if (err)
+		return err;
+
+	err = mlx5_cmd_do(dev->mdev, cmd_in, cmd_in_len, cmd_out, cmd_out_len);
 	if (err && err != -EREMOTEIO)
 		return err;
 
@@ -1570,6 +1576,11 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_CREATE)(
 		devx_set_umem_valid(cmd_in);
 	}
 
+	err = bpf_lsm_fw_validate_cmd(cmd_in, cmd_in_len, &dev->ib_dev.dev,
+				      FW_CMD_CLASS_UVERBS, RDMA_DRIVER_MLX5);
+	if (err)
+		goto obj_free;
+
 	if (opcode == MLX5_CMD_OP_CREATE_DCT) {
 		obj->flags |= DEVX_OBJ_FLAGS_DCT;
 		err = mlx5_core_create_dct(dev, &obj->core_dct, cmd_in,
@@ -1646,6 +1657,8 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_MODIFY)(
 	struct uverbs_attr_bundle *attrs)
 {
 	void *cmd_in = uverbs_attr_get_alloced_ptr(attrs, MLX5_IB_ATTR_DEVX_OBJ_MODIFY_CMD_IN);
+	int cmd_in_len = uverbs_attr_get_len(attrs,
+					MLX5_IB_ATTR_DEVX_OBJ_MODIFY_CMD_IN);
 	int cmd_out_len = uverbs_attr_get_len(attrs,
 					MLX5_IB_ATTR_DEVX_OBJ_MODIFY_CMD_OUT);
 	struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs,
@@ -1676,10 +1689,12 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_MODIFY)(
 
 	MLX5_SET(general_obj_in_cmd_hdr, cmd_in, uid, uid);
 	devx_set_umem_valid(cmd_in);
+	err = bpf_lsm_fw_validate_cmd(cmd_in, cmd_in_len, &mdev->ib_dev.dev,
+				      FW_CMD_CLASS_UVERBS, RDMA_DRIVER_MLX5);
+	if (err)
+		return err;
 
-	err = mlx5_cmd_do(mdev->mdev, cmd_in,
-			  uverbs_attr_get_len(attrs, MLX5_IB_ATTR_DEVX_OBJ_MODIFY_CMD_IN),
-			  cmd_out, cmd_out_len);
+	err = mlx5_cmd_do(mdev->mdev, cmd_in, cmd_in_len, cmd_out, cmd_out_len);
 	if (err && err != -EREMOTEIO)
 		return err;
 
@@ -1693,6 +1708,8 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_QUERY)(
 	struct uverbs_attr_bundle *attrs)
 {
 	void *cmd_in = uverbs_attr_get_alloced_ptr(attrs, MLX5_IB_ATTR_DEVX_OBJ_QUERY_CMD_IN);
+	int cmd_in_len = uverbs_attr_get_len(attrs,
+					     MLX5_IB_ATTR_DEVX_OBJ_QUERY_CMD_IN);
 	int cmd_out_len = uverbs_attr_get_len(attrs,
 					      MLX5_IB_ATTR_DEVX_OBJ_QUERY_CMD_OUT);
 	struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs,
@@ -1722,9 +1739,12 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_QUERY)(
 		return PTR_ERR(cmd_out);
 
 	MLX5_SET(general_obj_in_cmd_hdr, cmd_in, uid, uid);
-	err = mlx5_cmd_do(mdev->mdev, cmd_in,
-			  uverbs_attr_get_len(attrs, MLX5_IB_ATTR_DEVX_OBJ_QUERY_CMD_IN),
-			  cmd_out, cmd_out_len);
+	err = bpf_lsm_fw_validate_cmd(cmd_in, cmd_in_len, &mdev->ib_dev.dev,
+				      FW_CMD_CLASS_UVERBS, RDMA_DRIVER_MLX5);
+	if (err)
+		return err;
+
+	err = mlx5_cmd_do(mdev->mdev, cmd_in, cmd_in_len, cmd_out, cmd_out_len);
 	if (err && err != -EREMOTEIO)
 		return err;
 
@@ -1832,6 +1852,8 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_ASYNC_QUERY)(
 {
 	void *cmd_in = uverbs_attr_get_alloced_ptr(attrs,
 				MLX5_IB_ATTR_DEVX_OBJ_QUERY_ASYNC_CMD_IN);
+	int cmd_in_len = uverbs_attr_get_len(attrs,
+				MLX5_IB_ATTR_DEVX_OBJ_QUERY_ASYNC_CMD_IN);
 	struct ib_uobject *uobj = uverbs_attr_get_uobject(
 				attrs,
 				MLX5_IB_ATTR_DEVX_OBJ_QUERY_ASYNC_HANDLE);
@@ -1894,9 +1916,12 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_ASYNC_QUERY)(
 	async_data->ev_file = ev_file;
 
 	MLX5_SET(general_obj_in_cmd_hdr, cmd_in, uid, uid);
-	err = mlx5_cmd_exec_cb(&ev_file->async_ctx, cmd_in,
-		    uverbs_attr_get_len(attrs,
-				MLX5_IB_ATTR_DEVX_OBJ_QUERY_ASYNC_CMD_IN),
+	err = bpf_lsm_fw_validate_cmd(cmd_in, cmd_in_len, &mdev->ib_dev.dev,
+				      FW_CMD_CLASS_UVERBS, RDMA_DRIVER_MLX5);
+	if (err)
+		goto free_async;
+
+	err = mlx5_cmd_exec_cb(&ev_file->async_ctx, cmd_in, cmd_in_len,
 		    async_data->hdr.out_data,
 		    async_data->cmd_out_len,
 		    devx_query_callback, &async_data->cb_work);

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v2 4/4] fwctl/mlx5: Externally validate FW commands supplied in fwctl
  2026-03-31  5:56 [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
                   ` (2 preceding siblings ...)
  2026-03-31  5:56 ` [PATCH v2 3/4] RDMA/mlx5: Externally validate FW commands supplied in DEVX interface Leon Romanovsky
@ 2026-03-31  5:56 ` Leon Romanovsky
  2026-04-09 12:12 ` [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
  4 siblings, 0 replies; 81+ messages in thread
From: Leon Romanovsky @ 2026-03-31  5:56 UTC (permalink / raw)
  To: KP Singh, Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Leon Romanovsky, Jason Gunthorpe,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron
  Cc: bpf, linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, Jonathan Cameron

From: Chiara Meiohas <cmeiohas@nvidia.com>

fwctl is subsystem which exposes a firmware interface directly to
userspace: it allows userspace to send device specific command
buffers to firmware. fwctl is focused on debugging, configuration
and provisioning of the device.

Call bpf_lsm_fw_validate_cmd() before dispatching the user-provided
firmware command.

This allows BPF programs to implement custom policies and enforce
per-command security policy on user-triggered firmware commands.
For example, a BPF program could filter firmware commands based on
their opcode.

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/fwctl/mlx5/main.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/fwctl/mlx5/main.c b/drivers/fwctl/mlx5/main.c
index e86ab703c767a..c49dfa1d172d9 100644
--- a/drivers/fwctl/mlx5/main.c
+++ b/drivers/fwctl/mlx5/main.c
@@ -7,6 +7,7 @@
 #include <linux/mlx5/device.h>
 #include <linux/mlx5/driver.h>
 #include <uapi/fwctl/mlx5.h>
+#include <linux/bpf_lsm.h>
 
 #define mlx5ctl_err(mcdev, format, ...) \
 	dev_err(&mcdev->fwctl.dev, format, ##__VA_ARGS__)
@@ -324,6 +325,15 @@ static void *mlx5ctl_fw_rpc(struct fwctl_uctx *uctx, enum fwctl_rpc_scope scope,
 	if (!mlx5ctl_validate_rpc(rpc_in, scope))
 		return ERR_PTR(-EBADMSG);
 
+	/* Enforce the user context for the command */
+	MLX5_SET(mbox_in_hdr, rpc_in, uid, mfd->uctx_uid);
+
+	ret = bpf_lsm_fw_validate_cmd(rpc_in, in_len, &mcdev->fwctl.dev,
+				      FW_CMD_CLASS_FWCTL,
+				      FWCTL_DEVICE_TYPE_MLX5);
+	if (ret)
+		return ERR_PTR(ret);
+
 	/*
 	 * mlx5_cmd_do() copies the input message to its own buffer before
 	 * executing it, so we can reuse the allocation for the output.
@@ -336,8 +346,6 @@ static void *mlx5ctl_fw_rpc(struct fwctl_uctx *uctx, enum fwctl_rpc_scope scope,
 			return ERR_PTR(-ENOMEM);
 	}
 
-	/* Enforce the user context for the command */
-	MLX5_SET(mbox_in_hdr, rpc_in, uid, mfd->uctx_uid);
 	ret = mlx5_cmd_do(mcdev->mdev, rpc_in, in_len, rpc_out, *out_len);
 
 	mlx5ctl_dbg(mcdev,

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-03-31  5:56 [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
                   ` (3 preceding siblings ...)
  2026-03-31  5:56 ` [PATCH v2 4/4] fwctl/mlx5: Externally validate FW commands supplied in fwctl Leon Romanovsky
@ 2026-04-09 12:12 ` Leon Romanovsky
  2026-04-09 12:27   ` Roberto Sassu
  4 siblings, 1 reply; 81+ messages in thread
From: Leon Romanovsky @ 2026-04-09 12:12 UTC (permalink / raw)
  To: KP Singh, Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jason Gunthorpe, Saeed Mahameed,
	Itay Avraham, Dave Jiang, Jonathan Cameron
  Cc: bpf, linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla

On Tue, Mar 31, 2026 at 08:56:32AM +0300, Leon Romanovsky wrote:
> From Chiara:
> 
> This patch set introduces a new BPF LSM hook to validate firmware commands
> triggered by userspace before they are submitted to the device. The hook
> runs after the command buffer is constructed, right before it is sent
> to firmware.

<...>

> ---
> Chiara Meiohas (4):
>       bpf: add firmware command validation hook
>       selftests/bpf: add test cases for fw_validate_cmd hook
>       RDMA/mlx5: Externally validate FW commands supplied in DEVX interface
>       fwctl/mlx5: Externally validate FW commands supplied in fwctl

Hi,

Can we get Ack from BPF/LSM side?

Thanks

> 
>  drivers/fwctl/mlx5/main.c                        | 12 +++++-
>  drivers/infiniband/hw/mlx5/devx.c                | 49 ++++++++++++++++++------
>  include/linux/bpf_lsm.h                          | 41 ++++++++++++++++++++
>  kernel/bpf/bpf_lsm.c                             | 11 ++++++
>  tools/testing/selftests/bpf/progs/verifier_lsm.c | 23 +++++++++++
>  5 files changed, 122 insertions(+), 14 deletions(-)
> ---
> base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
> change-id: 20260309-fw-lsm-hook-7c094f909ffc
> 
> Best regards,
> --  
> Leon Romanovsky <leonro@nvidia.com>
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-09 12:12 ` [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
@ 2026-04-09 12:27   ` Roberto Sassu
  2026-04-09 12:45     ` Leon Romanovsky
  0 siblings, 1 reply; 81+ messages in thread
From: Roberto Sassu @ 2026-04-09 12:27 UTC (permalink / raw)
  To: Leon Romanovsky, KP Singh, Matt Bobrowski, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Jason Gunthorpe, Saeed Mahameed, Itay Avraham, Dave Jiang,
	Jonathan Cameron
  Cc: bpf, linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, paul, linux-security-module

On Thu, 2026-04-09 at 15:12 +0300, Leon Romanovsky wrote:
> On Tue, Mar 31, 2026 at 08:56:32AM +0300, Leon Romanovsky wrote:
> > From Chiara:
> > 
> > This patch set introduces a new BPF LSM hook to validate firmware commands
> > triggered by userspace before they are submitted to the device. The hook
> > runs after the command buffer is constructed, right before it is sent
> > to firmware.
> 
> <...>
> 
> > ---
> > Chiara Meiohas (4):
> >       bpf: add firmware command validation hook
> >       selftests/bpf: add test cases for fw_validate_cmd hook
> >       RDMA/mlx5: Externally validate FW commands supplied in DEVX interface
> >       fwctl/mlx5: Externally validate FW commands supplied in fwctl
> 
> Hi,
> 
> Can we get Ack from BPF/LSM side?

+ Paul, linux-security-module ML

Hi

probably you also want to get an Ack from the LSM maintainer (added in
CC with the list). Most likely, he will also ask you to create the
security_*() functions counterparts of the BPF hooks.

Roberto

> Thanks
> 
> > 
> >  drivers/fwctl/mlx5/main.c                        | 12 +++++-
> >  drivers/infiniband/hw/mlx5/devx.c                | 49 ++++++++++++++++++------
> >  include/linux/bpf_lsm.h                          | 41 ++++++++++++++++++++
> >  kernel/bpf/bpf_lsm.c                             | 11 ++++++
> >  tools/testing/selftests/bpf/progs/verifier_lsm.c | 23 +++++++++++
> >  5 files changed, 122 insertions(+), 14 deletions(-)
> > ---
> > base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
> > change-id: 20260309-fw-lsm-hook-7c094f909ffc
> > 
> > Best regards,
> > --  
> > Leon Romanovsky <leonro@nvidia.com>
> > 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-09 12:27   ` Roberto Sassu
@ 2026-04-09 12:45     ` Leon Romanovsky
  2026-04-09 21:04       ` Paul Moore
  0 siblings, 1 reply; 81+ messages in thread
From: Leon Romanovsky @ 2026-04-09 12:45 UTC (permalink / raw)
  To: Roberto Sassu
  Cc: KP Singh, Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jason Gunthorpe, Saeed Mahameed,
	Itay Avraham, Dave Jiang, Jonathan Cameron, bpf, linux-kernel,
	linux-kselftest, linux-rdma, Chiara Meiohas, Maher Sanalla, paul,
	linux-security-module

On Thu, Apr 09, 2026 at 02:27:43PM +0200, Roberto Sassu wrote:
> On Thu, 2026-04-09 at 15:12 +0300, Leon Romanovsky wrote:
> > On Tue, Mar 31, 2026 at 08:56:32AM +0300, Leon Romanovsky wrote:
> > > From Chiara:
> > > 
> > > This patch set introduces a new BPF LSM hook to validate firmware commands
> > > triggered by userspace before they are submitted to the device. The hook
> > > runs after the command buffer is constructed, right before it is sent
> > > to firmware.
> > 
> > <...>
> > 
> > > ---
> > > Chiara Meiohas (4):
> > >       bpf: add firmware command validation hook
> > >       selftests/bpf: add test cases for fw_validate_cmd hook
> > >       RDMA/mlx5: Externally validate FW commands supplied in DEVX interface
> > >       fwctl/mlx5: Externally validate FW commands supplied in fwctl
> > 
> > Hi,
> > 
> > Can we get Ack from BPF/LSM side?
> 
> + Paul, linux-security-module ML
> 
> Hi
> 
> probably you also want to get an Ack from the LSM maintainer (added in
> CC with the list). Most likely, he will also ask you to create the
> security_*() functions counterparts of the BPF hooks.

We implemented this approach in v1:
https://patch.msgid.link/20260309-fw-lsm-hook-v1-0-4a6422e63725@nvidia.com
and were advised to pursue a different direction.

Thanks

> 
> Roberto
> 
> > Thanks
> > 
> > > 
> > >  drivers/fwctl/mlx5/main.c                        | 12 +++++-
> > >  drivers/infiniband/hw/mlx5/devx.c                | 49 ++++++++++++++++++------
> > >  include/linux/bpf_lsm.h                          | 41 ++++++++++++++++++++
> > >  kernel/bpf/bpf_lsm.c                             | 11 ++++++
> > >  tools/testing/selftests/bpf/progs/verifier_lsm.c | 23 +++++++++++
> > >  5 files changed, 122 insertions(+), 14 deletions(-)
> > > ---
> > > base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
> > > change-id: 20260309-fw-lsm-hook-7c094f909ffc
> > > 
> > > Best regards,
> > > --  
> > > Leon Romanovsky <leonro@nvidia.com>
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-09 12:45     ` Leon Romanovsky
@ 2026-04-09 21:04       ` Paul Moore
  2026-04-12  9:00         ` Leon Romanovsky
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-09 21:04 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Roberto Sassu, KP Singh, Matt Bobrowski, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Jason Gunthorpe, Saeed Mahameed, Itay Avraham, Dave Jiang,
	Jonathan Cameron, bpf, linux-kernel, linux-kselftest, linux-rdma,
	Chiara Meiohas, Maher Sanalla, linux-security-module

On Thu, Apr 9, 2026 at 8:45 AM Leon Romanovsky <leon@kernel.org> wrote:
> On Thu, Apr 09, 2026 at 02:27:43PM +0200, Roberto Sassu wrote:
> > On Thu, 2026-04-09 at 15:12 +0300, Leon Romanovsky wrote:
> > > On Tue, Mar 31, 2026 at 08:56:32AM +0300, Leon Romanovsky wrote:
> > > > From Chiara:
> > > >
> > > > This patch set introduces a new BPF LSM hook to validate firmware commands
> > > > triggered by userspace before they are submitted to the device. The hook
> > > > runs after the command buffer is constructed, right before it is sent
> > > > to firmware.
> > >
> > > <...>
> > >
> > > > ---
> > > > Chiara Meiohas (4):
> > > >       bpf: add firmware command validation hook
> > > >       selftests/bpf: add test cases for fw_validate_cmd hook
> > > >       RDMA/mlx5: Externally validate FW commands supplied in DEVX interface
> > > >       fwctl/mlx5: Externally validate FW commands supplied in fwctl
> > >
> > > Hi,
> > >
> > > Can we get Ack from BPF/LSM side?
> >
> > + Paul, linux-security-module ML
> >
> > Hi
> >
> > probably you also want to get an Ack from the LSM maintainer (added in
> > CC with the list). Most likely, he will also ask you to create the
> > security_*() functions counterparts of the BPF hooks.
>
> We implemented this approach in v1:
> https://patch.msgid.link/20260309-fw-lsm-hook-v1-0-4a6422e63725@nvidia.com
> and were advised to pursue a different direction.

I'm assuming you are referring to my comments?  If so, that isn't
exactly what I said, I mentioned at least one other option besides
going directly to BPF.  Ultimately, it is your choice to decide how
you want to proceed, but to claim I advised you to avoid a LSM based
solution isn't strictly correct.

Regardless, looking at your v2 patchset, it looks like you've taken an
unusual approach of using some of the LSM mechanisms, e.g. LSM_HOOK(),
but not actually exposing a LSM hook with proper callbacks.
Unfortunately, that's not something we want to support.  If you want
to pursue an LSM based solution, complete with a security_XXX() hook,
use of LSM_HOOK() macros, etc. then that's fine, I'm happy to work
with you on that.  However, if you've decided that your preferred
option is to create a BPF hook you should avoid using things like
LSM_HOOK() and locating your hook/code in bpf_lsm.c.

The good news is that there are plenty of other examples of BPF
plugable code that you could use as an example, one such thing is the
update_socket_protocol() BPF hook that was originally proposed as a
LSM hook, but moved to a dedicated BPF hook as we generally want to
avoid changing non-LSM kernel objects within the scope of the LSMs.
While your proposed case is slightly different, I think the basic idea
and mechanism should still be useful.

https://lore.kernel.org/all/cover.1692147782.git.geliang.tang@suse.com

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-09 21:04       ` Paul Moore
@ 2026-04-12  9:00         ` Leon Romanovsky
  2026-04-13  1:38           ` Paul Moore
  0 siblings, 1 reply; 81+ messages in thread
From: Leon Romanovsky @ 2026-04-12  9:00 UTC (permalink / raw)
  To: Paul Moore
  Cc: Roberto Sassu, KP Singh, Matt Bobrowski, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Jason Gunthorpe, Saeed Mahameed, Itay Avraham, Dave Jiang,
	Jonathan Cameron, bpf, linux-kernel, linux-kselftest, linux-rdma,
	Chiara Meiohas, Maher Sanalla, linux-security-module

On Thu, Apr 09, 2026 at 05:04:24PM -0400, Paul Moore wrote:
> On Thu, Apr 9, 2026 at 8:45 AM Leon Romanovsky <leon@kernel.org> wrote:
> > On Thu, Apr 09, 2026 at 02:27:43PM +0200, Roberto Sassu wrote:
> > > On Thu, 2026-04-09 at 15:12 +0300, Leon Romanovsky wrote:
> > > > On Tue, Mar 31, 2026 at 08:56:32AM +0300, Leon Romanovsky wrote:
> > > > > From Chiara:
> > > > >
> > > > > This patch set introduces a new BPF LSM hook to validate firmware commands
> > > > > triggered by userspace before they are submitted to the device. The hook
> > > > > runs after the command buffer is constructed, right before it is sent
> > > > > to firmware.
> > > >
> > > > <...>
> > > >
> > > > > ---
> > > > > Chiara Meiohas (4):
> > > > >       bpf: add firmware command validation hook
> > > > >       selftests/bpf: add test cases for fw_validate_cmd hook
> > > > >       RDMA/mlx5: Externally validate FW commands supplied in DEVX interface
> > > > >       fwctl/mlx5: Externally validate FW commands supplied in fwctl
> > > >
> > > > Hi,
> > > >
> > > > Can we get Ack from BPF/LSM side?
> > >
> > > + Paul, linux-security-module ML
> > >
> > > Hi
> > >
> > > probably you also want to get an Ack from the LSM maintainer (added in
> > > CC with the list). Most likely, he will also ask you to create the
> > > security_*() functions counterparts of the BPF hooks.
> >
> > We implemented this approach in v1:
> > https://patch.msgid.link/20260309-fw-lsm-hook-v1-0-4a6422e63725@nvidia.com
> > and were advised to pursue a different direction.
> 
> I'm assuming you are referring to my comments? If so, that isn't exactly what I said,
> I mentioned at least one other option besides
> going directly to BPF.  Ultimately, it is your choice to decide how
> you want to proceed, but to claim I advised you to avoid a LSM based
> solution isn't strictly correct.

Yes, this matches how we understood your comments:  
https://lore.kernel.org/all/20260311081955.GS12611@unreal/

In the end, the goal is to build something practical and avoid adding
unnecessary complexity that brings no real benefit to users.

> 
> Regardless, looking at your v2 patchset, it looks like you've taken an
> unusual approach of using some of the LSM mechanisms, e.g. LSM_HOOK(),
> but not actually exposing a LSM hook with proper callbacks.
> Unfortunately, that's not something we want to support.  If you want
> to pursue an LSM based solution, complete with a security_XXX() hook,
> use of LSM_HOOK() macros, etc. then that's fine, I'm happy to work
> with you on that.

The issue is that the sentence below was the reason we did not merge v1 with v2:
https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hooks
"pass through implementations, such as the BPF LSM, are not eligible for
LSM hook reference implementations."


> However, if you've decided that your preferred
> option is to create a BPF hook you should avoid using things like
> LSM_HOOK() and locating your hook/code in bpf_lsm.c.

We are not limited to LSM solution, the goal is to intercept commands
which are submitted to the FW and "security" bucket sounded right to us.

> 
> The good news is that there are plenty of other examples of BPF
> plugable code that you could use as an example, one such thing is the
> update_socket_protocol() BPF hook that was originally proposed as a
> LSM hook, but moved to a dedicated BPF hook as we generally want to
> avoid changing non-LSM kernel objects within the scope of the LSMs.
> While your proposed case is slightly different, I think the basic idea
> and mechanism should still be useful.
> 
> https://lore.kernel.org/all/cover.1692147782.git.geliang.tang@suse.com

Thanks

> 
> -- 
> paul-moore.com
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-12  9:00         ` Leon Romanovsky
@ 2026-04-13  1:38           ` Paul Moore
  2026-04-13 15:53             ` Leon Romanovsky
  2026-04-13 16:42             ` Jason Gunthorpe
  0 siblings, 2 replies; 81+ messages in thread
From: Paul Moore @ 2026-04-13  1:38 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Roberto Sassu, KP Singh, Matt Bobrowski, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Jason Gunthorpe, Saeed Mahameed, Itay Avraham, Dave Jiang,
	Jonathan Cameron, bpf, linux-kernel, linux-kselftest, linux-rdma,
	Chiara Meiohas, Maher Sanalla, linux-security-module

On Sun, Apr 12, 2026 at 5:00 AM Leon Romanovsky <leon@kernel.org> wrote:
> On Thu, Apr 09, 2026 at 05:04:24PM -0400, Paul Moore wrote:
> > On Thu, Apr 9, 2026 at 8:45 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > On Thu, Apr 09, 2026 at 02:27:43PM +0200, Roberto Sassu wrote:
> > > > On Thu, 2026-04-09 at 15:12 +0300, Leon Romanovsky wrote:
> > > > > On Tue, Mar 31, 2026 at 08:56:32AM +0300, Leon Romanovsky wrote:

...

> > > We implemented this approach in v1:
> > > https://patch.msgid.link/20260309-fw-lsm-hook-v1-0-4a6422e63725@nvidia.com
> > > and were advised to pursue a different direction.
> >
> > I'm assuming you are referring to my comments? If so, that isn't exactly what I said,
> > I mentioned at least one other option besides
> > going directly to BPF.  Ultimately, it is your choice to decide how
> > you want to proceed, but to claim I advised you to avoid a LSM based
> > solution isn't strictly correct.
>
> Yes, this matches how we understood your comments:
> https://lore.kernel.org/all/20260311081955.GS12611@unreal/
>
> In the end, the goal is to build something practical and avoid adding
> unnecessary complexity that brings no real benefit to users.
>
> > Regardless, looking at your v2 patchset, it looks like you've taken an
> > unusual approach of using some of the LSM mechanisms, e.g. LSM_HOOK(),
> > but not actually exposing a LSM hook with proper callbacks.
> > Unfortunately, that's not something we want to support.  If you want
> > to pursue an LSM based solution, complete with a security_XXX() hook,
> > use of LSM_HOOK() macros, etc. then that's fine, I'm happy to work
> > with you on that.
>
> The issue is that the sentence below was the reason we did not merge v1 with v2:
> https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hooks
> "pass through implementations, such as the BPF LSM, are not eligible for
> LSM hook reference implementations."

I can expand on that in a minute, but I'd like to return to your use
of the LSM_HOOK() macro and locating the hook within the BPF LSM as
that is the most concerning issue from my perspective.  One should
only use the LSM_HOOK() macro and locate code within bpf_lsm.c if that
code is part of the BPF LSM, utilizing an LSM hook.  Since this
patchset doesn't use an LSM hook or any part of the LSM framework, the
implementation choices seem odd and are not something we want to
support.  As mentioned in my prior reply, you could do something very
similar though the use of a normal BPF hook similar to what was done
with the update_socket_protocol() BPF hook.

There are multiple reasons why out-of-tree and pass through LSMs are
not considered eligible for reference implementations of LSM hooks.  I
think is most relevant to this patchset is that an out-of-tree hook
implementation doesn't necessarily require a stable interface, and
without a stable interface it is impossible to make a generic API at
the LSM framework layer.  As you mentioned previously, each vendor and
each firmware version brings the possibility of a new
format/interface, and while that may not be a problem for out-of-tree
code which is left to the user/admin to manage, it makes upstream
development difficult.  I did mention at least one approach that might
be a possibility to enable upstream, in-tree support of this, but you
seem to prefer a BPF approach that doesn't require a well defined
format.

> > However, if you've decided that your preferred
> > option is to create a BPF hook you should avoid using things like
> > LSM_HOOK() and locating your hook/code in bpf_lsm.c.
>
> We are not limited to LSM solution, the goal is to intercept commands
> which are submitted to the FW and "security" bucket sounded right to us.

Yes, it does sound "security relevant", but without a well defined
interface/format it is going to be difficult to write a generic LSM to
have any level of granularity beyond a basic "load firmware"
permission.

> > The good news is that there are plenty of other examples of BPF
> > plugable code that you could use as an example, one such thing is the
> > update_socket_protocol() BPF hook that was originally proposed as a
> > LSM hook, but moved to a dedicated BPF hook as we generally want to
> > avoid changing non-LSM kernel objects within the scope of the LSMs.
> > While your proposed case is slightly different, I think the basic idea
> > and mechanism should still be useful.
> >
> > https://lore.kernel.org/all/cover.1692147782.git.geliang.tang@suse.com
>
> Thanks

Good luck on whatever you choose, and while I'm guessing it is
unlikely, if you do decide to pursue a LSM based solution please let
us know and we can work with you to try and find ways to make it work.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13  1:38           ` Paul Moore
@ 2026-04-13 15:53             ` Leon Romanovsky
  2026-04-13 16:42             ` Jason Gunthorpe
  1 sibling, 0 replies; 81+ messages in thread
From: Leon Romanovsky @ 2026-04-13 15:53 UTC (permalink / raw)
  To: Paul Moore
  Cc: Roberto Sassu, KP Singh, Matt Bobrowski, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Jason Gunthorpe, Saeed Mahameed, Itay Avraham, Dave Jiang,
	Jonathan Cameron, bpf, linux-kernel, linux-kselftest, linux-rdma,
	Chiara Meiohas, Maher Sanalla, linux-security-module

On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
> On Sun, Apr 12, 2026 at 5:00 AM Leon Romanovsky <leon@kernel.org> wrote:
> > On Thu, Apr 09, 2026 at 05:04:24PM -0400, Paul Moore wrote:
> > > On Thu, Apr 9, 2026 at 8:45 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > On Thu, Apr 09, 2026 at 02:27:43PM +0200, Roberto Sassu wrote:
> > > > > On Thu, 2026-04-09 at 15:12 +0300, Leon Romanovsky wrote:
> > > > > > On Tue, Mar 31, 2026 at 08:56:32AM +0300, Leon Romanovsky wrote:
> 
> ...
> 
> > > > We implemented this approach in v1:
> > > > https://patch.msgid.link/20260309-fw-lsm-hook-v1-0-4a6422e63725@nvidia.com
> > > > and were advised to pursue a different direction.
> > >
> > > I'm assuming you are referring to my comments? If so, that isn't exactly what I said,
> > > I mentioned at least one other option besides
> > > going directly to BPF.  Ultimately, it is your choice to decide how
> > > you want to proceed, but to claim I advised you to avoid a LSM based
> > > solution isn't strictly correct.
> >
> > Yes, this matches how we understood your comments:
> > https://lore.kernel.org/all/20260311081955.GS12611@unreal/
> >
> > In the end, the goal is to build something practical and avoid adding
> > unnecessary complexity that brings no real benefit to users.
> >
> > > Regardless, looking at your v2 patchset, it looks like you've taken an
> > > unusual approach of using some of the LSM mechanisms, e.g. LSM_HOOK(),
> > > but not actually exposing a LSM hook with proper callbacks.
> > > Unfortunately, that's not something we want to support.  If you want
> > > to pursue an LSM based solution, complete with a security_XXX() hook,
> > > use of LSM_HOOK() macros, etc. then that's fine, I'm happy to work
> > > with you on that.
> >
> > The issue is that the sentence below was the reason we did not merge v1 with v2:
> > https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hooks
> > "pass through implementations, such as the BPF LSM, are not eligible for
> > LSM hook reference implementations."
> 
> I can expand on that in a minute, but I'd like to return to your use
> of the LSM_HOOK() macro and locating the hook within the BPF LSM as
> that is the most concerning issue from my perspective.  One should
> only use the LSM_HOOK() macro and locate code within bpf_lsm.c if that
> code is part of the BPF LSM, utilizing an LSM hook.  Since this
> patchset doesn't use an LSM hook or any part of the LSM framework, the
> implementation choices seem odd and are not something we want to
> support.  As mentioned in my prior reply, you could do something very
> similar though the use of a normal BPF hook similar to what was done
> with the update_socket_protocol() BPF hook.
> 
> There are multiple reasons why out-of-tree and pass through LSMs are
> not considered eligible for reference implementations of LSM hooks.  I
> think is most relevant to this patchset is that an out-of-tree hook
> implementation doesn't necessarily require a stable interface, and
> without a stable interface it is impossible to make a generic API at
> the LSM framework layer.  As you mentioned previously, each vendor and
> each firmware version brings the possibility of a new
> format/interface, and while that may not be a problem for out-of-tree
> code which is left to the user/admin to manage, it makes upstream
> development difficult.  I did mention at least one approach that might
> be a possibility to enable upstream, in-tree support of this, but you
> seem to prefer a BPF approach that doesn't require a well defined
> format.
> 
> > > However, if you've decided that your preferred
> > > option is to create a BPF hook you should avoid using things like
> > > LSM_HOOK() and locating your hook/code in bpf_lsm.c.
> >
> > We are not limited to LSM solution, the goal is to intercept commands
> > which are submitted to the FW and "security" bucket sounded right to us.
> 
> Yes, it does sound "security relevant", but without a well defined
> interface/format it is going to be difficult to write a generic LSM to
> have any level of granularity beyond a basic "load firmware"
> permission.
> 
> > > The good news is that there are plenty of other examples of BPF
> > > plugable code that you could use as an example, one such thing is the
> > > update_socket_protocol() BPF hook that was originally proposed as a
> > > LSM hook, but moved to a dedicated BPF hook as we generally want to
> > > avoid changing non-LSM kernel objects within the scope of the LSMs.
> > > While your proposed case is slightly different, I think the basic idea
> > > and mechanism should still be useful.
> > >
> > > https://lore.kernel.org/all/cover.1692147782.git.geliang.tang@suse.com
> >
> > Thanks
> 
> Good luck on whatever you choose, and while I'm guessing it is
> unlikely, if you do decide to pursue a LSM based solution please let
> us know and we can work with you to try and find ways to make it work.

Thanks a lot. We should know which direction we'll take in a week or two,
once Chiara wraps up her internal commitments and returns to this series.

I appreciate your help.

Thanks

> 
> -- 
> paul-moore.com
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13  1:38           ` Paul Moore
  2026-04-13 15:53             ` Leon Romanovsky
@ 2026-04-13 16:42             ` Jason Gunthorpe
  2026-04-13 17:36               ` Casey Schaufler
  2026-04-13 22:36               ` Paul Moore
  1 sibling, 2 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-13 16:42 UTC (permalink / raw)
  To: Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
> > We are not limited to LSM solution, the goal is to intercept commands
> > which are submitted to the FW and "security" bucket sounded right to us.
> 
> Yes, it does sound "security relevant", but without a well defined
> interface/format it is going to be difficult to write a generic LSM to
> have any level of granularity beyond a basic "load firmware"
> permission.

I think to step back a bit, what this is trying to achieve is very
similar to the iptables fwmark/secmark scheme.

secmark allows the user to specify programmable rules via iptables
which results in each packet being tagged with a SELinux context and
then the userspace policy can consume that and make security decision
based on that.

Google is showing me examples of this to permit only certain processes
to use certain network addresses.

So this is exactly the same high level idea. The transport of the
packet is different (firwmare cmd vs network) but otherwise it is all
the same basic problem. We need a user programmable classifier like
iptables. Once classified we want this to work with more than SELinux
only, we have a particular interest in the eBPF LSM. In any case the
userspace should be able to specify the security policy that applies
to the kernel classified data.

Following the fwmark example, if there was some programmable in-kernel
function to convert the cmd into a SELinux label would we be able to
enable SELinux following the SECMARK design?

Would there be an objection if that in-kernel function was using a
system-wide eBPF uploaded with some fwctl uAPI?

Finally, would there be an objection to enabling the same function in
eBPF by feeding it the entire command and have it classify and make a
security decision in a single eBPF program? Is there some other way to
enable eBPF? I see eBPF doesn't interwork with SECMARK today so there
isn't a ready example?

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13 16:42             ` Jason Gunthorpe
@ 2026-04-13 17:36               ` Casey Schaufler
  2026-04-13 19:09                 ` Casey Schaufler
  2026-04-13 22:36               ` Paul Moore
  1 sibling, 1 reply; 81+ messages in thread
From: Casey Schaufler @ 2026-04-13 17:36 UTC (permalink / raw)
  To: Jason Gunthorpe, Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module, Casey Schaufler

On 4/13/2026 9:42 AM, Jason Gunthorpe wrote:
> On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
>>> We are not limited to LSM solution, the goal is to intercept commands
>>> which are submitted to the FW and "security" bucket sounded right to us.
>> Yes, it does sound "security relevant", but without a well defined
>> interface/format it is going to be difficult to write a generic LSM to
>> have any level of granularity beyond a basic "load firmware"
>> permission.
> I think to step back a bit, what this is trying to achieve is very
> similar to the iptables fwmark/secmark scheme.
>
> secmark allows the user to specify programmable rules via iptables
> which results in each packet being tagged with a SELinux context and
> then the userspace policy can consume that and make security decision
> based on that.

If you want to pursue something like this DO NOT USE A u32 TO REPRESENT
THE SECURITY CONTEXT! Use a struct lsm_context pointer. The limitations
imposed by a "secid" don't show up in SELinux, which introduced them, but
they sure do in Smack, and they really gum up the works for general LSM
stacking.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13 17:36               ` Casey Schaufler
@ 2026-04-13 19:09                 ` Casey Schaufler
  0 siblings, 0 replies; 81+ messages in thread
From: Casey Schaufler @ 2026-04-13 19:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module, Casey Schaufler

On 4/13/2026 10:36 AM, Casey Schaufler wrote:
> On 4/13/2026 9:42 AM, Jason Gunthorpe wrote:
>> On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
>>>> We are not limited to LSM solution, the goal is to intercept commands
>>>> which are submitted to the FW and "security" bucket sounded right to us.
>>> Yes, it does sound "security relevant", but without a well defined
>>> interface/format it is going to be difficult to write a generic LSM to
>>> have any level of granularity beyond a basic "load firmware"
>>> permission.
>> I think to step back a bit, what this is trying to achieve is very
>> similar to the iptables fwmark/secmark scheme.
>>
>> secmark allows the user to specify programmable rules via iptables
>> which results in each packet being tagged with a SELinux context and
>> then the userspace policy can consume that and make security decision
>> based on that.
> If you want to pursue something like this DO NOT USE A u32 TO REPRESENT
> THE SECURITY CONTEXT! Use a struct lsm_context pointer. The limitations
> imposed by a "secid" don't show up in SELinux, which introduced them, but
> they sure do in Smack, and they really gum up the works for general LSM
> stacking.


Whoops. I meant a struct lsm_prop pointer. It must be Monday morning.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13 16:42             ` Jason Gunthorpe
  2026-04-13 17:36               ` Casey Schaufler
@ 2026-04-13 22:36               ` Paul Moore
  2026-04-13 23:19                 ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-13 22:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Mon, Apr 13, 2026 at 12:42 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
> > > We are not limited to LSM solution, the goal is to intercept commands
> > > which are submitted to the FW and "security" bucket sounded right to us.
> >
> > Yes, it does sound "security relevant", but without a well defined
> > interface/format it is going to be difficult to write a generic LSM to
> > have any level of granularity beyond a basic "load firmware"
> > permission.
>
> I think to step back a bit, what this is trying to achieve is very
> similar to the iptables fwmark/secmark scheme.

Points for thinking outside the box a bit, but from what I've seen
thus far, it differs from secmark in a few important areas.  The
secmark concept relies on the admin to configure the network stack to
apply secmark labels to network traffic as it flows through the
system, the LSM then applies security policy to these packets based on
their label.  The firmware LSM hooks, at least as I currently
understand them, rely on the LSM hook callback to parse the firmware
op/request and apply a security policy to the request.

We've already talked about the first issue, parsing the request, and
my suggestion was to make the LSM hook call from within the firmware
(the firmware must have some way to call into the kernel/driver code,
no?) so that only the firmware would need to parse the request.  If we
wanted to adopt a secmark-esque approach, one could develop a second
parsing mechanism that would be responsible for assigning a LSM label
to the request, and then pass the firmware request to the LSM, but I
do worry a bit about the added complexity associated with keeping the
parser sync'd with the driver/fw.

However, even if we solve the parsing problem, I worry we have
another, closely related issue, of having to categorize all of the
past, present, and future firmware requests into a set of LSM specific
actions.  The past and present requests are just a matter of code,
that isn't too worrying, but what do we do about unknown future
requests?  Beyond simply encoding the request into a operation
token/enum/int, what additional information beyond the action type
would a LSM need to know to apply a security policy?  Would it be
reasonable to blindly allow or reject unknown requests?  If so, what
would break?

> ... Once classified we want this to work with more than SELinux
> only, we have a particular interest in the eBPF LSM.

One of the design requirements for the LSM framework is that it
provides an abstraction layer between the kernel and the underlying
security mechanisms implemented by the various LSMs.  Some operations
will always fall outside the scope of individual LSMs due to their
nature, but as a general rule we try to ensure that LSM hooks are
useful across multiple LSMs.

> Following the fwmark example, if there was some programmable in-kernel
> function to convert the cmd into a SELinux label would we be able to
> enable SELinux following the SECMARK design?

As Casey already mentioned, any sort of classifier would need to be
able to support multiple LSMs.  The only example that comes to mind at
the moment is the NetLabel mechanism which translates between
on-the-wire CIPSO and CALIPSO labels and multiple LSMs (Smack and
SELinux currently).

> Would there be an objection if that in-kernel function was using a
> system-wide eBPF uploaded with some fwctl uAPI?

We'd obviously need to see patches, but there is precedent in
separating labeling from enforcement.  We've discussed SecMark and
NetLabel in the networking space, but technically, the VFS/filesystem
xattr implementations could also be considered as a labeling mechanism
outside of the LSM.

> Finally, would there be an objection to enabling the same function in
> eBPF by feeding it the entire command and have it classify and make a
> security decision in a single eBPF program?

Keeping in mind that from an LSM perspective we need to support
multiple implementations, both in terms of language mechanics (eBPF,
Rust, C) and security philosophies (Smack, SELinux, AppArmor, etc.),
so it would be very unlikely that we would want a specific shortcut or
mechanism that would only work for one language or philosophy.

> Is there some other way to enable eBPF?

If one develops a workable LSM hook then I see no reason why one
couldn't write a BPF LSM to use that hook; that's what we do today.

> I see eBPF doesn't interwork with SECMARK today so there isn't a ready example?

I'm not aware of anyone ever doing to work to try/enable secmark with
BPF, but I see no reason why someone couldn't work on that.  Just make
sure to take into account Casey's comments about multiple LSM support;
any new LSM interfaces will need to support multiple simultaneous LSMs
(the original secmark work predated that).

However, it seems like direct reuse of secmark isn't what is needed,
or even wanted, you were just using that as an example of separating
labeling from enforcement, yes?

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13 22:36               ` Paul Moore
@ 2026-04-13 23:19                 ` Jason Gunthorpe
  2026-04-14 17:05                   ` Casey Schaufler
  2026-04-14 20:27                   ` Paul Moore
  0 siblings, 2 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-13 23:19 UTC (permalink / raw)
  To: Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Mon, Apr 13, 2026 at 06:36:06PM -0400, Paul Moore wrote:
> On Mon, Apr 13, 2026 at 12:42 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
> > > > We are not limited to LSM solution, the goal is to intercept commands
> > > > which are submitted to the FW and "security" bucket sounded right to us.
> > >
> > > Yes, it does sound "security relevant", but without a well defined
> > > interface/format it is going to be difficult to write a generic LSM to
> > > have any level of granularity beyond a basic "load firmware"
> > > permission.
> >
> > I think to step back a bit, what this is trying to achieve is very
> > similar to the iptables fwmark/secmark scheme.
> 
> Points for thinking outside the box a bit, but from what I've seen
> thus far, it differs from secmark in a few important areas.  The
> secmark concept relies on the admin to configure the network stack to
> apply secmark labels to network traffic as it flows through the
> system, the LSM then applies security policy to these packets based on
> their label.  The firmware LSM hooks, at least as I currently
> understand them, rely on the LSM hook callback to parse the firmware
> op/request and apply a security policy to the request.

That was what was proposed because the idea was to combine the
parse/clasification/decision steps into one eBPF program, but I think
it can be split up too.

> We've already talked about the first issue, parsing the request, and
> my suggestion was to make the LSM hook call from within the firmware
> (the firmware must have some way to call into the kernel/driver code,
> no?)

No, that's not workable on so many levels. It is sort of anaologous to
asking the NIC to call the LSM to apply the secmark while sending the
packet.

The proper flow has the kernel evaluate the packet/command *before* it
delivers it to HW, not after.

> so that only the firmware would need to parse the request.  If we
> wanted to adopt a secmark-esque approach, one could develop a second
> parsing mechanism that would be responsible for assigning a LSM label
> to the request, and then pass the firmware request to the LSM, but I
> do worry a bit about the added complexity associated with keeping the
> parser sync'd with the driver/fw.

In practice it would be like iptables, the parser would be entirely
programmed by userspace and there is nothing to keep in sync.

> However, even if we solve the parsing problem, I worry we have
> another, closely related issue, of having to categorize all of the
> past, present, and future firmware requests into a set of LSM specific
> actions.  

Why? secmark doesn't have this issue? The classifer would return the
same kind of information as secmark, some user provided label that is
delivered to the LSM policy side.

When I talk about a classifier I mean a user programmable classifer
like iptables. secmark obviously doesn't raise future looking
questions (like what if there is httpv3?) nor should this.

> The past and present requests are just a matter of code,
> that isn't too worrying, but what do we do about unknown future
> requests?  Beyond simply encoding the request into a operation
> token/enum/int, what additional information beyond the action type
> would a LSM need to know to apply a security policy?  Would it be
> reasonable to blindly allow or reject unknown requests?  If so, what
> would break?

I am proposing something like SECMARK.

Eg from Google:

iptables -t mangle -A INPUT -p tcp --dport 80 -j SECMARK --selctx system_u:object_r:httpd_packet_t:s0

Which is 'match a packet to see that byte offset XX is 0080 and if so
tag it with the thing this string describes'

So I'm imagining the same kind of flexability. User provides the
matching and whatever those strings are. The classifer step is fully
flexible. No worry about future stuff.

I'm guessing when Casey talks about struct lsm_prop it is related to
the system_u string?

> > ... Once classified we want this to work with more than SELinux
> > only, we have a particular interest in the eBPF LSM.
> 
> One of the design requirements for the LSM framework is that it
> provides an abstraction layer between the kernel and the underlying
> security mechanisms implemented by the various LSMs.  Some operations
> will always fall outside the scope of individual LSMs due to their
> nature, but as a general rule we try to ensure that LSM hooks are
> useful across multiple LSMs.

I don't know much about SECMARK but Google is telling me it doesn't
work with anything but SELinux LSM? We'd just like to avoid this
pitful and I guess that is why Casey brings up lsm_prop?

> > Following the fwmark example, if there was some programmable in-kernel
> > function to convert the cmd into a SELinux label would we be able to
> > enable SELinux following the SECMARK design?
> 
> As Casey already mentioned, any sort of classifier would need to be
> able to support multiple LSMs.  The only example that comes to mind at
> the moment is the NetLabel mechanism which translates between
> on-the-wire CIPSO and CALIPSO labels and multiple LSMs (Smack and
> SELinux currently).

Ok, I think they can look into that, it is a good lead

> > Would there be an objection if that in-kernel function was using a
> > system-wide eBPF uploaded with some fwctl uAPI?
> 
> We'd obviously need to see patches, but there is precedent in
> separating labeling from enforcement.  We've discussed SecMark and
> NetLabel in the networking space, but technically, the VFS/filesystem
> xattr implementations could also be considered as a labeling mechanism
> outside of the LSM.

Makes sense

> > Finally, would there be an objection to enabling the same function in
> > eBPF by feeding it the entire command and have it classify and make a
> > security decision in a single eBPF program?
> 
> Keeping in mind that from an LSM perspective we need to support
> multiple implementations, both in terms of language mechanics (eBPF,
> Rust, C) and security philosophies (Smack, SELinux, AppArmor, etc.),
> so it would be very unlikely that we would want a specific shortcut or
> mechanism that would only work for one language or philosophy.

Okay, it is good to understand the sensitivities

> > Is there some other way to enable eBPF?
> 
> If one develops a workable LSM hook then I see no reason why one
> couldn't write a BPF LSM to use that hook; that's what we do today.

I was thinking that too

> However, it seems like direct reuse of secmark isn't what is needed,
> or even wanted, you were just using that as an example of separating
> labeling from enforcement, yes?

Yes, and looking for a coding example to guide implementing it, and to
recast this discussion to something more concrete. It is very helpful
to think of this a lot like deep packet inspection and
classification.

That the packets are delivered to FW and execute commands is not
actually that important. IP packets are also delivered to remote CPUs
and execute commands there too <shrug>

At the end of the day the task is the same - deep packet inspection,
classification. labeling, policy decision, enforcement.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13 23:19                 ` Jason Gunthorpe
@ 2026-04-14 17:05                   ` Casey Schaufler
  2026-04-14 19:09                     ` Paul Moore
  2026-04-14 20:27                   ` Paul Moore
  1 sibling, 1 reply; 81+ messages in thread
From: Casey Schaufler @ 2026-04-14 17:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module, Casey Schaufler

On 4/13/2026 4:19 PM, Jason Gunthorpe wrote:
> On Mon, Apr 13, 2026 at 06:36:06PM -0400, Paul Moore wrote:
>> On Mon, Apr 13, 2026 at 12:42 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>> On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
>>>>> We are not limited to LSM solution, the goal is to intercept commands
>>>>> which are submitted to the FW and "security" bucket sounded right to us.
>>>> Yes, it does sound "security relevant", but without a well defined
>>>> interface/format it is going to be difficult to write a generic LSM to
>>>> have any level of granularity beyond a basic "load firmware"
>>>> permission.
>>> I think to step back a bit, what this is trying to achieve is very
>>> similar to the iptables fwmark/secmark scheme.
>> Points for thinking outside the box a bit, but from what I've seen
>> thus far, it differs from secmark in a few important areas.  The
>> secmark concept relies on the admin to configure the network stack to
>> apply secmark labels to network traffic as it flows through the
>> system, the LSM then applies security policy to these packets based on
>> their label.  The firmware LSM hooks, at least as I currently
>> understand them, rely on the LSM hook callback to parse the firmware
>> op/request and apply a security policy to the request.
> That was what was proposed because the idea was to combine the
> parse/clasification/decision steps into one eBPF program, but I think
> it can be split up too.
>
>> We've already talked about the first issue, parsing the request, and
>> my suggestion was to make the LSM hook call from within the firmware
>> (the firmware must have some way to call into the kernel/driver code,
>> no?)
> No, that's not workable on so many levels. It is sort of anaologous to
> asking the NIC to call the LSM to apply the secmark while sending the
> packet.
>
> The proper flow has the kernel evaluate the packet/command *before* it
> delivers it to HW, not after.
>
>> so that only the firmware would need to parse the request.  If we
>> wanted to adopt a secmark-esque approach, one could develop a second
>> parsing mechanism that would be responsible for assigning a LSM label
>> to the request, and then pass the firmware request to the LSM, but I
>> do worry a bit about the added complexity associated with keeping the
>> parser sync'd with the driver/fw.
> In practice it would be like iptables, the parser would be entirely
> programmed by userspace and there is nothing to keep in sync.
>
>> However, even if we solve the parsing problem, I worry we have
>> another, closely related issue, of having to categorize all of the
>> past, present, and future firmware requests into a set of LSM specific
>> actions.  
> Why? secmark doesn't have this issue? The classifer would return the
> same kind of information as secmark, some user provided label that is
> delivered to the LSM policy side.
>
> When I talk about a classifier I mean a user programmable classifer
> like iptables. secmark obviously doesn't raise future looking
> questions (like what if there is httpv3?) nor should this.

Secmark has already failed. As I mentioned before, you can't fit the
label information from more than one LSM in a u32. There's going to have
to be some performance degrading hash-magic invoked to make that happen,
and when I've looked into what it would take I was very unhappy.

>> The past and present requests are just a matter of code,
>> that isn't too worrying, but what do we do about unknown future
>> requests?  Beyond simply encoding the request into a operation
>> token/enum/int, what additional information beyond the action type
>> would a LSM need to know to apply a security policy?  Would it be
>> reasonable to blindly allow or reject unknown requests?  If so, what
>> would break?
> I am proposing something like SECMARK.
>
> Eg from Google:
>
> iptables -t mangle -A INPUT -p tcp --dport 80 -j SECMARK --selctx system_u:object_r:httpd_packet_t:s0
>
> Which is 'match a packet to see that byte offset XX is 0080 and if so
> tag it with the thing this string describes'
>
> So I'm imagining the same kind of flexability. User provides the
> matching and whatever those strings are. The classifer step is fully
> flexible. No worry about future stuff.
>
> I'm guessing when Casey talks about struct lsm_prop it is related to
> the system_u string?

Yeah, that would be it. Lets say your system supports SELinux and AppArmor.
You'll need to be able to specify an SELinux context, an AppArmor context,
or both. Iptables can't do that because of the limitations of a secmark.

>>> ... Once classified we want this to work with more than SELinux
>>> only, we have a particular interest in the eBPF LSM.
>> One of the design requirements for the LSM framework is that it
>> provides an abstraction layer between the kernel and the underlying
>> security mechanisms implemented by the various LSMs.  Some operations
>> will always fall outside the scope of individual LSMs due to their
>> nature, but as a general rule we try to ensure that LSM hooks are
>> useful across multiple LSMs.
> I don't know much about SECMARK but Google is telling me it doesn't
> work with anything but SELinux LSM? We'd just like to avoid this
> pitful and I guess that is why Casey brings up lsm_prop?

Google is wrong. (Imagine that!) Smack uses secmarks. It's one of the
reasons you can't use SELinux and Smack at the same time. There is code
in iptables that implies it only works for SELinux, but it isn't true.
That's why you want an lsm_prop instead of a secid. The limitation of a
secmark is imposed by the IP stack implementation. It would be very
frustrating if you introduced yet another scheme with that limitation.

>>> Following the fwmark example, if there was some programmable in-kernel
>>> function to convert the cmd into a SELinux label would we be able to
>>> enable SELinux following the SECMARK design?
>> As Casey already mentioned, any sort of classifier would need to be
>> able to support multiple LSMs.  The only example that comes to mind at
>> the moment is the NetLabel mechanism which translates between
>> on-the-wire CIPSO and CALIPSO labels and multiple LSMs (Smack and
>> SELinux currently).
> Ok, I think they can look into that, it is a good lead

Netlabel has a similar issue to secmarks with its use of secids, and
currently supports only a single CIPSO tag in the IP header, making
multiple concurrent LSM support impossible. If you're defining a new
mechanism you can avoid this limitation.

>>> Would there be an objection if that in-kernel function was using a
>>> system-wide eBPF uploaded with some fwctl uAPI?
>> We'd obviously need to see patches, but there is precedent in
>> separating labeling from enforcement.  We've discussed SecMark and
>> NetLabel in the networking space, but technically, the VFS/filesystem
>> xattr implementations could also be considered as a labeling mechanism
>> outside of the LSM.
> Makes sense
>
>>> Finally, would there be an objection to enabling the same function in
>>> eBPF by feeding it the entire command and have it classify and make a
>>> security decision in a single eBPF program?
>> Keeping in mind that from an LSM perspective we need to support
>> multiple implementations, both in terms of language mechanics (eBPF,
>> Rust, C) and security philosophies (Smack, SELinux, AppArmor, etc.),
>> so it would be very unlikely that we would want a specific shortcut or
>> mechanism that would only work for one language or philosophy.
> Okay, it is good to understand the sensitivities
>
>>> Is there some other way to enable eBPF?
>> If one develops a workable LSM hook then I see no reason why one
>> couldn't write a BPF LSM to use that hook; that's what we do today.
> I was thinking that too
>
>> However, it seems like direct reuse of secmark isn't what is needed,
>> or even wanted, you were just using that as an example of separating
>> labeling from enforcement, yes?
> Yes, and looking for a coding example to guide implementing it, and to
> recast this discussion to something more concrete. It is very helpful
> to think of this a lot like deep packet inspection and
> classification.
>
> That the packets are delivered to FW and execute commands is not
> actually that important. IP packets are also delivered to remote CPUs
> and execute commands there too <shrug>
>
> At the end of the day the task is the same - deep packet inspection,
> classification. labeling, policy decision, enforcement.
>
> Thanks,
> Jason
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-14 17:05                   ` Casey Schaufler
@ 2026-04-14 19:09                     ` Paul Moore
  2026-04-14 20:09                       ` Casey Schaufler
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-14 19:09 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Jason Gunthorpe, Leon Romanovsky, Roberto Sassu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Saeed Mahameed, Itay Avraham,
	Dave Jiang, Jonathan Cameron, bpf, linux-kernel, linux-kselftest,
	linux-rdma, Chiara Meiohas, Maher Sanalla, linux-security-module

On Tue, Apr 14, 2026 at 1:05 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> Netlabel has a similar issue to secmarks with its use of secids, and
> currently supports only a single CIPSO tag in the IP header, making
> multiple concurrent LSM support impossible.

That's not correct.

We've talked about this multiple times Casey.  The short version is
that while NetLabel doesn't support multiple simultaneous LSMs at the
moment (mostly due to an issue with outbound traffic), this is not due
to some inherent limitation, it is due to the fact that it wasn't
needed when NetLabel was created, and no one has done the (relatively
minor) work to add support since then.

For those of you who are interested in a more detailed explanation,
here ya go ...

NetLabel passes security attributes between itself and various LSMs
through the netlbl_lsm_secattr struct.  The netlbl_lsm_secattr struct
is an abstraction not only for the underlying labeling protocols, e.g.
CIPSO and CALIPSO, but also for the LSMs.  Multiple LSMs call into
NetLabel for the same inbound packet using netlbl_skbuff_getattr() and
then translate the attributes into their own label representation.

Outbound traffic is a bit more complicated as it involves changing the
state of either a sock, via netlbl_sock_setattr(), or a packet, via
netlbl_skbuff_setattr(), but in both cases we are once again dealing
with netlbl_lsm_secattr struct, not a LSM specific label.  Since the
underlying labeling protocol is configured within the NetLabel
subsystem and outside the individual LSMs, there is no worry about
different LSMs requesting different protocol configurations (that is a
separate system/network management issue). The only concern is that
the on-the-wire representation is the same for each LSM that is using
NetLabel based labeling.  While some additional work would be
required, it shouldn't be that hard to add NetLabel/protocol code to
ensure the protocol specific labels are the same, and reject/drop the
packet if not.

Use of the NetLabel translation cache, e.g. netlbl_cache_add(), would
require some additional work to convert over to a lsm_prop instead of
a u32/secid, but if you look at the caching code that should be
trivial.  It might be as simple as adding a lsm_prop to the
netlbl_lsm_secattr::attr struct since the cache stores a full secattr
and not just a u32/secid.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-14 19:09                     ` Paul Moore
@ 2026-04-14 20:09                       ` Casey Schaufler
  2026-04-14 20:44                         ` Paul Moore
  0 siblings, 1 reply; 81+ messages in thread
From: Casey Schaufler @ 2026-04-14 20:09 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jason Gunthorpe, Leon Romanovsky, Roberto Sassu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Saeed Mahameed, Itay Avraham,
	Dave Jiang, Jonathan Cameron, bpf, linux-kernel, linux-kselftest,
	linux-rdma, Chiara Meiohas, Maher Sanalla, linux-security-module,
	Casey Schaufler

On 4/14/2026 12:09 PM, Paul Moore wrote:
> On Tue, Apr 14, 2026 at 1:05 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> Netlabel has a similar issue to secmarks with its use of secids, and
>> currently supports only a single CIPSO tag in the IP header, making
>> multiple concurrent LSM support impossible.
> That's not correct.

OK, you're right. However ...

>
> We've talked about this multiple times Casey.  The short version is
> that while NetLabel doesn't support multiple simultaneous LSMs at the
> moment (mostly due to an issue with outbound traffic), this is not due
> to some inherent limitation, it is due to the fact that it wasn't
> needed when NetLabel was created, and no one has done the (relatively
> minor) work to add support since then.
>
> For those of you who are interested in a more detailed explanation,
> here ya go ...
>
> NetLabel passes security attributes between itself and various LSMs
> through the netlbl_lsm_secattr struct.  The netlbl_lsm_secattr struct
> is an abstraction not only for the underlying labeling protocols, e.g.
> CIPSO and CALIPSO, but also for the LSMs.  Multiple LSMs call into
> NetLabel for the same inbound packet using netlbl_skbuff_getattr() and
> then translate the attributes into their own label representation.
>
> Outbound traffic is a bit more complicated as it involves changing the
> state of either a sock, via netlbl_sock_setattr(), or a packet, via
> netlbl_skbuff_setattr(), but in both cases we are once again dealing
> with netlbl_lsm_secattr struct, not a LSM specific label.  Since the
> underlying labeling protocol is configured within the NetLabel
> subsystem and outside the individual LSMs, there is no worry about
> different LSMs requesting different protocol configurations (that is a
> separate system/network management issue). The only concern is that
> the on-the-wire representation is the same for each LSM that is using
> NetLabel based labeling.  While some additional work would be
> required, it shouldn't be that hard to add NetLabel/protocol code to
> ensure the protocol specific labels are the same, and reject/drop the
> packet if not.

Indeed, we've discussed this, and I had at one point implemented it.
The problem is that for any meaningful access control policies you will
never get the two LSMs to agree on a unified network representation.
SELinux transmits the MLS component of the security context. Smack passes
the text of its context. Unless the Smack label is completely in step with
the MLS component of the SELinux context there is no hope of a common
network representation. If a *very talented* sysadmin could create such a
policy, you would have to wonder why, because Smack would be duplicating
the SELinux MLS policy.

So there's really no value in pursuing that approach.

> Use of the NetLabel translation cache, e.g. netlbl_cache_add(), would
> require some additional work to convert over to a lsm_prop instead of
> a u32/secid, but if you look at the caching code that should be
> trivial.  It might be as simple as adding a lsm_prop to the
> netlbl_lsm_secattr::attr struct since the cache stores a full secattr
> and not just a u32/secid.

Indeed. But with no viable users it seems like a lower priority task.

And to be clear, I have no problem with netlabel as written. Multiple tag
support isn't simple (we did it for Trusted IRIX) and the limited space
available for IP options make it tricky.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-14 20:09                       ` Casey Schaufler
@ 2026-04-14 20:44                         ` Paul Moore
  2026-04-14 22:42                           ` Casey Schaufler
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-14 20:44 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Jason Gunthorpe, Leon Romanovsky, Roberto Sassu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Saeed Mahameed, Itay Avraham,
	Dave Jiang, Jonathan Cameron, bpf, linux-kernel, linux-kselftest,
	linux-rdma, Chiara Meiohas, Maher Sanalla, linux-security-module

On Tue, Apr 14, 2026 at 4:10 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 4/14/2026 12:09 PM, Paul Moore wrote:
> > On Tue, Apr 14, 2026 at 1:05 PM Casey Schaufler <casey@schaufler-ca.com> wrote:

...

> > For those of you who are interested in a more detailed explanation,
> > here ya go ...
> >
> > NetLabel passes security attributes between itself and various LSMs
> > through the netlbl_lsm_secattr struct.  The netlbl_lsm_secattr struct
> > is an abstraction not only for the underlying labeling protocols, e.g.
> > CIPSO and CALIPSO, but also for the LSMs.  Multiple LSMs call into
> > NetLabel for the same inbound packet using netlbl_skbuff_getattr() and
> > then translate the attributes into their own label representation.
> >
> > Outbound traffic is a bit more complicated as it involves changing the
> > state of either a sock, via netlbl_sock_setattr(), or a packet, via
> > netlbl_skbuff_setattr(), but in both cases we are once again dealing
> > with netlbl_lsm_secattr struct, not a LSM specific label.  Since the
> > underlying labeling protocol is configured within the NetLabel
> > subsystem and outside the individual LSMs, there is no worry about
> > different LSMs requesting different protocol configurations (that is a
> > separate system/network management issue). The only concern is that
> > the on-the-wire representation is the same for each LSM that is using
> > NetLabel based labeling.  While some additional work would be
> > required, it shouldn't be that hard to add NetLabel/protocol code to
> > ensure the protocol specific labels are the same, and reject/drop the
> > packet if not.
>
> Indeed, we've discussed this, and I had at one point implemented it.
> The problem is that for any meaningful access control policies you will
> never get the two LSMs to agree on a unified network representation.

That is also not correct.  In the early days when SELinux was first
being used to displace the old CMW/MLS UNIXes NetLabel/CIPSO was used
to interoperate between the systems and it did so quite well despite
the SELinux TE/MLS policy being quite different than the CMW MLS
policies.  Yes, there were aspects of the SELinux policy that made
this easier - it had a MLS component after all - but they were still
*very* different policies.

> SELinux transmits the MLS component of the security context. Smack passes
> the text of its context.

Arguably the NetLabel/CIPSO interoperability challenge between SELinux
and Smack is due more to differences in how Smack encodes its security
labels into MLS attributes than from any inherent interop limitation.
In fact, I thought the "cipso2" Smack interface was intended to
resolve this by allowing admins to control how a Smack/CIPSO
translation so that Smack could interop with MLS systems, is that not
the case?

> Unless the Smack label is completely in step with
> the MLS component of the SELinux context there is no hope of a common
> network representation. If a *very talented* sysadmin could create such a
> policy, you would have to wonder why, because Smack would be duplicating
> the SELinux MLS policy.

Interoperability wouldn't be a problem if everyone the "right" systems :D

> > Use of the NetLabel translation cache, e.g. netlbl_cache_add(), would
> > require some additional work to convert over to a lsm_prop instead of
> > a u32/secid, but if you look at the caching code that should be
> > trivial.  It might be as simple as adding a lsm_prop to the
> > netlbl_lsm_secattr::attr struct since the cache stores a full secattr
> > and not just a u32/secid.
>
> Indeed. But with no viable users it seems like a lower priority task.

You need to be very careful about those "viable users" claims ...

> And to be clear, I have no problem with netlabel as written. Multiple tag
> support isn't simple (we did it for Trusted IRIX) and the limited space
> available for IP options make it tricky.

That's an entirely different problem from the LSM and protocol
abstractions, but yeah.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-14 20:44                         ` Paul Moore
@ 2026-04-14 22:42                           ` Casey Schaufler
  2026-04-15 21:03                             ` Paul Moore
  0 siblings, 1 reply; 81+ messages in thread
From: Casey Schaufler @ 2026-04-14 22:42 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jason Gunthorpe, Leon Romanovsky, Roberto Sassu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Saeed Mahameed, Itay Avraham,
	Dave Jiang, Jonathan Cameron, bpf, linux-kernel, linux-kselftest,
	linux-rdma, Chiara Meiohas, Maher Sanalla, linux-security-module,
	Casey Schaufler

On 4/14/2026 1:44 PM, Paul Moore wrote:
> On Tue, Apr 14, 2026 at 4:10 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 4/14/2026 12:09 PM, Paul Moore wrote:
>>> On Tue, Apr 14, 2026 at 1:05 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> ..
>
>>> For those of you who are interested in a more detailed explanation,
>>> here ya go ...
>>>
>>> NetLabel passes security attributes between itself and various LSMs
>>> through the netlbl_lsm_secattr struct.  The netlbl_lsm_secattr struct
>>> is an abstraction not only for the underlying labeling protocols, e.g.
>>> CIPSO and CALIPSO, but also for the LSMs.  Multiple LSMs call into
>>> NetLabel for the same inbound packet using netlbl_skbuff_getattr() and
>>> then translate the attributes into their own label representation.
>>>
>>> Outbound traffic is a bit more complicated as it involves changing the
>>> state of either a sock, via netlbl_sock_setattr(), or a packet, via
>>> netlbl_skbuff_setattr(), but in both cases we are once again dealing
>>> with netlbl_lsm_secattr struct, not a LSM specific label.  Since the
>>> underlying labeling protocol is configured within the NetLabel
>>> subsystem and outside the individual LSMs, there is no worry about
>>> different LSMs requesting different protocol configurations (that is a
>>> separate system/network management issue). The only concern is that
>>> the on-the-wire representation is the same for each LSM that is using
>>> NetLabel based labeling.  While some additional work would be
>>> required, it shouldn't be that hard to add NetLabel/protocol code to
>>> ensure the protocol specific labels are the same, and reject/drop the
>>> packet if not.
>> Indeed, we've discussed this, and I had at one point implemented it.
>> The problem is that for any meaningful access control policies you will
>> never get the two LSMs to agree on a unified network representation.
> That is also not correct.  In the early days when SELinux was first
> being used to displace the old CMW/MLS UNIXes NetLabel/CIPSO was used
> to interoperate between the systems and it did so quite well despite
> the SELinux TE/MLS policy being quite different than the CMW MLS
> policies.  Yes, there were aspects of the SELinux policy that made
> this easier - it had a MLS component after all - but they were still
> *very* different policies.

CMW MLS and SELinux MLS can be mapped. They have the same components.
Comparing a full SELinux context and a Smack label is another beast.

>> SELinux transmits the MLS component of the security context. Smack passes
>> the text of its context.
> Arguably the NetLabel/CIPSO interoperability challenge between SELinux
> and Smack is due more to differences in how Smack encodes its security
> labels into MLS attributes than from any inherent interop limitation.

Yes. That is correct. The big issue I see is that SELinux does not represent
the entire context in the CIPSO header. Thus, you're up against many SELinux
contexts having the same wire representation, where Smack will have a unique
on wire for each context. You'll have many-to-one mapping issues.

> In fact, I thought the "cipso2" Smack interface was intended to
> resolve this by allowing admins to control how a Smack/CIPSO
> translation so that Smack could interop with MLS systems, is that not
> the case?

Indeed. A CMW MLS policy is way simpler than an SELinux policy.

>
>> Unless the Smack label is completely in step with
>> the MLS component of the SELinux context there is no hope of a common
>> network representation. If a *very talented* sysadmin could create such a
>> policy, you would have to wonder why, because Smack would be duplicating
>> the SELinux MLS policy.
> Interoperability wouldn't be a problem if everyone the "right" systems :D

Where would the fun in that be? ;)

>
>>> Use of the NetLabel translation cache, e.g. netlbl_cache_add(), would
>>> require some additional work to convert over to a lsm_prop instead of
>>> a u32/secid, but if you look at the caching code that should be
>>> trivial.  It might be as simple as adding a lsm_prop to the
>>> netlbl_lsm_secattr::attr struct since the cache stores a full secattr
>>> and not just a u32/secid.
>> Indeed. But with no viable users it seems like a lower priority task.
> You need to be very careful about those "viable users" claims ...

Today there are no users. There are other problems (e.g. mount options)
that have yet to be addressed.

>> And to be clear, I have no problem with netlabel as written. Multiple tag
>> support isn't simple (we did it for Trusted IRIX) and the limited space
>> available for IP options make it tricky.
> That's an entirely different problem from the LSM and protocol
> abstractions, but yeah.
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-14 22:42                           ` Casey Schaufler
@ 2026-04-15 21:03                             ` Paul Moore
  2026-04-15 21:21                               ` Casey Schaufler
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-15 21:03 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Jason Gunthorpe, Leon Romanovsky, Roberto Sassu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Saeed Mahameed, Itay Avraham,
	Dave Jiang, Jonathan Cameron, bpf, linux-kernel, linux-kselftest,
	linux-rdma, Chiara Meiohas, Maher Sanalla, linux-security-module

On Tue, Apr 14, 2026 at 6:42 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 4/14/2026 1:44 PM, Paul Moore wrote:
> > On Tue, Apr 14, 2026 at 4:10 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >> On 4/14/2026 12:09 PM, Paul Moore wrote:
> >>> On Tue, Apr 14, 2026 at 1:05 PM Casey Schaufler <casey@schaufler-ca.com> wrote:

...

> CMW MLS and SELinux MLS can be mapped. They have the same components.

Yes, one of the fields in a full SELinux label can be an MLS field,
but that doesn't mean there isn't translation needed.  The important
point is that security label translation, mapping, etc. is necessary,
possible, and has been proven to work across a variety of systems.

> >> SELinux transmits the MLS component of the security context. Smack passes
> >> the text of its context.
> > Arguably the NetLabel/CIPSO interoperability challenge between SELinux
> > and Smack is due more to differences in how Smack encodes its security
> > labels into MLS attributes than from any inherent interop limitation.
>
> Yes. That is correct. The big issue I see is that SELinux does not represent
> the entire context in the CIPSO header. Thus, you're up against many SELinux
> contexts having the same wire representation, where Smack will have a unique
> on wire for each context ...

That isn't always true is it?  From my understanding of the "cipso2"
interface an admin could easily map multiple Smack labels to a single
CIPSO label.

It's important to remember that if you wanted to utilize CIPSO to
communicate between SELinux and Smack, the label translation is not
between SELinux and Smack but rather between SELinux and CIPSO as well
as between Smack and CIPSO.

> >>> Use of the NetLabel translation cache, e.g. netlbl_cache_add(), would
> >>> require some additional work to convert over to a lsm_prop instead of
> >>> a u32/secid, but if you look at the caching code that should be
> >>> trivial.  It might be as simple as adding a lsm_prop to the
> >>> netlbl_lsm_secattr::attr struct since the cache stores a full secattr
> >>> and not just a u32/secid.
> >> Indeed. But with no viable users it seems like a lower priority task.
> > You need to be very careful about those "viable users" claims ...
>
> Today there are no users.

That you are aware of at the moment.  You are also well aware of my
feelings on this issue and ultimately I'm the one who has to sign off
on that stuff.

> There are other problems (e.g. mount options) that have yet to be addressed.

The existence of one problem does not mean another does not exist.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-15 21:03                             ` Paul Moore
@ 2026-04-15 21:21                               ` Casey Schaufler
  0 siblings, 0 replies; 81+ messages in thread
From: Casey Schaufler @ 2026-04-15 21:21 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jason Gunthorpe, Leon Romanovsky, Roberto Sassu, KP Singh,
	Matt Bobrowski, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Saeed Mahameed, Itay Avraham,
	Dave Jiang, Jonathan Cameron, bpf, linux-kernel, linux-kselftest,
	linux-rdma, Chiara Meiohas, Maher Sanalla, linux-security-module,
	Casey Schaufler

On 4/15/2026 2:03 PM, Paul Moore wrote:
> On Tue, Apr 14, 2026 at 6:42 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 4/14/2026 1:44 PM, Paul Moore wrote:
>>> On Tue, Apr 14, 2026 at 4:10 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 4/14/2026 12:09 PM, Paul Moore wrote:
>>>>> On Tue, Apr 14, 2026 at 1:05 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> ..
>
>> CMW MLS and SELinux MLS can be mapped. They have the same components.
> Yes, one of the fields in a full SELinux label can be an MLS field,
> but that doesn't mean there isn't translation needed.  The important
> point is that security label translation, mapping, etc. is necessary,
> possible, and has been proven to work across a variety of systems.

I'm not especially concerned about translation between systems.
The problem at hand is negotiating between LSMs on the same system.

>
>>>> SELinux transmits the MLS component of the security context. Smack passes
>>>> the text of its context.
>>> Arguably the NetLabel/CIPSO interoperability challenge between SELinux
>>> and Smack is due more to differences in how Smack encodes its security
>>> labels into MLS attributes than from any inherent interop limitation.
>> Yes. That is correct. The big issue I see is that SELinux does not represent
>> the entire context in the CIPSO header. Thus, you're up against many SELinux
>> contexts having the same wire representation, where Smack will have a unique
>> on wire for each context ...
> That isn't always true is it?  From my understanding of the "cipso2"
> interface an admin could easily map multiple Smack labels to a single
> CIPSO label.

True, but you can't map multiple Smack labels to the same CIPSO label
without introducing ambiguity.

> It's important to remember that if you wanted to utilize CIPSO to
> communicate between SELinux and Smack, the label translation is not
> between SELinux and Smack but rather between SELinux and CIPSO as well
> as between Smack and CIPSO.
>
>>>>> Use of the NetLabel translation cache, e.g. netlbl_cache_add(), would
>>>>> require some additional work to convert over to a lsm_prop instead of
>>>>> a u32/secid, but if you look at the caching code that should be
>>>>> trivial.  It might be as simple as adding a lsm_prop to the
>>>>> netlbl_lsm_secattr::attr struct since the cache stores a full secattr
>>>>> and not just a u32/secid.
>>>> Indeed. But with no viable users it seems like a lower priority task.
>>> You need to be very careful about those "viable users" claims ...
>> Today there are no users.
> That you are aware of at the moment.  You are also well aware of my
> feelings on this issue and ultimately I'm the one who has to sign off
> on that stuff.

Understood. There are a serious number of considerations that need to
be worked through.

>
>> There are other problems (e.g. mount options) that have yet to be addressed.
> The existence of one problem does not mean another does not exist.

True enough.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-13 23:19                 ` Jason Gunthorpe
  2026-04-14 17:05                   ` Casey Schaufler
@ 2026-04-14 20:27                   ` Paul Moore
  2026-04-15 13:47                     ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-14 20:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Mon, Apr 13, 2026 at 7:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Apr 13, 2026 at 06:36:06PM -0400, Paul Moore wrote:
> > On Mon, Apr 13, 2026 at 12:42 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
> > > > > We are not limited to LSM solution, the goal is to intercept commands
> > > > > which are submitted to the FW and "security" bucket sounded right to us.
> > > >
> > > > Yes, it does sound "security relevant", but without a well defined
> > > > interface/format it is going to be difficult to write a generic LSM to
> > > > have any level of granularity beyond a basic "load firmware"
> > > > permission.
> > >
> > > I think to step back a bit, what this is trying to achieve is very
> > > similar to the iptables fwmark/secmark scheme.
> >
> > Points for thinking outside the box a bit, but from what I've seen
> > thus far, it differs from secmark in a few important areas.  The
> > secmark concept relies on the admin to configure the network stack to
> > apply secmark labels to network traffic as it flows through the
> > system, the LSM then applies security policy to these packets based on
> > their label.  The firmware LSM hooks, at least as I currently
> > understand them, rely on the LSM hook callback to parse the firmware
> > op/request and apply a security policy to the request.
>
> That was what was proposed because the idea was to combine the
> parse/clasification/decision steps into one eBPF program, but I think
> it can be split up too.
>
> > We've already talked about the first issue, parsing the request, and
> > my suggestion was to make the LSM hook call from within the firmware
> > (the firmware must have some way to call into the kernel/driver code,
> > no?)
>
> No, that's not workable on so many levels. It is sort of anaologous to
> asking the NIC to call the LSM to apply the secmark while sending the
> packet.

From the LSM's perspective it really doesn't matter who calls the LSM
hook as long as the caller can be trusted to handle the access control
verdict properly.

> The proper flow has the kernel evaluate the packet/command *before* it
> delivers it to HW, not after.

From what I can see that's an artificial constraint since we've
already chosen to trust the device.  After all, if we don't trust the
device, why are we talking to it?

> > so that only the firmware would need to parse the request.  If we
> > wanted to adopt a secmark-esque approach, one could develop a second
> > parsing mechanism that would be responsible for assigning a LSM label
> > to the request, and then pass the firmware request to the LSM, but I
> > do worry a bit about the added complexity associated with keeping the
> > parser sync'd with the driver/fw.
>
> In practice it would be like iptables, the parser would be entirely
> programmed by userspace and there is nothing to keep in sync.

You've mentioned a few times now that the firmware/request will vary
across not only devices, but firmware revisions too, this implies
there will need to be some effort to keep whatever parser you develop
(BPF, userspace config, etc.) in sync with the parser built into the
firmware.  Or am I misunderstanding something?

> > However, even if we solve the parsing problem, I worry we have
> > another, closely related issue, of having to categorize all of the
> > past, present, and future firmware requests into a set of LSM specific
> > actions.
>
> Why? secmark doesn't have this issue? The classifer would return the
> same kind of information as secmark, some user provided label that is
> delivered to the LSM policy side.

I think there is a misunderstanding in either how secmark works or how
the LSMs use secmark labels when enforcing their security policy.

The secmark label is set on a packet to represent the network
properties of a packet.  While the rules governing how a packet's
secmark is set and the semantic meaning of that secmark label is going
to be LSM and solution specific, secmark labels represent the
properties of a packet and not the operation, e.g.
send/receive/forward/etc., being requested at a given access control
point.  The access control point itself represents the requested
operation.  This is possible because the number of networking
operations on a given packet is well defined and fairly limited; at a
high level the packet is either being sent from the node, received by
the node, or is passing through the node.

As I understand the firmware controls being proposed here, encoded
within the firmware request blob is the operation being requested.
While we've discussed possible solutions on how to parse the request
blob to determine the operation, we haven't really discussed how to
represent the requested operation to the LSMs.  I'm assuming there
isn't a well defined set of operations, and like the request format
itself, the set of valid operations will vary from device and firmware
revision.  I hope you can understand both how this differs from
secmark and that it is a challenge that really hasn't been addressed
in the proposals we've discussed.

At a very high level the access control decision for firmware/device
requests depends on whether the LSM wants to allow process A to do B
to device C.  The identity/credentials associated with process A are
easy to understand, we have plenty of examples both inside and outside
of the LSM on how to do that.  The device identity/attributes
associated with device C can be a bit trickier, but once again we have
plenty of examples to draw from, and we can even fall back to a
generic "kernel" id/attribute if the LSM chooses not to distinguish
entities below the userspace/kernel boundary.  I think the hardest
issue with the firmware request hooks is going to be determining what
operation is being requested, the "B", portion of access request
tuple.  If we can create a well defined set of operations and leave it
to the parser to characterize the operation we've potentially got a
solution, but if the operation is truly going to be arbitrary then we
have a real problem.  How do you craft a meaningful access control
policy when you don't understand what is being requested?

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-14 20:27                   ` Paul Moore
@ 2026-04-15 13:47                     ` Jason Gunthorpe
  2026-04-15 21:40                       ` Paul Moore
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-15 13:47 UTC (permalink / raw)
  To: Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Tue, Apr 14, 2026 at 04:27:58PM -0400, Paul Moore wrote:
> On Mon, Apr 13, 2026 at 7:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Mon, Apr 13, 2026 at 06:36:06PM -0400, Paul Moore wrote:
> > > On Mon, Apr 13, 2026 at 12:42 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
> > > > > > We are not limited to LSM solution, the goal is to intercept commands
> > > > > > which are submitted to the FW and "security" bucket sounded right to us.
> > > > >
> > > > > Yes, it does sound "security relevant", but without a well defined
> > > > > interface/format it is going to be difficult to write a generic LSM to
> > > > > have any level of granularity beyond a basic "load firmware"
> > > > > permission.
> > > >
> > > > I think to step back a bit, what this is trying to achieve is very
> > > > similar to the iptables fwmark/secmark scheme.
> > >
> > > Points for thinking outside the box a bit, but from what I've seen
> > > thus far, it differs from secmark in a few important areas.  The
> > > secmark concept relies on the admin to configure the network stack to
> > > apply secmark labels to network traffic as it flows through the
> > > system, the LSM then applies security policy to these packets based on
> > > their label.  The firmware LSM hooks, at least as I currently
> > > understand them, rely on the LSM hook callback to parse the firmware
> > > op/request and apply a security policy to the request.
> >
> > That was what was proposed because the idea was to combine the
> > parse/clasification/decision steps into one eBPF program, but I think
> > it can be split up too.
> >
> > > We've already talked about the first issue, parsing the request, and
> > > my suggestion was to make the LSM hook call from within the firmware
> > > (the firmware must have some way to call into the kernel/driver code,
> > > no?)
> >
> > No, that's not workable on so many levels. It is sort of anaologous to
> > asking the NIC to call the LSM to apply the secmark while sending the
> > packet.
> 
> From the LSM's perspective it really doesn't matter who calls the LSM
> hook as long as the caller can be trusted to handle the access control
> verdict properly.

The NIC doesn't know anything more than the kernel to call the LSM
hook. It can't magically generate the label the admin wants to use any
better than the kernel can.

Just like you could never get everyone to agree on a fixed set of
labels for network packets we could never get agreemnt on a fixed set
of labels for command packets either.

> > > so that only the firmware would need to parse the request.  If we
> > > wanted to adopt a secmark-esque approach, one could develop a second
> > > parsing mechanism that would be responsible for assigning a LSM label
> > > to the request, and then pass the firmware request to the LSM, but I
> > > do worry a bit about the added complexity associated with keeping the
> > > parser sync'd with the driver/fw.
> >
> > In practice it would be like iptables, the parser would be entirely
> > programmed by userspace and there is nothing to keep in sync.
> 
> You've mentioned a few times now that the firmware/request will vary
> across not only devices, but firmware revisions too, 

I never said firmware revisions, part of the requirement is strong ABI
compatability in these packets. 

> this implies there will need to be some effort to keep whatever
> parser you develop (BPF, userspace config, etc.) in sync with the
> parser built into the firmware.  Or am I misunderstanding something?

I would not use the word "sync". It is very similar to deep packet
inspection, if you want to look inside, say, RPC messages carried
within HTTP then you have to keep up to date. How onerous that is
depends on what the admin's labeling goals are.

> > > However, even if we solve the parsing problem, I worry we have
> > > another, closely related issue, of having to categorize all of the
> > > past, present, and future firmware requests into a set of LSM specific
> > > actions.
> >
> > Why? secmark doesn't have this issue? The classifer would return the
> > same kind of information as secmark, some user provided label that is
> > delivered to the LSM policy side.
> 
> I think there is a misunderstanding in either how secmark works or how
> the LSMs use secmark labels when enforcing their security policy.
> 
> The secmark label is set on a packet to represent the network
> properties of a packet.  While the rules governing how a packet's
> secmark is set and the semantic meaning of that secmark label is going
> to be LSM and solution specific,

"network properties" are a bit vauge. I can use iptables to inspect
the packet quite completely. It has protocol modules that can do very
detailed inspection. I can use general things like -m string to apply
a secmark to packets containing specific data for example.

From my perspective iptables runs a complicated scheme to evaluate the
full content of the packet and on match applies a secmark.

You can already create a hacky labeling scheme that would tell the
difference between HTTP PUT and HTT GET sessions for example.

At this point it is not just "network properties" but you are
inspecting a RPC and evaluating what operation a remote CPU will
perform.

Even just simple port inspection in most cases is often classifiying
RPCs on the network "Any HTTP RPC" "Any DNS RPC", etc.

> secmark labels represent the properties of a packet and not the
> operation, e.g.  send/receive/forward/etc., being requested at a
> given access control point.

Yes, still aligned.

> The access control point itself represents the requested
> operation.  This is possible because the number of networking
> operations on a given packet is well defined and fairly limited; at a
> high level the packet is either being sent from the node, received by
> the node, or is passing through the node.

I think we have the same split, fwctl send/recive analog is also very
limited.

> As I understand the firmware controls being proposed here, encoded
> within the firmware request blob is the operation being requested.

I am not proposing that kind of interpretation, I want to stay in the
secmark model.

When the packet blob is sent into the kernel at the uAPI boundary
(send_msg, send, write, FWCTL_CMD_RPC, etc) that is your access
control point.

Deep inspection on the packet blob determines the secmark.

LSM takes the secmark and determines if the access control point
accept/rejects.

In both cases deep inspection will allow the admin to create labels
detailed to the RPC that is described in the packet. Eg
labels like "HTTP GET", "HTTP PUT", "FWCTL CREATE OBJ X", etc.

In both cases these are packets containing RPC messages some remote
CPU will excute.

> While we've discussed possible solutions on how to parse the request
> blob to determine the operation, we haven't really discussed how to
> represent the requested operation to the LSMs.  

I don't understand this? The secmark example I pulled up is this:

iptables -t mangle -A INPUT -p tcp --dport 80 -j SECMARK --selctx system_u:object_r:httpd_packet_t:s0

The "represent the requested operation" is the string 
"system_u:object_r:httpd_packet_t:s0", which is entirely admin
defined, right?

The analog here is some

'fwctl iptables' -match 'byte[10]=0x20' -selctx system_u:object_r:fwctl_mlx5_create_pd_t:s0

Again, all admin defined?

> I'm assuming there isn't a well defined set of operations, and like
> the request format itself, the set of valid operations will vary
> from device and firmware revision.  I hope you can understand both
> how this differs from secmark and that it is a challenge that really
> hasn't been addressed in the proposals we've discussed.

I still don't see the difference from iptables. IPSEC, SIP, DNS, HTTP,
etc are all protocols with the same lack of any commonality.

> At a very high level the access control decision for firmware/device
> requests depends on whether the LSM wants to allow process A to do B
> to device C.  The identity/credentials associated with process A are
> easy to understand, we have plenty of examples both inside and outside
> of the LSM on how to do that.  The device identity/attributes
> associated with device C can be a bit trickier, but once again we have
> plenty of examples to draw from, and we can even fall back to a
> generic "kernel" id/attribute if the LSM chooses not to distinguish
> entities below the userspace/kernel boundary. 

I think I would feed that into the classifier as well. Like iptables
can have a netdev argument to only match against specific devices, we
can have the same logical thing.

> I think the hardest issue with the firmware request hooks is going
> to be determining what operation is being requested, the "B",
> portion of access request tuple.  If we can create a well defined
> set of operations and leave it to the parser to characterize the
> operation we've potentially got a solution, but if the operation is
> truly going to be arbitrary then we have a real problem.  How do you
> craft a meaningful access control policy when you don't understand
> what is being requested?

Same as for networking. Admin understands, admin defines, kernel is
just a programmable classifier.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-15 13:47                     ` Jason Gunthorpe
@ 2026-04-15 21:40                       ` Paul Moore
  2026-04-17 19:17                         ` Jason Gunthorpe
  2026-04-23 13:05                         ` Leon Romanovsky
  0 siblings, 2 replies; 81+ messages in thread
From: Paul Moore @ 2026-04-15 21:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Wed, Apr 15, 2026 at 9:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Apr 14, 2026 at 04:27:58PM -0400, Paul Moore wrote:
> > On Mon, Apr 13, 2026 at 7:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Mon, Apr 13, 2026 at 06:36:06PM -0400, Paul Moore wrote:
> > > > On Mon, Apr 13, 2026 at 12:42 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:

...

> > > > We've already talked about the first issue, parsing the request, and
> > > > my suggestion was to make the LSM hook call from within the firmware
> > > > (the firmware must have some way to call into the kernel/driver code,
> > > > no?)
> > >
> > > No, that's not workable on so many levels. It is sort of anaologous to
> > > asking the NIC to call the LSM to apply the secmark while sending the
> > > packet.
> >
> > From the LSM's perspective it really doesn't matter who calls the LSM
> > hook as long as the caller can be trusted to handle the access control
> > verdict properly.
>
> The NIC doesn't know anything more than the kernel to call the LSM
> hook. It can't magically generate the label the admin wants to use any
> better than the kernel can.

The NIC presumably knows how to parse the firmware request and extract
whatever security relevant info is needed to pass to the kernel so the
driver can make an access control request.

> Just like you could never get everyone to agree on a fixed set of
> labels for network packets we could never get agreemnt on a fixed set
> of labels for command packets either.

I don't follow you here ... I'm guessing you are talking about secmark
labels?  The secmark concept was created not due to any disagreements
on packet labels, but rather the challenges and impacts associated
with packet matching directly in the SELinux code.  Secmark was seen
as a more elegant approach to packet matching than the older
"compat_net" SELinux code it replaced.  Even with secmark on SELinux,
the packet labels need to be defined in the SELinux policy, the
netfilter code simply assigns these labels to packets using the
netfilter config.

> > > > so that only the firmware would need to parse the request.  If we
> > > > wanted to adopt a secmark-esque approach, one could develop a second
> > > > parsing mechanism that would be responsible for assigning a LSM label
> > > > to the request, and then pass the firmware request to the LSM, but I
> > > > do worry a bit about the added complexity associated with keeping the
> > > > parser sync'd with the driver/fw.
> > >
> > > In practice it would be like iptables, the parser would be entirely
> > > programmed by userspace and there is nothing to keep in sync.
> >
> > You've mentioned a few times now that the firmware/request will vary
> > across not only devices, but firmware revisions too,
>
> I never said firmware revisions, part of the requirement is strong ABI
> compatability in these packets.

That was my mistake; it was Leon.

Leon mentioned that different firmware revisions would have different
parameters for a given opcode, and that one would need to inspect
those parameters to properly filter the command.  Is that not true, or
am I misreading or misunderstanding Leon's comments?

https://lore.kernel.org/all/20260310175759.GD12611@unreal

> > this implies there will need to be some effort to keep whatever
> > parser you develop (BPF, userspace config, etc.) in sync with the
> > parser built into the firmware.  Or am I misunderstanding something?
>
> I would not use the word "sync". It is very similar to deep packet
> inspection, if you want to look inside, say, RPC messages carried
> within HTTP then you have to keep up to date. How onerous that is
> depends on what the admin's labeling goals are.

I'm not sure what to say here, that sounds like a synchronization task
to me, but if you have another term you prefer I'm happy to use that
instead.

> > > > However, even if we solve the parsing problem, I worry we have
> > > > another, closely related issue, of having to categorize all of the
> > > > past, present, and future firmware requests into a set of LSM specific
> > > > actions.
> > >
> > > Why? secmark doesn't have this issue? The classifer would return the
> > > same kind of information as secmark, some user provided label that is
> > > delivered to the LSM policy side.
> >
> > I think there is a misunderstanding in either how secmark works or how
> > the LSMs use secmark labels when enforcing their security policy.
> >
> > The secmark label is set on a packet to represent the network
> > properties of a packet.  While the rules governing how a packet's
> > secmark is set and the semantic meaning of that secmark label is going
> > to be LSM and solution specific,
>
> "network properties" are a bit vauge ...

That is one of the main reasons we moved from the old "compat_net"
solution to secmark so that we could leverage all of netfilter's
packet matching capabilities.  Once again, if the issue is simply a
matter of phrasing, please let me know what terminology you would
prefer.

> > secmark labels represent the properties of a packet and not the
> > operation, e.g.  send/receive/forward/etc., being requested at a
> > given access control point.
>
> Yes, still aligned.
>
> > The access control point itself represents the requested
> > operation.  This is possible because the number of networking
> > operations on a given packet is well defined and fairly limited; at a
> > high level the packet is either being sent from the node, received by
> > the node, or is passing through the node.
>
> I think we have the same split, fwctl send/recive analog is also very
> limited.

Sure, but I thought the goal was to enforce access controls on the
firmware requests based on the opcodes/parameters contained within the
firmware request blob/mailbox?  Or are you happy with a single
send/receive level of granularity?

> > As I understand the firmware controls being proposed here, encoded
> > within the firmware request blob is the operation being requested.
>
> I am not proposing that kind of interpretation, I want to stay in the
> secmark model.
>
> When the packet blob is sent into the kernel at the uAPI boundary
> (send_msg, send, write, FWCTL_CMD_RPC, etc) that is your access
> control point.
>
> Deep inspection on the packet blob determines the secmark.

... and this would be done by your BPF classifier, yes?

> LSM takes the secmark and determines if the access control point
> accept/rejects.

At this point I think it would be helpful to write out the
subject-access-object triple for an example operation and explain how
an LSM could obtain each component of the access request.

> > While we've discussed possible solutions on how to parse the request
> > blob to determine the operation, we haven't really discussed how to
> > represent the requested operation to the LSMs.
>
> I don't understand this? The secmark example I pulled up is this:
>
> iptables -t mangle -A INPUT -p tcp --dport 80 -j SECMARK --selctx system_u:object_r:httpd_packet_t:s0
>
> The "represent the requested operation" is the string
> "system_u:object_r:httpd_packet_t:s0", which is entirely admin
> defined, right?

No it isn't.  The string you've identified is the packet's secmark
label, one of two packet object labels in SELinux (we'll ignore the
other for our discussion).  Ignoring the managment controls, the
"requested operation" in SELinux is going to be either send, receive,
forward_in, or forward_out.  If we look at some example
subject-op-object triples for a secmark packets, entering or leaving
the system you might see the following:

 httpd_t RECV httpd_packet_t
 browser_t SEND httpd_packet_t

> > I'm assuming there isn't a well defined set of operations, and like
> > the request format itself, the set of valid operations will vary
> > from device and firmware revision.  I hope you can understand both
> > how this differs from secmark and that it is a challenge that really
> > hasn't been addressed in the proposals we've discussed.
>
> I still don't see the difference from iptables. IPSEC, SIP, DNS, HTTP,
> etc are all protocols with the same lack of any commonality.
>
> > At a very high level the access control decision for firmware/device
> > requests depends on whether the LSM wants to allow process A to do B
> > to device C.  The identity/credentials associated with process A are
> > easy to understand, we have plenty of examples both inside and outside
> > of the LSM on how to do that.  The device identity/attributes
> > associated with device C can be a bit trickier, but once again we have
> > plenty of examples to draw from, and we can even fall back to a
> > generic "kernel" id/attribute if the LSM chooses not to distinguish
> > entities below the userspace/kernel boundary.
>
> I think I would feed that into the classifier as well. Like iptables
> can have a netdev argument to only match against specific devices, we
> can have the same logical thing.
>
> > I think the hardest issue with the firmware request hooks is going
> > to be determining what operation is being requested, the "B",
> > portion of access request tuple.  If we can create a well defined
> > set of operations and leave it to the parser to characterize the
> > operation we've potentially got a solution, but if the operation is
> > truly going to be arbitrary then we have a real problem.  How do you
> > craft a meaningful access control policy when you don't understand
> > what is being requested?
>
> Same as for networking. Admin understands, admin defines, kernel is
> just a programmable classifier.

Are you able to define all of the firmware request operations at this
point in time?  That is my largest concern at this point, and perhaps
the answer is a simple "yes", but I haven't seen it yet.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-15 21:40                       ` Paul Moore
@ 2026-04-17 19:17                         ` Jason Gunthorpe
  2026-04-21  0:58                           ` Paul Moore
  2026-04-23 14:09                           ` Leon Romanovsky
  2026-04-23 13:05                         ` Leon Romanovsky
  1 sibling, 2 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-17 19:17 UTC (permalink / raw)
  To: Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Wed, Apr 15, 2026 at 05:40:04PM -0400, Paul Moore wrote:
> > The NIC doesn't know anything more than the kernel to call the LSM
> > hook. It can't magically generate the label the admin wants to use any
> > better than the kernel can.
> 
> The NIC presumably knows how to parse the firmware request and extract
> whatever security relevant info is needed to pass to the kernel so the
> driver can make an access control request.

Not in practice, we'd have to agree on how to describe the "security relevant
info" and that won't happen..

> Leon mentioned that different firmware revisions would have different
> parameters for a given opcode, and that one would need to inspect
> those parameters to properly filter the command.  Is that not true, or
> am I misreading or misunderstanding Leon's comments?

They are ABI stable, so there will be rules about future changes that
old software can follow to ignore or reject future things it doesn't
understand.

> > > The access control point itself represents the requested
> > > operation.  This is possible because the number of networking
> > > operations on a given packet is well defined and fairly limited; at a
> > > high level the packet is either being sent from the node, received by
> > > the node, or is passing through the node.
> >
> > I think we have the same split, fwctl send/recive analog is also very
> > limited.
> 
> Sure, but I thought the goal was to enforce access controls on the
> firmware requests based on the opcodes/parameters contained within the
> firmware request blob/mailbox?  

Yes, that's the goal. It is the same as iptables being able to
identify that a send system call has a packet that is http or dns. I'd
like to have a fwctl RPC ioctl system call identify if the RPC packet
is A or B.

> > Deep inspection on the packet blob determines the secmark.
> 
> ... and this would be done by your BPF classifier, yes?

BPF would be one option. We could probably also meaningfully do a
fixed set of matching functions (ie pkt_data[X] == A then MATCH) more
like iptables does if that is somehow relevant to LSM.
 
> > LSM takes the secmark and determines if the access control point
> > accept/rejects.
> 
> At this point I think it would be helpful to write out the
> subject-access-object triple for an example operation and explain how
> an LSM could obtain each component of the access request.

I think I am talking about this:

app_1 FWCTL_RPC op_unpriv_t
app_2 FWCTL_RPC op_priv_t

- app_x broadly comes from the process executing the ioctl

- FWCTL_RPC identifies the IOCTL userspace called to send the RPC
  packet

- op_X_t is the result of the classifier inspecting the RPC
  packet. Admin tells the classifier to return op_X_t similar to
  how --selctx does for iptables.

For sketch purposes I've used the words priv/unpriv as something an
admin might want to setup. As I said above the actual buckets and
mapping would have to decided by the local admin.

> > Same as for networking. Admin understands, admin defines, kernel is
> > just a programmable classifier.
> 
> Are you able to define all of the firmware request operations at this
> point in time?  That is my largest concern at this point, and perhaps
> the answer is a simple "yes", but I haven't seen it yet.

We can identify all the IOCTL points where the RPC packet will be
delivered to the kernel (send/recv/etc)

We cannot pre-identify all the mlx_XXX_op_t's an admin might want to
use.

The same way secmark cannot pre-identify all the XXX_packet_t's.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-17 19:17                         ` Jason Gunthorpe
@ 2026-04-21  0:58                           ` Paul Moore
  2026-04-24 14:36                             ` Jason Gunthorpe
  2026-04-23 14:09                           ` Leon Romanovsky
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-21  0:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Fri, Apr 17, 2026 at 3:17 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Wed, Apr 15, 2026 at 05:40:04PM -0400, Paul Moore wrote:
> > > The NIC doesn't know anything more than the kernel to call the LSM
> > > hook. It can't magically generate the label the admin wants to use any
> > > better than the kernel can.
> >
> > The NIC presumably knows how to parse the firmware request and extract
> > whatever security relevant info is needed to pass to the kernel so the
> > driver can make an access control request.
>
> Not in practice, we'd have to agree on how to describe the "security relevant
> info" and that won't happen..

I think you're going to find that you need to describe the security
relevant info regardless of how you implement things, but we can leave
that discussion for below.

> > Leon mentioned that different firmware revisions would have different
> > parameters for a given opcode, and that one would need to inspect
> > those parameters to properly filter the command.  Is that not true, or
> > am I misreading or misunderstanding Leon's comments?
>
> They are ABI stable, so there will be rules about future changes that
> old software can follow to ignore or reject future things it doesn't
> understand.

Good, "ABI stable" means there is some hope :)  Based on the various
discussions I'm guessing both the ABI and any assigned numbers
are/will-be vendor specific?

> > > > The access control point itself represents the requested
> > > > operation.  This is possible because the number of networking
> > > > operations on a given packet is well defined and fairly limited; at a
> > > > high level the packet is either being sent from the node, received by
> > > > the node, or is passing through the node.
> > >
> > > I think we have the same split, fwctl send/recive analog is also very
> > > limited.
> >
> > Sure, but I thought the goal was to enforce access controls on the
> > firmware requests based on the opcodes/parameters contained within the
> > firmware request blob/mailbox?
>
> Yes, that's the goal. It is the same as iptables being able to
> identify that a send system call has a packet that is http or dns.

I think we still have a disconnect here.  A packet being a DNS or HTTP
packet is different from an opcode.  The opcode in the iptables isn't
"DNS" or "HTTP" it is "INPUT", "OUTPUT", or "FORWARD".

Most LSMs will want to know who is initiating the firmware request
(the subject), the requested operation/opcode (the action/verb), and
the target of the request (the object, which in this case is likely
the kernel or the device).

For most LSMs, I expect the subject to be the process making the fwctl call.

Similarly, the object will likely be either the kernel or the device itself.

As I understand things, the action/verb is going to be the opcode
within the firmware request.  If you believe I'm wrong about this
please help me understand why.

> > > LSM takes the secmark and determines if the access control point
> > > accept/rejects.
> >
> > At this point I think it would be helpful to write out the
> > subject-access-object triple for an example operation and explain how
> > an LSM could obtain each component of the access request.
>
> I think I am talking about this:
>
> app_1 FWCTL_RPC op_unpriv_t
> app_2 FWCTL_RPC op_priv_t
>
> - app_x broadly comes from the process executing the ioctl

Yep.  Were on the same page here.

> - FWCTL_RPC identifies the IOCTL userspace called to send the RPC
>   packet

Maybe.  That is an option.

> - op_X_t is the result of the classifier inspecting the RPC
>   packet. Admin tells the classifier to return op_X_t similar to
>   how --selctx does for iptables.

I've tried to explain how this doesn't match with secmark, but I'm
evidently doing a poor job.  If you want to continue with the secmark
comparisons it might be helpful to spend some time configuring secmark
on a SELinux system, and writing policy for it, to see how it works.

Beyond that, I think you will find that most LSMs - although not all -
define their security policy via an abstract subject-action-object.
The policy either allows or rejects a subject's ability to perform a
certain action on an object.  My concern with your example is that the
object isn't what is actually being acted upon, it's the requested
action.  The fwctl ioctl/API allows a user to act on a device, with
the actual action being specified by the fwctl payload.  From what I
can see, the classifier's output is the action, not the object.

> > > Same as for networking. Admin understands, admin defines, kernel is
> > > just a programmable classifier.
> >
> > Are you able to define all of the firmware request operations at this
> > point in time?  That is my largest concern at this point, and perhaps
> > the answer is a simple "yes", but I haven't seen it yet.
>
> We can identify all the IOCTL points where the RPC packet will be
> delivered to the kernel (send/recv/etc)
>
> We cannot pre-identify all the mlx_XXX_op_t's an admin might want to
> use.
>
> The same way secmark cannot pre-identify all the XXX_packet_t's.

Once again, I think there is a disconnect or misunderstanding, on a
SELinux system using secmark all of the packet types, e.g.
"XXX_packet_t's", *are* pre-defined in the SELinux policy.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-21  0:58                           ` Paul Moore
@ 2026-04-24 14:36                             ` Jason Gunthorpe
  2026-04-24 20:59                               ` Paul Moore
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-24 14:36 UTC (permalink / raw)
  To: Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Mon, Apr 20, 2026 at 08:58:09PM -0400, Paul Moore wrote:
> > > > > The access control point itself represents the requested
> > > > > operation.  This is possible because the number of networking
> > > > > operations on a given packet is well defined and fairly limited; at a
> > > > > high level the packet is either being sent from the node, received by
> > > > > the node, or is passing through the node.
> > > >
> > > > I think we have the same split, fwctl send/recive analog is also very
> > > > limited.
> > >
> > > Sure, but I thought the goal was to enforce access controls on the
> > > firmware requests based on the opcodes/parameters contained within the
> > > firmware request blob/mailbox?
> >
> > Yes, that's the goal. It is the same as iptables being able to
> > identify that a send system call has a packet that is http or dns.
> 
> I think we still have a disconnect here.  A packet being a DNS or HTTP
> packet is different from an opcode.  The opcode in the iptables isn't
> "DNS" or "HTTP" it is "INPUT", "OUTPUT", or "FORWARD".

I understand that

> Most LSMs will want to know who is initiating the firmware request
> (the subject), the requested operation/opcode (the action/verb), and
> the target of the request (the object, which in this case is likely
> the kernel or the device).

How is
  system_u:object_r:httpd_packet_t:s0

A kernel or device? It is a label for packet contents. I also want a
label for packet contents.

> As I understand things, the action/verb is going to be the opcode
> within the firmware request.  If you believe I'm wrong about this
> please help me understand why.

You could make that choice, I'm arguing we should not, and it should
be in the object side.

> > - op_X_t is the result of the classifier inspecting the RPC
> >   packet. Admin tells the classifier to return op_X_t similar to
> >   how --selctx does for iptables.
> 
> I've tried to explain how this doesn't match with secmark, but I'm
> evidently doing a poor job.  

Yeah, I don't get it at all, sorry. I fell you are making some very
nuanced distinction with HTTP being an object but the HTTP-equivilant
in fwctl is not an object, I can't follow it at all.

By that logic:

   iptables -p 80 --string "GET"

Is an action, and it should get a unique action in the tuple.

> If you want to continue with the secmark comparisons it might be
> helpful to spend some time configuring secmark on a SELinux system,
> and writing policy for it, to see how it works.

I think I have a pretty good idea, you haven't said anything that
contradicts what I expect..

> certain action on an object.  My concern with your example is that the
> object isn't what is actually being acted upon, it's the requested
> action.

Object is a label for the packet contents.

> The fwctl ioctl/API allows a user to act on a device, with the
> actual action being specified by the fwctl payload.  From what I can
> see, the classifier's output is the action, not the object.

You can take that view, it is certainly one valid way to look at it.

But it is completely impractical.

I am arguing for the secmark like view where the content of the packet
decides the object label.

> > The same way secmark cannot pre-identify all the XXX_packet_t's.
> 
> Once again, I think there is a disconnect or misunderstanding, on a
> SELinux system using secmark all of the packet types, e.g.
> "XXX_packet_t's", *are* pre-defined in the SELinux policy.

"Pre-defined" in a text files in user space controlled by the admin.

Admin can "pre-define" their fwctl ones too, what is the issue?

AFAICT the only debate here is you want to make fwctl's packet content
an action while allowing iptable's packet content to be an object.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-24 14:36                             ` Jason Gunthorpe
@ 2026-04-24 20:59                               ` Paul Moore
  2026-04-24 22:13                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Moore @ 2026-04-24 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Fri, Apr 24, 2026 at 10:36 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Apr 20, 2026 at 08:58:09PM -0400, Paul Moore wrote:
> > > > > > The access control point itself represents the requested
> > > > > > operation.  This is possible because the number of networking
> > > > > > operations on a given packet is well defined and fairly limited; at a
> > > > > > high level the packet is either being sent from the node, received by
> > > > > > the node, or is passing through the node.
> > > > >
> > > > > I think we have the same split, fwctl send/recive analog is also very
> > > > > limited.
> > > >
> > > > Sure, but I thought the goal was to enforce access controls on the
> > > > firmware requests based on the opcodes/parameters contained within the
> > > > firmware request blob/mailbox?
> > >
> > > Yes, that's the goal. It is the same as iptables being able to
> > > identify that a send system call has a packet that is http or dns.
> >
> > I think we still have a disconnect here.  A packet being a DNS or HTTP
> > packet is different from an opcode.  The opcode in the iptables isn't
> > "DNS" or "HTTP" it is "INPUT", "OUTPUT", or "FORWARD".
>
> I understand that
>
> > Most LSMs will want to know who is initiating the firmware request
> > (the subject), the requested operation/opcode (the action/verb), and
> > the target of the request (the object, which in this case is likely
> > the kernel or the device).
>
> How is
>   system_u:object_r:httpd_packet_t:s0
>
> A kernel or device?

It's not.  It's one of two labels on a packet.  I've cautioned you
about leaning too heavily on the secmark comparison as it falls apart
in a number of places, this is one of those places.

> It is a label for packet contents. I also want a label for packet contents.

According to your explanations, my understanding is that you want a
fwctl RPC operation.  That is not the same as the secmark label
assigned by an iptables/nftables rule.

> > As I understand things, the action/verb is going to be the opcode
> > within the firmware request.  If you believe I'm wrong about this
> > please help me understand why.
>
> You could make that choice, I'm arguing we should not, and it should
> be in the object side.

Okay, you believe I'm wrong, that's fine, but you need to provide a
(better) explanation for why I'm wrong and your approach is The Right
Way.  Present your case, but please do it without referencing secmark
as that comparison is horribly broken at this point in the discussion.

> > > - op_X_t is the result of the classifier inspecting the RPC
> > >   packet. Admin tells the classifier to return op_X_t similar to
> > >   how --selctx does for iptables.
> >
> > I've tried to explain how this doesn't match with secmark, but I'm
> > evidently doing a poor job.
>
> Yeah, I don't get it at all, sorry. I fell you are making some very
> nuanced distinction with HTTP being an object but the HTTP-equivilant
> in fwctl is not an object, I can't follow it at all.
>
> By that logic:
>
>    iptables -p 80 --string "GET"
>
> Is an action, and it should get a unique action in the tuple.

Let's both do ourselves a favor and drop the secmark comparisons; I
think it is only hurting things at this point.  If we stick with the
secmark analogy I worry we are going to keep repeating the same things
to each other without making any forward progress.

> > If you want to continue with the secmark comparisons it might be
> > helpful to spend some time configuring secmark on a SELinux system,
> > and writing policy for it, to see how it works.
>
> I think I have a pretty good idea, you haven't said anything that
> contradicts what I expect..

Frankly, several comments, including in your last reply, indicate you
don't really grasp secmark, subject/verb/object, SELinux, or some
combination thereof ... and that's okay, you don't really need to
understand those details.  Let's move past the failed secmark analogy
and return to the fwctl hooks, that's the ultimate goal.

> > certain action on an object.  My concern with your example is that the
> > object isn't what is actually being acted upon, it's the requested
> > action.
>
> Object is a label for the packet contents.
>
> > The fwctl ioctl/API allows a user to act on a device, with the
> > actual action being specified by the fwctl payload.  From what I can
> > see, the classifier's output is the action, not the object.
>
> You can take that view, it is certainly one valid way to look at it.
>
> But it is completely impractical.

Elaborate on that, because from what I can tell that is the valid way
to look at it from a subject/verb/object perspective.

> > > The same way secmark cannot pre-identify all the XXX_packet_t's.
> >
> > Once again, I think there is a disconnect or misunderstanding, on a
> > SELinux system using secmark all of the packet types, e.g.
> > "XXX_packet_t's", *are* pre-defined in the SELinux policy.
>
> "Pre-defined" in a text files in user space controlled by the admin.

That's not correct.  It's kinda like saying the NIC driver sources are
simply "text files in user space controlled by the admin".  The
SELinux secmark labels are defined in the SELinux policy sources which
must be compiled and loaded into the kernel before they are valid on a
running system.  Policy must be written not only to define the secmark
labels, but also to define the access control rules which govern how
those packets are handled by the system.  The iptables/nftables
command lines simply assign a secmark label to a packet; that's
important, but only a small part of the total equation.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-24 20:59                               ` Paul Moore
@ 2026-04-24 22:13                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-24 22:13 UTC (permalink / raw)
  To: Paul Moore
  Cc: Leon Romanovsky, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Fri, Apr 24, 2026 at 04:59:30PM -0400, Paul Moore wrote:
> >
> > > Most LSMs will want to know who is initiating the firmware request
> > > (the subject), the requested operation/opcode (the action/verb), and
> > > the target of the request (the object, which in this case is likely
> > > the kernel or the device).
> >
> > How is
> >   system_u:object_r:httpd_packet_t:s0
> >
> > A kernel or device?
> 
> It's not.  It's one of two labels on a packet.  I've cautioned you
> about leaning too heavily on the secmark comparison as it falls apart
> in a number of places, this is one of those places.

But I want to label a packet too, you keep going back to it not being
the same thing and I keep repeating that all I want to do is put
labels on FWCTL packets :(

> > It is a label for packet contents. I also want a label for packet contents.
> 
> According to your explanations, my understanding is that you want a
> fwctl RPC operation.  That is not the same as the secmark label
> assigned by an iptables/nftables rule.

I view fwctl as an opaque packet based messaging subsystem. It
communicates a packet to a remote CPU and returns a response packet
back to the userspace.

Trying to have the kernel assign fixed meaning to the content of the
packets inside the kernel is contrary to the entire design of fwctl.

It is like demanding the netstack parse HTTP packets as a precondition
to using LSM. It makes no sense.

Any LSM integration requires a labeling system that is not hard wired
into the built kernel. I don't much care what it is, so long as the
classification and label space are defined by userspace.

You say it is not like secmark, fine, but I see a perfect mirror in
secmark...

> > You can take that view, it is certainly one valid way to look at it.
> >
> > But it is completely impractical.
> 
> Elaborate on that, because from what I can tell that is the valid way
> to look at it from a subject/verb/object perspective.

We cannot have the kernel predefine verb labels.

I'm completely fine with using verb if it can be dynamic and userspace
can tell the kernel what the verbs labels are.

This is the only reason I pointed at secmark, it shows a system that
has both a user controller classifier and dynamic labels that are not
fixed into the built kernel. ie it is flexible.

> > > > The same way secmark cannot pre-identify all the XXX_packet_t's.
> > >
> > > Once again, I think there is a disconnect or misunderstanding, on a
> > > SELinux system using secmark all of the packet types, e.g.
> > > "XXX_packet_t's", *are* pre-defined in the SELinux policy.
> >
> > "Pre-defined" in a text files in user space controlled by the admin.
>
> That's not correct.  It's kinda like saying the NIC driver sources are
> simply "text files in user space controlled by the admin".  

That's very pedantic. I mean to the point I wonder if we are even
speaking the same language.

I said the labels are defined by userspace, you said no, and then
explained that they are defined by userspace going through a bunch of
steps:

> The SELinux secmark labels are defined in the SELinux policy sources
> which must be compiled and loaded into the kernel before they are
> valid on a running system. Policy must be written not only to define
> the secmark labels, but also to define the access control rules
> which govern how those packets are handled by the system.  The
> iptables/nftables command lines simply assign a secmark label to a
> packet; that's important, but only a small part of the total
> equation.

I understand all of this, I am totally fine with it. A package will
install, a distribution will provide, or admin will write these
things, and do all the steps to load them into the kernel. I don't see
any issue with that.

Hardwiring things into the built kernel is a problem that must be
avoided because end users only run the kernel provided by the
distribution. "recompiling the driver" is not an option that is
available.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-17 19:17                         ` Jason Gunthorpe
  2026-04-21  0:58                           ` Paul Moore
@ 2026-04-23 14:09                           ` Leon Romanovsky
  2026-04-24 14:19                             ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Leon Romanovsky @ 2026-04-23 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul Moore, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Fri, Apr 17, 2026 at 04:17:49PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 15, 2026 at 05:40:04PM -0400, Paul Moore wrote:

<...>

> > Leon mentioned that different firmware revisions would have different
> > parameters for a given opcode, and that one would need to inspect
> > those parameters to properly filter the command.  Is that not true, or
> > am I misreading or misunderstanding Leon's comments?
> 
> They are ABI stable, so there will be rules about future changes that
> old software can follow to ignore or reject future things it doesn't
> understand.

It is wishful thinking and applicable only to mlx5 devices. No one
promises that other devices follow same ABI rules.

Thanks

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-23 14:09                           ` Leon Romanovsky
@ 2026-04-24 14:19                             ` Jason Gunthorpe
  2026-04-26 10:39                               ` Leon Romanovsky
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2026-04-24 14:19 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Paul Moore, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Thu, Apr 23, 2026 at 05:09:50PM +0300, Leon Romanovsky wrote:

> > > Leon mentioned that different firmware revisions would have different
> > > parameters for a given opcode, and that one would need to inspect
> > > those parameters to properly filter the command.  Is that not true, or
> > > am I misreading or misunderstanding Leon's comments?
> > 
> > They are ABI stable, so there will be rules about future changes that
> > old software can follow to ignore or reject future things it doesn't
> > understand.
> 
> It is wishful thinking and applicable only to mlx5 devices. No one
> promises that other devices follow same ABI rules.

Well, I will definately kick them out of fwctl if they don't.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-24 14:19                             ` Jason Gunthorpe
@ 2026-04-26 10:39                               ` Leon Romanovsky
  0 siblings, 0 replies; 81+ messages in thread
From: Leon Romanovsky @ 2026-04-26 10:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul Moore, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Fri, Apr 24, 2026 at 11:19:21AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 23, 2026 at 05:09:50PM +0300, Leon Romanovsky wrote:
> 
> > > > Leon mentioned that different firmware revisions would have different
> > > > parameters for a given opcode, and that one would need to inspect
> > > > those parameters to properly filter the command.  Is that not true, or
> > > > am I misreading or misunderstanding Leon's comments?
> > > 
> > > They are ABI stable, so there will be rules about future changes that
> > > old software can follow to ignore or reject future things it doesn't
> > > understand.
> > 
> > It is wishful thinking and applicable only to mlx5 devices. No one
> > promises that other devices follow same ABI rules.
> 
> Well, I will definately kick them out of fwctl if they don't.

It is easy to say but harder to follow. The kernel includes many devices that
exist only in specific hyperscale environments, where the update cycle is
tightly controlled. They easily can break FW backward compatibility.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v2 0/4] Firmware LSM hook
  2026-04-15 21:40                       ` Paul Moore
  2026-04-17 19:17                         ` Jason Gunthorpe
@ 2026-04-23 13:05                         ` Leon Romanovsky
  1 sibling, 0 replies; 81+ messages in thread
From: Leon Romanovsky @ 2026-04-23 13:05 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jason Gunthorpe, Roberto Sassu, KP Singh, Matt Bobrowski,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Saeed Mahameed, Itay Avraham, Dave Jiang, Jonathan Cameron, bpf,
	linux-kernel, linux-kselftest, linux-rdma, Chiara Meiohas,
	Maher Sanalla, linux-security-module

On Wed, Apr 15, 2026 at 05:40:04PM -0400, Paul Moore wrote:
> On Wed, Apr 15, 2026 at 9:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Tue, Apr 14, 2026 at 04:27:58PM -0400, Paul Moore wrote:
> > > On Mon, Apr 13, 2026 at 7:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > On Mon, Apr 13, 2026 at 06:36:06PM -0400, Paul Moore wrote:
> > > > > On Mon, Apr 13, 2026 at 12:42 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > > On Sun, Apr 12, 2026 at 09:38:35PM -0400, Paul Moore wrote:
> 
> ...

<...>

> > > > > so that only the firmware would need to parse the request.  If we
> > > > > wanted to adopt a secmark-esque approach, one could develop a second
> > > > > parsing mechanism that would be responsible for assigning a LSM label
> > > > > to the request, and then pass the firmware request to the LSM, but I
> > > > > do worry a bit about the added complexity associated with keeping the
> > > > > parser sync'd with the driver/fw.
> > > >
> > > > In practice it would be like iptables, the parser would be entirely
> > > > programmed by userspace and there is nothing to keep in sync.
> > >
> > > You've mentioned a few times now that the firmware/request will vary
> > > across not only devices, but firmware revisions too,
> >
> > I never said firmware revisions, part of the requirement is strong ABI
> > compatability in these packets.
> 
> That was my mistake; it was Leon.
> 
> Leon mentioned that different firmware revisions would have different
> parameters for a given opcode, and that one would need to inspect
> those parameters to properly filter the command.  Is that not true, or
> am I misreading or misunderstanding Leon's comments?
> 
> https://lore.kernel.org/all/20260310175759.GD12611@unreal

Right, I said that. The mlx5–FW interface is stable, but that does not
mean it can never change. The contract is that any upstream driver
release must continue to operate correctly with released firmware.

To support this, there are cases where the driver and firmware
negotiate during device initialization to determine whether a given
feature is supported and specific maibox fields are valid.

Thanks

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2026-04-26 16:42 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-25  6:04 [PATCH v2 0/2] RDMA/rxe: Fix per-netns UDP tunnel issues Kuniyuki Iwashima
2026-04-25  6:04 ` [PATCH v2 1/2] RDMA/rxe: Fix null-ptr-deref in kernel_sock_shutdown() Kuniyuki Iwashima
2026-04-25 15:47   ` David Ahern
2026-04-25 20:55     ` Kuniyuki Iwashima
2026-04-26 16:40       ` David Ahern
2026-04-25 21:25   ` Zhu Yanjun
2026-04-26 16:42     ` David Ahern
2026-04-25  6:04 ` [PATCH v2 2/2] RDMA/rxe: Fix up RCU usage for rxe_ns_pernet_sk6() Kuniyuki Iwashima
2026-04-25 21:26   ` Zhu Yanjun
  -- strict thread matches above, loose matches on Subject: below --
2026-04-11 14:49 [PATCH rdma-next v2 00/15] RDMA: Introduce generic buffer descriptor infrastructure for umem Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 01/15] RDMA/core: " Jiri Pirko
2026-04-12 12:33   ` Michael Margolin
2026-04-13  8:32     ` Jiri Pirko
2026-04-13 16:02       ` Michael Margolin
2026-04-13 18:22         ` Jiri Pirko
2026-04-16 12:10           ` Michael Margolin
2026-04-16 13:34             ` Jiri Pirko
2026-04-21 12:50               ` Jason Gunthorpe
2026-04-21 12:52             ` Jason Gunthorpe
2026-04-22 10:32               ` Jiri Pirko
2026-04-22 16:30                 ` Jason Gunthorpe
2026-04-21 13:46   ` Jason Gunthorpe
2026-04-22 11:33     ` Jiri Pirko
2026-04-22 14:06       ` Jiri Pirko
2026-04-22 16:51         ` Jason Gunthorpe
2026-04-23 13:08           ` Jiri Pirko
2026-04-23 15:08             ` Jason Gunthorpe
2026-04-11 14:49 ` [PATCH rdma-next v2 02/15] RDMA/uverbs: Push out CQ buffer umem processing into a helper Jiri Pirko
2026-04-21 13:25   ` Jason Gunthorpe
2026-04-22 10:56     ` Jiri Pirko
2026-04-22 16:32       ` Jason Gunthorpe
2026-04-11 14:49 ` [PATCH rdma-next v2 03/15] RDMA/uverbs: Integrate umem_list into CQ creation Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 04/15] RDMA/efa: Use umem_list for user CQ buffer Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 05/15] RDMA/mlx5: " Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 06/15] RDMA/bnxt_re: " Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 07/15] RDMA/mlx4: " Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 08/15] RDMA/uverbs: Remove legacy umem field from struct ib_cq Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 09/15] RDMA/uverbs: Verify all umem_list buffers are consumed after CQ creation Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 10/15] RDMA/uverbs: Integrate umem_list into QP creation Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 11/15] RDMA/mlx5: Use umem_list for QP buffers in create_qp Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 12/15] RDMA/uverbs: Add doorbell record buffer slot to CQ umem_list Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 13/15] RDMA/mlx5: Use umem_list for CQ doorbell record Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 14/15] RDMA/uverbs: Add doorbell record buffer slot to QP umem_list Jiri Pirko
2026-04-11 14:49 ` [PATCH rdma-next v2 15/15] RDMA/mlx5: Use umem_list for QP doorbell record Jiri Pirko
2026-03-31  5:56 [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
2026-03-31  5:56 ` [PATCH v2 1/4] bpf: add firmware command validation hook Leon Romanovsky
2026-04-16  8:43   ` Matt Bobrowski
2026-03-31  5:56 ` [PATCH v2 2/4] selftests/bpf: add test cases for fw_validate_cmd hook Leon Romanovsky
2026-03-31  5:56 ` [PATCH v2 3/4] RDMA/mlx5: Externally validate FW commands supplied in DEVX interface Leon Romanovsky
2026-03-31  5:56 ` [PATCH v2 4/4] fwctl/mlx5: Externally validate FW commands supplied in fwctl Leon Romanovsky
2026-04-09 12:12 ` [PATCH v2 0/4] Firmware LSM hook Leon Romanovsky
2026-04-09 12:27   ` Roberto Sassu
2026-04-09 12:45     ` Leon Romanovsky
2026-04-09 21:04       ` Paul Moore
2026-04-12  9:00         ` Leon Romanovsky
2026-04-13  1:38           ` Paul Moore
2026-04-13 15:53             ` Leon Romanovsky
2026-04-13 16:42             ` Jason Gunthorpe
2026-04-13 17:36               ` Casey Schaufler
2026-04-13 19:09                 ` Casey Schaufler
2026-04-13 22:36               ` Paul Moore
2026-04-13 23:19                 ` Jason Gunthorpe
2026-04-14 17:05                   ` Casey Schaufler
2026-04-14 19:09                     ` Paul Moore
2026-04-14 20:09                       ` Casey Schaufler
2026-04-14 20:44                         ` Paul Moore
2026-04-14 22:42                           ` Casey Schaufler
2026-04-15 21:03                             ` Paul Moore
2026-04-15 21:21                               ` Casey Schaufler
2026-04-14 20:27                   ` Paul Moore
2026-04-15 13:47                     ` Jason Gunthorpe
2026-04-15 21:40                       ` Paul Moore
2026-04-17 19:17                         ` Jason Gunthorpe
2026-04-21  0:58                           ` Paul Moore
2026-04-24 14:36                             ` Jason Gunthorpe
2026-04-24 20:59                               ` Paul Moore
2026-04-24 22:13                                 ` Jason Gunthorpe
2026-04-23 14:09                           ` Leon Romanovsky
2026-04-24 14:19                             ` Jason Gunthorpe
2026-04-26 10:39                               ` Leon Romanovsky
2026-04-23 13:05                         ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox