public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use.
@ 2025-12-04  8:26 Tetsuo Handa
  2026-02-28  6:07 ` Tetsuo Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2025-12-04  8:26 UTC (permalink / raw)
  To: OFED mailing list

[-- Attachment #1: Type: text/plain, Size: 6352 bytes --]

I found that running the attached example program causes khungtaskd message. What is wrong?



INFO: task rdma:1387 blocked for more than 122 seconds.
      Not tainted 6.18.0 #231
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:rdma            state:D stack:0     pid:1387  tgid:1387  ppid:1347   task_flags:0x400100 flags:0x00080001
Call Trace:
 <TASK>
 __schedule+0x369/0x8a0
 schedule+0x3a/0xe0
 schedule_timeout+0xca/0x110
 wait_for_completion+0x8a/0x140
 ib_uverbs_remove_one+0x1b0/0x210 [ib_uverbs]
 remove_client_context+0x8d/0xd0 [ib_core]
 disable_device+0x8b/0x170 [ib_core]
 __ib_unregister_device+0x110/0x180 [ib_core]
 ib_unregister_device_and_put+0x37/0x50 [ib_core]
 nldev_dellink+0xa4/0x100 [ib_core]
 rdma_nl_rcv_msg+0x12f/0x2f0 [ib_core]
 ? __lock_acquire+0x55d/0xbf0
 rdma_nl_rcv_skb.constprop.0.isra.0+0xb2/0x100 [ib_core]
 netlink_unicast+0x203/0x2e0
 netlink_sendmsg+0x1f8/0x420
 __sys_sendto+0x1e1/0x1f0
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x94/0x320
 ? _copy_to_user+0x22/0x70
 ? move_addr_to_user+0xd6/0x120
 ? __sys_getsockname+0x9a/0xf0
 ? do_syscall_64+0x137/0x320
 ? do_sock_setsockopt+0x85/0x160
 ? do_sock_setsockopt+0x85/0x160
 ? __sys_setsockopt+0x7b/0xc0
 ? do_syscall_64+0x137/0x320
 ? do_syscall_64+0x137/0x320
 ? do_syscall_64+0x137/0x320
 ? lockdep_hardirqs_on_prepare.part.0+0x9b/0x150
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f877077b77e
RSP: 002b:00007ffda335da70 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000056038051c3c0 RCX: 00007f877077b77e
RDX: 0000000000000018 RSI: 000056038051b2a0 RDI: 0000000000000004
RBP: 00007ffda335da80 R08: 00007f877090f9a0 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffda335dce0
R13: 00007ffda335dd38 R14: 00007ffda335dce0 R15: 0000000069314344
 </TASK>

Showing all locks held in the system:
4 locks held by kworker/u512:0/11:
 #0: ffff8c82003d3148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x509/0x590
 #1: ffffcea98008fe38 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x1e2/0x590
 #2: ffffffff9807a310 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net+0x51/0x390
 #3: ffff8c8224164700 (&device->unregistration_lock){+.+.}-{4:4}, at: rdma_dev_change_netns+0x28/0x120 [ib_core]
1 lock held by khungtaskd/99:
 #0: ffffffff97d9f4e0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire.constprop.0+0x7/0x30
2 locks held by kworker/10:1/127:
1 lock held by systemd-journal/662:
2 locks held by rdma/1387:
 #0: ffffffffc0b88c18 (&rdma_nl_types[idx].sem){.+.+}-{4:4}, at: rdma_nl_rcv_msg+0x9e/0x2f0 [ib_core]
 #1: ffff8c8224164700 (&device->unregistration_lock){+.+.}-{4:4}, at: __ib_unregister_device+0xe4/0x180 [ib_core]

=============================================

INFO: task kworker/u512:0:11 blocked for more than 122 seconds.
      Not tainted 6.18.0 #231
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u512:0  state:D stack:0     pid:11    tgid:11    ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
 __schedule+0x369/0x8a0
 schedule+0x3a/0xe0
 schedule_preempt_disabled+0x15/0x30
 __mutex_lock+0x568/0x1170
 ? rdma_dev_change_netns+0x28/0x120 [ib_core]
 ? rdma_dev_change_netns+0x28/0x120 [ib_core]
 rdma_dev_change_netns+0x28/0x120 [ib_core]
 rdma_dev_exit_net+0x1a4/0x320 [ib_core]
 ops_undo_list+0xea/0x3b0
 cleanup_net+0x20b/0x390
 process_one_work+0x223/0x590
 worker_thread+0x1cb/0x3a0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xff/0x240
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x182/0x1e0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
INFO: task kworker/u512:0:11 is blocked on a mutex likely owned by task rdma:1387.
INFO: task rdma:1387 blocked for more than 245 seconds.
      Not tainted 6.18.0 #231
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:rdma            state:D stack:0     pid:1387  tgid:1387  ppid:1347   task_flags:0x400100 flags:0x00080001
Call Trace:
 <TASK>
 __schedule+0x369/0x8a0
 schedule+0x3a/0xe0
 schedule_timeout+0xca/0x110
 wait_for_completion+0x8a/0x140
 ib_uverbs_remove_one+0x1b0/0x210 [ib_uverbs]
 remove_client_context+0x8d/0xd0 [ib_core]
 disable_device+0x8b/0x170 [ib_core]
 __ib_unregister_device+0x110/0x180 [ib_core]
 ib_unregister_device_and_put+0x37/0x50 [ib_core]
 nldev_dellink+0xa4/0x100 [ib_core]
 rdma_nl_rcv_msg+0x12f/0x2f0 [ib_core]
 ? __lock_acquire+0x55d/0xbf0
 rdma_nl_rcv_skb.constprop.0.isra.0+0xb2/0x100 [ib_core]
 netlink_unicast+0x203/0x2e0
 netlink_sendmsg+0x1f8/0x420
 __sys_sendto+0x1e1/0x1f0
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x94/0x320
 ? _copy_to_user+0x22/0x70
 ? move_addr_to_user+0xd6/0x120
 ? __sys_getsockname+0x9a/0xf0
 ? do_syscall_64+0x137/0x320
 ? do_sock_setsockopt+0x85/0x160
 ? do_sock_setsockopt+0x85/0x160
 ? __sys_setsockopt+0x7b/0xc0
 ? do_syscall_64+0x137/0x320
 ? do_syscall_64+0x137/0x320
 ? do_syscall_64+0x137/0x320
 ? lockdep_hardirqs_on_prepare.part.0+0x9b/0x150
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f877077b77e
RSP: 002b:00007ffda335da70 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000056038051c3c0 RCX: 00007f877077b77e
RDX: 0000000000000018 RSI: 000056038051b2a0 RDI: 0000000000000004
RBP: 00007ffda335da80 R08: 00007f877090f9a0 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffda335dce0
R13: 00007ffda335dd38 R14: 00007ffda335dce0 R15: 0000000069314344
 </TASK>

Showing all locks held in the system:
4 locks held by kworker/u512:0/11:
 #0: ffff8c82003d3148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x509/0x590
 #1: ffffcea98008fe38 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x1e2/0x590
 #2: ffffffff9807a310 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net+0x51/0x390
 #3: ffff8c8224164700 (&device->unregistration_lock){+.+.}-{4:4}, at: rdma_dev_change_netns+0x28/0x120 [ib_core]
1 lock held by khungtaskd/99:
 #0: ffffffff97d9f4e0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire.constprop.0+0x7/0x30
2 locks held by rdma/1387:
 #0: ffffffffc0b88c18 (&rdma_nl_types[idx].sem){.+.+}-{4:4}, at: rdma_nl_rcv_msg+0x9e/0x2f0 [ib_core]
 #1: ffff8c8224164700 (&device->unregistration_lock){+.+.}-{4:4}, at: __ib_unregister_device+0xe4/0x180 [ib_core]

=============================================

[-- Attachment #2: rdma_example.c --]
[-- Type: text/plain, Size: 5958 bytes --]

// gcc -Wall -O2 rdma_example.c -lrdmacm -libverbs
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <getopt.h>
#include <stdlib.h>
#include <unistd.h>
#include <rdma/rdma_cma.h>

int main(int argc, char *argv[])
{
	const char *remote_addr = "10.0.0.1";
	const int port = 21234;
	struct rdma_event_channel *evch1, *evch2;
	struct rdma_cm_id *server_id;
	struct rdma_cm_id *client_id;
	struct rdma_cm_event *event = NULL;
	struct rdma_conn_param conn_param;
	struct ibv_pd *pd1, *pd2;
	struct ibv_cq *cq1, *cq2;
	struct ibv_mr *mr1, *mr2;
	struct ibv_send_wr snd_wr;
	struct ibv_recv_wr rcv_wr;
	struct ibv_send_wr *bad_wr = NULL;
	struct ibv_sge sge;
	struct ibv_wc wc;
	struct ibv_qp_init_attr attr = {
		.cap = {
			.max_send_wr = 32,
			.max_recv_wr = 32,
			.max_send_sge = 1,
			.max_recv_sge = 1,
			.max_inline_data = 64
		},
		.qp_type = IBV_QPT_RC
	};
	char msg[256] = "Hello World";
	const int msg_len = strlen(msg) + 1;
	struct sockaddr_in sin;

	if (unshare(CLONE_NEWNET)) {
		perror("unshare");
		exit(-1);
	}
	system("ip link set lo up");
	system("rdma link add siw0 type siw netdev lo");
	system("rdma link list");
	system("ip link add veth1 type veth peer name veth2");
	system("ip addr add 10.0.0.1/24 dev veth1");
	system("ip link set veth1 up");
	system("ip addr add 10.0.0.2/24 dev veth2");
	system("ip link set veth2 up");
	system("ping -c 1 10.0.0.1");
	system("ping -c 1 10.0.0.2");
	system("rdma link show");
	system("rdma link add siw_dev1 type siw netdev veth1");
	system("rdma link add siw_dev2 type siw netdev veth2");
		
	if (!(evch1 = rdma_create_event_channel())) {
		perror("server: rdma_create_event_channel");
		exit(-1);
	}
	if (rdma_create_id(evch1, &server_id, NULL, RDMA_PS_TCP)) {
		perror("server: rdma_create_id");
		exit(-1);
	}
	sin.sin_family = AF_INET;
	sin.sin_port = htons(port);
	sin.sin_addr.s_addr = htonl(INADDR_ANY);
	if (rdma_bind_addr(server_id, (struct sockaddr *)&sin)) {
		perror("server: rdma_bind_addr");
		exit(-1);
	}
	if (rdma_listen(server_id, 6)) {
		perror("server: rdma_listen");
		exit(-1);
	}
	if (!(evch2 = rdma_create_event_channel())) {
		perror("client: rdma_create_event_channel");
		exit(-1);
	}
	if (rdma_create_id(evch2, &client_id, NULL, RDMA_PS_TCP)) {
		perror("client: rdma_create_id");
		exit(-1);
	}
	sin.sin_family = AF_INET;
	sin.sin_port = htons(port);
	sin.sin_addr.s_addr = inet_addr(remote_addr);
	if (rdma_resolve_addr
	    (client_id, NULL, (struct sockaddr *)&sin, 2000)) {
		perror("client: rdma_resolve_addr");
		exit(-1);
	}
	if (rdma_get_cm_event(evch2, &event)
	    || event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
		perror("client: rdma_get_cm_event");
		exit(-1);
	}
	rdma_ack_cm_event(event);
	if (rdma_resolve_route(client_id, 2000)) {
		perror("client: rdma_resolve_route");
		exit(-1);
	}
	if (rdma_get_cm_event(evch2, &event)
	    || event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
		perror("client: rdma_get_cm_event");
		exit(-1);
	}
	rdma_ack_cm_event(event);
	if (!(pd2 = ibv_alloc_pd(client_id->verbs))) {
		perror("client: ibv_alloc_pd");
		exit(-1);
	}
	if (!(mr2 = ibv_reg_mr(pd2, msg, 256,
			       IBV_ACCESS_REMOTE_WRITE |
			       IBV_ACCESS_LOCAL_WRITE |
			       IBV_ACCESS_REMOTE_READ))) {
		perror("client: ibv_reg_mr");
		exit(-1);
	}		
	if (!(cq2 = ibv_create_cq(client_id->verbs, 32, 0, 0, 0))) {
		perror("client: ibv_create_cq");
		exit(-1);
	}
	attr.send_cq = attr.recv_cq = cq2;
	if (rdma_create_qp(client_id, pd2, &attr)) {
		perror("client: rdma_create_qp");
		exit(-1);
	}
	sge.addr = (uint64_t)msg;
	sge.length = msg_len;
	sge.lkey = mr2->lkey;
	rcv_wr.sg_list = &sge;
	rcv_wr.num_sge = 1;
	rcv_wr.next = NULL;
	if (ibv_post_recv(client_id->qp, &rcv_wr, NULL)) {
		perror("client: ibv_post_recv");
		exit(-1);
	}
	memset(&conn_param, 0, sizeof conn_param);
	if (rdma_connect(client_id, &conn_param)) {
		perror("client: rdma_connect");
		exit(-1);
	}
	if (rdma_get_cm_event(evch1, &event)
	    || event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
		perror("server: rdma_get_cm_event");
		exit(-1);
	}
	client_id = (struct rdma_cm_id *)event->id;
	if (!(pd1 = ibv_alloc_pd(client_id->verbs))) {
		perror("server: ibv_alloc_pd");
		exit(-1);
	}
	if (!(mr1 = ibv_reg_mr(pd1, msg, 256,
			       IBV_ACCESS_REMOTE_WRITE |
			       IBV_ACCESS_LOCAL_WRITE |
			       IBV_ACCESS_REMOTE_READ))) {
		perror("server: ibv_reg_mr");
		exit(-1);
	}
	if (!(cq1 = ibv_create_cq(client_id->verbs, 32, 0, 0, 0))) {
		perror("server: ibv_create_cq");
		exit(-1);
	}
	attr.send_cq = attr.recv_cq = cq1;
	if (rdma_create_qp(client_id, pd1, &attr)) {
		perror("server: rdma_create_qp");
		exit(-1);
	}
	memset(&conn_param, 0, sizeof conn_param);
	if (rdma_accept(client_id, &conn_param)) {
		perror("server: rdma_accept");
		exit(-1);
	}
	rdma_ack_cm_event(event);
	if (rdma_get_cm_event(evch1, &event)
	    || event->event != RDMA_CM_EVENT_ESTABLISHED) {
		perror("server: rdma_get_cm_event");
		exit(-1);
	}
	rdma_ack_cm_event(event);
	sge.addr = (uint64_t)msg;
	sge.length = msg_len;
	sge.lkey = mr1->lkey;
	snd_wr.sg_list = &sge;
	snd_wr.num_sge = 1;
	snd_wr.opcode = IBV_WR_SEND;
	snd_wr.send_flags = IBV_SEND_SIGNALED;
	snd_wr.next = NULL;
	if (ibv_post_send(client_id->qp, &snd_wr, &bad_wr)) {
		perror("server: ibv_post_send");
		exit(-1);
	}
	while (!ibv_poll_cq(cq1, 1, &wc))
		;
	if (wc.status != IBV_WC_SUCCESS) {
		perror("server: ibv_poll_cq");
		exit(-1);
	}
	if (rdma_get_cm_event(evch2, &event)
	    || event->event != RDMA_CM_EVENT_ESTABLISHED) {
		perror("client: rdma_get_cm_event");
		exit(-1);
	}
	rdma_ack_cm_event(event);
	while (!ibv_poll_cq(cq2, 1, &wc))
		;
	if (wc.status != IBV_WC_SUCCESS) {
		perror("client: ibv_poll_cq");
		exit(-1);
	}
	printf("Received: %s\n", msg);
	fflush(stdout);
	system("rdma link del siw_dev1");
	system("rdma link del siw_dev2");
	system("ip link del veth1 type veth peer name veth2");
	return system("rdma link del siw0");
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use.
  2025-12-04  8:26 [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use Tetsuo Handa
@ 2026-02-28  6:07 ` Tetsuo Handa
  2026-02-28 16:43   ` Jason Gunthorpe
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2026-02-28  6:07 UTC (permalink / raw)
  To: OFED mailing list, Jason Gunthorpe, Leon Romanovsky

On 2025/12/04 17:26, Tetsuo Handa wrote:
> I found that running the attached example program causes khungtaskd message. What is wrong?

I found that this is a deadlock caused by "struct ib_device_ops"->disassociate_ucontext == NULL.
If the thread which called ib_uverbs_remove_one() is unable to call ib_uverbs_release_file()
 from ib_uverbs_close() because it is blocked at wait_for_completion(), it forms a deadlock.
I think that we need to make the following change, by providing non-NULL disassociate_ucontext
callback.

By the way, it seems that all non-NULL disassociate_ucontext callback users are passing no-op
function as callback. What is needed for safely disassociate ucontex for those who currently
do not provide disassociate_ucontext callback? If nothing is needed, can we remove
disassociate_ucontext callback?


diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 7b68967a6301..f1e20928391b 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -1248,43 +1248,39 @@ static void ib_uverbs_free_hw_resources(struct ib_uverbs_device *uverbs_dev,
 	}
 	mutex_unlock(&uverbs_dev->lists_mutex);
 
 	uverbs_disassociate_api(uverbs_dev->uapi);
 }
 
 static void ib_uverbs_remove_one(struct ib_device *device, void *client_data)
 {
 	struct ib_uverbs_device *uverbs_dev = client_data;
-	int wait_clients = 1;
 
 	cdev_device_del(&uverbs_dev->cdev, &uverbs_dev->dev);
 	ida_free(&uverbs_ida, uverbs_dev->devnum);
 
 	if (device->ops.disassociate_ucontext) {
 		/* We disassociate HW resources and immediately return.
 		 * Userspace will see a EIO errno for all future access.
 		 * Upon returning, ib_device may be freed internally and is not
 		 * valid any more.
 		 * uverbs_device is still available until all clients close
 		 * their files, then the uverbs device ref count will be zero
 		 * and its resources will be freed.
 		 * Note: At this point no more files can be opened since the
 		 * cdev was deleted, however active clients can still issue
 		 * commands and close their open files.
 		 */
 		ib_uverbs_free_hw_resources(uverbs_dev, device);
-		wait_clients = 0;
 	}
 
 	if (refcount_dec_and_test(&uverbs_dev->refcount))
 		ib_uverbs_comp_dev(uverbs_dev);
-	if (wait_clients)
-		wait_for_completion(&uverbs_dev->comp);
 
 	put_device(&uverbs_dev->dev);
 }
 
 static int __init ib_uverbs_init(void)
 {
 	int ret;
 
 	ret = register_chrdev_region(IB_UVERBS_BASE_DEV,


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use.
  2026-02-28  6:07 ` Tetsuo Handa
@ 2026-02-28 16:43   ` Jason Gunthorpe
  2026-02-28 22:35     ` Tetsuo Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Gunthorpe @ 2026-02-28 16:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: OFED mailing list, Leon Romanovsky

On Sat, Feb 28, 2026 at 03:07:29PM +0900, Tetsuo Handa wrote:
> On 2025/12/04 17:26, Tetsuo Handa wrote:
> > I found that running the attached example program causes khungtaskd message. What is wrong?
> 
> I found that this is a deadlock caused by "struct ib_device_ops"->disassociate_ucontext == NULL.
> If the thread which called ib_uverbs_remove_one() is unable to call ib_uverbs_release_file()
>  from ib_uverbs_close() because it is blocked at
> wait_for_completion(), it forms a deadlock.

That doesn't sound right at all, the wait_for_completion is waiting
for other threads to let go of the context before closing it. rxe/etc
that syzkaller is testing don't support disassociate so they need to
wait.

If the wait gets stuck that is a different issue.

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use.
  2026-02-28 16:43   ` Jason Gunthorpe
@ 2026-02-28 22:35     ` Tetsuo Handa
  2026-03-01  7:43       ` Tetsuo Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2026-02-28 22:35 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: OFED mailing list, Leon Romanovsky

On 2026/03/01 1:43, Jason Gunthorpe wrote:
> On Sat, Feb 28, 2026 at 03:07:29PM +0900, Tetsuo Handa wrote:
>> On 2025/12/04 17:26, Tetsuo Handa wrote:
>>> I found that running the attached example program causes khungtaskd message. What is wrong?
>>
>> I found that this is a deadlock caused by "struct ib_device_ops"->disassociate_ucontext == NULL.
>> If the thread which called ib_uverbs_remove_one() is unable to call ib_uverbs_release_file()
>>  from ib_uverbs_close() because it is blocked at
>> wait_for_completion(), it forms a deadlock.
> 
> That doesn't sound right at all, the wait_for_completion is waiting
> for other threads to let go of the context before closing it. rxe/etc
> that syzkaller is testing don't support disassociate so they need to
> wait.

This issue was not found by syzkaller. Please see the reproducer.

> 
> If the wait gets stuck that is a different issue.

My question is how we can support disassociate...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use.
  2026-02-28 22:35     ` Tetsuo Handa
@ 2026-03-01  7:43       ` Tetsuo Handa
  2026-03-01 17:47         ` Jason Gunthorpe
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2026-03-01  7:43 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: OFED mailing list, Leon Romanovsky

On 2026/03/01 7:35, Tetsuo Handa wrote:
> On 2026/03/01 1:43, Jason Gunthorpe wrote:
>> On Sat, Feb 28, 2026 at 03:07:29PM +0900, Tetsuo Handa wrote:
>>> On 2025/12/04 17:26, Tetsuo Handa wrote:
>>>> I found that running the attached example program causes khungtaskd message. What is wrong?
>>>
>>> I found that this is a deadlock caused by "struct ib_device_ops"->disassociate_ucontext == NULL.
>>> If the thread which called ib_uverbs_remove_one() is unable to call ib_uverbs_release_file()
>>>  from ib_uverbs_close() because it is blocked at
>>> wait_for_completion(), it forms a deadlock.
>>
>> That doesn't sound right at all, the wait_for_completion is waiting
>> for other threads to let go of the context before closing it. rxe/etc
>> that syzkaller is testing don't support disassociate so they need to
>> wait.
> 
> This issue was not found by syzkaller. Please see the reproducer.
> 
>>
>> If the wait gets stuck that is a different issue.
> 
> My question is how we can support disassociate...
> 

As of eb71ab2bf722 in linux.git, this problem is still reproducible.

If you add debug printk() like below, you will be able to confirm that
ib_uverbs_remove_one() cannot return unless file descriptor is closed.


diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 7b68967a6301..aa247b662836 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -210,8 +210,11 @@ void ib_uverbs_release_file(struct kref *ref)
                module_put(ib_dev->ops.owner);
        srcu_read_unlock(&file->device->disassociate_srcu, srcu_key);

-       if (refcount_dec_and_test(&file->device->refcount))
+       if (refcount_dec_and_test(&file->device->refcount)) {
+               pr_info("Start ib_uverbs_comp_dev()\n");
                ib_uverbs_comp_dev(file->device);
+               pr_info("End ib_uverbs_comp_dev()\n");
+       }

        if (file->default_async_file)
                uverbs_uobject_put(&file->default_async_file->uobj);
@@ -1277,8 +1280,11 @@ static void ib_uverbs_remove_one(struct ib_device *device, void *client_data)

        if (refcount_dec_and_test(&uverbs_dev->refcount))
                ib_uverbs_comp_dev(uverbs_dev);
-       if (wait_clients)
+       if (wait_clients) {
+               pr_info("Start wait_for_completion()\n");
                wait_for_completion(&uverbs_dev->comp);
+               pr_info("End wait_for_completion()\n");
+       }

        put_device(&uverbs_dev->dev);
 }

[  322.613977] [   T1281] iwpm_register_pid: Unable to send a nlmsg (client = 2)
[  322.691002] [   T1321] Start wait_for_completion()
[  496.727254] [    T100] INFO: task rdma:1321 blocked for more than 122 seconds.
[  496.732315] [    T100]       Not tainted 7.0.0-rc1+ #288
[  496.736240] [    T100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  496.742253] [    T100] task:rdma            state:D stack:0     pid:1321  tgid:1321  ppid:1281   task_flags:0x400100 flags:0x00080000
[  496.750349] [    T100] Call Trace:
[  496.752911] [    T100]  <TASK>
[  496.755040] [    T100]  __schedule+0x33e/0x6d0
[  496.758612] [    T100]  ? __lock_release.isra.0+0x59/0x170
[  496.761729] [    T100]  schedule+0x3a/0xe0
[  496.764818] [    T100]  schedule_timeout+0xca/0x110
[  496.768491] [    T100]  wait_for_completion+0x8a/0x140
[  496.771999] [    T100]  ib_uverbs_remove_one+0x1bc/0x220 [ib_uverbs]
[  496.776066] [    T100]  remove_client_context+0x8d/0xd0 [ib_core]
[  496.780560] [    T100]  disable_device+0x8b/0x170 [ib_core]
[  496.784371] [    T100]  __ib_unregister_device+0x110/0x180 [ib_core]
[  496.789351] [    T100]  ib_unregister_device_and_put+0x37/0x50 [ib_core]
(...snipped...) // Kill rdma_example process here which is waiting for rdma process to terminate.
[  540.070450] [   T1281] Start ib_uverbs_comp_dev()
[  540.074127] [   T1281] End ib_uverbs_comp_dev()
[  540.074498] [   T1321] End wait_for_completion()


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use.
  2026-03-01  7:43       ` Tetsuo Handa
@ 2026-03-01 17:47         ` Jason Gunthorpe
  0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-03-01 17:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: OFED mailing list, Leon Romanovsky

On Sun, Mar 01, 2026 at 04:43:55PM +0900, Tetsuo Handa wrote:
> On 2026/03/01 7:35, Tetsuo Handa wrote:
> > On 2026/03/01 1:43, Jason Gunthorpe wrote:
> >> On Sat, Feb 28, 2026 at 03:07:29PM +0900, Tetsuo Handa wrote:
> >>> On 2025/12/04 17:26, Tetsuo Handa wrote:
> >>>> I found that running the attached example program causes khungtaskd message. What is wrong?
> >>>
> >>> I found that this is a deadlock caused by "struct ib_device_ops"->disassociate_ucontext == NULL.
> >>> If the thread which called ib_uverbs_remove_one() is unable to call ib_uverbs_release_file()
> >>>  from ib_uverbs_close() because it is blocked at
> >>> wait_for_completion(), it forms a deadlock.
> >>
> >> That doesn't sound right at all, the wait_for_completion is waiting
> >> for other threads to let go of the context before closing it. rxe/etc
> >> that syzkaller is testing don't support disassociate so they need to
> >> wait.
> > 
> > This issue was not found by syzkaller. Please see the reproducer.
> > 
> >>
> >> If the wait gets stuck that is a different issue.
> > 
> > My question is how we can support disassociate...
> > 
> 
> As of eb71ab2bf722 in linux.git, this problem is still reproducible.
> 
> If you add debug printk() like below, you will be able to confirm that
> ib_uverbs_remove_one() cannot return unless file descriptor is closed.

That's right - that is exactly how it designed to work.

Without disassociate support in the driver you *cannot* safely unload
the driver while FDs are open, to protect the kernel from UAF crashes
we sleep here.

I have no idea what it would take to add disassociate to siw or rxe,
they clearly should have it given how easy it is to trigger a "driver
unbind" through ip route..

Alternatively maybe we should make 'ip link del' fail early in these
case when a FD is open instead of infinite way.

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-01 17:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-04  8:26 [rdma] "rdma link del" operation hangs at wait_for_completion() when a file descriptor is in use Tetsuo Handa
2026-02-28  6:07 ` Tetsuo Handa
2026-02-28 16:43   ` Jason Gunthorpe
2026-02-28 22:35     ` Tetsuo Handa
2026-03-01  7:43       ` Tetsuo Handa
2026-03-01 17:47         ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox