ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bug report] rbd unmap hangs after pausing and unpausing I/O
@ 2025-09-23 10:38 Raphael Zimmer
  2025-09-23 17:42 ` Viacheslav Dubeyko
  2025-09-23 18:33 ` Ilya Dryomov
  0 siblings, 2 replies; 6+ messages in thread
From: Raphael Zimmer @ 2025-09-23 10:38 UTC (permalink / raw)
  To: Ilya Dryomov, Xiubo Li; +Cc: ceph-devel

Hello,

I encountered an error with the kernel Ceph client (specifically using 
an RBD device) when pausing I/O on the cluster by setting and unsetting 
pauserd and pausewr flags. An error was seen with two different setups, 
which I believe is due to the same problem.

1) When pausing and later unpausing I/O on the cluster, everything seems 
to work as expected until trying to unmap an RBD device from the kernel. 
In this case, the rbd unmap command hangs and also can't be killed. To 
get back to a normally working state, a system reboot is needed. This 
behavior was observed on different systems (Debian 12 and 13) and could 
also be reproduced with an installation of the mainline kernel (v6.17-rc6).

Steps to reproduce:
- Connect kernel client to RBD device (rbd map)
- Pause I/O on cluster (ceph osd pause)
- Wait some time (3 minutes should be enough)
- Unpause I/O on cluster
- Try to unmap RBD device on client


2) When using an application that internally uses the kernel Ceph client 
code, I observed the following behavior:

Pausing I/O leads to a watch error after some time (same as with failing 
OSDs or e.g. when pool quota is reached). In rbd_watch_errcb 
(drivers/block/rbd.c), the watch_dwork gets scheduled, which leads to a 
call of rbd_reregister_watch -> __rbd_register_watch -> ceph_osdc_watch 
(net/ceph/osd_client.c) -> linger_reg_commit_wait -> 
wait_for_completion_killable. At this point, it waits without any 
timeout for the completion. The normal behavior is to wait until the 
causing condition is resolved and then return. With pausing and 
unpausing I/O, wait_for_completion_killable does not return even after 
unpausing because no call to complete or complete_all happens. I would 
guess that on unpausing some call is missing so that committing the 
linger request never completes.

 From what I am seeing, it seems like this missing completion in the 
second case is also the cause of the hanging rbd unmap with the 
unmodified kernel.


Best regards,

Raphael

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-24 17:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-23 10:38 [bug report] rbd unmap hangs after pausing and unpausing I/O Raphael Zimmer
2025-09-23 17:42 ` Viacheslav Dubeyko
2025-09-24 11:51   ` Raphael Zimmer
2025-09-24 17:49     ` Viacheslav Dubeyko
2025-09-23 18:33 ` Ilya Dryomov
2025-09-24 12:05   ` Raphael Zimmer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).