[PATCH] ceph: fix potential stray locked folios during umount

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] ceph: fix potential stray locked folios during umount
@ 2026-04-18 13:39 Li Lei
  2026-04-23 19:30 ` Viacheslav Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: Li Lei @ 2026-04-18 13:39 UTC (permalink / raw)
  To: idryomov, amarkuze, slava, xiubli
  Cc: ceph-devel, linux-kernel, noctis.akm, lilei24

During umount, we only wait for stopping_blockers to drop to zero for
a certain time specified by mount_timeout, and continue the rest of
the procedure even if there are inflight requests. This behavior may
leave some folios locked even after the cephfs umounted, which causes
other kernel threads to hung.

Buffered read process calls filemap_update_page() and waits on
folio_put_wait_locked() with TASK_KILLABLE flag set, which means this
process could be killed and the filesystem could be umount successfully
(no file opened in it). Umount calls truncate_inode_pages() and waits
on locked pages for those inodes whose i_count == 0. In these way,
there would be no locked folios for this filesystem left in system
after umount exits.

However, things are different for cephfs. Cephfs calls ihold() and
submits osd request for buffered read and gets folio locked. Once the
buffered read process is killed, the inode will be skipped in
evict_inodes(), because its i_count > 0. Forthemore, the folios are
still locked. It can only be unlocked in netfs_unlock_read_folio().

stopping_blocks should block umount from proceeding, but it only waits
for mount_timeout (default 60s) even if there are still flying request
out there, leaving stray locked folios. Other kthread, like kcompactd
, could be stuck on those locked folioes forever.

Steps to Reproduce:
1. echo 3 > /proc/sys/vm/drop_caches.
2. dd if=cephfs/xxx.img of=/dev/null
    Make sure cephfs/xxx.img is big enough to make time for us to do the
    following command
3. execute 'systemctl stop ceph-osd@*' on the osd nodes
    It would be great if you have a tiny cluster. Stopping all the osds
    would be much easier.
4. kill -9 `pidof dd`.
    Buffered read process must be killed at that moment. But inflight
    read requests can be observed in the /sys/kernel/debug/ceph/xxxx/osdc
5. umount cephfs
    Wait for 60s if you mount cephfs by using the default mount option.

We got the warning:
ceph: [b2c9a006-9ad8-48e9-8257-6fb1e1b91014 66562]: umount timed out, 0
VFS: Busy inodes after unmount of ceph (ceph)

if check_data_corruption option disable, kcompactd may stuck in the
future. If it is eanbled, we catch the bug immediately.

[94543.042953] ------------[ cut here ]------------
[94543.049391] kernel BUG at fs/super.c:654!
[94543.054171] Oops: invalid opcode: 0000 [#1] SMP PTI
[94543.059881] CPU: 25 UID: 0 PID: 3451674 Comm: umount Kdump: loaded Tainted: G S         OE       7.0.0-dirty #2 PREEMPTLAZY
[94543.072678] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[94543.080918] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[94543.089755] RIP: 0010:generic_shutdown_super+0x111/0x120
[94543.095982] Code: cc cc e8 c2 1f ef ff 48 8b bb d0 00 00 00 eb db 48 8b 43 28 48 8d b3 98 03 00 00 48 c7 c7 90 09 55 8c 48 8b 10 e8 0f c3 cc ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90
[94543.117607] RSP: 0018:ffffce35f8c53d40 EFLAGS: 00010246
[94543.123793] RAX: 000000000000002d RBX: ffff8ba94d0d9000 RCX: 0000000000000000
[94543.132125] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8bc0df91c600
[94543.140460] RBP: ffffffffc13b52c0 R08: 0000000000000000 R09: ffffce35f8c53be0
[94543.148801] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8ba94af0e000
[94543.157150] R13: ffff8ba94d0d9000 R14: 0000000000000004 R15: ffff8ba946d9a000
[94543.165505] FS:  00007fb1c607c840(0000) GS:ffff8bc1520e4000(0000) knlGS:0000000000000000
[94543.174943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[94543.181769] CR2: 00000000d64e5000 CR3: 000000189c9f2002 CR4: 00000000003726f0
[94543.190160] Call Trace:
[94543.193317]  <TASK>
[94543.196088]  kill_anon_super+0x12/0x40
[94543.200719]  ceph_kill_sb+0xda/0x2c0 [ceph]
[94543.205877]  ? radix_tree_delete_item+0x68/0xd0
[94543.211395]  deactivate_locked_super+0x31/0xb0
[94543.216815]  cleanup_mnt+0xcb/0x110
[94543.221169]  task_work_run+0x58/0x80
[94543.225629]  exit_to_user_mode_loop+0x13f/0x4d0
[94543.231163]  do_syscall_64+0x1ef/0x840
[94543.235827]  ? do_syscall_64+0x101/0x840
[94543.240687]  ? do_user_addr_fault+0x20e/0x6b0
[94543.246036]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[94543.252166] RIP: 0033:0x7fb1c5f0ccab
[94543.256650] Code: 73 31 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 39 31 0e 00 f7 d8
[94543.278690] RSP: 002b:00007ffe96f80ea8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[94543.287710] RAX: 0000000000000000 RBX: 00007fb1c61fb264 RCX: 00007fb1c5f0ccab
[94543.296251] RDX: fffffffffffffe88 RSI: 0000000000000000 RDI: 000055ae9dcc6ec0
[94543.304801] RBP: 000055ae9dcc6c90 R08: 0000000000000000 R09: 00007ffe96f7fc50
[94543.313358] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[94543.321922] R13: 000055ae9dcc6ec0 R14: 000055ae9dcc6da0 R15: 0000000000000000

So make it wait until all the flying requests returns for clean and safe
umount.

Fixes: 1464de9f813e ("ceph: wait for OSD requests' callbacks to finish when unmounting")
Signed-off-by: Li Lei <lilei24@kuaishou.com>
---
 fs/ceph/super.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 2aed6b3..48e63c1 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1569,13 +1569,10 @@ static void ceph_kill_sb(struct super_block *s)
 	spin_unlock(&mdsc->stopping_lock);
 
 	if (wait && atomic_read(&mdsc->stopping_blockers)) {
-		long timeleft = wait_for_completion_killable_timeout(
-					&mdsc->stopping_waiter,
-					fsc->client->options->mount_timeout);
-		if (!timeleft) /* timed out */
-			pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
-		else if (timeleft < 0) /* killed */
-			pr_warn_client(cl, "umount was killed, %ld\n", timeleft);
+		int rc = wait_for_completion_killable(
+					&mdsc->stopping_waiter);
+		if (rc < 0) /* killed */
+			pr_warn_client(cl, "umount was killed\n");
 	}
 
 	mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re:  [PATCH] ceph: fix potential stray locked folios during umount
  2026-04-18 13:39 [PATCH] ceph: fix potential stray locked folios during umount Li Lei
@ 2026-04-23 19:30 ` Viacheslav Dubeyko
  2026-04-24 19:44   ` 李磊
  0 siblings, 1 reply; 8+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-23 19:30 UTC (permalink / raw)
  To: lilei24@kuaishou.com, idryomov@gmail.com, Alex Markuze,
	slava@dubeyko.com, Xiubo Li
  Cc: ceph-devel@vger.kernel.org, noctis.akm@gmail.com,
	linux-kernel@vger.kernel.org

On Sat, 2026-04-18 at 21:39 +0800, Li Lei wrote:
> During umount, we only wait for stopping_blockers to drop to zero for
> a certain time specified by mount_timeout, and continue the rest of
> the procedure even if there are inflight requests. This behavior may
> leave some folios locked even after the cephfs umounted, which causes
> other kernel threads to hung.
> 
> Buffered read process calls filemap_update_page() and waits on
> folio_put_wait_locked() with TASK_KILLABLE flag set, which means this
> process could be killed and the filesystem could be umount successfully
> (no file opened in it). Umount calls truncate_inode_pages() and waits
> on locked pages for those inodes whose i_count == 0. In these way,
> there would be no locked folios for this filesystem left in system
> after umount exits.
> 
> However, things are different for cephfs. Cephfs calls ihold() and
> submits osd request for buffered read and gets folio locked. Once the
> buffered read process is killed, the inode will be skipped in
> evict_inodes(), because its i_count > 0. Forthemore, the folios are
> still locked. It can only be unlocked in netfs_unlock_read_folio().
> 
> stopping_blocks should block umount from proceeding, but it only waits
> for mount_timeout (default 60s) even if there are still flying request
> out there, leaving stray locked folios. Other kthread, like kcompactd
> , could be stuck on those locked folioes forever.
> 
> Steps to Reproduce:
> 1. echo 3 > /proc/sys/vm/drop_caches.
> 2. dd if=cephfs/xxx.img of=/dev/null
>     Make sure cephfs/xxx.img is big enough to make time for us to do the
>     following command
> 3. execute 'systemctl stop ceph-osd@*' on the osd nodes
>     It would be great if you have a tiny cluster. Stopping all the osds
>     would be much easier.
> 4. kill -9 `pidof dd`.
>     Buffered read process must be killed at that moment. But inflight
>     read requests can be observed in the /sys/kernel/debug/ceph/xxxx/osdc
> 5. umount cephfs
>     Wait for 60s if you mount cephfs by using the default mount option.
> 
> We got the warning:
> ceph: [b2c9a006-9ad8-48e9-8257-6fb1e1b91014 66562]: umount timed out, 0
> VFS: Busy inodes after unmount of ceph (ceph)
> 
> if check_data_corruption option disable, kcompactd may stuck in the
> future. If it is eanbled, we catch the bug immediately.
> 
> [94543.042953] ------------[ cut here ]------------
> [94543.049391] kernel BUG at fs/super.c:654!
> [94543.054171] Oops: invalid opcode: 0000 [#1] SMP PTI
> [94543.059881] CPU: 25 UID: 0 PID: 3451674 Comm: umount Kdump: loaded Tainted: G S         OE       7.0.0-dirty #2 PREEMPTLAZY
> [94543.072678] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [94543.080918] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
> [94543.089755] RIP: 0010:generic_shutdown_super+0x111/0x120
> [94543.095982] Code: cc cc e8 c2 1f ef ff 48 8b bb d0 00 00 00 eb db 48 8b 43 28 48 8d b3 98 03 00 00 48 c7 c7 90 09 55 8c 48 8b 10 e8 0f c3 cc ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90
> [94543.117607] RSP: 0018:ffffce35f8c53d40 EFLAGS: 00010246
> [94543.123793] RAX: 000000000000002d RBX: ffff8ba94d0d9000 RCX: 0000000000000000
> [94543.132125] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8bc0df91c600
> [94543.140460] RBP: ffffffffc13b52c0 R08: 0000000000000000 R09: ffffce35f8c53be0
> [94543.148801] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8ba94af0e000
> [94543.157150] R13: ffff8ba94d0d9000 R14: 0000000000000004 R15: ffff8ba946d9a000
> [94543.165505] FS:  00007fb1c607c840(0000) GS:ffff8bc1520e4000(0000) knlGS:0000000000000000
> [94543.174943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [94543.181769] CR2: 00000000d64e5000 CR3: 000000189c9f2002 CR4: 00000000003726f0
> [94543.190160] Call Trace:
> [94543.193317]  <TASK>
> [94543.196088]  kill_anon_super+0x12/0x40
> [94543.200719]  ceph_kill_sb+0xda/0x2c0 [ceph]
> [94543.205877]  ? radix_tree_delete_item+0x68/0xd0
> [94543.211395]  deactivate_locked_super+0x31/0xb0
> [94543.216815]  cleanup_mnt+0xcb/0x110
> [94543.221169]  task_work_run+0x58/0x80
> [94543.225629]  exit_to_user_mode_loop+0x13f/0x4d0
> [94543.231163]  do_syscall_64+0x1ef/0x840
> [94543.235827]  ? do_syscall_64+0x101/0x840
> [94543.240687]  ? do_user_addr_fault+0x20e/0x6b0
> [94543.246036]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [94543.252166] RIP: 0033:0x7fb1c5f0ccab
> [94543.256650] Code: 73 31 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 39 31 0e 00 f7 d8
> [94543.278690] RSP: 002b:00007ffe96f80ea8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> [94543.287710] RAX: 0000000000000000 RBX: 00007fb1c61fb264 RCX: 00007fb1c5f0ccab
> [94543.296251] RDX: fffffffffffffe88 RSI: 0000000000000000 RDI: 000055ae9dcc6ec0
> [94543.304801] RBP: 000055ae9dcc6c90 R08: 0000000000000000 R09: 00007ffe96f7fc50
> [94543.313358] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [94543.321922] R13: 000055ae9dcc6ec0 R14: 000055ae9dcc6da0 R15: 0000000000000000
> 
> So make it wait until all the flying requests returns for clean and safe
> umount.
> 
> Fixes: 1464de9f813e ("ceph: wait for OSD requests' callbacks to finish when unmounting")
> Signed-off-by: Li Lei <lilei24@kuaishou.com>
> ---
>  fs/ceph/super.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index 2aed6b3..48e63c1 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -1569,13 +1569,10 @@ static void ceph_kill_sb(struct super_block *s)
>  	spin_unlock(&mdsc->stopping_lock);
>  
>  	if (wait && atomic_read(&mdsc->stopping_blockers)) {
> -		long timeleft = wait_for_completion_killable_timeout(
> -					&mdsc->stopping_waiter,
> -					fsc->client->options->mount_timeout);
> -		if (!timeleft) /* timed out */
> -			pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
> -		else if (timeleft < 0) /* killed */
> -			pr_warn_client(cl, "umount was killed, %ld\n", timeleft);
> +		int rc = wait_for_completion_killable(
> +					&mdsc->stopping_waiter);
> +		if (rc < 0) /* killed */
> +			pr_warn_client(cl, "umount was killed\n");
>  	}
>  
>  	mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;

I am not completely sure that it's right approach to fix the problem. If I
understood correctly, we have situation that somehow OSD is down and dd process
has been killed. It's not normal and usual situation. So, your suggestion is to
hung in unmount process forever because we cannot service the flying requests.
Is it really good way to fix the problem? Probably, we need to have more smart
logic here of waiting flying requests ending. But timeout makes sense to prevent
from the situation that something is going wrong and we cannot finish unmount at
all. What do you think? Maybe we need to have some loop that checks the state by
waking up after some timeout. If the number of flying requests is decreasing,
then we should wait. But if nothing is changing with time, then it means that
something is wrong and it makes sense to unmount anyway. Because, finally, we
will have some call trace in system log, anyway. Does it make sense?

Also, we have likewise pattern in the ceph_kill_sb():

	if (atomic64_read(&mdsc->dirty_folios) > 0) {
		wait_queue_head_t *wq = &mdsc->flush_end_wq;
		long timeleft = wait_event_killable_timeout(*wq,
					atomic64_read(&mdsc->dirty_folios) <=
0,
					fsc->client->options->mount_timeout);
		if (!timeleft) /* timed out */
			pr_warn_client(cl, "umount timed out, %ld\n",
timeleft);
		else if (timeleft < 0) /* killed */
			pr_warn_client(cl, "umount was killed, %ld\n",
timeleft);
	}

Do we need to do something here too?

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re:  [PATCH] ceph: fix potential stray locked folios during umount
  2026-04-23 19:30 ` Viacheslav Dubeyko
@ 2026-04-24 19:44   ` 李磊
  2026-04-24 22:02     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: 李磊 @ 2026-04-24 19:44 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: 李磊, idryomov@gmail.com, Alex Markuze,
	slava@dubeyko.com, Xiubo Li, ceph-devel@vger.kernel.org,
	noctis.akm@gmail.com, linux-kernel@vger.kernel.org



> 2026年4月24日 03:30，Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 写道：
> 
> 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
> 
> 
> On Sat, 2026-04-18 at 21:39 +0800, Li Lei wrote:
>> During umount, we only wait for stopping_blockers to drop to zero for
>> a certain time specified by mount_timeout, and continue the rest of
>> the procedure even if there are inflight requests. This behavior may
>> leave some folios locked even after the cephfs umounted, which causes
>> other kernel threads to hung.
>> 
>> Buffered read process calls filemap_update_page() and waits on
>> folio_put_wait_locked() with TASK_KILLABLE flag set, which means this
>> process could be killed and the filesystem could be umount successfully
>> (no file opened in it). Umount calls truncate_inode_pages() and waits
>> on locked pages for those inodes whose i_count == 0. In these way,
>> there would be no locked folios for this filesystem left in system
>> after umount exits.
>> 
>> However, things are different for cephfs. Cephfs calls ihold() and
>> submits osd request for buffered read and gets folio locked. Once the
>> buffered read process is killed, the inode will be skipped in
>> evict_inodes(), because its i_count > 0. Forthemore, the folios are
>> still locked. It can only be unlocked in netfs_unlock_read_folio().
>> 
>> stopping_blocks should block umount from proceeding, but it only waits
>> for mount_timeout (default 60s) even if there are still flying request
>> out there, leaving stray locked folios. Other kthread, like kcompactd
>> , could be stuck on those locked folioes forever.
>> 
>> Steps to Reproduce:
>> 1. echo 3 > /proc/sys/vm/drop_caches.
>> 2. dd if=cephfs/xxx.img of=/dev/null
>>    Make sure cephfs/xxx.img is big enough to make time for us to do the
>>    following command
>> 3. execute 'systemctl stop ceph-osd@*' on the osd nodes
>>    It would be great if you have a tiny cluster. Stopping all the osds
>>    would be much easier.
>> 4. kill -9 `pidof dd`.
>>    Buffered read process must be killed at that moment. But inflight
>>    read requests can be observed in the /sys/kernel/debug/ceph/xxxx/osdc
>> 5. umount cephfs
>>    Wait for 60s if you mount cephfs by using the default mount option.
>> 
>> We got the warning:
>> ceph: [b2c9a006-9ad8-48e9-8257-6fb1e1b91014 66562]: umount timed out, 0
>> VFS: Busy inodes after unmount of ceph (ceph)
>> 
>> if check_data_corruption option disable, kcompactd may stuck in the
>> future. If it is eanbled, we catch the bug immediately.
>> 
>> [94543.042953] ------------[ cut here ]------------
>> [94543.049391] kernel BUG at fs/super.c:654!
>> [94543.054171] Oops: invalid opcode: 0000 [#1] SMP PTI
>> [94543.059881] CPU: 25 UID: 0 PID: 3451674 Comm: umount Kdump: loaded Tainted: G S         OE       7.0.0-dirty #2 PREEMPTLAZY
>> [94543.072678] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>> [94543.080918] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
>> [94543.089755] RIP: 0010:generic_shutdown_super+0x111/0x120
>> [94543.095982] Code: cc cc e8 c2 1f ef ff 48 8b bb d0 00 00 00 eb db 48 8b 43 28 48 8d b3 98 03 00 00 48 c7 c7 90 09 55 8c 48 8b 10 e8 0f c3 cc ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90
>> [94543.117607] RSP: 0018:ffffce35f8c53d40 EFLAGS: 00010246
>> [94543.123793] RAX: 000000000000002d RBX: ffff8ba94d0d9000 RCX: 0000000000000000
>> [94543.132125] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8bc0df91c600
>> [94543.140460] RBP: ffffffffc13b52c0 R08: 0000000000000000 R09: ffffce35f8c53be0
>> [94543.148801] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8ba94af0e000
>> [94543.157150] R13: ffff8ba94d0d9000 R14: 0000000000000004 R15: ffff8ba946d9a000
>> [94543.165505] FS:  00007fb1c607c840(0000) GS:ffff8bc1520e4000(0000) knlGS:0000000000000000
>> [94543.174943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [94543.181769] CR2: 00000000d64e5000 CR3: 000000189c9f2002 CR4: 00000000003726f0
>> [94543.190160] Call Trace:
>> [94543.193317]  <TASK>
>> [94543.196088]  kill_anon_super+0x12/0x40
>> [94543.200719]  ceph_kill_sb+0xda/0x2c0 [ceph]
>> [94543.205877]  ? radix_tree_delete_item+0x68/0xd0
>> [94543.211395]  deactivate_locked_super+0x31/0xb0
>> [94543.216815]  cleanup_mnt+0xcb/0x110
>> [94543.221169]  task_work_run+0x58/0x80
>> [94543.225629]  exit_to_user_mode_loop+0x13f/0x4d0
>> [94543.231163]  do_syscall_64+0x1ef/0x840
>> [94543.235827]  ? do_syscall_64+0x101/0x840
>> [94543.240687]  ? do_user_addr_fault+0x20e/0x6b0
>> [94543.246036]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [94543.252166] RIP: 0033:0x7fb1c5f0ccab
>> [94543.256650] Code: 73 31 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 39 31 0e 00 f7 d8
>> [94543.278690] RSP: 002b:00007ffe96f80ea8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
>> [94543.287710] RAX: 0000000000000000 RBX: 00007fb1c61fb264 RCX: 00007fb1c5f0ccab
>> [94543.296251] RDX: fffffffffffffe88 RSI: 0000000000000000 RDI: 000055ae9dcc6ec0
>> [94543.304801] RBP: 000055ae9dcc6c90 R08: 0000000000000000 R09: 00007ffe96f7fc50
>> [94543.313358] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> [94543.321922] R13: 000055ae9dcc6ec0 R14: 000055ae9dcc6da0 R15: 0000000000000000
>> 
>> So make it wait until all the flying requests returns for clean and safe
>> umount.
>> 
>> Fixes: 1464de9f813e ("ceph: wait for OSD requests' callbacks to finish when unmounting")
>> Signed-off-by: Li Lei <lilei24@kuaishou.com>
>> ---
>> fs/ceph/super.c | 11 ++++-------
>> 1 file changed, 4 insertions(+), 7 deletions(-)
>> 
>> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
>> index 2aed6b3..48e63c1 100644
>> --- a/fs/ceph/super.c
>> +++ b/fs/ceph/super.c
>> @@ -1569,13 +1569,10 @@ static void ceph_kill_sb(struct super_block *s)
>>      spin_unlock(&mdsc->stopping_lock);
>> 
>>      if (wait && atomic_read(&mdsc->stopping_blockers)) {
>> -             long timeleft = wait_for_completion_killable_timeout(
>> -                                     &mdsc->stopping_waiter,
>> -                                     fsc->client->options->mount_timeout);
>> -             if (!timeleft) /* timed out */
>> -                     pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
>> -             else if (timeleft < 0) /* killed */
>> -                     pr_warn_client(cl, "umount was killed, %ld\n", timeleft);
>> +             int rc = wait_for_completion_killable(
>> +                                     &mdsc->stopping_waiter);
>> +             if (rc < 0) /* killed */
>> +                     pr_warn_client(cl, "umount was killed\n");
>>      }
>> 
>>      mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
> 
> I am not completely sure that it's right approach to fix the problem. If I
> understood correctly, we have situation that somehow OSD is down and dd process
> has been killed.
Yes. This this the situation we met in our production environment. We found 1/4
of our client-nodes had kcompactd task hung because of requesting for a folio_lock.
After using crash tool to debug, we figured that page belonged to a cephfs file,
howevere that cephfs had been umounted. 

The 'Steps to Reproduce’ is used to emulate the real workload.


> It's not normal and usual situation. So, your suggestion is to
> hung in unmount process forever because we cannot service the flying requests.
> Is it really good way to fix the problem? Probably, we need to have more smart
> logic here of waiting flying requests ending. But timeout makes sense to prevent
> from the situation that something is going wrong and we cannot finish unmount at
> all. What do you think? Maybe we need to have some loop that checks the state by
> waking up after some timeout. If the number of flying requests is decreasing,
> then we should wait. But if nothing is changing with time, then it means that
> something is wrong and it makes sense to unmount anyway. Because, finally, we
> will have some call trace in system log, anyway. Does it make sense?
> 
> Also, we have likewise pattern in the ceph_kill_sb():
> 
>        if (atomic64_read(&mdsc->dirty_folios) > 0) {
>                wait_queue_head_t *wq = &mdsc->flush_end_wq;
>                long timeleft = wait_event_killable_timeout(*wq,
>                                        atomic64_read(&mdsc->dirty_folios) <=
> 0,
>                                        fsc->client->options->mount_timeout);
>                if (!timeleft) /* timed out */
>                        pr_warn_client(cl, "umount timed out, %ld\n",
> timeleft);
>                else if (timeleft < 0) /* killed */
>                        pr_warn_client(cl, "umount was killed, %ld\n",
> timeleft);
>        }
> 

I understand your concern. This patch is a truly straightforward workaround.
So, how about we just abort OSD requests if they take too long to return
during unmounting ?

Compared to leaving some locked folios in the system, return -EIO to those
OSD requests which may never return is more reasonable. This is because locked
folios left behind Cephfs unmount may block kcompactd and render the entire
system unstable. 

Besides, successful unmounting doesn't guarantee dirty buffers are successfully
written to the backend. For example, when a buffered write returns, the local
filesystem may encounter bad blocks on the local disk and -EIO is returned to
the writeback kworkers.  Therefore, in our scenario, does it make sense if we
treat the OSD requests that have been flight for a certain period as failed,
And return -EIO to the caller?

Lastly, I think we can just use stopping blockers to replace dirty_folios to
simplify the unmounting and wait process. Accordingly, in ceph_kill_sb(), we
only need to wait for stopping_blockers count to drop to zero. If a timeout
occurs, we can cancel all the inflight requests and print some warning messages.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Re:  [PATCH] ceph: fix potential stray locked folios during umount
  2026-04-24 19:44   ` 李磊
@ 2026-04-24 22:02     ` Viacheslav Dubeyko
  2026-04-26 15:38       ` 李磊
  0 siblings, 1 reply; 8+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-24 22:02 UTC (permalink / raw)
  To: lilei24@kuaishou.com
  Cc: ceph-devel@vger.kernel.org, idryomov@gmail.com, Alex Markuze,
	slava@dubeyko.com, Xiubo Li, noctis.akm@gmail.com,
	linux-kernel@vger.kernel.org

On Fri, 2026-04-24 at 19:44 +0000, 李磊 wrote:
> 
> > 2026年4月24日 03:30，Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 写道：
> > 
> > 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
> > 
> > 
> > On Sat, 2026-04-18 at 21:39 +0800, Li Lei wrote:
> > > During umount, we only wait for stopping_blockers to drop to zero for
> > > a certain time specified by mount_timeout, and continue the rest of
> > > the procedure even if there are inflight requests. This behavior may
> > > leave some folios locked even after the cephfs umounted, which causes
> > > other kernel threads to hung.
> > > 
> > > Buffered read process calls filemap_update_page() and waits on
> > > folio_put_wait_locked() with TASK_KILLABLE flag set, which means this
> > > process could be killed and the filesystem could be umount successfully
> > > (no file opened in it). Umount calls truncate_inode_pages() and waits
> > > on locked pages for those inodes whose i_count == 0. In these way,
> > > there would be no locked folios for this filesystem left in system
> > > after umount exits.
> > > 
> > > However, things are different for cephfs. Cephfs calls ihold() and
> > > submits osd request for buffered read and gets folio locked. Once the
> > > buffered read process is killed, the inode will be skipped in
> > > evict_inodes(), because its i_count > 0. Forthemore, the folios are
> > > still locked. It can only be unlocked in netfs_unlock_read_folio().
> > > 
> > > stopping_blocks should block umount from proceeding, but it only waits
> > > for mount_timeout (default 60s) even if there are still flying request
> > > out there, leaving stray locked folios. Other kthread, like kcompactd
> > > , could be stuck on those locked folioes forever.
> > > 
> > > Steps to Reproduce:
> > > 1. echo 3 > /proc/sys/vm/drop_caches.
> > > 2. dd if=cephfs/xxx.img of=/dev/null
> > >    Make sure cephfs/xxx.img is big enough to make time for us to do the
> > >    following command
> > > 3. execute 'systemctl stop ceph-osd@*' on the osd nodes
> > >    It would be great if you have a tiny cluster. Stopping all the osds
> > >    would be much easier.
> > > 4. kill -9 `pidof dd`.
> > >    Buffered read process must be killed at that moment. But inflight
> > >    read requests can be observed in the /sys/kernel/debug/ceph/xxxx/osdc
> > > 5. umount cephfs
> > >    Wait for 60s if you mount cephfs by using the default mount option.
> > > 
> > > We got the warning:
> > > ceph: [b2c9a006-9ad8-48e9-8257-6fb1e1b91014 66562]: umount timed out, 0
> > > VFS: Busy inodes after unmount of ceph (ceph)
> > > 
> > > if check_data_corruption option disable, kcompactd may stuck in the
> > > future. If it is eanbled, we catch the bug immediately.
> > > 
> > > [94543.042953] ------------[ cut here ]------------
> > > [94543.049391] kernel BUG at fs/super.c:654!
> > > [94543.054171] Oops: invalid opcode: 0000 [#1] SMP PTI
> > > [94543.059881] CPU: 25 UID: 0 PID: 3451674 Comm: umount Kdump: loaded Tainted: G S         OE       7.0.0-dirty #2 PREEMPTLAZY
> > > [94543.072678] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> > > [94543.080918] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
> > > [94543.089755] RIP: 0010:generic_shutdown_super+0x111/0x120
> > > [94543.095982] Code: cc cc e8 c2 1f ef ff 48 8b bb d0 00 00 00 eb db 48 8b 43 28 48 8d b3 98 03 00 00 48 c7 c7 90 09 55 8c 48 8b 10 e8 0f c3 cc ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90
> > > [94543.117607] RSP: 0018:ffffce35f8c53d40 EFLAGS: 00010246
> > > [94543.123793] RAX: 000000000000002d RBX: ffff8ba94d0d9000 RCX: 0000000000000000
> > > [94543.132125] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8bc0df91c600
> > > [94543.140460] RBP: ffffffffc13b52c0 R08: 0000000000000000 R09: ffffce35f8c53be0
> > > [94543.148801] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8ba94af0e000
> > > [94543.157150] R13: ffff8ba94d0d9000 R14: 0000000000000004 R15: ffff8ba946d9a000
> > > [94543.165505] FS:  00007fb1c607c840(0000) GS:ffff8bc1520e4000(0000) knlGS:0000000000000000
> > > [94543.174943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [94543.181769] CR2: 00000000d64e5000 CR3: 000000189c9f2002 CR4: 00000000003726f0
> > > [94543.190160] Call Trace:
> > > [94543.193317]  <TASK>
> > > [94543.196088]  kill_anon_super+0x12/0x40
> > > [94543.200719]  ceph_kill_sb+0xda/0x2c0 [ceph]
> > > [94543.205877]  ? radix_tree_delete_item+0x68/0xd0
> > > [94543.211395]  deactivate_locked_super+0x31/0xb0
> > > [94543.216815]  cleanup_mnt+0xcb/0x110
> > > [94543.221169]  task_work_run+0x58/0x80
> > > [94543.225629]  exit_to_user_mode_loop+0x13f/0x4d0
> > > [94543.231163]  do_syscall_64+0x1ef/0x840
> > > [94543.235827]  ? do_syscall_64+0x101/0x840
> > > [94543.240687]  ? do_user_addr_fault+0x20e/0x6b0
> > > [94543.246036]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > [94543.252166] RIP: 0033:0x7fb1c5f0ccab
> > > [94543.256650] Code: 73 31 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 39 31 0e 00 f7 d8
> > > [94543.278690] RSP: 002b:00007ffe96f80ea8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> > > [94543.287710] RAX: 0000000000000000 RBX: 00007fb1c61fb264 RCX: 00007fb1c5f0ccab
> > > [94543.296251] RDX: fffffffffffffe88 RSI: 0000000000000000 RDI: 000055ae9dcc6ec0
> > > [94543.304801] RBP: 000055ae9dcc6c90 R08: 0000000000000000 R09: 00007ffe96f7fc50
> > > [94543.313358] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > > [94543.321922] R13: 000055ae9dcc6ec0 R14: 000055ae9dcc6da0 R15: 0000000000000000
> > > 
> > > So make it wait until all the flying requests returns for clean and safe
> > > umount.
> > > 
> > > Fixes: 1464de9f813e ("ceph: wait for OSD requests' callbacks to finish when unmounting")
> > > Signed-off-by: Li Lei <lilei24@kuaishou.com>
> > > ---
> > > fs/ceph/super.c | 11 ++++-------
> > > 1 file changed, 4 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> > > index 2aed6b3..48e63c1 100644
> > > --- a/fs/ceph/super.c
> > > +++ b/fs/ceph/super.c
> > > @@ -1569,13 +1569,10 @@ static void ceph_kill_sb(struct super_block *s)
> > >      spin_unlock(&mdsc->stopping_lock);
> > > 
> > >      if (wait && atomic_read(&mdsc->stopping_blockers)) {
> > > -             long timeleft = wait_for_completion_killable_timeout(
> > > -                                     &mdsc->stopping_waiter,
> > > -                                     fsc->client->options->mount_timeout);
> > > -             if (!timeleft) /* timed out */
> > > -                     pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
> > > -             else if (timeleft < 0) /* killed */
> > > -                     pr_warn_client(cl, "umount was killed, %ld\n", timeleft);
> > > +             int rc = wait_for_completion_killable(
> > > +                                     &mdsc->stopping_waiter);
> > > +             if (rc < 0) /* killed */
> > > +                     pr_warn_client(cl, "umount was killed\n");
> > >      }
> > > 
> > >      mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
> > 
> > I am not completely sure that it's right approach to fix the problem. If I
> > understood correctly, we have situation that somehow OSD is down and dd process
> > has been killed.
> Yes. This this the situation we met in our production environment. We found 1/4
> of our client-nodes had kcompactd task hung because of requesting for a folio_lock.
> After using crash tool to debug, we figured that page belonged to a cephfs file,
> howevere that cephfs had been umounted. 
> 
> The 'Steps to Reproduce’ is used to emulate the real workload.
> 
> 
> > It's not normal and usual situation. So, your suggestion is to
> > hung in unmount process forever because we cannot service the flying requests.
> > Is it really good way to fix the problem? Probably, we need to have more smart
> > logic here of waiting flying requests ending. But timeout makes sense to prevent
> > from the situation that something is going wrong and we cannot finish unmount at
> > all. What do you think? Maybe we need to have some loop that checks the state by
> > waking up after some timeout. If the number of flying requests is decreasing,
> > then we should wait. But if nothing is changing with time, then it means that
> > something is wrong and it makes sense to unmount anyway. Because, finally, we
> > will have some call trace in system log, anyway. Does it make sense?
> > 
> > Also, we have likewise pattern in the ceph_kill_sb():
> > 
> >        if (atomic64_read(&mdsc->dirty_folios) > 0) {
> >                wait_queue_head_t *wq = &mdsc->flush_end_wq;
> >                long timeleft = wait_event_killable_timeout(*wq,
> >                                        atomic64_read(&mdsc->dirty_folios) <=
> > 0,
> >                                        fsc->client->options->mount_timeout);
> >                if (!timeleft) /* timed out */
> >                        pr_warn_client(cl, "umount timed out, %ld\n",
> > timeleft);
> >                else if (timeleft < 0) /* killed */
> >                        pr_warn_client(cl, "umount was killed, %ld\n",
> > timeleft);
> >        }
> > 
> 
> I understand your concern. This patch is a truly straightforward workaround.
> So, how about we just abort OSD requests if they take too long to return
> during unmounting ?

The question here is how to define that OSD requests taking too long time?
Potentially, processing could be really slow for some reason. From one point of
view, if we know that destination OSD is down or we have network partitioning,
then it doesn't make sense to wait to long. I am thinking about potential
checking of number of OSD requests. If this number is going down, then it needs
to wait, otherwise, if this number doesn't change, then it needs to finish the
unmount without waiting. Does it make sense?

> 
> Compared to leaving some locked folios in the system, return -EIO to those
> OSD requests which may never return is more reasonable. This is because locked
> folios left behind Cephfs unmount may block kcompactd and render the entire
> system unstable. 
> 

I agree. It makes sense. If we know that some OSD requests will never return,
then we need to manage this situation in better way. But how could we detect
that OSD request will never return?

> Besides, successful unmounting doesn't guarantee dirty buffers are successfully
> written to the backend. For example, when a buffered write returns, the local
> filesystem may encounter bad blocks on the local disk and -EIO is returned to
> the writeback kworkers.  Therefore, in our scenario, does it make sense if we
> treat the OSD requests that have been flight for a certain period as failed,
> And return -EIO to the caller?

This is the main question: how to detect that OSD requests are failed?

As far as I can see, if an OSD is down and osd_request_timeout is not set (the
default), a stalled write can block unmount indefinitely. I assume that you have
the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
of management the stuck OSD requests during unmount.

Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
write will fail, triggering con_fault().

Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
requests older than that deadline are aborted with -ETIMEDOUT via
abort_request().

Homeless requests: requests that can't be mapped to any OSD are also checked
against osd_request_timeout.

The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
silent beyond interval. When this fires, the connection is considered dead and
con_fault() is triggered.

So, we need to find a proper approach of finding a good solution from available
functionality.

> 
> Lastly, I think we can just use stopping blockers to replace dirty_folios to
> simplify the unmounting and wait process. Accordingly, in ceph_kill_sb(), we
> only need to wait for stopping_blockers count to drop to zero. If a timeout
> occurs, we can cancel all the inflight requests and print some warning messages.

The dirty_folios is "how much dirty data is still to be flushed," while
stopping_blockers is "how many threads are currently inside code that holds an
implicit reference to the MDS client." Unmount must drain both in order, and the
two counters solve entirely different races.

The mdsc->dirty_folios — count of dirty page-cache folios not yet written back.
It incremented in ceph_dirty_folio() at the moment of folio transitions from
clean to dirty in the page cache. It decremented in the OSD write-completion
callback after the OSD acknowledges the writeback and end_page_writeback() is
called. It represents the number of file-data folios that have been dirtied
(modified in the page cache) but whose data has not yet reached an OSD (i.e.,
writeback is pending or in flight).

The mdsc->stopping_blockers counts of in-progress MDS/OSD message handlers. It
incremented by ceph_inc_mds_stopping_blocker() / ceph_inc_osd_stopping_blocker()
at the entry of any async operation that must not be interrupted mid-flight by
shutdown. It decremented at the exit of the async operations' handlers.

I don't think that we can use the stopping blockers only.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RE: Re:  [PATCH] ceph: fix potential stray locked folios during umount
  2026-04-24 22:02     ` Viacheslav Dubeyko
@ 2026-04-26 15:38       ` 李磊
  2026-04-27 21:52         ` Viacheslav Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: 李磊 @ 2026-04-26 15:38 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: 李磊, ceph-devel@vger.kernel.org, idryomov@gmail.com,
	Alex Markuze, slava@dubeyko.com, Xiubo Li, noctis.akm@gmail.com,
	linux-kernel@vger.kernel.org



> 2026年4月25日 06:02，Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 写道：
> 
> 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
> 
> 
> On Fri, 2026-04-24 at 19:44 +0000, 李磊 wrote:
>> 
>>> 2026年4月24日 03:30，Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 写道：
>>> 
>>> 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
>>> 
>>> 
>>> On Sat, 2026-04-18 at 21:39 +0800, Li Lei wrote:
>>>> During umount, we only wait for stopping_blockers to drop to zero for
>>>> a certain time specified by mount_timeout, and continue the rest of
>>>> the procedure even if there are inflight requests. This behavior may
>>>> leave some folios locked even after the cephfs umounted, which causes
>>>> other kernel threads to hung.
>>>> 
>>>> Buffered read process calls filemap_update_page() and waits on
>>>> folio_put_wait_locked() with TASK_KILLABLE flag set, which means this
>>>> process could be killed and the filesystem could be umount successfully
>>>> (no file opened in it). Umount calls truncate_inode_pages() and waits
>>>> on locked pages for those inodes whose i_count == 0. In these way,
>>>> there would be no locked folios for this filesystem left in system
>>>> after umount exits.
>>>> 
>>>> However, things are different for cephfs. Cephfs calls ihold() and
>>>> submits osd request for buffered read and gets folio locked. Once the
>>>> buffered read process is killed, the inode will be skipped in
>>>> evict_inodes(), because its i_count > 0. Forthemore, the folios are
>>>> still locked. It can only be unlocked in netfs_unlock_read_folio().
>>>> 
>>>> stopping_blocks should block umount from proceeding, but it only waits
>>>> for mount_timeout (default 60s) even if there are still flying request
>>>> out there, leaving stray locked folios. Other kthread, like kcompactd
>>>> , could be stuck on those locked folioes forever.
>>>> 
>>>> Steps to Reproduce:
>>>> 1. echo 3 > /proc/sys/vm/drop_caches.
>>>> 2. dd if=cephfs/xxx.img of=/dev/null
>>>>   Make sure cephfs/xxx.img is big enough to make time for us to do the
>>>>   following command
>>>> 3. execute 'systemctl stop ceph-osd@*' on the osd nodes
>>>>   It would be great if you have a tiny cluster. Stopping all the osds
>>>>   would be much easier.
>>>> 4. kill -9 `pidof dd`.
>>>>   Buffered read process must be killed at that moment. But inflight
>>>>   read requests can be observed in the /sys/kernel/debug/ceph/xxxx/osdc
>>>> 5. umount cephfs
>>>>   Wait for 60s if you mount cephfs by using the default mount option.
>>>> 
>>>> We got the warning:
>>>> ceph: [b2c9a006-9ad8-48e9-8257-6fb1e1b91014 66562]: umount timed out, 0
>>>> VFS: Busy inodes after unmount of ceph (ceph)
>>>> 
>>>> if check_data_corruption option disable, kcompactd may stuck in the
>>>> future. If it is eanbled, we catch the bug immediately.
>>>> 
>>>> [94543.042953] ------------[ cut here ]------------
>>>> [94543.049391] kernel BUG at fs/super.c:654!
>>>> [94543.054171] Oops: invalid opcode: 0000 [#1] SMP PTI
>>>> [94543.059881] CPU: 25 UID: 0 PID: 3451674 Comm: umount Kdump: loaded Tainted: G S         OE       7.0.0-dirty #2 PREEMPTLAZY
>>>> [94543.072678] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>>> [94543.080918] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
>>>> [94543.089755] RIP: 0010:generic_shutdown_super+0x111/0x120
>>>> [94543.095982] Code: cc cc e8 c2 1f ef ff 48 8b bb d0 00 00 00 eb db 48 8b 43 28 48 8d b3 98 03 00 00 48 c7 c7 90 09 55 8c 48 8b 10 e8 0f c3 cc ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90
>>>> [94543.117607] RSP: 0018:ffffce35f8c53d40 EFLAGS: 00010246
>>>> [94543.123793] RAX: 000000000000002d RBX: ffff8ba94d0d9000 RCX: 0000000000000000
>>>> [94543.132125] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8bc0df91c600
>>>> [94543.140460] RBP: ffffffffc13b52c0 R08: 0000000000000000 R09: ffffce35f8c53be0
>>>> [94543.148801] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8ba94af0e000
>>>> [94543.157150] R13: ffff8ba94d0d9000 R14: 0000000000000004 R15: ffff8ba946d9a000
>>>> [94543.165505] FS:  00007fb1c607c840(0000) GS:ffff8bc1520e4000(0000) knlGS:0000000000000000
>>>> [94543.174943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [94543.181769] CR2: 00000000d64e5000 CR3: 000000189c9f2002 CR4: 00000000003726f0
>>>> [94543.190160] Call Trace:
>>>> [94543.193317]  <TASK>
>>>> [94543.196088]  kill_anon_super+0x12/0x40
>>>> [94543.200719]  ceph_kill_sb+0xda/0x2c0 [ceph]
>>>> [94543.205877]  ? radix_tree_delete_item+0x68/0xd0
>>>> [94543.211395]  deactivate_locked_super+0x31/0xb0
>>>> [94543.216815]  cleanup_mnt+0xcb/0x110
>>>> [94543.221169]  task_work_run+0x58/0x80
>>>> [94543.225629]  exit_to_user_mode_loop+0x13f/0x4d0
>>>> [94543.231163]  do_syscall_64+0x1ef/0x840
>>>> [94543.235827]  ? do_syscall_64+0x101/0x840
>>>> [94543.240687]  ? do_user_addr_fault+0x20e/0x6b0
>>>> [94543.246036]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> [94543.252166] RIP: 0033:0x7fb1c5f0ccab
>>>> [94543.256650] Code: 73 31 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 39 31 0e 00 f7 d8
>>>> [94543.278690] RSP: 002b:00007ffe96f80ea8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
>>>> [94543.287710] RAX: 0000000000000000 RBX: 00007fb1c61fb264 RCX: 00007fb1c5f0ccab
>>>> [94543.296251] RDX: fffffffffffffe88 RSI: 0000000000000000 RDI: 000055ae9dcc6ec0
>>>> [94543.304801] RBP: 000055ae9dcc6c90 R08: 0000000000000000 R09: 00007ffe96f7fc50
>>>> [94543.313358] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>>>> [94543.321922] R13: 000055ae9dcc6ec0 R14: 000055ae9dcc6da0 R15: 0000000000000000
>>>> 
>>>> So make it wait until all the flying requests returns for clean and safe
>>>> umount.
>>>> 
>>>> Fixes: 1464de9f813e ("ceph: wait for OSD requests' callbacks to finish when unmounting")
>>>> Signed-off-by: Li Lei <lilei24@kuaishou.com>
>>>> ---
>>>> fs/ceph/super.c | 11 ++++-------
>>>> 1 file changed, 4 insertions(+), 7 deletions(-)
>>>> 
>>>> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
>>>> index 2aed6b3..48e63c1 100644
>>>> --- a/fs/ceph/super.c
>>>> +++ b/fs/ceph/super.c
>>>> @@ -1569,13 +1569,10 @@ static void ceph_kill_sb(struct super_block *s)
>>>>     spin_unlock(&mdsc->stopping_lock);
>>>> 
>>>>     if (wait && atomic_read(&mdsc->stopping_blockers)) {
>>>> -             long timeleft = wait_for_completion_killable_timeout(
>>>> -                                     &mdsc->stopping_waiter,
>>>> -                                     fsc->client->options->mount_timeout);
>>>> -             if (!timeleft) /* timed out */
>>>> -                     pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
>>>> -             else if (timeleft < 0) /* killed */
>>>> -                     pr_warn_client(cl, "umount was killed, %ld\n", timeleft);
>>>> +             int rc = wait_for_completion_killable(
>>>> +                                     &mdsc->stopping_waiter);
>>>> +             if (rc < 0) /* killed */
>>>> +                     pr_warn_client(cl, "umount was killed\n");
>>>>     }
>>>> 
>>>>     mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
>>> 
>>> I am not completely sure that it's right approach to fix the problem. If I
>>> understood correctly, we have situation that somehow OSD is down and dd process
>>> has been killed.
>> Yes. This this the situation we met in our production environment. We found 1/4
>> of our client-nodes had kcompactd task hung because of requesting for a folio_lock.
>> After using crash tool to debug, we figured that page belonged to a cephfs file,
>> howevere that cephfs had been umounted.
>> 
>> The 'Steps to Reproduce’ is used to emulate the real workload.
>> 
>> 
>>> It's not normal and usual situation. So, your suggestion is to
>>> hung in unmount process forever because we cannot service the flying requests.
>>> Is it really good way to fix the problem? Probably, we need to have more smart
>>> logic here of waiting flying requests ending. But timeout makes sense to prevent
>>> from the situation that something is going wrong and we cannot finish unmount at
>>> all. What do you think? Maybe we need to have some loop that checks the state by
>>> waking up after some timeout. If the number of flying requests is decreasing,
>>> then we should wait. But if nothing is changing with time, then it means that
>>> something is wrong and it makes sense to unmount anyway. Because, finally, we
>>> will have some call trace in system log, anyway. Does it make sense?
>>> 
>>> Also, we have likewise pattern in the ceph_kill_sb():
>>> 
>>>       if (atomic64_read(&mdsc->dirty_folios) > 0) {
>>>               wait_queue_head_t *wq = &mdsc->flush_end_wq;
>>>               long timeleft = wait_event_killable_timeout(*wq,
>>>                                       atomic64_read(&mdsc->dirty_folios) <=
>>> 0,
>>>                                       fsc->client->options->mount_timeout);
>>>               if (!timeleft) /* timed out */
>>>                       pr_warn_client(cl, "umount timed out, %ld\n",
>>> timeleft);
>>>               else if (timeleft < 0) /* killed */
>>>                       pr_warn_client(cl, "umount was killed, %ld\n",
>>> timeleft);
>>>       }
>>> 
>> 
>> I understand your concern. This patch is a truly straightforward workaround.
>> So, how about we just abort OSD requests if they take too long to return
>> during unmounting ?
> 
> The question here is how to define that OSD requests taking too long time?
> Potentially, processing could be really slow for some reason. From one point of
> view, if we know that destination OSD is down or we have network partitioning,
> then it doesn't make sense to wait to long. I am thinking about potential
> checking of number of OSD requests. If this number is going down, then it needs
> to wait, otherwise, if this number doesn't change, then it needs to finish the
> unmount without waiting. Does it make sense?
> 
>> 
>> Compared to leaving some locked folios in the system, return -EIO to those
>> OSD requests which may never return is more reasonable. This is because locked
>> folios left behind Cephfs unmount may block kcompactd and render the entire
>> system unstable.
>> 
> 
> I agree. It makes sense. If we know that some OSD requests will never return,
> then we need to manage this situation in better way. But how could we detect
> that OSD request will never return?
> 
>> Besides, successful unmounting doesn't guarantee dirty buffers are successfully
>> written to the backend. For example, when a buffered write returns, the local
>> filesystem may encounter bad blocks on the local disk and -EIO is returned to
>> the writeback kworkers.  Therefore, in our scenario, does it make sense if we
>> treat the OSD requests that have been flight for a certain period as failed,
>> And return -EIO to the caller?
> 
> This is the main question: how to detect that OSD requests are failed?
> 
> As far as I can see, if an OSD is down and osd_request_timeout is not set (the
> default), a stalled write can block unmount indefinitely. I assume that you have
> the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
> of management the stuck OSD requests during unmount.
> 
> Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
> the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
> keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
> write will fail, triggering con_fault().
> 
> Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
> requests older than that deadline are aborted with -ETIMEDOUT via
> abort_request().
> 
> Homeless requests: requests that can't be mapped to any OSD are also checked
> against osd_request_timeout.
> 
> The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
> acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
> silent beyond interval. When this fires, the connection is considered dead and
> con_fault() is triggered.
> 
> So, we need to find a proper approach of finding a good solution from available
> functionality.

I agree. Instead of waiting for inflight requests infinitely or aborting OSD
requests brutally, you prefer a much more elegant way to deal with this dilemma.
It’s cool, but it seems complex and more time is needed to fix locked folios leakage
on the client nodes. Is there any acceptable short-term scheme?

I find it is not easy to work around this issue by merely increasing opt->mount_timeout.
Both dirty_folios and stopping_blockers wait with TASK_KILLABLE set, which means the
unmount process’s wait can be interrupted by a kill signal and leave some locked folios
after unmount regardless of the mount_timeout setting.

> 
>> 
>> Lastly, I think we can just use stopping blockers to replace dirty_folios to
>> simplify the unmounting and wait process. Accordingly, in ceph_kill_sb(), we
>> only need to wait for stopping_blockers count to drop to zero. If a timeout
>> occurs, we can cancel all the inflight requests and print some warning messages.
> 
> The dirty_folios is "how much dirty data is still to be flushed," while
> stopping_blockers is "how many threads are currently inside code that holds an
> implicit reference to the MDS client." Unmount must drain both in order, and the
> two counters solve entirely different races.
> 
> The mdsc->dirty_folios — count of dirty page-cache folios not yet written back.
> It incremented in ceph_dirty_folio() at the moment of folio transitions from
> clean to dirty in the page cache. It decremented in the OSD write-completion
> callback after the OSD acknowledges the writeback and end_page_writeback() is
> called. It represents the number of file-data folios that have been dirtied
> (modified in the page cache) but whose data has not yet reached an OSD (i.e.,
> writeback is pending or in flight).
> 
> The mdsc->stopping_blockers counts of in-progress MDS/OSD message handlers. It
> incremented by ceph_inc_mds_stopping_blocker() / ceph_inc_osd_stopping_blocker()
> at the entry of any async operation that must not be interrupted mid-flight by
> shutdown. It decremented at the exit of the async operations' handlers.

Thanks for the detailed explanation of dirty_folios and stopping_blockers. However,
it is still a bit confusing that we need to wait dirty foios to decrease to zero
after sync_filesystem() in ceph_kill_sb(). It seems that the semantics of 
sync_filesystem() is broken in Cephfs? Because in the common situation, sync_filesystem()
guarantees dirty folios are flushed to the backend, and PG_dirty of these folios
are cleared after it returns.

Thanks,
Li


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RE: Re:  [PATCH] ceph: fix potential stray locked folios during umount
  2026-04-26 15:38       ` 李磊
@ 2026-04-27 21:52         ` Viacheslav Dubeyko
  2026-04-29 14:42           ` 李磊
  0 siblings, 1 reply; 8+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-27 21:52 UTC (permalink / raw)
  To: lilei24@kuaishou.com
  Cc: Xiubo Li, ceph-devel@vger.kernel.org, idryomov@gmail.com,
	slava@dubeyko.com, Alex Markuze, noctis.akm@gmail.com,
	linux-kernel@vger.kernel.org

On Sun, 2026-04-26 at 15:38 +0000, 李磊 wrote:
> 
> > > 

<skipped>

> > > I understand your concern. This patch is a truly straightforward workaround.
> > > So, how about we just abort OSD requests if they take too long to return
> > > during unmounting ?
> > 
> > The question here is how to define that OSD requests taking too long time?
> > Potentially, processing could be really slow for some reason. From one point of
> > view, if we know that destination OSD is down or we have network partitioning,
> > then it doesn't make sense to wait to long. I am thinking about potential
> > checking of number of OSD requests. If this number is going down, then it needs
> > to wait, otherwise, if this number doesn't change, then it needs to finish the
> > unmount without waiting. Does it make sense?
> > 
> > > 
> > > Compared to leaving some locked folios in the system, return -EIO to those
> > > OSD requests which may never return is more reasonable. This is because locked
> > > folios left behind Cephfs unmount may block kcompactd and render the entire
> > > system unstable.
> > > 
> > 
> > I agree. It makes sense. If we know that some OSD requests will never return,
> > then we need to manage this situation in better way. But how could we detect
> > that OSD request will never return?
> > 
> > > Besides, successful unmounting doesn't guarantee dirty buffers are successfully
> > > written to the backend. For example, when a buffered write returns, the local
> > > filesystem may encounter bad blocks on the local disk and -EIO is returned to
> > > the writeback kworkers.  Therefore, in our scenario, does it make sense if we
> > > treat the OSD requests that have been flight for a certain period as failed,
> > > And return -EIO to the caller?
> > 
> > This is the main question: how to detect that OSD requests are failed?
> > 
> > As far as I can see, if an OSD is down and osd_request_timeout is not set (the
> > default), a stalled write can block unmount indefinitely. I assume that you have
> > the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
> > of management the stuck OSD requests during unmount.
> > 
> > Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
> > the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
> > keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
> > write will fail, triggering con_fault().
> > 
> > Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
> > requests older than that deadline are aborted with -ETIMEDOUT via
> > abort_request().
> > 
> > Homeless requests: requests that can't be mapped to any OSD are also checked
> > against osd_request_timeout.
> > 
> > The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
> > acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
> > silent beyond interval. When this fires, the connection is considered dead and
> > con_fault() is triggered.
> > 
> > So, we need to find a proper approach of finding a good solution from available
> > functionality.
> 
> I agree. Instead of waiting for inflight requests infinitely or aborting OSD
> requests brutally, you prefer a much more elegant way to deal with this dilemma.
> It’s cool, but it seems complex and more time is needed to fix locked folios leakage
> on the client nodes. Is there any acceptable short-term scheme?

Have you tried to set up the osd_request_timeout and to see how CephFS kernel
client will behave afterwards? Will it change anything?

> 
> I find it is not easy to work around this issue by merely increasing opt->mount_timeout.
> Both dirty_folios and stopping_blockers wait with TASK_KILLABLE set, which means the
> unmount process’s wait can be interrupted by a kill signal and leave some locked folios
> after unmount regardless of the mount_timeout setting.

If somebody (or something) would like to kill the process, then there is nothing
that we can do. The potential kill signal can be received at any time point and
some locked folios continue to exists.

> 
> > 
> > > 
> > > Lastly, I think we can just use stopping blockers to replace dirty_folios to
> > > simplify the unmounting and wait process. Accordingly, in ceph_kill_sb(), we
> > > only need to wait for stopping_blockers count to drop to zero. If a timeout
> > > occurs, we can cancel all the inflight requests and print some warning messages.
> > 
> > The dirty_folios is "how much dirty data is still to be flushed," while
> > stopping_blockers is "how many threads are currently inside code that holds an
> > implicit reference to the MDS client." Unmount must drain both in order, and the
> > two counters solve entirely different races.
> > 
> > The mdsc->dirty_folios — count of dirty page-cache folios not yet written back.
> > It incremented in ceph_dirty_folio() at the moment of folio transitions from
> > clean to dirty in the page cache. It decremented in the OSD write-completion
> > callback after the OSD acknowledges the writeback and end_page_writeback() is
> > called. It represents the number of file-data folios that have been dirtied
> > (modified in the page cache) but whose data has not yet reached an OSD (i.e.,
> > writeback is pending or in flight).
> > 
> > The mdsc->stopping_blockers counts of in-progress MDS/OSD message handlers. It
> > incremented by ceph_inc_mds_stopping_blocker() / ceph_inc_osd_stopping_blocker()
> > at the entry of any async operation that must not be interrupted mid-flight by
> > shutdown. It decremented at the exit of the async operations' handlers.
> 
> Thanks for the detailed explanation of dirty_folios and stopping_blockers. However,
> it is still a bit confusing that we need to wait dirty foios to decrease to zero
> after sync_filesystem() in ceph_kill_sb(). It seems that the semantics of 
> sync_filesystem() is broken in Cephfs? Because in the common situation, sync_filesystem()
> guarantees dirty folios are flushed to the backend, and PG_dirty of these folios
> are cleared after it returns.
> 
> 

As far as I can see, during sync_filesystem() call, it is used
filemap_fdatawait_keep_errors() that waits for pages tagged
PAGECACHE_TAG_WRITEBACK. A dirty page that was found by ceph_writepages_start()
but never submitted stays tagged PAGECACHE_TAG_DIRTY, and sync_filesystem()
returns treating it as invisible. So, this is why we need to wait dirty foios to
decrease to zero in ceph_kill_sb().

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: RE: Re:  [PATCH] ceph: fix potential stray locked folios during umount
  2026-04-27 21:52         ` Viacheslav Dubeyko
@ 2026-04-29 14:42           ` 李磊
  2026-04-29 18:20             ` Viacheslav Dubeyko
  0 siblings, 1 reply; 8+ messages in thread
From: 李磊 @ 2026-04-29 14:42 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: 李磊, Xiubo Li, ceph-devel@vger.kernel.org,
	idryomov@gmail.com, slava@dubeyko.com, Alex Markuze,
	noctis.akm@gmail.com, linux-kernel@vger.kernel.org



> 2026年4月28日 05:52，Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 写道：
> 
> 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
> 
> 
> On Sun, 2026-04-26 at 15:38 +0000, 李磊 wrote:
>> 
>>>> 
> 
> <skipped>
> 
>>>> I understand your concern. This patch is a truly straightforward workaround.
>>>> So, how about we just abort OSD requests if they take too long to return
>>>> during unmounting ?
>>> 
>>> The question here is how to define that OSD requests taking too long time?
>>> Potentially, processing could be really slow for some reason. From one point of
>>> view, if we know that destination OSD is down or we have network partitioning,
>>> then it doesn't make sense to wait to long. I am thinking about potential
>>> checking of number of OSD requests. If this number is going down, then it needs
>>> to wait, otherwise, if this number doesn't change, then it needs to finish the
>>> unmount without waiting. Does it make sense?
>>> 
>>>> 
>>>> Compared to leaving some locked folios in the system, return -EIO to those
>>>> OSD requests which may never return is more reasonable. This is because locked
>>>> folios left behind Cephfs unmount may block kcompactd and render the entire
>>>> system unstable.
>>>> 
>>> 
>>> I agree. It makes sense. If we know that some OSD requests will never return,
>>> then we need to manage this situation in better way. But how could we detect
>>> that OSD request will never return?
>>> 
>>>> Besides, successful unmounting doesn't guarantee dirty buffers are successfully
>>>> written to the backend. For example, when a buffered write returns, the local
>>>> filesystem may encounter bad blocks on the local disk and -EIO is returned to
>>>> the writeback kworkers.  Therefore, in our scenario, does it make sense if we
>>>> treat the OSD requests that have been flight for a certain period as failed,
>>>> And return -EIO to the caller?
>>> 
>>> This is the main question: how to detect that OSD requests are failed?
>>> 
>>> As far as I can see, if an OSD is down and osd_request_timeout is not set (the
>>> default), a stalled write can block unmount indefinitely. I assume that you have
>>> the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
>>> of management the stuck OSD requests during unmount.
>>> 
>>> Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
>>> the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
>>> keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
>>> write will fail, triggering con_fault().
>>> 
>>> Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
>>> requests older than that deadline are aborted with -ETIMEDOUT via
>>> abort_request().
>>> 
>>> Homeless requests: requests that can't be mapped to any OSD are also checked
>>> against osd_request_timeout.
>>> 
>>> The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
>>> acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
>>> silent beyond interval. When this fires, the connection is considered dead and
>>> con_fault() is triggered.
>>> 
>>> So, we need to find a proper approach of finding a good solution from available
>>> functionality.
>> 
>> I agree. Instead of waiting for inflight requests infinitely or aborting OSD
>> requests brutally, you prefer a much more elegant way to deal with this dilemma.
>> It’s cool, but it seems complex and more time is needed to fix locked folios leakage
>> on the client nodes. Is there any acceptable short-term scheme?
> 
> Have you tried to set up the osd_request_timeout and to see how CephFS kernel
> client will behave afterwards? Will it change anything?

If I apply this patch to wait for stopping blockers to drop to zero, setting osd_request_timeout
can help abort OSD requests in time and allow the unmount process to proceed. However
I think we still have 2 aspects to discuss.

1. Instead of using mount_timeout, can we use other option to accommodate waiting during
   the unmount process?

   It is somewhat confusing that the mount_timeout option decides how long we should wait
   for both dirty_folios and stopping_blockers if they don’t drop to zero. As for as I know
   mount_timeout determines the maximum wait time in open_root_dentry() for loading root
   inode during the mount operation.

   Just for the scenario I described — stop all the OSDs and kill buffered read, is it
   better to use osd_request_timeout instead?

   Or can we wait_for_completion() infinitely if an OSD request never returns, but create a
   debugfs file (for example ‘abort’) to tigger all OSD’s requests to ensure a clean and
   successful and unmount.

2. Is killable waiting really suitable here ?

   Any user-space process may send a kill signal to the unmount process, which may leave
   behind some stray locked folios and degrade the system stability. Maybe we should use
   non-killable functions here ?

Thanks,
Li

> 
>> 
>> I find it is not easy to work around this issue by merely increasing opt->mount_timeout.
>> Both dirty_folios and stopping_blockers wait with TASK_KILLABLE set, which means the
>> unmount process’s wait can be interrupted by a kill signal and leave some locked folios
>> after unmount regardless of the mount_timeout setting.
> 
> If somebody (or something) would like to kill the process, then there is nothing
> that we can do. The potential kill signal can be received at any time point and
> some locked folios continue to exists.
> 
>> 
>>> 
>>>> 
>>>> Lastly, I think we can just use stopping blockers to replace dirty_folios to
>>>> simplify the unmounting and wait process. Accordingly, in ceph_kill_sb(), we
>>>> only need to wait for stopping_blockers count to drop to zero. If a timeout
>>>> occurs, we can cancel all the inflight requests and print some warning messages.
>>> 
>>> The dirty_folios is "how much dirty data is still to be flushed," while
>>> stopping_blockers is "how many threads are currently inside code that holds an
>>> implicit reference to the MDS client." Unmount must drain both in order, and the
>>> two counters solve entirely different races.
>>> 
>>> The mdsc->dirty_folios — count of dirty page-cache folios not yet written back.
>>> It incremented in ceph_dirty_folio() at the moment of folio transitions from
>>> clean to dirty in the page cache. It decremented in the OSD write-completion
>>> callback after the OSD acknowledges the writeback and end_page_writeback() is
>>> called. It represents the number of file-data folios that have been dirtied
>>> (modified in the page cache) but whose data has not yet reached an OSD (i.e.,
>>> writeback is pending or in flight).
>>> 
>>> The mdsc->stopping_blockers counts of in-progress MDS/OSD message handlers. It
>>> incremented by ceph_inc_mds_stopping_blocker() / ceph_inc_osd_stopping_blocker()
>>> at the entry of any async operation that must not be interrupted mid-flight by
>>> shutdown. It decremented at the exit of the async operations' handlers.
>> 
>> Thanks for the detailed explanation of dirty_folios and stopping_blockers. However,
>> it is still a bit confusing that we need to wait dirty foios to decrease to zero
>> after sync_filesystem() in ceph_kill_sb(). It seems that the semantics of
>> sync_filesystem() is broken in Cephfs? Because in the common situation, sync_filesystem()
>> guarantees dirty folios are flushed to the backend, and PG_dirty of these folios
>> are cleared after it returns.
>> 
>> 
> 
> As far as I can see, during sync_filesystem() call, it is used
> filemap_fdatawait_keep_errors() that waits for pages tagged
> PAGECACHE_TAG_WRITEBACK. A dirty page that was found by ceph_writepages_start()
> but never submitted stays tagged PAGECACHE_TAG_DIRTY, and sync_filesystem()
> returns treating it as invisible. So, this is why we need to wait dirty foios to
> decrease to zero in ceph_kill_sb().
> 
> Thanks,
> Slava.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Re: RE: Re:  [PATCH] ceph: fix potential stray locked folios during umount
  2026-04-29 14:42           ` 李磊
@ 2026-04-29 18:20             ` Viacheslav Dubeyko
  0 siblings, 0 replies; 8+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-29 18:20 UTC (permalink / raw)
  To: lilei24@kuaishou.com
  Cc: Alex Markuze, Xiubo Li, ceph-devel@vger.kernel.org,
	slava@dubeyko.com, idryomov@gmail.com, noctis.akm@gmail.com,
	linux-kernel@vger.kernel.org

On Wed, 2026-04-29 at 14:42 +0000, 李磊 wrote:
> 
> > 2026年4月28日 05:52，Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 写道：
> > 
> > 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
> > 
> > 
> > On Sun, 2026-04-26 at 15:38 +0000, 李磊 wrote:
> > > 
> > > > > 
> > 
> > <skipped>
> > 
> > > > > I understand your concern. This patch is a truly straightforward workaround.
> > > > > So, how about we just abort OSD requests if they take too long to return
> > > > > during unmounting ?
> > > > 
> > > > The question here is how to define that OSD requests taking too long time?
> > > > Potentially, processing could be really slow for some reason. From one point of
> > > > view, if we know that destination OSD is down or we have network partitioning,
> > > > then it doesn't make sense to wait to long. I am thinking about potential
> > > > checking of number of OSD requests. If this number is going down, then it needs
> > > > to wait, otherwise, if this number doesn't change, then it needs to finish the
> > > > unmount without waiting. Does it make sense?
> > > > 
> > > > > 
> > > > > Compared to leaving some locked folios in the system, return -EIO to those
> > > > > OSD requests which may never return is more reasonable. This is because locked
> > > > > folios left behind Cephfs unmount may block kcompactd and render the entire
> > > > > system unstable.
> > > > > 
> > > > 
> > > > I agree. It makes sense. If we know that some OSD requests will never return,
> > > > then we need to manage this situation in better way. But how could we detect
> > > > that OSD request will never return?
> > > > 
> > > > > Besides, successful unmounting doesn't guarantee dirty buffers are successfully
> > > > > written to the backend. For example, when a buffered write returns, the local
> > > > > filesystem may encounter bad blocks on the local disk and -EIO is returned to
> > > > > the writeback kworkers.  Therefore, in our scenario, does it make sense if we
> > > > > treat the OSD requests that have been flight for a certain period as failed,
> > > > > And return -EIO to the caller?
> > > > 
> > > > This is the main question: how to detect that OSD requests are failed?
> > > > 
> > > > As far as I can see, if an OSD is down and osd_request_timeout is not set (the
> > > > default), a stalled write can block unmount indefinitely. I assume that you have
> > > > the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
> > > > of management the stuck OSD requests during unmount.
> > > > 
> > > > Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
> > > > the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
> > > > keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
> > > > write will fail, triggering con_fault().
> > > > 
> > > > Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
> > > > requests older than that deadline are aborted with -ETIMEDOUT via
> > > > abort_request().
> > > > 
> > > > Homeless requests: requests that can't be mapped to any OSD are also checked
> > > > against osd_request_timeout.
> > > > 
> > > > The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
> > > > acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
> > > > silent beyond interval. When this fires, the connection is considered dead and
> > > > con_fault() is triggered.
> > > > 
> > > > So, we need to find a proper approach of finding a good solution from available
> > > > functionality.
> > > 
> > > I agree. Instead of waiting for inflight requests infinitely or aborting OSD
> > > requests brutally, you prefer a much more elegant way to deal with this dilemma.
> > > It’s cool, but it seems complex and more time is needed to fix locked folios leakage
> > > on the client nodes. Is there any acceptable short-term scheme?
> > 
> > Have you tried to set up the osd_request_timeout and to see how CephFS kernel
> > client will behave afterwards? Will it change anything?
> 
> If I apply this patch to wait for stopping blockers to drop to zero, setting osd_request_timeout
> can help abort OSD requests in time and allow the unmount process to proceed. However
> I think we still have 2 aspects to discuss.

I think that if osd_request_timeout has CEPH_OSD_REQUEST_TIMEOUT_DEFAULT value
(infinite timeout), then, probably, we need to process this in special way.
Maybe, we need to change the default timeout to another default value that can
manage aborting OSD requests in reasonable time. What do you think?

> 
> 1. Instead of using mount_timeout, can we use other option to accommodate waiting during
>    the unmount process?
> 
>    It is somewhat confusing that the mount_timeout option decides how long we should wait
>    for both dirty_folios and stopping_blockers if they don’t drop to zero. As for as I know
>    mount_timeout determines the maximum wait time in open_root_dentry() for loading root
>    inode during the mount operation.
> 
>    Just for the scenario I described — stop all the OSDs and kill buffered read, is it
>    better to use osd_request_timeout instead?
> 
>    Or can we wait_for_completion() infinitely if an OSD request never returns, but create a
>    debugfs file (for example ‘abort’) to tigger all OSD’s requests to ensure a clean and
>    successful and unmount.

Probably, you are right, the mount_timeout option could look confusing here.
But, from another point of view, we have unmout process here and mount_timeout
option could be considered like a good fit. But we need wait ending of OSD
requests. So, I can agree that osd_request_timeout sounds like more proper
option here.

Also, I started to think that we need to improve the logic. Currently, we have:

wait_queue_head_t *wq = &mdsc->flush_end_wq;
long timeleft = wait_event_killable_timeout(*wq,
					atomic64_read(&mdsc->dirty_folios) <=
0,
					fsc->client->options->mount_timeout);
if (!timeleft) /* timed out */
	pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
else if (timeleft < 0) /* killed */
	pr_warn_client(cl, "umount was killed, %ld\n", timeleft);

Technically speaking, even if timeout has been elapsed (especially short
enough), then it doesn't mean that all dirty folios have been processed. I think
we need to have a loop in both cases for waiting processing all dirty folios or
processing/aborting all OSD requests.

What do you think?

> 
> 2. Is killable waiting really suitable here ?
> 
>    Any user-space process may send a kill signal to the unmount process, which may leave
>    behind some stray locked folios and degrade the system stability. Maybe we should use
>    non-killable functions here ?
> 
> 
> 

I think if anyone kills the process, then this person expects that this process
dies right now. Usually, we kill the process if something is going wrong
already. I am not sure that non-killable functions will be better here.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-29 18:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-18 13:39 [PATCH] ceph: fix potential stray locked folios during umount Li Lei
2026-04-23 19:30 ` Viacheslav Dubeyko
2026-04-24 19:44   ` 李磊
2026-04-24 22:02     ` Viacheslav Dubeyko
2026-04-26 15:38       ` 李磊
2026-04-27 21:52         ` Viacheslav Dubeyko
2026-04-29 14:42           ` 李磊
2026-04-29 18:20             ` Viacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox