From: Chris Dunlop <chris@onthe.net.au>
To: linux-xfs@vger.kernel.org
Subject: rm hanging, v6.1.35
Date: Tue, 11 Jul 2023 07:53:54 +1000 [thread overview]
Message-ID: <20230710215354.GA679018@onthe.net.au> (raw)
Hi,
This box is newly booted into linux v6.1.35 (2 days ago), it was
previously running v5.15.118 without any problems (other than that fixed
by "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the
upgrade).
I have rm operations on two files that have been stuck for in excess of 22
hours and 18 hours respectively:
$ ps -opid,lstart,state,wchan=WCHAN-xxxxxxxxxxxxxxx,cmd -C rm
PID STARTED S WCHAN-xxxxxxxxxxxxxxx CMD
2379355 Mon Jul 10 09:07:57 2023 D vfs_unlink /bin/rm -rf /aaa/5539_tmp
2392421 Mon Jul 10 09:18:27 2023 D down_write_nested /bin/rm -rf /aaa/5539_tmp
2485728 Mon Jul 10 09:28:57 2023 D down_write_nested /bin/rm -rf /aaa/5539_tmp
2488254 Mon Jul 10 09:39:27 2023 D down_write_nested /bin/rm -rf /aaa/5539_tmp
2491180 Mon Jul 10 09:49:58 2023 D down_write_nested /bin/rm -rf /aaa/5539_tmp
3014914 Mon Jul 10 13:00:33 2023 D vfs_unlink /bin/rm -rf /bbb/5541_tmp
3095893 Mon Jul 10 13:11:03 2023 D down_write_nested /bin/rm -rf /bbb/5541_tmp
3098809 Mon Jul 10 13:21:35 2023 D down_write_nested /bin/rm -rf /bbb/5541_tmp
3101387 Mon Jul 10 13:32:06 2023 D down_write_nested /bin/rm -rf /bbb/5541_tmp
3195017 Mon Jul 10 13:42:37 2023 D down_write_nested /bin/rm -rf /bbb/5541_tmp
The "rm"s are run from a process that's obviously tried a few times to get
rid of these files.
There's nothing extraordinary about the files in terms of size:
$ ls -ltrn --full-time /aaa/5539_tmp /bbb/5541_tmp
-rw-rw-rw- 1 1482 1482 7870643 2023-07-10 06:07:58.684036505 +1000 /aaa/5539_tmp
-rw-rw-rw- 1 1482 1482 701240 2023-07-10 10:00:34.181064549 +1000 /bbb/5541_tmp
As hinted by the WCHAN in the ps output above, each "primary" rm (i.e. the
first one run on each file) stack trace looks like:
[<0>] vfs_unlink+0x48/0x270
[<0>] do_unlinkat+0x1f5/0x290
[<0>] __x64_sys_unlinkat+0x3b/0x60
[<0>] do_syscall_64+0x34/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
And each "secondary" rm (i.e. the subsequent ones on each file) stack
trace looks like:
== blog-230710-xfs-rm-stuckd
[<0>] down_write_nested+0xdc/0x100
[<0>] do_unlinkat+0x10d/0x290
[<0>] __x64_sys_unlinkat+0x3b/0x60
[<0>] do_syscall_64+0x34/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
Multiple kernel strack traces don't show vfs_unlink or anything related
that I can see, or anything else consistent or otherwise interesting. Most
cores are idle.
Each of /aaa and /bbb are separate XFS filesystems:
$ xfs_info /aaa
meta-data=/dev/mapper/aaa isize=512 agcount=2, agsize=268434432 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1
data = bsize=4096 blocks=536868864, imaxpct=5
= sunit=256 swidth=256 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=262143, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
$ xfs_info /bbb
meta-data=/dev/mapper/bbb isize=512 agcount=8, agsize=268434432 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1
data = bsize=4096 blocks=1879047168, imaxpct=5
= sunit=256 swidth=256 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
There's plenty of free space at the fs level:
$ df -h /aaa /bbb
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/aaa 2.0T 551G 1.5T 27% /aaa
/dev/mapper/bbb 7.0T 3.6T 3.5T 52% /bbb
The fses are on sparse ceph/rbd volumes, the underlying storage tells me
they're 50-60% utilised:
aaa: provisioned="2048G" used="1015.9G"
bbb: provisioned="7168G" used="4925.3G"
Where to from here?
I'm guessing only a reboot is going to unstick this. Anything I should be
looking at before reverting to v5.15.118?
...subsequent to starting writing all this down I have another two sets of
rms stuck, again on unremarkable files, and on two more separate
filesystems.
...oh. And an 'ls' on those files is hanging. The reboot has become more
urgent.
Cheers,
Chris
next reply other threads:[~2023-07-10 22:01 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-10 21:53 Chris Dunlop [this message]
2023-07-11 0:13 ` rm hanging, v6.1.35 Chris Dunlop
2023-07-11 1:57 ` Chris Dunlop
2023-07-11 3:10 ` Dave Chinner
2023-07-11 7:05 ` Chris Dunlop
2023-07-11 22:21 ` Dave Chinner
2023-07-12 1:13 ` Chris Dunlop
2023-07-12 1:42 ` Dave Chinner
2023-07-12 2:17 ` Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes Chris Dunlop
2023-07-12 9:26 ` Amir Goldstein
2023-07-13 0:31 ` Chris Dunlop
2023-07-13 0:57 ` Chris Dunlop
2023-07-11 0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
2023-07-11 1:13 ` Chris Dunlop
2023-07-11 2:29 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230710215354.GA679018@onthe.net.au \
--to=chris@onthe.net.au \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox