rm hanging, v6.1.35

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* rm hanging, v6.1.35
@ 2023-07-10 21:53 Chris Dunlop
  2023-07-11  0:13 ` Chris Dunlop
  2023-07-11  0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
  0 siblings, 2 replies; 15+ messages in thread
From: Chris Dunlop @ 2023-07-10 21:53 UTC (permalink / raw)
  To: linux-xfs

Hi,

This box is newly booted into linux v6.1.35 (2 days ago), it was 
previously running v5.15.118 without any problems (other than that fixed 
by "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the 
upgrade).

I have rm operations on two files that have been stuck for in excess of 22 
hours and 18 hours respectively:

$ ps -opid,lstart,state,wchan=WCHAN-xxxxxxxxxxxxxxx,cmd -C rm
     PID                  STARTED S WCHAN-xxxxxxxxxxxxxxx CMD
2379355 Mon Jul 10 09:07:57 2023 D vfs_unlink            /bin/rm -rf /aaa/5539_tmp
2392421 Mon Jul 10 09:18:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
2485728 Mon Jul 10 09:28:57 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
2488254 Mon Jul 10 09:39:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
2491180 Mon Jul 10 09:49:58 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
3014914 Mon Jul 10 13:00:33 2023 D vfs_unlink            /bin/rm -rf /bbb/5541_tmp
3095893 Mon Jul 10 13:11:03 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
3098809 Mon Jul 10 13:21:35 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
3101387 Mon Jul 10 13:32:06 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
3195017 Mon Jul 10 13:42:37 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp

The "rm"s are run from a process that's obviously tried a few times to get 
rid of these files.

There's nothing extraordinary about the files in terms of size:

$ ls -ltrn --full-time /aaa/5539_tmp /bbb/5541_tmp
-rw-rw-rw- 1 1482 1482 7870643 2023-07-10 06:07:58.684036505 +1000 /aaa/5539_tmp
-rw-rw-rw- 1 1482 1482  701240 2023-07-10 10:00:34.181064549 +1000 /bbb/5541_tmp

As hinted by the WCHAN in the ps output above, each "primary" rm (i.e. the 
first one run on each file) stack trace looks like:

[<0>] vfs_unlink+0x48/0x270
[<0>] do_unlinkat+0x1f5/0x290
[<0>] __x64_sys_unlinkat+0x3b/0x60
[<0>] do_syscall_64+0x34/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

And each "secondary" rm (i.e. the subsequent ones on each file) stack 
trace looks like:

== blog-230710-xfs-rm-stuckd
[<0>] down_write_nested+0xdc/0x100
[<0>] do_unlinkat+0x10d/0x290
[<0>] __x64_sys_unlinkat+0x3b/0x60
[<0>] do_syscall_64+0x34/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

Multiple kernel strack traces don't show vfs_unlink or anything related 
that I can see, or anything else consistent or otherwise interesting. Most 
cores are idle.

Each of /aaa and /bbb are separate XFS filesystems:

$ xfs_info /aaa
meta-data=/dev/mapper/aaa        isize=512    agcount=2, agsize=268434432 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=1
          =                       reflink=1    bigtime=1 inobtcount=1
data     =                       bsize=4096   blocks=536868864, imaxpct=5
          =                       sunit=256    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=262143, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ xfs_info /bbb
meta-data=/dev/mapper/bbb        isize=512    agcount=8, agsize=268434432 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=1
          =                       reflink=1    bigtime=1 inobtcount=1
data     =                       bsize=4096   blocks=1879047168, imaxpct=5
          =                       sunit=256    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

There's plenty of free space at the fs level:

$ df -h /aaa /bbb
Filesystem          Size  Used Avail Use% Mounted on
/dev/mapper/aaa     2.0T  551G  1.5T  27% /aaa
/dev/mapper/bbb     7.0T  3.6T  3.5T  52% /bbb

The fses are on sparse ceph/rbd volumes, the underlying storage tells me 
they're 50-60% utilised:

aaa: provisioned="2048G" used="1015.9G"
bbb: provisioned="7168G" used="4925.3G"

Where to from here?

I'm guessing only a reboot is going to unstick this. Anything I should be 
looking at before reverting to v5.15.118?

...subsequent to starting writing all this down I have another two sets of 
rms stuck, again on unremarkable files, and on two more separate 
filesystems.

...oh. And an 'ls' on those files is hanging. The reboot has become more 
urgent.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-10 21:53 rm hanging, v6.1.35 Chris Dunlop
@ 2023-07-11  0:13 ` Chris Dunlop
  2023-07-11  1:57   ` Chris Dunlop
  2023-07-11  0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
  1 sibling, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2023-07-11  0:13 UTC (permalink / raw)
  To: linux-xfs

On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
> Hi,
>
> This box is newly booted into linux v6.1.35 (2 days ago), it was 
> previously running v5.15.118 without any problems (other than that 
> fixed by "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason 
> for the upgrade).
>
> I have rm operations on two files that have been stuck for in excess 
> of 22 hours and 18 hours respectively:
...
> ...subsequent to starting writing all this down I have another two 
> sets of rms stuck, again on unremarkable files, and on two more 
> separate filesystems.
>
> ...oh. And an 'ls' on those files is hanging. The reboot has become 
> more urgent.

FYI, it's not 'ls' that's hanging, it's bash, because I used a wildcard on 
the command line. The bash stack:

$ cat /proc/24779/stack
[<0>] iterate_dir+0x3e/0x180
[<0>] __x64_sys_getdents64+0x71/0x100
[<0>] do_syscall_64+0x34/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

'lsof' shows me it's trying to read one of the directories holding the 
file that one of the newer hanging "rm"s is trying to remove.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11  0:13 ` Chris Dunlop
@ 2023-07-11  1:57   ` Chris Dunlop
  2023-07-11  3:10     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2023-07-11  1:57 UTC (permalink / raw)
  To: linux-xfs

On Tue, Jul 11, 2023 at 10:13:31AM +1000, Chris Dunlop wrote:
> On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
>> Hi,
>>
>> This box is newly booted into linux v6.1.35 (2 days ago), it was 
>> previously running v5.15.118 without any problems (other than that 
>> fixed by "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason 
>> for the upgrade).
>>
>> I have rm operations on two files that have been stuck for in excess 
>> of 22 hours and 18 hours respectively:
> ...
>> ...subsequent to starting writing all this down I have another two 
>> sets of rms stuck, again on unremarkable files, and on two more 
>> separate filesystems.
>>
>> ...oh. And an 'ls' on those files is hanging. The reboot has become 
>> more urgent.
>
> FYI, it's not 'ls' that's hanging, it's bash, because I used a 
> wildcard on the command line. The bash stack:
>
> $ cat /proc/24779/stack
> [<0>] iterate_dir+0x3e/0x180
> [<0>] __x64_sys_getdents64+0x71/0x100
> [<0>] do_syscall_64+0x34/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> 'lsof' shows me it's trying to read one of the directories holding the 
> file that one of the newer hanging "rm"s is trying to remove.

Ugh. It wasn't just the "rm"s and bash hanging (and as it turns out, 
xfs_logprint), they were just obvious because that's what I was looking 
at.  It turns out there was a whole lot more hanging.

Full sysrq-w output at:

https://file.io/tg7F5OqIWo1B

Sorry if there's more I could be looking at, but I've gotta reboot this 
thing NOW...

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11  1:57   ` Chris Dunlop
@ 2023-07-11  3:10     ` Dave Chinner
  2023-07-11  7:05       ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-07-11  3:10 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

FYI, Chris: google is classifying your email as spam because it
doesn't have any dkim/dmarc authentication on it. You're lucky I
even got this, because I know that gmail mostly just drops such
email in a black hole now....

On Tue, Jul 11, 2023 at 11:57:16AM +1000, Chris Dunlop wrote:
> On Tue, Jul 11, 2023 at 10:13:31AM +1000, Chris Dunlop wrote:
> > On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
> > > Hi,
> > > 
> > > This box is newly booted into linux v6.1.35 (2 days ago), it was
> > > previously running v5.15.118 without any problems (other than that
> > > fixed by "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason
> > > for the upgrade).
> > > 
> > > I have rm operations on two files that have been stuck for in excess
> > > of 22 hours and 18 hours respectively:
> > ...
> > > ...subsequent to starting writing all this down I have another two
> > > sets of rms stuck, again on unremarkable files, and on two more
> > > separate filesystems.
> > > 
> > > ...oh. And an 'ls' on those files is hanging. The reboot has become
> > > more urgent.
> > 
> > FYI, it's not 'ls' that's hanging, it's bash, because I used a wildcard
> > on the command line. The bash stack:
> > 
> > $ cat /proc/24779/stack
> > [<0>] iterate_dir+0x3e/0x180
> > [<0>] __x64_sys_getdents64+0x71/0x100
> > [<0>] do_syscall_64+0x34/0x80
> > [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > 
> > 'lsof' shows me it's trying to read one of the directories holding the
> > file that one of the newer hanging "rm"s is trying to remove.
> 
> Ugh. It wasn't just the "rm"s and bash hanging (and as it turns out,
> xfs_logprint), they were just obvious because that's what I was looking at.
> It turns out there was a whole lot more hanging.
> 
> Full sysrq-w output at:
> 
> https://file.io/tg7F5OqIWo1B

Ok, you have XFS on dm-writecache on md raid 5 and
everything is stuck in either dm-writecache or md.

For example, dm-writecache writeback is stuck waiting on md:

Jul 11 11:23:15 b2 kernel: task:kworker/40:1    state:D stack:0     pid:1801085 ppid:2      flags:0x00004000
Jul 11 11:23:15 b2 kernel: Workqueue: writecache-writeback writecache_flush_work [dm_writecache]
Jul 11 11:23:15 b2 kernel: Call Trace:
Jul 11 11:23:15 b2 kernel:  <TASK>
Jul 11 11:23:15 b2 kernel:  __schedule+0x245/0x7a0
Jul 11 11:23:15 b2 kernel:  schedule+0x50/0xa0
Jul 11 11:23:15 b2 kernel:  raid5_get_active_stripe+0x1de/0x480 [raid456]
Jul 11 11:23:15 b2 kernel:  raid5_make_request+0x290/0xe10 [raid456]
Jul 11 11:23:15 b2 kernel:  md_handle_request+0x196/0x240 [md_mod]
Jul 11 11:23:15 b2 kernel:  __submit_bio+0x125/0x250
Jul 11 11:23:15 b2 kernel:  submit_bio_noacct_nocheck+0x141/0x370
Jul 11 11:23:15 b2 kernel:  dispatch_io+0x16f/0x2e0 [dm_mod]
Jul 11 11:23:15 b2 kernel:  dm_io+0xf3/0x200 [dm_mod]
Jul 11 11:23:15 b2 kernel:  ssd_commit_flushed+0x12c/0x1c0 [dm_writecache]
Jul 11 11:23:15 b2 kernel:  writecache_flush+0xde/0x2a0 [dm_writecache]
Jul 11 11:23:15 b2 kernel:  writecache_flush_work+0x1f/0x30 [dm_writecache]
Jul 11 11:23:15 b2 kernel:  process_one_work+0x296/0x500
Jul 11 11:23:15 b2 kernel:  worker_thread+0x2a/0x3b0
Jul 11 11:23:15 b2 kernel:  kthread+0xe6/0x110
Jul 11 11:23:15 b2 kernel:  ret_from_fork+0x1f/0x30
Jul 11 11:23:15 b2 kernel:  </TASK>


This task holds the wc_lock() while it is doing this writeback
flush.

All the XFS filesystems are stuck in similar ways, such as trying to
push the journal and getting stuck in IO submission in
dm-writecache:

Jul 11 11:23:15 b2 kernel: Workqueue: xfs-cil/dm-128 xlog_cil_push_work [xfs]
Jul 11 11:23:15 b2 kernel: Call Trace:
Jul 11 11:23:15 b2 kernel:  <TASK>
Jul 11 11:23:15 b2 kernel:  __schedule+0x245/0x7a0
Jul 11 11:23:15 b2 kernel:  schedule+0x50/0xa0
Jul 11 11:23:15 b2 kernel:  schedule_preempt_disabled+0x14/0x20
Jul 11 11:23:15 b2 kernel:  __mutex_lock+0x72a/0xc80
Jul 11 11:23:15 b2 kernel:  writecache_map+0x2f/0x790 [dm_writecache]
Jul 11 11:23:15 b2 kernel:  __map_bio+0x46/0x1c0 [dm_mod]
Jul 11 11:23:15 b2 kernel:  __send_duplicate_bios+0x1d4/0x240 [dm_mod]
Jul 11 11:23:15 b2 kernel:  __send_empty_flush+0x95/0xd0 [dm_mod]
Jul 11 11:23:15 b2 kernel:  dm_submit_bio+0x3b5/0x5f0 [dm_mod]
Jul 11 11:23:15 b2 kernel:  __submit_bio+0x125/0x250
Jul 11 11:23:15 b2 kernel:  submit_bio_noacct_nocheck+0x141/0x370
Jul 11 11:23:15 b2 kernel:  xlog_state_release_iclog+0xfb/0x220 [xfs]
Jul 11 11:23:15 b2 kernel:  xlog_write_get_more_iclog_space+0x87/0x130 [xfs]
Jul 11 11:23:15 b2 kernel:  xlog_write+0x295/0x3d0 [xfs]
Jul 11 11:23:15 b2 kernel:  xlog_cil_push_work+0x5b5/0x760 [xfs]
Jul 11 11:23:15 b2 kernel:  process_one_work+0x296/0x500
Jul 11 11:23:15 b2 kernel:  worker_thread+0x2a/0x3b0
Jul 11 11:23:15 b2 kernel:  kthread+0xe6/0x110
Jul 11 11:23:15 b2 kernel:  ret_from_fork+0x1f/0x30

These are getting into dm-writecache, and the lock they are getting
stuck on is the wc_lock().

Directory reads are getting stuck in dm-write-cache:

Jul 11 11:23:16 b2 kernel: task:find            state:D stack:0     pid:1832634 ppid:1832627 flags:0x00000000
Jul 11 11:23:16 b2 kernel: Call Trace:
Jul 11 11:23:16 b2 kernel:  <TASK>
Jul 11 11:23:16 b2 kernel:  __schedule+0x245/0x7a0
Jul 11 11:23:16 b2 kernel:  schedule+0x50/0xa0
Jul 11 11:23:16 b2 kernel:  schedule_preempt_disabled+0x14/0x20
Jul 11 11:23:16 b2 kernel:  __mutex_lock+0x72a/0xc80
Jul 11 11:23:16 b2 kernel:  writecache_map+0x2f/0x790 [dm_writecache]
Jul 11 11:23:16 b2 kernel:  __map_bio+0x46/0x1c0 [dm_mod]
Jul 11 11:23:16 b2 kernel:  dm_submit_bio+0x273/0x5f0 [dm_mod]
Jul 11 11:23:16 b2 kernel:  __submit_bio+0x125/0x250
Jul 11 11:23:16 b2 kernel:  submit_bio_noacct_nocheck+0x141/0x370
Jul 11 11:23:16 b2 kernel:  _xfs_buf_ioapply+0x224/0x3c0 [xfs]
Jul 11 11:23:16 b2 kernel:  __xfs_buf_submit+0x88/0x1c0 [xfs]
Jul 11 11:23:16 b2 kernel:  xfs_buf_read_map+0x1a2/0x340 [xfs]
Jul 11 11:23:16 b2 kernel:  xfs_buf_readahead_map+0x23/0x30 [xfs]
Jul 11 11:23:16 b2 kernel:  xfs_da_reada_buf+0x76/0x80 [xfs]
Jul 11 11:23:16 b2 kernel:  xfs_dir2_leaf_readbuf+0x269/0x350 [xfs]
Jul 11 11:23:16 b2 kernel:  xfs_dir2_leaf_getdents+0xd5/0x3b0 [xfs]
Jul 11 11:23:16 b2 kernel:  xfs_readdir+0xff/0x1d0 [xfs]
Jul 11 11:23:16 b2 kernel:  iterate_dir+0x13f/0x180
Jul 11 11:23:16 b2 kernel:  __x64_sys_getdents64+0x71/0x100
Jul 11 11:23:16 b2 kernel:  ? __x64_sys_getdents64+0x100/0x100
Jul 11 11:23:16 b2 kernel:  do_syscall_64+0x34/0x80
Jul 11 11:23:16 b2 kernel:  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Same lock.

File data reads are getting stuck the same way in dm-writecache.

File data writes (i.e. writeback) are getting stuck the same way
in dm-writecache.

Yup, everything is stuck on dm-writecache flushing to the underlying
RAID-5 device.

I don't see anything indicating a filesystem problem here. This
looks like a massively overloaded RAID 5 volume. i.e. the fast
storage that makes up the write cache has filled, and now everything
is stuck waiting for the fast cache to drain to the slow backing
store before new writes can be made to the fast cache. IOWs,
everything is running as RAID5 write speed and there's a huge
backlog of data being written to the RAID 5 volume(s) keeping them
100% busy.

If there is no IO going to the RAID 5 volumes, then you've got a
RAID 5 or physical IO hang of some kind. But I can't find any
indication of that - everything looks like it's waiting on RAID 5
stripe population to make write progress...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11  3:10     ` Dave Chinner
@ 2023-07-11  7:05       ` Chris Dunlop
  2023-07-11 22:21         ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2023-07-11  7:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote:
> FYI, Chris: google is classifying your email as spam because it
> doesn't have any dkim/dmarc authentication on it. You're lucky I
> even got this, because I know that gmail mostly just drops such
> email in a black hole now....

Dkim hopefully fixed now.

More on the problem topic below, and apologies for the aside, but this is 
my immediate concern: post the reboot into v5.15.118 I have one of the 
filesystems failing to mount with:

Jul 11 16:13:10 b2 kernel: XFS (dm-146): Metadata CRC error detected at xfs_allocbt_read_verify+0x12/0xb0 [xfs], xfs_bnobt block 0x10a00f7f8
Jul 11 16:13:10 b2 kernel: XFS (dm-146): Unmount and run xfs_repair
Jul 11 16:13:10 b2 kernel: XFS (dm-146): First 128 bytes of corrupted metadata buffer:
Jul 11 16:13:10 b2 kernel: 00000000: 41 42 33 42 00 00 01 ab 05 0a 78 fe 02 c0 2b ff  AB3B......x...+.
Jul 11 16:13:10 b2 kernel: 00000010: 00 00 00 01 0a 00 f7 f8 00 00 00 26 00 3b 2d 40  ...........&.;-@
Jul 11 16:13:10 b2 kernel: 00000020: cf 42 ed 07 78 a1 4e ff 9e 20 e3 d9 6f fc 3e 30  .B..x.N.. ..o.>0
Jul 11 16:13:10 b2 kernel: 00000030: 00 00 00 02 80 b2 25 41 08 c7 6c 00 00 00 01 28  ......%A..l....(
Jul 11 16:13:10 b2 kernel: 00000040: 08 c7 6d 3e 00 00 00 c2 08 c7 6f 65 00 00 00 a7  ..m>......oe....
Jul 11 16:13:10 b2 kernel: 00000050: 08 c7 70 a5 00 00 00 3a 08 c7 71 00 00 00 00 c4  ..p....:..q.....
Jul 11 16:13:10 b2 kernel: 00000060: 08 c7 71 e8 00 00 00 18 08 c7 72 50 00 00 02 c3  ..q.......rP....
Jul 11 16:13:10 b2 kernel: 00000070: 08 c7 75 b1 00 00 02 4b 08 c7 78 db 00 00 02 3d  ..u....K..x....=
Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount/recovery failed: error -74
Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount failed

Then xfs_repair comes back with: 

# xfs_repair /dev/vg-name/lv-name
Phase 1 - find and verify superblock...
         - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
         - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

At this point is "xfs_repair -L" (and attendant potential data loss) my 
only option?


> On Tue, Jul 11, 2023 at 11:57:16AM +1000, Chris Dunlop wrote:
>> On Tue, Jul 11, 2023 at 10:13:31AM +1000, Chris Dunlop wrote:
>>> On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
>>>> Hi,
>>>>
>>>> This box is newly booted into linux v6.1.35 (2 days ago), it was
>>>> previously running v5.15.118 without any problems (other than that
>>>> fixed by "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason
>>>> for the upgrade).
>>>>
>>>> I have rm operations on two files that have been stuck for in excess
>>>> of 22 hours and 18 hours respectively:
...
>>
>> Full sysrq-w output at:
>>
>> https://file.io/tg7F5OqIWo1B
>
> Ok, you have XFS on dm-writecache on md raid 5 and
> everything is stuck in either dm-writecache or md.

Yes, dm-writecache, with the cache on SSD md raid6, in front of the XFS 
devices on lvm / ceph-rbd.

> This task holds the wc_lock() while it is doing this writeback
> flush.
>
> All the XFS filesystems are stuck in similar ways, such as trying to
> push the journal and getting stuck in IO submission in
> dm-writecache:
...
> These are getting into dm-writecache, and the lock they are getting
> stuck on is the wc_lock().
...
>
> Directory reads are getting stuck in dm-write-cache:
...
> Same lock.
>
> File data reads are getting stuck the same way in dm-writecache.
>
> File data writes (i.e. writeback) are getting stuck the same way
> in dm-writecache.
>
> Yup, everything is stuck on dm-writecache flushing to the underlying
> RAID-5 device.
>
> I don't see anything indicating a filesystem problem here. This
> looks like a massively overloaded RAID 5 volume. i.e. the fast
> storage that makes up the write cache has filled, and now everything
> is stuck waiting for the fast cache to drain to the slow backing
> store before new writes can be made to the fast cache. IOWs,
> everything is running as RAID5 write speed and there's a huge
> backlog of data being written to the RAID 5 volume(s) keeping them
> 100% busy.

This had never been an issue with v5.15. Is it possible the upgrade to 
v6.1 had a hand in this or that's probably just coincidence?

In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes" cause 
a significantly greater write load on the cache?

I note that there's one fs (separate to the corrupt one above) that's 
still in mount recovery since the boot some hours ago. On past experience 
that indicates the inodegc stuff is holding things up, i.e. it would have 
been running prior to the reboot - or at least trying to.

> If there is no IO going to the RAID 5 volumes, then you've got a
> RAID 5 or physical IO hang of some kind. But I can't find any
> indication of that - everything looks like it's waiting on RAID 5
> stripe population to make write progress...

There's been no change to the cache device over the reboot, and it
currently looks busy, but it doesn't look completely swamped:

                     load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s  arq-sz  aqu-sz    await   rwait   wwait %util
2023-07-11-16:10:00 16.7   0.6   0.0   1.0  15.0   0.0  82.5 md10    0.0    0.0  172.5 1060.1 17131.66 27450.08   72.34    6.33     5.13    0.56     5.9  48.8
2023-07-11-16:11:00 17.2   0.8   0.0   1.1  15.6   0.0  81.7 md10    0.0    0.0  185.7 1151.4 18202.36 20681.51   58.16    8.18     6.12    0.53     7.0  44.0
2023-07-11-16:12:00 18.9   0.5   0.0   1.0  17.0   0.0  80.6 md10    0.0    0.0  179.9 1139.6 15617.51 26579.51   63.96    7.62     5.78    0.46     6.6  47.5
2023-07-11-16:13:00 18.7   0.6   0.0   1.0  15.4   0.0  82.1 md10    0.0    0.0  161.8 1399.7 15751.70 26273.20   53.83   11.47     7.35    0.51     8.1  47.4
2023-07-11-16:14:00 17.2   0.5   0.0   0.9  13.9   0.0  83.8 md10    0.0    0.0  190.2 1251.1 15207.77 20907.73   50.12   14.19     9.85    2.16    11.0  42.2
2023-07-11-16:15:00 17.0   0.6   0.0   1.0  13.8   0.0  83.7 md10    0.0    0.0  163.5 1571.2 14561.41 24508.45   45.05   14.38     8.29    0.53     9.1  46.7
2023-07-11-16:16:00 18.7   0.8   0.0   1.0  15.5   0.0  81.8 md10    0.0    0.0  164.6 1294.7 13256.49 20737.23   46.59   12.54     8.59    0.50     9.6  42.3
2023-07-11-16:17:00 18.9   0.6   0.0   1.0  14.0   0.0  83.4 md10    0.0    0.0  179.4 1026.7 14079.12 19095.93   55.01    9.12     7.56    0.91     8.7  44.8
2023-07-11-16:18:00 19.4   0.6   0.0   1.0  13.8   0.0  83.7 md10    0.0    0.0  151.6 1315.5 12935.14 23547.30   49.73   11.80     8.04    0.43     8.9  48.5
2023-07-11-16:19:00 19.6   0.6   0.0   1.0  11.9   0.0  85.7 md10    0.0    0.0  168.6 1895.9 14153.63 23810.50   36.78   16.75     8.11    0.77     8.8  48.5

It should be handling about the same load as prior to the reboot.


Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11  7:05       ` Chris Dunlop
@ 2023-07-11 22:21         ` Dave Chinner
  2023-07-12  1:13           ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-07-11 22:21 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
> On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote:
> > FYI, Chris: google is classifying your email as spam because it
> > doesn't have any dkim/dmarc authentication on it. You're lucky I
> > even got this, because I know that gmail mostly just drops such
> > email in a black hole now....
> 
> Dkim hopefully fixed now.

Looks good.

> More on the problem topic below, and apologies for the aside, but this is my
> immediate concern: post the reboot into v5.15.118 I have one of the
> filesystems failing to mount with:
> 
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): Metadata CRC error detected at xfs_allocbt_read_verify+0x12/0xb0 [xfs], xfs_bnobt block 0x10a00f7f8
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): Unmount and run xfs_repair
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): First 128 bytes of corrupted metadata buffer:
> Jul 11 16:13:10 b2 kernel: 00000000: 41 42 33 42 00 00 01 ab 05 0a 78 fe 02 c0 2b ff  AB3B......x...+.
> Jul 11 16:13:10 b2 kernel: 00000010: 00 00 00 01 0a 00 f7 f8 00 00 00 26 00 3b 2d 40  ...........&.;-@
> Jul 11 16:13:10 b2 kernel: 00000020: cf 42 ed 07 78 a1 4e ff 9e 20 e3 d9 6f fc 3e 30  .B..x.N.. ..o.>0
> Jul 11 16:13:10 b2 kernel: 00000030: 00 00 00 02 80 b2 25 41 08 c7 6c 00 00 00 01 28  ......%A..l....(
> Jul 11 16:13:10 b2 kernel: 00000040: 08 c7 6d 3e 00 00 00 c2 08 c7 6f 65 00 00 00 a7  ..m>......oe....
> Jul 11 16:13:10 b2 kernel: 00000050: 08 c7 70 a5 00 00 00 3a 08 c7 71 00 00 00 00 c4  ..p....:..q.....
> Jul 11 16:13:10 b2 kernel: 00000060: 08 c7 71 e8 00 00 00 18 08 c7 72 50 00 00 02 c3  ..q.......rP....
> Jul 11 16:13:10 b2 kernel: 00000070: 08 c7 75 b1 00 00 02 4b 08 c7 78 db 00 00 02 3d  ..u....K..x....=
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount/recovery failed: error -74
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount failed

I would suggest that this indicates a torn write of some kind. Given
the state of the system when you rebooted, and all the RAID 5/6
writes in progress, this is entirely possible...

> Then xfs_repair comes back with:
> 
> # xfs_repair /dev/vg-name/lv-name
> Phase 1 - find and verify superblock...
>         - reporting progress in intervals of 15 minutes
> Phase 2 - using internal log
>         - zero log...
> ERROR: The filesystem has valuable metadata changes in a log which needs to
> be replayed.  Mount the filesystem to replay the log, and unmount it before
> re-running xfs_repair.  If you are unable to mount the filesystem, then use
> the -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a mount
> of the filesystem before doing this.
> 
> At this point is "xfs_repair -L" (and attendant potential data loss) my only
> option?

One option is to correct the CRC with xfs_db, then try to mount the
filesystem again, but that will simply allow recovery to try to
modify what is possibly a bad btree block and then have other things
go really wrong (e.g. cross-linked data, duplicate freespace) at a
later point in time. That's potentially far more damaging, and I
wouldn't try it without having a viable rollback strategy (e.g. full
device snapshot and/or copy)....

It's probably safest to just zero the log and run repair at this
point.

> > This task holds the wc_lock() while it is doing this writeback
> > flush.
> > 
> > All the XFS filesystems are stuck in similar ways, such as trying to
> > push the journal and getting stuck in IO submission in
> > dm-writecache:
> ...
> > These are getting into dm-writecache, and the lock they are getting
> > stuck on is the wc_lock().
> ...
> > 
> > Directory reads are getting stuck in dm-write-cache:
> ...
> > Same lock.
> > 
> > File data reads are getting stuck the same way in dm-writecache.
> > 
> > File data writes (i.e. writeback) are getting stuck the same way
> > in dm-writecache.
> > 
> > Yup, everything is stuck on dm-writecache flushing to the underlying
> > RAID-5 device.
> > 
> > I don't see anything indicating a filesystem problem here. This
> > looks like a massively overloaded RAID 5 volume. i.e. the fast
> > storage that makes up the write cache has filled, and now everything
> > is stuck waiting for the fast cache to drain to the slow backing
> > store before new writes can be made to the fast cache. IOWs,
> > everything is running as RAID5 write speed and there's a huge
> > backlog of data being written to the RAID 5 volume(s) keeping them
> > 100% busy.
> 
> This had never been an issue with v5.15. Is it possible the upgrade to v6.1
> had a hand in this or that's probably just coincidence?

It could be a dm-writecache or md raid regression, but it could be
just "luck".

> In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes" cause a
> significantly greater write load on the cache?

No.

> I note that there's one fs (separate to the corrupt one above) that's still
> in mount recovery since the boot some hours ago. On past experience that
> indicates the inodegc stuff is holding things up, i.e. it would have been
> running prior to the reboot - or at least trying to.

I only found one inodegc working running in the trace above - it was
blocked on writecache doing a rmapbt block read removing extents. It
may have been a long running cleanup, but it may not have been.

As it is, mount taking a long time because there's a inode with a
huge number of extents that need freeing on an unlinked list is
*not* related to background inodegc in any way - this has -always-
happened with XFS; it's simply what unlinked inode recovery does.

i.e. background inodegc just moved the extent freeing from syscall
exit context to an async worker thread; the filesystem still has
exactly the same work to do. If the system goes down while that work
is in progress, log recovery has always finished off that cleanup
work...

> > If there is no IO going to the RAID 5 volumes, then you've got a
> > RAID 5 or physical IO hang of some kind. But I can't find any
> > indication of that - everything looks like it's waiting on RAID 5
> > stripe population to make write progress...
> 
> There's been no change to the cache device over the reboot, and it
> currently looks busy, but it doesn't look completely swamped:
....

About 150 1MB sized reads, and about 1000-1500 much smaller
writes each second, with an average write wait of near 10ms.
Certainly not a fast device, and it's running at about 50%
utilisation with an average queue depth of 10-15 IOs.

> It should be handling about the same load as prior to the reboot.

If that's the typical sustained load, I wouldn't expect it to have
that much extra overhead given small writes on RAID 6 volumes have
the worse performance characteristics possible. If the write cache
starts flushing lots of small discontiguous writes, I'd expect to
see that device go to 100% utilisation and long write wait times for
extended periods...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11 22:21         ` Dave Chinner
@ 2023-07-12  1:13           ` Chris Dunlop
  2023-07-12  1:42             ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2023-07-12  1:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 12, 2023 at 08:21:11AM +1000, Dave Chinner wrote:
> On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
>> On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote:
>
>> More on the problem topic below, and apologies for the aside, but 
>> this is my immediate concern: post the reboot into v5.15.118 I have 
>> one of the filesystems failing to mount with:
...
> I would suggest that this indicates a torn write of some kind. Given
> the state of the system when you rebooted, and all the RAID 5/6
> writes in progress, this is entirely possible...
...
> It's probably safest to just zero the log and run repair at this
> point.

Thanks. I did that last night (after taking a snapshot) - 'xfs_repair 
-v' is still running but there's a LOT of nasty output so it's not 
looking good.

...oh. It finished whilst I've been writing this. If you're interested 
in the log:

https://file.io/XOGokgxgttEX

The directory structure looks sane, I'll start running checks on the 
data.

>>> I don't see anything indicating a filesystem problem here. This
>>> looks like a massively overloaded RAID 5 volume. i.e. the fast
>>> storage that makes up the write cache has filled, and now everything
>>> is stuck waiting for the fast cache to drain to the slow backing
>>> store before new writes can be made to the fast cache. IOWs,
>>> everything is running as RAID5 write speed and there's a huge
>>> backlog of data being written to the RAID 5 volume(s) keeping them
>>> 100% busy.

Oddly, of the 56 similarly configured filesystems (writecache/lvm/rbd) 
on this box, with maybe 10 actively sinking writes at any time, that 
one above is the only one that had any trouble on reboot. If it had 
been a general problem w/ the cache device I would have thought more 
of the active writers would have similar issues. Maybe I just got 
lucky - and/or that demonstrates how hard xfs tries to keep your data 
sane.

>> This had never been an issue with v5.15. Is it possible the upgrade 
>> to v6.1 had a hand in this or that's probably just coincidence?
>
> It could be a dm-writecache or md raid regression, but it could be
> just "luck".

Ugh. Sigh. I guess at some point I'm going to have to bite the bullet 
again, and next time watch the cache device like a hawk. I'll keep an 
eye out for dm-writecache and md raid problems and patches etc. so see 
what might come up.

Is there anything else that occurs to you that I might monitor prior 
to and during any future recurrance of the problem?

>> In particular, could "5e672cd69f0a xfs: non-blocking inodegc 
>> pushes" cause a significantly greater write load on the cache?
>
> No.
>
>> I note that there's one fs (separate to the corrupt one above) 
>> that's still in mount recovery since the boot some hours ago. On 
>> past experience that indicates the inodegc stuff is holding things 
>> up, i.e. it would have been running prior to the reboot - or at 
>> least trying to.
>
> I only found one inodegc working running in the trace above - it was
> blocked on writecache doing a rmapbt block read removing extents. It
> may have been a long running cleanup, but it may not have been.

Probably was a cleanup: that's the reason for the update to v6.1, 
every now and again the box was running into the problem of getting 
blocked on the cleanup. The extent of the problem is significantly 
reduced by moving from one large fs where any extended cleanup would 
block everything, to multiple small-ish fs'es (500G-15T) where the 
blast radius of an extended cleanup is far more constrained. But the 
problem was still pretty annoying when it hits.

Hmmm, maybe I can just carry a local backport of "non-blocking inodegc 
pushes" in my local v5.15. That would push back my need to move do 
v6.1.

Or could / should it be considered for an official backport?  Looks 
like it applies cleanly to current v5.15.120.

> As it is, mount taking a long time because there's a inode with a
> huge number of extents that need freeing on an unlinked list is
> *not* related to background inodegc in any way - this has -always-
> happened with XFS; it's simply what unlinked inode recovery does.
>
> i.e. background inodegc just moved the extent freeing from syscall
> exit context to an async worker thread; the filesystem still has
> exactly the same work to do. If the system goes down while that work
> is in progress, log recovery has always finished off that cleanup
> work...

Gotcha. I'd mistakenly thought "non-blocking inodegc pushes" queued up 
the garbage collection for background processing. My further mistaken 
hand-wavy thought was that, if the processing that was previously 
foreground was now punted to background (perhaps with different 
priorities) maybe the background processing was simply way more 
efficient, enough to swamp the cache with metadata updates.

But with your further explanation and actually reading the patch 
(should have done that first) shows the gc was already queued, the 
update was to NOT wait for the queue to be flushed.

Hmmm, then again... might it be possible that, without the patch, at 
some point after a large delete, further work was blocked whilst 
waiting for the queue to be flushed, limiting the total amount of 
work, but with the patch, the further work (e.g. more deletes) is able 
to be queued - possibly to the point of swamping the cache device?

>> There's been no change to the cache device over the reboot, and it
>> currently looks busy, but it doesn't look completely swamped:
> ....
>
> About 150 1MB sized reads, and about 1000-1500 much smaller
> writes each second, with an average write wait of near 10ms.
> Certainly not a fast device, and it's running at about 50%
> utilisation with an average queue depth of 10-15 IOs.

That's hopefully the cache working as intended: sinking small 
continuous writes (network data uploads) and aggregating them into 
larger blocks to flush out to the bulk storage (i.e. reads from the 
cache to write to the bulk).

>> It should be handling about the same load as prior to the reboot.
>
> If that's the typical sustained load, I wouldn't expect it to have
> that much extra overhead given small writes on RAID 6 volumes have
> the worse performance characteristics possible. If the write cache
> starts flushing lots of small discontiguous writes, I'd expect to
> see that device go to 100% utilisation and long write wait times for
> extended periods...

There shouldn't be many small discontigous writes in the data: it's 
basically network uploads to sequential files in the 100s MB to multi 
GB range. But there's also a bunch of reflinking, not to mention 
occasionally removing highly reflinked multi-TB files, so these 
metadata updates might count as "lots of small discontiguous writes"?

Thanks for your help,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-12  1:13           ` Chris Dunlop
@ 2023-07-12  1:42             ` Dave Chinner
  2023-07-12  2:17               ` Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-07-12  1:42 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Wed, Jul 12, 2023 at 11:13:56AM +1000, Chris Dunlop wrote:
> On Wed, Jul 12, 2023 at 08:21:11AM +1000, Dave Chinner wrote:
> > On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
> > > On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote:
> > 
> > > More on the problem topic below, and apologies for the aside, but
> > > this is my immediate concern: post the reboot into v5.15.118 I have
> > > one of the filesystems failing to mount with:
> ...
> > I would suggest that this indicates a torn write of some kind. Given
> > the state of the system when you rebooted, and all the RAID 5/6
> > writes in progress, this is entirely possible...
> ...
> > It's probably safest to just zero the log and run repair at this
> > point.
> 
> Thanks. I did that last night (after taking a snapshot) - 'xfs_repair -v' is
> still running but there's a LOT of nasty output so it's not looking good.
> 
> ...oh. It finished whilst I've been writing this. If you're interested in
> the log:
> 
> https://file.io/XOGokgxgttEX

Oh, there's *lots* of CRC errors. All through the free space and
rmap btrees, but all in a similar range of LBAs.


$  grep CRC ~/Downloads/xfs_repair.log |sort -k 9
Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518c600/0x1000
Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518c648/0x1000
Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518c650/0x1000
Metadata CRC error detected at 0x55577660b07d, xfs_cntbt block 0x10518c668/0x1000
Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518ce88/0x1000
Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518d6c0/0x1000
Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518d6d0/0x1000
Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518d6e0/0x1000
Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518d828/0x1000
Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518d858/0x1000
.....

There's a whole cluster of CRC errors very close together (10 across
0x1200 sectors = 6144 sectors = 3MB of disk space).

This pattern of "bad CRC clusters" occurs more often than not as
I look through the list of CRC errors. I suspect this indicates
something more fundamental wrong with either the RAID volume of the
dm-writecache on top of it...

> The directory structure looks sane, I'll start running checks on the data.

Given the amount of bad metadata, I wouldn't trust anything in that
filesystem to be properly intact.

> > > > I don't see anything indicating a filesystem problem here. This
> > > > looks like a massively overloaded RAID 5 volume. i.e. the fast
> > > > storage that makes up the write cache has filled, and now everything
> > > > is stuck waiting for the fast cache to drain to the slow backing
> > > > store before new writes can be made to the fast cache. IOWs,
> > > > everything is running as RAID5 write speed and there's a huge
> > > > backlog of data being written to the RAID 5 volume(s) keeping them
> > > > 100% busy.
> 
> Oddly, of the 56 similarly configured filesystems (writecache/lvm/rbd) on
> this box, with maybe 10 actively sinking writes at any time, that one above
> is the only one that had any trouble on reboot. If it had been a general
> problem w/ the cache device I would have thought more of the active writers
> would have similar issues. Maybe I just got lucky - and/or that demonstrates
> how hard xfs tries to keep your data sane.

I don't think it's XFS related at all....

> > > This had never been an issue with v5.15. Is it possible the upgrade
> > > to v6.1 had a hand in this or that's probably just coincidence?
> > 
> > It could be a dm-writecache or md raid regression, but it could be
> > just "luck".
> 
> Ugh. Sigh. I guess at some point I'm going to have to bite the bullet again,
> and next time watch the cache device like a hawk. I'll keep an eye out for
> dm-writecache and md raid problems and patches etc. so see what might come
> up.
> 
> Is there anything else that occurs to you that I might monitor prior to and
> during any future recurrance of the problem?

Not really. What every problem occurred that started things going
bad was not evident from the information that was gathered.

> > > In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes"
> > > cause a significantly greater write load on the cache?
> > 
> > No.
> > 
> > > I note that there's one fs (separate to the corrupt one above)
> > > that's still in mount recovery since the boot some hours ago. On
> > > past experience that indicates the inodegc stuff is holding things
> > > up, i.e. it would have been running prior to the reboot - or at
> > > least trying to.
> > 
> > I only found one inodegc working running in the trace above - it was
> > blocked on writecache doing a rmapbt block read removing extents. It
> > may have been a long running cleanup, but it may not have been.
> 
> Probably was a cleanup: that's the reason for the update to v6.1, every now
> and again the box was running into the problem of getting blocked on the
> cleanup. The extent of the problem is significantly reduced by moving from
> one large fs where any extended cleanup would block everything, to multiple
> small-ish fs'es (500G-15T) where the blast radius of an extended cleanup is
> far more constrained. But the problem was still pretty annoying when it
> hits.
> 
> Hmmm, maybe I can just carry a local backport of "non-blocking inodegc
> pushes" in my local v5.15. That would push back my need to move do v6.1.
> 
> Or could / should it be considered for an official backport?  Looks like it
> applies cleanly to current v5.15.120.

I thought that had already been done - there's supposed to be
someone taking care of 5.15 LTS backports for XFS....

> > As it is, mount taking a long time because there's a inode with a
> > huge number of extents that need freeing on an unlinked list is
> > *not* related to background inodegc in any way - this has -always-
> > happened with XFS; it's simply what unlinked inode recovery does.
> > 
> > i.e. background inodegc just moved the extent freeing from syscall
> > exit context to an async worker thread; the filesystem still has
> > exactly the same work to do. If the system goes down while that work
> > is in progress, log recovery has always finished off that cleanup
> > work...
> 
> Gotcha. I'd mistakenly thought "non-blocking inodegc pushes" queued up the
> garbage collection for background processing. My further mistaken hand-wavy
> thought was that, if the processing that was previously foreground was now
> punted to background (perhaps with different priorities) maybe the
> background processing was simply way more efficient, enough to swamp the
> cache with metadata updates.

Doubt it, the processing is exactly the same code, just done from a
different task context.

> But with your further explanation and actually reading the patch (should
> have done that first) shows the gc was already queued, the update was to NOT
> wait for the queue to be flushed.
> 
> Hmmm, then again... might it be possible that, without the patch, at some
> point after a large delete, further work was blocked whilst waiting for the
> queue to be flushed, limiting the total amount of work, but with the patch,
> the further work (e.g. more deletes) is able to be queued - possibly to the
> point of swamping the cache device?

Unlinks don't generate a lot of IO. They do a lot of repeated
operations to cached metadata, and the journal relogs all those
blocks without triggering IO to the journal, too.

> > > There's been no change to the cache device over the reboot, and it
> > > currently looks busy, but it doesn't look completely swamped:
> > ....
> > 
> > About 150 1MB sized reads, and about 1000-1500 much smaller
> > writes each second, with an average write wait of near 10ms.
> > Certainly not a fast device, and it's running at about 50%
> > utilisation with an average queue depth of 10-15 IOs.
> 
> That's hopefully the cache working as intended: sinking small continuous
> writes (network data uploads) and aggregating them into larger blocks to
> flush out to the bulk storage (i.e. reads from the cache to write to the
> bulk).
> 
> > > It should be handling about the same load as prior to the reboot.
> > 
> > If that's the typical sustained load, I wouldn't expect it to have
> > that much extra overhead given small writes on RAID 6 volumes have
> > the worse performance characteristics possible. If the write cache
> > starts flushing lots of small discontiguous writes, I'd expect to
> > see that device go to 100% utilisation and long write wait times for
> > extended periods...
> 
> There shouldn't be many small discontigous writes in the data: it's
> basically network uploads to sequential files in the 100s MB to multi GB
> range. But there's also a bunch of reflinking, not to mention occasionally
> removing highly reflinked multi-TB files, so these metadata updates might
> count as "lots of small discontiguous writes"?

Yeah, the metadata updates end up being small discontiguous
writes....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes
  2023-07-12  1:42             ` Dave Chinner
@ 2023-07-12  2:17               ` Chris Dunlop
  2023-07-12  9:26                 ` Amir Goldstein
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2023-07-12  2:17 UTC (permalink / raw)
  To: Leah Rumancik, Theodore Ts'o; +Cc: Dave Chinner, linux-xfs

Request for backport to v5.15:

5e672cd69f0a xfs: non-blocking inodegc pushes

Reference:

https://lore.kernel.org/all/ZK4E%2FgGuaBu+qvKL@dread.disaster.area/
---------------------------------------------------------------------
From: Dave Chinner <david@fromorbit.com>
To: Chris Dunlop <chris@onthe.net.au>
Cc: linux-xfs@vger.kernel.org
Subject: Re: rm hanging, v6.1.35

On Wed, Jul 12, 2023 at 11:13:56AM +1000, Chris Dunlop wrote:
>> On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
>>> In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes"
>>> cause a significantly greater write load on the cache?
...
> Or could / should it be considered for an official backport?  Looks like it
> applies cleanly to current v5.15.120.

I thought that had already been done - there's supposed to be
someone taking care of 5.15 LTS backports for XFS....
---------------------------------------------------------------------

Thanks,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes
  2023-07-12  2:17               ` Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes Chris Dunlop
@ 2023-07-12  9:26                 ` Amir Goldstein
  2023-07-13  0:31                   ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Amir Goldstein @ 2023-07-12  9:26 UTC (permalink / raw)
  To: Chris Dunlop
  Cc: Leah Rumancik, Theodore Ts'o, Dave Chinner, linux-xfs,
	Leah Rumancik, Darrick J. Wong, Chandan Babu R

On Wed, Jul 12, 2023 at 5:37 AM Chris Dunlop <chris@onthe.net.au> wrote:
>
> Request for backport to v5.15:
>
> 5e672cd69f0a xfs: non-blocking inodegc pushes

This is not the subject of above commit, it was the
subject of the cover letter:
https://www.spinics.net/lists/linux-xfs/msg61813.html

containing the following upstream commits:
7cf2b0f9611b xfs: bound maximum wait time for inodegc work
5e672cd69f0a xfs: introduce xfs_inodegc_push()

>
> Reference:
>
> https://lore.kernel.org/all/ZK4E%2FgGuaBu+qvKL@dread.disaster.area/
> ---------------------------------------------------------------------
> From: Dave Chinner <david@fromorbit.com>
> To: Chris Dunlop <chris@onthe.net.au>
> Cc: linux-xfs@vger.kernel.org
> Subject: Re: rm hanging, v6.1.35
>
> On Wed, Jul 12, 2023 at 11:13:56AM +1000, Chris Dunlop wrote:
> >> On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
> >>> In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes"
> >>> cause a significantly greater write load on the cache?
> ...
> > Or could / should it be considered for an official backport?  Looks like it
> > applies cleanly to current v5.15.120.
>
> I thought that had already been done - there's supposed to be
> someone taking care of 5.15 LTS backports for XFS....

Leah has already queued these two patches for 5.15 backport,
but she is now on sick leave, so that was not done yet.

We did however, identify a few more inodegc fixes from 6.4,
which also fix a bug in one of the two commits above:

03e0add80f4c xfs: explicitly specify cpu when forcing inodegc delayed
work to run immediately
   Fixes: 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work")
b37c4c8339cd xfs: check that per-cpu inodegc workers actually run on that cpu
2254a7396a0c xfs: fix xfs_inodegc_stop racing with mod_delayed_work
   Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")

6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel") has already
been applied to 5.15.y.

stable tree rules require that the above fixes from 6.4 be applied to 6.1.y
before 5.15.y, so I have already tested them and they are ready to be posted.
I wanted to wait a bit after the 6.4.0 release to make sure that we did not pull
the blanket too much to the other side, because as the reports from Chris
demonstrate, the inodegc fixes had some unexpected outcomes.

Anyway, it is 6.4.3 already and I haven't seen any shouts on the list,
plus the 6.4 fixes look pretty safe, so I guess this is a good time for me
to post the 6.1.y backports.

w.r.t testing and posting the 5.15.y backports, we currently have a problem.
Chandan said that he will not have time to fill in for Leah and I don't have
a baseline established for 5.15 and am going on vacation next week anyway.

So either Chandan can make an exception for this inodegc series, or it will
have to wait for Leah to be back.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes
  2023-07-12  9:26                 ` Amir Goldstein
@ 2023-07-13  0:31                   ` Chris Dunlop
  2023-07-13  0:57                     ` Chris Dunlop
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Dunlop @ 2023-07-13  0:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Leah Rumancik, Theodore Ts'o, Dave Chinner, linux-xfs,
	Leah Rumancik, Darrick J. Wong, Chandan Babu R

On Wed, Jul 12, 2023 at 12:26:54PM +0300, Amir Goldstein wrote:
> On Wed, Jul 12, 2023 at 5:37 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>
>> Request for backport to v5.15:
>>
>> 5e672cd69f0a xfs: non-blocking inodegc pushes
>
> This is not the subject of above commit, it was the
> subject of the cover letter:
> https://www.spinics.net/lists/linux-xfs/msg61813.html
>
> containing the following upstream commits:
> 7cf2b0f9611b xfs: bound maximum wait time for inodegc work
> 5e672cd69f0a xfs: introduce xfs_inodegc_push()
...
> Leah has already queued these two patches for 5.15 backport,
> but she is now on sick leave, so that was not done yet.
>
> We did however, identify a few more inodegc fixes from 6.4,
> which also fix a bug in one of the two commits above:
>
> 03e0add80f4c xfs: explicitly specify cpu when forcing inodegc delayed
> work to run immediately
>   Fixes: 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work")
> b37c4c8339cd xfs: check that per-cpu inodegc workers actually run on that cpu
> 2254a7396a0c xfs: fix xfs_inodegc_stop racing with mod_delayed_work
>   Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")
>
> 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel") has already
> been applied to 5.15.y.
>
> stable tree rules require that the above fixes from 6.4 be applied to 6.1.y
> before 5.15.y, so I have already tested them and they are ready to be posted.
> I wanted to wait a bit after the 6.4.0 release to make sure that we did not pull
> the blanket too much to the other side, because as the reports from Chris
> demonstrate, the inodegc fixes had some unexpected outcomes.

Just to clarify: whilst I updated to 6.1 specifically to gain access to 
the non-blocking inodegc commits, there's no evidence those inodegc 
commits had anything at all to do with my issue on 6.1, and Dave's sense 
is that they probably weren't involved.

That said, those "Fixes:" commits that haven't yet made it to 6.1 look at 
least a little suspicious to my naive eye.

> Anyway, it is 6.4.3 already and I haven't seen any shouts on the list,
> plus the 6.4 fixes look pretty safe, so I guess this is a good time for me
> to post the 6.1.y backports.
>
> w.r.t testing and posting the 5.15.y backports, we currently have a problem.
> Chandan said that he will not have time to fill in for Leah and I don't have
> a baseline established for 5.15 and am going on vacation next week anyway.
>
> So either Chandan can make an exception for this inodegc series, or it will
> have to wait for Leah to be back.

I might gird my loins and try 6.1 again once those further fixes are in.


Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes
  2023-07-13  0:31                   ` Chris Dunlop
@ 2023-07-13  0:57                     ` Chris Dunlop
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Dunlop @ 2023-07-13  0:57 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Leah Rumancik, Theodore Ts'o, Dave Chinner, linux-xfs,
	Leah Rumancik, Darrick J. Wong, Chandan Babu R

On Thu, Jul 13, 2023 at 10:31:04AM +1000, Chris Dunlop wrote:
> On Wed, Jul 12, 2023 at 12:26:54PM +0300, Amir Goldstein wrote:
>> On Wed, Jul 12, 2023 at 5:37 AM Chris Dunlop <chris@onthe.net.au> wrote:
>>>
>>> Request for backport to v5.15:
>>>
>>> 5e672cd69f0a xfs: non-blocking inodegc pushes
>>
>> This is not the subject of above commit, it was the
>> subject of the cover letter:
>> https://www.spinics.net/lists/linux-xfs/msg61813.html
>>
>> containing the following upstream commits:
>> 7cf2b0f9611b xfs: bound maximum wait time for inodegc work
>> 5e672cd69f0a xfs: introduce xfs_inodegc_push()
> ...
>> Leah has already queued these two patches for 5.15 backport,
>> but she is now on sick leave, so that was not done yet.
>>
>> We did however, identify a few more inodegc fixes from 6.4,
>> which also fix a bug in one of the two commits above:
>>
>> 03e0add80f4c xfs: explicitly specify cpu when forcing inodegc delayed
>> work to run immediately
>>  Fixes: 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work")
>> b37c4c8339cd xfs: check that per-cpu inodegc workers actually run on that cpu
>> 2254a7396a0c xfs: fix xfs_inodegc_stop racing with mod_delayed_work
>>  Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")
>>
>> 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel") has already
>> been applied to 5.15.y.
>>
>> stable tree rules require that the above fixes from 6.4 be applied to 6.1.y
>> before 5.15.y, so I have already tested them and they are ready to be posted.
>> I wanted to wait a bit after the 6.4.0 release to make sure that we did not pull
>> the blanket too much to the other side, because as the reports from Chris
>> demonstrate, the inodegc fixes had some unexpected outcomes.
>
> Just to clarify: whilst I updated to 6.1 specifically to gain access 
> to the non-blocking inodegc commits, there's no evidence those inodegc 
> commits had anything at all to do with my issue on 6.1, and Dave's 
> sense is that they probably weren't involved.
>
> That said, those "Fixes:" commits that haven't yet made it to 6.1 look 
> at least a little suspicious to my naive eye.

Hmmm.... in particular, this fix:

2254a7396a0c xfs: fix xfs_inodegc_stop racing with mod_delayed_work
  Fixes: 6191cf3ad59f ("xfs: flush inodegc workqueue tasks before cancel")

mentions:

----
For this to trip, we must have a thread draining the inodedgc workqueue
and a second thread trying to queue inodegc work to that workqueue.
This can happen if freezing or a ro remount race with reclaim poking our
faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
file:
----

I'm indeed periodically freezing these filesystems to take snapshots.

So that could possibly be the source of my problem, although my kernel log 
does not have the WARN_ON_ONCE() mentioned in the patch.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-10 21:53 rm hanging, v6.1.35 Chris Dunlop
  2023-07-11  0:13 ` Chris Dunlop
@ 2023-07-11  0:53 ` Bagas Sanjaya
  2023-07-11  1:13   ` Chris Dunlop
  2023-07-11  2:29   ` Dave Chinner
  1 sibling, 2 replies; 15+ messages in thread
From: Bagas Sanjaya @ 2023-07-11  0:53 UTC (permalink / raw)
  To: Chris Dunlop, Linux XFS, Dave Chinner, Darrick J. Wong
  Cc: Linux Stable, Linux Kernel Mailing List, Linux Regressions

[-- Attachment #1: Type: text/plain, Size: 5291 bytes --]

On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
> Hi,
> 
> This box is newly booted into linux v6.1.35 (2 days ago), it was previously
> running v5.15.118 without any problems (other than that fixed by
> "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the
> upgrade).
> 
> I have rm operations on two files that have been stuck for in excess of 22
> hours and 18 hours respectively:
> 
> $ ps -opid,lstart,state,wchan=WCHAN-xxxxxxxxxxxxxxx,cmd -C rm
>     PID                  STARTED S WCHAN-xxxxxxxxxxxxxxx CMD
> 2379355 Mon Jul 10 09:07:57 2023 D vfs_unlink            /bin/rm -rf /aaa/5539_tmp
> 2392421 Mon Jul 10 09:18:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 2485728 Mon Jul 10 09:28:57 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 2488254 Mon Jul 10 09:39:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 2491180 Mon Jul 10 09:49:58 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 3014914 Mon Jul 10 13:00:33 2023 D vfs_unlink            /bin/rm -rf /bbb/5541_tmp
> 3095893 Mon Jul 10 13:11:03 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 3098809 Mon Jul 10 13:21:35 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 3101387 Mon Jul 10 13:32:06 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 3195017 Mon Jul 10 13:42:37 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 
> The "rm"s are run from a process that's obviously tried a few times to get
> rid of these files.
> 
> There's nothing extraordinary about the files in terms of size:
> 
> $ ls -ltrn --full-time /aaa/5539_tmp /bbb/5541_tmp
> -rw-rw-rw- 1 1482 1482 7870643 2023-07-10 06:07:58.684036505 +1000 /aaa/5539_tmp
> -rw-rw-rw- 1 1482 1482  701240 2023-07-10 10:00:34.181064549 +1000 /bbb/5541_tmp
> 
> As hinted by the WCHAN in the ps output above, each "primary" rm (i.e. the
> first one run on each file) stack trace looks like:
> 
> [<0>] vfs_unlink+0x48/0x270
> [<0>] do_unlinkat+0x1f5/0x290
> [<0>] __x64_sys_unlinkat+0x3b/0x60
> [<0>] do_syscall_64+0x34/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> And each "secondary" rm (i.e. the subsequent ones on each file) stack trace
> looks like:
> 
> == blog-230710-xfs-rm-stuckd
> [<0>] down_write_nested+0xdc/0x100
> [<0>] do_unlinkat+0x10d/0x290
> [<0>] __x64_sys_unlinkat+0x3b/0x60
> [<0>] do_syscall_64+0x34/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> Multiple kernel strack traces don't show vfs_unlink or anything related that
> I can see, or anything else consistent or otherwise interesting. Most cores
> are idle.
> 
> Each of /aaa and /bbb are separate XFS filesystems:
> 
> $ xfs_info /aaa
> meta-data=/dev/mapper/aaa        isize=512    agcount=2, agsize=268434432 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1
> data     =                       bsize=4096   blocks=536868864, imaxpct=5
>          =                       sunit=256    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=262143, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> $ xfs_info /bbb
> meta-data=/dev/mapper/bbb        isize=512    agcount=8, agsize=268434432 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1
> data     =                       bsize=4096   blocks=1879047168, imaxpct=5
>          =                       sunit=256    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> There's plenty of free space at the fs level:
> 
> $ df -h /aaa /bbb
> Filesystem          Size  Used Avail Use% Mounted on
> /dev/mapper/aaa     2.0T  551G  1.5T  27% /aaa
> /dev/mapper/bbb     7.0T  3.6T  3.5T  52% /bbb
> 
> The fses are on sparse ceph/rbd volumes, the underlying storage tells me
> they're 50-60% utilised:
> 
> aaa: provisioned="2048G" used="1015.9G"
> bbb: provisioned="7168G" used="4925.3G"
> 
> Where to from here?
> 
> I'm guessing only a reboot is going to unstick this. Anything I should be
> looking at before reverting to v5.15.118?
> 
> ...subsequent to starting writing all this down I have another two sets of
> rms stuck, again on unremarkable files, and on two more separate
> filesystems.
> 
> ...oh. And an 'ls' on those files is hanging. The reboot has become more
> urgent.
> 

Smells like regression resurfaced, right? I mean, does 5e672cd69f0a53 not
completely fix your reported blocking regression earlier?

I'm kinda confused...

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11  0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
@ 2023-07-11  1:13   ` Chris Dunlop
  2023-07-11  2:29   ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Dunlop @ 2023-07-11  1:13 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Linux XFS, Dave Chinner, Darrick J. Wong, Linux Stable,
	Linux Kernel Mailing List, Linux Regressions

On Tue, Jul 11, 2023 at 07:53:35AM +0700, Bagas Sanjaya wrote:
> On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
>> Hi,
>>
>> This box is newly booted into linux v6.1.35 (2 days ago), it was previously
>> running v5.15.118 without any problems (other than that fixed by
>> "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the
>> upgrade).
>>
>> I have rm operations on two files that have been stuck for in excess of 22
>> hours and 18 hours respectively:
>
> Smells like regression resurfaced, right? I mean, does 5e672cd69f0a53 not
> completely fix your reported blocking regression earlier?
>
> I'm kinda confused...

It looks like a completely different problem. I was wanting 5e672cd69f0a53 
as a resolution to this:

https://lore.kernel.org/all/20220510215411.GT1098723@dread.disaster.area/T/

In that case there were kernel stacks etc. showing inodegc stuff, and the 
problem eventually resolved itself once the massive amount of work had 
been processed (5e672cd69f0a53 puts that work into the background).

In this case it looks like it's just a pure hang or deadlock - there's 
apparently nothing happening.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11  0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
  2023-07-11  1:13   ` Chris Dunlop
@ 2023-07-11  2:29   ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2023-07-11  2:29 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Chris Dunlop, Linux XFS, Dave Chinner, Darrick J. Wong,
	Linux Stable, Linux Kernel Mailing List, Linux Regressions

On Tue, Jul 11, 2023 at 07:53:35AM +0700, Bagas Sanjaya wrote:
> On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
> > Hi,
> > 
> > This box is newly booted into linux v6.1.35 (2 days ago), it was previously
> > running v5.15.118 without any problems (other than that fixed by
> > "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the
> > upgrade).
> > 
> > I have rm operations on two files that have been stuck for in excess of 22
> > hours and 18 hours respectively:
> > 
> > $ ps -opid,lstart,state,wchan=WCHAN-xxxxxxxxxxxxxxx,cmd -C rm
> >     PID                  STARTED S WCHAN-xxxxxxxxxxxxxxx CMD
> > 2379355 Mon Jul 10 09:07:57 2023 D vfs_unlink            /bin/rm -rf /aaa/5539_tmp
> > 2392421 Mon Jul 10 09:18:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 2485728 Mon Jul 10 09:28:57 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 2488254 Mon Jul 10 09:39:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 2491180 Mon Jul 10 09:49:58 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 3014914 Mon Jul 10 13:00:33 2023 D vfs_unlink            /bin/rm -rf /bbb/5541_tmp
> > 3095893 Mon Jul 10 13:11:03 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 3098809 Mon Jul 10 13:21:35 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 3101387 Mon Jul 10 13:32:06 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 3195017 Mon Jul 10 13:42:37 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 
> > The "rm"s are run from a process that's obviously tried a few times to get
> > rid of these files.
>
> > There's nothing extraordinary about the files in terms of size:
> > 
> > $ ls -ltrn --full-time /aaa/5539_tmp /bbb/5541_tmp
> > -rw-rw-rw- 1 1482 1482 7870643 2023-07-10 06:07:58.684036505 +1000 /aaa/5539_tmp
> > -rw-rw-rw- 1 1482 1482  701240 2023-07-10 10:00:34.181064549 +1000 /bbb/5541_tmp
> > 
> > As hinted by the WCHAN in the ps output above, each "primary" rm (i.e. the
> > first one run on each file) stack trace looks like:
> > 
> > [<0>] vfs_unlink+0x48/0x270
> > [<0>] do_unlinkat+0x1f5/0x290
> > [<0>] __x64_sys_unlinkat+0x3b/0x60
> > [<0>] do_syscall_64+0x34/0x80
> > [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

This looks to be stuck on the target inode lock (i.e. the locks for
the inodes at /aaa/5539_tmp and /bbb/5541_tmp).

What's holding these inode locks? This hasn't even got to XFS yet
here, so there's something else going on in the background. Attached
the full output of 'echo w > /proc/sysrq-trigger' and 'echo t >
/proc/sysrq-trigger', please?

> > 
> > And each "secondary" rm (i.e. the subsequent ones on each file) stack trace
> > looks like:
> > 
> > == blog-230710-xfs-rm-stuckd
> > [<0>] down_write_nested+0xdc/0x100
> > [<0>] do_unlinkat+0x10d/0x290
> > [<0>] __x64_sys_unlinkat+0x3b/0x60
> > [<0>] do_syscall_64+0x34/0x80
> > [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

These are likely all stuck on the parent directory inode lock (i.e.
/aaa and /bbb).


> > Where to from here?
> > 
> > I'm guessing only a reboot is going to unstick this. Anything I should be
> > looking at before reverting to v5.15.118?
> > 
> > ...subsequent to starting writing all this down I have another two sets of
> > rms stuck, again on unremarkable files, and on two more separate
> > filesystems.

What's an "unremarkable file" look like? Is is a reflink copy of
something else, a hard link, a small/large regular data file or something else?

> > 
> > ...oh. And an 'ls' on those files is hanging. The reboot has become more
> > urgent.

Yup, that's most likely getting stuck on the directory locks that
the unlinks are holding....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-07-13  0:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-10 21:53 rm hanging, v6.1.35 Chris Dunlop
2023-07-11  0:13 ` Chris Dunlop
2023-07-11  1:57   ` Chris Dunlop
2023-07-11  3:10     ` Dave Chinner
2023-07-11  7:05       ` Chris Dunlop
2023-07-11 22:21         ` Dave Chinner
2023-07-12  1:13           ` Chris Dunlop
2023-07-12  1:42             ` Dave Chinner
2023-07-12  2:17               ` Subject: v5.15 backport - 5e672cd69f0a xfs: non-blocking inodegc pushes Chris Dunlop
2023-07-12  9:26                 ` Amir Goldstein
2023-07-13  0:31                   ` Chris Dunlop
2023-07-13  0:57                     ` Chris Dunlop
2023-07-11  0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
2023-07-11  1:13   ` Chris Dunlop
2023-07-11  2:29   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox