Highly reflinked and fragmented considered harmful?

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Highly reflinked and fragmented considered harmful?
@ 2022-05-09  2:46 Chris Dunlop
  2022-05-09 23:09 ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-05-09  2:46 UTC (permalink / raw)
  To: linux-xfs

Hi,

Is it to be expected that removing 29TB of highly reflinked and 
fragmented data could take days, the entire time blocking other tasks 
like "rm" and "df" on the same filesystem?

- is any way to see progress and make an eta estimate?
- would 4700+ processes blocked on the same fs slow things down or they're 
   not relevant?
- with a reboot/remount, does the log replay continue from where it left 
   off, or start again?
- is there anything that might improve the situation in newer kernels?

Some details:

# uname -r
5.15.34-otn-00007-g6bff5dd37abb

# xfs_info /chroot
meta-data=/dev/mapper/vg00-chroot isize=512    agcount=282, agsize=244184192 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=1
          =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=68719475712, imaxpct=1
          =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The fs is sitting on lvm and the underlying block device is a ceph 8+3 
erasure-coded rbd.

This fs was originally 30T and has been expanded to 256T:

   Limits to growth
   https://www.spinics.net/lists/linux-xfs/msg60451.html

And it's been the topic of a few other issues:

   Extreme fragmentation ho!
   https://www.spinics.net/lists/linux-xfs/msg47707.html

   Mysterious ENOSPC
   https://www.spinics.net/lists/linux-xfs/msg55446.html

The story...

I did an "rm -rf" of a directory containing a "du"-indicated 29TB spread 
over maybe 50 files. The data would have been highly reflinked and 
fragmented. A large part of the reflinking would be to files outside the 
dir in question, and I imagine maybe only 2-3TB of data would actually 
be freed by the "rm".

The "rm" completed in less than a second, but an immediately following 
"df" on that fs didn't return, even 60+ hours later(!!).

In the meantime, other background processes where also attempting to do 
various "rm"s (small amounts of non-reflinked data) and "df"s on the 
same fs which were also hanging - building up to an load average of 
4700+(!!) over time. A flame graph showed the kernel with 75% idle and 
3.75% xfsaild as the largest users, and the rest a wide variety of other 
uninteresting stuff. A "sysrq w" showed the tasks blocking in 
xfs_inodegc_flush, e.g.:

May 06 09:49:29 d5 kernel: task:df              state:D stack:    0 pid:32738 ppid:     1 flags:0x00000004
May 06 09:49:29 d5 kernel: Call Trace:
May 06 09:49:29 d5 kernel:  <TASK>
May 06 09:49:29 d5 kernel:  __schedule+0x241/0x740
May 06 09:49:29 d5 kernel:  schedule+0x3a/0xa0
May 06 09:49:29 d5 kernel:  schedule_timeout+0x271/0x310
May 06 09:49:29 d5 kernel:  ? find_held_lock+0x2d/0x90
May 06 09:49:29 d5 kernel:  ? sched_clock_cpu+0x9/0xa0
May 06 09:49:29 d5 kernel:  ? lock_release+0x214/0x350
May 06 09:49:29 d5 kernel:  wait_for_completion+0x7b/0xc0
May 06 09:49:29 d5 kernel:  __flush_work+0x217/0x350
May 06 09:49:29 d5 kernel:  ? flush_workqueue_prep_pwqs+0x120/0x120
May 06 09:49:29 d5 kernel:  ? wait_for_completion+0x1c/0xc0
May 06 09:49:29 d5 kernel:  xfs_inodegc_flush.part.24+0x62/0xc0 [xfs]
May 06 09:49:29 d5 kernel:  xfs_fs_statfs+0x37/0x1a0 [xfs]
May 06 09:49:29 d5 kernel:  statfs_by_dentry+0x3c/0x60
May 06 09:49:29 d5 kernel:  vfs_statfs+0x16/0xd0
May 06 09:49:29 d5 kernel:  user_statfs+0x44/0x80
May 06 09:49:29 d5 kernel:  __do_sys_statfs+0x10/0x30
May 06 09:49:29 d5 kernel:  do_syscall_64+0x34/0x80
May 06 09:49:29 d5 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 06 09:49:29 d5 kernel: RIP: 0033:0x7f2adb25ec07
May 06 09:49:29 d5 kernel: RSP: 002b:00007ffe074c81d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
May 06 09:49:29 d5 kernel: RAX: ffffffffffffffda RBX: 00005604e08f93c0 RCX: 00007f2adb25ec07
May 06 09:49:29 d5 kernel: RDX: 00007ffe074c8460 RSI: 00007ffe074c81e0 RDI: 00005604e08f93c0
May 06 09:49:29 d5 kernel: RBP: 00007ffe074c81e0 R08: 00005604e08f9400 R09: 0000000000000000
May 06 09:49:29 d5 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe074c8280
May 06 09:49:29 d5 kernel: R13: 0000000000000000 R14: 00005604e08f9400 R15: 00005604e08f93c0
May 06 09:49:29 d5 kernel:  </TASK>

Iostat showed consistent high %util and high latencies. E.g. a combined 
load average and iostat output at 66 hours after the initial "df" 
started:

            load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s  arq-sz  aqu-sz    await   rwait   wwait %util
07:55:32 4772.3   0.0   0.0   0.2   0.8   0.0  98.8 rbd0    0.0    0.0    0.5   27.9     1.80   530.20   37.46   25.87   911.07   34.67   925.2  97.0
07:55:52 4772.1   0.0   0.0   0.2   0.0   0.0  99.7 rbd0    0.0    0.0    0.9   17.2     3.60   548.20   60.97    7.64   422.22   11.83   443.7  99.2
07:56:12 4772.1   0.0   0.0   0.3   1.9   0.0  97.6 rbd0    0.0    0.5    0.2   46.9     1.00   513.00   21.80   46.06   976.91  304.40   980.5  96.3
07:56:32 4773.1   0.0   0.0   0.3   1.7   0.0  97.8 rbd0    0.0    0.0    0.5   12.9     1.80   306.20   45.97    6.81   508.09    4.67   525.6  57.8
07:56:52 4773.8   0.0   0.0   0.3   1.4   0.0  98.0 rbd0    0.0    0.1    0.7   40.4     2.60   591.80   28.96   32.70   796.60  163.23   806.8  92.2
07:57:12 4774.4   0.1   0.0   0.3   1.8   0.0  97.7 rbd0    0.0    0.3    0.2   43.1     0.80   541.80   25.06   42.13   973.05   15.25   977.5  84.0
07:57:32 4774.8   0.2   0.0   0.7   2.3   0.0  96.5 rbd0    0.0    0.1    0.5   35.2     2.00   496.40   27.92   30.25   847.35    9.00   859.3  85.7
07:57:52 4775.4   0.0   0.0   0.4   1.6   0.0  97.7 rbd0    0.0    0.2    0.7   45.1     2.80   510.80   22.43   42.88   936.33   76.21   949.7  73.6

Apart from the iostat there was no other indication any progress was 
being made.

Eventually I thought that perhaps the excessive numbers of blocked 
processes might be adding to the woes by causing thundering herd 
problems and the like, and decided to reboot the box as the only way to 
remove those tasks, with the understanding the journal to be replayed 
before the fs mount would complete.

There were few more reboots and mount(-attempts) over the next day as I 
was messing around with things - turning off the background processes 
causing the ever-increasing loads, taking lvm snapshots, upgrading to 
the latest stable version of the kernel etc.

I'm now wondering if each mount continues from whence the previous 
mount(-attempt) left off, or does it start processing the replay log 
again from the very beginning?

It's now been 40 hours since the most recent reboot/mount and the mount 
process is still running. The load average is a more reasonable <10, and 
the iostat is similar to above.

Is there any way to see where the log replay is up to, and from there 
make a judgement on how much longer it might take, e.g. something in 
xfs_db, or some eBPF wizardry?

I tried using xfs_logprint but I'm not sure what it's telling me, and 
multiple versions taken 10 minutes apart don't show any differences that 
are obviously "progress" - the only differences are in q:0x[[:xdigit:]]+ 
in the TRANS entries, and in a:0x[[:xdigit:]]+ in the INO entries, e.g.:

t1:
TRANS: tid:0xcf6e0775  #items:50029  trans:0xcf6e0775  q:0x563417886560
...
INO: cnt:2 total:2 a:0x563419435920 len:56 a:0x5634194359c0 len:176
         INODE: #regs:2   ino:0x48000698b  flags:0x1   dsize:0
         CORE inode:
                 magic:IN  mode:0x81a4  ver:3  format:3
                 uid:1952  gid:1952  nlink:0 projid:0x00000000
                 atime:1616462862  mtime:1651077084  ctime:1651634628
                 flushiter:0
                 size:0x0  nblks:0xdbcfc47  exsize:0  nextents:1676364  anextents:0
                 forkoff:0  dmevmask:0x0  dmstate:0  flags:0x0  gen:4176048690
                 flags2 0x2 cowextsize 0x0

t1+10s:
TRANS: tid:0xcf6e0775  #items:50029  trans:0xcf6e0775  q:0x55b5d8815560
...
INO: cnt:2 total:2 a:0x55b5da3c4920 len:56 a:0x55b5da3c49c0 len:176
         INODE: #regs:2   ino:0x48000698b  flags:0x1   dsize:0
         CORE inode:
                 magic:IN  mode:0x81a4  ver:3  format:3
                 uid:1952  gid:1952  nlink:0 projid:0x00000000
                 atime:1616462862  mtime:1651077084  ctime:1651634628
                 flushiter:0
                 size:0x0  nblks:0xdbcfc47  exsize:0  nextents:1676364  anextents:0
                 forkoff:0  dmevmask:0x0  dmstate:0  flags:0x0  gen:4176048690
                 flags2 0x2 cowextsize 0x0


Cheers,

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-09  2:46 Highly reflinked and fragmented considered harmful? Chris Dunlop
@ 2022-05-09 23:09 ` Dave Chinner
  2022-05-10  2:55   ` Chris Dunlop
  2022-05-10  4:07   ` Amir Goldstein
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2022-05-09 23:09 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> Hi,
> 
> Is it to be expected that removing 29TB of highly reflinked and fragmented
> data could take days, the entire time blocking other tasks like "rm" and
> "df" on the same filesystem?
> 
> - is any way to see progress and make an eta estimate?
> - would 4700+ processes blocked on the same fs slow things down or they're
> not relevant?
> - with a reboot/remount, does the log replay continue from where it left
> off, or start again?
> - is there anything that might improve the situation in newer kernels?
> 
> Some details:
> 
> # uname -r
> 5.15.34-otn-00007-g6bff5dd37abb
> 
> # xfs_info /chroot
> meta-data=/dev/mapper/vg00-chroot isize=512    agcount=282, agsize=244184192 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=0 inobtcount=0
> data     =                       bsize=4096   blocks=68719475712, imaxpct=1
>          =                       sunit=128    swidth=512 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> The fs is sitting on lvm and the underlying block device is a ceph 8+3
> erasure-coded rbd.

Ok, so -slow- storage.

> This fs was originally 30T and has been expanded to 256T:
> 
>   Limits to growth
>   https://www.spinics.net/lists/linux-xfs/msg60451.html
> 
> And it's been the topic of a few other issues:
> 
>   Extreme fragmentation ho!
>   https://www.spinics.net/lists/linux-xfs/msg47707.html
> 
>   Mysterious ENOSPC
>   https://www.spinics.net/lists/linux-xfs/msg55446.html
> 
> The story...
> 
> I did an "rm -rf" of a directory containing a "du"-indicated 29TB spread
> over maybe 50 files. The data would have been highly reflinked and
> fragmented. A large part of the reflinking would be to files outside the dir
> in question, and I imagine maybe only 2-3TB of data would actually be freed
> by the "rm".

But it's still got to clean up 29TB of shared extent references.
Assuming worst case reflink extent fragmentation of 4kB filesystem
blocks, 29TB is roughly 7 *billion* references that have to be
cleaned up.

TANSTAAFL.

A single CPU running flat out that isn't limited by log space or
metadata writeback can unshare and free about 80-100k extents a
second. If you are either RAM or IO bound, then you might only be
able to remove a few thousand a second as throughput is limited by
the random metadata writeback rate to free journal space for the
next transaction.

OOOOOOH.

>          =                       crc=1        finobt=1, sparse=1, rmapbt=1

You have rmapbt enabled. That significantly increases the amount of
metadata every reflink operation has to update. Every shared extent
that is removed, the rmap index has to be updated to remove that
back-reference from rmapbt. When there is heavy extent sharing
occurring, the rmaps get *large*. e.g. share a single block 100
times, there is one refcount record for the 100 references. BUt
there are 100 rmap records - one for each of the 100 references to
to the block. IOWs, reflink is efficiently recording reference
counts, but rmap cannot as every owner/offset tuple is different for
each shared extent owner.

Likely that means that the maximum sustainable unshare rate is going
to be halved - so maybe a max of 40-50k reference removals/s

> The "rm" completed in less than a second, but an immediately following "df"
> on that fs didn't return, even 60+ hours later(!!).

At 10k refs/s removed , that's 3hrs/100M or 30hrs/billion references.

At a 1k refs/s removed, that's 3.6 million refs/hr an hour, or
~30hrs/100M refs.

At a 100 refs/s removed, that's 360k million refs/hr an hour, or
~300hrs/100M refs.

And it's entire possible you asked the filesystem to remove a
billion shared references.....

> In the meantime, other background processes where also attempting to do
> various "rm"s (small amounts of non-reflinked data) and "df"s on the same fs
> which were also hanging - building up to an load average of 4700+(!!) over

Nothing wrong with a load average of 4700 - it's common to see that
sort of thing when there is lots of IO operations in flight.  All it
is telling you is that the system has a massive IO backlog.

> time. A flame graph showed the kernel with 75% idle and 3.75% xfsaild as the
> largest users, and the rest a wide variety of other uninteresting stuff. A
> "sysrq w" showed the tasks blocking in xfs_inodegc_flush, e.g.:

Yeah, I've been considering removing that blocking flush for reasons
like this.  That's not your problem, though.

i.e. If the unlink hadn't been put in the background, you'd be
complaining that rm had hung, the system is overloaded, IO bound
and you can't kill the rm. ANd then after reboot, log recovery
caused the same problem....

So what you see really has nothing to do with background inodegc
at all...

> Iostat showed consistent high %util and high latencies. E.g. a combined load
> average and iostat output at 66 hours after the initial "df" started:
> 
>            load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s    r/s    w/s    rkB/s    wkB/s  arq-sz  aqu-sz    await   rwait   wwait %util
> 07:55:32 4772.3   0.0   0.0   0.2   0.8   0.0  98.8 rbd0    0.0    0.0    0.5   27.9     1.80   530.20   37.46   25.87   911.07   34.67   925.2  97.0
> 07:55:52 4772.1   0.0   0.0   0.2   0.0   0.0  99.7 rbd0    0.0    0.0    0.9   17.2     3.60   548.20   60.97    7.64   422.22   11.83   443.7  99.2
> 07:56:12 4772.1   0.0   0.0   0.3   1.9   0.0  97.6 rbd0    0.0    0.5    0.2   46.9     1.00   513.00   21.80   46.06   976.91  304.40   980.5  96.3
> 07:56:32 4773.1   0.0   0.0   0.3   1.7   0.0  97.8 rbd0    0.0    0.0    0.5   12.9     1.80   306.20   45.97    6.81   508.09    4.67   525.6  57.8
> 07:56:52 4773.8   0.0   0.0   0.3   1.4   0.0  98.0 rbd0    0.0    0.1    0.7   40.4     2.60   591.80   28.96   32.70   796.60  163.23   806.8  92.2
> 07:57:12 4774.4   0.1   0.0   0.3   1.8   0.0  97.7 rbd0    0.0    0.3    0.2   43.1     0.80   541.80   25.06   42.13   973.05   15.25   977.5  84.0
> 07:57:32 4774.8   0.2   0.0   0.7   2.3   0.0  96.5 rbd0    0.0    0.1    0.5   35.2     2.00   496.40   27.92   30.25   847.35    9.00   859.3  85.7
> 07:57:52 4775.4   0.0   0.0   0.4   1.6   0.0  97.7 rbd0    0.0    0.2    0.7   45.1     2.80   510.80   22.43   42.88   936.33   76.21   949.7  73.6

Yup, the system is IO bound on 4kB random write IO, managing a
healthy total of about _200 IOPS_ and less than 4MB/s to disk.
In short, I don't think you system is even managing 100 reference
removals a second - maybe 50 refs/s are being removed because it is
completely IO bound on dirty metadata writeback...

IOWs, the problem here is that  you asked the filesystem to perform
*billions* of update operations by running that rm -rf command and
your storage simply isn't up to performing such operations.

What reflink giveth you, reflink taketh away.

> Apart from the iostat there was no other indication any progress was being
> made.
> 
> Eventually I thought that perhaps the excessive numbers of blocked processes
> might be adding to the woes by causing thundering herd problems and the
> like, and decided to reboot the box as the only way to remove those tasks,
> with the understanding the journal to be replayed before the fs mount would
> complete.

Nope, I'd put my house on it not completing journal replay....

> There were few more reboots and mount(-attempts) over the next day as I was
> messing around with things - turning off the background processes causing
> the ever-increasing loads, taking lvm snapshots, upgrading to the latest
> stable version of the kernel etc.
> 
> I'm now wondering if each mount continues from whence the previous
> mount(-attempt) left off, or does it start processing the replay log again
> from the very beginning?

.... because log recovery runs garbage collection of unlinked
inodes, which was what was running when you rebooted the system.
Only now, you can't access the filesystem at all because it won't
complete the mount process....

> It's now been 40 hours since the most recent reboot/mount and the mount
> process is still running. The load average is a more reasonable <10, and the
> iostat is similar to above.

Yup, because there's only hte background inodegc workers operating
on the filesystem and you don't have other processes trying to do
data IO on it as it hasn't finished mounting yet. The filesystem is
doing the same amount of work, making the same progress, yet the
load average is only 10.

Load average is a very poor measure of the load that is actually on
the system......

> Is there any way to see where the log replay is up to, and from there make a
> judgement on how much longer it might take, e.g. something in xfs_db, or
> some eBPF wizardry?

Not really. tracepoints can tell you the operations going on, and if
you know enough about the internals of XFS they might tell you how
many extents still remain on the inodes to remove....

> I tried using xfs_logprint but I'm not sure what it's telling me, and
> multiple versions taken 10 minutes apart don't show any differences that are
> obviously "progress" - the only differences are in q:0x[[:xdigit:]]+ in the
> TRANS entries, and in a:0x[[:xdigit:]]+ in the INO entries, e.g.:

logprint just dumps the active transactions in the log - the ones
that are waiting on metadata writeback to move the tail of the log
forwards and retire the transaction.

What you see here:

> t1:
> TRANS: tid:0xcf6e0775  #items:50029  trans:0xcf6e0775  q:0x563417886560
> ...
> t1+10s:
> TRANS: tid:0xcf6e0775  #items:50029  trans:0xcf6e0775  q:0x55b5d8815560

Is the same transaction in the journal. all that changed was the
"a:" fields which are the memory address of the region buffer in
logprint and have nothing to do with what is in the journal itself.
It's largely logprint debug information.

IOWs, all this tells us is that the tail of the log is not moving
very fast, which tallies with the really slow storage doing random
writes and being unable to move the tail of the log forwards very
quickly.

Indeed, let me put that in context - this transaction has 50,000
items in it. THat's going to take at _least_ 10,000 IOs to write
back all that metadata. at maybe 100 IOs/s, just clearing this
transaction to free up the log space (maybe 32MB of journal space
out of 2GB) will take at least 2 minutes.....

And the rest of the journal will have a similar density of items to
be flushed, so 2GB/32MB = 128 * 2 minutes = 4 hours. SO even if the
unlinks stopped right now, there's a backlog of at least 4 hours
worth of dirty metadata to flush to disk. Either unmount of log
recovery would take this long to run at that point.

Also, that inode had 1.6 million extents and mapped just under 1TB
of data. At 100 refs/s being removed, that's 16,000 seconds to clean
up that inode. If that is representative of the other files, then 50
x 16,0000 is roughly 10 days....

At some point, you have to pay the price of creating billions of
random fine-grained cross references in tens of TBs of data spread
across weeks and months of production. You don't notice the scale of
the cross-reference because it's taken weeks and months of normal
operations to get there. It's only when you finally have to perform
an operation that needs to iterate all those references that the
scale suddenly becomes apparent. XFS scales to really large numbers
without significant degradation, so people don't notice things like
object counts or cross references until something like this
happens.

I don't think there's much we can do at the filesystem level to help
you at this point - the inode output in the transaction dump above
indicates that you haven't been using extent size hints to limit
fragmentation or extent share/COW sizes, so the damage is already
present and we can't really do anything to fix that up.

I'm also not sure that xfs_repair will help at this point because it
has to pull the entire filesystem metadata into memory.  I suspect
that you don't have a couple of TB of RAM or really fast SSD swap
space to throw at that problem (might need both!). I've seen repair
take 3 weeks and 250GB of RAM to resolve a corrupted 14TB hardlink
farm containing a couple of billion hardlinks on similar slow RAID 6
storage. I don't expect repair will be any faster resolving a couple
of billion reflink and rmap references, but I do expect that to
require a whole lot more memory than hardlink resolution. repair
will garbage collect the unlinked inodes without any additional
overhead, but the baseline overhead is likely prohibitive....

Another option might be to get a fast 1-2TB NVMe SSD (or two or 3)
and build a cache teir over the top of the ceph rbds using dm-cache.
That way the cache teir can suck up all the small random writes and
the garbage collection will run much faster. When it is done, run a
fstrim on the filesystem to clear all the previously written but now
free space from the cache device, then flush the cache device back
to storage. That last flush will likely take a long, long time
because it will all be small random writes, but should be many less
than what the filesystem needed to run to clean up the billions of
cross references the reflinks created....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-09 23:09 ` Dave Chinner
@ 2022-05-10  2:55   ` Chris Dunlop
  2022-05-10  5:14     ` Darrick J. Wong
  2022-05-10  4:07   ` Amir Goldstein
  1 sibling, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-05-10  2:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

On Tue, May 10, 2022 at 09:09:18AM +1000, Dave Chinner wrote:
> On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
>> Is it to be expected that removing 29TB of highly reflinked and fragmented
>> data could take days, the entire time blocking other tasks like "rm" and
>> "df" on the same filesystem?
...
> At some point, you have to pay the price of creating billions of
> random fine-grained cross references in tens of TBs of data spread
> across weeks and months of production. You don't notice the scale of
> the cross-reference because it's taken weeks and months of normal
> operations to get there. It's only when you finally have to perform
> an operation that needs to iterate all those references that the
> scale suddenly becomes apparent. XFS scales to really large numbers
> without significant degradation, so people don't notice things like
> object counts or cross references until something like this
> happens.
>
> I don't think there's much we can do at the filesystem level to help
> you at this point - the inode output in the transaction dump above
> indicates that you haven't been using extent size hints to limit
> fragmentation or extent share/COW sizes, so the damage is already
> present and we can't really do anything to fix that up.

Thanks for taking the time to provide a detailed and informative
exposition, it certainly helps me understand what I'm asking of the fs, 
the areas that deserve more attention, and how to approach analyzing the 
situation.

At this point I'm about 3 days from completing copying the data (from a 
snapshot of the troubled fs mounted with 'norecovery') over to a brand new 
fs. Unfortunately the new fs is also rmapbt=1 so I'll go through all the 
copying again (under more controlled circumstances) to get onto a rmapbt=0 
fs (losing the ability to do online repairs whenever that arrives - 
hopefully that won't come back to haunt me).

Out of interest:

>> - with a reboot/remount, does the log replay continue from where it left
>> off, or start again?

Sorry, if you provided an answer to this, I didn't understand it.

Basically the question is, if a recovery on mount were going to take 10 
hours, but the box rebooted and fs mounted again at 8 hours, would the 
recovery this time take 2 hours or once again 10 hours?

Cheers,

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-10  2:55   ` Chris Dunlop
@ 2022-05-10  5:14     ` Darrick J. Wong
  0 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-05-10  5:14 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Dave Chinner, linux-xfs

On Tue, May 10, 2022 at 12:55:41PM +1000, Chris Dunlop wrote:
> Hi Dave,
> 
> On Tue, May 10, 2022 at 09:09:18AM +1000, Dave Chinner wrote:
> > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > data could take days, the entire time blocking other tasks like "rm" and
> > > "df" on the same filesystem?
> ...
> > At some point, you have to pay the price of creating billions of
> > random fine-grained cross references in tens of TBs of data spread
> > across weeks and months of production. You don't notice the scale of
> > the cross-reference because it's taken weeks and months of normal
> > operations to get there. It's only when you finally have to perform
> > an operation that needs to iterate all those references that the
> > scale suddenly becomes apparent. XFS scales to really large numbers
> > without significant degradation, so people don't notice things like
> > object counts or cross references until something like this
> > happens.
> > 
> > I don't think there's much we can do at the filesystem level to help
> > you at this point - the inode output in the transaction dump above
> > indicates that you haven't been using extent size hints to limit
> > fragmentation or extent share/COW sizes, so the damage is already
> > present and we can't really do anything to fix that up.
> 
> Thanks for taking the time to provide a detailed and informative
> exposition, it certainly helps me understand what I'm asking of the fs, the
> areas that deserve more attention, and how to approach analyzing the
> situation.
> 
> At this point I'm about 3 days from completing copying the data (from a
> snapshot of the troubled fs mounted with 'norecovery') over to a brand new
> fs. Unfortunately the new fs is also rmapbt=1 so I'll go through all the
> copying again (under more controlled circumstances) to get onto a rmapbt=0
> fs (losing the ability to do online repairs whenever that arrives -
> hopefully that won't come back to haunt me).

Hmm.  Were most of the stuck processes running xfs_inodegc_flush?  Maybe
we should try to switch that to something that will stop waiting after
30s, since most of the (non-fsfreeze) callers don't actually *require*
that the work actually finish, they're just trying to return accurate
space accounting to userspace.

> Out of interest:
> 
> > > - with a reboot/remount, does the log replay continue from where it left
> > > off, or start again?
> 
> Sorry, if you provided an answer to this, I didn't understand it.
> 
> Basically the question is, if a recovery on mount were going to take 10
> hours, but the box rebooted and fs mounted again at 8 hours, would the
> recovery this time take 2 hours or once again 10 hours?

In theory yes, it'll restart where it left off, but if 10 seconds go by
and the extent count *hasn't changed* then yikes did we spend that
entire time doing refcount btree updates??

--D

> Cheers,
> 
> Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-09 23:09 ` Dave Chinner
  2022-05-10  2:55   ` Chris Dunlop
@ 2022-05-10  4:07   ` Amir Goldstein
  2022-05-10  5:10     ` Darrick J. Wong
  1 sibling, 1 reply; 16+ messages in thread
From: Amir Goldstein @ 2022-05-10  4:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Dunlop, linux-xfs

On Tue, May 10, 2022 at 2:25 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > Hi,
> >
> > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > data could take days, the entire time blocking other tasks like "rm" and
> > "df" on the same filesystem?
> >
[...]
> > The story...
> >
> > I did an "rm -rf" of a directory containing a "du"-indicated 29TB spread
> > over maybe 50 files. The data would have been highly reflinked and
> > fragmented. A large part of the reflinking would be to files outside the dir
> > in question, and I imagine maybe only 2-3TB of data would actually be freed
> > by the "rm".
>
> But it's still got to clean up 29TB of shared extent references.
> Assuming worst case reflink extent fragmentation of 4kB filesystem
> blocks, 29TB is roughly 7 *billion* references that have to be
> cleaned up.
>
> TANSTAAFL.
>
[...]
>
> IOWs, the problem here is that  you asked the filesystem to perform
> *billions* of update operations by running that rm -rf command and
> your storage simply isn't up to performing such operations.
>
> What reflink giveth you, reflink taketh away.
>

When I read this story, it reads like the filesystem is to blame and
not the user.

First of all, the user did not "ask the filesystem to perform
*billions* of updates",
the user asked the filesystem to remove 50 huge files.

End users do not have to understand how filesystem unlink operation works.
But even if we agree that the user "asked the filesystem to perform *billions*
of updates" (as is the same with rm -rf of billions of files), If the
filesystem says
"ok I'll do it" and hogs the system for 10 days,
there might be something wrong with the system, not with the user.

Linux grew dirty page throttling for the same reason - so we can stop blaming
the users who copied the movie to their USB pen drive for their system getting
stuck.

This incident sounds like a very serious problem - the sort of problem that
makes users leave a filesystem with a door slam, never come back and
start tweeting about how awful fs X is.

And most users won't even try to analyse the situation as Chris did and
write about it to xfs list before starting to tweet.

From a product POV, I think what should have happened here is that
freeing up the space would have taken 10 days in the background, but
otherwise, filesystem should not have been blocking other processes
for long periods of time.

Of course, it would have been nice if there was a friendly user interface
to notify users of background cg work progress.

All this is much easier said than done, but that does not make it less true.

Can we do anything to throttle background cg work to the point that it
has less catastrophic effect on end users? Perhaps limit the amount of
journal credits allowed to be consumed by gc work? so "foreground"
operations will be less likely to hang?

I am willing to take a swing at it, if you point me at the right direction.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-10  4:07   ` Amir Goldstein
@ 2022-05-10  5:10     ` Darrick J. Wong
  2022-05-10  6:30       ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2022-05-10  5:10 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, Chris Dunlop, linux-xfs

On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
> On Tue, May 10, 2022 at 2:25 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > Hi,
> > >
> > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > data could take days, the entire time blocking other tasks like "rm" and
> > > "df" on the same filesystem?
> > >
> [...]
> > > The story...
> > >
> > > I did an "rm -rf" of a directory containing a "du"-indicated 29TB spread
> > > over maybe 50 files. The data would have been highly reflinked and
> > > fragmented. A large part of the reflinking would be to files outside the dir
> > > in question, and I imagine maybe only 2-3TB of data would actually be freed
> > > by the "rm".
> >
> > But it's still got to clean up 29TB of shared extent references.
> > Assuming worst case reflink extent fragmentation of 4kB filesystem
> > blocks, 29TB is roughly 7 *billion* references that have to be
> > cleaned up.
> >
> > TANSTAAFL.
> >
> [...]
> >
> > IOWs, the problem here is that  you asked the filesystem to perform
> > *billions* of update operations by running that rm -rf command and
> > your storage simply isn't up to performing such operations.
> >
> > What reflink giveth you, reflink taketh away.

And here I was expecting "...and rmapbt taketh away." ;)

> When I read this story, it reads like the filesystem is to blame and
> not the user.
> 
> First of all, the user did not "ask the filesystem to perform
> *billions* of updates",
> the user asked the filesystem to remove 50 huge files.
> 
> End users do not have to understand how filesystem unlink operation works.
> But even if we agree that the user "asked the filesystem to perform *billions*
> of updates" (as is the same with rm -rf of billions of files), If the
> filesystem says
> "ok I'll do it" and hogs the system for 10 days,
> there might be something wrong with the system, not with the user.
> 
> Linux grew dirty page throttling for the same reason - so we can stop blaming
> the users who copied the movie to their USB pen drive for their system getting
> stuck.

(Is the default dirty ratio still 20% of DRAM?)

> This incident sounds like a very serious problem - the sort of problem that
> makes users leave a filesystem with a door slam, never come back and
> start tweeting about how awful fs X is.
> 
> And most users won't even try to analyse the situation as Chris did and
> write about it to xfs list before starting to tweet.
> 
> From a product POV, I think what should have happened here is that
> freeing up the space would have taken 10 days in the background, but
> otherwise, filesystem should not have been blocking other processes
> for long periods of time.

Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
curious if the stall warning backtraces all had xfs_inodegc_flush() in
them, or were there other parts of the system stalling elsewhere too?
50 billion updates is a lot, but there shouldn't be stall warnings.

The one you pasted into your message is an ugly wart of the background
inode gc code -- statfs (and getquota with root dqid) are slow-path
summary counter reporting system calls, so they call flush_workqueue to
make sure the background workers have collected *all* the garbage.

I bet, however, that you and everyone else would rather have somewhat
inaccurate results than a load average of 4700 and a dead machine.

What I really want is flush_workqueue_timeout(), where we kick the
workers and wait some amount of time, where the amount is a large number
(say hangcheck_timeout-5) if we're near ENOSPC and a small one if not.
IOWS, we'll try to have statfs return reasonably accurate results, but
if it takes too long we'll get impatient and just return what we have.

> Of course, it would have been nice if there was a friendly user interface
> to notify users of background cg work progress.
> 
> All this is much easier said than done, but that does not make it less true.
> 
> Can we do anything to throttle background cg work to the point that it
> has less catastrophic effect on end users? Perhaps limit the amount of
> journal credits allowed to be consumed by gc work? so "foreground"
> operations will be less likely to hang?

...that said, if foreground writers are also stalling unacceptably,
then we ought to throttle the background too.  Though that makes the
"stuck in statfs" problem worse.  But I'd want to know that foreground
threads are getting crushed before I started fiddling with that.

--D

> I am willing to take a swing at it, if you point me at the right direction.
> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-10  5:10     ` Darrick J. Wong
@ 2022-05-10  6:30       ` Chris Dunlop
  2022-05-10  8:16         ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-05-10  6:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, Dave Chinner, linux-xfs

On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote:
> On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
>> On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
>>> Is it to be expected that removing 29TB of highly reflinked and fragmented
>>> data could take days, the entire time blocking other tasks like "rm" and
>>> "df" on the same filesystem?
...
>> From a product POV, I think what should have happened here is that
>> freeing up the space would have taken 10 days in the background, but
>> otherwise, filesystem should not have been blocking other processes
>> for long periods of time.
>
> Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
> curious if the stall warning backtraces all had xfs_inodegc_flush() in
> them, or were there other parts of the system stalling elsewhere too?
> 50 billion updates is a lot, but there shouldn't be stall warnings.

Sure: https://file.io/25za5BNBlnU8  (6.8M)

Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush().

> I bet, however, that you and everyone else would rather have somewhat
> inaccurate results than a load average of 4700 and a dead machine.

Yes, that would have been a nicer result.

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-10  6:30       ` Chris Dunlop
@ 2022-05-10  8:16         ` Dave Chinner
  2022-05-10 19:19           ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-05-10  8:16 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs

On Tue, May 10, 2022 at 04:30:51PM +1000, Chris Dunlop wrote:
> On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote:
> > On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
> > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > > data could take days, the entire time blocking other tasks like "rm" and
> > > > "df" on the same filesystem?
> ...
> > > From a product POV, I think what should have happened here is that
> > > freeing up the space would have taken 10 days in the background, but
> > > otherwise, filesystem should not have been blocking other processes
> > > for long periods of time.

Sure, that's obvious, and without looking at the code I know what
that is: statfs()

> > Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
> > curious if the stall warning backtraces all had xfs_inodegc_flush() in
> > them, or were there other parts of the system stalling elsewhere too?
> > 50 billion updates is a lot, but there shouldn't be stall warnings.
> 
> Sure: https://file.io/25za5BNBlnU8  (6.8M)
> 
> Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush().

yup, 3677 tasks in statfs(). The majority are rm, df, and check_disk
processes, there's a couple of veeamagent processes stuck and
an (ostnamed) process, whatever that is...

No real surprises there, and it's not why the filesystem is taking
so long to remove all the reflink references.

There is just one background inodegc worker thread running, so
there's no excessive load being generated by inodegc, either. It's
no different to a single rm running xfs_inactive() directly on a
slightly older kernel and filling the journal:

May 06 09:49:01 d5 kernel: task:kworker/6:1     state:D stack:    0 pid:23258 ppid:     2 flags:0x00004000
May 06 09:49:01 d5 kernel: Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker [xfs]
May 06 09:49:01 d5 kernel: Call Trace:
May 06 09:49:01 d5 kernel:  <TASK>
May 06 09:49:01 d5 kernel:  __schedule+0x241/0x740
May 06 09:49:01 d5 kernel:  schedule+0x3a/0xa0
May 06 09:49:01 d5 kernel:  schedule_timeout+0x271/0x310
May 06 09:49:01 d5 kernel:  __down+0x6c/0xa0
May 06 09:49:01 d5 kernel:  down+0x3b/0x50
May 06 09:49:01 d5 kernel:  xfs_buf_lock+0x40/0xe0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_find.isra.32+0x3ee/0x730 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_get_map+0x3c/0x430 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_read_map+0x37/0x2c0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_trans_read_buf_map+0x1cb/0x300 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_read_buf_block.constprop.40+0x75/0xb0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_lookup_get_block+0x85/0x150 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_overlapped_query_range+0x33c/0x3c0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_query_range+0xd5/0x100 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_query_range+0x71/0x80 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_lookup_le_range+0x88/0x180 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_unmap_shared+0x89/0x560 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_finish_one+0x201/0x260 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_update_finish_item+0x33/0x60 [xfs]
May 06 09:49:01 d5 kernel:  xfs_defer_finish_noroll+0x215/0x5a0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_defer_finish+0x13/0x70 [xfs]
May 06 09:49:01 d5 kernel:  xfs_itruncate_extents_flags+0xc4/0x240 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inactive_truncate+0x7f/0xc0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inactive+0x10c/0x130 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inodegc_worker+0xb5/0x140 [xfs]
May 06 09:49:01 d5 kernel:  process_one_work+0x2a8/0x4c0
May 06 09:49:01 d5 kernel:  worker_thread+0x21b/0x3c0
May 06 09:49:01 d5 kernel:  kthread+0x121/0x140

There are 18 tasks in destroy_inode() blocked on a workqueue flush
- these are new unlinks that are getting throttled because that
per-cpu inodegc queue is full and work is ongoing. Not a huge
deal, maybe we should look to hand full queues to another CPU if
the neighbour CPU has an empty queue. That would reduce
unnecessary throttling, though it may mean more long running
inodegc processes in the background....

Really, though, I don't see anything deadlocked, just a system
backed up doing a really large amount of metadata modification.
Everything is sitting on throttles or waiting on IO and making
slow forwards progress as metadata writeback allows log space to be
freed.

I think we should just accept that statfs() can never really report
exactly accurate space usagei to userspace and get rid of the flush.
Work around the specific fstests dependencies on rm always
immediately making unlinked inodes free space in fstests rather than
in the kernel code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-10  8:16         ` Dave Chinner
@ 2022-05-10 19:19           ` Darrick J. Wong
  2022-05-10 21:54             ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2022-05-10 19:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Dunlop, Amir Goldstein, linux-xfs

On Tue, May 10, 2022 at 06:16:32PM +1000, Dave Chinner wrote:
> On Tue, May 10, 2022 at 04:30:51PM +1000, Chris Dunlop wrote:
> > On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote:
> > > On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
> > > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > > > data could take days, the entire time blocking other tasks like "rm" and
> > > > > "df" on the same filesystem?
> > ...
> > > > From a product POV, I think what should have happened here is that
> > > > freeing up the space would have taken 10 days in the background, but
> > > > otherwise, filesystem should not have been blocking other processes
> > > > for long periods of time.
> 
> Sure, that's obvious, and without looking at the code I know what
> that is: statfs()
> 
> > > Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
> > > curious if the stall warning backtraces all had xfs_inodegc_flush() in
> > > them, or were there other parts of the system stalling elsewhere too?
> > > 50 billion updates is a lot, but there shouldn't be stall warnings.
> > 
> > Sure: https://file.io/25za5BNBlnU8  (6.8M)
> > 
> > Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush().
> 
> yup, 3677 tasks in statfs(). The majority are rm, df, and check_disk
> processes, there's a couple of veeamagent processes stuck and
> an (ostnamed) process, whatever that is...
> 
> No real surprises there, and it's not why the filesystem is taking
> so long to remove all the reflink references.
> 
> There is just one background inodegc worker thread running, so
> there's no excessive load being generated by inodegc, either. It's
> no different to a single rm running xfs_inactive() directly on a
> slightly older kernel and filling the journal:
> 
> May 06 09:49:01 d5 kernel: task:kworker/6:1     state:D stack:    0 pid:23258 ppid:     2 flags:0x00004000
> May 06 09:49:01 d5 kernel: Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker [xfs]
> May 06 09:49:01 d5 kernel: Call Trace:
> May 06 09:49:01 d5 kernel:  <TASK>
> May 06 09:49:01 d5 kernel:  __schedule+0x241/0x740
> May 06 09:49:01 d5 kernel:  schedule+0x3a/0xa0
> May 06 09:49:01 d5 kernel:  schedule_timeout+0x271/0x310
> May 06 09:49:01 d5 kernel:  __down+0x6c/0xa0
> May 06 09:49:01 d5 kernel:  down+0x3b/0x50
> May 06 09:49:01 d5 kernel:  xfs_buf_lock+0x40/0xe0 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_buf_find.isra.32+0x3ee/0x730 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_buf_get_map+0x3c/0x430 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_buf_read_map+0x37/0x2c0 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_trans_read_buf_map+0x1cb/0x300 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_btree_read_buf_block.constprop.40+0x75/0xb0 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_btree_lookup_get_block+0x85/0x150 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_btree_overlapped_query_range+0x33c/0x3c0 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_btree_query_range+0xd5/0x100 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_rmap_query_range+0x71/0x80 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_rmap_lookup_le_range+0x88/0x180 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_rmap_unmap_shared+0x89/0x560 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_rmap_finish_one+0x201/0x260 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_rmap_update_finish_item+0x33/0x60 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_defer_finish_noroll+0x215/0x5a0 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_defer_finish+0x13/0x70 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_itruncate_extents_flags+0xc4/0x240 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_inactive_truncate+0x7f/0xc0 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_inactive+0x10c/0x130 [xfs]
> May 06 09:49:01 d5 kernel:  xfs_inodegc_worker+0xb5/0x140 [xfs]
> May 06 09:49:01 d5 kernel:  process_one_work+0x2a8/0x4c0
> May 06 09:49:01 d5 kernel:  worker_thread+0x21b/0x3c0
> May 06 09:49:01 d5 kernel:  kthread+0x121/0x140
> 
> There are 18 tasks in destroy_inode() blocked on a workqueue flush
> - these are new unlinks that are getting throttled because that
> per-cpu inodegc queue is full and work is ongoing. Not a huge
> deal, maybe we should look to hand full queues to another CPU if
> the neighbour CPU has an empty queue. That would reduce
> unnecessary throttling, though it may mean more long running
> inodegc processes in the background....
> 
> Really, though, I don't see anything deadlocked, just a system
> backed up doing a really large amount of metadata modification.
> Everything is sitting on throttles or waiting on IO and making
> slow forwards progress as metadata writeback allows log space to be
> freed.
> 
> I think we should just accept that statfs() can never really report
> exactly accurate space usagei to userspace and get rid of the flush.

What about all the code that flushes gc work when we hit ENOSPC/EDQUOT?
Do we let those threads stall too because the fs is out of resources and
they can just wait?  Or would that also be better off with a flush
timeout and userspace can just eat the ENOSPC/EDQUOT after 30 seconds?

> Work around the specific fstests dependencies on rm always
> immediately making unlinked inodes free space in fstests rather than
> in the kernel code.

<grumble> I *really* don't want to launch *yet another* full scale audit
of XFS + supporting software behavior on tiny and/or full filesystems.
If someone else wants to take that on then please do with my blessings,
but I haven't even succeeded at landing all the patches resulting from
... whichever month it was where you suggested disallowing those tiny
filesystems.

(I didn't even bother sending the "disallow < 300MB filesystems" series
because there were 30+ patches of bugfixes that have been backing up in
my tree forever now, and I figured that was more than plenty for 5.18.)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-10 19:19           ` Darrick J. Wong
@ 2022-05-10 21:54             ` Dave Chinner
  2022-05-11  0:37               ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-05-10 21:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chris Dunlop, Amir Goldstein, linux-xfs

On Tue, May 10, 2022 at 12:19:18PM -0700, Darrick J. Wong wrote:
> On Tue, May 10, 2022 at 06:16:32PM +1000, Dave Chinner wrote:
> > On Tue, May 10, 2022 at 04:30:51PM +1000, Chris Dunlop wrote:
> > > On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote:
> > > > On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
> > > > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > > > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > > > > data could take days, the entire time blocking other tasks like "rm" and
> > > > > > "df" on the same filesystem?
> > > ...
> > > > > From a product POV, I think what should have happened here is that
> > > > > freeing up the space would have taken 10 days in the background, but
> > > > > otherwise, filesystem should not have been blocking other processes
> > > > > for long periods of time.
> > 
> > Sure, that's obvious, and without looking at the code I know what
> > that is: statfs()
> > 
> > > > Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
> > > > curious if the stall warning backtraces all had xfs_inodegc_flush() in
> > > > them, or were there other parts of the system stalling elsewhere too?
> > > > 50 billion updates is a lot, but there shouldn't be stall warnings.
> > > 
> > > Sure: https://file.io/25za5BNBlnU8  (6.8M)
> > > 
> > > Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush().
> > 
> > yup, 3677 tasks in statfs(). The majority are rm, df, and check_disk
> > processes, there's a couple of veeamagent processes stuck and
> > an (ostnamed) process, whatever that is...
> > 
> > No real surprises there, and it's not why the filesystem is taking
> > so long to remove all the reflink references.
> > 
> > There is just one background inodegc worker thread running, so
> > there's no excessive load being generated by inodegc, either. It's
> > no different to a single rm running xfs_inactive() directly on a
> > slightly older kernel and filling the journal:
> > 
> > May 06 09:49:01 d5 kernel: task:kworker/6:1     state:D stack:    0 pid:23258 ppid:     2 flags:0x00004000
> > May 06 09:49:01 d5 kernel: Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker [xfs]
> > May 06 09:49:01 d5 kernel: Call Trace:
> > May 06 09:49:01 d5 kernel:  <TASK>
> > May 06 09:49:01 d5 kernel:  __schedule+0x241/0x740
> > May 06 09:49:01 d5 kernel:  schedule+0x3a/0xa0
> > May 06 09:49:01 d5 kernel:  schedule_timeout+0x271/0x310
> > May 06 09:49:01 d5 kernel:  __down+0x6c/0xa0
> > May 06 09:49:01 d5 kernel:  down+0x3b/0x50
> > May 06 09:49:01 d5 kernel:  xfs_buf_lock+0x40/0xe0 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_buf_find.isra.32+0x3ee/0x730 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_buf_get_map+0x3c/0x430 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_buf_read_map+0x37/0x2c0 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_trans_read_buf_map+0x1cb/0x300 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_btree_read_buf_block.constprop.40+0x75/0xb0 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_btree_lookup_get_block+0x85/0x150 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_btree_overlapped_query_range+0x33c/0x3c0 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_btree_query_range+0xd5/0x100 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_rmap_query_range+0x71/0x80 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_rmap_lookup_le_range+0x88/0x180 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_rmap_unmap_shared+0x89/0x560 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_rmap_finish_one+0x201/0x260 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_rmap_update_finish_item+0x33/0x60 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_defer_finish_noroll+0x215/0x5a0 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_defer_finish+0x13/0x70 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_itruncate_extents_flags+0xc4/0x240 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_inactive_truncate+0x7f/0xc0 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_inactive+0x10c/0x130 [xfs]
> > May 06 09:49:01 d5 kernel:  xfs_inodegc_worker+0xb5/0x140 [xfs]
> > May 06 09:49:01 d5 kernel:  process_one_work+0x2a8/0x4c0
> > May 06 09:49:01 d5 kernel:  worker_thread+0x21b/0x3c0
> > May 06 09:49:01 d5 kernel:  kthread+0x121/0x140
> > 
> > There are 18 tasks in destroy_inode() blocked on a workqueue flush
> > - these are new unlinks that are getting throttled because that
> > per-cpu inodegc queue is full and work is ongoing. Not a huge
> > deal, maybe we should look to hand full queues to another CPU if
> > the neighbour CPU has an empty queue. That would reduce
> > unnecessary throttling, though it may mean more long running
> > inodegc processes in the background....
> > 
> > Really, though, I don't see anything deadlocked, just a system
> > backed up doing a really large amount of metadata modification.
> > Everything is sitting on throttles or waiting on IO and making
> > slow forwards progress as metadata writeback allows log space to be
> > freed.
> > 
> > I think we should just accept that statfs() can never really report
> > exactly accurate space usagei to userspace and get rid of the flush.
> 
> What about all the code that flushes gc work when we hit ENOSPC/EDQUOT?
> Do we let those threads stall too because the fs is out of resources and
> they can just wait?  Or would that also be better off with a flush
> timeout and userspace can just eat the ENOSPC/EDQUOT after 30 seconds?

1. Not an immediate problem we need to solve.
2. flush at enospc/edquot is best effort, so doesn't need to block
   waiting on inodegc. the enospc/edquot flush will get repeated
   soon enough, so that will take into account progress made by
   long running inodegc ops.
3. we leave pending work on the per-cpu queues under the
   flush/throttle thresholds indefinitely.
4. to be accurate, statfs() needs to flush #3.
5. While working on the rework of inode reclaimation, I converted the
   inodegc queues to use delayed works to ensure work starts on
   per-cpu queues within 10ms of queueing to avoid #3 causing
   problems.
6. I added a non-blocking async flush mechanism that works by
   modifying the queue timer to 0 to trigger immedate work
   scheduling for anything that hasn't been run.

#4 is the problem that Chris hit - we're trying to be perfectly
accurate when perfect accuracy is impossible and we're paying the
price for that.

Fix #3, and the need to block statfs() largely goes away. And as per
#2, we don't really need to block edquot/enospc flushes, either. We
jsut need to push the queues to make sure they are running and doing
work...

> > Work around the specific fstests dependencies on rm always
> > immediately making unlinked inodes free space in fstests rather than
> > in the kernel code.
> 
> <grumble> I *really* don't want to launch *yet another* full scale audit
> of XFS + supporting software behavior on tiny and/or full filesystems.

This doesn't need an audit. Just fix the "single unlinks don't get
processed until a flush occurs" problem, and most of the issues
fstests has will go away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-10 21:54             ` Dave Chinner
@ 2022-05-11  0:37               ` Darrick J. Wong
  2022-05-11  1:36                 ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2022-05-11  0:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Dunlop, Amir Goldstein, linux-xfs

On Wed, May 11, 2022 at 07:54:11AM +1000, Dave Chinner wrote:
> On Tue, May 10, 2022 at 12:19:18PM -0700, Darrick J. Wong wrote:
> > On Tue, May 10, 2022 at 06:16:32PM +1000, Dave Chinner wrote:
> > > On Tue, May 10, 2022 at 04:30:51PM +1000, Chris Dunlop wrote:
> > > > On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote:
> > > > > On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
> > > > > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > > > > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > > > > > data could take days, the entire time blocking other tasks like "rm" and
> > > > > > > "df" on the same filesystem?
> > > > ...
> > > > > > From a product POV, I think what should have happened here is that
> > > > > > freeing up the space would have taken 10 days in the background, but
> > > > > > otherwise, filesystem should not have been blocking other processes
> > > > > > for long periods of time.
> > > 
> > > Sure, that's obvious, and without looking at the code I know what
> > > that is: statfs()
> > > 
> > > > > Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
> > > > > curious if the stall warning backtraces all had xfs_inodegc_flush() in
> > > > > them, or were there other parts of the system stalling elsewhere too?
> > > > > 50 billion updates is a lot, but there shouldn't be stall warnings.
> > > > 
> > > > Sure: https://file.io/25za5BNBlnU8  (6.8M)
> > > > 
> > > > Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush().
> > > 
> > > yup, 3677 tasks in statfs(). The majority are rm, df, and check_disk
> > > processes, there's a couple of veeamagent processes stuck and
> > > an (ostnamed) process, whatever that is...
> > > 
> > > No real surprises there, and it's not why the filesystem is taking
> > > so long to remove all the reflink references.
> > > 
> > > There is just one background inodegc worker thread running, so
> > > there's no excessive load being generated by inodegc, either. It's
> > > no different to a single rm running xfs_inactive() directly on a
> > > slightly older kernel and filling the journal:
> > > 
> > > May 06 09:49:01 d5 kernel: task:kworker/6:1     state:D stack:    0 pid:23258 ppid:     2 flags:0x00004000
> > > May 06 09:49:01 d5 kernel: Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker [xfs]
> > > May 06 09:49:01 d5 kernel: Call Trace:
> > > May 06 09:49:01 d5 kernel:  <TASK>
> > > May 06 09:49:01 d5 kernel:  __schedule+0x241/0x740
> > > May 06 09:49:01 d5 kernel:  schedule+0x3a/0xa0
> > > May 06 09:49:01 d5 kernel:  schedule_timeout+0x271/0x310
> > > May 06 09:49:01 d5 kernel:  __down+0x6c/0xa0
> > > May 06 09:49:01 d5 kernel:  down+0x3b/0x50
> > > May 06 09:49:01 d5 kernel:  xfs_buf_lock+0x40/0xe0 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_buf_find.isra.32+0x3ee/0x730 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_buf_get_map+0x3c/0x430 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_buf_read_map+0x37/0x2c0 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_trans_read_buf_map+0x1cb/0x300 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_btree_read_buf_block.constprop.40+0x75/0xb0 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_btree_lookup_get_block+0x85/0x150 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_btree_overlapped_query_range+0x33c/0x3c0 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_btree_query_range+0xd5/0x100 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_rmap_query_range+0x71/0x80 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_rmap_lookup_le_range+0x88/0x180 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_rmap_unmap_shared+0x89/0x560 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_rmap_finish_one+0x201/0x260 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_rmap_update_finish_item+0x33/0x60 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_defer_finish_noroll+0x215/0x5a0 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_defer_finish+0x13/0x70 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_itruncate_extents_flags+0xc4/0x240 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_inactive_truncate+0x7f/0xc0 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_inactive+0x10c/0x130 [xfs]
> > > May 06 09:49:01 d5 kernel:  xfs_inodegc_worker+0xb5/0x140 [xfs]
> > > May 06 09:49:01 d5 kernel:  process_one_work+0x2a8/0x4c0
> > > May 06 09:49:01 d5 kernel:  worker_thread+0x21b/0x3c0
> > > May 06 09:49:01 d5 kernel:  kthread+0x121/0x140
> > > 
> > > There are 18 tasks in destroy_inode() blocked on a workqueue flush
> > > - these are new unlinks that are getting throttled because that
> > > per-cpu inodegc queue is full and work is ongoing. Not a huge
> > > deal, maybe we should look to hand full queues to another CPU if
> > > the neighbour CPU has an empty queue. That would reduce
> > > unnecessary throttling, though it may mean more long running
> > > inodegc processes in the background....
> > > 
> > > Really, though, I don't see anything deadlocked, just a system
> > > backed up doing a really large amount of metadata modification.
> > > Everything is sitting on throttles or waiting on IO and making
> > > slow forwards progress as metadata writeback allows log space to be
> > > freed.
> > > 
> > > I think we should just accept that statfs() can never really report
> > > exactly accurate space usagei to userspace and get rid of the flush.
> > 
> > What about all the code that flushes gc work when we hit ENOSPC/EDQUOT?
> > Do we let those threads stall too because the fs is out of resources and
> > they can just wait?  Or would that also be better off with a flush
> > timeout and userspace can just eat the ENOSPC/EDQUOT after 30 seconds?
> 
> 1. Not an immediate problem we need to solve.
> 2. flush at enospc/edquot is best effort, so doesn't need to block
>    waiting on inodegc. the enospc/edquot flush will get repeated
>    soon enough, so that will take into account progress made by
>    long running inodegc ops.
> 3. we leave pending work on the per-cpu queues under the
>    flush/throttle thresholds indefinitely.
> 4. to be accurate, statfs() needs to flush #3.

Yes, that's the state of things currently...

> 5. While working on the rework of inode reclaimation, I converted the
>    inodegc queues to use delayed works to ensure work starts on
>    per-cpu queues within 10ms of queueing to avoid #3 causing
>    problems.
> 6. I added a non-blocking async flush mechanism that works by
>    modifying the queue timer to 0 to trigger immedate work
>    scheduling for anything that hasn't been run.

*OH*, you've already built those two things?  Well that makes this
/much/ easier.  I think all we'd need to fix #4 is an xfs_inodegc_push()
function that statfs/enospc/edquot can call to do xfs_inodegc_queue_all
and return immediately.

/me calms down, goes back to proofreading the online fsck design doc.

--D

> #4 is the problem that Chris hit - we're trying to be perfectly
> accurate when perfect accuracy is impossible and we're paying the
> price for that.
> 
> Fix #3, and the need to block statfs() largely goes away. And as per
> #2, we don't really need to block edquot/enospc flushes, either. We
> jsut need to push the queues to make sure they are running and doing
> work...
> 
> > > Work around the specific fstests dependencies on rm always
> > > immediately making unlinked inodes free space in fstests rather than
> > > in the kernel code.
> > 
> > <grumble> I *really* don't want to launch *yet another* full scale audit
> > of XFS + supporting software behavior on tiny and/or full filesystems.
> 
> This doesn't need an audit. Just fix the "single unlinks don't get
> processed until a flush occurs" problem, and most of the issues
> fstests has will go away.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-11  0:37               ` Darrick J. Wong
@ 2022-05-11  1:36                 ` Dave Chinner
  2022-05-11  2:16                   ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-05-11  1:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chris Dunlop, Amir Goldstein, linux-xfs

On Tue, May 10, 2022 at 05:37:08PM -0700, Darrick J. Wong wrote:
> On Wed, May 11, 2022 at 07:54:11AM +1000, Dave Chinner wrote:
> > On Tue, May 10, 2022 at 12:19:18PM -0700, Darrick J. Wong wrote:
> > > On Tue, May 10, 2022 at 06:16:32PM +1000, Dave Chinner wrote:
> > > > I think we should just accept that statfs() can never really report
> > > > exactly accurate space usagei to userspace and get rid of the flush.
> > > 
> > > What about all the code that flushes gc work when we hit ENOSPC/EDQUOT?
> > > Do we let those threads stall too because the fs is out of resources and
> > > they can just wait?  Or would that also be better off with a flush
> > > timeout and userspace can just eat the ENOSPC/EDQUOT after 30 seconds?
> > 
> > 1. Not an immediate problem we need to solve.
> > 2. flush at enospc/edquot is best effort, so doesn't need to block
> >    waiting on inodegc. the enospc/edquot flush will get repeated
> >    soon enough, so that will take into account progress made by
> >    long running inodegc ops.
> > 3. we leave pending work on the per-cpu queues under the
> >    flush/throttle thresholds indefinitely.
> > 4. to be accurate, statfs() needs to flush #3.
> 
> Yes, that's the state of things currently...
> 
> > 5. While working on the rework of inode reclaimation, I converted the
> >    inodegc queues to use delayed works to ensure work starts on
> >    per-cpu queues within 10ms of queueing to avoid #3 causing
> >    problems.
> > 6. I added a non-blocking async flush mechanism that works by
> >    modifying the queue timer to 0 to trigger immedate work
> >    scheduling for anything that hasn't been run.
> 
> *OH*, you've already built those two things?  Well that makes this
> /much/ easier.  I think all we'd need to fix #4 is an xfs_inodegc_push()
> function that statfs/enospc/edquot can call to do xfs_inodegc_queue_all
> and return immediately.

*nod*.

I think that's the thing that some people have missed in in this
thread - I've know for a while now the scope of problems blocking
flushes from statfs() can cause - any issue with background
inodegc not making progress can deadlock the filesystem. I've lost
count of the number of times I had inodegc go wrong or crash and the
only possible recovery was to power cycle because nothing could be
executed from the command line.

That's because statfs() appears to be used in some commonly used
library function and so many utility programs on the system will get
stuck and be unable to run when inodegc dies, deadlocks, or takes a
real long time to make progress.

Hence I didn't need to do any analysis of Chris' system to know
exactly what was going on - I've seen it many, many times myself and
have work in progress that avoids those issues with inodegc work
that never completes.

IOWs, everything is understood, fixes are already written, and
there's no actual threat of data loss or corruption from this issue.
It's just lots of stuff getting stuck behind a long running
operation...

All this report means is that I have to bump the priority of that
work, separate it out into it's own series and make sure it's ready
for the next cycle.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-11  1:36                 ` Dave Chinner
@ 2022-05-11  2:16                   ` Chris Dunlop
  2022-05-11  2:52                     ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-05-11  2:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs

On Wed, May 11, 2022 at 11:36:54AM +1000, Dave Chinner wrote:
> I think that's the thing that some people have missed in in this
> thread - I've know for a while now the scope of problems blocking
> flushes from statfs() can cause - any issue with background
> inodegc not making progress can deadlock the filesystem. I've lost
> count of the number of times I had inodegc go wrong or crash and the
> only possible recovery was to power cycle because nothing could be
> executed from the command line.
>
> That's because statfs() appears to be used in some commonly used
> library function and so many utility programs on the system will get
> stuck and be unable to run when inodegc dies, deadlocks, or takes a
> real long time to make progress.
>
> Hence I didn't need to do any analysis of Chris' system to know
> exactly what was going on - I've seen it many, many times myself and
> have work in progress that avoids those issues with inodegc work
> that never completes.
>
> IOWs, everything is understood, fixes are already written, and
> there's no actual threat of data loss or corruption from this issue.
> It's just lots of stuff getting stuck behind a long running
> operation...

Your patches are stuck behind other long running priorities, or the 
patches address an issue of stuff getting stuck? Or, of course, both? ;-)

Out of interest, would this work also reduce the time spent mounting in my 
case? I.e. would a lot of the work from my recovery mount be punted off 
to a background thread?

Cheers,

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-11  2:16                   ` Chris Dunlop
@ 2022-05-11  2:52                     ` Dave Chinner
  2022-05-11  3:58                       ` Chris Dunlop
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-05-11  2:52 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs

On Wed, May 11, 2022 at 12:16:57PM +1000, Chris Dunlop wrote:
> On Wed, May 11, 2022 at 11:36:54AM +1000, Dave Chinner wrote:
> > I think that's the thing that some people have missed in in this
> > thread - I've know for a while now the scope of problems blocking
> > flushes from statfs() can cause - any issue with background
> > inodegc not making progress can deadlock the filesystem. I've lost
> > count of the number of times I had inodegc go wrong or crash and the
> > only possible recovery was to power cycle because nothing could be
> > executed from the command line.
> > 
> > That's because statfs() appears to be used in some commonly used
> > library function and so many utility programs on the system will get
> > stuck and be unable to run when inodegc dies, deadlocks, or takes a
> > real long time to make progress.
> > 
> > Hence I didn't need to do any analysis of Chris' system to know
> > exactly what was going on - I've seen it many, many times myself and
> > have work in progress that avoids those issues with inodegc work
> > that never completes.
> > 
> > IOWs, everything is understood, fixes are already written, and
> > there's no actual threat of data loss or corruption from this issue.
> > It's just lots of stuff getting stuck behind a long running
> > operation...
> 
> Your patches are stuck behind other long running priorities, or the patches
> address an issue of stuff getting stuck? Or, of course, both? ;-)

I was refering to the way your system got blocked up - lots of
stuff stuck behind a long running operation. :)

As for priorities, the queue/flushing fixes are part of a body of
work in progress that I benched for 3 months when agreeing to take
over maintenance of the kernel code for the current dev cycle. My
priorities over recent weeks have been to get everyone else's work
reviewed, tested, improved, and merged, not focusing on progressing
my own projects.

As such, I wasn't considering changing inoedegc queuing in the next
merge window at all - it would just be a part of the larger body of
work when it was ready.

Your problems have made it apparent that we really need those
problems fixed ASAP (i.e. the next merge cycle) hence the priority
bump....

> Out of interest, would this work also reduce the time spent mounting in my
> case? I.e. would a lot of the work from my recovery mount be punted off to a
> background thread?

No. log recovery will punt the remaining inodegc work to background
threads so it might get slightly parallelised, but we have a hard
barrier between completing log recovery work and completing the
mount process at the moment. Hence we wait for inodegc to complete
before log recovery is marked as complete.

In theory we could allow background inodegc to bleed into active
duty once log recovery has processed all the unlinked lists, but
that's a change of behaviour that would require a careful audit of
the last part of the mount path to ensure that it is safe to be
running concurrent background operations whilst complete mount state
updates.

This hasn't been on my radar at all up until now, but I'll have a
think about it next time I look at those bits of recovery. I suspect
that probably won't be far away - I have a set of unlinked inode
list optimisations that rework the log recovery infrastructure near
the top of my current work queue, so I will be in that general
vicinity over the next few weeks...

Regardless, the inodegc work is going to be slow on your system no
matter what we do because of the underlying storage layout. What we
need to do is try to remove all the places where stuff can get
blocked on inodegc completion, but that is somewhat complex because
we still need to be able to throttle queue depths in various
situations.

As always, there is plenty of scope for improvement - the question
is working out which improvements are the ones we actually need
first...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-11  2:52                     ` Dave Chinner
@ 2022-05-11  3:58                       ` Chris Dunlop
  2022-05-11  5:18                         ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Dunlop @ 2022-05-11  3:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs

On Wed, May 11, 2022 at 12:52:35PM +1000, Dave Chinner wrote:
> On Wed, May 11, 2022 at 12:16:57PM +1000, Chris Dunlop wrote:
>> Out of interest, would this work also reduce the time spent mounting in 
>> my case? I.e. would a lot of the work from my recovery mount be punted 
>> off to a background thread?
>
> No. log recovery will punt the remaining inodegc work to background
> threads so it might get slightly parallelised, but we have a hard
> barrier between completing log recovery work and completing the
> mount process at the moment. Hence we wait for inodegc to complete
> before log recovery is marked as complete.
>
> In theory we could allow background inodegc to bleed into active
> duty once log recovery has processed all the unlinked lists, but
> that's a change of behaviour that would require a careful audit of
> the last part of the mount path to ensure that it is safe to be
> running concurrent background operations whilst complete mount state
> updates.
>
> This hasn't been on my radar at all up until now, but I'll have a
> think about it next time I look at those bits of recovery. I suspect
> that probably won't be far away - I have a set of unlinked inode
> list optimisations that rework the log recovery infrastructure near
> the top of my current work queue, so I will be in that general
> vicinity over the next few weeks...

I'll keep an eye out.

> Regardless, the inodegc work is going to be slow on your system no
> matter what we do because of the underlying storage layout. What we
> need to do is try to remove all the places where stuff can get
> blocked on inodegc completion, but that is somewhat complex because
> we still need to be able to throttle queue depths in various
> situations.

That reminds of a something I've been wondering about for obvious reasons: 
for workloads where metadata operations are dominant, do you have any 
ponderings on allowing AGs to be put on fast storage whilst the bulk data 
is on molasses storage?

Cheers,

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Highly reflinked and fragmented considered harmful?
  2022-05-11  3:58                       ` Chris Dunlop
@ 2022-05-11  5:18                         ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2022-05-11  5:18 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs

On Wed, May 11, 2022 at 01:58:13PM +1000, Chris Dunlop wrote:
> On Wed, May 11, 2022 at 12:52:35PM +1000, Dave Chinner wrote:
> > On Wed, May 11, 2022 at 12:16:57PM +1000, Chris Dunlop wrote:
> > > Out of interest, would this work also reduce the time spent mounting
> > > in my case? I.e. would a lot of the work from my recovery mount be
> > > punted off to a background thread?
> > 
> > No. log recovery will punt the remaining inodegc work to background
> > threads so it might get slightly parallelised, but we have a hard
> > barrier between completing log recovery work and completing the
> > mount process at the moment. Hence we wait for inodegc to complete
> > before log recovery is marked as complete.
> > 
> > In theory we could allow background inodegc to bleed into active
> > duty once log recovery has processed all the unlinked lists, but
> > that's a change of behaviour that would require a careful audit of
> > the last part of the mount path to ensure that it is safe to be
> > running concurrent background operations whilst complete mount state
> > updates.
> > 
> > This hasn't been on my radar at all up until now, but I'll have a
> > think about it next time I look at those bits of recovery. I suspect
> > that probably won't be far away - I have a set of unlinked inode
> > list optimisations that rework the log recovery infrastructure near
> > the top of my current work queue, so I will be in that general
> > vicinity over the next few weeks...
> 
> I'll keep an eye out.
> 
> > Regardless, the inodegc work is going to be slow on your system no
> > matter what we do because of the underlying storage layout. What we
> > need to do is try to remove all the places where stuff can get
> > blocked on inodegc completion, but that is somewhat complex because
> > we still need to be able to throttle queue depths in various
> > situations.
> 
> That reminds of a something I've been wondering about for obvious reasons:
> for workloads where metadata operations are dominant, do you have any
> ponderings on allowing AGs to be put on fast storage whilst the bulk data is
> on molasses storage?

If you're willing to give up reflink and pretty much all the
allocation optimisations for storage locality that make spinning
disks perform OK, then you can do this right now with a realtime
device as the user data store. You still have AGs, but they will
contain metadata only - your bulk data storage device is the
realtime device.

This has downsides. You give up reflink. You give up rmap. You give
up allocation concurrency. You give up btree indexed free space,
which means giving up the ability to find optimal free spaces.
Allocation algorithms are optimised for deterministic, bound
overhead behaviour (the "realtime" aspect of the RT device) so you
give up smart, context aware allocation algorithms. the list goes
on.

reflink and rmap support for the realtime device are in the pipeline
(not likely to be added in the near term), but solutions for any of
the other issues are not. They are intrinsic behaviours that result
from the realtime device architecture.

However, there's no real way to separate data in AGs from metadata
in AGs - they share the same address space and there's no simple way
to keep them separate and map different parts of the AG to different
storage devices. that would require a fair chunk of slicing and
dicing at the DM level, and then we have a whole net set of problems
to deal with when AGs run out of metadata space because reflink
and/or rmap metadata explosions...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-05-11  5:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-09  2:46 Highly reflinked and fragmented considered harmful? Chris Dunlop
2022-05-09 23:09 ` Dave Chinner
2022-05-10  2:55   ` Chris Dunlop
2022-05-10  5:14     ` Darrick J. Wong
2022-05-10  4:07   ` Amir Goldstein
2022-05-10  5:10     ` Darrick J. Wong
2022-05-10  6:30       ` Chris Dunlop
2022-05-10  8:16         ` Dave Chinner
2022-05-10 19:19           ` Darrick J. Wong
2022-05-10 21:54             ` Dave Chinner
2022-05-11  0:37               ` Darrick J. Wong
2022-05-11  1:36                 ` Dave Chinner
2022-05-11  2:16                   ` Chris Dunlop
2022-05-11  2:52                     ` Dave Chinner
2022-05-11  3:58                       ` Chris Dunlop
2022-05-11  5:18                         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox