Endless "reclaiming chunk"/"relocating block group"

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Endless "reclaiming chunk"/"relocating block group"
@ 2022-10-19 18:29 Christoph Biedl
  2022-10-20  1:59 ` Zygo Blaxell
  2022-10-20  9:16 ` Filipe Manana
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Biedl @ 2022-10-19 18:29 UTC (permalink / raw)
  To: linux-btrfs

Hello,

On some systems I observe a strange behaviour: After remounting a BTRFS
readwrite, a background process starts doing things on the disk,
messages look like

| BTRFS info (device nvme0n1p1): reclaiming chunk 21486669660160 with 100% used 0% unusable
| BTRFS info (device nvme0n1p1): relocating block group 21486669660160 flags data
| BTRFS info (device nvme0n1p1): found 4317 extents, stage: move data extents
| BTRFS info (device nvme0n1p1): found 4317 extents, stage: update data pointers

and (with differing numbers) this goes on for hours and days, at a
read/write rate of about 165/244 kbyte/sec. The filesystem, some 2.5
Gbyte total size, is filled to about 55%, so even if that process
touches each and every block, it should already have handled everything,
several times.

Now, I have no clue what is happening here, what triggers it, if it will
ever finish. Point is, this takes a measuarable amount of I/O and CPU,
and it delays other processes.

Some details, and things I've tested:

This behaviour is reproducible 100%, even with a btrfs created mere
moments ago.

The filesystem was created using the 5.10 and 6.0 version of the
btrfs-progs (both as provided by Debian stable and unstable resp.).

Using the grml rescue system (stable and daily, the latter kernel 5.19),
the system does not show this behaviour.

The group block number is constantly increasing (14 digits after two
days), in other words, I have not observed a wrap-around.

It was suggested in IRC to format using the --mixed parameter, no avail.

It was also suggested to set the various bg_reclaim_threshold to zero to
stop this process, no avail.

This is amd64 hardware without any unusual elements. I could easily
reproduce this on a fairly different platforms to make sure it's not
hardware specific.

Scrubbing did not show any errors, and the problem remained.

The host runs a hand-crafted kernel, currently 5.19, and I reckon this
is the source of the problem. Of course I've compared all the BTRFS
kernel options, they are identical. In the block device layer
configuration I couldn't see any difference that I can think would
relate to this issue. Likewise I compared all kernel configuration
options mentioned in src/fs/btrfs/, still nothing noteworthy.

So I'm a bit out of ideas. Unless there's something obvious from the
description above, perhaps you could give a hint to the following: The
process that emits the messages above, is there a way to stop it, or to
report completion percentage? Looking intobtrfs_reclaim_bgs_work
(block-group.c), it doesn't look like it. Are block group numbers really
*that* big, magnitudes over the size of the entire filesystem?

Regards,

    Christoph

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Endless "reclaiming chunk"/"relocating block group"
  2022-10-19 18:29 Endless "reclaiming chunk"/"relocating block group" Christoph Biedl
@ 2022-10-20  1:59 ` Zygo Blaxell
  2022-10-20 11:01   ` Christoph Biedl
  2022-10-20  9:16 ` Filipe Manana
  1 sibling, 1 reply; 7+ messages in thread
From: Zygo Blaxell @ 2022-10-20  1:59 UTC (permalink / raw)
  To: Christoph Biedl; +Cc: linux-btrfs

On Wed, Oct 19, 2022 at 08:29:59PM +0200, Christoph Biedl wrote:
> Hello,
> 
> On some systems I observe a strange behaviour: After remounting a BTRFS
> readwrite, a background process starts doing things on the disk,
> messages look like
> 
> | BTRFS info (device nvme0n1p1): reclaiming chunk 21486669660160 with 100% used 0% unusable
> | BTRFS info (device nvme0n1p1): relocating block group 21486669660160 flags data
> | BTRFS info (device nvme0n1p1): found 4317 extents, stage: move data extents
> | BTRFS info (device nvme0n1p1): found 4317 extents, stage: update data pointers
> 
> and (with differing numbers) this goes on for hours and days, at a
> read/write rate of about 165/244 kbyte/sec. The filesystem, some 2.5
> Gbyte total size, is filled to about 55%, so even if that process
> touches each and every block, it should already have handled everything,
> several times.
> 
> Now, I have no clue what is happening here, what triggers it, if it will
> ever finish. Point is, this takes a measuarable amount of I/O and CPU,
> and it delays other processes.
> 
> 
> Some details, and things I've tested:
> 
> This behaviour is reproducible 100%, even with a btrfs created mere
> moments ago.
> 
> The filesystem was created using the 5.10 and 6.0 version of the
> btrfs-progs (both as provided by Debian stable and unstable resp.).

Reclaim is a purely in-kernel-memory feature, so this should not have
an effect.

> Using the grml rescue system (stable and daily, the latter kernel 5.19),
> the system does not show this behaviour.

> The group block number is constantly increasing (14 digits after two
> days), in other words, I have not observed a wrap-around.

It's a 64-bit number so it's not going to wrap around any time soon.

> It was suggested in IRC to format using the --mixed parameter, no avail.
> 
> It was also suggested to set the various bg_reclaim_threshold to zero to
> stop this process, no avail.

Not sure what's happening there.  The reclaim threshold is the minimum
amount of free space, so it shouldn't be triggering with a 100% filled
block group.  Reclaiming an _exactly_ filled block group makes no sense
at all (no improvement of space usage is possible when a block group
is completely filled), so we shouldn't be doing it.

Note that since 5.19 there are multiple bg_reclaim_threshold knobs:

	/sys/fs/btrfs/(uuid)/bg_reclaim_threshold
	/sys/fs/btrfs/(uuid)/allocation/metadata/bg_reclaim_threshold
	/sys/fs/btrfs/(uuid)/allocation/system/bg_reclaim_threshold
	/sys/fs/btrfs/(uuid)/allocation/data/bg_reclaim_threshold

Make sure all of these are zero.

> This is amd64 hardware without any unusual elements. I could easily
> reproduce this on a fairly different platforms to make sure it's not
> hardware specific.

It makes me think of possible rounding errors (e.g. the threshold
calculation divides by 100, or there's a sum of quantities that leads
to a percentage > 100, but the code treats zero as a special case and
bails out long before, so I don't see how we'd reach those corner cases.

> Scrubbing did not show any errors, and the problem remained.

Scrub shouldn't interact with reclaim, except to slightly delay it.

> The host runs a hand-crafted kernel, currently 5.19, and I reckon this
> is the source of the problem. Of course I've compared all the BTRFS
> kernel options, they are identical. In the block device layer
> configuration I couldn't see any difference that I can think would
> relate to this issue. Likewise I compared all kernel configuration
> options mentioned in src/fs/btrfs/, still nothing noteworthy.
> 
> 
> So I'm a bit out of ideas. Unless there's something obvious from the
> description above, perhaps you could give a hint to the following: The
> process that emits the messages above, is there a way to stop it,

Set bg_reclaim_threshold to 0, as you've already tried (but maybe not
in all of the relevant places).

> or to report completion percentage?

It's done more or less one block group at a time, so it stops when it
stops.  Even if it reaches the end of the block group reclaim list for
one iteration of the worker, there's nothing stopping new block groups
from being added while processing the list, making the percentage at
any given time meaningless.

> Are block group numbers really
> *that* big, magnitudes over the size of the entire filesystem?

Relocation (including reclaim, resize, and balance) copies the data from
the old block group into a new block group, removing free space fragments
between the data extents in the process.  Block group numbers are (almost)
never reused, so each new block group created has higher bytenrs than
any before it.  If you've relocated every block group 100 times, the
block group numbers will be over 100x the size of the filesystem.

> Regards,
> 
>     Christoph
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Endless "reclaiming chunk"/"relocating block group"
  2022-10-19 18:29 Endless "reclaiming chunk"/"relocating block group" Christoph Biedl
  2022-10-20  1:59 ` Zygo Blaxell
@ 2022-10-20  9:16 ` Filipe Manana
  1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2022-10-20  9:16 UTC (permalink / raw)
  To: Christoph Biedl; +Cc: linux-btrfs

On Wed, Oct 19, 2022 at 8:12 PM Christoph Biedl
<linux-kernel.bfrz@manchmal.in-ulm.de> wrote:
>
> Hello,
>
> On some systems I observe a strange behaviour: After remounting a BTRFS
> readwrite, a background process starts doing things on the disk,
> messages look like
>
> | BTRFS info (device nvme0n1p1): reclaiming chunk 21486669660160 with 100% used 0% unusable
> | BTRFS info (device nvme0n1p1): relocating block group 21486669660160 flags data
> | BTRFS info (device nvme0n1p1): found 4317 extents, stage: move data extents
> | BTRFS info (device nvme0n1p1): found 4317 extents, stage: update data pointers

So that means you have automic block reclaim enabled (in this case a
non-zero value in the sysfs
file /sys/fs/btrfs/<uuid>/allocation/data/bg_reclaim_threshold).

This sounds like it can be fixed by a very recent patch which is not
yet in any released kernel:

https://lore.kernel.org/linux-btrfs/5f8c37f6ebc9024ef4351ae895f3e5fdb9c67baf.1665701210.git.boris@bur.io/

>
> and (with differing numbers) this goes on for hours and days, at a
> read/write rate of about 165/244 kbyte/sec. The filesystem, some 2.5
> Gbyte total size, is filled to about 55%, so even if that process
> touches each and every block, it should already have handled everything,
> several times.
>
> Now, I have no clue what is happening here, what triggers it, if it will
> ever finish. Point is, this takes a measuarable amount of I/O and CPU,
> and it delays other processes.
>
>
> Some details, and things I've tested:
>
> This behaviour is reproducible 100%, even with a btrfs created mere
> moments ago.
>
> The filesystem was created using the 5.10 and 6.0 version of the
> btrfs-progs (both as provided by Debian stable and unstable resp.).
>
> Using the grml rescue system (stable and daily, the latter kernel 5.19),
> the system does not show this behaviour.
>
> The group block number is constantly increasing (14 digits after two
> days), in other words, I have not observed a wrap-around.
>
> It was suggested in IRC to format using the --mixed parameter, no avail.
>
> It was also suggested to set the various bg_reclaim_threshold to zero to
> stop this process, no avail.
>
> This is amd64 hardware without any unusual elements. I could easily
> reproduce this on a fairly different platforms to make sure it's not
> hardware specific.
>
> Scrubbing did not show any errors, and the problem remained.
>
>
> The host runs a hand-crafted kernel, currently 5.19, and I reckon this
> is the source of the problem. Of course I've compared all the BTRFS
> kernel options, they are identical. In the block device layer
> configuration I couldn't see any difference that I can think would
> relate to this issue. Likewise I compared all kernel configuration
> options mentioned in src/fs/btrfs/, still nothing noteworthy.
>
>
> So I'm a bit out of ideas. Unless there's something obvious from the
> description above, perhaps you could give a hint to the following: The
> process that emits the messages above, is there a way to stop it, or to
> report completion percentage? Looking intobtrfs_reclaim_bgs_work
> (block-group.c), it doesn't look like it. Are block group numbers really
> *that* big, magnitudes over the size of the entire filesystem?
>
> Regards,
>
>     Christoph
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Endless "reclaiming chunk"/"relocating block group"
  2022-10-20  1:59 ` Zygo Blaxell
@ 2022-10-20 11:01   ` Christoph Biedl
  2022-10-20 13:53     ` Zygo Blaxell
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Biedl @ 2022-10-20 11:01 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Thanks for add the toughts shared ...

Zygo Blaxell wrote...

(...)
> It makes me think of possible rounding errors (e.g. the threshold
> calculation divides by 100, or there's a sum of quantities that leads
> to a percentage > 100, but the code treats zero as a special case and
> bails out long before, so I don't see how we'd reach those corner cases.

Out of desperation, I followed an approach I could read between the lines
here. The thing I did not mention (just because I forgot about it) is
the compiler version used to build the kernel. For reasons, this is
still a gcc-5 (5.4.0, to be precise). After a rebuild using gcc-11,
things are nice and smooth.

Now I cannot deny I'm quite confused. Assuming this is really the cause,
how should I go from here? I can patch the sources at will, but it's a
huge amount of code, and I don't even know where is start. Also from
expericence, debug print statements do bar the optimizer from creating
the problematic instructions.

    Christoph

FWIW, according to Documentation/process/changes.rst, that old gcc-5 is
considered sufficient. I didn't expect that :)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Endless "reclaiming chunk"/"relocating block group"
  2022-10-20 11:01   ` Christoph Biedl
@ 2022-10-20 13:53     ` Zygo Blaxell
  2022-11-24 18:33       ` Christoph Biedl
  0 siblings, 1 reply; 7+ messages in thread
From: Zygo Blaxell @ 2022-10-20 13:53 UTC (permalink / raw)
  To: Christoph Biedl; +Cc: linux-btrfs

On Thu, Oct 20, 2022 at 01:01:48PM +0200, Christoph Biedl wrote:
> Thanks for add the toughts shared ...
> 
> Zygo Blaxell wrote...
> 
> (...)
> > It makes me think of possible rounding errors (e.g. the threshold
> > calculation divides by 100, or there's a sum of quantities that leads
> > to a percentage > 100, but the code treats zero as a special case and
> > bails out long before, so I don't see how we'd reach those corner cases.
> 
> Out of desperation, I followed an approach I could read between the lines
> here. The thing I did not mention (just because I forgot about it) is
> the compiler version used to build the kernel. For reasons, this is
> still a gcc-5 (5.4.0, to be precise). After a rebuild using gcc-11,
> things are nice and smooth.

That's...not the weirdest thing I've ever seen, but maybe the weirdest
thing I've seen this month.

> Now I cannot deny I'm quite confused. Assuming this is really the cause,
> how should I go from here? I can patch the sources at will, but it's a
> huge amount of code, and I don't even know where is start. Also from
> expericence, debug print statements do bar the optimizer from creating
> the problematic instructions.

That in itself would (weakly) confirm a compiler issue.  But it might
be simpler than that, like a bug in the implementation of a math builtin
that wasn't popular in the kernel 8 years ago, but is popular now after
the implementation was debugged in GCC.  I don't know if that's the case
in this instance, but similar things have happened in the past.

>     Christoph
> 
> FWIW, according to Documentation/process/changes.rst, that old gcc-5 is
> considered sufficient. I didn't expect that :)

On paper maybe, but maybe not in real-world use.  Somewhat related thread
on linux-kernel:

	https://lore.kernel.org/all/Y0hz3u8ZNO2yFU2f@sirena.org.uk/T/

TL;DR not enough people are testing new kernel code against old compilers.
If it's problematic to keep the host system's gcc up to date, a build
chroot for kernel building with an up to date toolchain is the way to go.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Endless "reclaiming chunk"/"relocating block group"
  2022-10-20 13:53     ` Zygo Blaxell
@ 2022-11-24 18:33       ` Christoph Biedl
  2022-11-24 19:14         ` Christoph Biedl
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Biedl @ 2022-11-24 18:33 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2709 bytes --]

Zygo Blaxell wrote...

> TL;DR not enough people are testing new kernel code against old compilers.
> If it's problematic to keep the host system's gcc up to date, a build
> chroot for kernel building with an up to date toolchain is the way to go.

Indeed, but unfortunately this is not an option in the given environment.

Eventually I started bisecting and found the commit below introduced
the trouble. And just to make it clear, the change is correct as far as
I can see. The code built by gcc 5.4 is not.

commit ac2f1e63c65c695b6134f40a078cf82df627e188
Author: Josef Bacik <josef@toxicpanda.com>
Date:   Tue Mar 29 01:56:07 2022 -0700

    btrfs: allow block group background reclaim for non-zoned filesystems

These final three hunks caught my attention:

@@ -3220,6 +3245,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
        spin_unlock(&info->delalloc_root_lock);

        while (total) {
+               bool reclaim;
+
                cache = btrfs_lookup_block_group(info, bytenr);
                if (!cache) {
                        ret = -ENOENT;
@@ -3265,6 +3292,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
                                        cache->space_info, num_bytes);
                        cache->space_info->bytes_used -= num_bytes;
                        cache->space_info->disk_used -= num_bytes * factor;
+
+                       reclaim = should_reclaim_block_group(cache, num_bytes);
                        spin_unlock(&cache->lock);
                        spin_unlock(&cache->space_info->lock);

@@ -3291,6 +3320,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
                if (!alloc && old_val == 0) {
                        if (!btrfs_test_opt(info, DISCARD_ASYNC))
                                btrfs_mark_bg_unused(cache);
+               } else if (!alloc && reclaim) {
+                       btrfs_mark_bg_to_reclaim(cache);
                }

                btrfs_put_block_group(cache);

It seems strange "reclaim" is not initialized but after a closer look
this turns out to be okay. That variable is initialized if "alloc" is
false (not visible in the diff above), but only used under that
condition as well (last hunk). Which is why gcc does not emit warnings
about this.

*However*, some additional debug-print statements revealed the generated
code enters the block that calls btrfs_mark_bg_to_reclaim /even/ /if/
alloc is false, and reclaim undefined (usually true).

That is fairy scary, and I have no idea how many other places in the
kernel are affected by this. I'd assume, quite a few.

Workaround is trivial by initializing reclaim to false. But still.

    Christoph

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Endless "reclaiming chunk"/"relocating block group"
  2022-11-24 18:33       ` Christoph Biedl
@ 2022-11-24 19:14         ` Christoph Biedl
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Biedl @ 2022-11-24 19:14 UTC (permalink / raw)
  To: linux-btrfs

Stupid typo ...


Christoph Biedl wrote...

> *However*, some additional debug-print statements revealed the generated
> code enters the block that calls btrfs_mark_bg_to_reclaim /even/ /if/
> alloc is
           true,
>                 and reclaim undefined (usually true).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-11-24 19:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-19 18:29 Endless "reclaiming chunk"/"relocating block group" Christoph Biedl
2022-10-20  1:59 ` Zygo Blaxell
2022-10-20 11:01   ` Christoph Biedl
2022-10-20 13:53     ` Zygo Blaxell
2022-11-24 18:33       ` Christoph Biedl
2022-11-24 19:14         ` Christoph Biedl
2022-10-20  9:16 ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox