public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* permanently wedged in filesystem, fs/btrfs/relocation.c:1937 prepare_to_merge
@ 2023-07-20 13:42 Chris Murphy
  2023-08-03 21:12 ` Boris Burkov
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Murphy @ 2023-07-20 13:42 UTC (permalink / raw)
  To: Btrfs BTRFS

kernel 6.3.12
btrfs-progs 6.3.2

User reports converting from metadata single to dup, which fails midway through and goes read-only. Following a reboot the file system progressively reports no free space even though `btrfs fi us` reports plenty of unused space in all block groups (but no unallocated space).

We were able to capture some information about file system state, but ordinary filtered balance measures failed so the user reformatted.

Downstream bug report
https://bugzilla.redhat.com/show_bug.cgi?id=2224346


 BTRFS info (device sda1): balance: start -mconvert=dup -sconvert=dup
 BTRFS info (device sda1): relocating block group 83480281088 flags metadata
 ------------[ cut here ]------------
 BTRFS: Transaction aborted (error -28)
 WARNING: CPU: 1 PID: 180121 at fs/btrfs/relocation.c:1937 prepare_to_merge+0x41f/0x430
 Modules linked in: [snipped]
 CPU: 1 PID: 180121 Comm: btrfs Not tainted 6.3.12-100.fc37.x86_64 #1
 Hardware name: Dell Inc. Latitude E6500                  /0PP476, BIOS A29 06/04/2013
 RIP: 0010:prepare_to_merge+0x41f/0x430
 Code: ad e8 75 e1 04 00 eb e0 44 89 f6 48 c7 c7 b8 8e 90 ad e8 a4 07 ab ff 0f 0b eb b4 44 89 f6 48 c7 c7 b8 8e 90 ad e8 91 07 ab ff <0f> 0b eb ba e8 38 be 93 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90
 RSP: 0018:ffffb6a64b387af8 EFLAGS: 00010282
 RAX: 0000000000000000 RBX: ffff9ac5cc749000 RCX: 0000000000000027
 RDX: ffff9ac617d21548 RSI: 0000000000000001 RDI: ffff9ac617d21540
 RBP: ffff9ac5004e0680 R08: 0000000000000000 R09: ffffb6a64b387988
 R10: 0000000000000003 R11: ffffffffae146108 R12: 00000000ffffffe4
 R13: ffff9ac50a5f7000 R14: 00000000ffffffe4 R15: ffff9ac4a3464358
 FS:  00007f0d6fc1b900(0000) GS:ffff9ac617d00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000c0010a9008 CR3: 000000001a196000 CR4: 00000000000406e0
 Call Trace:
  <TASK>
  ? prepare_to_merge+0x41f/0x430
  ? __warn+0x81/0x130
  ? prepare_to_merge+0x41f/0x430
  ? report_bug+0x171/0x1a0
  ? prb_read_valid+0x1b/0x30
  ? handle_bug+0x41/0x70
  ? exc_invalid_op+0x17/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? prepare_to_merge+0x41f/0x430
  ? prepare_to_merge+0x41f/0x430
  relocate_block_group+0x130/0x500
  btrfs_relocate_block_group+0x296/0x430
  btrfs_relocate_chunk+0x3f/0x160
  btrfs_balance+0x905/0x1390
  ? __kmem_cache_alloc_node+0x187/0x320
  ? btrfs_ioctl+0x2435/0x2640
  btrfs_ioctl+0x224e/0x2640
  ? ioctl_has_perm.constprop.0.isra.0+0xdd/0x140
  __x64_sys_ioctl+0x94/0xd0
  do_syscall_64+0x5f/0x90
  ? exit_to_user_mode_prepare+0x188/0x1f0
  ? syscall_exit_to_user_mode+0x1b/0x40
  ? do_syscall_64+0x6b/0x90
  ? syscall_exit_to_user_mode+0x1b/0x40
  ? do_syscall_64+0x6b/0x90
  ? syscall_exit_to_user_mode+0x1b/0x40
  ? do_syscall_64+0x6b/0x90
  ? exc_page_fault+0x74/0x170
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
 RIP: 0033:0x7f0d6fd66d6f
 Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
 RSP: 002b:00007fff3c9bc290 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f0d6fd66d6f
 RDX: 00007fff3c9bc390 RSI: 00000000c4009420 RDI: 0000000000000003
 RBP: 0000000000000000 R08: 0000000000000004 R09: 0000000000000073
 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff3c9be3f4
 R13: 00007fff3c9bc390 R14: 0000000000000001 R15: 0000000000000000
  </TASK>
 ---[ end trace 0000000000000000 ]---
 BTRFS info (device sda1: state A): dumping space info:
 BTRFS info (device sda1: state A): space_info DATA has 22002716672 free, is not full
 BTRFS info (device sda1: state A): space_info total=51514441728, used=29511708672, pinned=0, reserved=0, may_use=16384, readonly=0 zone_unusable=0
 BTRFS info (device sda1: state A): space_info METADATA has 242122752 free, is full
 BTRFS info (device sda1: state A): space_info total=2632974336, used=1418067968, pinned=53788672, reserved=5505024, may_use=365117440, readonly=548372480 zone_unusable=0
 BTRFS info (device sda1: state A): space_info SYSTEM has 39829504 free, is not full
 BTRFS info (device sda1: state A): space_info total=39845888, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
 BTRFS info (device sda1: state A): global_block_rsv: size 107692032 reserved 107692032
 BTRFS info (device sda1: state A): trans_block_rsv: size 0 reserved 0
 BTRFS info (device sda1: state A): chunk_block_rsv: size 0 reserved 0
 BTRFS info (device sda1: state A): delayed_block_rsv: size 655360 reserved 655360
 BTRFS info (device sda1: state A): delayed_refs_rsv: size 2482503680 reserved 254541824
 BTRFS: error (device sda1: state A) in prepare_to_merge:1937: errno=-28 No space left
 BTRFS info (device sda1: state EA): forced readonly
 BTRFS info (device sda1: state EA): balance: ended with status: -30

# btrfs fi usage /btrfs_root/sda1
Overall:
    Device size:                  50.92GiB
    Device allocated:             50.92GiB
    Device unallocated:            1.00MiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         29.20GiB
    Free (estimated):             20.49GiB      (min: 20.49GiB)
    Free (statfs, df):            20.49GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.18
    Global reserve:              102.70MiB      (used: 0.00B)
    Multiple profiles:                 yes      (metadata, system)

Data,single: Size:47.98GiB, Used:27.48GiB (57.29%)
   /dev/sda1      47.98GiB

Metadata,single: Size:2.00GiB, Used:943.66MiB (46.08%)
   /dev/sda1       2.00GiB

Metadata,DUP: Size:463.00MiB, Used:408.33MiB (88.19%)
   /dev/sda1     926.00MiB

System,single: Size:32.00MiB, Used:0.00B (0.00%)
   /dev/sda1      32.00MiB

System,DUP: Size:6.00MiB, Used:16.00KiB (0.26%)
   /dev/sda1      12.00MiB

Unallocated:
   /dev/sda1       1.00MiB


sysfs

allocation/metadata/disk_used:1917665280
allocation/metadata/bytes_pinned:0
allocation/metadata/chunk_size:1073741824
allocation/metadata/bytes_used:1051197440
allocation/metadata/bg_reclaim_threshold:0
allocation/metadata/size_classes:none 0
allocation/metadata/size_classes:small 0
allocation/metadata/size_classes:medium 0
allocation/metadata/size_classes:large 0
allocation/metadata/single/used_bytes:184729600
allocation/metadata/single/total_bytes:2147483648
allocation/metadata/dup/used_bytes:866467840
allocation/metadata/dup/total_bytes:972029952
allocation/metadata/disk_total:4091543552
allocation/metadata/total_bytes:3119513600
allocation/metadata/bytes_reserved:0
allocation/metadata/bytes_readonly:1962754048
allocation/metadata/bytes_zone_unusable:0
allocation/metadata/bytes_may_use:105512960
allocation/metadata/flags:4
allocation/system/disk_used:32768
allocation/system/bytes_pinned:0
allocation/system/chunk_size:33554432
allocation/system/bytes_used:16384
allocation/system/bg_reclaim_threshold:0
allocation/system/size_classes:none 0
allocation/system/size_classes:small 0
allocation/system/size_classes:medium 0
allocation/system/size_classes:large 0
allocation/system/dup/used_bytes:16384
allocation/system/dup/total_bytes:67108864
allocation/system/disk_total:134217728
allocation/system/total_bytes:67108864
allocation/system/bytes_reserved:0
allocation/system/bytes_readonly:0
allocation/system/bytes_zone_unusable:0
allocation/system/bytes_may_use:0
allocation/system/flags:2
allocation/global_rsv_reserved:105512960
allocation/data/disk_used:29451214848
allocation/data/bytes_pinned:0
allocation/data/chunk_size:10737418240
allocation/data/bytes_used:29451214848
allocation/data/bg_reclaim_threshold:0
allocation/data/size_classes:none 5
allocation/data/size_classes:small 32
allocation/data/size_classes:medium 7
allocation/data/size_classes:large 5
allocation/data/single/used_bytes:29451214848
allocation/data/single/total_bytes:50453282816
allocation/data/disk_total:50453282816
allocation/data/total_bytes:50453282816
allocation/data/bytes_reserved:0
allocation/data/bytes_readonly:0
allocation/data/bytes_zone_unusable:0
allocation/data/bytes_may_use:0
allocation/data/flags:1
allocation/global_rsv_size:105512960

Looks similar to this: 
https://lore.kernel.org/lkml/000000000000a3d67705ff730522@google.com/T/




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: permanently wedged in filesystem, fs/btrfs/relocation.c:1937 prepare_to_merge
  2023-07-20 13:42 permanently wedged in filesystem, fs/btrfs/relocation.c:1937 prepare_to_merge Chris Murphy
@ 2023-08-03 21:12 ` Boris Burkov
  2023-08-04  1:23   ` Nicholas D Steeves
  0 siblings, 1 reply; 4+ messages in thread
From: Boris Burkov @ 2023-08-03 21:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Thu, Jul 20, 2023 at 09:42:37AM -0400, Chris Murphy wrote:
> kernel 6.3.12
> btrfs-progs 6.3.2
> 
> User reports converting from metadata single to dup, which fails midway through and goes read-only. Following a reboot the file system progressively reports no free space even though `btrfs fi us` reports plenty of unused space in all block groups (but no unallocated space).
> 
> We were able to capture some information about file system state, but ordinary filtered balance measures failed so the user reformatted.

Thank you for the report. This looks like the "usual" data
fragmentation/overallocation issue.

There are a few aspects to this:
1. Not letting data allocations use up all the unallocated space.
2. Not hitting ENOSPC until we absolutely must, should we get into that
state.
3. Beating back fragmentation and maintaining healthy unallocated space.

1. is close to intractable in general. I have some ideas for how to make
btrfs leave enough buffer, but it gets complicated on edge cases where
metadata runs away (rather than data) and I haven't had time to pursue
them.

2. is very tricky and while interesting, not the most helpful in the
short term, so I will focus on 3.

The btrfs allocator is far from perfect and despite a few measures that
attempt to prevent fragmentation, it can still happen. If you have a
system that reproduces this, you can consider using the scripts I wrote
here: https://github.com/josefbacik/fsperf/tree/master/src/frag to dump
the fragmentation level of the FS (and even visualize it) to confirm my
hypothesis. I'm happy to help you get that up and running.

Now let's suppose you do have a workload that challenges our allocator,
fragments the data block groups, and chews through all the unallocated
space. We have a lot of those at Meta, so luckily, there is some relief
available.

Fundamentally the remediation is to defragment the disk, which we do
do with data block group balancing. You can invoke this manually with:
`btrfs balance start -d<thresh> <fs>`
where <thresh> is a percentage fullness of data block_groups to target
with balancing. Lower is more conservative so you can start low and
increase it to 80 or so till you reclaim enough space. If you use that,
it's better to do it proactively periodically rather than after you get
stuck, 'cause as you saw, balances start failing with ENOSPC too.
(see point 2. above :))

Balance also has a "limit" parameter which you can use to avoid
rewriting the whole disk every time you balance if you have too high of
a threshold. Each block group is 1G, so you can use that info to pick a
good limit (and generally judge how much extra re-writing you're doing)

Alternatively, and this is what we've been doing with some success at
Meta, you can use autorelocation, which does the balancing inline in the
kernel as a block group gets space freed from it. The algorithm is a bit
naive and doesn't try too hard to account for how fragmented the block
groups are before balancing them, so it may do too much I/O. YMMV.
To try that, you can use the sysfs knob:
/sys/fs/btrfs/<uuid>/bg_reclaim_threshold

That is the percentage value a bg has to drop below on a free that puts
it on a balance list (similar to that -d<thresh> parameter). If it's 0,
no autorelocation will occur, if it's too high, you might not be hitting
that fullness on block groups to then sink below it. (e.g., if it's 75 a
free that takes you from 50->49 will not trigger a balance). At Meta, we
use 30 which is somewhat aggressive. You will know it's working when 1)
your unallocated space goes up and 2) by dmesg logs reading:
'relocating block group <offset> flags data'
showing up.

Hope that's helpful, and let me know if there is any other assistance I
can provide,
Boris

> 
> Downstream bug report
> https://bugzilla.redhat.com/show_bug.cgi?id=2224346
> 
> 
>  BTRFS info (device sda1): balance: start -mconvert=dup -sconvert=dup
>  BTRFS info (device sda1): relocating block group 83480281088 flags metadata
>  ------------[ cut here ]------------
>  BTRFS: Transaction aborted (error -28)
>  WARNING: CPU: 1 PID: 180121 at fs/btrfs/relocation.c:1937 prepare_to_merge+0x41f/0x430
>  Modules linked in: [snipped]
>  CPU: 1 PID: 180121 Comm: btrfs Not tainted 6.3.12-100.fc37.x86_64 #1
>  Hardware name: Dell Inc. Latitude E6500                  /0PP476, BIOS A29 06/04/2013
>  RIP: 0010:prepare_to_merge+0x41f/0x430
>  Code: ad e8 75 e1 04 00 eb e0 44 89 f6 48 c7 c7 b8 8e 90 ad e8 a4 07 ab ff 0f 0b eb b4 44 89 f6 48 c7 c7 b8 8e 90 ad e8 91 07 ab ff <0f> 0b eb ba e8 38 be 93 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90
>  RSP: 0018:ffffb6a64b387af8 EFLAGS: 00010282
>  RAX: 0000000000000000 RBX: ffff9ac5cc749000 RCX: 0000000000000027
>  RDX: ffff9ac617d21548 RSI: 0000000000000001 RDI: ffff9ac617d21540
>  RBP: ffff9ac5004e0680 R08: 0000000000000000 R09: ffffb6a64b387988
>  R10: 0000000000000003 R11: ffffffffae146108 R12: 00000000ffffffe4
>  R13: ffff9ac50a5f7000 R14: 00000000ffffffe4 R15: ffff9ac4a3464358
>  FS:  00007f0d6fc1b900(0000) GS:ffff9ac617d00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000000c0010a9008 CR3: 000000001a196000 CR4: 00000000000406e0
>  Call Trace:
>   <TASK>
>   ? prepare_to_merge+0x41f/0x430
>   ? __warn+0x81/0x130
>   ? prepare_to_merge+0x41f/0x430
>   ? report_bug+0x171/0x1a0
>   ? prb_read_valid+0x1b/0x30
>   ? handle_bug+0x41/0x70
>   ? exc_invalid_op+0x17/0x70
>   ? asm_exc_invalid_op+0x1a/0x20
>   ? prepare_to_merge+0x41f/0x430
>   ? prepare_to_merge+0x41f/0x430
>   relocate_block_group+0x130/0x500
>   btrfs_relocate_block_group+0x296/0x430
>   btrfs_relocate_chunk+0x3f/0x160
>   btrfs_balance+0x905/0x1390
>   ? __kmem_cache_alloc_node+0x187/0x320
>   ? btrfs_ioctl+0x2435/0x2640
>   btrfs_ioctl+0x224e/0x2640
>   ? ioctl_has_perm.constprop.0.isra.0+0xdd/0x140
>   __x64_sys_ioctl+0x94/0xd0
>   do_syscall_64+0x5f/0x90
>   ? exit_to_user_mode_prepare+0x188/0x1f0
>   ? syscall_exit_to_user_mode+0x1b/0x40
>   ? do_syscall_64+0x6b/0x90
>   ? syscall_exit_to_user_mode+0x1b/0x40
>   ? do_syscall_64+0x6b/0x90
>   ? syscall_exit_to_user_mode+0x1b/0x40
>   ? do_syscall_64+0x6b/0x90
>   ? exc_page_fault+0x74/0x170
>   entry_SYSCALL_64_after_hwframe+0x72/0xdc
>  RIP: 0033:0x7f0d6fd66d6f
>  Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
>  RSP: 002b:00007fff3c9bc290 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>  RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f0d6fd66d6f
>  RDX: 00007fff3c9bc390 RSI: 00000000c4009420 RDI: 0000000000000003
>  RBP: 0000000000000000 R08: 0000000000000004 R09: 0000000000000073
>  R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff3c9be3f4
>  R13: 00007fff3c9bc390 R14: 0000000000000001 R15: 0000000000000000
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  BTRFS info (device sda1: state A): dumping space info:
>  BTRFS info (device sda1: state A): space_info DATA has 22002716672 free, is not full
>  BTRFS info (device sda1: state A): space_info total=51514441728, used=29511708672, pinned=0, reserved=0, may_use=16384, readonly=0 zone_unusable=0
>  BTRFS info (device sda1: state A): space_info METADATA has 242122752 free, is full
>  BTRFS info (device sda1: state A): space_info total=2632974336, used=1418067968, pinned=53788672, reserved=5505024, may_use=365117440, readonly=548372480 zone_unusable=0
>  BTRFS info (device sda1: state A): space_info SYSTEM has 39829504 free, is not full
>  BTRFS info (device sda1: state A): space_info total=39845888, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
>  BTRFS info (device sda1: state A): global_block_rsv: size 107692032 reserved 107692032
>  BTRFS info (device sda1: state A): trans_block_rsv: size 0 reserved 0
>  BTRFS info (device sda1: state A): chunk_block_rsv: size 0 reserved 0
>  BTRFS info (device sda1: state A): delayed_block_rsv: size 655360 reserved 655360
>  BTRFS info (device sda1: state A): delayed_refs_rsv: size 2482503680 reserved 254541824
>  BTRFS: error (device sda1: state A) in prepare_to_merge:1937: errno=-28 No space left
>  BTRFS info (device sda1: state EA): forced readonly
>  BTRFS info (device sda1: state EA): balance: ended with status: -30
> 
> # btrfs fi usage /btrfs_root/sda1
> Overall:
>     Device size:                  50.92GiB
>     Device allocated:             50.92GiB
>     Device unallocated:            1.00MiB
>     Device missing:                  0.00B
>     Device slack:                    0.00B
>     Used:                         29.20GiB
>     Free (estimated):             20.49GiB      (min: 20.49GiB)
>     Free (statfs, df):            20.49GiB
>     Data ratio:                       1.00
>     Metadata ratio:                   1.18
>     Global reserve:              102.70MiB      (used: 0.00B)
>     Multiple profiles:                 yes      (metadata, system)
> 
> Data,single: Size:47.98GiB, Used:27.48GiB (57.29%)
>    /dev/sda1      47.98GiB
> 
> Metadata,single: Size:2.00GiB, Used:943.66MiB (46.08%)
>    /dev/sda1       2.00GiB
> 
> Metadata,DUP: Size:463.00MiB, Used:408.33MiB (88.19%)
>    /dev/sda1     926.00MiB
> 
> System,single: Size:32.00MiB, Used:0.00B (0.00%)
>    /dev/sda1      32.00MiB
> 
> System,DUP: Size:6.00MiB, Used:16.00KiB (0.26%)
>    /dev/sda1      12.00MiB
> 
> Unallocated:
>    /dev/sda1       1.00MiB
> 
> 
> sysfs
> 
> allocation/metadata/disk_used:1917665280
> allocation/metadata/bytes_pinned:0
> allocation/metadata/chunk_size:1073741824
> allocation/metadata/bytes_used:1051197440
> allocation/metadata/bg_reclaim_threshold:0
> allocation/metadata/size_classes:none 0
> allocation/metadata/size_classes:small 0
> allocation/metadata/size_classes:medium 0
> allocation/metadata/size_classes:large 0
> allocation/metadata/single/used_bytes:184729600
> allocation/metadata/single/total_bytes:2147483648
> allocation/metadata/dup/used_bytes:866467840
> allocation/metadata/dup/total_bytes:972029952
> allocation/metadata/disk_total:4091543552
> allocation/metadata/total_bytes:3119513600
> allocation/metadata/bytes_reserved:0
> allocation/metadata/bytes_readonly:1962754048
> allocation/metadata/bytes_zone_unusable:0
> allocation/metadata/bytes_may_use:105512960
> allocation/metadata/flags:4
> allocation/system/disk_used:32768
> allocation/system/bytes_pinned:0
> allocation/system/chunk_size:33554432
> allocation/system/bytes_used:16384
> allocation/system/bg_reclaim_threshold:0
> allocation/system/size_classes:none 0
> allocation/system/size_classes:small 0
> allocation/system/size_classes:medium 0
> allocation/system/size_classes:large 0
> allocation/system/dup/used_bytes:16384
> allocation/system/dup/total_bytes:67108864
> allocation/system/disk_total:134217728
> allocation/system/total_bytes:67108864
> allocation/system/bytes_reserved:0
> allocation/system/bytes_readonly:0
> allocation/system/bytes_zone_unusable:0
> allocation/system/bytes_may_use:0
> allocation/system/flags:2
> allocation/global_rsv_reserved:105512960
> allocation/data/disk_used:29451214848
> allocation/data/bytes_pinned:0
> allocation/data/chunk_size:10737418240
> allocation/data/bytes_used:29451214848
> allocation/data/bg_reclaim_threshold:0
> allocation/data/size_classes:none 5
> allocation/data/size_classes:small 32
> allocation/data/size_classes:medium 7
> allocation/data/size_classes:large 5
> allocation/data/single/used_bytes:29451214848
> allocation/data/single/total_bytes:50453282816
> allocation/data/disk_total:50453282816
> allocation/data/total_bytes:50453282816
> allocation/data/bytes_reserved:0
> allocation/data/bytes_readonly:0
> allocation/data/bytes_zone_unusable:0
> allocation/data/bytes_may_use:0
> allocation/data/flags:1
> allocation/global_rsv_size:105512960
> 
> Looks similar to this: 
> https://lore.kernel.org/lkml/000000000000a3d67705ff730522@google.com/T/
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: permanently wedged in filesystem, fs/btrfs/relocation.c:1937 prepare_to_merge
  2023-08-03 21:12 ` Boris Burkov
@ 2023-08-04  1:23   ` Nicholas D Steeves
  2023-08-04 18:00     ` Boris Burkov
  0 siblings, 1 reply; 4+ messages in thread
From: Nicholas D Steeves @ 2023-08-04  1:23 UTC (permalink / raw)
  To: Boris Burkov, Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2237 bytes --]

Boris Burkov <boris@bur.io> writes:

> On Thu, Jul 20, 2023 at 09:42:37AM -0400, Chris Murphy wrote:
>
> The btrfs allocator is far from perfect and despite a few measures that
> attempt to prevent fragmentation, it can still happen. If you have a
> system that reproduces this, you can consider using the scripts I wrote
> here: https://github.com/josefbacik/fsperf/tree/master/src/frag to dump
> the fragmentation level of the FS (and even visualize it) to confirm my
> hypothesis. I'm happy to help you get that up and running.
>
> Now let's suppose you do have a workload that challenges our allocator,
> fragments the data block groups, and chews through all the unallocated
> space. We have a lot of those at Meta, so luckily, there is some relief
> available.
>
> Fundamentally the remediation is to defragment the disk, which we do
> do with data block group balancing. You can invoke this manually with:
> `btrfs balance start -d<thresh> <fs>`
> where <thresh> is a percentage fullness of data block_groups to target
> with balancing. Lower is more conservative so you can start low and
> increase it to 80 or so till you reclaim enough space. If you use that,
> it's better to do it proactively periodically rather than after you get
> stuck, 'cause as you saw, balances start failing with ENOSPC too.
> (see point 2. above :))

Would it be useful to use fsperf's frag (module?) in combination with
the required btrd to periodically assess the state of fragmentation?
What are the downsides of doing this?

I'm specifically interested in minimising the risk of "everything was
fine until the fs blew up", and it seems like running this test
periodically would provide useful data that would inform the sysadmin
about whether the risk of rewriting data at rest with a rebalance is
less than the risk of encountering issues triggered by the less than
perfect allocator.

Because it sounds like there still exist workloads that necessitate
periodic rebalancing, sysadmins need a way to determine the degree of
need for rebalancing in order to define a mitigation policy in a
fact-based way.

Is fsperf the correct tool for this general case, or should we be using
something else?


Thanks!
Nicholas

P.S. Please CC me in replies.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: permanently wedged in filesystem, fs/btrfs/relocation.c:1937 prepare_to_merge
  2023-08-04  1:23   ` Nicholas D Steeves
@ 2023-08-04 18:00     ` Boris Burkov
  0 siblings, 0 replies; 4+ messages in thread
From: Boris Burkov @ 2023-08-04 18:00 UTC (permalink / raw)
  To: Nicholas D Steeves; +Cc: Chris Murphy, Btrfs BTRFS

On Thu, Aug 03, 2023 at 09:23:34PM -0400, Nicholas D Steeves wrote:
> Boris Burkov <boris@bur.io> writes:
> 
> > On Thu, Jul 20, 2023 at 09:42:37AM -0400, Chris Murphy wrote:
> >
> > The btrfs allocator is far from perfect and despite a few measures that
> > attempt to prevent fragmentation, it can still happen. If you have a
> > system that reproduces this, you can consider using the scripts I wrote
> > here: https://github.com/josefbacik/fsperf/tree/master/src/frag to dump
> > the fragmentation level of the FS (and even visualize it) to confirm my
> > hypothesis. I'm happy to help you get that up and running.
> >
> > Now let's suppose you do have a workload that challenges our allocator,
> > fragments the data block groups, and chews through all the unallocated
> > space. We have a lot of those at Meta, so luckily, there is some relief
> > available.
> >
> > Fundamentally the remediation is to defragment the disk, which we do
> > do with data block group balancing. You can invoke this manually with:
> > `btrfs balance start -d<thresh> <fs>`
> > where <thresh> is a percentage fullness of data block_groups to target
> > with balancing. Lower is more conservative so you can start low and
> > increase it to 80 or so till you reclaim enough space. If you use that,
> > it's better to do it proactively periodically rather than after you get
> > stuck, 'cause as you saw, balances start failing with ENOSPC too.
> > (see point 2. above :))
> 
> Would it be useful to use fsperf's frag (module?) in combination with
> the required btrd to periodically assess the state of fragmentation?
> What are the downsides of doing this?

I think this is probably overkill, compared to experimenting with
auto-relocation and monitoring relocation/IO. Btrd is designed to run on
a mounted filesystem and uses the SEARCH_V2 ioctl so it should be "fine"
to use, but the script walks the entire extent tree so on a large file
system it will be slow and use lots of memory (it ooms on my test vms
when I'm not careful..)

I wrote this as a helper for testing out allocator changes targeting
fragmentation. fsperf is our perf testbed, so it runs some workload and
then when it's done on a basically inactive test fs, it runs the script.

I would say that it is unsupported for serious production use, and I
wouldn't use it in that way, but it doesn't use any insane features and
shouldn't crash your system besides normal resource hogging type issues.

I don't have concrete plans for btrfs to track block_group fragmentation
directly (haven't figured out if I can do it efficiently) but it would
be an interesting project for the future.

> 
> I'm specifically interested in minimising the risk of "everything was
> fine until the fs blew up", and it seems like running this test
> periodically would provide useful data that would inform the sysadmin
> about whether the risk of rewriting data at rest with a rebalance is
> less than the risk of encountering issues triggered by the less than
> perfect allocator.
> 
> Because it sounds like there still exist workloads that necessitate
> periodic rebalancing, sysadmins need a way to determine the degree of
> need for rebalancing in order to define a mitigation policy in a
> fact-based way.
> 
> Is fsperf the correct tool for this general case, or should we be using
> something else?

We monitor "unallocated" via btrfs filesystem usage. Unallocated
trending down while data usage % is relatively low is a good sign of
fragmentation and data over-allocation where balance would help.

> 
> 
> Thanks!
> Nicholas
> 
> P.S. Please CC me in replies.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-08-04 18:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-20 13:42 permanently wedged in filesystem, fs/btrfs/relocation.c:1937 prepare_to_merge Chris Murphy
2023-08-03 21:12 ` Boris Burkov
2023-08-04  1:23   ` Nicholas D Steeves
2023-08-04 18:00     ` Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox