* [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO [not found] <e03530c5-6af9-4f7a-9205-21d41dc092e5@archworks.co> @ 2025-10-26 21:37 ` Sandwich 2025-10-26 22:11 ` Sandwich 2025-10-27 2:08 ` Zygo Blaxell 0 siblings, 2 replies; 6+ messages in thread From: Sandwich @ 2025-10-26 21:37 UTC (permalink / raw) To: linux-btrfs hi, i hit an ENOSPC corner case converting a 6-disk btrfs from data=single to data=raid6 and metadata/system=raid1c4. after the failure, canceling the balance forces the fs read-only. there's plenty of unallocated space overall, but metadata reports "full" and delayed refs fail. attempts to add another (empty) device also immediately flip the fs to RO and the add does not proceed. i am aware RAID56 is not recommended. how the filesystem grew: i started with two disks, created btrfs (data=single), and filled it. i added two more disks and filled it again. after adding the final two disks i attempted the conversion to data=raid6 with metadata/system=raid1c4—that conversion is what triggered ENOSPC and the current RO behavior. when the convert began, usage was about 51 TiB used out of ~118 TiB total device size. environment during the incident: ``` uname -r: 6.14.11-4-pve btrfs --version: btrfs-progs v6.14 quotas: off unclean shutdowns: none disks: 2×20 TB (~18.19 TiB) + 4×18 TB (~16.37 TiB) ``` operation that started it: ``` btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force ``` current state: i can mount read-write only with `-o skip_balance`. running `btrfs balance cancel` immediately forces RO. mixed profiles remain (data=single+raid6, metadata=raid1+raid1c4, system=raid1+raid1c4). i tried clearing the free-space cache, afterward the free-space tree could not be rebuilt and subsequent operations hit backref errors (details below). adding a new device also forces RO and fails. FS Info: ``` # btrfs fi df /mnt/Data Data, single: total=46.78TiB, used=44.72TiB Data, RAID6: total=4.35TiB, used=4.29TiB System, RAID1: total=8.00MiB, used=5.22MiB System, RAID1C4: total=32.00MiB, used=352.00KiB Metadata, RAID1: total=56.00GiB, used=50.54GiB Metadata, RAID1C4: total=10.00GiB, used=9.97GiB ``` ``` # btrfs fi usage -T /mnt/Data Device size: 118.24TiB Device allocated: 53.46TiB Device unallocated: 64.78TiB Used: 51.29TiB Free (estimated): 64.20TiB (min: 18.26TiB) Free (statfs, df): 33.20TiB Data ratio: 1.04 Metadata ratio: 2.33 Multiple profiles: yes (data, metadata, system) ``` ``` # btrfs filesystem show /mnt/Data Label: 'Data' uuid: 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 Total devices 6 FS bytes used 49.07TiB devid 1 size 18.19TiB used 16.23TiB path /dev/sdf devid 2 size 18.19TiB used 16.23TiB path /dev/sdg devid 3 size 16.37TiB used 14.54TiB path /dev/sdc devid 4 size 16.37TiB used 4.25TiB path /dev/sdb devid 5 size 16.37TiB used 1.10TiB path /dev/sdd devid 6 size 16.37TiB used 1.10TiB path /dev/sde ``` initial dmesg at first failure (before cache clear): ``` Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info DATA has 68817772544 free, is not full Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info total=56215761584128, used=53883609358336, pinned=1023410176, reserved=0, may_use=0, readonly=2262311043072 zone_unusable=0 Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info METADATA has 4869275648 free, is full Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info total=70866960384, used=64968261632, pinned=33210368, reserved=5832704, may_use=990248960, readonly=131072 zone_unusable=0 Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info SYSTEM has 35307520 free, is not full Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info total=41943040, used=5832704, pinned=802816, reserved=0, may_use=0, readonly=0 zone_unusable=0 Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): global_block_rsv: size 536870912 reserved 536870912 Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): trans_block_rsv: size 0 reserved 0 Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): chunk_block_rsv: size 0 reserved 0 Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): delayed_block_rsv: size 0 reserved 0 Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): delayed_refs_rsv: size 453378048 reserved 453378048 Oct 26 15:25:39 anthem kernel: BTRFS: error (device sdf state A) in __btrfs_free_extent:3211: errno=-28 No space left Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state EA): forced readonly Oct 26 15:25:39 anthem kernel: BTRFS error (device sdf state EA): failed to run delayed ref for logical 72899131047936 num_bytes 16384 type 176 action 2 ref_mod 1: -28 Oct 26 15:25:39 anthem kernel: BTRFS: error (device sdf state EA) in btrfs_run_delayed_refs:2160: errno=-28 No space left Oct 26 15:33:16 anthem kernel: BTRFS info (device sdf state EA): last unmount of filesystem 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 ``` later dmesg around ENOSPC/RO: ``` BTRFS info (device sdf state A): space_info DATA has 9918180220928 free, is not full BTRFS info (device sdf state A): space_info METADATA has 5322637312 free, is full BTRFS: error (device sdf state A) in btrfs_run_delayed_refs:2160: errno=-28 No space left BTRFS info (device sdf state EA): forced readonly BTRFS: error (device sdf state EA) in reset_balance_state:3793: errno=-28 No space left BTRFS info (device sdf state EA): balance: canceled ``` `btrfs.static check --readonly /dev/sdf` with 6.17 shows backref issues: ``` ... backpointer mismatch on [81544970633216 16384] owner ref check failed [81544970633216 16384] ref mismatch on [81544977776640 16384] extent item 1, found 0 tree extent[81544977776640, 16384] root 10 has no tree block found incorrect global backref count on 81544977776640 found 1 wanted 0 ... could not load free space tree: No such file or directory could not load free space tree: No such file or directory could not load free space tree: No such file or directory [5/8] checking fs roots [6/8] checking only csums items (without verifying data) [7/8] checking root refs [8/8] checking quota groups skipped (not enabled on this FS) found 53948588761088 bytes used, error(s) found total csum bytes: 52620712264 total tree bytes: 64969162752 total fs tree bytes: 6759579648 total extent tree bytes: 3808886784 btree space waste bytes: 3059147492 file data blocks allocated: 53958044786688 referenced 56765362577408 ``` full check log: https://pastebin.com/8tJWeBnM after clearing the cache, the free-space tree cannot be rebuilt, kernel shows: ``` BTRFS: error (device sdf state A) in __btrfs_free_extent:3205: errno=-117 Filesystem corrupted BTRFS info (device sdf state EA): forced readonly BTRFS critical (device sdf state EA): unable to find ref byte nr 69983021023232 parent 0 root 10 owner 0 offset 0 slot 9 BTRFS error (device sdf state EA): failed to run delayed ref for logical 69983021023232 num_bytes 16384 type 176 action 2 ref_mod 1: -2 BTRFS: error (device sdf state EA) in btrfs_run_delayed_refs:2160: errno=-2 No such entry BTRFS: error (device sdf state EA) in reset_balance_state:3793: errno=-2 No such entry BTRFS info (device sdf state EA): balance: canceled ``` timeline is as follows: first two disks were single and filled, then two more added and filled, then last two added and the convert attempted. usage at convert start was ~51 TiB used of ~118 TiB total. exact command history around the incident: ``` Sat 25 Oct 2025 18:16:40 CEST btrfs device add /dev/sdd /mnt/Data Sat 25 Oct 2025 18:18:18 CEST btrfs device add /dev/sde /mnt/Data Sat 25 Oct 2025 18:18:21 CEST mount -a Sat 25 Oct 2025 18:19:01 CEST tmux new-session -s raid6 Sat 25 Oct 2025 18:41:17 CEST btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force Sat 25 Oct 2025 18:41:22 CEST btrfs balance status /mnt/Data Sat 25 Oct 2025 18:41:47 CEST watch btrfs balance status /mnt/Data Sat 25 Oct 2025 18:56:37 CEST mount -o remount /dev/sdb /mnt/Data Sun 26 Oct 2025 13:50:59 CET mount -o remount /dev/sde /mnt/Data Sun 26 Oct 2025 13:51:38 CET mount /dev/sdb /mnt/Data Sun 26 Oct 2025 13:52:33 CET cd /mnt/Data/ Sun 26 Oct 2025 13:59:29 CET touch /mnt/Data/test Sun 26 Oct 2025 14:40:31 CET vim /etc/fstab Sun 26 Oct 2025 14:40:31 CET vim /etc/exports Sun 26 Oct 2025 14:41:05 CET exportfs -arv Sun 26 Oct 2025 14:44:15 CET reboot Sun 26 Oct 2025 14:47:27 CET lsof +D /mnt/Data Sun 26 Oct 2025 14:52:26 CET mount -o rw,skip_balance,noatime,space_cache=v2 /dev/sdb /mnt/Data Sun 26 Oct 2025 15:01:10 CET btrfs fi df /mnt/Data Sun 26 Oct 2025 15:04:58 CET mount -o rw,skip_balance /dev/sdb /mnt/Data Sun 26 Oct 2025 15:08:56 CET btrfs device add -f /dev/sdh /mnt/Data Sun 26 Oct 2025 15:10:34 CET btrfs filesystem show /mnt/Data Sun 26 Oct 2025 15:10:55 CET btrfs rescue zero-log /dev/sdf Sun 26 Oct 2025 15:14:32 CET umount /mnt/Data Sun 26 Oct 2025 15:23:56 CET btrfs check --readonly /dev/sdf Sun 26 Oct 2025 15:33:22 CET btrfs rescue clear-space-cache v2 /dev/sdf Sun 26 Oct 2025 15:53:35 CET btrfs rescue clear-uuid-tree /dev/sdf Sun 26 Oct 2025 16:12:17 CET wipefs -af /dev/sdh Sun 26 Oct 2025 16:51:20 CET mount -o rw,skip_balance LABEL=Data /mnt/Data Sun 26 Oct 2025 16:51:30 CET btrfs device add -K -f /dev/sdh /mnt/Data Sun 26 Oct 2025 17:04:36 CET btrfs balance status -v /mnt/Data Sun 26 Oct 2025 17:07:12 CET umount /mnt/Data Sun 26 Oct 2025 17:13:26 CET btrfs check --readonly -s 1 /dev/sdb Sun 26 Oct 2025 17:21:36 CET mount -o rw,skip_balance,clear_cache,noatime LABEL=Data /mnt/Data Sun 26 Oct 2025 18:51:27 CET btrfs fi usage /mnt/Data Sun 26 Oct 2025 18:59:52 CET btrfs balance cancel /mnt/Data Sun 26 Oct 2025 19:02:26 CET touch /mnt/Data/test Sun 26 Oct 2025 19:38:23 CET wget https://github.com/kdave/btrfs-progs/releases/download/v6.17/btrfs.static Sun 26 Oct 2025 19:40:53 CET chmod +x btrfs.static Sun 26 Oct 2025 19:41:09 CET ./btrfs.static Sun 26 Oct 2025 19:41:14 CET umount -R /mnt/Data Sun 26 Oct 2025 19:42:02 CET ./btrfs.static check --readonly /dev/sdf Sun 26 Oct 2025 21:16:09 CET mount -o rw,skip_balance,noatime LABEL=Data /mnt/Data Sun 26 Oct 2025 21:31:22 CET ./btrfs.static balance cancel /mnt/Data ``` full incident kernel log: https://pastebin.com/KxP7Xa3g i’m looking for a safe recovery path. is there a supported way to unwind or complete the in-flight convert first (for example, freeing metadata space or running a limited balance), or should i avoid that and take a different route? if proceeding is risky, given that there are no `Data,single` chunks on `/dev/sdd` and `/dev/sde`, is it safe to remove those two devices to free room and try again? if that’s reasonable, what exact sequence (device remove/replace vs zeroing; mount options) would you recommend to minimize further damage? thanks, sandwich — note: this email’s formatting was prepared with the help of an LLM. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO 2025-10-26 21:37 ` [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO Sandwich @ 2025-10-26 22:11 ` Sandwich 2025-10-27 2:08 ` Zygo Blaxell 1 sibling, 0 replies; 6+ messages in thread From: Sandwich @ 2025-10-26 22:11 UTC (permalink / raw) To: linux-btrfs I have just retested it on latest ArchLinux kernel 6.17.4 and 6.17 progs with the same errors and issue. On 10/26/25 10:37 PM, Sandwich wrote: > hi, > > i hit an ENOSPC corner case converting a 6-disk btrfs from data=single to data=raid6 and metadata/system=raid1c4. after the failure, canceling the balance forces the fs read-only. there's plenty of unallocated space overall, but metadata reports "full" and delayed refs fail. attempts to add another (empty) device also immediately flip the fs to RO and the add does not proceed. > > i am aware RAID56 is not recommended. > > how the filesystem grew: > i started with two disks, created btrfs (data=single), and filled it. i added two more disks and filled it again. after adding the final two disks i attempted the conversion to data=raid6 with metadata/system=raid1c4—that conversion is what triggered ENOSPC and the current RO behavior. when the convert began, usage was about 51 TiB used out of ~118 TiB total device size. > > environment during the incident: > > ``` > uname -r: 6.14.11-4-pve > btrfs --version: btrfs-progs v6.14 > quotas: off > unclean shutdowns: none > disks: 2×20 TB (~18.19 TiB) + 4×18 TB (~16.37 TiB) > ``` > > operation that started it: > > ``` > btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force > ``` > > current state: > i can mount read-write only with `-o skip_balance`. running `btrfs balance cancel` immediately forces RO. mixed profiles remain (data=single+raid6, metadata=raid1+raid1c4, system=raid1+raid1c4). i tried clearing the free-space cache, afterward the free-space tree could not be rebuilt and subsequent operations hit backref errors (details below). adding a new device also forces RO and fails. > > FS Info: > > ``` > # btrfs fi df /mnt/Data > Data, single: total=46.78TiB, used=44.72TiB > Data, RAID6: total=4.35TiB, used=4.29TiB > System, RAID1: total=8.00MiB, used=5.22MiB > System, RAID1C4: total=32.00MiB, used=352.00KiB > Metadata, RAID1: total=56.00GiB, used=50.54GiB > Metadata, RAID1C4: total=10.00GiB, used=9.97GiB > ``` > > ``` > # btrfs fi usage -T /mnt/Data > Device size: 118.24TiB > Device allocated: 53.46TiB > Device unallocated: 64.78TiB > Used: 51.29TiB > Free (estimated): 64.20TiB (min: 18.26TiB) > Free (statfs, df): 33.20TiB > Data ratio: 1.04 > Metadata ratio: 2.33 > Multiple profiles: yes (data, metadata, system) > ``` > > ``` > # btrfs filesystem show /mnt/Data > Label: 'Data' uuid: 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 > Total devices 6 FS bytes used 49.07TiB > devid 1 size 18.19TiB used 16.23TiB path /dev/sdf > devid 2 size 18.19TiB used 16.23TiB path /dev/sdg > devid 3 size 16.37TiB used 14.54TiB path /dev/sdc > devid 4 size 16.37TiB used 4.25TiB path /dev/sdb > devid 5 size 16.37TiB used 1.10TiB path /dev/sdd > devid 6 size 16.37TiB used 1.10TiB path /dev/sde > ``` > initial dmesg at first failure (before cache clear): > > ``` > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info DATA has 68817772544 free, is not full > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info total=56215761584128, used=53883609358336, pinned=1023410176, reserved=0, may_use=0, readonly=2262311043072 zone_unusable=0 > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info METADATA has 4869275648 free, is full > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info total=70866960384, used=64968261632, pinned=33210368, reserved=5832704, may_use=990248960, readonly=131072 zone_unusable=0 > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info SYSTEM has 35307520 free, is not full > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): space_info total=41943040, used=5832704, pinned=802816, reserved=0, may_use=0, readonly=0 zone_unusable=0 > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): global_block_rsv: size 536870912 reserved 536870912 > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): trans_block_rsv: size 0 reserved 0 > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): chunk_block_rsv: size 0 reserved 0 > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): delayed_block_rsv: size 0 reserved 0 > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state A): delayed_refs_rsv: size 453378048 reserved 453378048 > Oct 26 15:25:39 anthem kernel: BTRFS: error (device sdf state A) in __btrfs_free_extent:3211: errno=-28 No space left > Oct 26 15:25:39 anthem kernel: BTRFS info (device sdf state EA): forced readonly > Oct 26 15:25:39 anthem kernel: BTRFS error (device sdf state EA): failed to run delayed ref for logical 72899131047936 num_bytes 16384 type 176 action 2 ref_mod 1: -28 > Oct 26 15:25:39 anthem kernel: BTRFS: error (device sdf state EA) in btrfs_run_delayed_refs:2160: errno=-28 No space left > Oct 26 15:33:16 anthem kernel: BTRFS info (device sdf state EA): last unmount of filesystem 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 > ``` > > later dmesg around ENOSPC/RO: > > ``` > BTRFS info (device sdf state A): space_info DATA has 9918180220928 free, is not full > BTRFS info (device sdf state A): space_info METADATA has 5322637312 free, is full > BTRFS: error (device sdf state A) in btrfs_run_delayed_refs:2160: errno=-28 No space left > BTRFS info (device sdf state EA): forced readonly > BTRFS: error (device sdf state EA) in reset_balance_state:3793: errno=-28 No space left > BTRFS info (device sdf state EA): balance: canceled > ``` > > `btrfs.static check --readonly /dev/sdf` with 6.17 shows backref issues: > > ``` > ... > backpointer mismatch on [81544970633216 16384] > owner ref check failed [81544970633216 16384] > ref mismatch on [81544977776640 16384] extent item 1, found 0 > tree extent[81544977776640, 16384] root 10 has no tree block found > incorrect global backref count on 81544977776640 found 1 wanted 0 > ... > could not load free space tree: No such file or directory > could not load free space tree: No such file or directory > could not load free space tree: No such file or directory > [5/8] checking fs roots > [6/8] checking only csums items (without verifying data) > [7/8] checking root refs > [8/8] checking quota groups skipped (not enabled on this FS) > found 53948588761088 bytes used, error(s) found > total csum bytes: 52620712264 > total tree bytes: 64969162752 > total fs tree bytes: 6759579648 > total extent tree bytes: 3808886784 > btree space waste bytes: 3059147492 > file data blocks allocated: 53958044786688 > referenced 56765362577408 > ``` > > full check log: > https://pastebin.com/8tJWeBnM > > after clearing the cache, the free-space tree cannot be rebuilt, kernel shows: > > ``` > BTRFS: error (device sdf state A) in __btrfs_free_extent:3205: errno=-117 Filesystem corrupted > BTRFS info (device sdf state EA): forced readonly > BTRFS critical (device sdf state EA): unable to find ref byte nr 69983021023232 parent 0 root 10 owner 0 offset 0 slot 9 > BTRFS error (device sdf state EA): failed to run delayed ref for logical 69983021023232 num_bytes 16384 type 176 action 2 ref_mod 1: -2 > BTRFS: error (device sdf state EA) in btrfs_run_delayed_refs:2160: errno=-2 No such entry > BTRFS: error (device sdf state EA) in reset_balance_state:3793: errno=-2 No such entry > BTRFS info (device sdf state EA): balance: canceled > ``` > > timeline is as follows: > first two disks were single and filled, then two more added and filled, then last two added and the convert attempted. usage at convert start was ~51 TiB used of ~118 TiB total. > exact command history around the incident: > > ``` > Sat 25 Oct 2025 18:16:40 CEST btrfs device add /dev/sdd /mnt/Data > Sat 25 Oct 2025 18:18:18 CEST btrfs device add /dev/sde /mnt/Data > Sat 25 Oct 2025 18:18:21 CEST mount -a > Sat 25 Oct 2025 18:19:01 CEST tmux new-session -s raid6 > Sat 25 Oct 2025 18:41:17 CEST btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force > Sat 25 Oct 2025 18:41:22 CEST btrfs balance status /mnt/Data > Sat 25 Oct 2025 18:41:47 CEST watch btrfs balance status /mnt/Data > Sat 25 Oct 2025 18:56:37 CEST mount -o remount /dev/sdb /mnt/Data > Sun 26 Oct 2025 13:50:59 CET mount -o remount /dev/sde /mnt/Data > Sun 26 Oct 2025 13:51:38 CET mount /dev/sdb /mnt/Data > Sun 26 Oct 2025 13:52:33 CET cd /mnt/Data/ > Sun 26 Oct 2025 13:59:29 CET touch /mnt/Data/test > Sun 26 Oct 2025 14:40:31 CET vim /etc/fstab > Sun 26 Oct 2025 14:40:31 CET vim /etc/exports > Sun 26 Oct 2025 14:41:05 CET exportfs -arv > Sun 26 Oct 2025 14:44:15 CET reboot > Sun 26 Oct 2025 14:47:27 CET lsof +D /mnt/Data > Sun 26 Oct 2025 14:52:26 CET mount -o rw,skip_balance,noatime,space_cache=v2 /dev/sdb /mnt/Data > Sun 26 Oct 2025 15:01:10 CET btrfs fi df /mnt/Data > Sun 26 Oct 2025 15:04:58 CET mount -o rw,skip_balance /dev/sdb /mnt/Data > Sun 26 Oct 2025 15:08:56 CET btrfs device add -f /dev/sdh /mnt/Data > Sun 26 Oct 2025 15:10:34 CET btrfs filesystem show /mnt/Data > Sun 26 Oct 2025 15:10:55 CET btrfs rescue zero-log /dev/sdf > Sun 26 Oct 2025 15:14:32 CET umount /mnt/Data > Sun 26 Oct 2025 15:23:56 CET btrfs check --readonly /dev/sdf > Sun 26 Oct 2025 15:33:22 CET btrfs rescue clear-space-cache v2 /dev/sdf > Sun 26 Oct 2025 15:53:35 CET btrfs rescue clear-uuid-tree /dev/sdf > Sun 26 Oct 2025 16:12:17 CET wipefs -af /dev/sdh > Sun 26 Oct 2025 16:51:20 CET mount -o rw,skip_balance LABEL=Data /mnt/Data > Sun 26 Oct 2025 16:51:30 CET btrfs device add -K -f /dev/sdh /mnt/Data > Sun 26 Oct 2025 17:04:36 CET btrfs balance status -v /mnt/Data > Sun 26 Oct 2025 17:07:12 CET umount /mnt/Data > Sun 26 Oct 2025 17:13:26 CET btrfs check --readonly -s 1 /dev/sdb > Sun 26 Oct 2025 17:21:36 CET mount -o rw,skip_balance,clear_cache,noatime LABEL=Data /mnt/Data > Sun 26 Oct 2025 18:51:27 CET btrfs fi usage /mnt/Data > Sun 26 Oct 2025 18:59:52 CET btrfs balance cancel /mnt/Data > Sun 26 Oct 2025 19:02:26 CET touch /mnt/Data/test > Sun 26 Oct 2025 19:38:23 CET wget https://github.com/kdave/btrfs-progs/releases/download/v6.17/btrfs.static > Sun 26 Oct 2025 19:40:53 CET chmod +x btrfs.static > Sun 26 Oct 2025 19:41:09 CET ./btrfs.static > Sun 26 Oct 2025 19:41:14 CET umount -R /mnt/Data > Sun 26 Oct 2025 19:42:02 CET ./btrfs.static check --readonly /dev/sdf > Sun 26 Oct 2025 21:16:09 CET mount -o rw,skip_balance,noatime LABEL=Data /mnt/Data > Sun 26 Oct 2025 21:31:22 CET ./btrfs.static balance cancel /mnt/Data > ``` > > full incident kernel log: > https://pastebin.com/KxP7Xa3g > > i’m looking for a safe recovery path. is there a supported way to unwind or complete the in-flight convert first (for example, freeing metadata space or running a limited balance), or should i avoid that and take a different route? if proceeding is risky, given that there are no `Data,single` chunks on `/dev/sdd` and `/dev/sde`, is it safe to remove those two devices to free room and try again? if that’s reasonable, what exact sequence (device remove/replace vs zeroing; mount options) would you recommend to minimize further damage? > > thanks, > sandwich > > — > note: this email’s formatting was prepared with the help of an LLM. > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO 2025-10-26 21:37 ` [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO Sandwich 2025-10-26 22:11 ` Sandwich @ 2025-10-27 2:08 ` Zygo Blaxell 2025-10-27 13:20 ` Sandwich 1 sibling, 1 reply; 6+ messages in thread From: Zygo Blaxell @ 2025-10-27 2:08 UTC (permalink / raw) To: Sandwich; +Cc: linux-btrfs On Sun, Oct 26, 2025 at 10:37:02PM +0100, Sandwich wrote: > hi, > > i hit an ENOSPC corner case converting a 6-disk btrfs from data=single > to data=raid6 and metadata/system=raid1c4. after the failure, canceling > the balance forces the fs read-only. there's plenty of unallocated space > overall, but metadata reports "full" and delayed refs fail. attempts > to add another (empty) device also immediately flip the fs to RO and > the add does not proceed. > > how the filesystem grew: > i started with two disks, created btrfs (data=single), and filled > it. i added two more disks and filled it again. after adding the > final two disks i attempted the conversion to data=raid6 with > metadata/system=raid1c4—that conversion is what triggered ENOSPC > and the current RO behavior. when the convert began, usage was about > 51 TiB used out of ~118 TiB total device size. > > environment during the incident: > > ``` > uname -r: 6.14.11-4-pve [...] > ``` > > operation that started it: > > ``` > btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force > ``` > > current state: > i can mount read-write only with `-o skip_balance`. running > `btrfs balance cancel` immediately forces RO. mixed profiles remain > (data=single+raid6, metadata=raid1+raid1c4, system=raid1+raid1c4). i > tried clearing the free-space cache, afterward the free-space tree > could not be rebuilt and subsequent operations hit backref errors > (details below). adding a new device also forces RO and fails. > > FS Info: > > ``` > # btrfs fi usage -T /mnt/Data > Device size: 118.24TiB > Device allocated: 53.46TiB > Device unallocated: 64.78TiB > Used: 51.29TiB > Free (estimated): 64.20TiB (min: 18.26TiB) > Free (statfs, df): 33.20TiB > Data ratio: 1.04 > Metadata ratio: 2.33 > Multiple profiles: yes (data, metadata, system) > ``` You left out the most important part of the `fi usage -T` information: the table... > ``` > # btrfs filesystem show /mnt/Data > Label: 'Data' uuid: 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 > Total devices 6 FS bytes used 49.07TiB > devid 1 size 18.19TiB used 16.23TiB path /dev/sdf > devid 2 size 18.19TiB used 16.23TiB path /dev/sdg > devid 3 size 16.37TiB used 14.54TiB path /dev/sdc > devid 4 size 16.37TiB used 4.25TiB path /dev/sdb > devid 5 size 16.37TiB used 1.10TiB path /dev/sdd > devid 6 size 16.37TiB used 1.10TiB path /dev/sde > ``` ...but from here we can guess there's between 2 and 14 TiB on each device, which should more than satisfy the requirements for raid1c4. So this is _not_ the expected problem in this scenario, where the filesystem fills up too many of the drives too soon, and legitimately can't continue balancing. It looks like an allocator bug. > full incident kernel log: > https://pastebin.com/KxP7Xa3g > > i’m looking for a safe recovery path. is there a supported way to > unwind or complete the in-flight convert first (for example, freeing > metadata space or running a limited balance), or should i avoid that > and take a different route? if proceeding is risky, given that there > are no `Data,single` chunks on `/dev/sdd` and `/dev/sde`, is it safe > to remove those two devices to free room and try again? if that’s > reasonable, what exact sequence (device remove/replace vs zeroing; > mount options) would you recommend to minimize further damage? The safe recovery path is to get a fix for the allocator bug so that you can finish the converting balance, either to raid1c4 or any other profile. This operation (balance) is something you should be able to do with current usage. There's no other way to get out of this situation, but a kernel bug is interfering with the balance. Removing devices definitely won't help, and may trigger other issues with raid6. Don't try that. You could try an up-to-date 6.6 or 6.12 LTS kernel, in case there's a regression in newer kernels. Don't use a kernel older than 6.6 with raid6. Mount options 'nossd,skip_balance,nodiscard,noatime' should minimize the short-term metadata requirements, which might just be enough to cancel the balance and start a convert in the other direction. > thanks, > sandwich [...] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO 2025-10-27 2:08 ` Zygo Blaxell @ 2025-10-27 13:20 ` Sandwich 2025-10-29 22:06 ` Sandwich 0 siblings, 1 reply; 6+ messages in thread From: Sandwich @ 2025-10-27 13:20 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs Thank you for your reply, Unfortunately, older kernels including 6.6, 6.8, 6.12 did not help here. I've used the suggested mount options `nossd,skip_balance,nodiscard,noatime`, and tried to cancel and resume the balance with it. The result stayed the same as previously. `btrfs fi usage -T /mnt/Data`: ``` root@anthem ~ # btrfs fi usage -T /mnt/Data Overall: Device size: 118.24TiB Device allocated: 53.46TiB Device unallocated: 64.78TiB Device missing: 0.00B Device slack: 0.00B Used: 51.29TiB Free (estimated): 64.20TiB (min: 18.26TiB) Free (statfs, df): 33.20TiB Data ratio: 1.04 Metadata ratio: 2.33 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: yes (data, metadata, system) Data Data Metadata Metadata System System Id Path single RAID6 RAID1 RAID1C4 RAID1 RAID1C4 Unallocated Total Slack -- -------- -------- ------- -------- -------- ------- --------- ----------- --------- ----- 1 /dev/sdf 15.10TiB 1.09TiB 35.00GiB 8.00GiB 8.00MiB 32.00MiB 1.96TiB 18.19TiB - 2 /dev/sdg 15.10TiB 1.09TiB 44.00GiB 2.00GiB 8.00MiB - 1.96TiB 18.19TiB - 3 /dev/sdc 13.43TiB 1.09TiB 29.00GiB - - - 1.83TiB 16.37TiB - 4 /dev/sdb 3.14TiB 1.09TiB 4.00GiB 11.00GiB - 32.00MiB 12.12TiB 16.37TiB - 5 /dev/sdd - 1.09TiB - 11.00GiB - 32.00MiB 15.27TiB 16.37TiB - 6 /dev/sde - 1.09TiB - 11.00GiB - 32.00MiB 15.27TiB 16.37TiB - 7 /dev/sdh - - - 1.00GiB - - 16.37TiB 16.37TiB - -- -------- -------- ------- -------- -------- ------- --------- ----------- --------- ----- Total 46.78TiB 4.35TiB 56.00GiB 11.00GiB 8.00MiB 32.00MiB 64.78TiB 118.24TiB 0.00B Used 44.72TiB 4.29TiB 50.54GiB 9.96GiB 5.22MiB 352.00KiB ``` What information is needed to trace this bug? If you're willing to help me on the code side, I would gladly provide you with any information or test patches. In the meantime, I start to back up the most important data out of the array. BR Sandwich On 10/27/25 3:08 AM, Zygo Blaxell wrote: > On Sun, Oct 26, 2025 at 10:37:02PM +0100, Sandwich wrote: >> hi, >> >> i hit an ENOSPC corner case converting a 6-disk btrfs from data=single >> to data=raid6 and metadata/system=raid1c4. after the failure, canceling >> the balance forces the fs read-only. there's plenty of unallocated space >> overall, but metadata reports "full" and delayed refs fail. attempts >> to add another (empty) device also immediately flip the fs to RO and >> the add does not proceed. >> >> how the filesystem grew: >> i started with two disks, created btrfs (data=single), and filled >> it. i added two more disks and filled it again. after adding the >> final two disks i attempted the conversion to data=raid6 with >> metadata/system=raid1c4—that conversion is what triggered ENOSPC >> and the current RO behavior. when the convert began, usage was about >> 51 TiB used out of ~118 TiB total device size. >> >> environment during the incident: >> >> ``` >> uname -r: 6.14.11-4-pve > [...] >> ``` >> >> operation that started it: >> >> ``` >> btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force >> ``` >> >> current state: >> i can mount read-write only with `-o skip_balance`. running >> `btrfs balance cancel` immediately forces RO. mixed profiles remain >> (data=single+raid6, metadata=raid1+raid1c4, system=raid1+raid1c4). i >> tried clearing the free-space cache, afterward the free-space tree >> could not be rebuilt and subsequent operations hit backref errors >> (details below). adding a new device also forces RO and fails. >> >> FS Info: >> >> ``` >> # btrfs fi usage -T /mnt/Data >> Device size: 118.24TiB >> Device allocated: 53.46TiB >> Device unallocated: 64.78TiB >> Used: 51.29TiB >> Free (estimated): 64.20TiB (min: 18.26TiB) >> Free (statfs, df): 33.20TiB >> Data ratio: 1.04 >> Metadata ratio: 2.33 >> Multiple profiles: yes (data, metadata, system) >> ``` > You left out the most important part of the `fi usage -T` information: > the table... > >> ``` >> # btrfs filesystem show /mnt/Data >> Label: 'Data' uuid: 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 >> Total devices 6 FS bytes used 49.07TiB >> devid 1 size 18.19TiB used 16.23TiB path /dev/sdf >> devid 2 size 18.19TiB used 16.23TiB path /dev/sdg >> devid 3 size 16.37TiB used 14.54TiB path /dev/sdc >> devid 4 size 16.37TiB used 4.25TiB path /dev/sdb >> devid 5 size 16.37TiB used 1.10TiB path /dev/sdd >> devid 6 size 16.37TiB used 1.10TiB path /dev/sde >> ``` > ...but from here we can guess there's between 2 and 14 TiB on each device, > which should more than satisfy the requirements for raid1c4. > > So this is _not_ the expected problem in this scenario, where the > filesystem fills up too many of the drives too soon, and legitimately > can't continue balancing. > > It looks like an allocator bug. > >> full incident kernel log: >> https://pastebin.com/KxP7Xa3g >> >> i’m looking for a safe recovery path. is there a supported way to >> unwind or complete the in-flight convert first (for example, freeing >> metadata space or running a limited balance), or should i avoid that >> and take a different route? if proceeding is risky, given that there >> are no `Data,single` chunks on `/dev/sdd` and `/dev/sde`, is it safe >> to remove those two devices to free room and try again? if that’s >> reasonable, what exact sequence (device remove/replace vs zeroing; >> mount options) would you recommend to minimize further damage? > The safe recovery path is to get a fix for the allocator bug so that you > can finish the converting balance, either to raid1c4 or any other profile. > > This operation (balance) is something you should be able to do with > current usage. There's no other way to get out of this situation, > but a kernel bug is interfering with the balance. > > Removing devices definitely won't help, and may trigger other issues > with raid6. Don't try that. > > You could try an up-to-date 6.6 or 6.12 LTS kernel, in case there's a > regression in newer kernels. Don't use a kernel older than 6.6 with > raid6. > > Mount options 'nossd,skip_balance,nodiscard,noatime' should minimize > the short-term metadata requirements, which might just be enough to > cancel the balance and start a convert in the other direction. > >> thanks, >> sandwich > [...] > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO 2025-10-27 13:20 ` Sandwich @ 2025-10-29 22:06 ` Sandwich 2025-10-30 18:16 ` Goffredo Baroncelli 0 siblings, 1 reply; 6+ messages in thread From: Sandwich @ 2025-10-29 22:06 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs Hi, I will in 1-2 days migrate my storage and format this array, so this story will end for me. In case, someone needs to replicate this. Make 2 disks in a single array: `mkfs.btrfs -L NewData -d single -m raid1 /dev/sda /dev/sdb` Fill them up till their full: Add 2 extra drives `btrfs device add /dev/sdx /some/path` Fill the whole array till about 50% total usage. Start the conversion with `btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force` In the after sign, using the limiters and converting the chunks bit by bit might have helped here. Like this `btrfs balance start -mconvert=raid1c4,soft,limit=1 -sconvert=raid1c4,soft,limit=1 /mnt/Data` Best Sandwich On 10/27/25 2:20 PM, Sandwich wrote: > Thank you for your reply, > Unfortunately, older kernels including 6.6, 6.8, 6.12 did not help here. > I've used the suggested mount options `nossd,skip_balance,nodiscard,noatime`, and tried to cancel and resume the balance with it. > The result stayed the same as previously. > > `btrfs fi usage -T /mnt/Data`: > ``` > root@anthem ~ # btrfs fi usage -T /mnt/Data > Overall: > Device size: 118.24TiB > Device allocated: 53.46TiB > Device unallocated: 64.78TiB > Device missing: 0.00B > Device slack: 0.00B > Used: 51.29TiB > Free (estimated): 64.20TiB (min: 18.26TiB) > Free (statfs, df): 33.20TiB > Data ratio: 1.04 > Metadata ratio: 2.33 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: yes (data, metadata, system) > > Data Data Metadata Metadata System System > Id Path single RAID6 RAID1 RAID1C4 RAID1 RAID1C4 Unallocated Total Slack > -- -------- -------- ------- -------- -------- ------- --------- ----------- --------- ----- > 1 /dev/sdf 15.10TiB 1.09TiB 35.00GiB 8.00GiB 8.00MiB 32.00MiB 1.96TiB 18.19TiB - > 2 /dev/sdg 15.10TiB 1.09TiB 44.00GiB 2.00GiB 8.00MiB - 1.96TiB 18.19TiB - > 3 /dev/sdc 13.43TiB 1.09TiB 29.00GiB - - - 1.83TiB 16.37TiB - > 4 /dev/sdb 3.14TiB 1.09TiB 4.00GiB 11.00GiB - 32.00MiB 12.12TiB 16.37TiB - > 5 /dev/sdd - 1.09TiB - 11.00GiB - 32.00MiB 15.27TiB 16.37TiB - > 6 /dev/sde - 1.09TiB - 11.00GiB - 32.00MiB 15.27TiB 16.37TiB - > 7 /dev/sdh - - - 1.00GiB - - 16.37TiB 16.37TiB - > -- -------- -------- ------- -------- -------- ------- --------- ----------- --------- ----- > Total 46.78TiB 4.35TiB 56.00GiB 11.00GiB 8.00MiB 32.00MiB 64.78TiB 118.24TiB 0.00B > Used 44.72TiB 4.29TiB 50.54GiB 9.96GiB 5.22MiB 352.00KiB > ``` > > What information is needed to trace this bug? > If you're willing to help me on the code side, I would gladly provide you with any information or test patches. > > In the meantime, I start to back up the most important data out of the array. > > BR > Sandwich > > On 10/27/25 3:08 AM, Zygo Blaxell wrote: >> On Sun, Oct 26, 2025 at 10:37:02PM +0100, Sandwich wrote: >>> hi, >>> >>> i hit an ENOSPC corner case converting a 6-disk btrfs from data=single >>> to data=raid6 and metadata/system=raid1c4. after the failure, canceling >>> the balance forces the fs read-only. there's plenty of unallocated space >>> overall, but metadata reports "full" and delayed refs fail. attempts >>> to add another (empty) device also immediately flip the fs to RO and >>> the add does not proceed. >>> >>> how the filesystem grew: >>> i started with two disks, created btrfs (data=single), and filled >>> it. i added two more disks and filled it again. after adding the >>> final two disks i attempted the conversion to data=raid6 with >>> metadata/system=raid1c4—that conversion is what triggered ENOSPC >>> and the current RO behavior. when the convert began, usage was about >>> 51 TiB used out of ~118 TiB total device size. >>> >>> environment during the incident: >>> >>> ``` >>> uname -r: 6.14.11-4-pve >> [...] >>> ``` >>> >>> operation that started it: >>> >>> ``` >>> btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force >>> ``` >>> >>> current state: >>> i can mount read-write only with `-o skip_balance`. running >>> `btrfs balance cancel` immediately forces RO. mixed profiles remain >>> (data=single+raid6, metadata=raid1+raid1c4, system=raid1+raid1c4). i >>> tried clearing the free-space cache, afterward the free-space tree >>> could not be rebuilt and subsequent operations hit backref errors >>> (details below). adding a new device also forces RO and fails. >>> >>> FS Info: >>> >>> ``` >>> # btrfs fi usage -T /mnt/Data >>> Device size: 118.24TiB >>> Device allocated: 53.46TiB >>> Device unallocated: 64.78TiB >>> Used: 51.29TiB >>> Free (estimated): 64.20TiB (min: 18.26TiB) >>> Free (statfs, df): 33.20TiB >>> Data ratio: 1.04 >>> Metadata ratio: 2.33 >>> Multiple profiles: yes (data, metadata, system) >>> ``` >> You left out the most important part of the `fi usage -T` information: >> the table... >> >>> ``` >>> # btrfs filesystem show /mnt/Data >>> Label: 'Data' uuid: 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 >>> Total devices 6 FS bytes used 49.07TiB >>> devid 1 size 18.19TiB used 16.23TiB path /dev/sdf >>> devid 2 size 18.19TiB used 16.23TiB path /dev/sdg >>> devid 3 size 16.37TiB used 14.54TiB path /dev/sdc >>> devid 4 size 16.37TiB used 4.25TiB path /dev/sdb >>> devid 5 size 16.37TiB used 1.10TiB path /dev/sdd >>> devid 6 size 16.37TiB used 1.10TiB path /dev/sde >>> ``` >> ...but from here we can guess there's between 2 and 14 TiB on each device, >> which should more than satisfy the requirements for raid1c4. >> >> So this is _not_ the expected problem in this scenario, where the >> filesystem fills up too many of the drives too soon, and legitimately >> can't continue balancing. >> >> It looks like an allocator bug. >> >>> full incident kernel log: >>> https://pastebin.com/KxP7Xa3g >>> >>> i’m looking for a safe recovery path. is there a supported way to >>> unwind or complete the in-flight convert first (for example, freeing >>> metadata space or running a limited balance), or should i avoid that >>> and take a different route? if proceeding is risky, given that there >>> are no `Data,single` chunks on `/dev/sdd` and `/dev/sde`, is it safe >>> to remove those two devices to free room and try again? if that’s >>> reasonable, what exact sequence (device remove/replace vs zeroing; >>> mount options) would you recommend to minimize further damage? >> The safe recovery path is to get a fix for the allocator bug so that you >> can finish the converting balance, either to raid1c4 or any other profile. >> >> This operation (balance) is something you should be able to do with >> current usage. There's no other way to get out of this situation, >> but a kernel bug is interfering with the balance. >> >> Removing devices definitely won't help, and may trigger other issues >> with raid6. Don't try that. >> >> You could try an up-to-date 6.6 or 6.12 LTS kernel, in case there's a >> regression in newer kernels. Don't use a kernel older than 6.6 with >> raid6. >> >> Mount options 'nossd,skip_balance,nodiscard,noatime' should minimize >> the short-term metadata requirements, which might just be enough to >> cancel the balance and start a convert in the other direction. >> >>> thanks, >>> sandwich >> [...] >> > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO 2025-10-29 22:06 ` Sandwich @ 2025-10-30 18:16 ` Goffredo Baroncelli 0 siblings, 0 replies; 6+ messages in thread From: Goffredo Baroncelli @ 2025-10-30 18:16 UTC (permalink / raw) To: Sandwich, Zygo Blaxell; +Cc: linux-btrfs On 29/10/2025 23.06, Sandwich wrote: > Hi, I will in 1-2 days migrate my storage and format this array, so this story will end for me. > In case, someone needs to replicate this. > > Make 2 disks in a single array: > `mkfs.btrfs -L NewData -d single -m raid1 /dev/sda /dev/sdb` > > Fill them up till their full: > Add 2 extra drives `btrfs device add /dev/sdx /some/path` During a conversion to raid6/raid1c4 with 4 disks, the filesystem needs to allocate a portion of the BG in all the disks. But if two are already filled, it fails (but this is expected). > Fill the whole array till about 50% total usage. > Start the conversion with `btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force` In fact, when I tried to reproduce (single disk size=10GB), I got the error ERROR: error during balancing 't/.': No space left on device $ uname -a Linux venice.bhome 6.16.12+deb14+1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.12-2 (2025-10-16) x86_64 GNU/Linux > > In the after sign, using the limiters and converting the chunks bit by bit might have helped here. > Like this `btrfs balance start -mconvert=raid1c4,soft,limit=1 -sconvert=raid1c4,soft,limit=1 /mnt/Data` > > Best > Sandwich > > On 10/27/25 2:20 PM, Sandwich wrote: >> Thank you for your reply, >> Unfortunately, older kernels including 6.6, 6.8, 6.12 did not help here. >> I've used the suggested mount options `nossd,skip_balance,nodiscard,noatime`, and tried to cancel and resume the balance with it. >> The result stayed the same as previously. >> >> `btrfs fi usage -T /mnt/Data`: >> ``` >> root@anthem ~ # btrfs fi usage -T /mnt/Data >> Overall: >> Device size: 118.24TiB >> Device allocated: 53.46TiB >> Device unallocated: 64.78TiB >> Device missing: 0.00B >> Device slack: 0.00B >> Used: 51.29TiB >> Free (estimated): 64.20TiB (min: 18.26TiB) >> Free (statfs, df): 33.20TiB >> Data ratio: 1.04 >> Metadata ratio: 2.33 >> Global reserve: 512.00MiB (used: 0.00B) >> Multiple profiles: yes (data, metadata, system) >> >> Data Data Metadata Metadata System System >> Id Path single RAID6 RAID1 RAID1C4 RAID1 RAID1C4 Unallocated Total Slack >> -- -------- -------- ------- -------- -------- ------- --------- ----------- --------- ----- >> 1 /dev/sdf 15.10TiB 1.09TiB 35.00GiB 8.00GiB 8.00MiB 32.00MiB 1.96TiB 18.19TiB - >> 2 /dev/sdg 15.10TiB 1.09TiB 44.00GiB 2.00GiB 8.00MiB - 1.96TiB 18.19TiB - >> 3 /dev/sdc 13.43TiB 1.09TiB 29.00GiB - - - 1.83TiB 16.37TiB - >> 4 /dev/sdb 3.14TiB 1.09TiB 4.00GiB 11.00GiB - 32.00MiB 12.12TiB 16.37TiB - >> 5 /dev/sdd - 1.09TiB - 11.00GiB - 32.00MiB 15.27TiB 16.37TiB - >> 6 /dev/sde - 1.09TiB - 11.00GiB - 32.00MiB 15.27TiB 16.37TiB - >> 7 /dev/sdh - - - 1.00GiB - - 16.37TiB 16.37TiB - >> -- -------- -------- ------- -------- -------- ------- --------- ----------- --------- ----- >> Total 46.78TiB 4.35TiB 56.00GiB 11.00GiB 8.00MiB 32.00MiB 64.78TiB 118.24TiB 0.00B >> Used 44.72TiB 4.29TiB 50.54GiB 9.96GiB 5.22MiB 352.00KiB >> ``` >> >> What information is needed to trace this bug? >> If you're willing to help me on the code side, I would gladly provide you with any information or test patches. >> >> In the meantime, I start to back up the most important data out of the array. >> >> BR >> Sandwich >> >> On 10/27/25 3:08 AM, Zygo Blaxell wrote: >>> On Sun, Oct 26, 2025 at 10:37:02PM +0100, Sandwich wrote: >>>> hi, >>>> >>>> i hit an ENOSPC corner case converting a 6-disk btrfs from data=single >>>> to data=raid6 and metadata/system=raid1c4. after the failure, canceling >>>> the balance forces the fs read-only. there's plenty of unallocated space >>>> overall, but metadata reports "full" and delayed refs fail. attempts >>>> to add another (empty) device also immediately flip the fs to RO and >>>> the add does not proceed. >>>> >>>> how the filesystem grew: >>>> i started with two disks, created btrfs (data=single), and filled >>>> it. i added two more disks and filled it again. after adding the >>>> final two disks i attempted the conversion to data=raid6 with >>>> metadata/system=raid1c4—that conversion is what triggered ENOSPC >>>> and the current RO behavior. when the convert began, usage was about >>>> 51 TiB used out of ~118 TiB total device size. >>>> >>>> environment during the incident: >>>> >>>> ``` >>>> uname -r: 6.14.11-4-pve >>> [...] >>>> ``` >>>> >>>> operation that started it: >>>> >>>> ``` >>>> btrfs balance start -v -dconvert=raid6 -mconvert=raid1c4 -sconvert=raid1c4 /mnt/Data --force >>>> ``` >>>> >>>> current state: >>>> i can mount read-write only with `-o skip_balance`. running >>>> `btrfs balance cancel` immediately forces RO. mixed profiles remain >>>> (data=single+raid6, metadata=raid1+raid1c4, system=raid1+raid1c4). i >>>> tried clearing the free-space cache, afterward the free-space tree >>>> could not be rebuilt and subsequent operations hit backref errors >>>> (details below). adding a new device also forces RO and fails. >>>> >>>> FS Info: >>>> >>>> ``` >>>> # btrfs fi usage -T /mnt/Data >>>> Device size: 118.24TiB >>>> Device allocated: 53.46TiB >>>> Device unallocated: 64.78TiB >>>> Used: 51.29TiB >>>> Free (estimated): 64.20TiB (min: 18.26TiB) >>>> Free (statfs, df): 33.20TiB >>>> Data ratio: 1.04 >>>> Metadata ratio: 2.33 >>>> Multiple profiles: yes (data, metadata, system) >>>> ``` >>> You left out the most important part of the `fi usage -T` information: >>> the table... >>> >>>> ``` >>>> # btrfs filesystem show /mnt/Data >>>> Label: 'Data' uuid: 7aa7fdb3-b3de-421c-bc86-daba55fc46f6 >>>> Total devices 6 FS bytes used 49.07TiB >>>> devid 1 size 18.19TiB used 16.23TiB path /dev/sdf >>>> devid 2 size 18.19TiB used 16.23TiB path /dev/sdg >>>> devid 3 size 16.37TiB used 14.54TiB path /dev/sdc >>>> devid 4 size 16.37TiB used 4.25TiB path /dev/sdb >>>> devid 5 size 16.37TiB used 1.10TiB path /dev/sdd >>>> devid 6 size 16.37TiB used 1.10TiB path /dev/sde >>>> ``` >>> ...but from here we can guess there's between 2 and 14 TiB on each device, >>> which should more than satisfy the requirements for raid1c4. >>> >>> So this is _not_ the expected problem in this scenario, where the >>> filesystem fills up too many of the drives too soon, and legitimately >>> can't continue balancing. >>> >>> It looks like an allocator bug. >>> >>>> full incident kernel log: >>>> https://pastebin.com/KxP7Xa3g >>>> >>>> i’m looking for a safe recovery path. is there a supported way to >>>> unwind or complete the in-flight convert first (for example, freeing >>>> metadata space or running a limited balance), or should i avoid that >>>> and take a different route? if proceeding is risky, given that there >>>> are no `Data,single` chunks on `/dev/sdd` and `/dev/sde`, is it safe >>>> to remove those two devices to free room and try again? if that’s >>>> reasonable, what exact sequence (device remove/replace vs zeroing; >>>> mount options) would you recommend to minimize further damage? >>> The safe recovery path is to get a fix for the allocator bug so that you >>> can finish the converting balance, either to raid1c4 or any other profile. >>> >>> This operation (balance) is something you should be able to do with >>> current usage. There's no other way to get out of this situation, >>> but a kernel bug is interfering with the balance. >>> >>> Removing devices definitely won't help, and may trigger other issues >>> with raid6. Don't try that. >>> >>> You could try an up-to-date 6.6 or 6.12 LTS kernel, in case there's a >>> regression in newer kernels. Don't use a kernel older than 6.6 with >>> raid6. >>> >>> Mount options 'nossd,skip_balance,nodiscard,noatime' should minimize >>> the short-term metadata requirements, which might just be enough to >>> cancel the balance and start a convert in the other direction. >>> >>>> thanks, >>>> sandwich >>> [...] >>> >> > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-10-30 18:19 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <e03530c5-6af9-4f7a-9205-21d41dc092e5@archworks.co>
2025-10-26 21:37 ` [btrfs] ENOSPC during convert to RAID6/RAID1C4 -> forced RO Sandwich
2025-10-26 22:11 ` Sandwich
2025-10-27 2:08 ` Zygo Blaxell
2025-10-27 13:20 ` Sandwich
2025-10-29 22:06 ` Sandwich
2025-10-30 18:16 ` Goffredo Baroncelli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox