* [PATCH 0/3] btrfs: unbalanced disks aware per-profile available space estimation
@ 2026-02-03 3:01 Qu Wenruo
2026-02-03 3:01 ` [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space Qu Wenruo
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-02-03 3:01 UTC (permalink / raw)
To: linux-btrfs
[CHANGELOG]
v1:
- Revive from the v5.9 era fix
- Make btrfs_update_per_profile_avail() to not return error
Instead just mark all profiles as unavailable, and
btrfs_get_per_profile_avail() will return false.
The caller will need to fallback to the existing factor based
estimation.
This greatly simplified the error handling, which is a pain point in
the original series.
- Remove a lot of refactor/cleanup
As that's already done in upstream.
- Only make calc_available_free_space() to use the new infrastructure
That's the main goal, fix can_over_commit().
Further enhancement can be done later.
There is a long known bug that if metadata is using RAID1 on two
unbalanced disks, btrfs have a very high chance to hit -ENOSPC during
critical paths and flips RO.
The bug dates back to v5.9 (where my last updates ends) and the most
recent bug report is from Christoph.
The idea to fix it is always here, by providing a chunk-allocator-like
available space estimation.
It doesn't need to be as heavy as chunk allocator, but at least it
should not over-estimate.
The demon is always in the details, the previous v5.9 era series
requires a lot of changes in error handling, because the
btrfs_update_per_profile_avail() can fail at critical paths in chunk
allocation/removal and device grow/shrink/add/removal.
But this time that function will no longer fail, but just mark
per-profile available estimation as unreliable, and let the caller to
fallback to the old factor based solution.
In the real world it should not be a big deal, as the only error is
-ENOMEM, but this greatly simplifies the error handling.
Qu Wenruo (3):
btrfs: introduce the device layout aware per-profile available space
btrfs: update per-profile available estimation
btrfs: use per-profile available space in calc_available_free_space()
fs/btrfs/space-info.c | 27 ++++---
fs/btrfs/volumes.c | 173 +++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 30 ++++++++
3 files changed, 217 insertions(+), 13 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space
2026-02-03 3:01 [PATCH 0/3] btrfs: unbalanced disks aware per-profile available space estimation Qu Wenruo
@ 2026-02-03 3:01 ` Qu Wenruo
2026-02-03 12:56 ` Filipe Manana
2026-02-03 23:49 ` kernel test robot
2026-02-03 3:01 ` [PATCH 2/3] btrfs: update per-profile available estimation Qu Wenruo
2026-02-03 3:01 ` [PATCH 3/3] btrfs: use per-profile available space in calc_available_free_space() Qu Wenruo
2 siblings, 2 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-02-03 3:01 UTC (permalink / raw)
To: linux-btrfs
[BUG]
There is a long known bug that if metadata is using RAID1 on two disks
with unbalanced sizes, there is a very high chance to hit ENOSPC related
transaction abort.
[CAUSE]
The root cause is in the available space estimation code:
- Factor based calculation
Just use all unallocated space, divide by the profile factor
One obvious user is can_overcommit().
This can not handle the following example:
devid 1 unallocated: 1GiB
devid 2 unallocated: 50GiB
metadata type: RAID1
If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB
free space for metadata.
Thus we can continue allocating metadata (over-commit) way beyond the
1GiB limit.
But this estimation is completely wrong, in reality we can only allocate
one single 1GiB RAID1 block group, thus if we continue over-commit, at
one time we will hit ENOSPC at some critical path and flips the fs
read-only.
[SOLUTION]
This patch will introduce per-profile available space estimation,
which can provide chunk-allocator like behavior to give a (mostly)
accurate result, with under-estimate corner cases.
There are some differences between the estimation and real chunk
allocator:
- No consideration on hole size
It's fine for most cases, as all data/metadata strips are in 1GiB size
thus there should not be any hole wasting much space.
And chunk allocator is able to use smaller stripes when there is
really no other choice.
Although in theory this means it can lead to some over-estimation, it
should not cause too much hassle in the real world.
The other benefit of such behavior is, we avoid dev-extent tree search
completely, thus the overhead is very small.
- No true balance for certain cases
If we have 3 disks RAID1, and each device has 2GiB unallocated space,
we can load balance the chunk allocation so that we can allocate 3GiB
RAID1 chunks, and that's what chunk allocator will do.
But this current estimation code is using the largest available space
to do a single allocation. Meaning the estimation will be 2GiB, thus
under estimate.
Such under estimation is fine and after the first chunk allocation, the
estimation will be updated and still give a correct 2GiB
estimation.
So this only means the estimation will be a little conservative, which
is safer for call sites like metadata over-commit check.
With that facility, for above 1GiB + 50GiB case, it will give a RAID1
estimation of 1GiB, instead of the incorrect 25.5GiB.
Or for a more complex example:
devid 1 unallocated: 1T
devid 2 unallocated: 1T
devid 3 unallocated: 10T
We will get an array of:
RAID10: 2T
RAID1: 2T
RAID1C3: 1T
RAID1C4: 0 (not enough devices)
DUP: 6T
RAID0: 3T
SINGLE: 12T
RAID5: 1T
RAID6: 0 (not enough devices)
[IMPLEMENTATION]
And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:
clear_virtual_used_space_of_all_rw_devices();
do {
/*
* The same as chunk allocator, despite used space,
* we also take virtual used space into consideration.
*/
sort_device_with_virtual_free_space();
/*
* Unlike chunk allocator, we don't need to bother hole/stripe
* size, so we use the smallest device to make sure we can
* allocated as many stripes as regular chunk allocator
*/
stripe_size = device_with_smallest_free->avail_space;
stripe_size = min(stripe_size, to_alloc / ndevs);
/*
* Allocate a virtual chunk, allocated virtual chunk will
* increase virtual used space, allow next iteration to
* properly emulate chunk allocator behavior.
*/
ret = alloc_virtual_chunk(stripe_size, &allocated_size);
if (ret == 0)
avail += allocated_size;
} while (ret == 0)
As we always select the device with least free space, the device with
the most space will be the first to be utilized, just like chunk
allocator.
For above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.
Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.
This minimal available space based calculation is not perfect, but the
important part is, the estimation is never exceeding the real available
space.
This patch just introduces the infrastructure, no hooks are executed
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/volumes.c | 153 +++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.h | 30 +++++++++
2 files changed, 183 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f281d113519b..2348d4d5e0b5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5372,6 +5372,159 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
return 0;
}
+/*
+ * Return 0 if we allocated any ballon(*) chunk, and restore the size to
+ * @allocated (the last parameter).
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: Ballon chunks are space holder for per-profile available space allocator.
+ * Ballon chunks won't really take on-disk space, but only to emulate
+ * chunk allocator behavior to get accurate estimation on available space.
+ */
+static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+ struct btrfs_device_info *devices_info,
+ enum btrfs_raid_types type,
+ u64 *allocated)
+{
+ const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+ struct btrfs_device *device;
+ u64 stripe_size;
+ int i;
+ int ndevs = 0;
+
+ lockdep_assert_held(&fs_info->chunk_mutex);
+
+ /* Go through devices to collect their unallocated space */
+ list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+ u64 avail;
+
+ if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+ &device->dev_state) ||
+ test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+ continue;
+
+ if (device->total_bytes > device->bytes_used +
+ device->per_profile_allocated)
+ avail = device->total_bytes - device->bytes_used -
+ device->per_profile_allocated;
+ else
+ avail = 0;
+
+ /* And exclude the [0, 1M) reserved space */
+ if (avail > SZ_1M)
+ avail -= SZ_1M;
+ else
+ avail = 0;
+
+ if (avail < fs_info->sectorsize)
+ continue;
+ /*
+ * Unlike chunk allocator, we don't care about stripe or hole
+ * size, so here we use @avail directly
+ */
+ devices_info[ndevs].dev_offset = 0;
+ devices_info[ndevs].total_avail = avail;
+ devices_info[ndevs].max_avail = avail;
+ devices_info[ndevs].dev = device;
+ ++ndevs;
+ }
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+ ndevs = rounddown(ndevs, raid_attr->devs_increment);
+ if (ndevs < raid_attr->devs_min)
+ return -ENOSPC;
+ if (raid_attr->devs_max)
+ ndevs = min(ndevs, (int)raid_attr->devs_max);
+ else
+ ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+ /*
+ * Now allocate a virtual chunk using the unallocated space of the
+ * device with the least unallocated space.
+ */
+ stripe_size = round_down(devices_info[ndevs - 1].total_avail,
+ fs_info->sectorsize);
+ for (i = 0; i < ndevs; i++)
+ devices_info[i].dev->per_profile_allocated += stripe_size;
+ *allocated = stripe_size * (ndevs - raid_attr->nparity) /
+ raid_attr->ncopies;
+ return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+ enum btrfs_raid_types type,
+ u64 *result_ret)
+{
+ struct btrfs_device_info *devices_info = NULL;
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+ struct btrfs_device *device;
+ u64 allocated;
+ u64 result = 0;
+ int ret = 0;
+
+ lockdep_assert_held(&fs_info->chunk_mutex);
+ ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+
+ /* Not enough devices, quick exit, just update the result */
+ if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
+ goto out;
+
+ devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+ GFP_NOFS);
+ if (!devices_info) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ /* Clear virtual chunk used space for each device */
+ list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+ device->per_profile_allocated = 0;
+
+ while (!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
+ result += allocated;
+
+out:
+ kfree(devices_info);
+ if (ret < 0 && ret != -ENOSPC)
+ return ret;
+ *result_ret = result;
+ return 0;
+}
+
+/* Update the per-profile available space array. */
+void btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+ u64 results[BTRFS_NR_RAID_TYPES];
+ int i = 0;
+ int ret;
+
+ /*
+ * Zoned is more complex as we can not simply get the amount of
+ * available space for each device.
+ */
+ if (btrfs_is_zoned(fs_info))
+ goto error;
+
+ for (; i < BTRFS_NR_RAID_TYPES; i++) {
+ ret = calc_one_profile_avail(fs_info, i, &results[i]);
+ if (ret < 0)
+ goto error;
+ }
+
+ spin_lock(&fs_info->fs_devices->per_profile_lock);
+ for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+ fs_info->fs_devices->per_profile_avail[i] = results[i];
+ set_bit(i, &fs_info->fs_devices->per_profile_uptodate);
+ }
+ spin_unlock(&fs_info->fs_devices->per_profile_lock);
+ return;
+error:
+ spin_lock(&fs_info->fs_devices->per_profile_lock);
+ bitmap_clear(&fs_info->fs_devices->per_profile_uptodate, 0,
+ BTRFS_NR_RAID_TYPES);
+ spin_unlock(&fs_info->fs_devices->per_profile_lock);
+}
+
static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
{
if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ebc85bf53ee7..ecb5ad9cf249 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -22,6 +22,7 @@
#include <uapi/linux/btrfs_tree.h>
#include "messages.h"
#include "extent-io-tree.h"
+#include "fs.h"
struct block_device;
struct bdev_handle;
@@ -213,6 +214,12 @@ struct btrfs_device {
/* Bandwidth limit for scrub, in bytes */
u64 scrub_speed_max;
+
+ /*
+ * A temporary number of allocated space during per-profile
+ * available space calculation.
+ */
+ u64 per_profile_allocated;
};
/*
@@ -458,6 +465,11 @@ struct btrfs_fs_devices {
/* Device to be used for reading in case of RAID1. */
u64 read_devid;
#endif
+
+ u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
+ /* Records per-type available space estimation. */
+ spinlock_t per_profile_lock;
+ unsigned long per_profile_uptodate;
};
#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \
@@ -886,6 +898,24 @@ int btrfs_bg_type_to_factor(u64 flags);
const char *btrfs_bg_type_to_raid_name(u64 flags);
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
bool btrfs_verify_dev_items(const struct btrfs_fs_info *fs_info);
+void btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info);
+
+static inline bool btrfs_get_per_profile_avail(struct btrfs_fs_info *fs_info,
+ u64 profile, u64 *avail_ret)
+{
+ enum btrfs_raid_types index = btrfs_bg_flags_to_raid_index(profile);
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+ bool uptodate = false;
+
+ spin_lock(&fs_devices->per_profile_lock);
+ if (test_bit(index, &fs_devices->per_profile_uptodate)) {
+ uptodate = true;
+ *avail_ret = fs_devices->per_profile_avail[index];
+ }
+ spin_unlock(&fs_info->fs_devices->per_profile_lock);
+ return uptodate;
+}
+
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
--
2.52.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/3] btrfs: update per-profile available estimation
2026-02-03 3:01 [PATCH 0/3] btrfs: unbalanced disks aware per-profile available space estimation Qu Wenruo
2026-02-03 3:01 ` [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space Qu Wenruo
@ 2026-02-03 3:01 ` Qu Wenruo
2026-02-03 3:01 ` [PATCH 3/3] btrfs: use per-profile available space in calc_available_free_space() Qu Wenruo
2 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-02-03 3:01 UTC (permalink / raw)
To: linux-btrfs
This involves the following timing:
- Chunk allocation
- Chunk removal
- After Mount
- New device
- Device removal
- Device shrink
- Device enlarge
And since the function btrfs_update_per_profile_avail() will not return
an error, this won't cause new error handling path.
Although when btrfs_update_per_profile_avail() failed (only ENOSPC
possible) it will mark the per-profile available estimation as
unreliable, so later btrfs_get_per_profile_avail() will return false and
require the caller to have a fallback solution.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/volumes.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2348d4d5e0b5..43865c1ed445 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2339,6 +2339,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
mutex_lock(&fs_info->chunk_mutex);
list_del_init(&device->dev_alloc_list);
device->fs_devices->rw_devices--;
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
}
@@ -2450,6 +2451,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
list_add(&device->dev_alloc_list,
&fs_devices->alloc_list);
device->fs_devices->rw_devices++;
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
}
return ret;
@@ -2937,6 +2939,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
*/
btrfs_clear_space_info_full(fs_info);
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
/* Add sysfs device entry */
@@ -2947,6 +2950,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
if (seeding_dev) {
mutex_lock(&fs_info->chunk_mutex);
ret = init_first_rw_device(trans);
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
if (unlikely(ret)) {
btrfs_abort_transaction(trans, ret);
@@ -3029,6 +3033,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
orig_super_total_bytes);
btrfs_set_super_num_devices(fs_info->super_copy,
orig_super_num_devices);
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
error_trans:
@@ -3121,6 +3126,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
if (list_empty(&device->post_commit_list))
list_add_tail(&device->post_commit_list,
&trans->transaction->dev_update_list);
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
btrfs_reserve_chunk_metadata(trans, false);
@@ -3497,6 +3503,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
}
}
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
trans->removing_chunk = false;
@@ -5185,6 +5192,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
atomic64_sub(free_diff, &fs_info->free_chunk_space);
}
+ btrfs_update_per_profile_avail(fs_info);
/*
* Once the device's size has been set to the new size, ensure all
* in-memory chunks are synced to disk so that the loop below sees them
@@ -5300,6 +5308,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
WARN_ON(diff > old_total);
btrfs_set_super_total_bytes(super_copy,
round_down(old_total - diff, fs_info->sectorsize));
+ btrfs_update_per_profile_avail(fs_info);
mutex_unlock(&fs_info->chunk_mutex);
btrfs_reserve_chunk_metadata(trans, false);
@@ -6002,6 +6011,8 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
check_raid56_incompat_flag(info, type);
check_raid1c34_incompat_flag(info, type);
+ btrfs_update_per_profile_avail(info);
+
return block_group;
}
@@ -8574,7 +8585,14 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
}
/* Ensure all chunks have corresponding dev extents */
- return verify_chunk_dev_extent_mapping(fs_info);
+ ret = verify_chunk_dev_extent_mapping(fs_info);
+ if (ret < 0)
+ return ret;
+
+ mutex_lock(&fs_info->chunk_mutex);
+ btrfs_update_per_profile_avail(fs_info);
+ mutex_unlock(&fs_info->chunk_mutex);
+ return 0;
}
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/3] btrfs: use per-profile available space in calc_available_free_space()
2026-02-03 3:01 [PATCH 0/3] btrfs: unbalanced disks aware per-profile available space estimation Qu Wenruo
2026-02-03 3:01 ` [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space Qu Wenruo
2026-02-03 3:01 ` [PATCH 2/3] btrfs: update per-profile available estimation Qu Wenruo
@ 2026-02-03 3:01 ` Qu Wenruo
2 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-02-03 3:01 UTC (permalink / raw)
To: linux-btrfs
For the following disk layout, can_overcommit() can cause false
confidence in available space:
devid 1 unallocated: 1GiB
devid 2 unallocated: 50GiB
metadata type: RAID1
As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size, resulting 25.5GiB
available space.
But in reality we can only allocate one 1GiB RAID1 chunk, the remaining
49GiB on devid 2 will never be utilized to fulfill a RAID1 chunk.
This leads to various ENOSPC related transaction abort and flips the fs
read-only.
Now use per-profile available space in calc_available_free_space(), and
only when that failed we fall back to the old factor based estimation.
And for zoned devices or for the very low chance of temporary memory
allocation failure, we will still fallback to factor based estimation.
But I hope in reality it's very rare.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/space-info.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index bb5aac7ee9d2..78b771d656b9 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -444,6 +444,7 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
enum btrfs_reserve_flush_enum flush)
{
struct btrfs_fs_info *fs_info = space_info->fs_info;
+ bool has_per_profile;
u64 profile;
u64 avail;
u64 data_chunk_size;
@@ -454,19 +455,21 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
else
profile = btrfs_metadata_alloc_profile(fs_info);
- avail = atomic64_read(&fs_info->free_chunk_space);
-
- /*
- * If we have dup, raid1 or raid10 then only half of the free
- * space is actually usable. For raid56, the space info used
- * doesn't include the parity drive, so we don't have to
- * change the math
- */
- factor = btrfs_bg_type_to_factor(profile);
- avail = div_u64(avail, factor);
- if (avail == 0)
- return 0;
+ has_per_profile = btrfs_get_per_profile_avail(fs_info, profile, &avail);
+ if (!has_per_profile) {
+ avail = atomic64_read(&fs_info->free_chunk_space);
+ /*
+ * If we have dup, raid1 or raid10 then only half of the free
+ * space is actually usable. For raid56, the space info used
+ * doesn't include the parity drive, so we don't have to
+ * change the math
+ */
+ factor = btrfs_bg_type_to_factor(profile);
+ avail = div_u64(avail, factor);
+ if (avail == 0)
+ return 0;
+ }
data_chunk_size = calc_effective_data_chunk_size(fs_info);
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space
2026-02-03 3:01 ` [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space Qu Wenruo
@ 2026-02-03 12:56 ` Filipe Manana
2026-02-03 20:52 ` Qu Wenruo
2026-02-03 23:49 ` kernel test robot
1 sibling, 1 reply; 8+ messages in thread
From: Filipe Manana @ 2026-02-03 12:56 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Tue, Feb 3, 2026 at 3:01 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> There is a long known bug that if metadata is using RAID1 on two disks
> with unbalanced sizes, there is a very high chance to hit ENOSPC related
> transaction abort.
>
> [CAUSE]
> The root cause is in the available space estimation code:
>
> - Factor based calculation
> Just use all unallocated space, divide by the profile factor
> One obvious user is can_overcommit().
>
> This can not handle the following example:
>
> devid 1 unallocated: 1GiB
> devid 2 unallocated: 50GiB
> metadata type: RAID1
>
> If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB
> free space for metadata.
> Thus we can continue allocating metadata (over-commit) way beyond the
> 1GiB limit.
>
> But this estimation is completely wrong, in reality we can only allocate
> one single 1GiB RAID1 block group, thus if we continue over-commit, at
> one time we will hit ENOSPC at some critical path and flips the fs
> read-only.
>
> [SOLUTION]
> This patch will introduce per-profile available space estimation,
> which can provide chunk-allocator like behavior to give a (mostly)
> accurate result, with under-estimate corner cases.
>
> There are some differences between the estimation and real chunk
> allocator:
>
> - No consideration on hole size
> It's fine for most cases, as all data/metadata strips are in 1GiB size
> thus there should not be any hole wasting much space.
>
> And chunk allocator is able to use smaller stripes when there is
> really no other choice.
>
> Although in theory this means it can lead to some over-estimation, it
> should not cause too much hassle in the real world.
>
> The other benefit of such behavior is, we avoid dev-extent tree search
> completely, thus the overhead is very small.
>
> - No true balance for certain cases
> If we have 3 disks RAID1, and each device has 2GiB unallocated space,
> we can load balance the chunk allocation so that we can allocate 3GiB
> RAID1 chunks, and that's what chunk allocator will do.
>
> But this current estimation code is using the largest available space
> to do a single allocation. Meaning the estimation will be 2GiB, thus
> under estimate.
>
> Such under estimation is fine and after the first chunk allocation, the
> estimation will be updated and still give a correct 2GiB
> estimation.
> So this only means the estimation will be a little conservative, which
> is safer for call sites like metadata over-commit check.
>
> With that facility, for above 1GiB + 50GiB case, it will give a RAID1
> estimation of 1GiB, instead of the incorrect 25.5GiB.
>
> Or for a more complex example:
> devid 1 unallocated: 1T
> devid 2 unallocated: 1T
> devid 3 unallocated: 10T
>
> We will get an array of:
> RAID10: 2T
> RAID1: 2T
> RAID1C3: 1T
> RAID1C4: 0 (not enough devices)
> DUP: 6T
> RAID0: 3T
> SINGLE: 12T
> RAID5: 1T
> RAID6: 0 (not enough devices)
>
> [IMPLEMENTATION]
> And for the each profile , we go chunk allocator level calculation:
> The pseudo code looks like:
>
> clear_virtual_used_space_of_all_rw_devices();
> do {
> /*
> * The same as chunk allocator, despite used space,
> * we also take virtual used space into consideration.
> */
> sort_device_with_virtual_free_space();
>
> /*
> * Unlike chunk allocator, we don't need to bother hole/stripe
> * size, so we use the smallest device to make sure we can
> * allocated as many stripes as regular chunk allocator
> */
> stripe_size = device_with_smallest_free->avail_space;
> stripe_size = min(stripe_size, to_alloc / ndevs);
>
> /*
> * Allocate a virtual chunk, allocated virtual chunk will
> * increase virtual used space, allow next iteration to
> * properly emulate chunk allocator behavior.
> */
> ret = alloc_virtual_chunk(stripe_size, &allocated_size);
> if (ret == 0)
> avail += allocated_size;
> } while (ret == 0)
>
> As we always select the device with least free space, the device with
> the most space will be the first to be utilized, just like chunk
> allocator.
> For above 1T + 10T device, we will allocate a 1T virtual chunk
> in the first iteration, then run out of device in next iteration.
>
> Thus only get 1T free space for RAID1 type, just like what chunk
> allocator would do.
>
> This minimal available space based calculation is not perfect, but the
> important part is, the estimation is never exceeding the real available
> space.
>
> This patch just introduces the infrastructure, no hooks are executed
> yet.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/volumes.c | 153 +++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/volumes.h | 30 +++++++++
> 2 files changed, 183 insertions(+)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f281d113519b..2348d4d5e0b5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5372,6 +5372,159 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
> return 0;
> }
>
> +/*
> + * Return 0 if we allocated any ballon(*) chunk, and restore the size to
> + * @allocated (the last parameter).
> + * Return -ENOSPC if we have no more space to allocate virtual chunk
> + *
> + * *: Ballon chunks are space holder for per-profile available space allocator.
> + * Ballon chunks won't really take on-disk space, but only to emulate
> + * chunk allocator behavior to get accurate estimation on available space.
> + */
> +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
> + struct btrfs_device_info *devices_info,
> + enum btrfs_raid_types type,
> + u64 *allocated)
> +{
> + const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> + struct btrfs_device *device;
> + u64 stripe_size;
> + int i;
Can and should be declared in the loop.
> + int ndevs = 0;
> +
> + lockdep_assert_held(&fs_info->chunk_mutex);
> +
> + /* Go through devices to collect their unallocated space */
Sentences should end with punctuation.
> + list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
> + u64 avail;
> +
> + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
> + &device->dev_state) ||
> + test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
> + continue;
> +
> + if (device->total_bytes > device->bytes_used +
> + device->per_profile_allocated)
> + avail = device->total_bytes - device->bytes_used -
> + device->per_profile_allocated;
> + else
> + avail = 0;
> +
> + /* And exclude the [0, 1M) reserved space */
End with punctuation.
> + if (avail > SZ_1M)
> + avail -= SZ_1M;
Use BTRFS_DEVICE_RANGE_RESERVED instead.
> + else
> + avail = 0;
> +
> + if (avail < fs_info->sectorsize)
> + continue;
> + /*
> + * Unlike chunk allocator, we don't care about stripe or hole
> + * size, so here we use @avail directly
Same here, missing ending punctuation.
> + */
> + devices_info[ndevs].dev_offset = 0;
> + devices_info[ndevs].total_avail = avail;
> + devices_info[ndevs].max_avail = avail;
> + devices_info[ndevs].dev = device;
> + ++ndevs;
> + }
> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> + btrfs_cmp_device_info, NULL);
> + ndevs = rounddown(ndevs, raid_attr->devs_increment);
> + if (ndevs < raid_attr->devs_min)
> + return -ENOSPC;
> + if (raid_attr->devs_max)
> + ndevs = min(ndevs, (int)raid_attr->devs_max);
> + else
> + ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
> +
> + /*
> + * Now allocate a virtual chunk using the unallocated space of the
> + * device with the least unallocated space.
> + */
> + stripe_size = round_down(devices_info[ndevs - 1].total_avail,
> + fs_info->sectorsize);
> + for (i = 0; i < ndevs; i++)
for (int i = 0; ....
> + devices_info[i].dev->per_profile_allocated += stripe_size;
> + *allocated = stripe_size * (ndevs - raid_attr->nparity) /
> + raid_attr->ncopies;
> + return 0;
> +}
> +
> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
> + enum btrfs_raid_types type,
> + u64 *result_ret)
> +{
> + struct btrfs_device_info *devices_info = NULL;
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> + struct btrfs_device *device;
> + u64 allocated;
> + u64 result = 0;
> + int ret = 0;
> +
> + lockdep_assert_held(&fs_info->chunk_mutex);
> + ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
> +
> + /* Not enough devices, quick exit, just update the result */
End with punctuation.
> + if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
> + goto out;
Misses setting ret to -ENOSPC.
> +
> + devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> + GFP_NOFS);
> + if (!devices_info) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + /* Clear virtual chunk used space for each device */
Missing punctuation again.
> + list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> + device->per_profile_allocated = 0;
> +
> + while (!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
> + result += allocated;
This can take some time, so:
1) Have a cond_resched() call here somewhere.
2) Compute only for the profiles we are using instead of all possible
profiles - we can determine which ones are in use by oring
fs_info->avail_data_alloc_bits, fs_info->avail_metadata_alloc_bits and
fs_info->avail_system_alloc_bits.
> +
> +out:
> + kfree(devices_info);
> + if (ret < 0 && ret != -ENOSPC)
> + return ret;
> + *result_ret = result;
> + return 0;
> +}
> +
> +/* Update the per-profile available space array. */
> +void btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info)
> +{
> + u64 results[BTRFS_NR_RAID_TYPES];
> + int i = 0;
Can and should be declared in the loop.
> + int ret;
> +
> + /*
> + * Zoned is more complex as we can not simply get the amount of
> + * available space for each device.
> + */
> + if (btrfs_is_zoned(fs_info))
> + goto error;
> +
> + for (; i < BTRFS_NR_RAID_TYPES; i++) {
for (int i = 0; ....
> + ret = calc_one_profile_avail(fs_info, i, &results[i]);
> + if (ret < 0)
> + goto error;
> + }
> +
> + spin_lock(&fs_info->fs_devices->per_profile_lock);
> + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
for (int i = 0; ...
> + fs_info->fs_devices->per_profile_avail[i] = results[i];
> + set_bit(i, &fs_info->fs_devices->per_profile_uptodate);
There's no need for the bitfield.
To indicate the values are not computed/valid we could set each
element of fs_info->fs_devices->per_profile_avail[] to U64_MAX for
example, it would avoid increasing further the size of struct fs_info.
Thanks.
> + }
> + spin_unlock(&fs_info->fs_devices->per_profile_lock);
> + return;
> +error:
> + spin_lock(&fs_info->fs_devices->per_profile_lock);
> + bitmap_clear(&fs_info->fs_devices->per_profile_uptodate, 0,
> + BTRFS_NR_RAID_TYPES);
> + spin_unlock(&fs_info->fs_devices->per_profile_lock);
> +}
> +
> static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
> {
> if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ebc85bf53ee7..ecb5ad9cf249 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -22,6 +22,7 @@
> #include <uapi/linux/btrfs_tree.h>
> #include "messages.h"
> #include "extent-io-tree.h"
> +#include "fs.h"
>
> struct block_device;
> struct bdev_handle;
> @@ -213,6 +214,12 @@ struct btrfs_device {
>
> /* Bandwidth limit for scrub, in bytes */
> u64 scrub_speed_max;
> +
> + /*
> + * A temporary number of allocated space during per-profile
> + * available space calculation.
> + */
> + u64 per_profile_allocated;
As this is used temporarily only for the calculation, I wonder if this
could be placed in struct btrfs_device_info instead.
Because btrfs_device is long lived while btrfs_device_info is always
short lived (and we use an array of such and allocate it in
calc_one_profile_avail()).
Thanks.
> };
>
> /*
> @@ -458,6 +465,11 @@ struct btrfs_fs_devices {
> /* Device to be used for reading in case of RAID1. */
> u64 read_devid;
> #endif
> +
> + u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
> + /* Records per-type available space estimation. */
> + spinlock_t per_profile_lock;
> + unsigned long per_profile_uptodate;
> };
>
> #define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \
> @@ -886,6 +898,24 @@ int btrfs_bg_type_to_factor(u64 flags);
> const char *btrfs_bg_type_to_raid_name(u64 flags);
> int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
> bool btrfs_verify_dev_items(const struct btrfs_fs_info *fs_info);
> +void btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info);
> +
> +static inline bool btrfs_get_per_profile_avail(struct btrfs_fs_info *fs_info,
> + u64 profile, u64 *avail_ret)
> +{
> + enum btrfs_raid_types index = btrfs_bg_flags_to_raid_index(profile);
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> + bool uptodate = false;
> +
> + spin_lock(&fs_devices->per_profile_lock);
> + if (test_bit(index, &fs_devices->per_profile_uptodate)) {
> + uptodate = true;
> + *avail_ret = fs_devices->per_profile_avail[index];
> + }
> + spin_unlock(&fs_info->fs_devices->per_profile_lock);
> + return uptodate;
> +}
> +
> bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
>
> bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
> --
> 2.52.0
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space
2026-02-03 12:56 ` Filipe Manana
@ 2026-02-03 20:52 ` Qu Wenruo
2026-02-03 21:47 ` Filipe Manana
0 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-02-03 20:52 UTC (permalink / raw)
To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs
在 2026/2/3 23:26, Filipe Manana 写道:
> On Tue, Feb 3, 2026 at 3:01 AM Qu Wenruo <wqu@suse.com> wrote:
[...]
>
>> + list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> + device->per_profile_allocated = 0;
>> +
>> + while (!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
>> + result += allocated;
>
> This can take some time, so:
>
> 1) Have a cond_resched() call here somewhere.
>
> 2) Compute only for the profiles we are using instead of all possible
> profiles - we can determine which ones are in use by oring
> fs_info->avail_data_alloc_bits, fs_info->avail_metadata_alloc_bits and
> fs_info->avail_system_alloc_bits.
In fact this will take almost no time.
Firstly all core functionality are just integer calculations, which are
very fast on modern hardware.
There is no tree search to grab the largest hole, unlike chunk allocator.
Secondly the allocation itself has no maximum size limit, thus even if
there are a lot of unallocated space on each device, it will only
handled in a huge chunk.
In my quick tests, the fs is small and dev-extent/chunk/bg tree are all
in cache, the runtime of btrfs_update_per_profile_avail() is pretty short.
For 2 disks, the runtime for all 9 profiles is around 2~5us,
meanwhile for 4 disks, the runtime is around 5~8us.
Do we still need to consider cond_sched() in this case?
Thanks,
Qu
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space
2026-02-03 20:52 ` Qu Wenruo
@ 2026-02-03 21:47 ` Filipe Manana
0 siblings, 0 replies; 8+ messages in thread
From: Filipe Manana @ 2026-02-03 21:47 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs
On Tue, Feb 3, 2026 at 8:52 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2026/2/3 23:26, Filipe Manana 写道:
> > On Tue, Feb 3, 2026 at 3:01 AM Qu Wenruo <wqu@suse.com> wrote:
> [...]
> >
> >> + list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> >> + device->per_profile_allocated = 0;
> >> +
> >> + while (!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
> >> + result += allocated;
> >
> > This can take some time, so:
> >
> > 1) Have a cond_resched() call here somewhere.
> >
> > 2) Compute only for the profiles we are using instead of all possible
> > profiles - we can determine which ones are in use by oring
> > fs_info->avail_data_alloc_bits, fs_info->avail_metadata_alloc_bits and
> > fs_info->avail_system_alloc_bits.
>
> In fact this will take almost no time.
>
> Firstly all core functionality are just integer calculations, which are
> very fast on modern hardware.
> There is no tree search to grab the largest hole, unlike chunk allocator.
>
> Secondly the allocation itself has no maximum size limit, thus even if
> there are a lot of unallocated space on each device, it will only
> handled in a huge chunk.
>
>
> In my quick tests, the fs is small and dev-extent/chunk/bg tree are all
> in cache, the runtime of btrfs_update_per_profile_avail() is pretty short.
>
> For 2 disks, the runtime for all 9 profiles is around 2~5us,
> meanwhile for 4 disks, the runtime is around 5~8us.
>
> Do we still need to consider cond_sched() in this case?
That's fine then, thanks.
>
> Thanks,
> Qu
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space
2026-02-03 3:01 ` [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space Qu Wenruo
2026-02-03 12:56 ` Filipe Manana
@ 2026-02-03 23:49 ` kernel test robot
1 sibling, 0 replies; 8+ messages in thread
From: kernel test robot @ 2026-02-03 23:49 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: oe-kbuild-all
Hi Qu,
kernel test robot noticed the following build errors:
[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.19-rc8 next-20260203]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-introduce-the-device-layout-aware-per-profile-available-space/20260203-110526
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link: https://lore.kernel.org/r/eb573bac21a16092d8e9f64533c6b0d6ed6b16a4.1770087101.git.wqu%40suse.com
patch subject: [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space
config: powerpc-randconfig-002-20260204 (https://download.01.org/0day-ci/archive/20260204/202602040700.ald285sK-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 14.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260204/202602040700.ald285sK-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602040700.ald285sK-lkp@intel.com/
All errors (new ones prefixed by >>):
powerpc-linux-ld: fs/btrfs/volumes.o: in function `alloc_virtual_chunk':
>> volumes.c:(.text+0x2648): undefined reference to `__udivdi3'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-02-03 23:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 3:01 [PATCH 0/3] btrfs: unbalanced disks aware per-profile available space estimation Qu Wenruo
2026-02-03 3:01 ` [PATCH 1/3] btrfs: introduce the device layout aware per-profile available space Qu Wenruo
2026-02-03 12:56 ` Filipe Manana
2026-02-03 20:52 ` Qu Wenruo
2026-02-03 21:47 ` Filipe Manana
2026-02-03 23:49 ` kernel test robot
2026-02-03 3:01 ` [PATCH 2/3] btrfs: update per-profile available estimation Qu Wenruo
2026-02-03 3:01 ` [PATCH 3/3] btrfs: use per-profile available space in calc_available_free_space() Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox