[PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy
@ 2026-01-09  5:31 Qu Wenruo
  2026-01-09  5:31 ` [PATCH 1/2] btrfs-progs: mkfs: discard the logical range in one search Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Qu Wenruo @ 2026-01-09  5:31 UTC (permalink / raw)
  To: linux-btrfs

After commit 4b861c186592 ("btrfs-progs: mkfs: discard free space"),
mkfs.btrfs inside my VM is much slower.

Previously it takes only around 0.015s, now it takes over 0.750s, which
is around 50x regression, and that's already when that virtio-blk device
is already ignoring discard commands.

It turns out that the main problem is inside how we submit discard
requests.

Currently we submit the discard immediately after finding a free space,
but for DUP profiles (the default one for metadata/system chunks), we
send a discard request for each mirror.

Since it's DUP, the two device extents are on the same device, then the
for next free space we send two discard requests again, meaning we're
switching between two different dev extents, making the discard requests
more like some random writes, greatly reduce the peroformance.

The root fix is in the second patch, where we record and re-order the
discard requests for each device, so that the eventual requests are all
in ascending order and are merged when possible.

The first patch is just a minor cleanup to reduce the call of
btrfs_map_blocks() by using WRITE for discard.

With this series, the runtime of mkfs.btrfs is still increased (by the
free space discarding), but still fast enough that even I can not sense
it (0.015s - > 0.017s), finally bring back my inner peace.

Qu Wenruo (2):
  btrfs-progs: mkfs: discard the logical range in one search
  btrfs-progs: mkfs: optimise the block group discarding behavior

 kernel-shared/volumes.c |   4 ++
 kernel-shared/volumes.h |   3 ++
 mkfs/main.c             | 100 ++++++++++++++++++++++++++++++----------
 3 files changed, 83 insertions(+), 24 deletions(-)

--
2.52.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] btrfs-progs: mkfs: discard the logical range in one search
  2026-01-09  5:31 [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy Qu Wenruo
@ 2026-01-09  5:31 ` Qu Wenruo
  2026-01-09  5:31 ` [PATCH 2/2] btrfs-progs: mkfs: optimise the block group discarding behavior Qu Wenruo
  2026-01-23 23:02 ` [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2026-01-09  5:31 UTC (permalink / raw)
  To: linux-btrfs

Currently for discard_logical_range() if the profile have multiple
mirrors, e.g. DUP, then we will call btrfs_map_block() multiple times,
and each call just return one mirror, and we submit discard for the
returned mirror.

This means we need to call btrfs_map_block() twice for DUP, but we can
use WRITE flag for btrfs_map_block(), which will return multiple mirrors
in one go, reduce the frequency for btrfs_map_block().

With that we do not even need the extra mirror number loop, and can
use discard_logical_range_mirror() to replace discard_logical_range().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 mkfs/main.c | 30 +++++++++---------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index f99e5486521d..ce85d34f077a 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -635,23 +635,27 @@ out:
 	return ret;
 }
 
-static int discard_logical_range_mirror(struct btrfs_fs_info *fs_info, int mirror,
-					u64 start, u64 len)
+static int discard_logical_range(struct btrfs_fs_info *fs_info, u64 start, u64 len)
 {
-	struct btrfs_multi_bio *multi = NULL;
 	int ret;
 	u64 cur_offset = 0;
 	u64 cur_len;
 
 	while (cur_offset < len) {
+		struct btrfs_multi_bio *multi = NULL;
 		struct btrfs_device *device;
 
 		cur_len = len - cur_offset;
-		ret = btrfs_map_block(fs_info, READ, start + cur_offset, &cur_len,
-				      &multi, mirror, NULL);
+		ret = btrfs_map_block(fs_info, WRITE, start + cur_offset, &cur_len,
+				      &multi, 0, NULL);
 		if (ret)
 			return ret;
 
+		if (multi->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+			free(multi);
+			return 0;
+		}
+
 		cur_len = min(cur_len, len - cur_offset);
 
 		for (int i = 0; i < multi->num_stripes; i++) {
@@ -674,22 +678,6 @@ static int discard_logical_range_mirror(struct btrfs_fs_info *fs_info, int mirro
 		multi = NULL;
 		cur_offset += cur_len;
 	}
-
-	return 0;
-}
-
-static int discard_logical_range(struct btrfs_fs_info *fs_info, u64 start, u64 len)
-{
-	int ret, num_copies;
-
-	num_copies = btrfs_num_copies(fs_info, start, len);
-
-	for (int i = 0; i < num_copies; i++) {
-		ret = discard_logical_range_mirror(fs_info, i + 1, start, len);
-		if (ret < 0)
-			return ret;
-	}
-
 	return 0;
 }
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] btrfs-progs: mkfs: optimise the block group discarding behavior
  2026-01-09  5:31 [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy Qu Wenruo
  2026-01-09  5:31 ` [PATCH 1/2] btrfs-progs: mkfs: discard the logical range in one search Qu Wenruo
@ 2026-01-09  5:31 ` Qu Wenruo
  2026-01-23 23:02 ` [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2026-01-09  5:31 UTC (permalink / raw)
  To: linux-btrfs

[PERFORMANCE REGRESSION]
After commit 4b861c186592 ("btrfs-progs: mkfs: discard free space"),
mkfs.btrfs is much slower in my VM, where /dev/test/scratch1 is an LV
from a virtio-blk device.
The virtio-blk device is backed by a file on ext4, and discard behavior
is already set to "ignore".

Good: (at 4b861c186592~1)

 $ time ./mkfs.btrfs  -f /dev/test/scratch1
 real	0m0.016s
 user	0m0.004s
 sys	0m0.007s

Bad: (at 4b861c186592)
 $ time mkfs.btrfs  -f /dev/test/scratch1
 real	0m0.782s
 user	0m0.012s
 sys	0m0.313s

The delay is enough to drive an impatient person (like me) crazy.

[CAUSE]
The idea of that commit is completely fine, and I see no problem in the
code itself.

However the problem is in the way we submit discard commands.

For the default mkfs profiles, we are using DUP for single device fses or
RAID1 for multi-device fses.

Especially for DUP, we will submit two discard commands to the same
device, but to different physical locations.
This makes the discard of metadata block groups more like some random
writes, as we have to switch between different device extents.

From my tests, that's the root cause, and if we re-order all those
discards into a more sequential submission, mkfs will become blazing
fast again.

[FIX]
Instead of submitting discard request for each hole we find, queue all
those per-device request into a cache_tree, which will merge any
adjacent ranges, and provide a proper sequential way to iterate each
range.

With this new optimization, the runtime is back to the <0.02 s range:

 $ time ./mkfs.btrfs  -f /dev/test/scratch1
 real	0m0.019s
 user	0m0.006s
 sys	0m0.007s

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 kernel-shared/volumes.c |  4 +++
 kernel-shared/volumes.h |  3 ++
 mkfs/main.c             | 70 +++++++++++++++++++++++++++++++++++++++--
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/kernel-shared/volumes.c b/kernel-shared/volumes.c
index 0a7301281470..b4093e0249e4 100644
--- a/kernel-shared/volumes.c
+++ b/kernel-shared/volumes.c
@@ -554,6 +554,7 @@ static int device_list_add(const char *path,
 			/* we can safely leave the fs_devices entry around */
 			return -ENOMEM;
 		}
+		cache_tree_init(&device->discard);
 		device->fd = -1;
 		device->devid = devid;
 		device->generation = found_transid;
@@ -643,6 +644,7 @@ again:
 		}
 		device->writeable = 0;
 		list_del(&device->dev_list);
+		free_extent_cache_tree(&device->discard);
 		/* free the memory */
 		kfree(device->name);
 		kfree(device->label);
@@ -2388,6 +2390,7 @@ static struct btrfs_device *fill_missing_device(u64 devid, const u8 *uuid)
 
 	device = kzalloc(sizeof(*device), GFP_NOFS);
 	device->devid = devid;
+	cache_tree_init(&device->discard);
 	memcpy(device->uuid, uuid, BTRFS_UUID_SIZE);
 	device->fd = -1;
 	return device;
@@ -2561,6 +2564,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
 		device = kzalloc(sizeof(*device), GFP_NOFS);
 		if (!device)
 			return -ENOMEM;
+		cache_tree_init(&device->discard);
 		device->fd = -1;
 		list_add(&device->dev_list,
 			 &fs_info->fs_devices->devices);
diff --git a/kernel-shared/volumes.h b/kernel-shared/volumes.h
index 74fccd147d82..60ca4593a108 100644
--- a/kernel-shared/volumes.h
+++ b/kernel-shared/volumes.h
@@ -39,6 +39,9 @@ struct btrfs_device {
 	struct btrfs_fs_devices *fs_devices;
 	struct btrfs_fs_info *fs_info;
 
+	/* Record the ranges that needs to be discarded during mkfs. */
+	struct cache_tree discard;
+
 	u64 total_ios;
 
 	int fd;
diff --git a/mkfs/main.c b/mkfs/main.c
index ce85d34f077a..58e9b57b4c9f 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1280,6 +1280,68 @@ cleanup:
 	return ret;
 }
 
+static int queue_discard_logical(struct btrfs_fs_info *fs_info, u64 start, u64 len)
+{
+	struct btrfs_multi_bio *multi = NULL;
+	int ret;
+	u64 cur_offset = 0;
+	u64 cur_len = 0;
+
+	while (cur_offset < len) {
+		struct btrfs_device *device;
+
+		cur_len = len - cur_offset;
+		ret = btrfs_map_block(fs_info, WRITE, start + cur_offset, &cur_len,
+				      &multi, 0, NULL);
+		if (ret)
+			return ret;
+
+		if (multi->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+			free(multi);
+			break;
+		}
+
+		cur_len = min(cur_len, len - cur_offset);
+
+		for (int i = 0; i < multi->num_stripes; i++) {
+			device = multi->stripes[i].dev;
+
+			ret = add_merge_cache_extent(&device->discard,
+					multi->stripes[i].physical, cur_len);
+			if (ret < 0) {
+				free(multi);
+				return ret;
+			}
+		}
+		free(multi);
+		multi = NULL;
+		cur_offset += cur_len;
+	}
+	return 0;
+}
+
+static int discard_all_devices(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_device *dev;
+
+	list_for_each_entry(dev, &fs_info->fs_devices->devices, dev_list) {
+
+		if (!dev->writeable)
+			continue;
+		for (struct cache_extent *cache = first_cache_extent(&dev->discard);
+		     cache; cache = next_cache_extent(cache)) {
+			int ret;
+
+			ret = device_discard_blocks(dev->fd, cache->start, cache->size);
+			if (ret == EOPNOTSUPP)
+				return 0;
+			if (ret < 0)
+				return ret;
+		}
+	}
+	return 0;
+}
+
 static int discard_free_space(struct btrfs_fs_info *fs_info, u64 metadata_profile)
 {
 	struct btrfs_root *free_space_root;
@@ -1329,7 +1391,7 @@ static int discard_free_space(struct btrfs_fs_info *fs_info, u64 metadata_profil
 		btrfs_item_key_to_cpu(leaf, &key, path.slots[0]);
 
 		if (key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
-			ret = discard_logical_range(fs_info, key.objectid, key.offset);
+			ret = queue_discard_logical(fs_info, key.objectid, key.offset);
 			if (ret < 0)
 				goto out;
 		} else if (key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
@@ -1358,7 +1420,7 @@ static int discard_free_space(struct btrfs_fs_info *fs_info, u64 metadata_profil
 				addr = key.objectid + (start_bit * fs_info->sectorsize);
 				length = (end_bit - start_bit) * fs_info->sectorsize;
 
-				ret = discard_logical_range(fs_info, addr, length);
+				ret = queue_discard_logical(fs_info, addr, length);
 				if (ret < 0) {
 					free(bitmap);
 					goto out;
@@ -1372,8 +1434,10 @@ static int discard_free_space(struct btrfs_fs_info *fs_info, u64 metadata_profil
 
 		path.slots[0]++;
 	}
+	btrfs_release_path(&path);
 
-	ret = 0;
+	/* Every discard range is properly queued. Now submit the real discard request. */
+	return discard_all_devices(fs_info);
 
 out:
 	btrfs_release_path(&path);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy
  2026-01-09  5:31 [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy Qu Wenruo
  2026-01-09  5:31 ` [PATCH 1/2] btrfs-progs: mkfs: discard the logical range in one search Qu Wenruo
  2026-01-09  5:31 ` [PATCH 2/2] btrfs-progs: mkfs: optimise the block group discarding behavior Qu Wenruo
@ 2026-01-23 23:02 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2026-01-23 23:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jan 09, 2026 at 04:01:13PM +1030, Qu Wenruo wrote:
> After commit 4b861c186592 ("btrfs-progs: mkfs: discard free space"),
> mkfs.btrfs inside my VM is much slower.
> 
> Previously it takes only around 0.015s, now it takes over 0.750s, which
> is around 50x regression, and that's already when that virtio-blk device
> is already ignoring discard commands.
> 
> It turns out that the main problem is inside how we submit discard
> requests.
> 
> Currently we submit the discard immediately after finding a free space,
> but for DUP profiles (the default one for metadata/system chunks), we
> send a discard request for each mirror.
> 
> Since it's DUP, the two device extents are on the same device, then the
> for next free space we send two discard requests again, meaning we're
> switching between two different dev extents, making the discard requests
> more like some random writes, greatly reduce the peroformance.
> 
> The root fix is in the second patch, where we record and re-order the
> discard requests for each device, so that the eventual requests are all
> in ascending order and are merged when possible.
> 
> The first patch is just a minor cleanup to reduce the call of
> btrfs_map_blocks() by using WRITE for discard.
> 
> With this series, the runtime of mkfs.btrfs is still increased (by the
> free space discarding), but still fast enough that even I can not sense
> it (0.015s - > 0.017s), finally bring back my inner peace.
> 
> Qu Wenruo (2):
>   btrfs-progs: mkfs: discard the logical range in one search
>   btrfs-progs: mkfs: optimise the block group discarding behavior

Thanks, added to devel. I was looking for some funny quote regaring
impatience and going crazy but nothing came out as adequate. Even if
it's reducing time from 0.75 to 0.17s it's that it might seem tolerable
now but in the long run the small delays and peformance drops accumulate
and it's much harder to identify the cause.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-01-23 23:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-09  5:31 [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy Qu Wenruo
2026-01-09  5:31 ` [PATCH 1/2] btrfs-progs: mkfs: discard the logical range in one search Qu Wenruo
2026-01-09  5:31 ` [PATCH 2/2] btrfs-progs: mkfs: optimise the block group discarding behavior Qu Wenruo
2026-01-23 23:02 ` [PATCH 0/2] btrfs-progs: mkfs: optimize the discard behavior so it won't drive me crazy David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox