* [PATCH v2] btrfs: Limit size of bios submitted from writeback
@ 2026-04-23 9:30 Jan Kara
2026-04-23 9:54 ` Qu Wenruo
0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2026-04-23 9:30 UTC (permalink / raw)
To: David Sterba; +Cc: Qu Wenruo, linux-btrfs, Jan Kara
Currently btrfs_writepages() just accumulates as large bio as possible
(within writeback_control constraints) and then submits it. This can
however lead to significant latency in writeback IO submission (I have
observed tens of miliseconds) because the submitted bio easily has over
hundred of megabytes. Consequently this leads to IO pipeline stalls and
reduced throughput.
At the same time beyond certain size submitting so large bio provides
diminishing returns because the bio is split by the block layer
immediately anyway. So compute (estimate of) bio size beyond which we
are unlikely to improve performance and just submit the bio for
writeback once we accumulate that much to keep the IO pipeline busy.
This improves writeback throughput for sequential writes by about 15% on
the test machine I was using.
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/btrfs/disk-io.c | 7 +++++++
fs/btrfs/extent_io.c | 10 ++++++++++
fs/btrfs/fs.h | 1 +
fs/btrfs/volumes.c | 29 +++++++++++++++++++++++++++++
fs/btrfs/volumes.h | 1 +
5 files changed, 48 insertions(+)
Changes since v1:
- moved limit checks to submit_extent_folio
- simplified computation of maximum bio size
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8a11be02eeb9..f063595d0cee 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3591,6 +3591,13 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
}
}
+ ret = btrfs_init_writeback_bio_size(fs_info);
+ if (ret) {
+ btrfs_err(fs_info, "failed to get optimum writeback size: %d",
+ ret);
+ goto fail_sysfs;
+ }
+
btrfs_free_zone_cache(fs_info);
btrfs_check_active_zone_reservation(fs_info);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ca3e4b99aec2..d13d7eb95d44 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -857,6 +857,16 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
/* Ordered extent boundary: move on to a new bio. */
if (bio_ctrl->len_to_oe_boundary == 0)
submit_one_bio(bio_ctrl);
+ /*
+ * If we have accumulated decent amount of IO, send it to the
+ * block layer so that IO can run while we are accumulating
+ * more folios to write.
+ */
+ else if (bio_ctrl->wbc &&
+ bio_ctrl->bbio->bio.bi_iter.bi_size >=
+ inode->root->fs_info->writeback_bio_size)
+ submit_one_bio(bio_ctrl);
+
} while (size);
}
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e..19e02452ab96 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -880,6 +880,7 @@ struct btrfs_fs_info {
u32 block_min_order;
u32 block_max_order;
u32 stripesize;
+ u32 writeback_bio_size;
u32 csum_size;
u32 csums_per_leaf;
u32 csum_type;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a88e68f90564..c27614a23ffb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -8179,6 +8179,35 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info)
return ret;
}
+int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+ struct btrfs_device *device;
+ u32 writeback_bio_size = fs_info->sectorsize;
+
+ mutex_lock(&fs_devices->device_list_mutex);
+ /*
+ * Let's take maximum over optimal request sizes for all devices. For
+ * RAID profiles writeback will submit stripe (64k) sized bios anyway
+ * so our value doesn't matter and for simple profiles this is a good
+ * approximation of sensible IO chunking.
+ */
+ list_for_each_entry(device, &fs_devices->devices, dev_list) {
+ struct request_queue *queue;
+ unsigned int io_opt;
+
+ queue = bdev_get_queue(device->bdev);
+ io_opt = queue_io_opt(queue) ? :
+ queue_max_sectors(queue) << SECTOR_SHIFT;
+ writeback_bio_size = max(writeback_bio_size, io_opt);
+ }
+ mutex_unlock(&fs_devices->device_list_mutex);
+
+ fs_info->writeback_bio_size = writeback_bio_size;
+
+ return 0;
+}
+
static int update_dev_stat_item(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0082c166af91..96904d18f686 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -784,6 +784,7 @@ int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
struct btrfs_ioctl_get_dev_stats *stats);
int btrfs_init_devices_late(struct btrfs_fs_info *fs_info);
int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
+int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info);
int btrfs_run_dev_stats(struct btrfs_trans_handle *trans);
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev);
void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev);
--
2.51.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] btrfs: Limit size of bios submitted from writeback
2026-04-23 9:30 [PATCH v2] btrfs: Limit size of bios submitted from writeback Jan Kara
@ 2026-04-23 9:54 ` Qu Wenruo
2026-04-27 9:03 ` Jan Kara
0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2026-04-23 9:54 UTC (permalink / raw)
To: Jan Kara, David Sterba; +Cc: linux-btrfs
在 2026/4/23 19:00, Jan Kara 写道:
> Currently btrfs_writepages() just accumulates as large bio as possible
> (within writeback_control constraints) and then submits it. This can
> however lead to significant latency in writeback IO submission (I have
> observed tens of miliseconds) because the submitted bio easily has over
> hundred of megabytes. Consequently this leads to IO pipeline stalls and
> reduced throughput.
>
> At the same time beyond certain size submitting so large bio provides
> diminishing returns because the bio is split by the block layer
> immediately anyway. So compute (estimate of) bio size beyond which we
> are unlikely to improve performance and just submit the bio for
> writeback once we accumulate that much to keep the IO pipeline busy.
> This improves writeback throughput for sequential writes by about 15% on
> the test machine I was using.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
Looks great!
Reviewed-by: Qu Wenruo <wqu@suse.com>
Just one minor question inlined below.
[...]
> +int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> + struct btrfs_device *device;
> + u32 writeback_bio_size = fs_info->sectorsize;
> +
> + mutex_lock(&fs_devices->device_list_mutex);
> + /*
> + * Let's take maximum over optimal request sizes for all devices. For
> + * RAID profiles writeback will submit stripe (64k) sized bios anyway
> + * so our value doesn't matter and for simple profiles this is a good
> + * approximation of sensible IO chunking.
> + */
> + list_for_each_entry(device, &fs_devices->devices, dev_list) {
> + struct request_queue *queue;
> + unsigned int io_opt;
> +
> + queue = bdev_get_queue(device->bdev);
> + io_opt = queue_io_opt(queue) ? :
> + queue_max_sectors(queue) << SECTOR_SHIFT;
> + writeback_bio_size = max(writeback_bio_size, io_opt);
> + }
> + mutex_unlock(&fs_devices->device_list_mutex);
> +
> + fs_info->writeback_bio_size = writeback_bio_size;
With this simplified version of optimal io size detection, do we want to
hook dev add/removal/replace to update the calculation?
I guess in the real world, the added/removed/replaced disks should have
all the same performance parameter for server usages, so no difference
there.
And for personal/pro users, I doubt if the original performance problem
is even noticeable for most end users.
So overall I'm fine either way.
Thanks,
Qu
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] btrfs: Limit size of bios submitted from writeback
2026-04-23 9:54 ` Qu Wenruo
@ 2026-04-27 9:03 ` Jan Kara
2026-04-27 9:50 ` Qu Wenruo
0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2026-04-27 9:03 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Jan Kara, David Sterba, linux-btrfs
On Thu 23-04-26 19:24:02, Qu Wenruo wrote:
>
>
> 在 2026/4/23 19:00, Jan Kara 写道:
> > Currently btrfs_writepages() just accumulates as large bio as possible
> > (within writeback_control constraints) and then submits it. This can
> > however lead to significant latency in writeback IO submission (I have
> > observed tens of miliseconds) because the submitted bio easily has over
> > hundred of megabytes. Consequently this leads to IO pipeline stalls and
> > reduced throughput.
> >
> > At the same time beyond certain size submitting so large bio provides
> > diminishing returns because the bio is split by the block layer
> > immediately anyway. So compute (estimate of) bio size beyond which we
> > are unlikely to improve performance and just submit the bio for
> > writeback once we accumulate that much to keep the IO pipeline busy.
> > This improves writeback throughput for sequential writes by about 15% on
> > the test machine I was using.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
>
> Looks great!
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Just one minor question inlined below.
Thanks!
> > +int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info)
> > +{
> > + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> > + struct btrfs_device *device;
> > + u32 writeback_bio_size = fs_info->sectorsize;
> > +
> > + mutex_lock(&fs_devices->device_list_mutex);
> > + /*
> > + * Let's take maximum over optimal request sizes for all devices. For
> > + * RAID profiles writeback will submit stripe (64k) sized bios anyway
> > + * so our value doesn't matter and for simple profiles this is a good
> > + * approximation of sensible IO chunking.
> > + */
> > + list_for_each_entry(device, &fs_devices->devices, dev_list) {
> > + struct request_queue *queue;
> > + unsigned int io_opt;
> > +
> > + queue = bdev_get_queue(device->bdev);
> > + io_opt = queue_io_opt(queue) ? :
> > + queue_max_sectors(queue) << SECTOR_SHIFT;
> > + writeback_bio_size = max(writeback_bio_size, io_opt);
> > + }
> > + mutex_unlock(&fs_devices->device_list_mutex);
> > +
> > + fs_info->writeback_bio_size = writeback_bio_size;
>
> With this simplified version of optimal io size detection, do we want to
> hook dev add/removal/replace to update the calculation?
>
> I guess in the real world, the added/removed/replaced disks should have all
> the same performance parameter for server usages, so no difference there.
>
> And for personal/pro users, I doubt if the original performance problem is
> even noticeable for most end users.
>
> So overall I'm fine either way.
Yeah, at this point I'm not sure the complexity is worth it. Normally added
disks have very similar parameters as existing ones, also even if they are
somewhat different, it will cost you a few percent of writeback speed at
worst which doesn't seem too bad and it will "fix" itself on next mount.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] btrfs: Limit size of bios submitted from writeback
2026-04-27 9:03 ` Jan Kara
@ 2026-04-27 9:50 ` Qu Wenruo
2026-04-27 23:48 ` Qu Wenruo
0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2026-04-27 9:50 UTC (permalink / raw)
To: Jan Kara, Qu Wenruo; +Cc: David Sterba, linux-btrfs
在 2026/4/27 18:33, Jan Kara 写道:
> On Thu 23-04-26 19:24:02, Qu Wenruo wrote:
>>
>>
>> 在 2026/4/23 19:00, Jan Kara 写道:
>>> Currently btrfs_writepages() just accumulates as large bio as possible
>>> (within writeback_control constraints) and then submits it. This can
>>> however lead to significant latency in writeback IO submission (I have
>>> observed tens of miliseconds) because the submitted bio easily has over
>>> hundred of megabytes. Consequently this leads to IO pipeline stalls and
>>> reduced throughput.
>>>
>>> At the same time beyond certain size submitting so large bio provides
>>> diminishing returns because the bio is split by the block layer
>>> immediately anyway. So compute (estimate of) bio size beyond which we
>>> are unlikely to improve performance and just submit the bio for
>>> writeback once we accumulate that much to keep the IO pipeline busy.
>>> This improves writeback throughput for sequential writes by about 15% on
>>> the test machine I was using.
>>>
>>> Signed-off-by: Jan Kara <jack@suse.cz>
>>
>> Looks great!
>>
>> Reviewed-by: Qu Wenruo <wqu@suse.com>
>>
>> Just one minor question inlined below.
>
> Thanks!
>
>>> +int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info)
>>> +{
>>> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> + struct btrfs_device *device;
>>> + u32 writeback_bio_size = fs_info->sectorsize;
>>> +
>>> + mutex_lock(&fs_devices->device_list_mutex);
>>> + /*
>>> + * Let's take maximum over optimal request sizes for all devices. For
>>> + * RAID profiles writeback will submit stripe (64k) sized bios anyway
>>> + * so our value doesn't matter and for simple profiles this is a good
>>> + * approximation of sensible IO chunking.
>>> + */
>>> + list_for_each_entry(device, &fs_devices->devices, dev_list) {
>>> + struct request_queue *queue;
>>> + unsigned int io_opt;
>>> +
>>> + queue = bdev_get_queue(device->bdev);
>>> + io_opt = queue_io_opt(queue) ? :
>>> + queue_max_sectors(queue) << SECTOR_SHIFT;
>>> + writeback_bio_size = max(writeback_bio_size, io_opt);
>>> + }
>>> + mutex_unlock(&fs_devices->device_list_mutex);
>>> +
>>> + fs_info->writeback_bio_size = writeback_bio_size;
>>
>> With this simplified version of optimal io size detection, do we want to
>> hook dev add/removal/replace to update the calculation?
>>
>> I guess in the real world, the added/removed/replaced disks should have all
>> the same performance parameter for server usages, so no difference there.
>>
>> And for personal/pro users, I doubt if the original performance problem is
>> even noticeable for most end users.
>>
>> So overall I'm fine either way.
>
> Yeah, at this point I'm not sure the complexity is worth it. Normally added
> disks have very similar parameters as existing ones, also even if they are
> somewhat different, it will cost you a few percent of writeback speed at
> worst which doesn't seem too bad and it will "fix" itself on next mount.
Thanks, we're on the same page.
Now the patch is pushed to for-next branch, with one typo "miliseconds"
fixed.
Thanks,
Qu
>
> Honza
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] btrfs: Limit size of bios submitted from writeback
2026-04-27 9:50 ` Qu Wenruo
@ 2026-04-27 23:48 ` Qu Wenruo
2026-04-28 9:01 ` Jan Kara
0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2026-04-27 23:48 UTC (permalink / raw)
To: Jan Kara, Qu Wenruo; +Cc: David Sterba, linux-btrfs
在 2026/4/27 19:20, Qu Wenruo 写道:
[...]
>> Yeah, at this point I'm not sure the complexity is worth it. Normally
>> added
>> disks have very similar parameters as existing ones, also even if they
>> are
>> somewhat different, it will cost you a few percent of writeback speed at
>> worst which doesn't seem too bad and it will "fix" itself on next mount.
>
> Thanks, we're on the same page.
>
> Now the patch is pushed to for-next branch, with one typo "miliseconds"
> fixed.
>
> Thanks,
> Qu
Just a minor update.
The existing check is considering device->bdev always exists, but we can
have missing devices which doesn't have device->bdev, and this will
cause NULL pointer dereference during test cases like btrfs/027.
Fixed in for-next branch with the following small diff:
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8d36e8a9f0d9..93a923e4ecaf 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -8224,6 +8224,8 @@ int btrfs_init_writeback_bio_size(struct
btrfs_fs_info *fs_info)
struct request_queue *queue;
unsigned int io_opt;
+ if (!device->bdev || test_bit(BTRFS_DEV_STATE_MISSING,
&device->dev_state))
+ continue;
queue = bdev_get_queue(device->bdev);
io_opt = queue_io_opt(queue) ? :
queue_max_sectors(queue) << SECTOR_SHIFT;
Thanks,
Qu
>>
>> Honza
>
>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] btrfs: Limit size of bios submitted from writeback
2026-04-27 23:48 ` Qu Wenruo
@ 2026-04-28 9:01 ` Jan Kara
0 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2026-04-28 9:01 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Jan Kara, Qu Wenruo, David Sterba, linux-btrfs
On Tue 28-04-26 09:18:38, Qu Wenruo wrote:
>
>
> 在 2026/4/27 19:20, Qu Wenruo 写道:
> [...]
> > > Yeah, at this point I'm not sure the complexity is worth it.
> > > Normally added
> > > disks have very similar parameters as existing ones, also even if
> > > they are
> > > somewhat different, it will cost you a few percent of writeback speed at
> > > worst which doesn't seem too bad and it will "fix" itself on next mount.
> >
> > Thanks, we're on the same page.
> >
> > Now the patch is pushed to for-next branch, with one typo "miliseconds"
> > fixed.
> >
> > Thanks,
> > Qu
>
> Just a minor update.
>
> The existing check is considering device->bdev always exists, but we can
> have missing devices which doesn't have device->bdev, and this will cause
> NULL pointer dereference during test cases like btrfs/027.
>
> Fixed in for-next branch with the following small diff:
Thanks for the fixup!
Honza
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8d36e8a9f0d9..93a923e4ecaf 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -8224,6 +8224,8 @@ int btrfs_init_writeback_bio_size(struct btrfs_fs_info
> *fs_info)
> struct request_queue *queue;
> unsigned int io_opt;
>
> + if (!device->bdev || test_bit(BTRFS_DEV_STATE_MISSING,
> &device->dev_state))
> + continue;
> queue = bdev_get_queue(device->bdev);
> io_opt = queue_io_opt(queue) ? :
> queue_max_sectors(queue) << SECTOR_SHIFT;
>
> Thanks,
> Qu
> > >
> > > Honza
> >
> >
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-28 9:01 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 9:30 [PATCH v2] btrfs: Limit size of bios submitted from writeback Jan Kara
2026-04-23 9:54 ` Qu Wenruo
2026-04-27 9:03 ` Jan Kara
2026-04-27 9:50 ` Qu Wenruo
2026-04-27 23:48 ` Qu Wenruo
2026-04-28 9:01 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox