* [PATCH] btrfs: Limit size of bios submitted from writeback
@ 2026-04-22 9:42 Jan Kara
2026-04-22 10:29 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2026-04-22 9:42 UTC (permalink / raw)
To: David Sterba; +Cc: linux-btrfs, Jan Kara
Currently btrfs_writepages() just accumulates as large bio as possible
(within writeback_control constraints) and then submits it. This can
however lead to significant latency in writeback IO submission (I have
observed tens of miliseconds) because the submitted bio easily has over
hundred of megabytes. Consequently this leads to IO pipeline stalls and
reduced throughput.
At the same time beyond certain size submitting so large bio provides
diminishing returns because the bio is split by the block layer
immediately anyway. So compute (estimate of) bio size beyond which we
are unlikely to improve performance and just submit the bio for
writeback once we accumulate that much to keep the IO pipeline busy.
This improves writeback throughput for sequential writes by about 15% on
the test machine I was using.
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/btrfs/disk-io.c | 7 ++++++
fs/btrfs/extent_io.c | 10 ++++++++
fs/btrfs/fs.h | 1 +
fs/btrfs/volumes.c | 54 ++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.h | 1 +
5 files changed, 73 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8a11be02eeb9..f063595d0cee 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3591,6 +3591,13 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
}
}
+ ret = btrfs_init_writeback_bio_size(fs_info);
+ if (ret) {
+ btrfs_err(fs_info, "failed to get optimum writeback size: %d",
+ ret);
+ goto fail_sysfs;
+ }
+
btrfs_free_zone_cache(fs_info);
btrfs_check_active_zone_reservation(fs_info);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ca3e4b99aec2..9c603d59a09b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2555,6 +2555,16 @@ static int extent_write_cache_pages(struct address_space *mapping,
break;
}
+ /*
+ * If we have accumulated decent amount of IO, send it
+ * to the block layer so that IO can run while we are
+ * accumulating more folios to write.
+ */
+ if (bio_ctrl->bbio &&
+ bio_ctrl->bbio->bio.bi_iter.bi_size >=
+ inode_to_fs_info(inode)->writeback_bio_size)
+ submit_write_bio(bio_ctrl, 0);
+
/*
* The filesystem may choose to bump up nr_to_write.
* We have to make sure to honor the new nr_to_write
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e..19e02452ab96 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -880,6 +880,7 @@ struct btrfs_fs_info {
u32 block_min_order;
u32 block_max_order;
u32 stripesize;
+ u32 writeback_bio_size;
u32 csum_size;
u32 csums_per_leaf;
u32 csum_type;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a88e68f90564..cb654e990333 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -8179,6 +8179,60 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info)
return ret;
}
+/*
+ * At maximum we submit writeback bios 64MB in size to avoid too large
+ * submission latencies
+ */
+#define BTRFS_MAX_WB_BIO_SIZE (64 << 20)
+
+int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info)
+{
+ struct rb_node *node;
+ u32 writeback_bio_sectors = 1;
+
+ read_lock(&fs_info->mapping_tree_lock);
+ /*
+ * For each data chunk compute the size of bio large enough to submit
+ * optimum size request for each of chunk's disk and take maximum
+ * over all data chunks.
+ */
+ for (node = rb_first_cached(&fs_info->mapping_tree); node;
+ node = rb_next(node)) {
+ struct btrfs_chunk_map *map;
+ unsigned int data_stripes, opt_rq_size = fs_info->sectorsize;
+ int i;
+
+ map = rb_entry(node, struct btrfs_chunk_map, rb_node);
+ if (!(map->type & BTRFS_BLOCK_GROUP_DATA))
+ continue;
+ data_stripes = calc_data_stripes(map->type, map->num_stripes);
+ for (i = 0; i < map->num_stripes; i++) {
+ struct request_queue *queue;
+ unsigned int io_opt;
+
+ if (!map->stripes[i].dev)
+ continue;
+ queue = bdev_get_queue(map->stripes[i].dev->bdev);
+ io_opt = queue_io_opt(queue) ? :
+ queue_max_sectors(queue) << SECTOR_SHIFT;
+ opt_rq_size = max(opt_rq_size, io_opt);
+ }
+ opt_rq_size >>= fs_info->sectorsize_bits;
+ writeback_bio_sectors = max(writeback_bio_sectors,
+ data_stripes * opt_rq_size);
+ }
+ read_unlock(&fs_info->mapping_tree_lock);
+
+ if (BTRFS_MAX_WB_BIO_SIZE >> fs_info->sectorsize_bits <=
+ writeback_bio_sectors)
+ fs_info->writeback_bio_size = BTRFS_MAX_WB_BIO_SIZE;
+ else
+ fs_info->writeback_bio_size =
+ writeback_bio_sectors << fs_info->sectorsize_bits;
+
+ return 0;
+}
+
static int update_dev_stat_item(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0082c166af91..96904d18f686 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -784,6 +784,7 @@ int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
struct btrfs_ioctl_get_dev_stats *stats);
int btrfs_init_devices_late(struct btrfs_fs_info *fs_info);
int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
+int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info);
int btrfs_run_dev_stats(struct btrfs_trans_handle *trans);
void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_device *srcdev);
void btrfs_rm_dev_replace_free_srcdev(struct btrfs_device *srcdev);
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH] btrfs: Limit size of bios submitted from writeback
2026-04-22 9:42 [PATCH] btrfs: Limit size of bios submitted from writeback Jan Kara
@ 2026-04-22 10:29 ` Qu Wenruo
2026-04-22 12:49 ` Jan Kara
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2026-04-22 10:29 UTC (permalink / raw)
To: Jan Kara, David Sterba; +Cc: linux-btrfs
在 2026/4/22 19:12, Jan Kara 写道:
[...]
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ca3e4b99aec2..9c603d59a09b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2555,6 +2555,16 @@ static int extent_write_cache_pages(struct address_space *mapping,
> break;
> }
>
> + /*
> + * If we have accumulated decent amount of IO, send it
> + * to the block layer so that IO can run while we are
> + * accumulating more folios to write.
> + */
> + if (bio_ctrl->bbio &&
> + bio_ctrl->bbio->bio.bi_iter.bi_size >=
> + inode_to_fs_info(inode)->writeback_bio_size)
> + submit_write_bio(bio_ctrl, 0);
I'd prefer to move the check a little earlier, better inside
submit_extent_filio() where we already have a similar check for ordered
extent boundaries.
One reason here is, we're considering huge folio support recently, and
with huge folios on arm64, we can have a folio as large as 32MiB, thus
for the worst case we can have a bio as large as 96MiB before submitting it.
[...]
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a88e68f90564..cb654e990333 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -8179,6 +8179,60 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info)
> return ret;
> }
>
> +/*
> + * At maximum we submit writeback bios 64MB in size to avoid too large
> + * submission latencies
> + */
> +#define BTRFS_MAX_WB_BIO_SIZE (64 << 20)
> +
> +int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info)
> +{
> + struct rb_node *node;
> + u32 writeback_bio_sectors = 1;
> +
> + read_lock(&fs_info->mapping_tree_lock);
> + /*
> + * For each data chunk compute the size of bio large enough to submit
> + * optimum size request for each of chunk's disk and take maximum
> + * over all data chunks.
> + */
> + for (node = rb_first_cached(&fs_info->mapping_tree); node;
> + node = rb_next(node)) {
Iterating through all chunk maps may take some time for huge filesystems.
Meanwhile the device list is way smaller than the chunk maps, what about
iterating through all devices instead?
Not to mention we are going to hit the same devices again and again
through the chunk maps.
This may not handle all corner cases, e.g. a fs with new disks added,
but should handle the most common cases pretty well.
> + struct btrfs_chunk_map *map;
> + unsigned int data_stripes, opt_rq_size = fs_info->sectorsize;
> + int i;
> +
> + map = rb_entry(node, struct btrfs_chunk_map, rb_node);
> + if (!(map->type & BTRFS_BLOCK_GROUP_DATA))
> + continue;
> + data_stripes = calc_data_stripes(map->type, map->num_stripes);
> + for (i = 0; i < map->num_stripes; i++) {
> + struct request_queue *queue;
> + unsigned int io_opt;
> +
> + if (!map->stripes[i].dev)
> + continue;
> + queue = bdev_get_queue(map->stripes[i].dev->bdev);
> + io_opt = queue_io_opt(queue) ? :
> + queue_max_sectors(queue) << SECTOR_SHIFT;
> + opt_rq_size = max(opt_rq_size, io_opt);
I'm wondering if we should use the minimal or maximum size.
If the optimal io sizes are very different, e.g. 512K vs 128M, the final
result will be truncated to 64M, would a 64M io submission stall the
pipeline for that 512K device?
Thanks a lot for finding the root cause!
Qu
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] btrfs: Limit size of bios submitted from writeback
2026-04-22 10:29 ` Qu Wenruo
@ 2026-04-22 12:49 ` Jan Kara
2026-04-22 21:43 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2026-04-22 12:49 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Jan Kara, David Sterba, linux-btrfs
On Wed 22-04-26 19:59:38, Qu Wenruo wrote:
>
>
> 在 2026/4/22 19:12, Jan Kara 写道:
> [...]
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index ca3e4b99aec2..9c603d59a09b 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2555,6 +2555,16 @@ static int extent_write_cache_pages(struct address_space *mapping,
> > break;
> > }
> > + /*
> > + * If we have accumulated decent amount of IO, send it
> > + * to the block layer so that IO can run while we are
> > + * accumulating more folios to write.
> > + */
> > + if (bio_ctrl->bbio &&
> > + bio_ctrl->bbio->bio.bi_iter.bi_size >=
> > + inode_to_fs_info(inode)->writeback_bio_size)
> > + submit_write_bio(bio_ctrl, 0);
>
> I'd prefer to move the check a little earlier, better inside
> submit_extent_filio() where we already have a similar check for ordered
> extent boundaries.
>
> One reason here is, we're considering huge folio support recently, and with
> huge folios on arm64, we can have a folio as large as 32MiB, thus for the
> worst case we can have a bio as large as 96MiB before submitting it.
Ok, done.
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index a88e68f90564..cb654e990333 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -8179,6 +8179,60 @@ int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info)
> > return ret;
> > }
> > +/*
> > + * At maximum we submit writeback bios 64MB in size to avoid too large
> > + * submission latencies
> > + */
> > +#define BTRFS_MAX_WB_BIO_SIZE (64 << 20)
> > +
> > +int btrfs_init_writeback_bio_size(struct btrfs_fs_info *fs_info)
> > +{
> > + struct rb_node *node;
> > + u32 writeback_bio_sectors = 1;
> > +
> > + read_lock(&fs_info->mapping_tree_lock);
> > + /*
> > + * For each data chunk compute the size of bio large enough to submit
> > + * optimum size request for each of chunk's disk and take maximum
> > + * over all data chunks.
> > + */
> > + for (node = rb_first_cached(&fs_info->mapping_tree); node;
> > + node = rb_next(node)) {
>
> Iterating through all chunk maps may take some time for huge filesystems.
Yeah, I was wondering a bit. But since this is done only on mount, I
figured it doesn't matter much.
> Meanwhile the device list is way smaller than the chunk maps, what about
> iterating through all devices instead?
I was thinking about that as well. But as you can see below, we also need
to know how the devices are put together in a raid and AFAIU that
information isn't there at the device level... If there's some better data
structure to iterate to get information about raid configurations of all
pieces of btrfs filesystem, I can certainly use that.
> Not to mention we are going to hit the same devices again and again through
> the chunk maps.
>
> This may not handle all corner cases, e.g. a fs with new disks added, but
> should handle the most common cases pretty well.
Yeah, I decided not to care about device hotplug. But I still want to
pick bio size large enough to fill each data disk in a raid with at least
io_opt large request.
> > + struct btrfs_chunk_map *map;
> > + unsigned int data_stripes, opt_rq_size = fs_info->sectorsize;
> > + int i;
> > +
> > + map = rb_entry(node, struct btrfs_chunk_map, rb_node);
> > + if (!(map->type & BTRFS_BLOCK_GROUP_DATA))
> > + continue;
> > + data_stripes = calc_data_stripes(map->type, map->num_stripes);
> > + for (i = 0; i < map->num_stripes; i++) {
> > + struct request_queue *queue;
> > + unsigned int io_opt;
> > +
> > + if (!map->stripes[i].dev)
> > + continue;
> > + queue = bdev_get_queue(map->stripes[i].dev->bdev);
> > + io_opt = queue_io_opt(queue) ? :
> > + queue_max_sectors(queue) << SECTOR_SHIFT;
> > + opt_rq_size = max(opt_rq_size, io_opt);
>
> I'm wondering if we should use the minimal or maximum size.
>
> If the optimal io sizes are very different, e.g. 512K vs 128M, the final
> result will be truncated to 64M, would a 64M io submission stall the
> pipeline for that 512K device?
So optimum IO sizes reported by devices are usually in 128k to 1m range,
the largest I've seen was 4m, so you usually won't get that large spread.
Also I don't think a setup which mixes wildly different disks in btrfs data
block groups is sensible and common enough to care about. In the worst case
such users won't get the performance boost offered by this optimization...
The trimming to 64m mostly happens to deal with situations where someone
has insane amount of drives striped together or some buggy device reports
some absurd value.
How large bio leads to IO stalls is impossible to say in general. It
depends on the speed of IO submission (including crc computation and all
the other stuff btrfs needs to do) compared to the disk throughput. 64m is
from my experiments enough to already cost say 5% of your writeback
throughput on the system I was testing with.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] btrfs: Limit size of bios submitted from writeback
2026-04-22 12:49 ` Jan Kara
@ 2026-04-22 21:43 ` Qu Wenruo
2026-04-23 7:57 ` Jan Kara
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2026-04-22 21:43 UTC (permalink / raw)
To: Jan Kara; +Cc: David Sterba, linux-btrfs
在 2026/4/22 22:19, Jan Kara 写道:
[...]
>> Iterating through all chunk maps may take some time for huge filesystems.
>
> Yeah, I was wondering a bit. But since this is done only on mount, I
> figured it doesn't matter much.
>
>> Meanwhile the device list is way smaller than the chunk maps, what about
>> iterating through all devices instead?
>
> I was thinking about that as well. But as you can see below, we also need
> to know how the devices are put together in a raid and AFAIU that
> information isn't there at the device level... If there's some better data
> structure to iterate to get information about raid configurations of all
> pieces of btrfs filesystem, I can certainly use that.
My idea would be just ignore the detailed RAID configuration completely.
One point is, btrfs can have more than one data profiles (e.g. during
balance, or canceled balance, or degraded writes), and even there is
only a single data profile, we can still have different optimized sizes,
e.g. RAID10 on very unbalanced disks.
Furthermore for real multi-device btrfs setup, it's very likely the data
profile is at least striped (RAID0/RAID10), thus the huge bio will be
split into 64K stripes by btrfs before submission already.
It's really affecting single-device profiles like SINGLE/DUP and
mirror-only profiles (RAID1*), where we directly submit such huge bio to
each device.
With that said, it will be much easier to ignore the complex RAID
profile iteration, and just grab the min/max optimal io size of each
device, multiple it by the number of disks and call it a day.
Thanks,
Qu
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] btrfs: Limit size of bios submitted from writeback
2026-04-22 21:43 ` Qu Wenruo
@ 2026-04-23 7:57 ` Jan Kara
0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2026-04-23 7:57 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Jan Kara, David Sterba, linux-btrfs
On Thu 23-04-26 07:13:59, Qu Wenruo wrote:
> 在 2026/4/22 22:19, Jan Kara 写道:
> [...]
> > > Iterating through all chunk maps may take some time for huge filesystems.
> >
> > Yeah, I was wondering a bit. But since this is done only on mount, I
> > figured it doesn't matter much.
> >
> > > Meanwhile the device list is way smaller than the chunk maps, what about
> > > iterating through all devices instead?
> >
> > I was thinking about that as well. But as you can see below, we also need
> > to know how the devices are put together in a raid and AFAIU that
> > information isn't there at the device level... If there's some better data
> > structure to iterate to get information about raid configurations of all
> > pieces of btrfs filesystem, I can certainly use that.
>
> My idea would be just ignore the detailed RAID configuration completely.
>
> One point is, btrfs can have more than one data profiles (e.g. during
> balance, or canceled balance, or degraded writes), and even there is only a
> single data profile, we can still have different optimized sizes, e.g.
> RAID10 on very unbalanced disks.
So these would be corner cases I'd be happy to ignore...
> Furthermore for real multi-device btrfs setup, it's very likely the data
> profile is at least striped (RAID0/RAID10), thus the huge bio will be split
> into 64K stripes by btrfs before submission already.
But this is a good point. With RAID setups btrfs is submitting one bio per
stripe (64k) anyway so there's no point in computing any larger limit. It
will never get used anyway.
> It's really affecting single-device profiles like SINGLE/DUP and mirror-only
> profiles (RAID1*), where we directly submit such huge bio to each device.
>
> With that said, it will be much easier to ignore the complex RAID profile
> iteration, and just grab the min/max optimal io size of each device,
> multiple it by the number of disks and call it a day.
OK, you've convinced me :) I'll change to iterating all devices. Just I don't
think multiplying by the number of disks makes sense - RAID setups don't
matter as we agreed above and with simple setups you're always submitting
to a single disk at a time anyway. So I'd go for max of opt_io over all
disks.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-04-23 7:57 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 9:42 [PATCH] btrfs: Limit size of bios submitted from writeback Jan Kara
2026-04-22 10:29 ` Qu Wenruo
2026-04-22 12:49 ` Jan Kara
2026-04-22 21:43 ` Qu Wenruo
2026-04-23 7:57 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox