* [PATCH 1/6] btrfs: add skeleton for delayed btrfs bio
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
@ 2026-03-19 21:04 ` Qu Wenruo
2026-03-19 21:04 ` [PATCH 2/6] btrfs: add delayed ordered extent support Qu Wenruo
` (5 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-03-19 21:04 UTC (permalink / raw)
To: linux-btrfs
The objective of such new delayed btrfs bio infrastructure is to allow
compressed write to go the regular extent_writepage_io() path, without
going through the async submission path.
This will make it easier to align our write path to iomap.
The core ideas of delayed btrfs bio are:
- A place holder ordered extent created at delalloc time
No space is reserved at that time, and is not implemented in this
patch.
- A delayed extent map created at delalloc time
It will have a special disk_bytenr (-4) to indicate the range is
delayed.
And a new EXTENT_FLAG_DELAYED flag.
- Delayed btrfs bios will be limited to BTRFS_MAX_COMPRESSED size
As only compression will go through delayed btrfs bio.
- Delayed btrfs bios will have @is_delayed flag set
And such bio will have 0 as bi_sector, but will never be submitted
directly through btrfs_submit_bio().
Currently the submission of a delayed btrfs bio is not here yet, and
will be implemented by later patches.
- Btrfs bio assembly mostly follows the regular path
There are several small exceptions:
* btrfs_bio_is_contig() needs to handle delayed disk_bytenr/bbio
* New bbio needs to have its is_delayed flag set if disk_bytenr
is EXTENT_MAP_DELAYED
- Real ordered extents will be created at bbio submission time
This part is not implemented in this patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/bio.c | 1 +
fs/btrfs/bio.h | 3 +++
fs/btrfs/btrfs_inode.h | 3 +++
fs/btrfs/extent_io.c | 29 +++++++++++++++++++++++++----
fs/btrfs/extent_map.h | 9 ++++++++-
fs/btrfs/inode.c | 41 +++++++++++++++++++++++++++++++++++++++++
6 files changed, 81 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 2a2a21aec817..513cf2eeff4d 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -900,6 +900,7 @@ void btrfs_submit_bbio(struct btrfs_bio *bbio, int mirror_num)
{
/* If bbio->inode is not populated, its file_offset must be 0. */
ASSERT(bbio->inode || bbio->file_offset == 0);
+ ASSERT(!bbio->is_delayed);
assert_bbio_alignment(bbio);
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 303ed6c7103d..49ebdc7ce6e6 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -99,6 +99,9 @@ struct btrfs_bio {
/* Whether the bio is written using zone append. */
bool can_use_append:1;
+ /* If the bio is delayed (aka, no backing OE). */
+ bool is_delayed:1;
+
/*
* This member must come last, bio_alloc_bioset will allocate enough
* bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 55c272fe5d92..080ede55b1d6 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -669,5 +669,8 @@ u64 btrfs_get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 start,
const struct btrfs_file_extent *file_extent,
int type);
+struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
+ u64 start, u32 length);
+void btrfs_submit_delayed_write(struct btrfs_bio *bbio);
#endif
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 33b1afbee0a6..5fdc78915046 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -180,12 +180,16 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
/* Caller should ensure the bio has at least some range added */
ASSERT(bbio->bio.bi_iter.bi_size);
-
+ /* Delayed bbio is only for write. */
+ if (bbio->is_delayed)
+ ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
bio_set_csum_search_commit_root(bio_ctrl);
if (btrfs_op(&bbio->bio) == BTRFS_MAP_READ &&
bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
btrfs_submit_compressed_read(bbio);
+ else if (bbio->is_delayed)
+ btrfs_submit_delayed_write(bbio);
else
btrfs_submit_bbio(bbio, 0);
@@ -723,6 +727,14 @@ static bool btrfs_bio_is_contig(struct btrfs_bio_ctrl *bio_ctrl,
struct bio *bio = &bio_ctrl->bbio->bio;
const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
+ /* One is delayed bbio and one is not, definitely not contig. */
+ if (bio_ctrl->bbio->is_delayed != (disk_bytenr == EXTENT_MAP_DELAYED))
+ return false;
+
+ /* For delayed bbio, only need to check if the file range is contig. */
+ if (bio_ctrl->bbio->is_delayed)
+ return bio_ctrl->next_file_offset == file_offset;
+
if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE) {
/*
* For compression, all IO should have its logical bytenr set
@@ -748,7 +760,13 @@ static void alloc_new_bio(struct btrfs_inode *inode,
bbio = btrfs_bio_alloc(BIO_MAX_VECS, bio_ctrl->opf, inode,
file_offset, bio_ctrl->end_io_func, NULL);
- bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
+ if (disk_bytenr == EXTENT_MAP_DELAYED) {
+ bbio->is_delayed = true;
+ bbio->bio.bi_iter.bi_sector = 0;
+ } else {
+ bbio->is_delayed = false;
+ bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
+ }
bbio->bio.bi_write_hint = inode->vfs_inode.i_write_hint;
bio_ctrl->bbio = bbio;
bio_ctrl->len_to_oe_boundary = U32_MAX;
@@ -762,7 +780,7 @@ static void alloc_new_bio(struct btrfs_inode *inode,
if (ordered) {
bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
ordered->file_offset +
- ordered->disk_num_bytes - file_offset);
+ ordered->num_bytes - file_offset);
bbio->ordered = ordered;
}
@@ -1688,7 +1706,10 @@ static int submit_one_sector(struct btrfs_inode *inode,
ASSERT(IS_ALIGNED(em->len, sectorsize));
block_start = btrfs_extent_map_block_start(em);
- disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
+ if (block_start == EXTENT_MAP_DELAYED)
+ disk_bytenr = block_start;
+ else
+ disk_bytenr = block_start + extent_offset;
ASSERT(!btrfs_extent_map_is_compressed(em));
ASSERT(block_start != EXTENT_MAP_HOLE);
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 6f685f3c9327..e45e9f96443a 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -13,7 +13,8 @@
struct btrfs_inode;
struct btrfs_fs_info;
-#define EXTENT_MAP_LAST_BYTE ((u64)-4)
+#define EXTENT_MAP_LAST_BYTE ((u64)-5)
+#define EXTENT_MAP_DELAYED ((u64)-4)
#define EXTENT_MAP_HOLE ((u64)-3)
#define EXTENT_MAP_INLINE ((u64)-2)
@@ -30,6 +31,12 @@ enum {
ENUM_BIT(EXTENT_FLAG_LOGGING),
/* This em is merged from two or more physically adjacent ems */
ENUM_BIT(EXTENT_FLAG_MERGED),
+ /*
+ * This real on-disk extent allocation is delayed until bio submission.
+ * For now it's only a place holder with EXTENT_MAP_DELAYED as
+ * its disk_bytenr.
+ */
+ ENUM_BIT(EXTENT_FLAG_DELAYED),
};
/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index acfef903ac8b..0551b8e755ed 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7552,6 +7552,47 @@ struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 start,
return em;
}
+struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
+ u64 start, u32 length)
+{
+ struct extent_map *em;
+ int ret;
+
+ em = btrfs_alloc_extent_map();
+ if (!em)
+ return ERR_PTR(-ENOMEM);
+
+ em->start = start;
+ em->len = length;
+ em->disk_bytenr = EXTENT_MAP_DELAYED;
+ em->disk_num_bytes = 0;
+ em->ram_bytes = 0;
+ em->generation = -1;
+ em->offset = 0;
+ em->flags = EXTENT_FLAG_DELAYED | EXTENT_FLAG_PINNED;
+
+ ret = btrfs_replace_extent_map_range(inode, em, true);
+ if (ret) {
+ btrfs_free_extent_map(em);
+ return ERR_PTR(ret);
+ }
+
+ /* em got 2 refs now, callers needs to do btrfs_free_extent_map once. */
+ return em;
+}
+
+void btrfs_submit_delayed_write(struct btrfs_bio *bbio)
+{
+ ASSERT(bbio->is_delayed);
+
+ /*
+ * Not yet implemented, and should not hit this path as we have no
+ * caller to create delayed extent map.
+ */
+ ASSERT(0);
+ bio_put(&bbio->bio);
+}
+
/*
* For release_folio() and invalidate_folio() we have a race window where
* folio_end_writeback() is called but the subpage spinlock is not yet released.
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 2/6] btrfs: add delayed ordered extent support
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-03-19 21:04 ` [PATCH 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
@ 2026-03-19 21:04 ` Qu Wenruo
2026-03-19 21:04 ` [PATCH 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
` (4 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-03-19 21:04 UTC (permalink / raw)
To: linux-btrfs
A delayed ordered extent has the following features:
- A new BTRFS_ORDERED_DELAYED flag
And this new flag must be set along BTRFS_ORDERED_REGULAR flag.
- No allocation of any on-disk space
As a delayed ordered extent doesn't take any on-disk space yet, it
won't release any reserved data/meta space either.
- Zero or more real OEs can be added to the parent
If a real OE is allocated, it must be inside the parent OE.
And such real OE will go through the regular data/meta space
reservation path.
- Children OEs will not be added to the per-inode OE rb-tree nor
per-root list
Only the parent OE is added to the per-inode rb-tree and per-root
list.
So anything waiting for ordered extents should only work on the parent
one.
There is a special corner case to btrfs_wait_ordered_extents(), as
delayed parent OEs have 0 disk_bytenr and disk_num_bytes, they will
be considered out of the [0, U64_MAX] range.
Thus we have to always wait for any delayed OEs of a root, no matter
if a block group range is given or not.
- When the parent OE finishes, all children OEs will also be finished
And reserved space handling is all handled by the children OEs.
- Any range not covered by child OE will be manually cleaned up
Above features allows us to use the existing ordered extent interfaces
to allocate new real OEs, and wait for them properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 76 +++++++++++++++++
fs/btrfs/ordered-data.c | 181 ++++++++++++++++++++++++++++++----------
fs/btrfs/ordered-data.h | 14 ++++
3 files changed, 225 insertions(+), 46 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0551b8e755ed..799e3d5c80f3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3162,6 +3162,79 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
update_inode_bytes, oe->qgroup_rsv);
}
+static int finish_delayed_ordered(struct btrfs_ordered_extent *oe)
+{
+ struct btrfs_inode *inode = oe->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ struct btrfs_ordered_extent *child;
+ struct btrfs_ordered_extent *tmp;
+ struct extent_state *cached = NULL;
+ const u32 nr_bits = oe->num_bytes >> fs_info->sectorsize_bits;
+ bool io_error = test_bit(BTRFS_ORDERED_IOERR, &oe->flags);
+ u64 cur = oe->file_offset;
+ int ret = 0;
+ int saved_ret = 0;
+
+ /* Finish each child OE. */
+ list_for_each_entry_safe(child, tmp, &oe->child_list, child_list) {
+ list_del_init(&child->child_list);
+ refcount_inc(&child->refs);
+
+ /* The range should have been marked in the bitmap. */
+ ASSERT(bitmap_test_range_all_set(oe->child_bitmap,
+ (child->file_offset - oe->file_offset) >> fs_info->sectorsize_bits,
+ child->num_bytes >> fs_info->sectorsize_bits));
+
+ if (io_error)
+ set_bit(BTRFS_ORDERED_IOERR, &child->flags);
+
+ ret = btrfs_finish_one_ordered(child);
+ if (ret && !saved_ret)
+ saved_ret = ret;
+ }
+
+ /* For ranges that doesn't have a child OE, manually clean them up. */
+ while (cur < oe->file_offset + oe->num_bytes) {
+ const u32 cur_bit = (cur - oe->file_offset) >> fs_info->sectorsize_bits;
+ u32 first_zero;
+ u32 next_set;
+ u64 range_start;
+ u64 range_end;
+ u32 range_len;
+
+ first_zero = find_next_zero_bit(oe->child_bitmap, nr_bits, cur_bit);
+ if (first_zero >= nr_bits)
+ break;
+ next_set = find_next_bit(oe->child_bitmap, nr_bits, first_zero);
+ ASSERT(next_set > first_zero);
+
+ range_start = oe->file_offset + (first_zero << fs_info->sectorsize_bits);
+ range_len = (next_set - first_zero) << fs_info->sectorsize_bits;
+ range_end = range_start + range_len - 1;
+
+ btrfs_lock_extent(&inode->io_tree, range_start, range_end, &cached);
+ /*
+ * The range has reserved data/metadata but no real OE, thus we have
+ * to manually release them.
+ */
+ btrfs_delalloc_release_space(inode, NULL, range_start, range_len, true);
+ /*
+ * Also need to remove/drop the pinned extent map range.
+ * Here we do not want the extent map to stay, as they do not represent
+ * any real extent map.
+ */
+ btrfs_drop_extent_map_range(inode, range_start, range_end, false);
+ btrfs_clear_extent_bit(&inode->io_tree, range_start, range_end,
+ EXTENT_LOCKED | EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
+ EXTENT_DO_ACCOUNTING, &cached);
+ cur = range_end + 1;
+ }
+ btrfs_remove_ordered_extent(oe);
+ btrfs_put_ordered_extent(oe);
+ btrfs_put_ordered_extent(oe);
+ return saved_ret;
+}
+
/*
* As ordered data IO finishes, this gets called so we can finish
* an ordered extent if the range of bytes in the file it covers are
@@ -3184,6 +3257,9 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
bool clear_reserved_extent = true;
unsigned int clear_bits = EXTENT_DEFRAG;
+ if (test_bit(BTRFS_ORDERED_DELAYED, &ordered_extent->flags))
+ return finish_delayed_ordered(ordered_extent);
+
start = ordered_extent->file_offset;
end = start + ordered_extent->num_bytes - 1;
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index bc88b904d024..c013609b6192 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -155,6 +155,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
u64 qgroup_rsv = 0;
const bool is_nocow = (flags &
((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
+ const bool is_delayed = test_bit(BTRFS_ORDERED_DELAYED, &flags);
/* Only one type flag can be set. */
ASSERT(hweight_long(flags & BTRFS_ORDERED_EXCLUSIVE_FLAGS) == 1);
@@ -169,6 +170,17 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
if (test_bit(BTRFS_ORDERED_ENCODED, &flags))
ASSERT(test_bit(BTRFS_ORDERED_COMPRESSED, &flags));
+ /*
+ * DELAYED can only be set with REGULAR, no DIRECT/ENCODED, and should
+ * not exceed BTRFS_MAX_COMPRESSED size.
+ */
+ if (test_bit(BTRFS_ORDERED_DELAYED, &flags)) {
+ ASSERT(test_bit(BTRFS_ORDERED_REGULAR, &flags));
+ ASSERT(!test_bit(BTRFS_ORDERED_DIRECT, &flags));
+ ASSERT(!test_bit(BTRFS_ORDERED_ENCODED, &flags));
+ ASSERT(num_bytes <= BTRFS_MAX_COMPRESSED);
+ }
+
/*
* For a NOCOW write we can free the qgroup reserve right now. For a COW
* one we transfer the reserved space from the inode's iotree into the
@@ -177,13 +189,14 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
* completing the ordered extent, when running the data delayed ref it
* creates, we free the reserved data with btrfs_qgroup_free_refroot().
*/
- if (is_nocow)
- ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
- else
- ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes, &qgroup_rsv);
-
- if (ret < 0)
- return ERR_PTR(ret);
+ if (!is_delayed) {
+ if (is_nocow)
+ ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
+ else
+ ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes, &qgroup_rsv);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ }
entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
if (!entry) {
@@ -215,19 +228,23 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
INIT_LIST_HEAD(&entry->root_extent_list);
INIT_LIST_HEAD(&entry->work_list);
INIT_LIST_HEAD(&entry->bioc_list);
+ INIT_LIST_HEAD(&entry->child_list);
init_completion(&entry->completion);
+ RB_CLEAR_NODE(&entry->rb_node);
/*
* We don't need the count_max_extents here, we can assume that all of
* that work has been done at higher layers, so this is truly the
* smallest the extent is going to get.
*/
- spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, 1);
- spin_unlock(&inode->lock);
+ if (!is_delayed) {
+ spin_lock(&inode->lock);
+ btrfs_mod_outstanding_extents(inode, 1);
+ spin_unlock(&inode->lock);
+ }
out:
- if (IS_ERR(entry) && !is_nocow)
+ if (IS_ERR(entry) && !is_nocow && !is_delayed)
btrfs_qgroup_free_refroot(inode->root->fs_info,
btrfs_root_id(inode->root),
qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
@@ -235,12 +252,43 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
return entry;
}
+static void add_child_oe(struct btrfs_ordered_extent *parent,
+ struct btrfs_ordered_extent *child)
+{
+ struct btrfs_inode *inode = parent->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ const u32 start_bit = (child->file_offset - parent->file_offset) >>
+ fs_info->sectorsize_bits;
+ const u32 nr_bits = child->num_bytes >> fs_info->sectorsize_bits;
+
+ lockdep_assert_held(&inode->ordered_tree_lock);
+ /* Basic flags check for parent and child. */
+ ASSERT(test_bit(BTRFS_ORDERED_DELAYED, &parent->flags));
+ ASSERT(!test_bit(BTRFS_ORDERED_DELAYED, &child->flags));
+
+ /* Child should not belong to any parent yet. */
+ ASSERT(list_empty(&child->child_list));
+
+ /* Child should be fully inside parent's range. */
+ ASSERT(child->file_offset >= parent->file_offset);
+ ASSERT(child->file_offset + child->num_bytes <=
+ parent->file_offset + parent->num_bytes);
+
+ /* There should be no existing child in the range. */
+ ASSERT(bitmap_test_range_all_zero(parent->child_bitmap, start_bit, nr_bits));
+
+ list_add_tail(&child->child_list, &parent->child_list);
+
+ bitmap_set(parent->child_bitmap, start_bit, nr_bits);
+}
+
static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
{
struct btrfs_inode *inode = entry->inode;
struct btrfs_root *root = inode->root;
struct btrfs_fs_info *fs_info = root->fs_info;
struct rb_node *node;
+ bool is_child = false;
trace_btrfs_ordered_extent_add(inode, entry);
@@ -253,17 +301,25 @@ static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
spin_lock(&inode->ordered_tree_lock);
node = tree_insert(&inode->ordered_tree, entry->file_offset,
&entry->rb_node);
- if (unlikely(node)) {
+ if (node) {
struct btrfs_ordered_extent *exist =
rb_entry(node, struct btrfs_ordered_extent, rb_node);
- btrfs_panic(fs_info, -EEXIST,
+ if (test_bit(BTRFS_ORDERED_DELAYED, &exist->flags)) {
+ add_child_oe(exist, entry);
+ is_child = true;
+ } else {
+ btrfs_panic(fs_info, -EEXIST,
"existing oe file_offset=%llu num_bytes=%llu flags=0x%lx new oe file_offset=%llu num_bytes=%llu flags=0x%lx",
- exist->file_offset, exist->num_bytes, exist->flags,
- entry->file_offset, entry->num_bytes, entry->flags);
+ exist->file_offset, exist->num_bytes, exist->flags,
+ entry->file_offset, entry->num_bytes, entry->flags);
+ }
}
spin_unlock(&inode->ordered_tree_lock);
+ /* Child OE shouldn't be added to per-root oe list. */
+ if (is_child)
+ return;
spin_lock(&root->ordered_extent_lock);
list_add_tail(&entry->root_extent_list,
&root->ordered_extents);
@@ -336,6 +392,20 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
return entry;
}
+struct btrfs_ordered_extent *btrfs_alloc_delayed_ordered_extent(
+ struct btrfs_inode *inode, u64 file_offset, u32 length)
+{
+ struct btrfs_ordered_extent *entry;
+
+ entry = alloc_ordered_extent(inode, file_offset, length, length, 0, 0, 0,
+ (1UL << BTRFS_ORDERED_REGULAR) |
+ (1UL << BTRFS_ORDERED_DELAYED),
+ BTRFS_COMPRESS_NONE);
+ if (!IS_ERR(entry))
+ insert_ordered_extent(entry);
+ return entry;
+}
+
/*
* Add a struct btrfs_ordered_sum into the list of checksums to be inserted
* when an ordered extent is finished. If the list covers more than one
@@ -643,8 +713,9 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
struct btrfs_root *root = btrfs_inode->root;
struct btrfs_fs_info *fs_info = root->fs_info;
struct rb_node *node;
- bool pending;
+ bool pending = false;
bool freespace_inode;
+ const bool is_delayed = test_bit(BTRFS_ORDERED_DELAYED, &entry->flags);
/*
* If this is a free space inode the thread has not acquired the ordered
@@ -653,33 +724,37 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
freespace_inode = btrfs_is_free_space_inode(btrfs_inode);
btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
- /* This is paired with alloc_ordered_extent(). */
- spin_lock(&btrfs_inode->lock);
- btrfs_mod_outstanding_extents(btrfs_inode, -1);
- spin_unlock(&btrfs_inode->lock);
- if (root != fs_info->tree_root) {
- u64 release;
+ if (!is_delayed) {
+ /* This is paired with alloc_ordered_extent(). */
+ spin_lock(&btrfs_inode->lock);
+ btrfs_mod_outstanding_extents(btrfs_inode, -1);
+ spin_unlock(&btrfs_inode->lock);
- if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags))
- release = entry->disk_num_bytes;
- else
- release = entry->num_bytes;
- btrfs_delalloc_release_metadata(btrfs_inode, release,
+ if (root != fs_info->tree_root) {
+ u64 release;
+
+ if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags))
+ release = entry->disk_num_bytes;
+ else
+ release = entry->num_bytes;
+ btrfs_delalloc_release_metadata(btrfs_inode, release,
test_bit(BTRFS_ORDERED_IOERR,
&entry->flags));
+ }
}
-
percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
fs_info->delalloc_batch);
spin_lock(&btrfs_inode->ordered_tree_lock);
- node = &entry->rb_node;
- rb_erase(node, &btrfs_inode->ordered_tree);
- RB_CLEAR_NODE(node);
- if (btrfs_inode->ordered_tree_last == node)
- btrfs_inode->ordered_tree_last = NULL;
- set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
- pending = test_and_clear_bit(BTRFS_ORDERED_PENDING, &entry->flags);
+ if (!RB_EMPTY_NODE(&entry->rb_node)) {
+ node = &entry->rb_node;
+ rb_erase(node, &btrfs_inode->ordered_tree);
+ RB_CLEAR_NODE(node);
+ if (btrfs_inode->ordered_tree_last == node)
+ btrfs_inode->ordered_tree_last = NULL;
+ set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
+ pending = test_and_clear_bit(BTRFS_ORDERED_PENDING, &entry->flags);
+ }
spin_unlock(&btrfs_inode->ordered_tree_lock);
/*
@@ -711,17 +786,21 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
btrfs_lockdep_release(fs_info, btrfs_trans_pending_ordered);
- spin_lock(&root->ordered_extent_lock);
- list_del_init(&entry->root_extent_list);
- root->nr_ordered_extents--;
-
trace_btrfs_ordered_extent_remove(btrfs_inode, entry);
- if (!root->nr_ordered_extents) {
- spin_lock(&fs_info->ordered_root_lock);
- BUG_ON(list_empty(&root->ordered_root));
- list_del_init(&root->ordered_root);
- spin_unlock(&fs_info->ordered_root_lock);
+ spin_lock(&root->ordered_extent_lock);
+ /* For child OEs, they are not added to per-root OEs. */
+ if (!list_empty(&entry->root_extent_list)) {
+ list_del_init(&entry->root_extent_list);
+ root->nr_ordered_extents--;
+
+
+ if (!root->nr_ordered_extents) {
+ spin_lock(&fs_info->ordered_root_lock);
+ BUG_ON(list_empty(&root->ordered_root));
+ list_del_init(&root->ordered_root);
+ spin_unlock(&fs_info->ordered_root_lock);
+ }
}
spin_unlock(&root->ordered_extent_lock);
wake_up(&entry->wait);
@@ -770,8 +849,18 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
ordered = list_first_entry(&splice, struct btrfs_ordered_extent,
root_extent_list);
- if (range_end <= ordered->disk_bytenr ||
- ordered->disk_bytenr + ordered->disk_num_bytes <= range_start) {
+ /*
+ * Delayed OEs have 0 disk_bytenr and 0 disk_num_bytes, thus
+ * they will be considered out of the [0, U64_MAX) range.
+ * And we do not know where they will really land until the
+ * writeback finished.
+ *
+ * So here we must exclude delayed OEs from the bg range check,
+ * and always wait for them.
+ */
+ if (!test_bit(BTRFS_ORDERED_DELAYED, &ordered->flags) &&
+ (range_end <= ordered->disk_bytenr ||
+ ordered->disk_bytenr + ordered->disk_num_bytes <= range_start)) {
list_move_tail(&ordered->root_extent_list, &skipped);
cond_resched_lock(&root->ordered_extent_lock);
continue;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2664ea455229..8a1800f109e8 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -13,6 +13,7 @@
#include <linux/rbtree.h>
#include <linux/wait.h>
#include "async-thread.h"
+#include "compression.h"
struct inode;
struct page;
@@ -87,6 +88,12 @@ enum {
*/
BTRFS_ORDERED_DIRECT,
+ /*
+ * Extra bit for delayed OE, can only be set for REGULAR.
+ * Can not be set with COMPRESSED/ENCODED/DIRECT.
+ */
+ BTRFS_ORDERED_DELAYED,
+
BTRFS_ORDERED_NR_FLAGS,
};
static_assert(BTRFS_ORDERED_NR_FLAGS <= BITS_PER_LONG);
@@ -155,6 +162,11 @@ struct btrfs_ordered_extent {
/* a per root list of all the pending ordered extents */
struct list_head root_extent_list;
+ /* Child ordered extent list for delayed OE. */
+ struct list_head child_list;
+
+ unsigned long child_bitmap[BITS_TO_LONGS(BTRFS_MAX_COMPRESSED / BTRFS_MIN_BLOCKSIZE)];
+
struct btrfs_work work;
struct completion completion;
@@ -192,6 +204,8 @@ struct btrfs_file_extent {
struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
struct btrfs_inode *inode, u64 file_offset,
const struct btrfs_file_extent *file_extent, unsigned long flags);
+struct btrfs_ordered_extent *btrfs_alloc_delayed_ordered_extent(
+ struct btrfs_inode *inode, u64 file_offset, u32 length);
void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
struct btrfs_ordered_sum *sum);
struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 3/6] btrfs: introduce the skeleton of delayed bbio endio function
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-03-19 21:04 ` [PATCH 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-03-19 21:04 ` [PATCH 2/6] btrfs: add delayed ordered extent support Qu Wenruo
@ 2026-03-19 21:04 ` Qu Wenruo
2026-03-19 21:04 ` [PATCH 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
` (3 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-03-19 21:04 UTC (permalink / raw)
To: linux-btrfs
A delayed bbio will not be directly submitted, but queued into a
workqueue, doing the heavy lifting compression there.
The compression and uncompressed fallback are not implemented in this
patch.
Only the main endio function and helper to queue workload into a
workqueue is implemented.
The endio function is mostly the same as end_bbio_data_write(), except
the extra memory allocation/freeing for the bbio->private.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 68 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 64 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 799e3d5c80f3..03018cc3bd25 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -97,6 +97,12 @@ struct data_reloc_warn {
int mirror_num;
};
+struct delayed_bio_private {
+ struct work_struct work;
+ struct btrfs_bio *delayed_bbio;
+ atomic_t pending_ios;
+};
+
/*
* For the file_extent_tree, we want to hold the inode lock when we lookup and
* update the disk_i_size, but lockdep will complain because our io_tree we hold
@@ -7657,18 +7663,72 @@ struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
return em;
}
-void btrfs_submit_delayed_write(struct btrfs_bio *bbio)
+static void run_delayed_bbio(struct work_struct *work)
{
- ASSERT(bbio->is_delayed);
+ struct delayed_bio_private *dbp = container_of(work, struct delayed_bio_private, work);
+ struct btrfs_bio *parent = dbp->delayed_bbio;
/*
- * Not yet implemented, and should not hit this path as we have no
- * caller to create delayed extent map.
+ * Increase the pending_ios so that parent bbio won't end
+ * until all child ones are submitted.
*/
+ atomic_inc(&dbp->pending_ios);
+ /* Compressed and uncompressed fallback is not yet implemented. */
ASSERT(0);
+ if (atomic_dec_and_test(&dbp->pending_ios))
+ btrfs_bio_end_io(parent, parent->status);
+}
+
+static void end_bbio_delayed(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp = bbio->private;
+ struct btrfs_inode *inode = bbio->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ struct folio_iter fi;
+ const u32 bio_size = bio_get_size(&bbio->bio);
+ const bool uptodate = bbio->status == BLK_STS_OK;
+
+ ASSERT(bbio->is_delayed);
+
+ bio_for_each_folio_all(fi, &bbio->bio) {
+ u64 start = folio_pos(fi.folio) + fi.offset;
+ u32 len = fi.length;
+
+ btrfs_folio_clear_ordered(fs_info, fi.folio, start, len);
+ btrfs_folio_clear_writeback(fs_info, fi.folio, start, len);
+ }
+ btrfs_mark_ordered_io_finished(inode, bbio->file_offset, bio_size, uptodate);
+ kfree(dbp);
bio_put(&bbio->bio);
}
+void btrfs_submit_delayed_write(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp;
+
+ ASSERT(bbio->is_delayed);
+
+ bbio->end_io = end_bbio_delayed;
+ dbp = kzalloc(sizeof(struct delayed_bio_private), GFP_NOFS);
+ if (!dbp) {
+ btrfs_bio_end_io(bbio, errno_to_blk_status(-ENOMEM));
+ return;
+ }
+ atomic_set(&dbp->pending_ios, 0);
+ dbp->delayed_bbio = bbio;
+ bbio->private = dbp;
+ /*
+ * TODO: find a way to properly allow sequential extent allocation.
+ *
+ * The existing btrfs async workqueue will execute the sequential workload
+ * twice, the second one to free the structure.
+ * But our current submission path can only be called once, after that
+ * the bbio will be gone thus can not afford to use btrfs async workqueue.
+ */
+ INIT_WORK(&dbp->work, run_delayed_bbio);
+ schedule_work(&dbp->work);
+}
+
/*
* For release_folio() and invalidate_folio() we have a race window where
* folio_end_writeback() is called but the subpage spinlock is not yet released.
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 4/6] btrfs: introduce compression for delayed bbio
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
` (2 preceding siblings ...)
2026-03-19 21:04 ` [PATCH 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
@ 2026-03-19 21:04 ` Qu Wenruo
2026-03-19 21:04 ` [PATCH 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
` (2 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-03-19 21:04 UTC (permalink / raw)
To: linux-btrfs
The compressed write path inside a delayed bbio is mostly the same as
regular compression, but with some differences:
- The error handling should not touch folio flags
It will be handled by the parent delayed bbio.
And those folios already have WRITEBACK flag set, not the LOCKED flag
of the async submission path.
- A successful compression will lead to a child compressed bio
That compressed bio will be properly submitted, and if there is no
more pending ios of the delayed bbio, end the delayed bbio.
There is minor note, since we're going through the regular
extent_writepage_io() path, we can have multiple bbios for the same
delayed ordered extent.
This means we may have a slightly lower compression ratio if by
whatever reason the write back path choose to skip the bio.
- No sequential execution of data extent reservation
The existing async thread has one quirk related to the ordered
function execution, which is not suitable for this call site.
After the compressed bio is submitted, we can no longer touch the
child compressed bio (it can finished immediately and also finish the
parent delayed bbio).
Meanwhile the async ordered function needs different entries to handle
the workload and free involved structures.
This will be the major changes compared to the existing compressed
write.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 111 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03018cc3bd25..ef1f5efb68ca 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7663,6 +7663,111 @@ struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
return em;
}
+static void end_bbio_delayed_compressed(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp = bbio->private;
+ struct btrfs_bio *parent = dbp->delayed_bbio;
+ struct folio_iter fi;
+
+ bio_for_each_folio_all(fi, &bbio->bio)
+ btrfs_free_compr_folio(fi.folio);
+ bio_put(&bbio->bio);
+
+ cmpxchg(&parent->status, BLK_STS_OK, bbio->status);
+ if (atomic_dec_and_test(&dbp->pending_ios))
+ btrfs_bio_end_io(parent, parent->status);
+}
+
+static bool try_submit_compressed(struct btrfs_bio *parent)
+{
+ struct delayed_bio_private *dbp = parent->private;
+ struct btrfs_bio *bbio = dbp->delayed_bbio;
+ struct btrfs_inode *inode = bbio->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ struct btrfs_key ins;
+ struct compressed_bio *cb;
+ struct extent_state *cached = NULL;
+ struct extent_map *em;
+ struct btrfs_ordered_extent *ordered;
+ struct btrfs_file_extent file_extent;
+ u64 alloc_hint;
+ const u32 len = bio_get_size(&bbio->bio);
+ const u64 fileoff = bbio->file_offset;
+ const u64 end = fileoff + len - 1;
+ u32 compressed_size;
+ int compress_type = fs_info->compress_type;
+ int compress_level = fs_info->compress_level;
+ int ret;
+
+ if (!btrfs_inode_can_compress(inode) ||
+ !inode_need_compress(inode, fileoff, end, false))
+ return false;
+
+ if (inode->defrag_compress > 0 &&
+ inode->defrag_compress < BTRFS_NR_COMPRESS_TYPES) {
+ compress_type = inode->defrag_compress;
+ compress_level = inode->defrag_compress_level;
+ } else if (inode->prop_compress) {
+ compress_type = inode->prop_compress;
+ }
+ cb = btrfs_compress_bio(inode, fileoff, len, compress_type,
+ compress_level, 0);
+ if (IS_ERR(cb))
+ return false;
+
+ round_up_last_block(cb, fs_info->sectorsize);
+ compressed_size = cb->bbio.bio.bi_iter.bi_size;
+
+ alloc_hint = btrfs_get_extent_allocation_hint(inode, fileoff, len);
+ ret = btrfs_reserve_extent(inode->root, len,
+ compressed_size, compressed_size,
+ 0, alloc_hint, &ins, true, true);
+ if (ret < 0) {
+ cleanup_compressed_bio(cb);
+ return false;
+ }
+ btrfs_lock_extent(&inode->io_tree, fileoff, end, &cached);
+ file_extent.disk_bytenr = ins.objectid;
+ file_extent.disk_num_bytes = ins.offset;
+ file_extent.ram_bytes = len;
+ file_extent.num_bytes = len;
+ file_extent.offset = 0;
+ file_extent.compression = cb->compress_type;
+
+ cb->bbio.bio.bi_iter.bi_sector = ins.objectid >> SECTOR_SHIFT;
+ em = btrfs_create_io_em(inode, fileoff, &file_extent, BTRFS_ORDERED_COMPRESSED);
+ if (IS_ERR(em)) {
+ ret = PTR_ERR(em);
+ goto out_free_reserve;
+ }
+ btrfs_free_extent_map(em);
+
+ ordered = btrfs_alloc_ordered_extent(inode, fileoff, &file_extent,
+ 1U << BTRFS_ORDERED_COMPRESSED);
+ if (IS_ERR(ordered)) {
+ btrfs_drop_extent_map_range(inode, fileoff, end, false);
+ ret = PTR_ERR(ordered);
+ goto out_free_reserve;
+ }
+ cb->bbio.ordered = ordered;
+ btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+ btrfs_unlock_extent(&inode->io_tree, fileoff, end, &cached);
+
+ cb->bbio.end_io = end_bbio_delayed_compressed;
+ cb->bbio.private = dbp;
+ atomic_inc(&dbp->pending_ios);
+ btrfs_submit_bbio(&cb->bbio, 0);
+ return true;
+
+out_free_reserve:
+ btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+ btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, true);
+ mapping_set_error(inode->vfs_inode.i_mapping, -EIO);
+ btrfs_unlock_extent(&inode->io_tree, fileoff, end, &cached);
+ cleanup_compressed_bio(cb);
+ return false;
+}
+
static void run_delayed_bbio(struct work_struct *work)
{
struct delayed_bio_private *dbp = container_of(work, struct delayed_bio_private, work);
@@ -7673,8 +7778,13 @@ static void run_delayed_bbio(struct work_struct *work)
* until all child ones are submitted.
*/
atomic_inc(&dbp->pending_ios);
- /* Compressed and uncompressed fallback is not yet implemented. */
+ if (try_submit_compressed(parent))
+ goto finish;
+
+ /* Uncompressed fallback is not yet implemented. */
ASSERT(0);
+
+finish:
if (atomic_dec_and_test(&dbp->pending_ios))
btrfs_bio_end_io(parent, parent->status);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 5/6] btrfs: implement uncompressed fallback for delayed bbio
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
` (3 preceding siblings ...)
2026-03-19 21:04 ` [PATCH 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
@ 2026-03-19 21:04 ` Qu Wenruo
2026-03-19 21:04 ` [PATCH 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
2026-04-01 22:52 ` [PATCH 0/6] btrfs: delay compression to bbio submission time Boris Burkov
6 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-03-19 21:04 UTC (permalink / raw)
To: linux-btrfs
When the compression failed (either bad ratio, fragmented free space, or
writeback path choose to submit the bio early), we have to fall back to
uncompressed writes.
The uncompressed fallback is mostly the same as cow_file_range() but
with some changes:
- Endio function is slightly different from the compressed path
Only in the folio freeing handling.
- Uncompressed fallback error handling
Since at the stage, the folios already have WRITEBACK flag set, we do
not need to do the usual page unlock/end writeback, but just free the
reserved space and call it a day.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 149 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 147 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef1f5efb68ca..183fb3c83c10 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7768,10 +7768,144 @@ static bool try_submit_compressed(struct btrfs_bio *parent)
return false;
}
+static void end_bbio_delayed_uncompressed(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp = bbio->private;
+ struct btrfs_bio *parent = dbp->delayed_bbio;
+ struct folio_iter fi;
+
+ bio_for_each_folio_all(fi, &bbio->bio)
+ folio_put(fi.folio);
+ bio_put(&bbio->bio);
+
+ cmpxchg(&parent->status, BLK_STS_OK, bbio->status);
+ if (atomic_dec_and_test(&dbp->pending_ios))
+ btrfs_bio_end_io(parent, parent->status);
+}
+
+static struct btrfs_bio *child_bbio_from_page_cache(struct btrfs_bio *parent,
+ u64 fileoff, u32 len)
+{
+ struct btrfs_inode *inode = parent->inode;
+ struct address_space *mapping = inode->vfs_inode.i_mapping;
+ struct btrfs_bio *bbio;
+ struct folio_iter fi;
+ u64 cur = fileoff;
+ int ret;
+
+ bbio = btrfs_bio_alloc(round_up(len, PAGE_SIZE) >> PAGE_SHIFT, REQ_OP_WRITE,
+ inode, fileoff, end_bbio_delayed_uncompressed,
+ parent->private);
+
+ while (cur < fileoff + len) {
+ struct folio *folio;
+ u32 cur_len;
+
+ folio = filemap_get_folio(mapping, cur >> PAGE_SHIFT);
+ if (IS_ERR(folio)) {
+ ret = PTR_ERR(folio);
+ goto error;
+ }
+ cur_len = min_t(u64, folio_next_pos(folio), fileoff + len) - cur;
+ ret = bio_add_folio(&bbio->bio, folio, cur_len,
+ offset_in_folio(folio, cur));
+ ASSERT(ret);
+ cur += cur_len;
+ }
+
+ return bbio;
+error:
+ bio_for_each_folio_all(fi, &bbio->bio)
+ folio_put(fi.folio);
+ bio_put(&bbio->bio);
+ return ERR_PTR(ret);
+}
+
+static int submit_one_uncompressed_range(struct btrfs_bio *parent, struct btrfs_key *ins,
+ struct extent_state **cached, u64 file_offset,
+ u32 num_bytes, u64 alloc_hint, u32 *ret_alloc_size)
+{
+ struct btrfs_inode *inode = parent->inode;
+ struct delayed_bio_private *dbp = parent->private;
+ struct btrfs_root *root = inode->root;
+ struct btrfs_fs_info *fs_info = root->fs_info;
+ struct btrfs_ordered_extent *ordered;
+ struct btrfs_file_extent file_extent;
+ struct btrfs_bio *child;
+ struct extent_map *em;
+ u64 cur_end;
+ u32 cur_len = 0;
+ int ret;
+
+ ret = btrfs_reserve_extent(root, num_bytes, num_bytes, fs_info->sectorsize,
+ 0, alloc_hint, ins, true, true);
+ if (ret < 0)
+ return ret;
+
+ cur_len = ins->offset;
+ cur_end = file_offset + cur_len - 1;
+
+ file_extent.disk_bytenr = ins->objectid;
+ file_extent.disk_num_bytes = ins->offset;
+ file_extent.num_bytes = ins->offset;
+ file_extent.ram_bytes = ins->offset;
+ file_extent.offset = 0;
+ file_extent.compression = BTRFS_COMPRESS_NONE;
+
+ btrfs_lock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ em = btrfs_create_io_em(inode, file_offset, &file_extent, BTRFS_ORDERED_REGULAR);
+ if (IS_ERR(em)) {
+ ret = PTR_ERR(em);
+ btrfs_unlock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ goto free_reserved;
+ }
+ btrfs_free_extent_map(em);
+ ordered = btrfs_alloc_ordered_extent(inode, file_offset, &file_extent,
+ 1U << BTRFS_ORDERED_REGULAR);
+ if (IS_ERR(ordered)) {
+ btrfs_drop_extent_map_range(inode, file_offset, cur_end, false);
+ btrfs_unlock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ ret = PTR_ERR(ordered);
+ goto free_reserved;
+ }
+ btrfs_dec_block_group_reservations(fs_info, ins->objectid);
+ btrfs_unlock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ child = child_bbio_from_page_cache(parent, file_offset, cur_len);
+ if (IS_ERR(child)) {
+ btrfs_put_ordered_extent(ordered);
+ btrfs_drop_extent_map_range(inode, file_offset, cur_end, false);
+ ret = PTR_ERR(ordered);
+ goto free_reserved;
+ }
+ child->ordered = ordered;
+ child->private = parent->private;
+ child->end_io = end_bbio_delayed_uncompressed;
+ child->bio.bi_iter.bi_sector = ins->objectid >> SECTOR_SHIFT;
+ atomic_inc(&dbp->pending_ios);
+ btrfs_submit_bbio(child, 0);
+ *ret_alloc_size = cur_len;
+ return 0;
+
+free_reserved:
+ btrfs_qgroup_free_data(inode, NULL, file_offset, cur_len, NULL);
+ btrfs_dec_block_group_reservations(fs_info, ins->objectid);
+ btrfs_free_reserved_extent(fs_info, ins->objectid, ins->offset, true);
+ ASSERT(ret != -EAGAIN);
+ return ret;
+}
+
static void run_delayed_bbio(struct work_struct *work)
{
struct delayed_bio_private *dbp = container_of(work, struct delayed_bio_private, work);
struct btrfs_bio *parent = dbp->delayed_bbio;
+ struct btrfs_key ins;
+ struct extent_state *cached = NULL;
+ const u32 uncompressed_size = bio_get_size(&parent->bio);
+ const u64 start = parent->file_offset;
+ const u64 end = start + uncompressed_size - 1;
+ u64 cur = start;
+ u64 alloc_hint;
+ int ret = 0;
/*
* Increase the pending_ios so that parent bbio won't end
@@ -7781,8 +7915,19 @@ static void run_delayed_bbio(struct work_struct *work)
if (try_submit_compressed(parent))
goto finish;
- /* Uncompressed fallback is not yet implemented. */
- ASSERT(0);
+ alloc_hint = btrfs_get_extent_allocation_hint(parent->inode, start,
+ uncompressed_size);
+ while (cur < end) {
+ u32 cur_len;
+
+ ret = submit_one_uncompressed_range(parent, &ins, &cached,
+ cur, end + 1 - cur,
+ alloc_hint, &cur_len);
+ if (ret < 0)
+ goto finish;
+ cur += cur_len;
+ alloc_hint += cur_len;
+ }
finish:
if (atomic_dec_and_test(&dbp->pending_ios))
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 6/6] btrfs: enable experimental delayed compression support
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
` (4 preceding siblings ...)
2026-03-19 21:04 ` [PATCH 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
@ 2026-03-19 21:04 ` Qu Wenruo
2026-04-01 22:52 ` [PATCH 0/6] btrfs: delay compression to bbio submission time Boris Burkov
6 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-03-19 21:04 UTC (permalink / raw)
To: linux-btrfs
Instead of the existing async submission path, the new delayed bbio will
handle compressed write by:
- Allocating delayed em/oe at run_delalloc_*() time
Thus no data extent is reserved at that time.
- Delayed bbio will be assembled at extent_writepage_io() time
- Delayed bbio will be intercepted just before submission
Which will run compression (or fallback to uncompressed writes) in
workqueue.
Data extents will only be reserved at that time, and the delayed em
will be replaced by real ones.
Meanwhile the real OE will be added as a child of the parent delayed
OE, and when the parent OE finishes, the child OE will be finished
with their file extents inserted.
This has some benefits:
- Higher concurrency
Previously async submission will hold the folio and io tree range
locked, this means we can not even read the uptodate folio.
Furthermore although the compressed write is queued into a workqueue
for submission and extent_writepage_io() will skip the compressed
range, when we need to write the next folio of the compressed range,
we will need to wait for the folio to be unlocked.
This makes async submission less async.
- Future DONTCACHE writes support
We do not support DONTCACHE because that feature requires writeback path
to clear the folio dirty and submit them sequentially.
Meanwhile async submission makes the writeback async, breaking the
sequential submission requirement.
This is also why we need a complex per-block tracking for writeback
flags, meanwhile iomap only requires a counter tracking.
With the new delayed compression, the lifespan of a folio aligns with
DONTCACHE and iomap.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 60 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 57 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 183fb3c83c10..c95207009210 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1673,6 +1673,57 @@ static bool run_delalloc_compressed(struct btrfs_inode *inode,
return true;
}
+static int run_delalloc_delayed(struct btrfs_inode *inode, struct folio *locked_folio,
+ u64 start, u64 end)
+{
+ struct btrfs_root *root = inode->root;
+ struct btrfs_fs_info *fs_info = root->fs_info;
+ struct extent_state *cached = NULL;
+ u64 cur = start;
+ int ret;
+
+ if (btrfs_is_shutdown(fs_info)) {
+ ret = -EIO;
+ goto error;
+ }
+ while (cur < end) {
+ struct extent_map *em;
+ struct btrfs_ordered_extent *oe;
+ u32 cur_len = min_t(u64, end + 1 - cur, BTRFS_MAX_COMPRESSED);
+
+ btrfs_lock_extent(&inode->io_tree, cur, cur + cur_len - 1, &cached);
+ em = btrfs_create_delayed_em(inode, cur, cur_len);
+ if (IS_ERR(em)) {
+ ret = PTR_ERR(em);
+ goto error;
+ }
+ btrfs_free_extent_map(em);
+ oe = btrfs_alloc_delayed_ordered_extent(inode, cur, cur_len);
+ if (IS_ERR(oe)) {
+ btrfs_drop_extent_map_range(inode, cur, cur + cur_len - 1, false);
+ ret = PTR_ERR(em);
+ goto error;
+ }
+ btrfs_put_ordered_extent(oe);
+
+ cur += cur_len;
+ }
+ extent_clear_unlock_delalloc(inode, start, end, locked_folio, &cached,
+ EXTENT_LOCKED | EXTENT_DELALLOC,
+ PAGE_UNLOCK | PAGE_SET_ORDERED);
+ return 0;
+error:
+ /* We have to drop any created delayed extent maps. */
+ if (start < cur)
+ btrfs_drop_extent_map_range(inode, start, cur - 1, false);
+ /* No range has any extent reserved, just clear them all. */
+ extent_clear_unlock_delalloc(inode, start, end, locked_folio, &cached,
+ EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+ EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
+ PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK);
+ return ret;
+}
+
/*
* Run the delalloc range from start to end, and write back any dirty pages
* covered by the range.
@@ -2436,9 +2487,12 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
return run_delalloc_nocow(inode, locked_folio, start, end);
if (btrfs_inode_can_compress(inode) &&
- inode_need_compress(inode, start, end, false) &&
- run_delalloc_compressed(inode, locked_folio, start, end, wbc))
- return 1;
+ inode_need_compress(inode, start, end, false)) {
+ if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
+ return run_delalloc_delayed(inode, locked_folio, start, end);
+ else if (run_delalloc_compressed(inode, locked_folio, start, end, wbc))
+ return 1;
+ }
if (zoned)
return run_delalloc_cow(inode, locked_folio, start, end, wbc, true);
--
2.53.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
` (5 preceding siblings ...)
2026-03-19 21:04 ` [PATCH 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
@ 2026-04-01 22:52 ` Boris Burkov
2026-04-01 23:29 ` Qu Wenruo
2026-04-22 6:23 ` Qu Wenruo
6 siblings, 2 replies; 13+ messages in thread
From: Boris Burkov @ 2026-04-01 22:52 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Fri, Mar 20, 2026 at 07:34:44AM +1030, Qu Wenruo wrote:
> [CHANGELOG]
> PoC->v1:
> - Fix the ordered extent leak caused by incorrect ref count of child OEs
> - Fix the reserved space leakage in ranges without a real OE
> - Fix the hang caused by incorrect extent lock/unlock pair
> All exposed by fsstress runs
>
> - Fix the OE range check in btrfs_wait_ordered_extents() that affects
> snapshot creation
> All exposed by fstests runs
>
> [BACKGROUND]
> Btrfs currently goes with async submission for compressed write, I'll go
> the following example to explain the async submission:
>
> The page and fs block sizes are all 4K, no large folio involved.
> The dirty range is [0, 4K), [8K, 128K).
>
> 0 4K 8K 128K
> |//| |/////////////////////////////////////////|
>
> - Write back folio 0
> * Delalloc
> writepage_delalloc() will find the delalloc range [0, 4K), and since
> it can not be inlined and too small for compression, it will be go
> through COW path, thus a new data extent is allocated, with
> corresponding EM/OE created.
>
> * Submission
> That folio 0 will be added into a bbio, and since we reached the OE
> end, the bbio will be submitted immediately.
>
> - Write back folio 8K
> * Delalloc
> writepage_delalloc() find the delalloc range [8K, 128K) and go
> compression.
> Instead of allocating an extent immediately, it queues the work into
> delalloc_workers.
>
> Please note that the range [8K, 128K) is completely locked during
> compression.
>
> * Skip submission
> As the whole folio 8K went through async submission, we skip bbio
> submission.
>
> - Write back folio 12K
> We wait for the folio to be unlocked (after compression is done and
> compressed bio is submitted).
> When the folio is unlocked, the folio will have writeback flag set and
> its dirty flag cleared. Thus we either wait for the writeback or skip
> the folio completely.
>
> This step repeats for the range [8K, 128K).
>
> AFAIK the async submission is required as we can not submit two
> different bbios for a single compressed range.
> Which is different from the uncompressed write path, where we can have
> several different bbios for a single ordered extent.
>
> [PROBLEMS]
> The async submission has the following problems:
>
> - Non-sequential writeback
> Especially when large folios are involved, we can have some blocks
> submitted immediately (uncompressed), and some submitted later
> (compressed).
>
> That breaks the assumption of iomap and DONTCACHE writes, which
> requires all blocks inside a folio to be submitted in one go.
>
> - Not really async
> As the example given above, we keep the whole range locked during
> compression.
> This means if we want to read a cached folio in that range, we still
> need to wait for the compression.
>
> [DELAYED COMPRESSION]
> The new idea is to delay the compression at bbio submission time.
> Now the workflow will be:
>
> - Write back folio 0
> The same, submitting it immediately
>
> - Write back folio 8K
> * Delalloc
> writepage_delalloc() find the delalloc range [8K, 128K) and go
> compression, but this time we allocated delayed EM and OE for the
> range [8K, 128K).
Just for the sake of the description:
I think it's helpful to be more clear that we unlock the delalloc
range here.
>
> * Submission
> That folio 8K will be added into a bbio, with its dirty flag removed
> and writeback flag set.
>
> - Writeback folio 12K ~ 124K
> * Delalloc
> No new delalloc range.
>
> * Submission
> Those folios will be added to the same bbio above.
> And after the last folio 124K is queued, we reached the OE end, and
> will submit the delayed bbio.
Can you explain the importance of the delayed bbio more? Why not just a
work queue item directly?
>
> - Delayed bbio submission
> As the bbio has a special @is_delayed flag set, it will not be
> submitted directly, but queued into a workqueue for compression.
>
> * Compression in the workqueue
> * Real delalloc
> Now an on-disk extent is reserved. The real EM will replace the
> delayed one.
> And the real OE will be added as a child of the original delayed
> one.
> * Compressed data submission
> * Delayed bbio finish
> When all child compressed/uncompressed writes finished, the delayed
> bbio will finish.
>
> The full delayed OE is also finished, which will insert all of its
> child OEs into the subvolume tree.
>
> This solves both the problems mentioned above, but is definitely way
> more complex than the current async submission:
I mentioned a few key questions in line, but I think they how and why of
"this solves both the problems mentioned above" is lacking at a high level.
As *I* understand it, the basic explanation is:
=====OLD=====
WRITEBACK
lock folio F
lock_extent_delalloc_range([F,F+N])
submit async chunk work
folio F+1
block on folio_lock()
ASYNC SUBMIT
do compression
do allocation
create OE
unlock folios (except locked_folio)
submit bio for OE
WRITEBACK
unblock on F+1
no-op
etc... till F+N
=====NEW=====
WRITEBACK
lock folio F
lock_extent_delalloc_range([F,F+N])
create delayed OE
unlock folios except locked_folio
lock folio F+1
add F+1 delayed bbio
etc... till F+N
"submit" delayed bio
ASYNC SUBMIT
do compression
do allocation
create child OE
submit real bio
Is this correct?
If so, the "magic" is in the delayed OE providing enough synchronization
to allow us to unlock the folios right away in the writeback context. So
we are breaking the contract "OE <=> allocated space" to allow this. I
think making the absolute core of the improvement more apparent in the
descriptions would be helpful.
I think one thing I still don't understand is the desire for the layered
bios/OEs instead of creating the same delayed OE, but then as we do the real
allocation/compression and discover the actual ranges doing
btrfs_split_ordered_extent() like short DIO writes, which seems quite
similar. Splitting/joining feels like a much more natural model for
ranges like OEs than layering into a tree. As we discover the sub ranges
we actually use, we split off the real OE.
I actually think this sort of makes sense for non-compressed too, where the
reserved size can be less than the maximum you tried for, just the same.
Probably too early to try to switch over everything, but it could be a
uniformity benefit in the future.
Fundamentally, I think it just feels kind of strange (with the old code
or the new) to have the one folio at a time iteration in writeback that
does lock/unlock on each folio and the delalloc locking/iteration that
we really need to do the right thing with extents. I would love if we
could reconcile them more completely rather than just hacking OEs into
submission (no pun intended :D).
I actually think your model is quite close to what I want, I think it
might just be too hacky/incremental. I am ok with moving forward with it
if we can also envision future next steps for further simplification.
I read the patches but apologies if I missed the answer to some of my
questions in the code.
Thanks,
Boris
>
> - Layered OEs
> And we need to manage the child/parent OEs properly
> But still it brings the minimal amount of changes to the existing OE
> users, and keep the scheme that every block going through
> extent_writepage_io() has a corresponding OE.
>
> - Possible extra split
> Since the delayed OE is allocated first, we can still submit two
> different delayed bbio for the same OE.
>
> This means we can have two smaller compressed extents compared to one,
> which may reduce the compression ratio.
>
> - More complex error handling
> We need to handle cases where some part of the delayed OE has no child
> one. In that case we need to manually release the reserved data/meta
> space.
>
> Qu Wenruo (6):
> btrfs: add skeleton for delayed btrfs bio
> btrfs: add delayed ordered extent support
> btrfs: introduce the skeleton of delayed bbio endio function
> btrfs: introduce compression for delayed bbio
> btrfs: implement uncompressed fallback for delayed bbio
> btrfs: enable experimental delayed compression support
>
> fs/btrfs/bio.c | 1 +
> fs/btrfs/bio.h | 3 +
> fs/btrfs/btrfs_inode.h | 3 +
> fs/btrfs/extent_io.c | 29 ++-
> fs/btrfs/extent_map.h | 9 +-
> fs/btrfs/inode.c | 492 +++++++++++++++++++++++++++++++++++++++-
> fs/btrfs/ordered-data.c | 181 +++++++++++----
> fs/btrfs/ordered-data.h | 14 ++
> 8 files changed, 678 insertions(+), 54 deletions(-)
>
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
2026-04-01 22:52 ` [PATCH 0/6] btrfs: delay compression to bbio submission time Boris Burkov
@ 2026-04-01 23:29 ` Qu Wenruo
2026-04-02 0:15 ` Qu Wenruo
2026-04-22 6:23 ` Qu Wenruo
1 sibling, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2026-04-01 23:29 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
在 2026/4/2 09:22, Boris Burkov 写道:
> On Fri, Mar 20, 2026 at 07:34:44AM +1030, Qu Wenruo wrote:
[...]
>> - Write back folio 8K
>> * Delalloc
>> writepage_delalloc() find the delalloc range [8K, 128K) and go
>> compression, but this time we allocated delayed EM and OE for the
>> range [8K, 128K).
>
> Just for the sake of the description:
>
> I think it's helpful to be more clear that we unlock the delalloc
> range here.
Sure.
>
>>
>> * Submission
>> That folio 8K will be added into a bbio, with its dirty flag removed
>> and writeback flag set.
>>
>> - Writeback folio 12K ~ 124K
>> * Delalloc
>> No new delalloc range.
>>
>> * Submission
>> Those folios will be added to the same bbio above.
>> And after the last folio 124K is queued, we reached the OE end, and
>> will submit the delayed bbio.
>
> Can you explain the importance of the delayed bbio more? Why not just a
> work queue item directly?
Which aspect of the importance?
If it's the folio dirty/writeback flags change, workqueue itself won't
solve it, as it will still be the old async submission, causing the
writeback flag to be set seemingly randomly.
If it's something else, the new behavior is almost the same as COW path,
except the complex layered OE part after submission.
>
>>
>> - Delayed bbio submission
>> As the bbio has a special @is_delayed flag set, it will not be
>> submitted directly, but queued into a workqueue for compression.
>>
>> * Compression in the workqueue
>> * Real delalloc
>> Now an on-disk extent is reserved. The real EM will replace the
>> delayed one.
>> And the real OE will be added as a child of the original delayed
>> one.
>> * Compressed data submission
>> * Delayed bbio finish
>> When all child compressed/uncompressed writes finished, the delayed
>> bbio will finish.
>>
>> The full delayed OE is also finished, which will insert all of its
>> child OEs into the subvolume tree.
>>
>> This solves both the problems mentioned above, but is definitely way
>> more complex than the current async submission:
>
> I mentioned a few key questions in line, but I think they how and why of
> "this solves both the problems mentioned above" is lacking at a high level.
>
> As *I* understand it, the basic explanation is:
> =====OLD=====
> WRITEBACK
> lock folio F
> lock_extent_delalloc_range([F,F+N])
> submit async chunk work
> folio F+1
> block on folio_lock()
>
> ASYNC SUBMIT
> do compression
> do allocation
> create OE
> unlock folios (except locked_folio)
> submit bio for OE
>
> WRITEBACK
> unblock on F+1
> no-op
> etc... till F+N
>
> =====NEW=====
> WRITEBACK
> lock folio F
> lock_extent_delalloc_range([F,F+N])
> create delayed OE
> unlock folios except locked_folio
> lock folio F+1
> add F+1 delayed bbio
> etc... till F+N
> "submit" delayed bio
>
> ASYNC SUBMIT
> do compression
> do allocation
> create child OE
> submit real bio
>
> Is this correct?
Yes.
>
> If so, the "magic" is in the delayed OE providing enough synchronization
> to allow us to unlock the folios right away in the writeback context.
Please remember that, for non-compressed writes, we do exactly the same
behavior, unlocking folios right away.
So the synchronization part is at least no worse than the regular COW path.
But I agree the current delalloc time folio lock/unlock is not
straightfoward to grasp.
> So
> we are breaking the contract "OE <=> allocated space" to allow this. I
> think making the absolute core of the improvement more apparent in the
> descriptions would be helpful.
>
> I think one thing I still don't understand is the desire for the layered
> bios/OEs instead of creating the same delayed OE, but then as we do the real
> allocation/compression and discover the actual ranges doing
> btrfs_split_ordered_extent() like short DIO writes, which seems quite
> similar. Splitting/joining feels like a much more natural model for
> ranges like OEs than layering into a tree. As we discover the sub ranges
> we actually use, we split off the real OE.
I can definitely work towards that direction. Although my concern is the
OE waiting/start part and error handling.
But so far those are only concerns, I need to implement the code to see
what can go wrong.
And if no major problem is hit, you can see a v2 with the split solution.
>
> I actually think this sort of makes sense for non-compressed too, where the
> reserved size can be less than the maximum you tried for, just the same.
> Probably too early to try to switch over everything, but it could be a
> uniformity benefit in the future.
The problem is the size of the contig range we can get with page cache,
compared to the max extent size.
Our max extent size is 128M, but one bio can only hold 256 bvecs.
For the worst case scenario, all of our page cache are not contig, thus
a bio can only hold 1M data for 4K page size, resulting way more fragments.
Although large folios can help, e.g. 64K IO block size with enough
memory can easily fill a 128MiB bio, it's indeed a little too early.
>
> Fundamentally, I think it just feels kind of strange (with the old code
> or the new) to have the one folio at a time iteration in writeback that
> does lock/unlock on each folio and the delalloc locking/iteration that
> we really need to do the right thing with extents. I would love if we
> could reconcile them more completely rather than just hacking OEs into
> submission (no pun intended :D).
IMHO, all the complex delalloc page lock/unlock is just to get a large
enough contig range to reduce fragments.
Personally I would prefer to submit one or more bios after locking all
folios at delalloc stage, which can reach the same less fragments.
But I guess there are a lot of extra writeback control related
accounting is needed, and the current folio/tag based writeback is
making it harder to do such early bio assembly/submission.
And with the ultimate objective of iomap integration, submission at
delalloc time can also be very tricky for iomap.
>
> I actually think your model is quite close to what I want, I think it
> might just be too hacky/incremental. I am ok with moving forward with it
> if we can also envision future next steps for further simplification.
>
> I read the patches but apologies if I missed the answer to some of my
> questions in the code.
No problem at all, your review really helps to improve the patchset.
Thanks,
Qu
>
> Thanks,
> Boris
>
>>
>> - Layered OEs
>> And we need to manage the child/parent OEs properly
>> But still it brings the minimal amount of changes to the existing OE
>> users, and keep the scheme that every block going through
>> extent_writepage_io() has a corresponding OE.
>>
>> - Possible extra split
>> Since the delayed OE is allocated first, we can still submit two
>> different delayed bbio for the same OE.
>>
>> This means we can have two smaller compressed extents compared to one,
>> which may reduce the compression ratio.
>>
>> - More complex error handling
>> We need to handle cases where some part of the delayed OE has no child
>> one. In that case we need to manually release the reserved data/meta
>> space.
>>
>> Qu Wenruo (6):
>> btrfs: add skeleton for delayed btrfs bio
>> btrfs: add delayed ordered extent support
>> btrfs: introduce the skeleton of delayed bbio endio function
>> btrfs: introduce compression for delayed bbio
>> btrfs: implement uncompressed fallback for delayed bbio
>> btrfs: enable experimental delayed compression support
>>
>> fs/btrfs/bio.c | 1 +
>> fs/btrfs/bio.h | 3 +
>> fs/btrfs/btrfs_inode.h | 3 +
>> fs/btrfs/extent_io.c | 29 ++-
>> fs/btrfs/extent_map.h | 9 +-
>> fs/btrfs/inode.c | 492 +++++++++++++++++++++++++++++++++++++++-
>> fs/btrfs/ordered-data.c | 181 +++++++++++----
>> fs/btrfs/ordered-data.h | 14 ++
>> 8 files changed, 678 insertions(+), 54 deletions(-)
>>
>> --
>> 2.53.0
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
2026-04-01 23:29 ` Qu Wenruo
@ 2026-04-02 0:15 ` Qu Wenruo
2026-04-02 0:51 ` Boris Burkov
0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2026-04-02 0:15 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
在 2026/4/2 09:59, Qu Wenruo 写道:
[...]
>> So
>> we are breaking the contract "OE <=> allocated space" to allow this. I
>> think making the absolute core of the improvement more apparent in the
>> descriptions would be helpful.
>>
>> I think one thing I still don't understand is the desire for the layered
>> bios/OEs instead of creating the same delayed OE, but then as we do
>> the real
>> allocation/compression and discover the actual ranges doing
>> btrfs_split_ordered_extent() like short DIO writes, which seems quite
>> similar. Splitting/joining feels like a much more natural model for
>> ranges like OEs than layering into a tree. As we discover the sub ranges
>> we actually use, we split off the real OE.
>
> I can definitely work towards that direction. Although my concern is the
> OE waiting/start part and error handling.
>
> But so far those are only concerns, I need to implement the code to see
> what can go wrong.
>
> And if no major problem is hit, you can see a v2 with the split solution.
Finally I recall the challenge using btrfs_split_ordered_extent(), that
we can not split the OE in the middle.
E.g. we have a delayed OE for range [0, 32K), then due to whatever
reasons (e.g. memory pressure), we are forced to submit range [0, 16K)
first, then range [16K, 32K).
Both go through delayed compression, but the range [16K, 32K) win the
race by failing the compression (bad ratio), and fallback to
uncompressed submission first, before range [0, 16K) even finishes its
compression.
Furthermore, for the range [16K, 32K) we do not have a large enough free
space to fill it in one go, but can only allocate several 8K sized extents.
So we need to split the [16K, 32K) into two ranges, [16K, 24K) and [24K,
32K).
This means we have to split the original [0, 32K) extent into [0, 16K),
[16K, 24K) and [24K, 32K) ranges.
This is not supported by the current btrfs_split_ordered_extent(), which
can only split range from the beginning of an OE.
I'll try to implement a version of btrfs_split_ordered_extent() that can
split the range at any offset to see how things will work then.
Thanks,
Qu
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
2026-04-02 0:15 ` Qu Wenruo
@ 2026-04-02 0:51 ` Boris Burkov
2026-04-02 4:51 ` Qu Wenruo
0 siblings, 1 reply; 13+ messages in thread
From: Boris Burkov @ 2026-04-02 0:51 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Thu, Apr 02, 2026 at 10:45:14AM +1030, Qu Wenruo wrote:
>
>
> 在 2026/4/2 09:59, Qu Wenruo 写道:
> [...]
> > > So
> > > we are breaking the contract "OE <=> allocated space" to allow this. I
> > > think making the absolute core of the improvement more apparent in the
> > > descriptions would be helpful.
> > >
> > > I think one thing I still don't understand is the desire for the layered
> > > bios/OEs instead of creating the same delayed OE, but then as we do
> > > the real
> > > allocation/compression and discover the actual ranges doing
> > > btrfs_split_ordered_extent() like short DIO writes, which seems quite
> > > similar. Splitting/joining feels like a much more natural model for
> > > ranges like OEs than layering into a tree. As we discover the sub ranges
> > > we actually use, we split off the real OE.
> >
> > I can definitely work towards that direction. Although my concern is the
> > OE waiting/start part and error handling.
> >
> > But so far those are only concerns, I need to implement the code to see
> > what can go wrong.
> >
> > And if no major problem is hit, you can see a v2 with the split solution.
>
> Finally I recall the challenge using btrfs_split_ordered_extent(), that we
> can not split the OE in the middle.
>
> E.g. we have a delayed OE for range [0, 32K), then due to whatever reasons
> (e.g. memory pressure), we are forced to submit range [0, 16K) first, then
> range [16K, 32K).
>
> Both go through delayed compression, but the range [16K, 32K) win the race
> by failing the compression (bad ratio), and fallback to uncompressed
> submission first, before range [0, 16K) even finishes its compression.
>
> Furthermore, for the range [16K, 32K) we do not have a large enough free
> space to fill it in one go, but can only allocate several 8K sized extents.
>
> So we need to split the [16K, 32K) into two ranges, [16K, 24K) and [24K,
> 32K).
>
> This means we have to split the original [0, 32K) extent into [0, 16K),
> [16K, 24K) and [24K, 32K) ranges.
> This is not supported by the current btrfs_split_ordered_extent(), which can
> only split range from the beginning of an OE.
>
>
> I'll try to implement a version of btrfs_split_ordered_extent() that can
> split the range at any offset to see how things will work then.
This was kinda buggy before with the extent maps, and I think Naohiro
Christoph and I ended up doing a good amount of refactoring to get rid
of it. I am confident you can implement it correctly, just a fair warning.
Also, you could consider doing an iterative split? Like if you need 16k,
then split [0,128k) into [0,16k)[16k,128k) then split [16k,128k) into
[16k,24k)[24k,128k).
Not sure if that is easier than the "true" 3 (or K) way split.
If all this becomes a huge headache please feel don't feel obliged to
carry on forever, I am happy to discuss again.
Thanks,
Boris
>
> Thanks,
> Qu
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
2026-04-02 0:51 ` Boris Burkov
@ 2026-04-02 4:51 ` Qu Wenruo
0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-04-02 4:51 UTC (permalink / raw)
To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs
在 2026/4/2 11:21, Boris Burkov 写道:
> On Thu, Apr 02, 2026 at 10:45:14AM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/4/2 09:59, Qu Wenruo 写道:
>> [...]
>>>> So
>>>> we are breaking the contract "OE <=> allocated space" to allow this. I
>>>> think making the absolute core of the improvement more apparent in the
>>>> descriptions would be helpful.
>>>>
>>>> I think one thing I still don't understand is the desire for the layered
>>>> bios/OEs instead of creating the same delayed OE, but then as we do
>>>> the real
>>>> allocation/compression and discover the actual ranges doing
>>>> btrfs_split_ordered_extent() like short DIO writes, which seems quite
>>>> similar. Splitting/joining feels like a much more natural model for
>>>> ranges like OEs than layering into a tree. As we discover the sub ranges
>>>> we actually use, we split off the real OE.
>>>
>>> I can definitely work towards that direction. Although my concern is the
>>> OE waiting/start part and error handling.
>>>
>>> But so far those are only concerns, I need to implement the code to see
>>> what can go wrong.
>>>
>>> And if no major problem is hit, you can see a v2 with the split solution.
>>
>> Finally I recall the challenge using btrfs_split_ordered_extent(), that we
>> can not split the OE in the middle.
>>
>> E.g. we have a delayed OE for range [0, 32K), then due to whatever reasons
>> (e.g. memory pressure), we are forced to submit range [0, 16K) first, then
>> range [16K, 32K).
>>
>> Both go through delayed compression, but the range [16K, 32K) win the race
>> by failing the compression (bad ratio), and fallback to uncompressed
>> submission first, before range [0, 16K) even finishes its compression.
>>
>> Furthermore, for the range [16K, 32K) we do not have a large enough free
>> space to fill it in one go, but can only allocate several 8K sized extents.
>>
>> So we need to split the [16K, 32K) into two ranges, [16K, 24K) and [24K,
>> 32K).
>>
>> This means we have to split the original [0, 32K) extent into [0, 16K),
>> [16K, 24K) and [24K, 32K) ranges.
>> This is not supported by the current btrfs_split_ordered_extent(), which can
>> only split range from the beginning of an OE.
>>
>>
>> I'll try to implement a version of btrfs_split_ordered_extent() that can
>> split the range at any offset to see how things will work then.
>
> This was kinda buggy before with the extent maps, and I think Naohiro
> Christoph and I ended up doing a good amount of refactoring to get rid
> of it. I am confident you can implement it correctly, just a fair warning.
>
> Also, you could consider doing an iterative split? Like if you need 16k,
> then split [0,128k) into [0,16k)[16k,128k) then split [16k,128k) into
> [16k,24k)[24k,128k).
Yes, that's possible if we go the ordered workqueue, and as long as the
writeback is always happening sequentially.
But the sequential writeback may not always be true.
E.g. we are initially writebacking folio 0, which created the OE for
range [0, 128K), but by memory pressure or whatever in MM layer,
suddenly the MM chose also to trigger writeback for folio 64K, meanwhile
our initial writeback is still submitting folios for range [0, 32k).
In that case, we have to do the split for 64K, and the sequential nature
is broken no matter what.
And that's also why I believe we hold all folios locked in the existing
async submission path, to avoid such unexpected split caused by MM.
I will definitely try some aspects of the ideas mentioned here, like
using ordered workqueue (which also addresses the current out-of-order
OE disk bytenr problems).
But if the OE splits is too complex, I'll definitely give you some
feedback about the problems.
Thanks,
Qu
>
> Not sure if that is easier than the "true" 3 (or K) way split.
>
> If all this becomes a huge headache please feel don't feel obliged to
> carry on forever, I am happy to discuss again.
>
> Thanks,
> Boris
>
>>
>> Thanks,
>> Qu
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
2026-04-01 22:52 ` [PATCH 0/6] btrfs: delay compression to bbio submission time Boris Burkov
2026-04-01 23:29 ` Qu Wenruo
@ 2026-04-22 6:23 ` Qu Wenruo
1 sibling, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-04-22 6:23 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs
在 2026/4/2 09:22, Boris Burkov 写道:
> On Fri, Mar 20, 2026 at 07:34:44AM +1030, Qu Wenruo wrote:
[...]
>
> I think one thing I still don't understand is the desire for the layered
> bios/OEs instead of creating the same delayed OE, but then as we do the real
> allocation/compression and discover the actual ranges doing
> btrfs_split_ordered_extent() like short DIO writes, which seems quite
> similar. Splitting/joining feels like a much more natural model for
> ranges like OEs than layering into a tree. As we discover the sub ranges
> we actually use, we split off the real OE.
After some coding, it really looks like the split idea is going to break
the existing OE code more.
1) The new real OE in the middle of the existing one
|<--------------- The existing OE ------------------>|
|<---- Head ---->|<- The new real OE ->|<--- Tail -->|
This is in fact the corner case, and TBH pretty easy to handle.
This will split the OE into 3 parts.
It's heading part is simple, we can shrink the existing one.
The middle one can just use the new OE.
But we need to allocate a new one for the tail part.
The allocation has two solutions:
a) GFP_ATOMIC
Since it's a very corner case, we can afford to go ATOMIC with
ordered_tree_lock hold.
b) Pre-allocation
At the cost of searching the ordered_tree twice, one to find out
there is a delayed OE, and needs to unlock and allocate a new OE.
Then we need to insert the new and tail OEs into the per-root
ordered_extent_list, which must unlock the ordered_tree_lock.
This can be solved, but I'm afraid it may not be elegant at all.
2) The new real OE covers the full delayed OE range
|<--------------- The existing OE ------------------>|
|<--------------- The new real OE ------------------>|
This is in fact the most common case, and pretty hard to find a good
way to workaround.
There are two solutions, neither is less destructive on the existing
OE codes:
a) Remove the existing OE
However we're holding the ordered_tree_lock, the existing
btrfs_remove_ordered_extent() can not be utilized.
We need to remove the existing from the tree, and delay the extra
handling (outstanding_extents/ordered_bytes/root_extent_list)
after we have unlocked the ordered_tree_lock.
And how do we end the existing delayed OE? Just remove it and
call it a day? Or go through btrfs_finish_one_ordered()?
Furthermore, I'm not sure what we should do if there is a
transaction waiting on the existing delayed OE.
Transfer to use the new OE?
This seems to be the most complex way.
b) Using the existing OE
So we replace the existing OE with parameters from the new one.
At least the transaction waiting is no longer a problem.
But now we have space reservation problem.
Originally we have reserved no space for the initial delayed OE,
but now we have space properly reserved for the new real OE.
This will need a full implementation and full tests to be sure.
3) The new OE covers the tail/head of the existing one
|<--------------- The existing OE ------------------>|
|<--- New --->|
This is again a very common case, e.g. the compression failed and
we fallback to uncompressed writes.
For the split it's pretty easy, shrink the existing one to the right,
and insert the new one.
Now the problem is related to the transaction waiting. The original
transaction is waiting on the whole range, now it will only wait
for the remaining part.
This may lead to fsync data corruptions, and we will need extra
handling to make the new OE on the transaction waiting list.
This makes the OE insert path to have extra things to consider.
In the end, I think it's better not to mix delayed OEs with regular ones.
Unlike the existing OE splitting, the delayed OEs are really having too
many differences with regular ones.
Thanks,
Qu
^ permalink raw reply [flat|nested] 13+ messages in thread