* [PATCH v2 0/6] btrfs: delay compression to bbio submission time
@ 2026-05-16 3:45 Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-05-16 3:45 UTC (permalink / raw)
To: linux-btrfs
[CHANGELOG]
v2:
- Rebased to the latest for-next branch
Several minor conflicts:
* The removal of folio ordered flag
* The refactor of btrfs_mod_oustanding_extents()
- Fix a random failure in btrfs/260
It turns out that the original filemap_flush() only triggers writeback
of dirty pages, but since our new compression happens after the bios
are submitted, there can be a race between inode->defrag_compression
clearing and compression path reading inode->defrag_compression.
This can cause btrfs to use the mount option other than the specified
defrag compression algo to do the compression.
Fix it by using filemap_write_and_wait_range(), which also avoids the
quirky double flush behavior.
- Fix a use-after-free bug where bio->bi_status is accessed after
bio_put()
- Remove a mapping_set_error() call when try_submit_compressed() failed
As we still have uncompressed fallback, we should not set the mapping
as error.
- Drop all allocated OEs along with the extent maps when
run_delalloc_delayed() failed
- Slightly reword the cover letter
PoC->v1:
- Fix the ordered extent leak caused by incorrect ref count of child OEs
- Fix the reserved space leakage in ranges without a real OE
- Fix the hang caused by incorrect extent lock/unlock pair
All exposed by fsstress runs
- Fix the OE range check in btrfs_wait_ordered_extents() that affects
snapshot creation
All exposed by fstests runs
[BACKGROUND]
Btrfs currently goes with async submission for compressed write, I'll go
the following example to explain the async submission:
The page and fs block sizes are all 4K, no large folio involved.
The dirty range is [0, 4K), [8K, 128K).
0 4K 8K 128K
|//| |/////////////////////////////////////////|
- Write back folio 0
No compression.
* New OE for [0, 4K)
* Submit bbio for [0, 4K)
- Write back folio 8K
* Delayed compression/OE creation into a workqueue
All folios in range [8K, 128K) is still locked.
* Skip submission
- Write back folios at range [12K, 128K)
* Wait for the folio to be unlocked.
As the folio is only unlocked after the compression is done.
* Skip submission
As the folio is no longer dirty.
[PROBLEMS]
The async submission has the following problems:
- Non-sequential writeback
Especially when large folios are involved, we can have some blocks
submitted immediately (uncompressed), and some submitted later
(compressed).
That breaks the assumption of iomap and DONTCACHE writes, which
requires all blocks inside a folio to be submitted in one go.
- Not really async
As the example given above, we keep the whole range locked during
compression.
This means if we want to read a cached folio in that range, we still
need to wait for the compression.
[DELAYED COMPRESSION]
The new idea is to delay the compression at bbio submission time.
Now the workflow will be:
- Write back folio 0
No compression, the same as the old code.
* New OE for [0, 4K)
* Submit bbio for [0, 4K)
- Write back folio 8K
* New OE for [8K, 128K)
This new OE has a delayed flag, without a real data extent backing
it.
Then the folio range [12K, 128K) is unlocked, just like the the
uncompressed writes.
* Queue the folio into a bbio
- Writeback folio 12K ~ 124K
* No new OE
The existing delayed OE [8K, 128K) is already there.
* Queue the folio into a bbio.
* Submit the bbio
As we have reached the OE end.
- Delayed bbio submission
As the bbio has a special @is_delayed flag set, it will not be
submitted directly, but queued into a workqueue for compression.
* Compression in the workqueue
As we do not want to delay the writeback of the remaining folios.
Thus the compression should be done inside a workqueue.
* Real delalloc
Now an on-disk extent is reserved. The real EM will replace the
delayed one.
And the real OE will be added as a child of the original delayed
one.
* Compressed data submission
* Delayed bbio finish
When all child compressed/uncompressed writes finished, the delayed
bbio will finish.
The full delayed OE is also finished, which will insert all of its
child OEs into the subvolume tree.
This solves both the problems mentioned above, but is definitely more
complex than the current async submission:
- An OE no longer represents an allocated extent
As we will have delayed OEs, which have no allocated space backing it.
Thankfully this is not a huge deal. At ordered extent finish time, we
just need to skip any reserved space handling for an delayed OE range
that doesn't have a real OE covering it.
- Layered OEs
And we need to manage the child/parent OEs properly
But still it brings the minimal amount of changes to the existing OE
users, and keep the scheme that every block going through
extent_writepage_io() has a corresponding OE.
The other solution to layered OEs is, to split OE at the real OE
allocation time.
But that has more corner cases than I thought:
* The new real OE is exactly the same size as the delayed OE
We need to either completely replace the delayed OE with the new
real one, or copy the members from the real OE into the existing
one.
Either way, there will be a OE that needs to be put, and skip all
the per-root OE tracking.
Also need to properly handling the OE waiting behavior for the
remaining one in the ordered tree.
* The new real OE is at the middle of a delayed OE
This is a corner case but can happen.
In that case we need to allocate a new OE to fill the tailing part,
and that new OE will also need to be added to the per-root OE list,
with proper flags inherited from the old OE.
All my local attempts to go that path, not only leads to more code
but more error handling and very tricky OE splits.
So I'm afraid the layered OE solution is complex, but less complex
than the alternavive OE splitting method.
At least for now, when only compressed writes are delayed, the layer
solution still seems to be simpler.
It may change in the future if we also want to go delayed writes for
non-compressed writes.
But I hope we can simplify the code before that future.
- Possible extra split
Since the delayed OE is allocated first, we can still submit two
different delayed bbio for the same OE.
This means we can have two smaller compressed extents compared to one,
which may reduce the compression ratio.
- More complex error handling
We need to handle cases where some part of the delayed OE has no child
one. In that case we need to manually release the reserved data/meta
space.
Qu Wenruo (6):
btrfs: add skeleton for delayed btrfs bio
btrfs: add delayed ordered extent support
btrfs: introduce the skeleton of delayed bbio endio function
btrfs: introduce compression for delayed bbio
btrfs: implement uncompressed fallback for delayed bbio
btrfs: enable experimental delayed compression support
fs/btrfs/bio.c | 1 +
fs/btrfs/bio.h | 3 +
fs/btrfs/btrfs_inode.h | 3 +
fs/btrfs/defrag.c | 26 ++-
fs/btrfs/extent_io.c | 34 ++-
fs/btrfs/extent_map.h | 9 +-
fs/btrfs/inode.c | 493 +++++++++++++++++++++++++++++++++++++++-
fs/btrfs/ordered-data.c | 178 +++++++++++----
fs/btrfs/ordered-data.h | 14 ++
9 files changed, 703 insertions(+), 58 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/6] btrfs: add skeleton for delayed btrfs bio
2026-05-16 3:45 [PATCH v2 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
@ 2026-05-16 3:45 ` Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 2/6] btrfs: add delayed ordered extent support Qu Wenruo
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-05-16 3:45 UTC (permalink / raw)
To: linux-btrfs
The objective of such new delayed btrfs bio infrastructure is to allow
compressed write to go the regular extent_writepage_io() path, without
going through the async submission path.
This will make it easier to align our write path to iomap.
The core ideas of delayed btrfs bio are:
- A placeholder ordered extent created at delalloc time
No space is reserved at that time, and is not implemented in this
patch.
- A delayed extent map created at delalloc time
It will have a special disk_bytenr (-4) to indicate the range is
delayed.
And a new EXTENT_FLAG_DELAYED flag.
- Delayed btrfs bios will be limited to BTRFS_MAX_COMPRESSED size
As only compression will go through delayed btrfs bio.
- Delayed btrfs bios will have @is_delayed flag set
And such bio will have 0 as bi_sector, but will never be submitted
directly through btrfs_submit_bio().
Currently the submission of a delayed btrfs bio is not here yet, and
will be implemented by later patches.
- Btrfs bio assembly mostly follows the regular path
There are several small exceptions:
* btrfs_bio_is_contig() needs to handle delayed disk_bytenr/bbio
* New bbio needs to have its is_delayed flag set if disk_bytenr
is EXTENT_MAP_DELAYED
- Real ordered extents will be created at bbio submission time
This part is not implemented in this patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/bio.c | 1 +
fs/btrfs/bio.h | 3 +++
fs/btrfs/btrfs_inode.h | 3 +++
fs/btrfs/extent_io.c | 29 ++++++++++++++++++++++++----
fs/btrfs/extent_map.h | 9 ++++++++-
fs/btrfs/inode.c | 43 +++++++++++++++++++++++++++++++++++++++++-
6 files changed, 82 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index cc0bd03048ba..1d418226ded9 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -908,6 +908,7 @@ void btrfs_submit_bbio(struct btrfs_bio *bbio, int mirror_num)
{
/* If bbio->inode is not populated, its file_offset must be 0. */
ASSERT(bbio->inode || bbio->file_offset == 0);
+ ASSERT(!bbio->is_delayed);
assert_bbio_alignment(bbio);
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 303ed6c7103d..49ebdc7ce6e6 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -99,6 +99,9 @@ struct btrfs_bio {
/* Whether the bio is written using zone append. */
bool can_use_append:1;
+ /* If the bio is delayed (aka, no backing OE). */
+ bool is_delayed:1;
+
/*
* This member must come last, bio_alloc_bioset will allocate enough
* bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index d5d81f9546c3..49ac6164f122 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -636,5 +636,8 @@ u64 btrfs_get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 start,
const struct btrfs_file_extent *file_extent,
int type);
+struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
+ u64 start, u32 length);
+void btrfs_submit_delayed_write(struct btrfs_bio *bbio);
#endif
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8cf1e4c5105f..7adf8e80ba36 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -181,12 +181,16 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
/* Caller should ensure the bio has at least some range added */
ASSERT(bbio->bio.bi_iter.bi_size);
-
+ /* Delayed bbio is only for write. */
+ if (bbio->is_delayed)
+ ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
bio_set_csum_search_commit_root(bio_ctrl);
if (btrfs_op(&bbio->bio) == BTRFS_MAP_READ &&
bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
btrfs_submit_compressed_read(bbio);
+ else if (bbio->is_delayed)
+ btrfs_submit_delayed_write(bbio);
else
btrfs_submit_bbio(bbio, 0);
@@ -709,6 +713,14 @@ static bool btrfs_bio_is_contig(struct btrfs_bio_ctrl *bio_ctrl,
struct bio *bio = &bio_ctrl->bbio->bio;
const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
+ /* One is delayed bbio and one is not, definitely not contig. */
+ if (bio_ctrl->bbio->is_delayed != (disk_bytenr == EXTENT_MAP_DELAYED))
+ return false;
+
+ /* For delayed bbio, only need to check if the file range is contig. */
+ if (bio_ctrl->bbio->is_delayed)
+ return bio_ctrl->next_file_offset == file_offset;
+
if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE) {
/*
* For compression, all IO should have its logical bytenr set
@@ -734,7 +746,13 @@ static int alloc_new_bio(struct btrfs_inode *inode,
bbio = btrfs_bio_alloc(BIO_MAX_VECS, bio_ctrl->opf, inode,
file_offset, bio_ctrl->end_io_func, NULL);
- bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
+ if (disk_bytenr == EXTENT_MAP_DELAYED) {
+ bbio->is_delayed = true;
+ bbio->bio.bi_iter.bi_sector = 0;
+ } else {
+ bbio->is_delayed = false;
+ bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
+ }
bbio->bio.bi_write_hint = inode->vfs_inode.i_write_hint;
bio_ctrl->bbio = bbio;
bio_ctrl->len_to_oe_boundary = U32_MAX;
@@ -761,7 +779,7 @@ static int alloc_new_bio(struct btrfs_inode *inode,
}
bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
ordered->file_offset +
- ordered->disk_num_bytes - file_offset);
+ ordered->num_bytes - file_offset);
bbio->ordered = ordered;
/*
@@ -1722,7 +1740,10 @@ static int submit_one_sector(struct btrfs_inode *inode,
ASSERT(IS_ALIGNED(em->len, sectorsize));
block_start = btrfs_extent_map_block_start(em);
- disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
+ if (block_start == EXTENT_MAP_DELAYED)
+ disk_bytenr = block_start;
+ else
+ disk_bytenr = block_start + extent_offset;
ASSERT(!btrfs_extent_map_is_compressed(em));
ASSERT(block_start != EXTENT_MAP_HOLE);
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 6f685f3c9327..2342f2a9a333 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -13,7 +13,8 @@
struct btrfs_inode;
struct btrfs_fs_info;
-#define EXTENT_MAP_LAST_BYTE ((u64)-4)
+#define EXTENT_MAP_LAST_BYTE ((u64)-5)
+#define EXTENT_MAP_DELAYED ((u64)-4)
#define EXTENT_MAP_HOLE ((u64)-3)
#define EXTENT_MAP_INLINE ((u64)-2)
@@ -30,6 +31,12 @@ enum {
ENUM_BIT(EXTENT_FLAG_LOGGING),
/* This em is merged from two or more physically adjacent ems */
ENUM_BIT(EXTENT_FLAG_MERGED),
+ /*
+ * This real on-disk extent allocation is delayed until bio submission.
+ * For now it's only a placeholder with EXTENT_MAP_DELAYED as
+ * its disk_bytenr.
+ */
+ ENUM_BIT(EXTENT_FLAG_DELAYED),
};
/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 973a89301baa..4b85ba6ddf48 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7406,10 +7406,51 @@ struct extent_map *btrfs_create_io_em(struct btrfs_inode *inode, u64 start,
return ERR_PTR(ret);
}
- /* em got 2 refs now, callers needs to do btrfs_free_extent_map once. */
+ /* em got 2 refs now, callers need to do btrfs_free_extent_map once. */
return em;
}
+struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
+ u64 start, u32 length)
+{
+ struct extent_map *em;
+ int ret;
+
+ em = btrfs_alloc_extent_map();
+ if (!em)
+ return ERR_PTR(-ENOMEM);
+
+ em->start = start;
+ em->len = length;
+ em->disk_bytenr = EXTENT_MAP_DELAYED;
+ em->disk_num_bytes = 0;
+ em->ram_bytes = 0;
+ em->generation = -1;
+ em->offset = 0;
+ em->flags = EXTENT_FLAG_DELAYED | EXTENT_FLAG_PINNED;
+
+ ret = btrfs_replace_extent_map_range(inode, em, true);
+ if (ret) {
+ btrfs_free_extent_map(em);
+ return ERR_PTR(ret);
+ }
+
+ /* em got 2 refs now, callers need to do btrfs_free_extent_map once. */
+ return em;
+}
+
+void btrfs_submit_delayed_write(struct btrfs_bio *bbio)
+{
+ ASSERT(bbio->is_delayed);
+
+ /*
+ * Not yet implemented, and should not hit this path as we have no
+ * caller to create delayed extent map.
+ */
+ ASSERT(0);
+ bio_put(&bbio->bio);
+}
+
/*
* For release_folio() and invalidate_folio() we have a race window where
* folio_end_writeback() is called but the subpage spinlock is not yet released.
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 2/6] btrfs: add delayed ordered extent support
2026-05-16 3:45 [PATCH v2 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
@ 2026-05-16 3:45 ` Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-05-16 3:45 UTC (permalink / raw)
To: linux-btrfs
A delayed ordered extent has the following features:
- A new BTRFS_ORDERED_DELAYED flag
And this new flag must be set along with the BTRFS_ORDERED_REGULAR flag.
- No allocation of any on-disk space
As a delayed ordered extent doesn't take any on-disk space yet, it
won't release any reserved data/meta space either.
- Zero or more real OEs can be added to the parent
If a real OE is allocated, it must be inside the parent OE.
And such real OE will go through the regular data/meta space
reservation path.
- Children OEs will not be added to the per-inode OE rb-tree nor
per-root list
Only the parent OE is added to the per-inode rb-tree and per-root
list.
So anything waiting for ordered extents should only work on the parent
one.
There is a special corner case to btrfs_wait_ordered_extents(), as
delayed parent OEs have 0 disk_bytenr and disk_num_bytes, they will
be considered out of the [0, U64_MAX] range.
Thus we have to always wait for any delayed OEs of a root, no matter
if a block group range is given or not.
- When the parent OE finishes, all children OEs will also be finished
And reserved space handling is all handled by the children OEs.
- Any range not covered by child OE will be manually cleaned up
Above features allow us to use the existing ordered extent interfaces
to allocate new real OEs, and wait for them properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 76 +++++++++++++++++
fs/btrfs/ordered-data.c | 178 ++++++++++++++++++++++++++++++----------
fs/btrfs/ordered-data.h | 14 ++++
3 files changed, 224 insertions(+), 44 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4b85ba6ddf48..6c4fbd0d4845 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3002,6 +3002,79 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
update_inode_bytes, oe->qgroup_rsv);
}
+static int finish_delayed_ordered(struct btrfs_ordered_extent *oe)
+{
+ struct btrfs_inode *inode = oe->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ struct btrfs_ordered_extent *child;
+ struct btrfs_ordered_extent *tmp;
+ struct extent_state *cached = NULL;
+ const u32 nr_bits = oe->num_bytes >> fs_info->sectorsize_bits;
+ bool io_error = test_bit(BTRFS_ORDERED_IOERR, &oe->flags);
+ u64 cur = oe->file_offset;
+ int ret = 0;
+ int saved_ret = 0;
+
+ /* Finish each child OE. */
+ list_for_each_entry_safe(child, tmp, &oe->child_list, child_list) {
+ list_del_init(&child->child_list);
+ refcount_inc(&child->refs);
+
+ /* The range should have been marked in the bitmap. */
+ ASSERT(bitmap_test_range_all_set(oe->child_bitmap,
+ (child->file_offset - oe->file_offset) >> fs_info->sectorsize_bits,
+ child->num_bytes >> fs_info->sectorsize_bits));
+
+ if (io_error)
+ set_bit(BTRFS_ORDERED_IOERR, &child->flags);
+
+ ret = btrfs_finish_one_ordered(child);
+ if (ret && !saved_ret)
+ saved_ret = ret;
+ }
+
+ /* For ranges that don't have a child OE, manually clean them up. */
+ while (cur < oe->file_offset + oe->num_bytes) {
+ const u32 cur_bit = (cur - oe->file_offset) >> fs_info->sectorsize_bits;
+ u32 first_zero;
+ u32 next_set;
+ u64 range_start;
+ u64 range_end;
+ u32 range_len;
+
+ first_zero = find_next_zero_bit(oe->child_bitmap, nr_bits, cur_bit);
+ if (first_zero >= nr_bits)
+ break;
+ next_set = find_next_bit(oe->child_bitmap, nr_bits, first_zero);
+ ASSERT(next_set > first_zero);
+
+ range_start = oe->file_offset + (first_zero << fs_info->sectorsize_bits);
+ range_len = (next_set - first_zero) << fs_info->sectorsize_bits;
+ range_end = range_start + range_len - 1;
+
+ btrfs_lock_extent(&inode->io_tree, range_start, range_end, &cached);
+ /*
+ * The range has reserved data/metadata but no real OE, thus we have
+ * to manually release them.
+ */
+ btrfs_delalloc_release_space(inode, NULL, range_start, range_len, true);
+ /*
+ * Also need to remove/drop the pinned extent map range.
+ * Here we do not want the extent map to stay, as they do not represent
+ * any real extent map.
+ */
+ btrfs_drop_extent_map_range(inode, range_start, range_end, false);
+ btrfs_clear_extent_bit(&inode->io_tree, range_start, range_end,
+ EXTENT_LOCKED | EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
+ EXTENT_DO_ACCOUNTING, &cached);
+ cur = range_end + 1;
+ }
+ btrfs_remove_ordered_extent(oe);
+ btrfs_put_ordered_extent(oe);
+ btrfs_put_ordered_extent(oe);
+ return saved_ret;
+}
+
/*
* As ordered data IO finishes, this gets called so we can finish
* an ordered extent if the range of bytes in the file it covers are
@@ -3024,6 +3097,9 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
bool clear_reserved_extent = true;
unsigned int clear_bits = 0;
+ if (test_bit(BTRFS_ORDERED_DELAYED, &ordered_extent->flags))
+ return finish_delayed_ordered(ordered_extent);
+
start = ordered_extent->file_offset;
end = start + ordered_extent->num_bytes - 1;
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index f5f77c33cf59..691dee5334ea 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -155,6 +155,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
u64 qgroup_rsv = 0;
const bool is_nocow = (flags &
((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
+ const bool is_delayed = test_bit(BTRFS_ORDERED_DELAYED, &flags);
/* Only one type flag can be set. */
ASSERT(has_single_bit_set(flags & BTRFS_ORDERED_EXCLUSIVE_FLAGS),
@@ -170,6 +171,17 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
if (test_bit(BTRFS_ORDERED_ENCODED, &flags))
ASSERT(test_bit(BTRFS_ORDERED_COMPRESSED, &flags));
+ /*
+ * DELAYED can only be set with REGULAR, no DIRECT/ENCODED, and should
+ * not exceed BTRFS_MAX_COMPRESSED size.
+ */
+ if (test_bit(BTRFS_ORDERED_DELAYED, &flags)) {
+ ASSERT(test_bit(BTRFS_ORDERED_REGULAR, &flags));
+ ASSERT(!test_bit(BTRFS_ORDERED_DIRECT, &flags));
+ ASSERT(!test_bit(BTRFS_ORDERED_ENCODED, &flags));
+ ASSERT(num_bytes <= BTRFS_MAX_COMPRESSED);
+ }
+
/*
* For a NOCOW write we can free the qgroup reserve right now. For a COW
* one we transfer the reserved space from the inode's iotree into the
@@ -178,13 +190,16 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
* completing the ordered extent, when running the data delayed ref it
* creates, we free the reserved data with btrfs_qgroup_free_refroot().
*/
- if (is_nocow)
- ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
- else
- ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes, &qgroup_rsv);
-
- if (ret < 0)
- return ERR_PTR(ret);
+ if (!is_delayed) {
+ if (is_nocow)
+ ret = btrfs_qgroup_free_data(inode, NULL, file_offset,
+ num_bytes, &qgroup_rsv);
+ else
+ ret = btrfs_qgroup_release_data(inode, file_offset,
+ num_bytes, &qgroup_rsv);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ }
entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
if (!entry) {
@@ -216,19 +231,23 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
INIT_LIST_HEAD(&entry->root_extent_list);
INIT_LIST_HEAD(&entry->work_list);
INIT_LIST_HEAD(&entry->bioc_list);
+ INIT_LIST_HEAD(&entry->child_list);
init_completion(&entry->completion);
+ RB_CLEAR_NODE(&entry->rb_node);
/*
* We don't need the count_max_extents here, we can assume that all of
* that work has been done at higher layers, so this is truly the
* smallest the extent is going to get.
*/
- spin_lock(&inode->lock);
- btrfs_mod_outstanding_extents(inode, 1);
- spin_unlock(&inode->lock);
+ if (!is_delayed) {
+ spin_lock(&inode->lock);
+ btrfs_mod_outstanding_extents(inode, 1);
+ spin_unlock(&inode->lock);
+ }
out:
- if (IS_ERR(entry) && !is_nocow)
+ if (IS_ERR(entry) && !is_nocow && !is_delayed)
btrfs_qgroup_free_refroot(inode->root->fs_info,
btrfs_root_id(inode->root),
qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
@@ -236,12 +255,43 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
return entry;
}
+static void add_child_oe(struct btrfs_ordered_extent *parent,
+ struct btrfs_ordered_extent *child)
+{
+ struct btrfs_inode *inode = parent->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ const u32 start_bit = (child->file_offset - parent->file_offset) >>
+ fs_info->sectorsize_bits;
+ const u32 nr_bits = child->num_bytes >> fs_info->sectorsize_bits;
+
+ lockdep_assert_held(&inode->ordered_tree_lock);
+ /* Basic flags check for parent and child. */
+ ASSERT(test_bit(BTRFS_ORDERED_DELAYED, &parent->flags));
+ ASSERT(!test_bit(BTRFS_ORDERED_DELAYED, &child->flags));
+
+ /* Child should not belong to any parent yet. */
+ ASSERT(list_empty(&child->child_list));
+
+ /* Child should be fully inside parent's range. */
+ ASSERT(child->file_offset >= parent->file_offset);
+ ASSERT(child->file_offset + child->num_bytes <=
+ parent->file_offset + parent->num_bytes);
+
+ /* There should be no existing child in the range. */
+ ASSERT(bitmap_test_range_all_zero(parent->child_bitmap, start_bit, nr_bits));
+
+ list_add_tail(&child->child_list, &parent->child_list);
+
+ bitmap_set(parent->child_bitmap, start_bit, nr_bits);
+}
+
static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
{
struct btrfs_inode *inode = entry->inode;
struct btrfs_root *root = inode->root;
struct btrfs_fs_info *fs_info = root->fs_info;
struct rb_node *node;
+ bool is_child = false;
trace_btrfs_ordered_extent_add(inode, entry);
@@ -254,17 +304,25 @@ static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
spin_lock(&inode->ordered_tree_lock);
node = tree_insert(&inode->ordered_tree, entry->file_offset,
&entry->rb_node);
- if (unlikely(node)) {
+ if (node) {
struct btrfs_ordered_extent *exist =
rb_entry(node, struct btrfs_ordered_extent, rb_node);
- btrfs_panic(fs_info, -EEXIST,
+ if (test_bit(BTRFS_ORDERED_DELAYED, &exist->flags)) {
+ add_child_oe(exist, entry);
+ is_child = true;
+ } else {
+ btrfs_panic(fs_info, -EEXIST,
"overlapping ordered extents, existing oe file_offset %llu num_bytes %llu flags 0x%lx, new oe file_offset %llu num_bytes %llu flags 0x%lx",
exist->file_offset, exist->num_bytes, exist->flags,
entry->file_offset, entry->num_bytes, entry->flags);
+ }
}
spin_unlock(&inode->ordered_tree_lock);
+ /* Child OE shouldn't be added to per-root oe list. */
+ if (is_child)
+ return;
spin_lock(&root->ordered_extent_lock);
list_add_tail(&entry->root_extent_list,
&root->ordered_extents);
@@ -337,6 +395,20 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
return entry;
}
+struct btrfs_ordered_extent *btrfs_alloc_delayed_ordered_extent(
+ struct btrfs_inode *inode, u64 file_offset, u32 length)
+{
+ struct btrfs_ordered_extent *entry;
+
+ entry = alloc_ordered_extent(inode, file_offset, length, length, 0, 0, 0,
+ (1UL << BTRFS_ORDERED_REGULAR) |
+ (1UL << BTRFS_ORDERED_DELAYED),
+ BTRFS_COMPRESS_NONE);
+ if (!IS_ERR(entry))
+ insert_ordered_extent(entry);
+ return entry;
+}
+
/*
* Add a struct btrfs_ordered_sum into the list of checksums to be inserted
* when an ordered extent is finished. If the list covers more than one
@@ -644,8 +716,9 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
struct btrfs_root *root = btrfs_inode->root;
struct btrfs_fs_info *fs_info = root->fs_info;
struct rb_node *node;
- bool pending;
+ bool pending = false;
bool freespace_inode;
+ const bool is_delayed = test_bit(BTRFS_ORDERED_DELAYED, &entry->flags);
/*
* If this is a free space inode the thread has not acquired the ordered
@@ -654,33 +727,37 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
freespace_inode = btrfs_is_free_space_inode(btrfs_inode);
btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
- /* This is paired with alloc_ordered_extent(). */
- spin_lock(&btrfs_inode->lock);
- btrfs_mod_outstanding_extents(btrfs_inode, -1);
- spin_unlock(&btrfs_inode->lock);
- if (root != fs_info->tree_root) {
- u64 release;
+ if (!is_delayed) {
+ /* This is paired with alloc_ordered_extent(). */
+ spin_lock(&btrfs_inode->lock);
+ btrfs_mod_outstanding_extents(btrfs_inode, -1);
+ spin_unlock(&btrfs_inode->lock);
- if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags))
- release = entry->disk_num_bytes;
- else
- release = entry->num_bytes;
- btrfs_delalloc_release_metadata(btrfs_inode, release,
+ if (root != fs_info->tree_root) {
+ u64 release;
+
+ if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags))
+ release = entry->disk_num_bytes;
+ else
+ release = entry->num_bytes;
+ btrfs_delalloc_release_metadata(btrfs_inode, release,
test_bit(BTRFS_ORDERED_IOERR,
&entry->flags));
+ }
}
-
percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
fs_info->delalloc_batch);
spin_lock(&btrfs_inode->ordered_tree_lock);
- node = &entry->rb_node;
- rb_erase(node, &btrfs_inode->ordered_tree);
- RB_CLEAR_NODE(node);
- if (btrfs_inode->ordered_tree_last == node)
- btrfs_inode->ordered_tree_last = NULL;
- set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
- pending = test_and_clear_bit(BTRFS_ORDERED_PENDING, &entry->flags);
+ if (!RB_EMPTY_NODE(&entry->rb_node)) {
+ node = &entry->rb_node;
+ rb_erase(node, &btrfs_inode->ordered_tree);
+ RB_CLEAR_NODE(node);
+ if (btrfs_inode->ordered_tree_last == node)
+ btrfs_inode->ordered_tree_last = NULL;
+ set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
+ pending = test_and_clear_bit(BTRFS_ORDERED_PENDING, &entry->flags);
+ }
spin_unlock(&btrfs_inode->ordered_tree_lock);
/*
@@ -712,17 +789,20 @@ void btrfs_remove_ordered_extent(struct btrfs_ordered_extent *entry)
btrfs_lockdep_release(fs_info, btrfs_trans_pending_ordered);
- spin_lock(&root->ordered_extent_lock);
- list_del_init(&entry->root_extent_list);
- root->nr_ordered_extents--;
-
trace_btrfs_ordered_extent_remove(btrfs_inode, entry);
- if (!root->nr_ordered_extents) {
- spin_lock(&fs_info->ordered_root_lock);
- BUG_ON(list_empty(&root->ordered_root));
- list_del_init(&root->ordered_root);
- spin_unlock(&fs_info->ordered_root_lock);
+ spin_lock(&root->ordered_extent_lock);
+ /* For child OEs, they are not added to per-root OEs. */
+ if (!list_empty(&entry->root_extent_list)) {
+ list_del_init(&entry->root_extent_list);
+ root->nr_ordered_extents--;
+
+ if (!root->nr_ordered_extents) {
+ spin_lock(&fs_info->ordered_root_lock);
+ BUG_ON(list_empty(&root->ordered_root));
+ list_del_init(&root->ordered_root);
+ spin_unlock(&fs_info->ordered_root_lock);
+ }
}
spin_unlock(&root->ordered_extent_lock);
wake_up(&entry->wait);
@@ -771,8 +851,18 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
ordered = list_first_entry(&splice, struct btrfs_ordered_extent,
root_extent_list);
- if (range_end <= ordered->disk_bytenr ||
- ordered->disk_bytenr + ordered->disk_num_bytes <= range_start) {
+ /*
+ * Delayed OEs have 0 disk_bytenr and 0 disk_num_bytes, thus
+ * they will be considered out of the [0, U64_MAX) range.
+ * And we do not know where they will really land until the
+ * writeback finished.
+ *
+ * So here we must exclude delayed OEs from the bg range check,
+ * and always wait for them.
+ */
+ if (!test_bit(BTRFS_ORDERED_DELAYED, &ordered->flags) &&
+ (range_end <= ordered->disk_bytenr ||
+ ordered->disk_bytenr + ordered->disk_num_bytes <= range_start)) {
list_move_tail(&ordered->root_extent_list, &skipped);
cond_resched_lock(&root->ordered_extent_lock);
continue;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 03e12380a2fd..7d959c439e99 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -13,6 +13,7 @@
#include <linux/rbtree.h>
#include <linux/wait.h>
#include "async-thread.h"
+#include "compression.h"
struct inode;
struct page;
@@ -87,6 +88,12 @@ enum {
*/
BTRFS_ORDERED_DIRECT,
+ /*
+ * Extra bit for delayed OE, can only be set for REGULAR.
+ * Can not be set with COMPRESSED/ENCODED/DIRECT.
+ */
+ BTRFS_ORDERED_DELAYED,
+
BTRFS_ORDERED_NR_FLAGS,
};
static_assert(BTRFS_ORDERED_NR_FLAGS <= BITS_PER_LONG);
@@ -155,6 +162,11 @@ struct btrfs_ordered_extent {
/* a per root list of all the pending ordered extents */
struct list_head root_extent_list;
+ /* Child ordered extent list for delayed OE. */
+ struct list_head child_list;
+
+ unsigned long child_bitmap[BITS_TO_LONGS(BTRFS_MAX_COMPRESSED / BTRFS_MIN_BLOCKSIZE)];
+
struct btrfs_work work;
struct completion completion;
@@ -192,6 +204,8 @@ struct btrfs_file_extent {
struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
struct btrfs_inode *inode, u64 file_offset,
const struct btrfs_file_extent *file_extent, unsigned long flags);
+struct btrfs_ordered_extent *btrfs_alloc_delayed_ordered_extent(
+ struct btrfs_inode *inode, u64 file_offset, u32 length);
void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
struct btrfs_ordered_sum *sum);
struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 3/6] btrfs: introduce the skeleton of delayed bbio endio function
2026-05-16 3:45 [PATCH v2 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 2/6] btrfs: add delayed ordered extent support Qu Wenruo
@ 2026-05-16 3:45 ` Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-05-16 3:45 UTC (permalink / raw)
To: linux-btrfs
A delayed bbio will not be directly submitted, but queued into a
workqueue, to perform compression there.
The compression and uncompressed fallback are not implemented in this
patch.
Only the main endio function and helper to queue workload into a
workqueue is implemented.
The endio function is mostly the same as end_bbio_data_write(), except
for the extra memory allocation/freeing for the bbio->private.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 67 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 63 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6c4fbd0d4845..43e4779a0f27 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -97,6 +97,12 @@ struct data_reloc_warn {
int mirror_num;
};
+struct delayed_bio_private {
+ struct work_struct work;
+ struct btrfs_bio *delayed_bbio;
+ atomic_t pending_ios;
+};
+
/*
* For the file_extent_tree, we want to hold the inode lock when we lookup and
* update the disk_i_size, but lockdep will complain because our io_tree we hold
@@ -7515,18 +7521,71 @@ struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
return em;
}
-void btrfs_submit_delayed_write(struct btrfs_bio *bbio)
+static void run_delayed_bbio(struct work_struct *work)
{
- ASSERT(bbio->is_delayed);
+ struct delayed_bio_private *dbp = container_of(work, struct delayed_bio_private, work);
+ struct btrfs_bio *parent = dbp->delayed_bbio;
/*
- * Not yet implemented, and should not hit this path as we have no
- * caller to create delayed extent map.
+ * Increase the pending_ios so that parent bbio won't end
+ * until all child ones are submitted.
*/
+ atomic_inc(&dbp->pending_ios);
+ /* Compressed and uncompressed fallback is not yet implemented. */
ASSERT(0);
+ if (atomic_dec_and_test(&dbp->pending_ios))
+ btrfs_bio_end_io(parent, parent->status);
+}
+
+static void end_bbio_delayed(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp = bbio->private;
+ struct btrfs_inode *inode = bbio->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ struct folio_iter fi;
+ const u32 bio_size = bio_get_size(&bbio->bio);
+ const bool uptodate = bbio->status == BLK_STS_OK;
+
+ ASSERT(bbio->is_delayed);
+
+ bio_for_each_folio_all(fi, &bbio->bio) {
+ u64 start = folio_pos(fi.folio) + fi.offset;
+ u32 len = fi.length;
+
+ btrfs_folio_clear_writeback(fs_info, fi.folio, start, len);
+ }
+ btrfs_mark_ordered_io_finished(inode, bbio->file_offset, bio_size, uptodate);
+ kfree(dbp);
bio_put(&bbio->bio);
}
+void btrfs_submit_delayed_write(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp;
+
+ ASSERT(bbio->is_delayed);
+
+ bbio->end_io = end_bbio_delayed;
+ dbp = kzalloc(sizeof(struct delayed_bio_private), GFP_NOFS);
+ if (!dbp) {
+ btrfs_bio_end_io(bbio, errno_to_blk_status(-ENOMEM));
+ return;
+ }
+ atomic_set(&dbp->pending_ios, 0);
+ dbp->delayed_bbio = bbio;
+ bbio->private = dbp;
+ /*
+ * TODO: find a way to properly allow sequential extent allocation.
+ *
+ * The existing btrfs async workqueue will execute the sequential workload
+ * twice, the second one to free the structure.
+ * But our current submission path can only be called once, after that
+ * the bbio will be gone thus can not afford to use btrfs async workqueue.
+ */
+ INIT_WORK(&dbp->work, run_delayed_bbio);
+ schedule_work(&dbp->work);
+}
+
/*
* For release_folio() and invalidate_folio() we have a race window where
* folio_end_writeback() is called but the subpage spinlock is not yet released.
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 4/6] btrfs: introduce compression for delayed bbio
2026-05-16 3:45 [PATCH v2 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
` (2 preceding siblings ...)
2026-05-16 3:45 ` [PATCH v2 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
@ 2026-05-16 3:45 ` Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
5 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-05-16 3:45 UTC (permalink / raw)
To: linux-btrfs
The compressed write path inside a delayed bbio is mostly the same as
regular compression, but with some differences:
- The error handling should not touch folio flags
It will be handled by the parent delayed bbio.
And those folios already have WRITEBACK flag set, not the LOCKED flag
of the async submission path.
- A successful compression will lead to a child compressed bio
That compressed bio will be properly submitted, and if there is no
more pending ios of the delayed bbio, end the delayed bbio.
There is a minor note, since we're going through the regular
extent_writepage_io() path, we can have multiple bbios for the same
delayed ordered extent.
This means we may have a slightly lower compression ratio if for
whatever reason the writeback path chooses to submit a smaller bio.
- No sequential execution of data extent reservation
The existing async thread has one quirk related to the ordered
function execution, which is not suitable for this call site.
After the compressed bio is submitted, we can no longer touch the
child compressed bio (it could finish immediately and also finish the
parent delayed bbio).
Meanwhile the async ordered function needs different entries to handle
the workload and free involved structures.
These will be the major changes compared to the existing compressed
write.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 109 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 43e4779a0f27..10c060738067 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7521,6 +7521,109 @@ struct extent_map *btrfs_create_delayed_em(struct btrfs_inode *inode,
return em;
}
+static void end_bbio_delayed_compressed(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp = bbio->private;
+ struct btrfs_bio *parent = dbp->delayed_bbio;
+ struct folio_iter fi;
+
+ bio_for_each_folio_all(fi, &bbio->bio)
+ btrfs_free_compr_folio(fi.folio);
+ cmpxchg(&parent->status, BLK_STS_OK, bbio->status);
+ if (atomic_dec_and_test(&dbp->pending_ios))
+ btrfs_bio_end_io(parent, parent->status);
+ bio_put(&bbio->bio);
+}
+
+static bool try_submit_compressed(struct btrfs_bio *parent)
+{
+ struct delayed_bio_private *dbp = parent->private;
+ struct btrfs_bio *bbio = dbp->delayed_bbio;
+ struct btrfs_inode *inode = bbio->inode;
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ struct btrfs_key ins;
+ struct compressed_bio *cb;
+ struct extent_state *cached = NULL;
+ struct extent_map *em;
+ struct btrfs_ordered_extent *ordered;
+ struct btrfs_file_extent file_extent;
+ u64 alloc_hint;
+ const u32 len = bio_get_size(&bbio->bio);
+ const u64 fileoff = bbio->file_offset;
+ const u64 end = fileoff + len - 1;
+ u32 compressed_size;
+ int compress_type = fs_info->compress_type;
+ int compress_level = fs_info->compress_level;
+ int ret;
+
+ if (!btrfs_inode_can_compress(inode) ||
+ !inode_need_compress(inode, fileoff, end, false))
+ return false;
+
+ if (inode->defrag_compress > 0 &&
+ inode->defrag_compress < BTRFS_NR_COMPRESS_TYPES) {
+ compress_type = inode->defrag_compress;
+ compress_level = inode->defrag_compress_level;
+ } else if (inode->prop_compress) {
+ compress_type = inode->prop_compress;
+ }
+ cb = btrfs_compress_bio(inode, fileoff, len, compress_type,
+ compress_level, 0);
+ if (IS_ERR(cb))
+ return false;
+
+ round_up_last_block(cb, fs_info->sectorsize);
+ compressed_size = cb->bbio.bio.bi_iter.bi_size;
+
+ alloc_hint = btrfs_get_extent_allocation_hint(inode, fileoff, len);
+ ret = btrfs_reserve_extent(inode->root, len,
+ compressed_size, compressed_size,
+ 0, alloc_hint, &ins, true, true);
+ if (ret < 0) {
+ cleanup_compressed_bio(cb);
+ return false;
+ }
+ btrfs_lock_extent(&inode->io_tree, fileoff, end, &cached);
+ file_extent.disk_bytenr = ins.objectid;
+ file_extent.disk_num_bytes = ins.offset;
+ file_extent.ram_bytes = len;
+ file_extent.num_bytes = len;
+ file_extent.offset = 0;
+ file_extent.compression = cb->compress_type;
+
+ cb->bbio.bio.bi_iter.bi_sector = ins.objectid >> SECTOR_SHIFT;
+ em = btrfs_create_io_em(inode, fileoff, &file_extent, BTRFS_ORDERED_COMPRESSED);
+ if (IS_ERR(em)) {
+ ret = PTR_ERR(em);
+ goto out_free_reserve;
+ }
+ btrfs_free_extent_map(em);
+
+ ordered = btrfs_alloc_ordered_extent(inode, fileoff, &file_extent,
+ 1U << BTRFS_ORDERED_COMPRESSED);
+ if (IS_ERR(ordered)) {
+ btrfs_drop_extent_map_range(inode, fileoff, end, false);
+ ret = PTR_ERR(ordered);
+ goto out_free_reserve;
+ }
+ cb->bbio.ordered = ordered;
+ btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+ btrfs_unlock_extent(&inode->io_tree, fileoff, end, &cached);
+
+ cb->bbio.end_io = end_bbio_delayed_compressed;
+ cb->bbio.private = dbp;
+ atomic_inc(&dbp->pending_ios);
+ btrfs_submit_bbio(&cb->bbio, 0);
+ return true;
+
+out_free_reserve:
+ btrfs_dec_block_group_reservations(fs_info, ins.objectid);
+ btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, true);
+ btrfs_unlock_extent(&inode->io_tree, fileoff, end, &cached);
+ cleanup_compressed_bio(cb);
+ return false;
+}
+
static void run_delayed_bbio(struct work_struct *work)
{
struct delayed_bio_private *dbp = container_of(work, struct delayed_bio_private, work);
@@ -7531,8 +7634,13 @@ static void run_delayed_bbio(struct work_struct *work)
* until all child ones are submitted.
*/
atomic_inc(&dbp->pending_ios);
- /* Compressed and uncompressed fallback is not yet implemented. */
+ if (try_submit_compressed(parent))
+ goto finish;
+
+ /* Uncompressed fallback is not yet implemented. */
ASSERT(0);
+
+finish:
if (atomic_dec_and_test(&dbp->pending_ios))
btrfs_bio_end_io(parent, parent->status);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 5/6] btrfs: implement uncompressed fallback for delayed bbio
2026-05-16 3:45 [PATCH v2 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
` (3 preceding siblings ...)
2026-05-16 3:45 ` [PATCH v2 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
@ 2026-05-16 3:45 ` Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
5 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-05-16 3:45 UTC (permalink / raw)
To: linux-btrfs
When the compression failed (either bad ratio, fragmented free space, or
writeback path chooses to submit the bio early), we have to fall back to
uncompressed writes.
The uncompressed fallback is mostly the same as cow_file_range() but
with some changes:
- Endio function is slightly different from the compressed path
Only in the folio freeing handling.
- Uncompressed fallback error handling
Since at this stage, the folios already have WRITEBACK flag set, we do
not need to do the usual page unlock/end writeback, but just free the
reserved space and call it a day.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/inode.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 148 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 10c060738067..75376c2ef665 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7624,10 +7624,143 @@ static bool try_submit_compressed(struct btrfs_bio *parent)
return false;
}
+static void end_bbio_delayed_uncompressed(struct btrfs_bio *bbio)
+{
+ struct delayed_bio_private *dbp = bbio->private;
+ struct btrfs_bio *parent = dbp->delayed_bbio;
+ struct folio_iter fi;
+
+ bio_for_each_folio_all(fi, &bbio->bio)
+ folio_put(fi.folio);
+ cmpxchg(&parent->status, BLK_STS_OK, bbio->status);
+ if (atomic_dec_and_test(&dbp->pending_ios))
+ btrfs_bio_end_io(parent, parent->status);
+ bio_put(&bbio->bio);
+}
+
+static struct btrfs_bio *child_bbio_from_page_cache(struct btrfs_bio *parent,
+ u64 fileoff, u32 len)
+{
+ struct btrfs_inode *inode = parent->inode;
+ struct address_space *mapping = inode->vfs_inode.i_mapping;
+ struct btrfs_bio *bbio;
+ struct folio_iter fi;
+ u64 cur = fileoff;
+ int ret;
+
+ bbio = btrfs_bio_alloc(round_up(len, PAGE_SIZE) >> PAGE_SHIFT, REQ_OP_WRITE,
+ inode, fileoff, end_bbio_delayed_uncompressed,
+ parent->private);
+
+ while (cur < fileoff + len) {
+ struct folio *folio;
+ u32 cur_len;
+
+ folio = filemap_get_folio(mapping, cur >> PAGE_SHIFT);
+ if (IS_ERR(folio)) {
+ ret = PTR_ERR(folio);
+ goto error;
+ }
+ cur_len = min_t(u64, folio_next_pos(folio), fileoff + len) - cur;
+ ret = bio_add_folio(&bbio->bio, folio, cur_len,
+ offset_in_folio(folio, cur));
+ ASSERT(ret);
+ cur += cur_len;
+ }
+
+ return bbio;
+error:
+ bio_for_each_folio_all(fi, &bbio->bio)
+ folio_put(fi.folio);
+ bio_put(&bbio->bio);
+ return ERR_PTR(ret);
+}
+
+static int submit_one_uncompressed_range(struct btrfs_bio *parent, struct btrfs_key *ins,
+ struct extent_state **cached, u64 file_offset,
+ u32 num_bytes, u64 alloc_hint, u32 *ret_alloc_size)
+{
+ struct btrfs_inode *inode = parent->inode;
+ struct delayed_bio_private *dbp = parent->private;
+ struct btrfs_root *root = inode->root;
+ struct btrfs_fs_info *fs_info = root->fs_info;
+ struct btrfs_ordered_extent *ordered;
+ struct btrfs_file_extent file_extent;
+ struct btrfs_bio *child;
+ struct extent_map *em;
+ u64 cur_end;
+ u32 cur_len = 0;
+ int ret;
+
+ ret = btrfs_reserve_extent(root, num_bytes, num_bytes, fs_info->sectorsize,
+ 0, alloc_hint, ins, true, true);
+ if (ret < 0)
+ return ret;
+
+ cur_len = ins->offset;
+ cur_end = file_offset + cur_len - 1;
+
+ file_extent.disk_bytenr = ins->objectid;
+ file_extent.disk_num_bytes = ins->offset;
+ file_extent.num_bytes = ins->offset;
+ file_extent.ram_bytes = ins->offset;
+ file_extent.offset = 0;
+ file_extent.compression = BTRFS_COMPRESS_NONE;
+
+ btrfs_lock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ em = btrfs_create_io_em(inode, file_offset, &file_extent, BTRFS_ORDERED_REGULAR);
+ if (IS_ERR(em)) {
+ ret = PTR_ERR(em);
+ btrfs_unlock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ goto free_reserved;
+ }
+ btrfs_free_extent_map(em);
+ ordered = btrfs_alloc_ordered_extent(inode, file_offset, &file_extent,
+ 1U << BTRFS_ORDERED_REGULAR);
+ if (IS_ERR(ordered)) {
+ btrfs_drop_extent_map_range(inode, file_offset, cur_end, false);
+ btrfs_unlock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ ret = PTR_ERR(ordered);
+ goto free_reserved;
+ }
+ btrfs_dec_block_group_reservations(fs_info, ins->objectid);
+ btrfs_unlock_extent(&inode->io_tree, file_offset, cur_end, cached);
+ child = child_bbio_from_page_cache(parent, file_offset, cur_len);
+ if (IS_ERR(child)) {
+ btrfs_put_ordered_extent(ordered);
+ btrfs_drop_extent_map_range(inode, file_offset, cur_end, false);
+ ret = PTR_ERR(child);
+ goto free_reserved;
+ }
+ child->ordered = ordered;
+ child->private = parent->private;
+ child->end_io = end_bbio_delayed_uncompressed;
+ child->bio.bi_iter.bi_sector = ins->objectid >> SECTOR_SHIFT;
+ atomic_inc(&dbp->pending_ios);
+ btrfs_submit_bbio(child, 0);
+ *ret_alloc_size = cur_len;
+ return 0;
+
+free_reserved:
+ btrfs_qgroup_free_data(inode, NULL, file_offset, cur_len, NULL);
+ btrfs_dec_block_group_reservations(fs_info, ins->objectid);
+ btrfs_free_reserved_extent(fs_info, ins->objectid, ins->offset, true);
+ ASSERT(ret != -EAGAIN);
+ return ret;
+}
+
static void run_delayed_bbio(struct work_struct *work)
{
struct delayed_bio_private *dbp = container_of(work, struct delayed_bio_private, work);
struct btrfs_bio *parent = dbp->delayed_bbio;
+ struct btrfs_key ins;
+ struct extent_state *cached = NULL;
+ const u32 uncompressed_size = bio_get_size(&parent->bio);
+ const u64 start = parent->file_offset;
+ const u64 end = start + uncompressed_size - 1;
+ u64 cur = start;
+ u64 alloc_hint;
+ int ret = 0;
/*
* Increase the pending_ios so that parent bbio won't end
@@ -7637,8 +7770,21 @@ static void run_delayed_bbio(struct work_struct *work)
if (try_submit_compressed(parent))
goto finish;
- /* Uncompressed fallback is not yet implemented. */
- ASSERT(0);
+ alloc_hint = btrfs_get_extent_allocation_hint(parent->inode, start,
+ uncompressed_size);
+ while (cur < end) {
+ u32 cur_len;
+
+ ret = submit_one_uncompressed_range(parent, &ins, &cached,
+ cur, end + 1 - cur,
+ alloc_hint, &cur_len);
+ if (ret < 0) {
+ cmpxchg(&parent->status, BLK_STS_OK, errno_to_blk_status(ret));
+ goto finish;
+ }
+ cur += cur_len;
+ alloc_hint += cur_len;
+ }
finish:
if (atomic_dec_and_test(&dbp->pending_ios))
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 6/6] btrfs: enable experimental delayed compression support
2026-05-16 3:45 [PATCH v2 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
` (4 preceding siblings ...)
2026-05-16 3:45 ` [PATCH v2 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
@ 2026-05-16 3:45 ` Qu Wenruo
5 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-05-16 3:45 UTC (permalink / raw)
To: linux-btrfs
Instead of the existing async submission path, the new delayed bbio will
handle compressed writes by:
- Allocating delayed em/oe at run_delalloc_*() time
Thus no data extent is reserved at that time.
- Delayed bbio will be assembled at extent_writepage_io() time
- Delayed bbio will be intercepted just before submission
Which will run compression (or fallback to uncompressed writes) in
workqueue.
Data extents will only be reserved at that time, and the delayed em
will be replaced by real ones.
Meanwhile the real OE will be added as a child of the parent delayed
OE, and when the parent OE finishes, the child OE will be finished
with their file extents inserted.
This has some benefits:
- Higher concurrency
Previously async submission will hold the folio and io tree range
locked, this means we can not even read the uptodate folio.
Furthermore although the compressed write is queued into a workqueue
for submission and extent_writepage_io() will skip the compressed
range, when we need to write the next folio of the compressed range,
we will need to wait for the folio to be unlocked.
This makes async submission less async.
- Future DONTCACHE writes support
We do not support DONTCACHE because that feature requires writeback path
to clear the folio dirty and submit them sequentially.
Meanwhile async submission makes the writeback async, breaking the
sequential submission requirement.
This is also why we need complex per-block tracking for writeback
flags, while iomap only requires a counter tracking.
With the new delayed compression, the lifespan of a folio aligns with
DONTCACHE and iomap.
There is an extra handling for defrag, where we have to write and wait
for the defrag range.
This is to avoid clearing inode->defrag_compress before the delayed
compression started.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/defrag.c | 26 ++++++++++++++++---
fs/btrfs/extent_io.c | 5 +++-
fs/btrfs/inode.c | 61 +++++++++++++++++++++++++++++++++++++++++---
3 files changed, 84 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index f0c6758b7055..092693cbb79e 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1342,6 +1342,7 @@ int btrfs_defrag_file(struct btrfs_inode *inode, struct file_ra_state *ra,
struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long sectors_defragged = 0;
u64 isize = i_size_read(&inode->vfs_inode);
+ const u64 start = round_down(range->start, fs_info->sectorsize);
u64 cur;
u64 last_byte;
bool do_compress = (range->flags & BTRFS_DEFRAG_RANGE_COMPRESS);
@@ -1393,7 +1394,7 @@ int btrfs_defrag_file(struct btrfs_inode *inode, struct file_ra_state *ra,
}
/* Align the range */
- cur = round_down(range->start, fs_info->sectorsize);
+ cur = start;
last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
/*
@@ -1464,10 +1465,27 @@ int btrfs_defrag_file(struct btrfs_inode *inode, struct file_ra_state *ra,
* need to be written back immediately.
*/
if (range->flags & BTRFS_DEFRAG_RANGE_START_IO) {
- filemap_flush(inode->vfs_inode.i_mapping);
- if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
- &inode->runtime_flags))
+ /*
+ * For experimental delayed writeback, we must wait
+ * for the range to be fully written back before
+ * clearing inode->defrag_compress.
+ *
+ * Regular filemap_flush() will only start writeback,
+ * which will only create delayed OEs. But the real
+ * compression is happening later.
+ * This means if we just flush but not wait for writeback,
+ * the inode->defrag_compress clearing can race with
+ * compression, causing the defrag algorithm not reflected.
+ */
+ if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL)) {
+ filemap_write_and_wait_range(inode->vfs_inode.i_mapping,
+ start, last_byte);
+ } else {
filemap_flush(inode->vfs_inode.i_mapping);
+ if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+ &inode->runtime_flags))
+ filemap_flush(inode->vfs_inode.i_mapping);
+ }
}
if (range->compress_type == BTRFS_COMPRESS_LZO)
btrfs_set_fs_incompat(fs_info, COMPRESS_LZO);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7adf8e80ba36..6ec7682df565 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -899,8 +899,11 @@ static unsigned int submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
* If we have accumulated decent amount of IO, send it to the
* block layer so that IO can run while we are accumulating
* more folios to write.
+ *
+ * This doesn't apply to delayed bbio which is going to be
+ * compressed.
*/
- else if (bio_ctrl->wbc &&
+ else if (bio_ctrl->wbc && !bio_ctrl->bbio->is_delayed &&
bio_ctrl->bbio->bio.bi_iter.bi_size >=
inode->root->fs_info->writeback_bio_size)
submit_one_bio(bio_ctrl);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 75376c2ef665..4acf710f7e0a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1654,6 +1654,58 @@ static bool run_delalloc_compressed(struct btrfs_inode *inode,
return true;
}
+static int run_delalloc_delayed(struct btrfs_inode *inode, struct folio *locked_folio,
+ u64 start, u64 end)
+{
+ struct btrfs_root *root = inode->root;
+ struct btrfs_fs_info *fs_info = root->fs_info;
+ struct extent_state *cached = NULL;
+ u64 cur = start;
+ int ret;
+
+ if (btrfs_is_shutdown(fs_info)) {
+ ret = -EIO;
+ goto error;
+ }
+ while (cur < end) {
+ struct extent_map *em;
+ struct btrfs_ordered_extent *oe;
+ u32 cur_len = min_t(u64, end + 1 - cur, BTRFS_MAX_COMPRESSED);
+
+ btrfs_lock_extent(&inode->io_tree, cur, cur + cur_len - 1, &cached);
+ em = btrfs_create_delayed_em(inode, cur, cur_len);
+ if (IS_ERR(em)) {
+ ret = PTR_ERR(em);
+ goto error;
+ }
+ btrfs_free_extent_map(em);
+ oe = btrfs_alloc_delayed_ordered_extent(inode, cur, cur_len);
+ if (IS_ERR(oe)) {
+ btrfs_drop_extent_map_range(inode, cur, cur + cur_len - 1, false);
+ ret = PTR_ERR(oe);
+ goto error;
+ }
+ btrfs_put_ordered_extent(oe);
+
+ cur += cur_len;
+ }
+ extent_clear_unlock_delalloc(inode, start, end, locked_folio, &cached,
+ EXTENT_LOCKED | EXTENT_DELALLOC,
+ PAGE_UNLOCK);
+ return 0;
+error:
+ if (start < cur) {
+ btrfs_drop_extent_map_range(inode, start, cur - 1, false);
+ btrfs_cleanup_ordered_extents(inode, start, cur - start);
+ }
+ /* No range has any extent reserved, just clear them all. */
+ extent_clear_unlock_delalloc(inode, start, end, locked_folio, &cached,
+ EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+ EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
+ PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK);
+ return ret;
+}
+
/*
* Run the delalloc range from start to end, and write back any dirty pages
* covered by the range.
@@ -2427,9 +2479,12 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
return run_delalloc_nocow(inode, locked_folio, start, end);
if (btrfs_inode_can_compress(inode) &&
- inode_need_compress(inode, start, end, false) &&
- run_delalloc_compressed(inode, locked_folio, start, end, wbc))
- return 1;
+ inode_need_compress(inode, start, end, false)) {
+ if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
+ return run_delalloc_delayed(inode, locked_folio, start, end);
+ else if (run_delalloc_compressed(inode, locked_folio, start, end, wbc))
+ return 1;
+ }
if (zoned)
return run_delalloc_cow(inode, locked_folio, start, end, wbc, true);
--
2.54.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-05-16 3:46 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-16 3:45 [PATCH v2 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 2/6] btrfs: add delayed ordered extent support Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
2026-05-16 3:45 ` [PATCH v2 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox