From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 0/6] btrfs: delay compression to bbio submission time
Date: Fri, 20 Mar 2026 07:34:44 +1030 [thread overview]
Message-ID: <cover.1773953307.git.wqu@suse.com> (raw)
[CHANGELOG]
PoC->v1:
- Fix the ordered extent leak caused by incorrect ref count of child OEs
- Fix the reserved space leakage in ranges without a real OE
- Fix the hang caused by incorrect extent lock/unlock pair
All exposed by fsstress runs
- Fix the OE range check in btrfs_wait_ordered_extents() that affects
snapshot creation
All exposed by fstests runs
[BACKGROUND]
Btrfs currently goes with async submission for compressed write, I'll go
the following example to explain the async submission:
The page and fs block sizes are all 4K, no large folio involved.
The dirty range is [0, 4K), [8K, 128K).
0 4K 8K 128K
|//| |/////////////////////////////////////////|
- Write back folio 0
* Delalloc
writepage_delalloc() will find the delalloc range [0, 4K), and since
it can not be inlined and too small for compression, it will be go
through COW path, thus a new data extent is allocated, with
corresponding EM/OE created.
* Submission
That folio 0 will be added into a bbio, and since we reached the OE
end, the bbio will be submitted immediately.
- Write back folio 8K
* Delalloc
writepage_delalloc() find the delalloc range [8K, 128K) and go
compression.
Instead of allocating an extent immediately, it queues the work into
delalloc_workers.
Please note that the range [8K, 128K) is completely locked during
compression.
* Skip submission
As the whole folio 8K went through async submission, we skip bbio
submission.
- Write back folio 12K
We wait for the folio to be unlocked (after compression is done and
compressed bio is submitted).
When the folio is unlocked, the folio will have writeback flag set and
its dirty flag cleared. Thus we either wait for the writeback or skip
the folio completely.
This step repeats for the range [8K, 128K).
AFAIK the async submission is required as we can not submit two
different bbios for a single compressed range.
Which is different from the uncompressed write path, where we can have
several different bbios for a single ordered extent.
[PROBLEMS]
The async submission has the following problems:
- Non-sequential writeback
Especially when large folios are involved, we can have some blocks
submitted immediately (uncompressed), and some submitted later
(compressed).
That breaks the assumption of iomap and DONTCACHE writes, which
requires all blocks inside a folio to be submitted in one go.
- Not really async
As the example given above, we keep the whole range locked during
compression.
This means if we want to read a cached folio in that range, we still
need to wait for the compression.
[DELAYED COMPRESSION]
The new idea is to delay the compression at bbio submission time.
Now the workflow will be:
- Write back folio 0
The same, submitting it immediately
- Write back folio 8K
* Delalloc
writepage_delalloc() find the delalloc range [8K, 128K) and go
compression, but this time we allocated delayed EM and OE for the
range [8K, 128K).
* Submission
That folio 8K will be added into a bbio, with its dirty flag removed
and writeback flag set.
- Writeback folio 12K ~ 124K
* Delalloc
No new delalloc range.
* Submission
Those folios will be added to the same bbio above.
And after the last folio 124K is queued, we reached the OE end, and
will submit the delayed bbio.
- Delayed bbio submission
As the bbio has a special @is_delayed flag set, it will not be
submitted directly, but queued into a workqueue for compression.
* Compression in the workqueue
* Real delalloc
Now an on-disk extent is reserved. The real EM will replace the
delayed one.
And the real OE will be added as a child of the original delayed
one.
* Compressed data submission
* Delayed bbio finish
When all child compressed/uncompressed writes finished, the delayed
bbio will finish.
The full delayed OE is also finished, which will insert all of its
child OEs into the subvolume tree.
This solves both the problems mentioned above, but is definitely way
more complex than the current async submission:
- Layered OEs
And we need to manage the child/parent OEs properly
But still it brings the minimal amount of changes to the existing OE
users, and keep the scheme that every block going through
extent_writepage_io() has a corresponding OE.
- Possible extra split
Since the delayed OE is allocated first, we can still submit two
different delayed bbio for the same OE.
This means we can have two smaller compressed extents compared to one,
which may reduce the compression ratio.
- More complex error handling
We need to handle cases where some part of the delayed OE has no child
one. In that case we need to manually release the reserved data/meta
space.
Qu Wenruo (6):
btrfs: add skeleton for delayed btrfs bio
btrfs: add delayed ordered extent support
btrfs: introduce the skeleton of delayed bbio endio function
btrfs: introduce compression for delayed bbio
btrfs: implement uncompressed fallback for delayed bbio
btrfs: enable experimental delayed compression support
fs/btrfs/bio.c | 1 +
fs/btrfs/bio.h | 3 +
fs/btrfs/btrfs_inode.h | 3 +
fs/btrfs/extent_io.c | 29 ++-
fs/btrfs/extent_map.h | 9 +-
fs/btrfs/inode.c | 492 +++++++++++++++++++++++++++++++++++++++-
fs/btrfs/ordered-data.c | 181 +++++++++++----
fs/btrfs/ordered-data.h | 14 ++
8 files changed, 678 insertions(+), 54 deletions(-)
--
2.53.0
next reply other threads:[~2026-03-19 21:05 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 21:04 Qu Wenruo [this message]
2026-03-19 21:04 ` [PATCH 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-03-19 21:04 ` [PATCH 2/6] btrfs: add delayed ordered extent support Qu Wenruo
2026-03-19 21:04 ` [PATCH 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
2026-03-19 21:04 ` [PATCH 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
2026-03-19 21:04 ` [PATCH 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
2026-03-19 21:04 ` [PATCH 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1773953307.git.wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox