From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
Date: Wed, 1 Apr 2026 15:52:57 -0700 [thread overview]
Message-ID: <20260401225257.GA826348@zen.localdomain> (raw)
In-Reply-To: <cover.1773953307.git.wqu@suse.com>
On Fri, Mar 20, 2026 at 07:34:44AM +1030, Qu Wenruo wrote:
> [CHANGELOG]
> PoC->v1:
> - Fix the ordered extent leak caused by incorrect ref count of child OEs
> - Fix the reserved space leakage in ranges without a real OE
> - Fix the hang caused by incorrect extent lock/unlock pair
> All exposed by fsstress runs
>
> - Fix the OE range check in btrfs_wait_ordered_extents() that affects
> snapshot creation
> All exposed by fstests runs
>
> [BACKGROUND]
> Btrfs currently goes with async submission for compressed write, I'll go
> the following example to explain the async submission:
>
> The page and fs block sizes are all 4K, no large folio involved.
> The dirty range is [0, 4K), [8K, 128K).
>
> 0 4K 8K 128K
> |//| |/////////////////////////////////////////|
>
> - Write back folio 0
> * Delalloc
> writepage_delalloc() will find the delalloc range [0, 4K), and since
> it can not be inlined and too small for compression, it will be go
> through COW path, thus a new data extent is allocated, with
> corresponding EM/OE created.
>
> * Submission
> That folio 0 will be added into a bbio, and since we reached the OE
> end, the bbio will be submitted immediately.
>
> - Write back folio 8K
> * Delalloc
> writepage_delalloc() find the delalloc range [8K, 128K) and go
> compression.
> Instead of allocating an extent immediately, it queues the work into
> delalloc_workers.
>
> Please note that the range [8K, 128K) is completely locked during
> compression.
>
> * Skip submission
> As the whole folio 8K went through async submission, we skip bbio
> submission.
>
> - Write back folio 12K
> We wait for the folio to be unlocked (after compression is done and
> compressed bio is submitted).
> When the folio is unlocked, the folio will have writeback flag set and
> its dirty flag cleared. Thus we either wait for the writeback or skip
> the folio completely.
>
> This step repeats for the range [8K, 128K).
>
> AFAIK the async submission is required as we can not submit two
> different bbios for a single compressed range.
> Which is different from the uncompressed write path, where we can have
> several different bbios for a single ordered extent.
>
> [PROBLEMS]
> The async submission has the following problems:
>
> - Non-sequential writeback
> Especially when large folios are involved, we can have some blocks
> submitted immediately (uncompressed), and some submitted later
> (compressed).
>
> That breaks the assumption of iomap and DONTCACHE writes, which
> requires all blocks inside a folio to be submitted in one go.
>
> - Not really async
> As the example given above, we keep the whole range locked during
> compression.
> This means if we want to read a cached folio in that range, we still
> need to wait for the compression.
>
> [DELAYED COMPRESSION]
> The new idea is to delay the compression at bbio submission time.
> Now the workflow will be:
>
> - Write back folio 0
> The same, submitting it immediately
>
> - Write back folio 8K
> * Delalloc
> writepage_delalloc() find the delalloc range [8K, 128K) and go
> compression, but this time we allocated delayed EM and OE for the
> range [8K, 128K).
Just for the sake of the description:
I think it's helpful to be more clear that we unlock the delalloc
range here.
>
> * Submission
> That folio 8K will be added into a bbio, with its dirty flag removed
> and writeback flag set.
>
> - Writeback folio 12K ~ 124K
> * Delalloc
> No new delalloc range.
>
> * Submission
> Those folios will be added to the same bbio above.
> And after the last folio 124K is queued, we reached the OE end, and
> will submit the delayed bbio.
Can you explain the importance of the delayed bbio more? Why not just a
work queue item directly?
>
> - Delayed bbio submission
> As the bbio has a special @is_delayed flag set, it will not be
> submitted directly, but queued into a workqueue for compression.
>
> * Compression in the workqueue
> * Real delalloc
> Now an on-disk extent is reserved. The real EM will replace the
> delayed one.
> And the real OE will be added as a child of the original delayed
> one.
> * Compressed data submission
> * Delayed bbio finish
> When all child compressed/uncompressed writes finished, the delayed
> bbio will finish.
>
> The full delayed OE is also finished, which will insert all of its
> child OEs into the subvolume tree.
>
> This solves both the problems mentioned above, but is definitely way
> more complex than the current async submission:
I mentioned a few key questions in line, but I think they how and why of
"this solves both the problems mentioned above" is lacking at a high level.
As *I* understand it, the basic explanation is:
=====OLD=====
WRITEBACK
lock folio F
lock_extent_delalloc_range([F,F+N])
submit async chunk work
folio F+1
block on folio_lock()
ASYNC SUBMIT
do compression
do allocation
create OE
unlock folios (except locked_folio)
submit bio for OE
WRITEBACK
unblock on F+1
no-op
etc... till F+N
=====NEW=====
WRITEBACK
lock folio F
lock_extent_delalloc_range([F,F+N])
create delayed OE
unlock folios except locked_folio
lock folio F+1
add F+1 delayed bbio
etc... till F+N
"submit" delayed bio
ASYNC SUBMIT
do compression
do allocation
create child OE
submit real bio
Is this correct?
If so, the "magic" is in the delayed OE providing enough synchronization
to allow us to unlock the folios right away in the writeback context. So
we are breaking the contract "OE <=> allocated space" to allow this. I
think making the absolute core of the improvement more apparent in the
descriptions would be helpful.
I think one thing I still don't understand is the desire for the layered
bios/OEs instead of creating the same delayed OE, but then as we do the real
allocation/compression and discover the actual ranges doing
btrfs_split_ordered_extent() like short DIO writes, which seems quite
similar. Splitting/joining feels like a much more natural model for
ranges like OEs than layering into a tree. As we discover the sub ranges
we actually use, we split off the real OE.
I actually think this sort of makes sense for non-compressed too, where the
reserved size can be less than the maximum you tried for, just the same.
Probably too early to try to switch over everything, but it could be a
uniformity benefit in the future.
Fundamentally, I think it just feels kind of strange (with the old code
or the new) to have the one folio at a time iteration in writeback that
does lock/unlock on each folio and the delalloc locking/iteration that
we really need to do the right thing with extents. I would love if we
could reconcile them more completely rather than just hacking OEs into
submission (no pun intended :D).
I actually think your model is quite close to what I want, I think it
might just be too hacky/incremental. I am ok with moving forward with it
if we can also envision future next steps for further simplification.
I read the patches but apologies if I missed the answer to some of my
questions in the code.
Thanks,
Boris
>
> - Layered OEs
> And we need to manage the child/parent OEs properly
> But still it brings the minimal amount of changes to the existing OE
> users, and keep the scheme that every block going through
> extent_writepage_io() has a corresponding OE.
>
> - Possible extra split
> Since the delayed OE is allocated first, we can still submit two
> different delayed bbio for the same OE.
>
> This means we can have two smaller compressed extents compared to one,
> which may reduce the compression ratio.
>
> - More complex error handling
> We need to handle cases where some part of the delayed OE has no child
> one. In that case we need to manually release the reserved data/meta
> space.
>
> Qu Wenruo (6):
> btrfs: add skeleton for delayed btrfs bio
> btrfs: add delayed ordered extent support
> btrfs: introduce the skeleton of delayed bbio endio function
> btrfs: introduce compression for delayed bbio
> btrfs: implement uncompressed fallback for delayed bbio
> btrfs: enable experimental delayed compression support
>
> fs/btrfs/bio.c | 1 +
> fs/btrfs/bio.h | 3 +
> fs/btrfs/btrfs_inode.h | 3 +
> fs/btrfs/extent_io.c | 29 ++-
> fs/btrfs/extent_map.h | 9 +-
> fs/btrfs/inode.c | 492 +++++++++++++++++++++++++++++++++++++++-
> fs/btrfs/ordered-data.c | 181 +++++++++++----
> fs/btrfs/ordered-data.h | 14 ++
> 8 files changed, 678 insertions(+), 54 deletions(-)
>
> --
> 2.53.0
>
next prev parent reply other threads:[~2026-04-01 22:52 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-03-19 21:04 ` [PATCH 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-03-19 21:04 ` [PATCH 2/6] btrfs: add delayed ordered extent support Qu Wenruo
2026-03-19 21:04 ` [PATCH 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
2026-03-19 21:04 ` [PATCH 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
2026-03-19 21:04 ` [PATCH 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
2026-03-19 21:04 ` [PATCH 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
2026-04-01 22:52 ` Boris Burkov [this message]
2026-04-01 23:29 ` [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-04-02 0:15 ` Qu Wenruo
2026-04-02 0:51 ` Boris Burkov
2026-04-02 4:51 ` Qu Wenruo
2026-04-22 6:23 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260401225257.GA826348@zen.localdomain \
--to=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox