From: Qu Wenruo <wqu@suse.com>
To: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
Date: Thu, 2 Apr 2026 09:59:46 +1030 [thread overview]
Message-ID: <987d1e01-50d7-4877-b55c-62191b12754d@suse.com> (raw)
In-Reply-To: <20260401225257.GA826348@zen.localdomain>
在 2026/4/2 09:22, Boris Burkov 写道:
> On Fri, Mar 20, 2026 at 07:34:44AM +1030, Qu Wenruo wrote:
[...]
>> - Write back folio 8K
>> * Delalloc
>> writepage_delalloc() find the delalloc range [8K, 128K) and go
>> compression, but this time we allocated delayed EM and OE for the
>> range [8K, 128K).
>
> Just for the sake of the description:
>
> I think it's helpful to be more clear that we unlock the delalloc
> range here.
Sure.
>
>>
>> * Submission
>> That folio 8K will be added into a bbio, with its dirty flag removed
>> and writeback flag set.
>>
>> - Writeback folio 12K ~ 124K
>> * Delalloc
>> No new delalloc range.
>>
>> * Submission
>> Those folios will be added to the same bbio above.
>> And after the last folio 124K is queued, we reached the OE end, and
>> will submit the delayed bbio.
>
> Can you explain the importance of the delayed bbio more? Why not just a
> work queue item directly?
Which aspect of the importance?
If it's the folio dirty/writeback flags change, workqueue itself won't
solve it, as it will still be the old async submission, causing the
writeback flag to be set seemingly randomly.
If it's something else, the new behavior is almost the same as COW path,
except the complex layered OE part after submission.
>
>>
>> - Delayed bbio submission
>> As the bbio has a special @is_delayed flag set, it will not be
>> submitted directly, but queued into a workqueue for compression.
>>
>> * Compression in the workqueue
>> * Real delalloc
>> Now an on-disk extent is reserved. The real EM will replace the
>> delayed one.
>> And the real OE will be added as a child of the original delayed
>> one.
>> * Compressed data submission
>> * Delayed bbio finish
>> When all child compressed/uncompressed writes finished, the delayed
>> bbio will finish.
>>
>> The full delayed OE is also finished, which will insert all of its
>> child OEs into the subvolume tree.
>>
>> This solves both the problems mentioned above, but is definitely way
>> more complex than the current async submission:
>
> I mentioned a few key questions in line, but I think they how and why of
> "this solves both the problems mentioned above" is lacking at a high level.
>
> As *I* understand it, the basic explanation is:
> =====OLD=====
> WRITEBACK
> lock folio F
> lock_extent_delalloc_range([F,F+N])
> submit async chunk work
> folio F+1
> block on folio_lock()
>
> ASYNC SUBMIT
> do compression
> do allocation
> create OE
> unlock folios (except locked_folio)
> submit bio for OE
>
> WRITEBACK
> unblock on F+1
> no-op
> etc... till F+N
>
> =====NEW=====
> WRITEBACK
> lock folio F
> lock_extent_delalloc_range([F,F+N])
> create delayed OE
> unlock folios except locked_folio
> lock folio F+1
> add F+1 delayed bbio
> etc... till F+N
> "submit" delayed bio
>
> ASYNC SUBMIT
> do compression
> do allocation
> create child OE
> submit real bio
>
> Is this correct?
Yes.
>
> If so, the "magic" is in the delayed OE providing enough synchronization
> to allow us to unlock the folios right away in the writeback context.
Please remember that, for non-compressed writes, we do exactly the same
behavior, unlocking folios right away.
So the synchronization part is at least no worse than the regular COW path.
But I agree the current delalloc time folio lock/unlock is not
straightfoward to grasp.
> So
> we are breaking the contract "OE <=> allocated space" to allow this. I
> think making the absolute core of the improvement more apparent in the
> descriptions would be helpful.
>
> I think one thing I still don't understand is the desire for the layered
> bios/OEs instead of creating the same delayed OE, but then as we do the real
> allocation/compression and discover the actual ranges doing
> btrfs_split_ordered_extent() like short DIO writes, which seems quite
> similar. Splitting/joining feels like a much more natural model for
> ranges like OEs than layering into a tree. As we discover the sub ranges
> we actually use, we split off the real OE.
I can definitely work towards that direction. Although my concern is the
OE waiting/start part and error handling.
But so far those are only concerns, I need to implement the code to see
what can go wrong.
And if no major problem is hit, you can see a v2 with the split solution.
>
> I actually think this sort of makes sense for non-compressed too, where the
> reserved size can be less than the maximum you tried for, just the same.
> Probably too early to try to switch over everything, but it could be a
> uniformity benefit in the future.
The problem is the size of the contig range we can get with page cache,
compared to the max extent size.
Our max extent size is 128M, but one bio can only hold 256 bvecs.
For the worst case scenario, all of our page cache are not contig, thus
a bio can only hold 1M data for 4K page size, resulting way more fragments.
Although large folios can help, e.g. 64K IO block size with enough
memory can easily fill a 128MiB bio, it's indeed a little too early.
>
> Fundamentally, I think it just feels kind of strange (with the old code
> or the new) to have the one folio at a time iteration in writeback that
> does lock/unlock on each folio and the delalloc locking/iteration that
> we really need to do the right thing with extents. I would love if we
> could reconcile them more completely rather than just hacking OEs into
> submission (no pun intended :D).
IMHO, all the complex delalloc page lock/unlock is just to get a large
enough contig range to reduce fragments.
Personally I would prefer to submit one or more bios after locking all
folios at delalloc stage, which can reach the same less fragments.
But I guess there are a lot of extra writeback control related
accounting is needed, and the current folio/tag based writeback is
making it harder to do such early bio assembly/submission.
And with the ultimate objective of iomap integration, submission at
delalloc time can also be very tricky for iomap.
>
> I actually think your model is quite close to what I want, I think it
> might just be too hacky/incremental. I am ok with moving forward with it
> if we can also envision future next steps for further simplification.
>
> I read the patches but apologies if I missed the answer to some of my
> questions in the code.
No problem at all, your review really helps to improve the patchset.
Thanks,
Qu
>
> Thanks,
> Boris
>
>>
>> - Layered OEs
>> And we need to manage the child/parent OEs properly
>> But still it brings the minimal amount of changes to the existing OE
>> users, and keep the scheme that every block going through
>> extent_writepage_io() has a corresponding OE.
>>
>> - Possible extra split
>> Since the delayed OE is allocated first, we can still submit two
>> different delayed bbio for the same OE.
>>
>> This means we can have two smaller compressed extents compared to one,
>> which may reduce the compression ratio.
>>
>> - More complex error handling
>> We need to handle cases where some part of the delayed OE has no child
>> one. In that case we need to manually release the reserved data/meta
>> space.
>>
>> Qu Wenruo (6):
>> btrfs: add skeleton for delayed btrfs bio
>> btrfs: add delayed ordered extent support
>> btrfs: introduce the skeleton of delayed bbio endio function
>> btrfs: introduce compression for delayed bbio
>> btrfs: implement uncompressed fallback for delayed bbio
>> btrfs: enable experimental delayed compression support
>>
>> fs/btrfs/bio.c | 1 +
>> fs/btrfs/bio.h | 3 +
>> fs/btrfs/btrfs_inode.h | 3 +
>> fs/btrfs/extent_io.c | 29 ++-
>> fs/btrfs/extent_map.h | 9 +-
>> fs/btrfs/inode.c | 492 +++++++++++++++++++++++++++++++++++++++-
>> fs/btrfs/ordered-data.c | 181 +++++++++++----
>> fs/btrfs/ordered-data.h | 14 ++
>> 8 files changed, 678 insertions(+), 54 deletions(-)
>>
>> --
>> 2.53.0
>>
next prev parent reply other threads:[~2026-04-01 23:29 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-03-19 21:04 ` [PATCH 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-03-19 21:04 ` [PATCH 2/6] btrfs: add delayed ordered extent support Qu Wenruo
2026-03-19 21:04 ` [PATCH 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
2026-03-19 21:04 ` [PATCH 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
2026-03-19 21:04 ` [PATCH 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
2026-03-19 21:04 ` [PATCH 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
2026-04-01 22:52 ` [PATCH 0/6] btrfs: delay compression to bbio submission time Boris Burkov
2026-04-01 23:29 ` Qu Wenruo [this message]
2026-04-02 0:15 ` Qu Wenruo
2026-04-02 0:51 ` Boris Burkov
2026-04-02 4:51 ` Qu Wenruo
2026-04-22 6:23 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=987d1e01-50d7-4877-b55c-62191b12754d@suse.com \
--to=wqu@suse.com \
--cc=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox