From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Boris Burkov <boris@bur.io>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/6] btrfs: delay compression to bbio submission time
Date: Thu, 2 Apr 2026 15:21:04 +1030 [thread overview]
Message-ID: <991136de-fa41-4847-b81e-af45b6f7d1ea@gmx.com> (raw)
In-Reply-To: <20260402005108.GA916963@zen.localdomain>
在 2026/4/2 11:21, Boris Burkov 写道:
> On Thu, Apr 02, 2026 at 10:45:14AM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/4/2 09:59, Qu Wenruo 写道:
>> [...]
>>>> So
>>>> we are breaking the contract "OE <=> allocated space" to allow this. I
>>>> think making the absolute core of the improvement more apparent in the
>>>> descriptions would be helpful.
>>>>
>>>> I think one thing I still don't understand is the desire for the layered
>>>> bios/OEs instead of creating the same delayed OE, but then as we do
>>>> the real
>>>> allocation/compression and discover the actual ranges doing
>>>> btrfs_split_ordered_extent() like short DIO writes, which seems quite
>>>> similar. Splitting/joining feels like a much more natural model for
>>>> ranges like OEs than layering into a tree. As we discover the sub ranges
>>>> we actually use, we split off the real OE.
>>>
>>> I can definitely work towards that direction. Although my concern is the
>>> OE waiting/start part and error handling.
>>>
>>> But so far those are only concerns, I need to implement the code to see
>>> what can go wrong.
>>>
>>> And if no major problem is hit, you can see a v2 with the split solution.
>>
>> Finally I recall the challenge using btrfs_split_ordered_extent(), that we
>> can not split the OE in the middle.
>>
>> E.g. we have a delayed OE for range [0, 32K), then due to whatever reasons
>> (e.g. memory pressure), we are forced to submit range [0, 16K) first, then
>> range [16K, 32K).
>>
>> Both go through delayed compression, but the range [16K, 32K) win the race
>> by failing the compression (bad ratio), and fallback to uncompressed
>> submission first, before range [0, 16K) even finishes its compression.
>>
>> Furthermore, for the range [16K, 32K) we do not have a large enough free
>> space to fill it in one go, but can only allocate several 8K sized extents.
>>
>> So we need to split the [16K, 32K) into two ranges, [16K, 24K) and [24K,
>> 32K).
>>
>> This means we have to split the original [0, 32K) extent into [0, 16K),
>> [16K, 24K) and [24K, 32K) ranges.
>> This is not supported by the current btrfs_split_ordered_extent(), which can
>> only split range from the beginning of an OE.
>>
>>
>> I'll try to implement a version of btrfs_split_ordered_extent() that can
>> split the range at any offset to see how things will work then.
>
> This was kinda buggy before with the extent maps, and I think Naohiro
> Christoph and I ended up doing a good amount of refactoring to get rid
> of it. I am confident you can implement it correctly, just a fair warning.
>
> Also, you could consider doing an iterative split? Like if you need 16k,
> then split [0,128k) into [0,16k)[16k,128k) then split [16k,128k) into
> [16k,24k)[24k,128k).
Yes, that's possible if we go the ordered workqueue, and as long as the
writeback is always happening sequentially.
But the sequential writeback may not always be true.
E.g. we are initially writebacking folio 0, which created the OE for
range [0, 128K), but by memory pressure or whatever in MM layer,
suddenly the MM chose also to trigger writeback for folio 64K, meanwhile
our initial writeback is still submitting folios for range [0, 32k).
In that case, we have to do the split for 64K, and the sequential nature
is broken no matter what.
And that's also why I believe we hold all folios locked in the existing
async submission path, to avoid such unexpected split caused by MM.
I will definitely try some aspects of the ideas mentioned here, like
using ordered workqueue (which also addresses the current out-of-order
OE disk bytenr problems).
But if the OE splits is too complex, I'll definitely give you some
feedback about the problems.
Thanks,
Qu
>
> Not sure if that is easier than the "true" 3 (or K) way split.
>
> If all this becomes a huge headache please feel don't feel obliged to
> carry on forever, I am happy to discuss again.
>
> Thanks,
> Boris
>
>>
>> Thanks,
>> Qu
>
next prev parent reply other threads:[~2026-04-02 4:51 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 21:04 [PATCH 0/6] btrfs: delay compression to bbio submission time Qu Wenruo
2026-03-19 21:04 ` [PATCH 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-03-19 21:04 ` [PATCH 2/6] btrfs: add delayed ordered extent support Qu Wenruo
2026-03-19 21:04 ` [PATCH 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
2026-03-19 21:04 ` [PATCH 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
2026-03-19 21:04 ` [PATCH 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
2026-03-19 21:04 ` [PATCH 6/6] btrfs: enable experimental delayed compression support Qu Wenruo
2026-04-01 22:52 ` [PATCH 0/6] btrfs: delay compression to bbio submission time Boris Burkov
2026-04-01 23:29 ` Qu Wenruo
2026-04-02 0:15 ` Qu Wenruo
2026-04-02 0:51 ` Boris Burkov
2026-04-02 4:51 ` Qu Wenruo [this message]
2026-04-22 6:23 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=991136de-fa41-4847-b81e-af45b6f7d1ea@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox