public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC 0/6] btrfs: delay compression to bbio submission time
Date: Mon,  9 Mar 2026 09:32:49 +1030	[thread overview]
Message-ID: <cover.1773009120.git.wqu@suse.com> (raw)

[PROOF OF CONCEPT ONLY]
!!! There are a lot of known bugs !!!

It's only for dicussion of design choices!

[DESIGN CHOICES]
The following design choices need to be discussed:

- Is delayed ordered extent needed
  In theory we can skip the delayed ordered extent.

  But later I found it harder to handle cases like > i_size writes.
  As beyond i_size writes needs to grab the OE and manually mark them
  finished.

  Furthermore it will need extra handling for cow fixup.

  Although the current delayed OE (and its child OEs) implementation is
  complex, it requires the least amount of changes for existing code.

- Async thread vs regular workqueue
  The existing compression is done in async thread, which gives a
  special gift, the on-disk extent is mostly sequential.

  This is ensured by the ordered function of async thread, where the
  extent is allocated sequentially.

  The new PoC is using regular workqueue, can result out-of-order
  on-disk extents for compressed writes.

  The main quirk is from the fact that ordered functions require
  re-execution of the same workload with @do_free set to true.

  However in the current code we can not do the re-execution as the bbio
  can already be gone after the compressed bio is submitted.

[BACKGROUND]
Btrfs currently goes with async submission for compressed write, I'll go
the following example to explain the async submission:

The page and fs block sizes are all 4K, no large folio involved.
The dirty range is [0, 4K), [8K, 128K).

    0  4K  8K                                        128K
    |//|   |/////////////////////////////////////////|

- Write back folio 0
  * Delalloc
    writepage_delalloc() will find the delalloc range [0, 4K), and since
    it can not be inlined and too small for compression, it will be go
    through COW path, thus a new data extent is allocated, with
    corresponding EM/OE created.

  * Submission
    That folio 0 will be added into a bbio, and since we reached the OE
    end, the bbio will be submitted.

- Write back folio 8K
  * Delalloc
    writepage_delalloc() find the delalloc range [8K, 128K) and go
    compression.
    Instead of allocating an extent immediately, it queues the work into
    delalloc_workers.

    Please note that the range [8K, 128K) is completely locked during
    compression.

  * Skip submission
    As the whole folio 8K went through async submission, we skip bbio
    submission.

- Write back folio 12K
  We wait for the folio to be unlocked (after compression is done and
  compressed bio is submitted).
  When the folio is unlocked, the folio will have writeback flag set and
  its dirty flag cleared. Thus we either wait for the writeback or skip
  the folio completely.

  This step repeats for the range [8K, 128K).

AFAIK the async submission is required as we can not submit two
different bbios for a single compressed range.
Which is different from the uncompressed write path, where we can have
several different bbios for a single ordered extent.

[PROBLEMS]
The async submission has the following problems:

- Non-sequential writeback
  Especially when large folios are involved, we can have some blocks
  submitted immediately (uncompressed), and some submitted later
  (compressed).

  That breaks the assumption of iomap and DONTCACHE writes.

- Not really async
  As the example given above, we keep the whole range locked during
  compression, making later read and even writeback itself to wait.

[DELAYED COMPRESSION]
The new idea is to delay the compression at bbio submission time.
Now the workflow will be:

- Write back folio 0
  The same, submitting it immediately

- Write back folio 8K
  * Delalloc
    writepage_delalloc() find the delalloc range [8K, 128K) and go
    compression, but this time we allocated delayed EM and OE for the
    range [8K, 128K).

  * Submission
    That folio 8K will be added into a bbio, with its dirty flag removed
    and writeback flag set.

- Writeback folio 12K ~ 124K
  * Delalloc
    No new delalloc range.
 
  * Submission
    Those folios will be added to the same bbio above.
    And after the last folio 124K is queued, we reached the OE end, and
    will submit the delayed bbio.

- Delayed bbio submission
  As the bbio has a special is_delayed flag set, it will not be
  submitted directly, but queued into a workqueue for compression.

  * Compression in the workqueue
  * Real delalloc
    Now an on-disk extent is reserved. The real EM will replace the
    delayed one.
    And the real OE will be added as a child of the original delayed
    one.
  * Compressed data submission
  * Delayed bbio finish
    When all child compressed/uncompressed writes finished, the delayed
    bbio will finish.

    The full delayed OE is also finished, which will insert all of its
    child OEs into the subvolume tree.

This solves both the problems mentioned above, but is definitely way
more complex than the current async submission:

- Layered OEs
  And we need to manage the child/parent OEs properly

- Possible extra split
  Since the delayed OE is allocated first, we can still submit two
  different delayed bbio for the same OE.

  This means we can have two smaller compressed extents compared to one,
  which may reduce the compression ratio.

- More complex error handling

- Unsolved bugs
  * Em leak
  * Bytes mayuse leak (METADATA)
  * Busy inode after umount
  * More hidden

Qu Wenruo (6):
  btrfs: add skeleton for delayed btrfs bio
  btrfs: add delayed ordered extent support
  btrfs: introduce the skeleton of delayed bbio endio function
  btrfs: introduce compression for delayed bbio
  btrfs: implement uncompressed fallback for delayed bbio
  btrfs: enable experimental delayed compression support

 fs/btrfs/bio.c          |   1 +
 fs/btrfs/bio.h          |   3 +
 fs/btrfs/btrfs_inode.h  |   3 +
 fs/btrfs/extent_io.c    |  29 ++-
 fs/btrfs/extent_map.h   |   9 +-
 fs/btrfs/inode.c        | 475 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ordered-data.c |  73 +++++-
 fs/btrfs/ordered-data.h |  14 ++
 8 files changed, 595 insertions(+), 12 deletions(-)

-- 
2.53.0


             reply	other threads:[~2026-03-08 23:03 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-08 23:02 Qu Wenruo [this message]
2026-03-08 23:02 ` [PATCH PoC 1/6] btrfs: add skeleton for delayed btrfs bio Qu Wenruo
2026-03-08 23:02 ` [PATCH PoC 2/6] btrfs: add delayed ordered extent support Qu Wenruo
2026-03-08 23:02 ` [PATCH PoC 3/6] btrfs: introduce the skeleton of delayed bbio endio function Qu Wenruo
2026-03-08 23:02 ` [PATCH PoC 4/6] btrfs: introduce compression for delayed bbio Qu Wenruo
2026-03-08 23:02 ` [PATCH PoC 5/6] btrfs: implement uncompressed fallback " Qu Wenruo
2026-03-08 23:02 ` [PATCH PoC 6/6] btrfs: enable experimental delayed compression support Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1773009120.git.wqu@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox