Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Paul Richards <paul.richards@gmail.com>
Cc: linux-btrfs@vger.kernel.org, dsterba@suse.com
Subject: Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
Date: Mon, 20 Apr 2026 08:00:17 +0930	[thread overview]
Message-ID: <ff32d6db-77a3-4a9d-aa78-8dff8400a13b@gmx.com> (raw)
In-Reply-To: <CAMoswehjDpHm7_jmTPV_vKkqJSNKu5pCEDxmwGzAghhEwTGHSA@mail.gmail.com>



在 2026/4/20 04:10, Paul Richards 写道:
> Thanks Qu for your feedback! I wasn't expecting such a thorough review of
> this early RFC. I will reply to a few of your comments below.
> 
> For the comments I don't reply to, please know that I have read them and
> agree with them. If I follow up with another revision then I will address them
> there.
> 
> 
> On Sun, 19 Apr 2026 at 06:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> 在 2026/4/19 09:55, Qu Wenruo 写道:
>>>
>>>
>>> 在 2026/4/19 00:08, Paul Richards 写道:
>>>> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
>>>> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
>>>> already supported by ext4 and xfs. The userspace contract is
>>>> documented in fallocate(2).
>>>>
>>>> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
>>>> moving punch_hole into its own function and decoupling locking from the
>>>> per-operation helpers. This is similar to the implementaitons for ext4
>>>> and xfs. The allocate-range and zero-range paths remain coupled since
>>>> they share some setup logic.
>>>>
>>>> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
>>>>
>>>> == Implementation approach ==
>>>>
>>>> For COLLAPSE_RANGE:
>>>>    - The removed region [offset, offset+len) is punched out via
>>>>      btrfs_replace_file_extents(), which handles boundary splitting.
>>>>    - All EXTENT_DATA keys with key.offset >= offset+len are shifted
>>>>      leftward by len in forward order.
>>>>
>>>> For INSERT_RANGE:
>>>>    - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
>>>>      by len in reverse order (required to avoid key collisions).
>>>>    - No pre-splitting of straddling extents is needed: the left portion
>>>>      of a straddling extent stays in place, the right portion is shifted;
>>>>      both reference the same physical extent via their existing
>>>>      extent_offset fields.
>>>>
>>>> For each shifted key, the corresponding back-reference in the extent
>>>> tree is updated via a shared helper btrfs_shift_extent_backref().
>>
>> After looking into each patch, I do not think the low level direct file
>> item change is a good idea, especially with your current implement:
>>
>> - Can lock the inode for a very long time
>>     E.g. inserting a hole into the beginning of a very large file.
>>     We will lock the inode until all file items are iterated, which kills
>>     concurrency.

After more reference to reflink code, I think this comment is a little 
over-reaction, as reflink also locks the involved range in one go, thus 
can lock the inode for a very long time too.

So long lock may not be a huge problem.

>>
>> - Possible problems with metadata reservation
>>
>> - Problems with ^no-holes collapse
>>     Will cause duplicated file offsets with hole file items.

This is still true, even if we only support the insert/collapse range 
for no-holes feature.

The bigger problem is that, even if a fs has no-holes feature, it can 
still have explicit hole extents. As the fs can be converted from 
^no-holes, and the existing holes are not removed.

I guess the initial design is mostly to make converting existing fses 
much easier, at the cost that kernel always has to handle explicit holes.


One idea is to introduce something like COMPAT_RO_STRICT_NO_HOLE to 
prevent explicit holes, then it will make the low-level key modification 
more feasible.
But that will need quite some time for such new feature to get adapted.

>>
>> On the other hand, with reflink the insert/collapse can even be
>> implemented in user space, with a proper step setting, we can still
>> allow concurrent read/write out of the reflink ranges.
>>
>> This makes me wonder, is these features really that necessary?
>>
>> If you know some programs actively utilize these features for real world
>> benefits, but can not be done through reflink, please provide them.
>>
> 
> There is a proprietary tool at my $dayjob which uses insert and collapse range.
> This works great on ext4 and xfs, and I am personally interested in having it
> work on btrfs.
> 
> Emulating in userspace is entirely possible using FICLONERANGE ioclt.. but
> since ext4 doesn't support that operation this leaves userspace needing to
> use one solution for ext4 (fallocate insert/collapse) and another for btrfs
> (FICLONERANGE). By filling the gap of btrfs fallocate modes I was hoping
> to simplify userspace and reduce friction in supporting btrfs.
> 
> However, on balance I agree with your opinion that emulating in userspace
> is the best approach here. So for now I will pause my work on this patch series.

I think you may be interested in utilizing btrfs_clone() to implement 
the insert/collapse range ioctl.

The benefit is you do not need to bother the low-level file item 
changes, nor the metadata reservation part (already done in btrfs_clone()).

However it won't work if the src/dst range overlaps, it can still be 
worked around by shrinking the reflink length to the minimal so that the 
ranges no longer overlaps, and at the cost of more fragments.

And you may still want to dig deeper into the extra locks and other 
corner cases like the final truncation after collapse and the extra hole 
insertion after insert.

I guess it may be a little easier to implement this time.

Thanks,
Qu

> 
>>>>
>>>> == Testing ==
>>>>
>>>> Tested with a Rust-based functional test suite covering:
>>>>    - Collapse and insert at the start, middle of a file
>>>>    - Multiple sequential operations on the same file
>>>>    - Files with multiple extents (fsync between writes to force separate
>>>>      extent items)
>>>>    - Files with holes (explicit punch_hole and implicit sparse writes)
>>>>    - Compressed extents (mount -o compress=zstd)
>>>>    - Transaction cycling (interval reduced to 4 during testing, verified
>>>>      in dmesg logs)
>>>>    - Inline files, verified that -EOPNOTSUPP is returned.
>>>
>>> I guess that tool has never verify the contents, nor multi-thread stress
>>> tests, e.g. fsstress?
>>>
> 
> My tests did read back and verify file contents. It was not multi threaded or
> particularly stressy.
> 
>>>>
>>>> The same tests pass on both btrfs and xfs (modulo the inline files).
>>>>
>>>> I have not run fstests which I know contains tests for INSERT_RANGE
>>>> and COLLAPSE_RANGE. I will do so.
>>>
>>> Thus I'd prefer a fstest run before whatever your local tool.
>>>
> 
> 100% agree that fstests would be better. I started with my own tests
> only to keep
> things very simple while the code got off the ground. If I carry this work on I
> will use fstests.
> 
>>>> == Notes ==
>>>>
>>>> This is my first kernel contribution. Development was significantly
>>>> assisted by an LLM (Amazon Q Developer). The implementation, testing,
>>>> and final review decisions are my own.
>>>
>>> I'm very interested in how the LLM is involved.
>>>
>>> You mentioned "implementation, testing, review" are on your own, this
>>> looks like everything is on your own.
>>>
>>> I don't think you're only using LLM to help understanding the code, thus
>>> it looks like implementation is contributed by the LLM.
>>>
>>> Please remember, you're the one explaining/defending the code.
>>> But as long as you can explain/defend the code, I'm fine with that.
>>>
> 
> My workflow with the LLM was this:
> 
> I provided a high level design, reference to the fallocate man page,
> pointers to the implementation of clone range in btrfs, and the existing
> fallocate implementations (for btrfs, ext4, and xfs). This was a couple
> of paragraphs and a handful of links.
> 
> I prompted the LLM to produce a detailed design document for how
> I could go about implementing insert and collapse for btrfs. The
> document it produced is English text of around 2000 words. It
> contains references to internal btrfs functions I needed and a
> description of the logic I needed to implement. It has no code.
> 
> I reviewed that doc, asked clarifications, and the doc was revised.
> 
> I then crafted the code. I relied heavily on the design doc that the
> LLM had written but did not follow it to the letter. During this
> implementation work I asked the LLM to clarify a few points, and
> the design doc was revised by the LLM.
> 
> During debugging most of the bugs I diagnosed myself. There was one
> bug which stumped me however, so I asked the LLM to provide
> hypotheses which I then worked through to solve the bug. I crafted
> the code for the fix. ( The specific bug was stale/incorrect data being
> read after the operation returned. The code was already invalidating
> the page cache, but I was not aware of the extent map cache that
> needed to be invalidated too. )
> 
> I wrote the test code myself.
> 
> Once the tests were passing the LLM was used to update the design
> doc to align with the code I actually wrote. I also asked the LLM to
> review my code prior to submitting it to the mailing list. I addressed
> some of that feedback. As an example piece of feedback it
> recommended I make use of unlikely() in "if" statements testing for
> errors.
> 
> 
>>> Still a newcomer with not a small code change will always attract more
>>> scrutiny, so please don't expect rapid review/merge/etc.
>>>
> 
> I am not in a rush. This has been a low-priority side quest going back
> to 2023: https://lore.kernel.org/linux-btrfs/CAMosweitbAN5EPOgJCtrbkRAj1QSbsYt4uDGVMZ378YY7wjnRw@mail.gmail.com/
> 
>

     prev parent reply	other threads:[~2026-04-19 22:30 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-18 14:38 [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Paul Richards
2026-04-18 14:38 ` [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes Paul Richards
2026-04-19  0:57   ` Qu Wenruo
2026-04-18 14:38 ` [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate() Paul Richards
2026-04-19  1:29   ` Qu Wenruo
2026-04-18 14:38 ` [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE " Paul Richards
2026-04-19  4:44   ` Qu Wenruo
2026-04-19  0:25 ` [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Qu Wenruo
2026-04-19  5:08   ` Qu Wenruo
2026-04-19 18:40     ` Paul Richards
2026-04-19 22:30       ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ff32d6db-77a3-4a9d-aa78-8dff8400a13b@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=dsterba@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=paul.richards@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox