[RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
@ 2026-04-18 14:38 Paul Richards
  2026-04-18 14:38 ` [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes Paul Richards
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards

This series adds support for FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
already supported by ext4 and xfs. The userspace contract is
documented in fallocate(2).

Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
moving punch_hole into its own function and decoupling locking from the
per-operation helpers. This is similar to the implementaitons for ext4
and xfs. The allocate-range and zero-range paths remain coupled since
they share some setup logic.

Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.

== Implementation approach ==

For COLLAPSE_RANGE:
 - The removed region [offset, offset+len) is punched out via
   btrfs_replace_file_extents(), which handles boundary splitting.
 - All EXTENT_DATA keys with key.offset >= offset+len are shifted
   leftward by len in forward order.

For INSERT_RANGE:
 - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
   by len in reverse order (required to avoid key collisions).
 - No pre-splitting of straddling extents is needed: the left portion
   of a straddling extent stays in place, the right portion is shifted;
   both reference the same physical extent via their existing
   extent_offset fields.

For each shifted key, the corresponding back-reference in the extent
tree is updated via a shared helper btrfs_shift_extent_backref().

After the key-shift loop, btrfs_drop_extent_map_range() is called to
invalidate the in-memory extent map cache. This is important for reads
after the fallocate operation to ensure they obtain data for the new
offsets.

The page cache is flushed and invalidated upfront (before any extent
manipulation) following the ext4/xfs pattern. The inode lock
(BTRFS_ILOCK_MMAP) is held throughout, preventing new dirty pages from
appearing during the operation.

Transaction cycling: the key-shift loop cycles transactions every
BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32) items to avoid
holding a single transaction open across a large number of extents.

Both operations are gated on CONFIG_BTRFS_EXPERIMENTAL.

== Known limitations ==

INSERT_RANGE returns -EOPNOTSUPP for inlined files. Supporting inline
files will require promoting the existing inline extent to a regular
one, since inline extends are supported only at the very start of a
file.

In the opposite direction, COLLAPSE_RANGE will not create inline files
like it should if the remaining data is small enough.

I intend to address both of these limitations.

== Testing ==

Tested with a Rust-based functional test suite covering:
 - Collapse and insert at the start, middle of a file
 - Multiple sequential operations on the same file
 - Files with multiple extents (fsync between writes to force separate
   extent items)
 - Files with holes (explicit punch_hole and implicit sparse writes)
 - Compressed extents (mount -o compress=zstd)
 - Transaction cycling (interval reduced to 4 during testing, verified
   in dmesg logs)
 - Inline files, verified that -EOPNOTSUPP is returned.

The same tests pass on both btrfs and xfs (modulo the inline files).

I have not run fstests which I know contains tests for INSERT_RANGE
and COLLAPSE_RANGE. I will do so.

== Questions for reviewers ==

1. Transaction cycling interval: we use 32 items per cycle. Is this
   the right threshold, or is there an established convention in btrfs
   for this kind of loop?

2. Extent lock scope for collapse: we hold the extent lock only on
   [offset, offset+len) during the hole punch, not on the full
   [offset, i_size) range that the key-shift loop operates on. Is
   this safe, or should we lock the full affected range?

3. CONFIG_BTRFS_EXPERIMENTAL gate: is this the right gate for these
   operations, or should they be unconditionally available?

== Notes ==

This is my first kernel contribution. Development was significantly
assisted by an LLM (Amazon Q Developer). The implementation, testing,
and final review decisions are my own.

Various btrfs_info() print statements, assertions, and comments that
were useful during development and testing have been left in place,
but will be removed or streamlined in the next revision.

Paul Richards (3):
  btrfs: refactor btrfs_fallocate() ahead of supporting more modes
  btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
  btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()

 fs/btrfs/file.c | 601 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 578 insertions(+), 23 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes
  2026-04-18 14:38 [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Paul Richards
@ 2026-04-18 14:38 ` Paul Richards
  2026-04-19  0:57   ` Qu Wenruo
  2026-04-18 14:38 ` [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate() Paul Richards
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards

Refactor btrfs_fallocate() to switch and dispatch based on
the mode argument, splitting most of the modes into separate
functions. Only the "allocate range" and "zero range" functions
remain coupled.

Signed-off-by: Paul Richards <paul.richards@gmail.com>
---
 fs/btrfs/file.c | 63 +++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a4cb9d3cfc4e..0b5cc3cec675 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2668,8 +2668,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 	bool truncated_block = false;
 	bool updated_inode = false;
 
-	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
-
 	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset, len);
 	if (ret)
 		goto out_only_mutex;
@@ -2712,7 +2710,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 		truncated_block = true;
 		ret = btrfs_truncate_block(BTRFS_I(inode), offset, orig_start, orig_end);
 		if (ret) {
-			btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 			return ret;
 		}
 	}
@@ -2809,7 +2806,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 				ret = ret2;
 		}
 	}
-	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 	return ret;
 }
 
@@ -2933,6 +2929,8 @@ static int btrfs_zero_range(struct inode *inode,
 	u64 bytes_to_reserve = 0;
 	bool space_reserved = false;
 
+	ASSERT((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ZERO_RANGE);
+
 	em = btrfs_get_extent(BTRFS_I(inode), NULL, alloc_start,
 			      alloc_end - alloc_start);
 	if (IS_ERR(em)) {
@@ -3092,7 +3090,7 @@ static int btrfs_zero_range(struct inode *inode,
 	return ret;
 }
 
-static long btrfs_fallocate(struct file *file, int mode,
+static long btrfs_allocate_or_zero_range(struct file *file, int mode,
 			    loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
@@ -3115,27 +3113,12 @@ static long btrfs_fallocate(struct file *file, int mode,
 	int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
 	int ret;
 
-	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
-		return -EIO;
-
-	/* Do not allow fallocate in ZONED mode */
-	if (btrfs_is_zoned(inode_to_fs_info(inode)))
-		return -EOPNOTSUPP;
+	ASSERT((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ALLOCATE_RANGE || (mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ZERO_RANGE);
 
 	alloc_start = round_down(offset, blocksize);
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
 
-	/* Make sure we aren't being give some crap mode */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
-		     FALLOC_FL_ZERO_RANGE))
-		return -EOPNOTSUPP;
-
-	if (mode & FALLOC_FL_PUNCH_HOLE)
-		return btrfs_punch_hole(file, offset, len);
-
-	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
-
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
 		ret = inode_newsize_ok(inode, offset + len);
 		if (ret)
@@ -3185,7 +3168,6 @@ static long btrfs_fallocate(struct file *file, int mode,
 
 	if (mode & FALLOC_FL_ZERO_RANGE) {
 		ret = btrfs_zero_range(inode, offset, len, mode);
-		btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 		return ret;
 	}
 
@@ -3282,11 +3264,46 @@ static long btrfs_fallocate(struct file *file, int mode,
 	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
 			    &cached_state);
 out:
-	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 	extent_changeset_free(data_reserved);
 	return ret;
 }
 
+static long btrfs_fallocate(struct file *file, int mode,
+			    loff_t offset, loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	int ret;
+
+	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
+		return -EIO;
+
+	/* Do not allow fallocate in ZONED mode */
+	if (btrfs_is_zoned(inode_to_fs_info(inode)))
+		return -EOPNOTSUPP;
+
+	/* Check for options we do not support. */
+	if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
+	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+
+	/* Only one mode bit may be set. */
+	switch (mode & FALLOC_FL_MODE_MASK) {
+	case FALLOC_FL_ALLOCATE_RANGE:
+	case FALLOC_FL_ZERO_RANGE:
+		ret = btrfs_allocate_or_zero_range(file, mode, offset, len);
+		break;
+	case FALLOC_FL_PUNCH_HOLE:
+		ret = btrfs_punch_hole(file, offset, len);
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+	}
+
+	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+	return ret;
+}
+
 /*
  * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
  * that has unflushed and/or flushing delalloc. There might be other adjacent
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes
  2026-04-18 14:38 ` [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes Paul Richards
@ 2026-04-19  0:57   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2026-04-19  0:57 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba



在 2026/4/19 00:08, Paul Richards 写道:
> Refactor btrfs_fallocate() to switch and dispatch based on
> the mode argument, splitting most of the modes into separate
> functions. Only the "allocate range" and "zero range" functions
> remain coupled.
> 
> Signed-off-by: Paul Richards <paul.richards@gmail.com>
> ---
[...]
> @@ -3115,27 +3113,12 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
>   	int ret;
>   
> -	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
> -		return -EIO;
> -
> -	/* Do not allow fallocate in ZONED mode */
> -	if (btrfs_is_zoned(inode_to_fs_info(inode)))
> -		return -EOPNOTSUPP;
> +	ASSERT((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ALLOCATE_RANGE || (mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ZERO_RANGE);

Too long line, this can be caught by checkpath.

>   
>   	alloc_start = round_down(offset, blocksize);
>   	alloc_end = round_up(offset + len, blocksize);
>   	cur_offset = alloc_start;
>   
> -	/* Make sure we aren't being give some crap mode */
> -	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> -		     FALLOC_FL_ZERO_RANGE))
> -		return -EOPNOTSUPP;
> -
> -	if (mode & FALLOC_FL_PUNCH_HOLE)
> -		return btrfs_punch_hole(file, offset, len);
> -
> -	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
> -
>   	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
>   		ret = inode_newsize_ok(inode, offset + len);
>   		if (ret)
> @@ -3185,7 +3168,6 @@ static long btrfs_fallocate(struct file *file, int mode,
>   
>   	if (mode & FALLOC_FL_ZERO_RANGE) {
>   		ret = btrfs_zero_range(inode, offset, len, mode);
> -		btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
>   		return ret;
>   	}
>   
> @@ -3282,11 +3264,46 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
>   			    &cached_state);
>   out:
> -	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
>   	extent_changeset_free(data_reserved);
>   	return ret;
>   }
>   
> +static long btrfs_fallocate(struct file *file, int mode,
> +			    loff_t offset, loff_t len)
> +{
> +	struct inode *inode = file_inode(file);
> +	int ret;
> +
> +	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
> +		return -EIO;
> +
> +	/* Do not allow fallocate in ZONED mode */
> +	if (btrfs_is_zoned(inode_to_fs_info(inode)))
> +		return -EOPNOTSUPP;
> +
> +	/* Check for options we do not support. */
> +	if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_KEEP_SIZE))
> +		return -EOPNOTSUPP;
> +
> +	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
> +
> +	/* Only one mode bit may be set. */
> +	switch (mode & FALLOC_FL_MODE_MASK) {
> +	case FALLOC_FL_ALLOCATE_RANGE:
> +	case FALLOC_FL_ZERO_RANGE:
> +		ret = btrfs_allocate_or_zero_range(file, mode, offset, len);

I'd prefer to have separate functions for zero and allocate range.

There are some shared code between them, but not too much, just:

- isize check
- file modified check
- btrfs_cont_expand()/btrfs_truncate_block()
- btrfs_wait_ordered_range().

This can be extracted into a helper function, shared by both 
btrfs_allocate_range() and btrfs_zero_range().

This should make it a little easier to read.

Thanks,
Qu


> +		break;
> +	case FALLOC_FL_PUNCH_HOLE:
> +		ret = btrfs_punch_hole(file, offset, len);
> +		break;
> +	default:
> +		ret = -EOPNOTSUPP;
> +	}
> +
> +	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
> +	return ret;
> +}
> +
>   /*
>    * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
>    * that has unflushed and/or flushing delalloc. There might be other adjacent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
  2026-04-18 14:38 [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Paul Richards
  2026-04-18 14:38 ` [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes Paul Richards
@ 2026-04-18 14:38 ` Paul Richards
  2026-04-19  1:29   ` Qu Wenruo
  2026-04-18 14:38 ` [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE " Paul Richards
  2026-04-19  0:25 ` [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Qu Wenruo
  3 siblings, 1 reply; 11+ messages in thread
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards

Assisted-by: Amazon Q Developer:auto/unknown
Signed-off-by: Paul Richards <paul.richards@gmail.com>
---
 fs/btrfs/file.c | 300 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 300 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0b5cc3cec675..99d24bef5f88 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -39,6 +39,13 @@
 #include "super.h"
 #include "print-tree.h"
 
+/*
+ * When we shift extents as part of fallocate insert or collapse we commit
+ * and cycle the transaction every BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL
+ * extents to avoid accumulating too many changes in one transaction.
+ */
+#define BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32)
+
 /*
  * Unlock folio after btrfs_file_write() is done with it.
  */
@@ -2648,6 +2655,296 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Update the extent back-reference in the extent tree when a
+ * BTRFS_EXTENT_DATA_KEY item is shifted to a new logical file offset.
+ * Drops the back-reference at old_file_offset and adds one at new_file_offset.
+ * Holes (disk_bytenr == 0) and inline extents have no back-references and
+ * must not be passed to this function.
+ */
+static int btrfs_shift_extent_backref(struct btrfs_trans_handle *trans,
+				      struct btrfs_root *root, u64 ino,
+				      u64 disk_bytenr, u64 num_bytes,
+				      u64 old_file_offset, u64 new_file_offset)
+{
+	struct btrfs_ref ref = {
+		.bytenr = disk_bytenr,
+		.num_bytes = num_bytes,
+		.parent = 0,
+		.owning_root = btrfs_root_id(root),
+		.ref_root = btrfs_root_id(root),
+	};
+	int ret;
+
+	ref.action = BTRFS_DROP_DELAYED_REF;
+	btrfs_init_data_ref(&ref, ino, old_file_offset, 0, false);
+	ret = btrfs_free_extent(trans, &ref);
+	if (unlikely(ret)) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
+
+	ref.action = BTRFS_ADD_DELAYED_REF;
+	btrfs_init_data_ref(&ref, ino, new_file_offset, 0, false);
+	ret = btrfs_inc_extent_ref(trans, &ref);
+	if (unlikely(ret))
+		btrfs_abort_transaction(trans, ret);
+
+	return ret;
+}
+
+static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	u64 end = offset + len;
+	struct btrfs_path *path;
+	struct btrfs_trans_handle *trans = NULL;
+	struct extent_state *cached_state = NULL;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	struct btrfs_key new_key;
+	u64 ino = btrfs_ino(BTRFS_I(inode));
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
+		return -EOPNOTSUPP;
+
+	/* offset and len must be sector-aligned */
+	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))
+		return -EINVAL;
+
+	/* collapse range must not reach or pass EOF - use ftruncate instead */
+	if (end >= inode->i_size)
+		return -EINVAL;
+
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: ino=%llu offset=%lld len=%lld i_size=%lld",
+		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);
+
+	/* wait for any ordered extents in [offset, i_size) to complete */
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
+				       inode->i_size - offset);
+	if (ret)
+		return ret;
+
+	/*
+	 * Flush dirty pages and invalidate the page cache for [offset, i_size)
+	 * before any extent manipulation, following the ext4/xfs pattern.
+	 * We hold BTRFS_ILOCK_MMAP so no new dirty pages can appear during
+	 * the operation. The page cache must be empty before we shift extent
+	 * keys so that stale pages at the old offsets cannot be read back
+	 * after the collapse.
+	 */
+	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
+	if (ret)
+		return ret;
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: nrpages before upfront invalidate=%lu (pages before offset=%llu not invalidated)",
+		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
+	truncate_pagecache_range(inode, offset, LLONG_MAX);
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: nrpages after upfront invalidate=%lu (expected %llu)",
+		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	/*
+	 * Lock the range [offset, end) and invalidate the page cache within
+	 * it. btrfs_punch_hole_lock_range() calls truncate_pagecache_range()
+	 * internally in a retry loop.
+	 */
+	btrfs_punch_hole_lock_range(inode, offset, end - 1, &cached_state);
+
+	/*
+	 * Remove all extents in [offset, end). Passing NULL for extent_info
+	 * means we are punching a hole. btrfs_replace_file_extents() splits
+	 * any extent straddling the boundaries, drops extent refs, and
+	 * returns a transaction handle for us to reuse.
+	 */
+	ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
+					 offset, end - 1, NULL, &trans);
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: btrfs_replace_file_extents ret=%d nrpages=%lu",
+		   ret, inode->i_mapping->nrpages);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= end
+	 * leftward by len bytes.
+	 *
+	 * We iterate forward (lowest offset first) which is safe for a
+	 * left-shift because the new key is always less than the old one,
+	 * so we never collide with a key we haven't visited yet.
+	 */
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = end;
+
+	int nr_shifted = 0;
+	while (1) {
+		struct btrfs_file_extent_item *fi;
+		u64 disk_bytenr;
+		u64 num_bytes;
+		u64 extent_offset;
+		int extent_type;
+
+		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+		if (ret < 0)
+			goto out_trans;
+
+		/* If no exact match, slot points at the next item - that's fine */
+
+		leaf = path->nodes[0];
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				goto out_trans;
+			if (ret > 0) {
+				/* No more items - we are done shifting */
+				ret = 0;
+				break;
+			}
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		/* Stop if we've moved past this inode's extent data items */
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+			ret = 0;
+			break;
+		}
+
+		btrfs_info(fs_info,
+			   "btrfs_collapse_range: shifting key offset %llu -> %llu",
+			   key.offset, key.offset - len);
+
+		fi = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		extent_type = btrfs_file_extent_type(leaf, fi);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+		extent_offset = btrfs_file_extent_offset(leaf, fi);
+
+		memcpy(&new_key, &key, sizeof(new_key));
+		new_key.offset -= len;
+		btrfs_set_item_key_safe(trans, path, &new_key);
+
+		/*
+		 * Update the back-reference in the extent tree to reflect the
+		 * new logical file offset. Holes (disk_bytenr == 0) and inline
+		 * extents have no back-references to update.
+		 */
+		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
+			ret = btrfs_shift_extent_backref(trans, root, ino,
+					disk_bytenr, num_bytes,
+					key.offset - extent_offset,
+					new_key.offset - extent_offset);
+			if (unlikely(ret))
+				goto out_trans;
+		}
+
+		/* Advance to the next item on the next iteration */
+		key.offset = new_key.offset + 1;
+		nr_shifted++;
+
+		/*
+		 * Cycle the transaction every N items to avoid holding a
+		 * single transaction open across a large number of extent
+		 * items. btrfs_set_item_key_safe() modifies leaves in-place
+		 * so we won't hit -ENOSPC; we use a simple counter instead.
+		 * Update the inode at each cycle point so it is consistent
+		 * on disk if a crash occurs mid-loop.
+		 */
+		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
+			btrfs_info(fs_info,
+			   "btrfs_collapse_range: cycling transaction, nr_shifted=%d", nr_shifted);
+			inode_inc_iversion(inode);
+			inode_set_mtime_to_ts(inode,
+					      inode_set_ctime_current(inode));
+			ret = btrfs_update_inode(trans, BTRFS_I(inode));
+			if (ret) {
+				btrfs_release_path(path);
+				goto out_trans;
+			}
+			btrfs_end_transaction(trans);
+			btrfs_btree_balance_dirty(fs_info);
+			trans = btrfs_start_transaction(root, 1);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				trans = NULL;
+				btrfs_release_path(path);
+				goto out_unlock;
+			}
+		}
+
+		btrfs_release_path(path);
+	}
+
+	if (ret)
+		goto out_trans;
+
+	/* Update i_size and the on-disk inode */
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: updating i_size %lld -> %lld nrpages=%lu",
+		   inode->i_size, inode->i_size - len, inode->i_mapping->nrpages);
+	/*
+	 * Drop the extent map cache for [offset, i_size) so that subsequent
+	 * reads re-load the correct mappings from the btree. The key-shift
+	 * loop updated the btree but the in-memory extent map cache still
+	 * has entries at the old logical offsets.
+	 */
+	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
+	inode_inc_iversion(inode);
+	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+	i_size_write(inode, inode->i_size - len);
+	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
+	ret = btrfs_update_inode(trans, BTRFS_I(inode));
+
+out_trans:
+	if (trans) {
+		if (ret)
+			btrfs_end_transaction(trans);
+		else
+			ret = btrfs_end_transaction(trans);
+	}
+out_unlock:
+	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, offset, end - 1,
+			    &cached_state);
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: post-unlock ret=%d i_size=%lld, invalidating page cache from %lld",
+		   ret, inode->i_size, offset);
+	if (IS_ENABLED(CONFIG_BTRFS_DEBUG) && !ret) {
+		/*
+		 * These are expected to be no-ops: ordered extents were drained
+		 * at the start of this function and BTRFS_ILOCK_MMAP has been
+		 * held throughout, so no new writes could have been submitted.
+		 * The page cache was emptied from offset onwards upfront and
+		 * nrpages in that region stayed 0 throughout the operation.
+		 * Pages before offset are unaffected and may still be cached.
+		 */
+		int wait_ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
+						    inode->i_size);
+		btrfs_info(fs_info,
+			   "btrfs_collapse_range: post-shift wait_ordered ret=%d",
+			   wait_ret);
+		ASSERT(wait_ret == 0);
+		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
+					      LLONG_MAX) == false);
+		truncate_pagecache_range(inode, offset, LLONG_MAX);
+		btrfs_info(fs_info,
+			   "btrfs_collapse_range: truncate_pagecache_range done");
+		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
+					      LLONG_MAX) == false);
+	}
+	btrfs_free_path(path);
+	return ret;
+}
+
 static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
@@ -3296,6 +3593,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 	case FALLOC_FL_PUNCH_HOLE:
 		ret = btrfs_punch_hole(file, offset, len);
 		break;
+	case FALLOC_FL_COLLAPSE_RANGE:
+		ret = btrfs_collapse_range(inode, offset, len);
+		break;
 	default:
 		ret = -EOPNOTSUPP;
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
  2026-04-18 14:38 ` [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate() Paul Richards
@ 2026-04-19  1:29   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2026-04-19  1:29 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba



在 2026/4/19 00:08, Paul Richards 写道:

Commit message please, especially when you're implementing a new feature.

> Assisted-by: Amazon Q Developer:auto/unknown
> Signed-off-by: Paul Richards <paul.richards@gmail.com>
> ---
>   fs/btrfs/file.c | 300 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 300 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0b5cc3cec675..99d24bef5f88 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -39,6 +39,13 @@
>   #include "super.h"
>   #include "print-tree.h"
>   
> +/*
> + * When we shift extents as part of fallocate insert or collapse we commit
> + * and cycle the transaction every BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL
> + * extents to avoid accumulating too many changes in one transaction.
> + */
> +#define BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32)
> +

As mentioned in the cover letter, you can go with 
btrfs_should_end_transaction() to do the check, which I believe is 
better than a fixed threshold.

>   /*
>    * Unlock folio after btrfs_file_write() is done with it.
>    */
> @@ -2648,6 +2655,296 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
>   	return ret;
>   }
>   
> +/*
> + * Update the extent back-reference in the extent tree when a
> + * BTRFS_EXTENT_DATA_KEY item is shifted to a new logical file offset.
> + * Drops the back-reference at old_file_offset and adds one at new_file_offset.
> + * Holes (disk_bytenr == 0) and inline extents have no back-references and
> + * must not be passed to this function.
> + */
> +static int btrfs_shift_extent_backref(struct btrfs_trans_handle *trans,
> +				      struct btrfs_root *root, u64 ino,
> +				      u64 disk_bytenr, u64 num_bytes,
> +				      u64 old_file_offset, u64 new_file_offset)
> +{
> +	struct btrfs_ref ref = {
> +		.bytenr = disk_bytenr,
> +		.num_bytes = num_bytes,
> +		.parent = 0,
> +		.owning_root = btrfs_root_id(root),
> +		.ref_root = btrfs_root_id(root),
> +	};
> +	int ret;
> +
> +	ref.action = BTRFS_DROP_DELAYED_REF;
> +	btrfs_init_data_ref(&ref, ino, old_file_offset, 0, false);
> +	ret = btrfs_free_extent(trans, &ref);
> +	if (unlikely(ret)) {
> +		btrfs_abort_transaction(trans, ret);
> +		return ret;
> +	}
> +
> +	ref.action = BTRFS_ADD_DELAYED_REF;
> +	btrfs_init_data_ref(&ref, ino, new_file_offset, 0, false);
> +	ret = btrfs_inc_extent_ref(trans, &ref);
> +	if (unlikely(ret))
> +		btrfs_abort_transaction(trans, ret);
> +
> +	return ret;
> +}
> +
> +static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	u64 end = offset + len;
> +	struct btrfs_path *path;

I'd recommend to go auto free for @path.

> +	struct btrfs_trans_handle *trans = NULL;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	struct btrfs_key new_key;
> +	u64 ino = btrfs_ino(BTRFS_I(inode));
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
> +		return -EOPNOTSUPP;
> +
> +	/* offset and len must be sector-aligned */
> +	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))

I'd prefer a more human readable separate checks on @offset and @len 
instead.

> +		return -EINVAL;
> +
> +	/* collapse range must not reach or pass EOF - use ftruncate instead */
> +	if (end >= inode->i_size)

And please also do a overflow check before this one.

> +		return -EINVAL;
> +
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: ino=%llu offset=%lld len=%lld i_size=%lld",
> +		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);

Either change it to a trace event, or just drop it.

For internal debugging, introduce trace_printk()/printk() in a dedicated 
debug patch instead and never send that debug patch.

> +
> +	/* wait for any ordered extents in [offset, i_size) to complete */
> +	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
> +				       inode->i_size - offset);

I'd prefer to use (u64)-1 as the length to be extra safe.

It's possible that we have some OE beyond the rounded up isize.

> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Flush dirty pages and invalidate the page cache for [offset, i_size)
> +	 * before any extent manipulation, following the ext4/xfs pattern.
> +	 * We hold BTRFS_ILOCK_MMAP so no new dirty pages can appear during
> +	 * the operation. The page cache must be empty before we shift extent
> +	 * keys so that stale pages at the old offsets cannot be read back
> +	 * after the collapse.
> +	 */
> +	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
> +	if (ret)
> +		return ret;
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: nrpages before upfront invalidate=%lu (pages before offset=%llu not invalidated)",
> +		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
> +	truncate_pagecache_range(inode, offset, LLONG_MAX);
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: nrpages after upfront invalidate=%lu (expected %llu)",
> +		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Lock the range [offset, end) and invalidate the page cache within
> +	 * it. btrfs_punch_hole_lock_range() calls truncate_pagecache_range()
> +	 * internally in a retry loop.
> +	 */
> +	btrfs_punch_hole_lock_range(inode, offset, end - 1, &cached_state);

Please lock the range to (u64)-1.

You're modifying all the underlay extent maps beyond @offset.

> +
> +	/*
> +	 * Remove all extents in [offset, end). Passing NULL for extent_info
> +	 * means we are punching a hole. btrfs_replace_file_extents() splits
> +	 * any extent straddling the boundaries, drops extent refs, and
> +	 * returns a transaction handle for us to reuse.
> +	 */
> +	ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
> +					 offset, end - 1, NULL, &trans);

Not sure if this will work on fses without no-holes (aka, -O ^no-holes 
option for mkfs).

This will insert hole file extents to fill range [@offset, @offset + 
@len), with ^no-holes, there will be file extent items with disk_bytenr 
== 0.

> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: btrfs_replace_file_extents ret=%d nrpages=%lu",
> +		   ret, inode->i_mapping->nrpages);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/*
> +	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= end
> +	 * leftward by len bytes.
> +	 *
> +	 * We iterate forward (lowest offset first) which is safe for a
> +	 * left-shift because the new key is always less than the old one,
> +	 * so we never collide with a key we haven't visited yet.
> +	 */
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = end;
> +
> +	int nr_shifted = 0;
> +	while (1) {
> +		struct btrfs_file_extent_item *fi;
> +		u64 disk_bytenr;
> +		u64 num_bytes;
> +		u64 extent_offset;
> +		int extent_type;
> +
> +		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
> +		if (ret < 0)
> +			goto out_trans;
> +
> +		/* If no exact match, slot points at the next item - that's fine */
> +
> +		leaf = path->nodes[0];
> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +			ret = btrfs_next_leaf(root, path);

This is not safe.
The next leaf may not be COWed.

If you check all btrfs_next_leaf() usages after a btrfs_search_slot() 
call, the btrfs_search_slot() is nevered called with @cow == 1.

> +			if (ret < 0)
> +				goto out_trans;
> +			if (ret > 0) {
> +				/* No more items - we are done shifting */
> +				ret = 0;
> +				break;
> +			}
> +			leaf = path->nodes[0];
> +		}
> +
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> +		/* Stop if we've moved past this inode's extent data items */
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		btrfs_info(fs_info,
> +			   "btrfs_collapse_range: shifting key offset %llu -> %llu",
> +			   key.offset, key.offset - len);
> +
> +		fi = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		extent_type = btrfs_file_extent_type(leaf, fi);
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
> +		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
> +		extent_offset = btrfs_file_extent_offset(leaf, fi);
> +
> +		memcpy(&new_key, &key, sizeof(new_key));
> +		new_key.offset -= len;
> +		btrfs_set_item_key_safe(trans, path, &new_key);

For ^no-holes, this will cause problems, as there can be a hole extent 
with exactly the same offset.

> +
> +		/*
> +		 * Update the back-reference in the extent tree to reflect the
> +		 * new logical file offset. Holes (disk_bytenr == 0) and inline
> +		 * extents have no back-references to update.
> +		 */
> +		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
> +			ret = btrfs_shift_extent_backref(trans, root, ino,
> +					disk_bytenr, num_bytes,
> +					key.offset - extent_offset,
> +					new_key.offset - extent_offset);
> +			if (unlikely(ret))
> +				goto out_trans;
> +		}
> +
> +		/* Advance to the next item on the next iteration */
> +		key.offset = new_key.offset + 1;

This is very ineffcient.

You're calling btrfs_search_slot() for every file extent item, just 
please don't.

Just handle all involved file extent items inside the leaf in one go.

> +		nr_shifted++;
> +
> +		/*
> +		 * Cycle the transaction every N items to avoid holding a
> +		 * single transaction open across a large number of extent
> +		 * items. btrfs_set_item_key_safe() modifies leaves in-place
> +		 * so we won't hit -ENOSPC; we use a simple counter instead.
> +		 * Update the inode at each cycle point so it is consistent
> +		 * on disk if a crash occurs mid-loop.
> +		 */
> +		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
> +			btrfs_info(fs_info,
> +			   "btrfs_collapse_range: cycling transaction, nr_shifted=%d", nr_shifted);
> +			inode_inc_iversion(inode);
> +			inode_set_mtime_to_ts(inode,
> +					      inode_set_ctime_current(inode));
> +			ret = btrfs_update_inode(trans, BTRFS_I(inode));
> +			if (ret) {
> +				btrfs_release_path(path);
> +				goto out_trans;
> +			}
> +			btrfs_end_transaction(trans);
> +			btrfs_btree_balance_dirty(fs_info);
> +			trans = btrfs_start_transaction(root, 1);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				trans = NULL;
> +				btrfs_release_path(path);
> +				goto out_unlock;
> +			}
> +		}
> +
> +		btrfs_release_path(path);
> +	}

The whole loop can be extracted into a helper.

The difference between insert and collapse is just the difference in the 
file offset shift direction (pluse for insert, minus to collapse).

Having two similar copies between insert and collapse is never going to 
improve the maintenanceability.

> +
> +	if (ret)
> +		goto out_trans;
> +
> +	/* Update i_size and the on-disk inode */
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: updating i_size %lld -> %lld nrpages=%lu",
> +		   inode->i_size, inode->i_size - len, inode->i_mapping->nrpages);
> +	/*
> +	 * Drop the extent map cache for [offset, i_size) so that subsequent
> +	 * reads re-load the correct mappings from the btree. The key-shift
> +	 * loop updated the btree but the in-memory extent map cache still
> +	 * has entries at the old logical offsets.
> +	 */
> +	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
> +	inode_inc_iversion(inode);
> +	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> +	i_size_write(inode, inode->i_size - len);
> +	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
> +	ret = btrfs_update_inode(trans, BTRFS_I(inode));
> +
> +out_trans:
> +	if (trans) {
> +		if (ret)
> +			btrfs_end_transaction(trans);

Not sure if we're safe to end the transaction here. If we a critical 
error, there may be file extents moved but some are not.

But at the same time, we may have commit a transaction halfway, thus the 
inconsitency may already be there.

Is there any docs about what is the expected behavior if INSERT/COLLAPSE 
failed halfway?

Should the fs fully revert to the original state (not sure if even 
possible) or something else?

> +		else
> +			ret = btrfs_end_transaction(trans);
> +	}
> +out_unlock:
> +	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, offset, end - 1,
> +			    &cached_state);
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: post-unlock ret=%d i_size=%lld, invalidating page cache from %lld",
> +		   ret, inode->i_size, offset);
> +	if (IS_ENABLED(CONFIG_BTRFS_DEBUG) && !ret) {
> +		/*
> +		 * These are expected to be no-ops: ordered extents were drained
> +		 * at the start of this function and BTRFS_ILOCK_MMAP has been
> +		 * held throughout, so no new writes could have been submitted.
> +		 * The page cache was emptied from offset onwards upfront and
> +		 * nrpages in that region stayed 0 throughout the operation.
> +		 * Pages before offset are unaffected and may still be cached.
> +		 */
> +		int wait_ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
> +						    inode->i_size);

Hide it behind DEBUG doesn't mean you can add whatever operations.

If you want to add this, please explain why.
If you can not explain it, just don't add in the formal patch.

Thanks,
Qu

> +		btrfs_info(fs_info,
> +			   "btrfs_collapse_range: post-shift wait_ordered ret=%d",
> +			   wait_ret);
> +		ASSERT(wait_ret == 0);
> +		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
> +					      LLONG_MAX) == false);
> +		truncate_pagecache_range(inode, offset, LLONG_MAX);
> +		btrfs_info(fs_info,
> +			   "btrfs_collapse_range: truncate_pagecache_range done");
> +		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
> +					      LLONG_MAX) == false);
> +	}
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
>   static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   {
>   	struct inode *inode = file_inode(file);
> @@ -3296,6 +3593,9 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	case FALLOC_FL_PUNCH_HOLE:
>   		ret = btrfs_punch_hole(file, offset, len);
>   		break;
> +	case FALLOC_FL_COLLAPSE_RANGE:
> +		ret = btrfs_collapse_range(inode, offset, len);
> +		break;
>   	default:
>   		ret = -EOPNOTSUPP;
>   	}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
  2026-04-18 14:38 [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Paul Richards
  2026-04-18 14:38 ` [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes Paul Richards
  2026-04-18 14:38 ` [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate() Paul Richards
@ 2026-04-18 14:38 ` Paul Richards
  2026-04-19  4:44   ` Qu Wenruo
  2026-04-19  0:25 ` [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Qu Wenruo
  3 siblings, 1 reply; 11+ messages in thread
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards

Assisted-by: Amazon Q Developer:auto/unknown
Signed-off-by: Paul Richards <paul.richards@gmail.com>
---
 fs/btrfs/file.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 238 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 99d24bef5f88..b708bb6a1082 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2945,6 +2945,241 @@ static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 	return ret;
 }
 
+static int btrfs_insert_range(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_path *path;
+	struct btrfs_trans_handle *trans = NULL;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	struct btrfs_key new_key;
+	u64 ino = btrfs_ino(BTRFS_I(inode));
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
+		return -EOPNOTSUPP;
+
+	/* offset and len must be sector-aligned */
+	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))
+		return -EINVAL;
+
+	/* offset must be within the file - use ftruncate to extend */
+	if (offset >= inode->i_size)
+		return -EINVAL;
+
+	/* result must not exceed the maximum file size */
+	if (len > inode->i_sb->s_maxbytes - inode->i_size)
+		return -EFBIG;
+
+	btrfs_info(fs_info,
+		   "btrfs_insert_range: ino=%llu offset=%lld len=%lld i_size=%lld",
+		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);
+
+	/* wait for any ordered extents in [offset, i_size) to complete */
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
+				       inode->i_size - offset);
+	if (ret)
+		return ret;
+
+	/*
+	 * Flush and invalidate the page cache for [offset, i_size) upfront,
+	 * following the same pattern as btrfs_collapse_range().
+	 */
+	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
+	if (ret)
+		return ret;
+	truncate_pagecache_range(inode, offset, LLONG_MAX);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		trans = NULL;
+		goto out_path;
+	}
+
+	/*
+	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= offset
+	 * rightward by len bytes.
+	 *
+	 * We must iterate in reverse order (highest offset first) to avoid
+	 * colliding with a key we haven't shifted yet - shifting forward
+	 * would overwrite the next item's key before we process it.
+	 *
+	 * No pre-splitting of straddling extents is needed. If an extent
+	 * straddles offset, the left portion (key.offset < offset) stays
+	 * in place and the right portion is shifted. Both reference the
+	 * same physical extent via their existing extent_offset fields,
+	 * which remain correct after the key shift.
+	 */
+
+	int nr_shifted = 0;
+
+	/* Find the last extent item for this inode */
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = (u64)-1;
+
+	while (1) {
+		struct btrfs_file_extent_item *fi;
+		u64 disk_bytenr;
+		u64 num_bytes;
+		u64 extent_offset;
+		int extent_type;
+
+		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+		if (ret < 0)
+			goto out_trans;
+
+		/*
+		 * Search for (ino, EXTENT_DATA, -1) will never find an exact
+		 * match, so ret == 1 and slot points one past the last item.
+		 * Step back one slot to land on the last extent item.
+		 */
+		if (path->slots[0] == 0) {
+			/* No items at all - nothing to shift */
+			ret = 0;
+			break;
+		}
+		path->slots[0]--;
+
+		leaf = path->nodes[0];
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		/* If we've gone past this inode's items, we are done */
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+			ret = 0;
+			break;
+		}
+
+		/* If this item is before the insertion point, we are done */
+		if (key.offset < offset) {
+			ret = 0;
+			break;
+		}
+
+		btrfs_info(fs_info,
+			   "btrfs_insert_range: shifting key offset %llu -> %llu",
+			   key.offset, key.offset + len);
+
+		fi = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		extent_type = btrfs_file_extent_type(leaf, fi);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+		extent_offset = btrfs_file_extent_offset(leaf, fi);
+
+		/*
+		 * Inline extents must have key.offset == 0 and cannot be
+		 * shifted to a non-zero offset - the tree checker enforces
+		 * this invariant. Reject with -EOPNOTSUPP.
+		 *
+		 * An inline extent can only exist if the file's entire content
+		 * fits within a single sector, meaning it is the only extent
+		 * item for this inode. It will therefore always be the first
+		 * item we encounter in the reverse iteration, before any keys
+		 * have been shifted, so bailing here leaves the file in a
+		 * consistent state.
+		 *
+		 * TODO: support this case by converting the inline extent to
+		 * a regular extent first, then shifting it. This would allow
+		 * INSERT_RANGE on small files, which xfs supports.
+		 */
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+			ret = -EOPNOTSUPP;
+			btrfs_release_path(path);
+			goto out_trans;
+		}
+
+		memcpy(&new_key, &key, sizeof(new_key));
+		new_key.offset += len;
+		btrfs_set_item_key_safe(trans, path, &new_key);
+
+		/* Update back-reference: drop old offset, add new offset */
+		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
+			ret = btrfs_shift_extent_backref(trans, root, ino,
+					disk_bytenr, num_bytes,
+					key.offset - extent_offset,
+					new_key.offset - extent_offset);
+			if (unlikely(ret))
+				goto out_trans;
+		}
+
+		/*
+		 * Step back to the previous item for the next iteration.
+		 * If we've reached slot 0 we need to move to the previous leaf.
+		 */
+		nr_shifted++;
+		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
+			btrfs_info(fs_info,
+			   "btrfs_insert_range: cycling transaction, nr_shifted=%d", nr_shifted);
+
+			inode_inc_iversion(inode);
+			inode_set_mtime_to_ts(inode,
+					      inode_set_ctime_current(inode));
+			ret = btrfs_update_inode(trans, BTRFS_I(inode));
+			if (ret) {
+				btrfs_release_path(path);
+				goto out_trans;
+			}
+			btrfs_end_transaction(trans);
+			btrfs_btree_balance_dirty(fs_info);
+			trans = btrfs_start_transaction(root, 1);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				trans = NULL;
+				btrfs_release_path(path);
+				goto out_path;
+			}
+		}
+
+		/*
+		 * Set key.offset to one below the current item so the next
+		 * btrfs_search_slot lands on the item before it.
+		 */
+		if (key.offset == 0) {
+			ret = 0;
+			break;
+		}
+		key.offset--;
+		btrfs_release_path(path);
+	}
+
+	if (ret)
+		goto out_trans;
+
+	/*
+	 * Drop stale extent map entries so subsequent reads re-load correct
+	 * mappings from the btree.
+	 */
+	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
+
+	btrfs_info(fs_info,
+		   "btrfs_insert_range: updating i_size %lld -> %lld",
+		   inode->i_size, inode->i_size + len);
+	inode_inc_iversion(inode);
+	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+	i_size_write(inode, inode->i_size + len);
+	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
+	ret = btrfs_update_inode(trans, BTRFS_I(inode));
+
+out_trans:
+	if (trans) {
+		if (ret)
+			btrfs_end_transaction(trans);
+		else
+			ret = btrfs_end_transaction(trans);
+	}
+out_path:
+	btrfs_free_path(path);
+	btrfs_info(fs_info, "btrfs_insert_range: returning %d", ret);
+	return ret;
+}
+
 static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
@@ -3596,6 +3831,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 	case FALLOC_FL_COLLAPSE_RANGE:
 		ret = btrfs_collapse_range(inode, offset, len);
 		break;
+	case FALLOC_FL_INSERT_RANGE:
+		ret = btrfs_insert_range(inode, offset, len);
+		break;
 	default:
 		ret = -EOPNOTSUPP;
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
  2026-04-18 14:38 ` [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE " Paul Richards
@ 2026-04-19  4:44   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2026-04-19  4:44 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba



在 2026/4/19 00:08, Paul Richards 写道:
> Assisted-by: Amazon Q Developer:auto/unknown
> Signed-off-by: Paul Richards <paul.richards@gmail.com>
> ---
>   fs/btrfs/file.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 238 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 99d24bef5f88..b708bb6a1082 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2945,6 +2945,241 @@ static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
>   	return ret;
>   }
>   
> +static int btrfs_insert_range(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_path *path;
> +	struct btrfs_trans_handle *trans = NULL;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	struct btrfs_key new_key;
> +	u64 ino = btrfs_ino(BTRFS_I(inode));
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
> +		return -EOPNOTSUPP;
> +
> +	/* offset and len must be sector-aligned */
> +	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))
> +		return -EINVAL;
> +
> +	/* offset must be within the file - use ftruncate to extend */
> +	if (offset >= inode->i_size)
> +		return -EINVAL;
> +
> +	/* result must not exceed the maximum file size */
> +	if (len > inode->i_sb->s_maxbytes - inode->i_size)
> +		return -EFBIG;
> +
> +	btrfs_info(fs_info,
> +		   "btrfs_insert_range: ino=%llu offset=%lld len=%lld i_size=%lld",
> +		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);
> +
> +	/* wait for any ordered extents in [offset, i_size) to complete */
> +	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
> +				       inode->i_size - offset);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Flush and invalidate the page cache for [offset, i_size) upfront,
> +	 * following the same pattern as btrfs_collapse_range().
> +	 */
> +	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
> +	if (ret)
> +		return ret;
> +	truncate_pagecache_range(inode, offset, LLONG_MAX);
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	trans = btrfs_start_transaction(root, 1);

As we may iterate through a lot of file items, the nr_items == 1 is 
definitely not correct.

On the other hand I'm not sure what's the correct nr_item to be passed 
here either.
Thus this may cause problems for metadata reservation. This also applies 
to the previous patch.

> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		trans = NULL;
> +		goto out_path;
> +	}
> +
> +	/*
> +	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= offset
> +	 * rightward by len bytes.
> +	 *
> +	 * We must iterate in reverse order (highest offset first) to avoid
> +	 * colliding with a key we haven't shifted yet - shifting forward
> +	 * would overwrite the next item's key before we process it.

OK, this explains why we can not have a shared helper to do the file 
offset shift.

> +	 *
> +	 * No pre-splitting of straddling extents is needed. If an extent
> +	 * straddles offset, the left portion (key.offset < offset) stays
> +	 * in place and the right portion is shifted. Both reference the
> +	 * same physical extent via their existing extent_offset fields,
> +	 * which remain correct after the key shift.
> +	 */
> +
> +	int nr_shifted = 0;
> +
> +	/* Find the last extent item for this inode */
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = (u64)-1;
> +
> +	while (1) {
> +		struct btrfs_file_extent_item *fi;
> +		u64 disk_bytenr;
> +		u64 num_bytes;
> +		u64 extent_offset;
> +		int extent_type;
> +
> +		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);

Again you're doing btrfs_search_slot() for every file item, which is 
very inefficient.

> +		if (ret < 0)
> +			goto out_trans;
> +
> +		/*
> +		 * Search for (ino, EXTENT_DATA, -1) will never find an exact
> +		 * match, so ret == 1 and slot points one past the last item.
> +		 * Step back one slot to land on the last extent item.
> +		 */
> +		if (path->slots[0] == 0) {
> +			/* No items at all - nothing to shift */
> +			ret = 0;
> +			break;
> +		}
> +		path->slots[0]--;
> +
> +		leaf = path->nodes[0];
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> +		/* If we've gone past this inode's items, we are done */
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		/* If this item is before the insertion point, we are done */
> +		if (key.offset < offset) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		btrfs_info(fs_info,
> +			   "btrfs_insert_range: shifting key offset %llu -> %llu",
> +			   key.offset, key.offset + len);
> +
> +		fi = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		extent_type = btrfs_file_extent_type(leaf, fi);
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
> +		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
> +		extent_offset = btrfs_file_extent_offset(leaf, fi);
> +
> +		/*
> +		 * Inline extents must have key.offset == 0 and cannot be
> +		 * shifted to a non-zero offset - the tree checker enforces
> +		 * this invariant. Reject with -EOPNOTSUPP.
> +		 *
> +		 * An inline extent can only exist if the file's entire content
> +		 * fits within a single sector, meaning it is the only extent
> +		 * item for this inode. It will therefore always be the first
> +		 * item we encounter in the reverse iteration, before any keys
> +		 * have been shifted, so bailing here leaves the file in a
> +		 * consistent state.
> +		 *
> +		 * TODO: support this case by converting the inline extent to
> +		 * a regular extent first, then shifting it. This would allow
> +		 * INSERT_RANGE on small files, which xfs supports.
> +		 */
> +		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> +			ret = -EOPNOTSUPP;
> +			btrfs_release_path(path);
> +			goto out_trans;
> +		}
> +
> +		memcpy(&new_key, &key, sizeof(new_key));
> +		new_key.offset += len;
> +		btrfs_set_item_key_safe(trans, path, &new_key);
> +
> +		/* Update back-reference: drop old offset, add new offset */
> +		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
> +			ret = btrfs_shift_extent_backref(trans, root, ino,
> +					disk_bytenr, num_bytes,
> +					key.offset - extent_offset,
> +					new_key.offset - extent_offset);
> +			if (unlikely(ret))
> +				goto out_trans;
> +		}
> +
> +		/*
> +		 * Step back to the previous item for the next iteration.
> +		 * If we've reached slot 0 we need to move to the previous leaf.
> +		 */
> +		nr_shifted++;
> +		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
> +			btrfs_info(fs_info,
> +			   "btrfs_insert_range: cycling transaction, nr_shifted=%d", nr_shifted);
> +
> +			inode_inc_iversion(inode);
> +			inode_set_mtime_to_ts(inode,
> +					      inode_set_ctime_current(inode));
> +			ret = btrfs_update_inode(trans, BTRFS_I(inode));
> +			if (ret) {
> +				btrfs_release_path(path);
> +				goto out_trans;
> +			}
> +			btrfs_end_transaction(trans);
> +			btrfs_btree_balance_dirty(fs_info);
> +			trans = btrfs_start_transaction(root, 1);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				trans = NULL;
> +				btrfs_release_path(path);
> +				goto out_path;
> +			}
> +		}
> +
> +		/*
> +		 * Set key.offset to one below the current item so the next
> +		 * btrfs_search_slot lands on the item before it.
> +		 */
> +		if (key.offset == 0) {
> +			ret = 0;
> +			break;
> +		}
> +		key.offset--;
> +		btrfs_release_path(path);
> +	}
> +
> +	if (ret)
> +		goto out_trans;
> +
> +	/*
> +	 * Drop stale extent map entries so subsequent reads re-load correct
> +	 * mappings from the btree.
> +	 */
> +	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
> +
> +	btrfs_info(fs_info,
> +		   "btrfs_insert_range: updating i_size %lld -> %lld",
> +		   inode->i_size, inode->i_size + len);
> +	inode_inc_iversion(inode);
> +	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> +	i_size_write(inode, inode->i_size + len);
> +	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
> +	ret = btrfs_update_inode(trans, BTRFS_I(inode));

After shifting all file extents to new file offsets, there is a hole in 
the range [@offset, @offset + @len), but there is no explicit hole 
extent for the range if ^no-holes.

This will cause fsck errors.

Thanks,
Qu

> +
> +out_trans:
> +	if (trans) {
> +		if (ret)
> +			btrfs_end_transaction(trans);
> +		else
> +			ret = btrfs_end_transaction(trans);
> +	}
> +out_path:
> +	btrfs_free_path(path);
> +	btrfs_info(fs_info, "btrfs_insert_range: returning %d", ret);
> +	return ret;
> +}
> +
>   static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   {
>   	struct inode *inode = file_inode(file);
> @@ -3596,6 +3831,9 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	case FALLOC_FL_COLLAPSE_RANGE:
>   		ret = btrfs_collapse_range(inode, offset, len);
>   		break;
> +	case FALLOC_FL_INSERT_RANGE:
> +		ret = btrfs_insert_range(inode, offset, len);
> +		break;
>   	default:
>   		ret = -EOPNOTSUPP;
>   	}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
  2026-04-18 14:38 [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Paul Richards
                   ` (2 preceding siblings ...)
  2026-04-18 14:38 ` [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE " Paul Richards
@ 2026-04-19  0:25 ` Qu Wenruo
  2026-04-19  5:08   ` Qu Wenruo
  3 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2026-04-19  0:25 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba



在 2026/4/19 00:08, Paul Richards 写道:
> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
> already supported by ext4 and xfs. The userspace contract is
> documented in fallocate(2).
> 
> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
> moving punch_hole into its own function and decoupling locking from the
> per-operation helpers. This is similar to the implementaitons for ext4
> and xfs. The allocate-range and zero-range paths remain coupled since
> they share some setup logic.
> 
> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
> 
> == Implementation approach ==
> 
> For COLLAPSE_RANGE:
>   - The removed region [offset, offset+len) is punched out via
>     btrfs_replace_file_extents(), which handles boundary splitting.
>   - All EXTENT_DATA keys with key.offset >= offset+len are shifted
>     leftward by len in forward order.
> 
> For INSERT_RANGE:
>   - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
>     by len in reverse order (required to avoid key collisions).
>   - No pre-splitting of straddling extents is needed: the left portion
>     of a straddling extent stays in place, the right portion is shifted;
>     both reference the same physical extent via their existing
>     extent_offset fields.
> 
> For each shifted key, the corresponding back-reference in the extent
> tree is updated via a shared helper btrfs_shift_extent_backref().
> 
> After the key-shift loop, btrfs_drop_extent_map_range() is called to
> invalidate the in-memory extent map cache. This is important for reads
> after the fallocate operation to ensure they obtain data for the new
> offsets.
> 
> The page cache is flushed and invalidated upfront (before any extent
> manipulation) following the ext4/xfs pattern. The inode lock
> (BTRFS_ILOCK_MMAP) is held throughout, preventing new dirty pages from
> appearing during the operation.

So far the explanation looks fine, but I'll need to dig deeper for each 
patch to be sure.

> 
> Transaction cycling: the key-shift loop cycles transactions every
> BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32) items to avoid
> holding a single transaction open across a large number of extents.
> 
> Both operations are gated on CONFIG_BTRFS_EXPERIMENTAL.
> 
> == Known limitations ==
> 
> INSERT_RANGE returns -EOPNOTSUPP for inlined files. Supporting inline
> files will require promoting the existing inline extent to a regular
> one, since inline extends are supported only at the very start of a
> file.

This can be addressed, e.g. by reading out the inlined extent into the 
page cache and redirty the new page cache.

But I'd say it's not a critical part thus I'm totally fine without this 
support for a while.

> 
> In the opposite direction, COLLAPSE_RANGE will not create inline files
> like it should if the remaining data is small enough.

This is even less important, there are a lot of cases that we don't 
inline a data extent, so keeping the extent uninlined is completely fine.

> 
> I intend to address both of these limitations.
> 
> == Testing ==
> 
> Tested with a Rust-based functional test suite covering:
>   - Collapse and insert at the start, middle of a file
>   - Multiple sequential operations on the same file
>   - Files with multiple extents (fsync between writes to force separate
>     extent items)
>   - Files with holes (explicit punch_hole and implicit sparse writes)
>   - Compressed extents (mount -o compress=zstd)
>   - Transaction cycling (interval reduced to 4 during testing, verified
>     in dmesg logs)
>   - Inline files, verified that -EOPNOTSUPP is returned.

I guess that tool has never verify the contents, nor multi-thread stress 
tests, e.g. fsstress?

> 
> The same tests pass on both btrfs and xfs (modulo the inline files).
> 
> I have not run fstests which I know contains tests for INSERT_RANGE
> and COLLAPSE_RANGE. I will do so.

Thus I'd prefer a fstest run before whatever your local tool.

> 
> == Questions for reviewers ==
> 
> 1. Transaction cycling interval: we use 32 items per cycle. Is this
>     the right threshold, or is there an established convention in btrfs
>     for this kind of loop?


We have btrfs_should_end_transaction() to do it, thus you do not need to 
use a locally defined threshold.

> 
> 2. Extent lock scope for collapse: we hold the extent lock only on
>     [offset, offset+len) during the hole punch, not on the full
>     [offset, i_size) range that the key-shift loop operates on. Is
>     this safe, or should we lock the full affected range?

I think you should hold the lock range [offset, U64_MAX).

Especially you have already dropped all cache for range [offset, 
U64_MAX) before the lock.

The same will apply for insert, but it looks like you have not locked 
any extent range for insert?

> 
> 3. CONFIG_BTRFS_EXPERIMENTAL gate: is this the right gate for these
>     operations, or should they be unconditionally available?

It's strongly recommended in this case.

> 
> == Notes ==
> 
> This is my first kernel contribution. Development was significantly
> assisted by an LLM (Amazon Q Developer). The implementation, testing,
> and final review decisions are my own.

I'm very interested in how the LLM is involved.

You mentioned "implementation, testing, review" are on your own, this 
looks like everything is on your own.

I don't think you're only using LLM to help understanding the code, thus 
it looks like implementation is contributed by the LLM.

Please remember, you're the one explaining/defending the code.
But as long as you can explain/defend the code, I'm fine with that.

Still a newcomer with not a small code change will always attract more 
scrutiny, so please don't expect rapid review/merge/etc.

> 
> Various btrfs_info() print statements, assertions, and comments that
> were useful during development and testing have been left in place,
> but will be removed or streamlined in the next revision.

Please don't. It's pretty easy to leave such scaffolding in the final 
code, and it will be time consuming to making they are properly removed 
in the final version.

Furthermore, dmesg based printk() is slow thus it can mask a lot of race 
related bugs just by slowing down the opeartions.

For critical operations, introduce trace events for them, there are 
examples across several inode operations.

If you really want to debug during development, please introduce 
temporary trace_printk()s in a dedicated patch, and just don't send that 
debug patch.

Thanks,
Qu

> 
> Paul Richards (3):
>    btrfs: refactor btrfs_fallocate() ahead of supporting more modes
>    btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
>    btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
> 
>   fs/btrfs/file.c | 601 ++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 578 insertions(+), 23 deletions(-)
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
  2026-04-19  0:25 ` [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Qu Wenruo
@ 2026-04-19  5:08   ` Qu Wenruo
  2026-04-19 18:40     ` Paul Richards
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2026-04-19  5:08 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba



在 2026/4/19 09:55, Qu Wenruo 写道:
> 
> 
> 在 2026/4/19 00:08, Paul Richards 写道:
>> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
>> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
>> already supported by ext4 and xfs. The userspace contract is
>> documented in fallocate(2).
>>
>> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
>> moving punch_hole into its own function and decoupling locking from the
>> per-operation helpers. This is similar to the implementaitons for ext4
>> and xfs. The allocate-range and zero-range paths remain coupled since
>> they share some setup logic.
>>
>> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
>>
>> == Implementation approach ==
>>
>> For COLLAPSE_RANGE:
>>   - The removed region [offset, offset+len) is punched out via
>>     btrfs_replace_file_extents(), which handles boundary splitting.
>>   - All EXTENT_DATA keys with key.offset >= offset+len are shifted
>>     leftward by len in forward order.
>>
>> For INSERT_RANGE:
>>   - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
>>     by len in reverse order (required to avoid key collisions).
>>   - No pre-splitting of straddling extents is needed: the left portion
>>     of a straddling extent stays in place, the right portion is shifted;
>>     both reference the same physical extent via their existing
>>     extent_offset fields.
>>
>> For each shifted key, the corresponding back-reference in the extent
>> tree is updated via a shared helper btrfs_shift_extent_backref().

After looking into each patch, I do not think the low level direct file 
item change is a good idea, especially with your current implement:

- Can lock the inode for a very long time
   E.g. inserting a hole into the beginning of a very large file.
   We will lock the inode until all file items are iterated, which kills
   concurrency.

- Possible problems with metadata reservation

- Problems with ^no-holes collapse
   Will cause duplicated file offsets with hole file items.

On the other hand, with reflink the insert/collapse can even be 
implemented in user space, with a proper step setting, we can still 
allow concurrent read/write out of the reflink ranges.

This makes me wonder, is these features really that necessary?

If you know some programs actively utilize these features for real world 
benefits, but can not be done through reflink, please provide them.

Thanks,
Qu


>>
>> After the key-shift loop, btrfs_drop_extent_map_range() is called to
>> invalidate the in-memory extent map cache. This is important for reads
>> after the fallocate operation to ensure they obtain data for the new
>> offsets.
>>
>> The page cache is flushed and invalidated upfront (before any extent
>> manipulation) following the ext4/xfs pattern. The inode lock
>> (BTRFS_ILOCK_MMAP) is held throughout, preventing new dirty pages from
>> appearing during the operation.
> 
> So far the explanation looks fine, but I'll need to dig deeper for each 
> patch to be sure.
> 
>>
>> Transaction cycling: the key-shift loop cycles transactions every
>> BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32) items to avoid
>> holding a single transaction open across a large number of extents.
>>
>> Both operations are gated on CONFIG_BTRFS_EXPERIMENTAL.
>>
>> == Known limitations ==
>>
>> INSERT_RANGE returns -EOPNOTSUPP for inlined files. Supporting inline
>> files will require promoting the existing inline extent to a regular
>> one, since inline extends are supported only at the very start of a
>> file.
> 
> This can be addressed, e.g. by reading out the inlined extent into the 
> page cache and redirty the new page cache.
> 
> But I'd say it's not a critical part thus I'm totally fine without this 
> support for a while.
> 
>>
>> In the opposite direction, COLLAPSE_RANGE will not create inline files
>> like it should if the remaining data is small enough.
> 
> This is even less important, there are a lot of cases that we don't 
> inline a data extent, so keeping the extent uninlined is completely fine.
> 
>>
>> I intend to address both of these limitations.
>>
>> == Testing ==
>>
>> Tested with a Rust-based functional test suite covering:
>>   - Collapse and insert at the start, middle of a file
>>   - Multiple sequential operations on the same file
>>   - Files with multiple extents (fsync between writes to force separate
>>     extent items)
>>   - Files with holes (explicit punch_hole and implicit sparse writes)
>>   - Compressed extents (mount -o compress=zstd)
>>   - Transaction cycling (interval reduced to 4 during testing, verified
>>     in dmesg logs)
>>   - Inline files, verified that -EOPNOTSUPP is returned.
> 
> I guess that tool has never verify the contents, nor multi-thread stress 
> tests, e.g. fsstress?
> 
>>
>> The same tests pass on both btrfs and xfs (modulo the inline files).
>>
>> I have not run fstests which I know contains tests for INSERT_RANGE
>> and COLLAPSE_RANGE. I will do so.
> 
> Thus I'd prefer a fstest run before whatever your local tool.
> 
>>
>> == Questions for reviewers ==
>>
>> 1. Transaction cycling interval: we use 32 items per cycle. Is this
>>     the right threshold, or is there an established convention in btrfs
>>     for this kind of loop?
> 
> 
> We have btrfs_should_end_transaction() to do it, thus you do not need to 
> use a locally defined threshold.
> 
>>
>> 2. Extent lock scope for collapse: we hold the extent lock only on
>>     [offset, offset+len) during the hole punch, not on the full
>>     [offset, i_size) range that the key-shift loop operates on. Is
>>     this safe, or should we lock the full affected range?
> 
> I think you should hold the lock range [offset, U64_MAX).
> 
> Especially you have already dropped all cache for range [offset, 
> U64_MAX) before the lock.
> 
> The same will apply for insert, but it looks like you have not locked 
> any extent range for insert?
> 
>>
>> 3. CONFIG_BTRFS_EXPERIMENTAL gate: is this the right gate for these
>>     operations, or should they be unconditionally available?
> 
> It's strongly recommended in this case.
> 
>>
>> == Notes ==
>>
>> This is my first kernel contribution. Development was significantly
>> assisted by an LLM (Amazon Q Developer). The implementation, testing,
>> and final review decisions are my own.
> 
> I'm very interested in how the LLM is involved.
> 
> You mentioned "implementation, testing, review" are on your own, this 
> looks like everything is on your own.
> 
> I don't think you're only using LLM to help understanding the code, thus 
> it looks like implementation is contributed by the LLM.
> 
> Please remember, you're the one explaining/defending the code.
> But as long as you can explain/defend the code, I'm fine with that.
> 
> Still a newcomer with not a small code change will always attract more 
> scrutiny, so please don't expect rapid review/merge/etc.
> 
>>
>> Various btrfs_info() print statements, assertions, and comments that
>> were useful during development and testing have been left in place,
>> but will be removed or streamlined in the next revision.
> 
> Please don't. It's pretty easy to leave such scaffolding in the final 
> code, and it will be time consuming to making they are properly removed 
> in the final version.
> 
> Furthermore, dmesg based printk() is slow thus it can mask a lot of race 
> related bugs just by slowing down the opeartions.
> 
> For critical operations, introduce trace events for them, there are 
> examples across several inode operations.
> 
> If you really want to debug during development, please introduce 
> temporary trace_printk()s in a dedicated patch, and just don't send that 
> debug patch.
> 
> Thanks,
> Qu
> 
>>
>> Paul Richards (3):
>>    btrfs: refactor btrfs_fallocate() ahead of supporting more modes
>>    btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
>>    btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
>>
>>   fs/btrfs/file.c | 601 ++++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 578 insertions(+), 23 deletions(-)
>>
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
  2026-04-19  5:08   ` Qu Wenruo
@ 2026-04-19 18:40     ` Paul Richards
  2026-04-19 22:30       ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Richards @ 2026-04-19 18:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba

Thanks Qu for your feedback! I wasn't expecting such a thorough review of
this early RFC. I will reply to a few of your comments below.

For the comments I don't reply to, please know that I have read them and
agree with them. If I follow up with another revision then I will address them
there.

On Sun, 19 Apr 2026 at 06:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> 在 2026/4/19 09:55, Qu Wenruo 写道:
> >
> >
> > 在 2026/4/19 00:08, Paul Richards 写道:
> >> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
> >> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
> >> already supported by ext4 and xfs. The userspace contract is
> >> documented in fallocate(2).
> >>
> >> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
> >> moving punch_hole into its own function and decoupling locking from the
> >> per-operation helpers. This is similar to the implementaitons for ext4
> >> and xfs. The allocate-range and zero-range paths remain coupled since
> >> they share some setup logic.
> >>
> >> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
> >>
> >> == Implementation approach ==
> >>
> >> For COLLAPSE_RANGE:
> >>   - The removed region [offset, offset+len) is punched out via
> >>     btrfs_replace_file_extents(), which handles boundary splitting.
> >>   - All EXTENT_DATA keys with key.offset >= offset+len are shifted
> >>     leftward by len in forward order.
> >>
> >> For INSERT_RANGE:
> >>   - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
> >>     by len in reverse order (required to avoid key collisions).
> >>   - No pre-splitting of straddling extents is needed: the left portion
> >>     of a straddling extent stays in place, the right portion is shifted;
> >>     both reference the same physical extent via their existing
> >>     extent_offset fields.
> >>
> >> For each shifted key, the corresponding back-reference in the extent
> >> tree is updated via a shared helper btrfs_shift_extent_backref().
>
> After looking into each patch, I do not think the low level direct file
> item change is a good idea, especially with your current implement:
>
> - Can lock the inode for a very long time
>    E.g. inserting a hole into the beginning of a very large file.
>    We will lock the inode until all file items are iterated, which kills
>    concurrency.
>
> - Possible problems with metadata reservation
>
> - Problems with ^no-holes collapse
>    Will cause duplicated file offsets with hole file items.
>
> On the other hand, with reflink the insert/collapse can even be
> implemented in user space, with a proper step setting, we can still
> allow concurrent read/write out of the reflink ranges.
>
> This makes me wonder, is these features really that necessary?
>
> If you know some programs actively utilize these features for real world
> benefits, but can not be done through reflink, please provide them.
>

There is a proprietary tool at my $dayjob which uses insert and collapse range.
This works great on ext4 and xfs, and I am personally interested in having it
work on btrfs.

Emulating in userspace is entirely possible using FICLONERANGE ioclt.. but
since ext4 doesn't support that operation this leaves userspace needing to
use one solution for ext4 (fallocate insert/collapse) and another for btrfs
(FICLONERANGE). By filling the gap of btrfs fallocate modes I was hoping
to simplify userspace and reduce friction in supporting btrfs.

However, on balance I agree with your opinion that emulating in userspace
is the best approach here. So for now I will pause my work on this patch series.

> >>
> >> == Testing ==
> >>
> >> Tested with a Rust-based functional test suite covering:
> >>   - Collapse and insert at the start, middle of a file
> >>   - Multiple sequential operations on the same file
> >>   - Files with multiple extents (fsync between writes to force separate
> >>     extent items)
> >>   - Files with holes (explicit punch_hole and implicit sparse writes)
> >>   - Compressed extents (mount -o compress=zstd)
> >>   - Transaction cycling (interval reduced to 4 during testing, verified
> >>     in dmesg logs)
> >>   - Inline files, verified that -EOPNOTSUPP is returned.
> >
> > I guess that tool has never verify the contents, nor multi-thread stress
> > tests, e.g. fsstress?
> >

My tests did read back and verify file contents. It was not multi threaded or
particularly stressy.

> >>
> >> The same tests pass on both btrfs and xfs (modulo the inline files).
> >>
> >> I have not run fstests which I know contains tests for INSERT_RANGE
> >> and COLLAPSE_RANGE. I will do so.
> >
> > Thus I'd prefer a fstest run before whatever your local tool.
> >

100% agree that fstests would be better. I started with my own tests
only to keep
things very simple while the code got off the ground. If I carry this work on I
will use fstests.

> >> == Notes ==
> >>
> >> This is my first kernel contribution. Development was significantly
> >> assisted by an LLM (Amazon Q Developer). The implementation, testing,
> >> and final review decisions are my own.
> >
> > I'm very interested in how the LLM is involved.
> >
> > You mentioned "implementation, testing, review" are on your own, this
> > looks like everything is on your own.
> >
> > I don't think you're only using LLM to help understanding the code, thus
> > it looks like implementation is contributed by the LLM.
> >
> > Please remember, you're the one explaining/defending the code.
> > But as long as you can explain/defend the code, I'm fine with that.
> >

My workflow with the LLM was this:

I provided a high level design, reference to the fallocate man page,
pointers to the implementation of clone range in btrfs, and the existing
fallocate implementations (for btrfs, ext4, and xfs). This was a couple
of paragraphs and a handful of links.

I prompted the LLM to produce a detailed design document for how
I could go about implementing insert and collapse for btrfs. The
document it produced is English text of around 2000 words. It
contains references to internal btrfs functions I needed and a
description of the logic I needed to implement. It has no code.

I reviewed that doc, asked clarifications, and the doc was revised.

I then crafted the code. I relied heavily on the design doc that the
LLM had written but did not follow it to the letter. During this
implementation work I asked the LLM to clarify a few points, and
the design doc was revised by the LLM.

During debugging most of the bugs I diagnosed myself. There was one
bug which stumped me however, so I asked the LLM to provide
hypotheses which I then worked through to solve the bug. I crafted
the code for the fix. ( The specific bug was stale/incorrect data being
read after the operation returned. The code was already invalidating
the page cache, but I was not aware of the extent map cache that
needed to be invalidated too. )

I wrote the test code myself.

Once the tests were passing the LLM was used to update the design
doc to align with the code I actually wrote. I also asked the LLM to
review my code prior to submitting it to the mailing list. I addressed
some of that feedback. As an example piece of feedback it
recommended I make use of unlikely() in "if" statements testing for
errors.

> > Still a newcomer with not a small code change will always attract more
> > scrutiny, so please don't expect rapid review/merge/etc.
> >

I am not in a rush. This has been a low-priority side quest going back
to 2023: https://lore.kernel.org/linux-btrfs/CAMosweitbAN5EPOgJCtrbkRAj1QSbsYt4uDGVMZ378YY7wjnRw@mail.gmail.com/

-- 
Paul Richards

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
  2026-04-19 18:40     ` Paul Richards
@ 2026-04-19 22:30       ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2026-04-19 22:30 UTC (permalink / raw)
  To: Paul Richards; +Cc: linux-btrfs, dsterba



在 2026/4/20 04:10, Paul Richards 写道:
> Thanks Qu for your feedback! I wasn't expecting such a thorough review of
> this early RFC. I will reply to a few of your comments below.
> 
> For the comments I don't reply to, please know that I have read them and
> agree with them. If I follow up with another revision then I will address them
> there.
> 
> 
> On Sun, 19 Apr 2026 at 06:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> 在 2026/4/19 09:55, Qu Wenruo 写道:
>>>
>>>
>>> 在 2026/4/19 00:08, Paul Richards 写道:
>>>> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
>>>> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
>>>> already supported by ext4 and xfs. The userspace contract is
>>>> documented in fallocate(2).
>>>>
>>>> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
>>>> moving punch_hole into its own function and decoupling locking from the
>>>> per-operation helpers. This is similar to the implementaitons for ext4
>>>> and xfs. The allocate-range and zero-range paths remain coupled since
>>>> they share some setup logic.
>>>>
>>>> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
>>>>
>>>> == Implementation approach ==
>>>>
>>>> For COLLAPSE_RANGE:
>>>>    - The removed region [offset, offset+len) is punched out via
>>>>      btrfs_replace_file_extents(), which handles boundary splitting.
>>>>    - All EXTENT_DATA keys with key.offset >= offset+len are shifted
>>>>      leftward by len in forward order.
>>>>
>>>> For INSERT_RANGE:
>>>>    - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
>>>>      by len in reverse order (required to avoid key collisions).
>>>>    - No pre-splitting of straddling extents is needed: the left portion
>>>>      of a straddling extent stays in place, the right portion is shifted;
>>>>      both reference the same physical extent via their existing
>>>>      extent_offset fields.
>>>>
>>>> For each shifted key, the corresponding back-reference in the extent
>>>> tree is updated via a shared helper btrfs_shift_extent_backref().
>>
>> After looking into each patch, I do not think the low level direct file
>> item change is a good idea, especially with your current implement:
>>
>> - Can lock the inode for a very long time
>>     E.g. inserting a hole into the beginning of a very large file.
>>     We will lock the inode until all file items are iterated, which kills
>>     concurrency.

After more reference to reflink code, I think this comment is a little 
over-reaction, as reflink also locks the involved range in one go, thus 
can lock the inode for a very long time too.

So long lock may not be a huge problem.

>>
>> - Possible problems with metadata reservation
>>
>> - Problems with ^no-holes collapse
>>     Will cause duplicated file offsets with hole file items.

This is still true, even if we only support the insert/collapse range 
for no-holes feature.

The bigger problem is that, even if a fs has no-holes feature, it can 
still have explicit hole extents. As the fs can be converted from 
^no-holes, and the existing holes are not removed.

I guess the initial design is mostly to make converting existing fses 
much easier, at the cost that kernel always has to handle explicit holes.


One idea is to introduce something like COMPAT_RO_STRICT_NO_HOLE to 
prevent explicit holes, then it will make the low-level key modification 
more feasible.
But that will need quite some time for such new feature to get adapted.

>>
>> On the other hand, with reflink the insert/collapse can even be
>> implemented in user space, with a proper step setting, we can still
>> allow concurrent read/write out of the reflink ranges.
>>
>> This makes me wonder, is these features really that necessary?
>>
>> If you know some programs actively utilize these features for real world
>> benefits, but can not be done through reflink, please provide them.
>>
> 
> There is a proprietary tool at my $dayjob which uses insert and collapse range.
> This works great on ext4 and xfs, and I am personally interested in having it
> work on btrfs.
> 
> Emulating in userspace is entirely possible using FICLONERANGE ioclt.. but
> since ext4 doesn't support that operation this leaves userspace needing to
> use one solution for ext4 (fallocate insert/collapse) and another for btrfs
> (FICLONERANGE). By filling the gap of btrfs fallocate modes I was hoping
> to simplify userspace and reduce friction in supporting btrfs.
> 
> However, on balance I agree with your opinion that emulating in userspace
> is the best approach here. So for now I will pause my work on this patch series.

I think you may be interested in utilizing btrfs_clone() to implement 
the insert/collapse range ioctl.

The benefit is you do not need to bother the low-level file item 
changes, nor the metadata reservation part (already done in btrfs_clone()).

However it won't work if the src/dst range overlaps, it can still be 
worked around by shrinking the reflink length to the minimal so that the 
ranges no longer overlaps, and at the cost of more fragments.

And you may still want to dig deeper into the extra locks and other 
corner cases like the final truncation after collapse and the extra hole 
insertion after insert.

I guess it may be a little easier to implement this time.

Thanks,
Qu

> 
>>>>
>>>> == Testing ==
>>>>
>>>> Tested with a Rust-based functional test suite covering:
>>>>    - Collapse and insert at the start, middle of a file
>>>>    - Multiple sequential operations on the same file
>>>>    - Files with multiple extents (fsync between writes to force separate
>>>>      extent items)
>>>>    - Files with holes (explicit punch_hole and implicit sparse writes)
>>>>    - Compressed extents (mount -o compress=zstd)
>>>>    - Transaction cycling (interval reduced to 4 during testing, verified
>>>>      in dmesg logs)
>>>>    - Inline files, verified that -EOPNOTSUPP is returned.
>>>
>>> I guess that tool has never verify the contents, nor multi-thread stress
>>> tests, e.g. fsstress?
>>>
> 
> My tests did read back and verify file contents. It was not multi threaded or
> particularly stressy.
> 
>>>>
>>>> The same tests pass on both btrfs and xfs (modulo the inline files).
>>>>
>>>> I have not run fstests which I know contains tests for INSERT_RANGE
>>>> and COLLAPSE_RANGE. I will do so.
>>>
>>> Thus I'd prefer a fstest run before whatever your local tool.
>>>
> 
> 100% agree that fstests would be better. I started with my own tests
> only to keep
> things very simple while the code got off the ground. If I carry this work on I
> will use fstests.
> 
>>>> == Notes ==
>>>>
>>>> This is my first kernel contribution. Development was significantly
>>>> assisted by an LLM (Amazon Q Developer). The implementation, testing,
>>>> and final review decisions are my own.
>>>
>>> I'm very interested in how the LLM is involved.
>>>
>>> You mentioned "implementation, testing, review" are on your own, this
>>> looks like everything is on your own.
>>>
>>> I don't think you're only using LLM to help understanding the code, thus
>>> it looks like implementation is contributed by the LLM.
>>>
>>> Please remember, you're the one explaining/defending the code.
>>> But as long as you can explain/defend the code, I'm fine with that.
>>>
> 
> My workflow with the LLM was this:
> 
> I provided a high level design, reference to the fallocate man page,
> pointers to the implementation of clone range in btrfs, and the existing
> fallocate implementations (for btrfs, ext4, and xfs). This was a couple
> of paragraphs and a handful of links.
> 
> I prompted the LLM to produce a detailed design document for how
> I could go about implementing insert and collapse for btrfs. The
> document it produced is English text of around 2000 words. It
> contains references to internal btrfs functions I needed and a
> description of the logic I needed to implement. It has no code.
> 
> I reviewed that doc, asked clarifications, and the doc was revised.
> 
> I then crafted the code. I relied heavily on the design doc that the
> LLM had written but did not follow it to the letter. During this
> implementation work I asked the LLM to clarify a few points, and
> the design doc was revised by the LLM.
> 
> During debugging most of the bugs I diagnosed myself. There was one
> bug which stumped me however, so I asked the LLM to provide
> hypotheses which I then worked through to solve the bug. I crafted
> the code for the fix. ( The specific bug was stale/incorrect data being
> read after the operation returned. The code was already invalidating
> the page cache, but I was not aware of the extent map cache that
> needed to be invalidated too. )
> 
> I wrote the test code myself.
> 
> Once the tests were passing the LLM was used to update the design
> doc to align with the code I actually wrote. I also asked the LLM to
> review my code prior to submitting it to the mailing list. I addressed
> some of that feedback. As an example piece of feedback it
> recommended I make use of unlikely() in "if" statements testing for
> errors.
> 
> 
>>> Still a newcomer with not a small code change will always attract more
>>> scrutiny, so please don't expect rapid review/merge/etc.
>>>
> 
> I am not in a rush. This has been a low-priority side quest going back
> to 2023: https://lore.kernel.org/linux-btrfs/CAMosweitbAN5EPOgJCtrbkRAj1QSbsYt4uDGVMZ378YY7wjnRw@mail.gmail.com/
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-19 22:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-18 14:38 [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Paul Richards
2026-04-18 14:38 ` [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes Paul Richards
2026-04-19  0:57   ` Qu Wenruo
2026-04-18 14:38 ` [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate() Paul Richards
2026-04-19  1:29   ` Qu Wenruo
2026-04-18 14:38 ` [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE " Paul Richards
2026-04-19  4:44   ` Qu Wenruo
2026-04-19  0:25 ` [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE Qu Wenruo
2026-04-19  5:08   ` Qu Wenruo
2026-04-19 18:40     ` Paul Richards
2026-04-19 22:30       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox