public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] btrfs: reflink updates
@ 2026-01-09 12:29 fdmanana
  2026-01-09 12:29 ` [PATCH v2 1/2] btrfs: invalidate pages instead of truncate after reflinking fdmanana
  2026-01-09 12:29 ` [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files() fdmanana
  0 siblings, 2 replies; 6+ messages in thread
From: fdmanana @ 2026-01-09 12:29 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Fix a bug where concurrent buffered reads with reflinks may return
unexpected zeroes when using large folios, and update a comment about
why we need to flush delalloc and wait for ordered extents before
attempting to invalidate the page cache.

V2: Updated patch 1/2 change logs and a stale comment.
    Added patch 2/2.

Filipe Manana (2):
  btrfs: invalidate pages instead of truncate after reflinking
  btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files()

 fs/btrfs/reflink.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] btrfs: invalidate pages instead of truncate after reflinking
  2026-01-09 12:29 [PATCH v2 0/2] btrfs: reflink updates fdmanana
@ 2026-01-09 12:29 ` fdmanana
  2026-01-11  4:11   ` Qu Wenruo
  2026-01-09 12:29 ` [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files() fdmanana
  1 sibling, 1 reply; 6+ messages in thread
From: fdmanana @ 2026-01-09 12:29 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Qu reported that generic/164 often fails because the read operations get
zeroes when it expects to either get all bytes with a value of 0x61 or
0x62. The issue stems from truncating the pages from the page cache
instead of invalidating, as truncating can zero page contents. This
zeroing is not just in case the range is not page sized (as it's commented
in truncate_inode_pages_range()) but also in case we are using large
folios, they need to be split and the splitting fails. Stealing Qu's
comment in the thread linked below:

  "We can have the following case:

	0          4K         8K         12K          16K
        |          |          |          |            |
        |<---- Extent A ----->|<----- Extent B ------>|

   The page size is still 4K, but the folio we got is 16K.

   Then if we remap the range for [8K, 16K), then
   truncate_inode_pages_range() will get the large folio 0 sized 16K,
   then call truncate_inode_partial_folio().

   Which later calls folio_zero_range() for the [8K, 16K) range first,
   then tries to split the folio into smaller ones to properly drop them
   from the cache.

   But if splitting failed (e.g. racing with other operations holding the
   filemap lock), the partially zeroed large folio will be kept, resulting
   the range [8K, 16K) being zeroed meanwhile the folio is still a 16K
   sized large one."

So instead of truncating, invalidate the page cache range with a call to
filemap_invalidate_inode(), which besides not doing any zeroing also
ensures that while it's invalidating folios, no new folios are added.

This helps ensure that buffered reads that happen while a reflink
operation is in progress always get either the whole old data (the one
before the reflink) or the whole new data, which is what generic/164
expects.

Link: https://lore.kernel.org/linux-btrfs/7fb9b44f-9680-4c22-a47f-6648cb109ddf@suse.com/
Reported-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/reflink.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index e746980567da..ab4ce56d69ee 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -705,7 +705,6 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	struct inode *src = file_inode(file_src);
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 	int ret;
-	int wb_ret;
 	u64 len = olen;
 	u64 bs = fs_info->sectorsize;
 	u64 end;
@@ -750,25 +749,29 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	btrfs_lock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state);
 	ret = btrfs_clone(src, inode, off, olen, len, destoff, 0);
 	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state);
+	if (ret < 0)
+		return ret;
 
 	/*
 	 * We may have copied an inline extent into a page of the destination
-	 * range, so wait for writeback to complete before truncating pages
+	 * range, so wait for writeback to complete before invalidating pages
 	 * from the page cache. This is a rare case.
 	 */
-	wb_ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
-	ret = ret ? ret : wb_ret;
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
+	if (ret < 0)
+		return ret;
+
 	/*
-	 * Truncate page cache pages so that future reads will see the cloned
-	 * data immediately and not the previous data.
+	 * Invalidate page cache so that future reads will see the cloned data
+	 * immediately and not the previous data.
 	 */
-	truncate_inode_pages_range(&inode->i_data,
-				round_down(destoff, PAGE_SIZE),
-				round_up(destoff + len, PAGE_SIZE) - 1);
+	ret = filemap_invalidate_inode(inode, false, destoff, end);
+	if (ret < 0)
+		return ret;
 
 	btrfs_btree_balance_dirty(fs_info);
 
-	return ret;
+	return 0;
 }
 
 static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files()
  2026-01-09 12:29 [PATCH v2 0/2] btrfs: reflink updates fdmanana
  2026-01-09 12:29 ` [PATCH v2 1/2] btrfs: invalidate pages instead of truncate after reflinking fdmanana
@ 2026-01-09 12:29 ` fdmanana
  2026-01-09 18:07   ` Boris Burkov
  2026-01-09 22:37   ` Qu Wenruo
  1 sibling, 2 replies; 6+ messages in thread
From: fdmanana @ 2026-01-09 12:29 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Make the comment more detailed about why we need to flush delalloc and
wait for ordered extent completion before attempting to invalidate the
page cache.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/reflink.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index ab4ce56d69ee..314cb95ba846 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -754,8 +754,13 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 
 	/*
 	 * We may have copied an inline extent into a page of the destination
-	 * range, so wait for writeback to complete before invalidating pages
-	 * from the page cache. This is a rare case.
+	 * range. So flush delalloc and wait for ordered extent completion.
+	 * This is to ensure the invalidation below does not fail, as if for
+	 * example it finds a dirty folio, our folio release callback
+	 * (btrfs_release_folio()) returns false, which makes the invalidation
+	 * return an -EBUSY error. We can't ignore such failures since they
+	 * could come from some range other than the copied inline extent's
+	 * destination range and we have no way to know that.
 	 */
 	ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
 	if (ret < 0)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files()
  2026-01-09 12:29 ` [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files() fdmanana
@ 2026-01-09 18:07   ` Boris Burkov
  2026-01-09 22:37   ` Qu Wenruo
  1 sibling, 0 replies; 6+ messages in thread
From: Boris Burkov @ 2026-01-09 18:07 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, Jan 09, 2026 at 12:29:51PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Make the comment more detailed about why we need to flush delalloc and
> wait for ordered extent completion before attempting to invalidate the
> page cache.
> 

Thank you very much for the extra explanation.

Reviewed-by: Boris Burkov <boris@bur.io>

> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/reflink.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index ab4ce56d69ee..314cb95ba846 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -754,8 +754,13 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>  
>  	/*
>  	 * We may have copied an inline extent into a page of the destination
> -	 * range, so wait for writeback to complete before invalidating pages
> -	 * from the page cache. This is a rare case.
> +	 * range. So flush delalloc and wait for ordered extent completion.
> +	 * This is to ensure the invalidation below does not fail, as if for
> +	 * example it finds a dirty folio, our folio release callback
> +	 * (btrfs_release_folio()) returns false, which makes the invalidation
> +	 * return an -EBUSY error. We can't ignore such failures since they
> +	 * could come from some range other than the copied inline extent's
> +	 * destination range and we have no way to know that.
>  	 */
>  	ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
>  	if (ret < 0)
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files()
  2026-01-09 12:29 ` [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files() fdmanana
  2026-01-09 18:07   ` Boris Burkov
@ 2026-01-09 22:37   ` Qu Wenruo
  1 sibling, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2026-01-09 22:37 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2026/1/9 22:59, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Make the comment more detailed about why we need to flush delalloc and
> wait for ordered extent completion before attempting to invalidate the
> page cache.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/reflink.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index ab4ce56d69ee..314cb95ba846 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -754,8 +754,13 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>   
>   	/*
>   	 * We may have copied an inline extent into a page of the destination
> -	 * range, so wait for writeback to complete before invalidating pages
> -	 * from the page cache. This is a rare case.
> +	 * range. So flush delalloc and wait for ordered extent completion.
> +	 * This is to ensure the invalidation below does not fail, as if for
> +	 * example it finds a dirty folio, our folio release callback
> +	 * (btrfs_release_folio()) returns false, which makes the invalidation
> +	 * return an -EBUSY error. We can't ignore such failures since they
> +	 * could come from some range other than the copied inline extent's
> +	 * destination range and we have no way to know that.
>   	 */
>   	ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
>   	if (ret < 0)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] btrfs: invalidate pages instead of truncate after reflinking
  2026-01-09 12:29 ` [PATCH v2 1/2] btrfs: invalidate pages instead of truncate after reflinking fdmanana
@ 2026-01-11  4:11   ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2026-01-11  4:11 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2026/1/9 22:59, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Qu reported that generic/164 often fails because the read operations get
> zeroes when it expects to either get all bytes with a value of 0x61 or
> 0x62. The issue stems from truncating the pages from the page cache
> instead of invalidating, as truncating can zero page contents. This
> zeroing is not just in case the range is not page sized (as it's commented
> in truncate_inode_pages_range()) but also in case we are using large
> folios, they need to be split and the splitting fails. Stealing Qu's
> comment in the thread linked below:
> 
>    "We can have the following case:
> 
> 	0          4K         8K         12K          16K
>          |          |          |          |            |
>          |<---- Extent A ----->|<----- Extent B ------>|
> 
>     The page size is still 4K, but the folio we got is 16K.
> 
>     Then if we remap the range for [8K, 16K), then
>     truncate_inode_pages_range() will get the large folio 0 sized 16K,
>     then call truncate_inode_partial_folio().
> 
>     Which later calls folio_zero_range() for the [8K, 16K) range first,
>     then tries to split the folio into smaller ones to properly drop them
>     from the cache.
> 
>     But if splitting failed (e.g. racing with other operations holding the
>     filemap lock), the partially zeroed large folio will be kept, resulting
>     the range [8K, 16K) being zeroed meanwhile the folio is still a 16K
>     sized large one."
> 
> So instead of truncating, invalidate the page cache range with a call to
> filemap_invalidate_inode(), which besides not doing any zeroing also
> ensures that while it's invalidating folios, no new folios are added.

Although I still believe the fix is correct and is the best we can do 
now, the root cause is that our btrfs_invalidate_folio() callback is not 
supporting partial invalidating.

In the above example case, truncate_inode_partial_folio() will call 
folio_invalidate() for us to drop the [8K, 16K) range. But unfortunately 
our btrfs callback only supports full folio drop, thus nothing will be 
done, and left the uptodate bitmap untouched for range [8K, 16K).

So the long-term solution will be to enhance btrfs_invalidate_folio() 
call back to properly clear the uptodate block bitmap for sub-folio 
invalidation, so that we can still keep the range [0, 8K) uptodate and 
skip the read from disk, then go back to use the 
truncate_inode_pages_range() function.

Thanks,
Qu

> 
> This helps ensure that buffered reads that happen while a reflink
> operation is in progress always get either the whole old data (the one
> before the reflink) or the whole new data, which is what generic/164
> expects.
> 
> Link: https://lore.kernel.org/linux-btrfs/7fb9b44f-9680-4c22-a47f-6648cb109ddf@suse.com/
> Reported-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/reflink.c | 23 +++++++++++++----------
>   1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index e746980567da..ab4ce56d69ee 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -705,7 +705,6 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>   	struct inode *src = file_inode(file_src);
>   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
>   	int ret;
> -	int wb_ret;
>   	u64 len = olen;
>   	u64 bs = fs_info->sectorsize;
>   	u64 end;
> @@ -750,25 +749,29 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
>   	btrfs_lock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state);
>   	ret = btrfs_clone(src, inode, off, olen, len, destoff, 0);
>   	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state);
> +	if (ret < 0)
> +		return ret;
>   
>   	/*
>   	 * We may have copied an inline extent into a page of the destination
> -	 * range, so wait for writeback to complete before truncating pages
> +	 * range, so wait for writeback to complete before invalidating pages
>   	 * from the page cache. This is a rare case.
>   	 */
> -	wb_ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
> -	ret = ret ? ret : wb_ret;
> +	ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
> +	if (ret < 0)
> +		return ret;
> +
>   	/*
> -	 * Truncate page cache pages so that future reads will see the cloned
> -	 * data immediately and not the previous data.
> +	 * Invalidate page cache so that future reads will see the cloned data
> +	 * immediately and not the previous data.
>   	 */
> -	truncate_inode_pages_range(&inode->i_data,
> -				round_down(destoff, PAGE_SIZE),
> -				round_up(destoff + len, PAGE_SIZE) - 1);
> +	ret = filemap_invalidate_inode(inode, false, destoff, end);
> +	if (ret < 0)
> +		return ret;
>   
>   	btrfs_btree_balance_dirty(fs_info);
>   
> -	return ret;
> +	return 0;
>   }
>   
>   static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-11  4:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-09 12:29 [PATCH v2 0/2] btrfs: reflink updates fdmanana
2026-01-09 12:29 ` [PATCH v2 1/2] btrfs: invalidate pages instead of truncate after reflinking fdmanana
2026-01-11  4:11   ` Qu Wenruo
2026-01-09 12:29 ` [PATCH v2 2/2] btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files() fdmanana
2026-01-09 18:07   ` Boris Burkov
2026-01-09 22:37   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox