[PATCH 0/2] btrfs: fix generic/362 failures with nodatasum

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum
@ 2026-05-27  5:06 Qu Wenruo
  2026-05-27  5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo
  2026-05-27  5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
  0 siblings, 2 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-05-27  5:06 UTC (permalink / raw)
  To: linux-btrfs

The test case generic/362 is pretty well hidden by several factors:

- "nodatasum" will not take effect due to the bad design of the test
  case
  If the target file already exists, the test case will reuse it,
  meanwhile "nodatasum" mount option only affects new files, meaning
  if the test case is executed before with data checksum, it will never
  go through the nodatasum path.

  There is already an update on the test case to make the failure
  reliably reproducible:
  https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/

- Btrfs always falls back to buffered IO if the inode has csum
  Thus we do not exercise the zero-copy path, hide the failure for the
  default mount option.

With that said, the test case failure needs to be fixed, and there are
two bugs in the direct io path:

- Treat any short write as an error
  This is especially common as we have disabled page faulting for
  reading from the @from iov_iter.

  This is fixed by the first patch.

- No isize rollback after a short write
  The isize is increased at btrfs_get_blocks_direct_write() but when a
  short write happened, the isize is not propoerly rolledback, causing
  later buffered fallback to write at the new isize.

Qu Wenruo (2):
  btrfs: fix false IO failure after falling back to buffered IO
  btrfs: fix incorrect buffered IO fallback for append direct writes

 fs/btrfs/direct-io.c    | 33 +++++++++++++++++++++++++++------
 fs/btrfs/inode.c        |  6 +-----
 fs/btrfs/ordered-data.c | 12 ++++++++++++
 fs/btrfs/ordered-data.h |  2 ++
 4 files changed, 42 insertions(+), 11 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO
  2026-05-27  5:06 [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum Qu Wenruo
@ 2026-05-27  5:06 ` Qu Wenruo
  2026-05-27 16:01   ` Boris Burkov
  2026-05-27  5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
  1 sibling, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-05-27  5:06 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
The test case generic/362 will fail with "nodatasum" mount option (*):

 MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:21:17.574771567 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +First write failed: Input/output error
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/

[CAUSE]
Btrfs direct write disable page fault of the input buffer, this is to
avoid a deadlock specific to btrfs.

So for the test case generic/362, it uses an anonymous page as input
buffer. And since the page is not yet faulted in, the direct IO will
fail with -EFAULT, causing us to go through the following call chain:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |  |- btrfs_dio_iomap_end()
 |     |- btrfs_finish_ordered_extent(uptodate = false);
 |        |- can_finish_ordered_extent()
 |           |- btrfs_mark_ordered_extent_error()
 |              |- mapping_set_error()
 |                 Now the address space is marked error.
 |
 |- iomap_dio_complete()
 |  The dio bio is empty, nothing submitted.
 |
 |- Fallback to buffered
 |  And the buffered write finished without error
 |
 |- filemap_fdatawait_range()
    |- filemap_check_errors()
       The previous error is recorded, thus an error is returned

However the buffered write is properly submitted and finished, the error
is from the previous short dio write.

[FIX]
When a short dio write happened, we shouldn't mark it as an error, but
treat it like a truncated write.

Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
that helper to mark the direct IO ordered extent as truncated, so it
won't cause failure for the later buffered fallback.

Cc: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c    | 18 +++++++++++++-----
 fs/btrfs/inode.c        |  6 +-----
 fs/btrfs/ordered-data.c | 12 ++++++++++++
 fs/btrfs/ordered-data.h |  2 ++
 4 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 57167d56dc72..598480b77002 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -610,6 +610,7 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 {
 	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
 	struct btrfs_dio_data *dio_data = iter->private;
+	struct btrfs_ordered_extent *ordered = dio_data->ordered;
 	size_t submitted = dio_data->submitted;
 	const bool write = !!(flags & IOMAP_WRITE);
 	int ret = 0;
@@ -624,16 +625,23 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	if (submitted < length) {
 		pos += submitted;
 		length -= submitted;
-		if (write)
-			btrfs_finish_ordered_extent(dio_data->ordered,
-						    pos, length, false);
-		else
+		if (write) {
+			/*
+			 * We got a short write, will fallback to buffered IO
+			 * for the whole range.
+			 * Set the truncate length to 0, so that no real file
+			 * extent item will be created.
+			 */
+			btrfs_mark_ordered_extent_truncated(ordered, 0);
+			btrfs_finish_ordered_extent(ordered, pos, length, true);
+		} else {
 			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
 						pos + length - 1, NULL);
+		}
 		ret = -ENOTBLK;
 	}
 	if (write) {
-		btrfs_put_ordered_extent(dio_data->ordered);
+		btrfs_put_ordered_extent(ordered);
 		dio_data->ordered = NULL;
 	}
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 973a89301baa..2c0131452754 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7590,11 +7590,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
 					       EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					       EXTENT_DEFRAG, &cached_state);
 
-		spin_lock(&inode->ordered_tree_lock);
-		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-		ordered->truncated_len = min(ordered->truncated_len,
-					     cur - ordered->file_offset);
-		spin_unlock(&inode->ordered_tree_lock);
+		btrfs_mark_ordered_extent_truncated(ordered, cur - ordered->file_offset);
 
 		/*
 		 * If the ordered extent has finished, we're safe to delete all
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index f5f77c33cf59..b32d4eabe0ab 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -358,6 +358,18 @@ void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered)
 		mapping_set_error(ordered->inode->vfs_inode.i_mapping, -EIO);
 }
 
+void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
+					 u64 truncate_len)
+{
+	struct btrfs_inode *inode = ordered->inode;
+
+	ASSERT(truncate_len <= ordered->num_bytes);
+	spin_lock(&inode->ordered_tree_lock);
+	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+	ordered->truncated_len = min(ordered->truncated_len, truncate_len);
+	spin_unlock(&inode->ordered_tree_lock);
+}
+
 static void finish_ordered_fn(struct btrfs_work *work)
 {
 	struct btrfs_ordered_extent *ordered_extent;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 03e12380a2fd..8d5d5ba1e02f 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -226,6 +226,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 struct btrfs_ordered_extent *btrfs_split_ordered_extent(
 			struct btrfs_ordered_extent *ordered, u64 len);
 void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered);
+void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
+					 u64 truncate_len);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes
  2026-05-27  5:06 [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum Qu Wenruo
  2026-05-27  5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo
@ 2026-05-27  5:06 ` Qu Wenruo
  2026-05-29  7:55   ` Qu Wenruo
  1 sibling, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-05-27  5:06 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) will still fail with the following error with nodatasum
mount option:

 generic/362  0s ... _check_dmesg: something found in dmesg (see /home/adam/xfstests/results//generic/362.dmesg)
 - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:13:09.072485767 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Wrong file size after first write, got 8192 expected 4096
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/

[CAUSE]
Btrfs disables page fault in during direct IO write, to avoid a specific
deadlock that is only specific to btrfs.

So for the test case generic/362, it will make the direct IO to fail
with -EFAULT, then we fallback to buffered IO.

However at btrfs_dio_iomap_begin() -> btrfs_get_blocks_direct_write(),
we have already updated the isize during extent allocation.
And if we failed the direct IO, the isize is still the updated one.

So it means the buffered write will respect the IOCB_APPEND flag and
write the new data at the update isize, resulting the above failure.

[FIX]
Introduce btrfs_dio_data::updated_isize and btrfs_dio_data::old_isize,
so that if btrfs_get_blocks_direct_write() enlarged the inode size, we
can know the old inode size.

Then if we got a short write, and btrfs_dio_data::updated_isize is set,
then revert to the old isize, so the buffered fallback can write into
the correct location.

Cc: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 598480b77002..24163a4bcfb0 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -15,10 +15,16 @@
 
 struct btrfs_dio_data {
 	ssize_t submitted;
+	/*
+	 * If we got a short dio write and @updated_isize is set,
+	 * revert to the old isize.
+	 */
+	loff_t old_isize;
 	struct extent_changeset *data_reserved;
 	struct btrfs_ordered_extent *ordered;
 	bool data_space_reserved;
 	bool nocow_done;
+	bool updated_isize;
 };
 
 struct btrfs_dio_private {
@@ -341,8 +347,11 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	 * Need to update the i_size under the extent lock so buffered
 	 * readers will get the updated i_size when we unlock.
 	 */
-	if (start + len > i_size_read(inode))
+	if (start + len > i_size_read(inode)) {
+		dio_data->old_isize = i_size_read(inode);
+		dio_data->updated_isize = true;
 		i_size_write(inode, start + len);
+	}
 out:
 	if (ret && space_reserved) {
 		btrfs_delalloc_release_extents(BTRFS_I(inode), len);
@@ -634,6 +643,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			 */
 			btrfs_mark_ordered_extent_truncated(ordered, 0);
 			btrfs_finish_ordered_extent(ordered, pos, length, true);
+			if (dio_data->updated_isize) {
+				i_size_write(inode, dio_data->old_isize);
+				dio_data->updated_isize = false;
+			}
 		} else {
 			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
 						pos + length - 1, NULL);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO
  2026-05-27  5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo
@ 2026-05-27 16:01   ` Boris Burkov
  2026-05-27 21:40     ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Boris Burkov @ 2026-05-27 16:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote:
> [BUG]
> The test case generic/362 will fail with "nodatasum" mount option (*):
> 
>  MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch
> 
>  generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
>     --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
>     +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:21:17.574771567 +0930
>     @@ -1,2 +1,3 @@
>      QA output created by 362
>     +First write failed: Input/output error
>      Silence is golden
>     ...
> 
> *: If the test case has been executed before with default data checksum,
> the failure will not reproduce. Need the following fix to make it
> reliably reproducible:
> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/
> 
> [CAUSE]
> Btrfs direct write disable page fault of the input buffer, this is to
> avoid a deadlock specific to btrfs.
> 
> So for the test case generic/362, it uses an anonymous page as input
> buffer. And since the page is not yet faulted in, the direct IO will
> fail with -EFAULT, causing us to go through the following call chain:
> 
>  btrfs_direct_write()

I believe that when direct_write() sees EFAULT from btrfs_dio_write() it
should do the fault and retry, not fallback straight to buffered.

	if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) {
		const size_t left = iov_iter_count(from);
		/*
		 * We have more data left to write. Try to fault in as many as
		 * possible of the remainder pages and retry. We do this without
		 * releasing and locking again the inode, to prevent races with
		 * truncate.
		 *
		 * Also, in case the iov refers to pages in the file range of the
		 * file we want to write to (due to a mmap), we could enter an
		 * infinite loop if we retry after faulting the pages in, since
		 * iomap will invalidate any pages in the range early on, before
		 * it tries to fault in the pages of the iov. So we keep track of
		 * how much was left of iov in the previous EFAULT and fallback
		 * to buffered IO in case we haven't made any progress.
		 */
		if (left == prev_left) {
			ret = -ENOTBLK;
		} else {
			fault_in_iov_iter_readable(from, left);
			prev_left = left;
			goto again;
		}
	}

>  |- btrfs_dio_write()
>  |  |- btrfs_dio_iomap_end()
>  |     |- btrfs_finish_ordered_extent(uptodate = false);
>  |        |- can_finish_ordered_extent()
>  |           |- btrfs_mark_ordered_extent_error()
>  |              |- mapping_set_error()
>  |                 Now the address space is marked error.
>  |
>  |- iomap_dio_complete()
>  |  The dio bio is empty, nothing submitted.
>  |
>  |- Fallback to buffered
>  |  And the buffered write finished without error
>  |
>  |- filemap_fdatawait_range()
>     |- filemap_check_errors()
>        The previous error is recorded, thus an error is returned
> 
> However the buffered write is properly submitted and finished, the error
> is from the previous short dio write.
> 
> [FIX]
> When a short dio write happened, we shouldn't mark it as an error, but
> treat it like a truncated write.

I am quite skeptical of this as the proper fix. I looked into this
really thoroughly back in
https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12-hch@lst.de/
and remember concluding it was much better to do the OE split and submit
separate direct writes, and I believe it was more or less working. I am
willing to believe that the mapping_set_error() thing slipped through
the cracks, though, so I apologize if I missed that detail. Has
something changed since then that makes us fall back to buffered on a
write buffer fault? Or am I misunderstanding something about what is
happening in this case?

g/708 is the test case for that particular corruption, FYI.

Thanks for looking into it,
Boris

> 
> Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
> that helper to mark the direct IO ordered extent as truncated, so it
> won't cause failure for the later buffered fallback.
> 
> Cc: stable@vger.kernel.org # 6.1+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/direct-io.c    | 18 +++++++++++++-----
>  fs/btrfs/inode.c        |  6 +-----
>  fs/btrfs/ordered-data.c | 12 ++++++++++++
>  fs/btrfs/ordered-data.h |  2 ++
>  4 files changed, 28 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 57167d56dc72..598480b77002 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -610,6 +610,7 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>  {
>  	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>  	struct btrfs_dio_data *dio_data = iter->private;
> +	struct btrfs_ordered_extent *ordered = dio_data->ordered;
>  	size_t submitted = dio_data->submitted;
>  	const bool write = !!(flags & IOMAP_WRITE);
>  	int ret = 0;
> @@ -624,16 +625,23 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>  	if (submitted < length) {
>  		pos += submitted;
>  		length -= submitted;
> -		if (write)
> -			btrfs_finish_ordered_extent(dio_data->ordered,
> -						    pos, length, false);
> -		else
> +		if (write) {
> +			/*
> +			 * We got a short write, will fallback to buffered IO
> +			 * for the whole range.
> +			 * Set the truncate length to 0, so that no real file
> +			 * extent item will be created.
> +			 */
> +			btrfs_mark_ordered_extent_truncated(ordered, 0);
> +			btrfs_finish_ordered_extent(ordered, pos, length, true);
> +		} else {
>  			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
>  						pos + length - 1, NULL);
> +		}
>  		ret = -ENOTBLK;
>  	}
>  	if (write) {
> -		btrfs_put_ordered_extent(dio_data->ordered);
> +		btrfs_put_ordered_extent(ordered);
>  		dio_data->ordered = NULL;
>  	}
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 973a89301baa..2c0131452754 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7590,11 +7590,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
>  					       EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
>  					       EXTENT_DEFRAG, &cached_state);
>  
> -		spin_lock(&inode->ordered_tree_lock);
> -		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> -		ordered->truncated_len = min(ordered->truncated_len,
> -					     cur - ordered->file_offset);
> -		spin_unlock(&inode->ordered_tree_lock);
> +		btrfs_mark_ordered_extent_truncated(ordered, cur - ordered->file_offset);
>  
>  		/*
>  		 * If the ordered extent has finished, we're safe to delete all
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index f5f77c33cf59..b32d4eabe0ab 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -358,6 +358,18 @@ void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered)
>  		mapping_set_error(ordered->inode->vfs_inode.i_mapping, -EIO);
>  }
>  
> +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
> +					 u64 truncate_len)
> +{
> +	struct btrfs_inode *inode = ordered->inode;
> +
> +	ASSERT(truncate_len <= ordered->num_bytes);
> +	spin_lock(&inode->ordered_tree_lock);
> +	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> +	ordered->truncated_len = min(ordered->truncated_len, truncate_len);
> +	spin_unlock(&inode->ordered_tree_lock);
> +}
> +
>  static void finish_ordered_fn(struct btrfs_work *work)
>  {
>  	struct btrfs_ordered_extent *ordered_extent;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 03e12380a2fd..8d5d5ba1e02f 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -226,6 +226,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
>  struct btrfs_ordered_extent *btrfs_split_ordered_extent(
>  			struct btrfs_ordered_extent *ordered, u64 len);
>  void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered);
> +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
> +					 u64 truncate_len);
>  int __init ordered_data_init(void);
>  void __cold ordered_data_exit(void);
>  
> -- 
> 2.54.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO
  2026-05-27 16:01   ` Boris Burkov
@ 2026-05-27 21:40     ` Qu Wenruo
  2026-05-29  4:32       ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-05-27 21:40 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, stable



在 2026/5/28 01:31, Boris Burkov 写道:
> On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote:
>> [BUG]
>> The test case generic/362 will fail with "nodatasum" mount option (*):
>>
>>   MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch
>>
>>   generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
>>      --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
>>      +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:21:17.574771567 +0930
>>      @@ -1,2 +1,3 @@
>>       QA output created by 362
>>      +First write failed: Input/output error
>>       Silence is golden
>>      ...
>>
>> *: If the test case has been executed before with default data checksum,
>> the failure will not reproduce. Need the following fix to make it
>> reliably reproducible:
>> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/
>>
>> [CAUSE]
>> Btrfs direct write disable page fault of the input buffer, this is to
>> avoid a deadlock specific to btrfs.
>>
>> So for the test case generic/362, it uses an anonymous page as input
>> buffer. And since the page is not yet faulted in, the direct IO will
>> fail with -EFAULT, causing us to go through the following call chain:
>>
>>   btrfs_direct_write()
> 
> I believe that when direct_write() sees EFAULT from btrfs_dio_write() it
> should do the fault and retry, not fallback straight to buffered.

It doesn't return -EFAULT.

btrfs_direct_write() returned an dio pointer, although it has not 
submitted any bytes for that dio structure.

So later iomap_dio_complete() returned 0.

The full trace printk looks like this:

  dio-append-buf--2501    [008] .....   627.444860: btrfs_direct_write: 
enter, root=5 ino=257 pos=0 iov_count=4096
  dio-append-buf--2501    [008] .....   627.444905: btrfs_direct_write: 
dio complete, ret=0
  dio-append-buf--2501    [008] .....   627.444906: btrfs_direct_write: 
buffered, pos=0 iov_counter=4096
  dio-append-buf--2501    [008] .....   627.444935: btrfs_direct_write: 
exit, root=5 ino=257 pos=0 iov_count=4096 ret=-5 written=0


> 
> 	if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) {
> 		const size_t left = iov_iter_count(from);
> 		/*
> 		 * We have more data left to write. Try to fault in as many as
> 		 * possible of the remainder pages and retry. We do this without
> 		 * releasing and locking again the inode, to prevent races with
> 		 * truncate.
> 		 *
> 		 * Also, in case the iov refers to pages in the file range of the
> 		 * file we want to write to (due to a mmap), we could enter an
> 		 * infinite loop if we retry after faulting the pages in, since
> 		 * iomap will invalidate any pages in the range early on, before
> 		 * it tries to fault in the pages of the iov. So we keep track of
> 		 * how much was left of iov in the previous EFAULT and fallback
> 		 * to buffered IO in case we haven't made any progress.

Furthermore, the page fault in won't make any difference for this 
particular case, exactly explained by the comment itself, that the page 
cache will be invalidated.

> 		 */
> 		if (left == prev_left) {
> 			ret = -ENOTBLK;
> 		} else {
> 			fault_in_iov_iter_readable(from, left);
> 			prev_left = left;
> 			goto again;
> 		}
> 	}
> 
>>   |- btrfs_dio_write()
>>   |  |- btrfs_dio_iomap_end()
>>   |     |- btrfs_finish_ordered_extent(uptodate = false);
>>   |        |- can_finish_ordered_extent()
>>   |           |- btrfs_mark_ordered_extent_error()
>>   |              |- mapping_set_error()
>>   |                 Now the address space is marked error.
>>   |
>>   |- iomap_dio_complete()
>>   |  The dio bio is empty, nothing submitted.
>>   |
>>   |- Fallback to buffered
>>   |  And the buffered write finished without error
>>   |
>>   |- filemap_fdatawait_range()
>>      |- filemap_check_errors()
>>         The previous error is recorded, thus an error is returned
>>
>> However the buffered write is properly submitted and finished, the error
>> is from the previous short dio write.
>>
>> [FIX]
>> When a short dio write happened, we shouldn't mark it as an error, but
>> treat it like a truncated write.
> 
> I am quite skeptical of this as the proper fix. I looked into this
> really thoroughly back in
> https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12-hch@lst.de/
> and remember concluding it was much better to do the OE split and submit
> separate direct writes, and I believe it was more or less working. I am
> willing to believe that the mapping_set_error() thing slipped through
> the cracks, though, so I apologize if I missed that detail. Has
> something changed since then that makes us fall back to buffered on a
> write buffer fault?

The buffered fallback for datacsum is masking the failure for most 
cases, and the test case itself is also hiding the failure even for 
nodatasum mount option, so we are not detecting it for a long time.

Secondly I also believed the btrfs_dio_write() should fail with -EFAULT 
initially, but that's no longer the case.

My guess is iomap dio error handling change, but will need extra digging 
to pin down the iomap commit.

> Or am I misunderstanding something about what is
> happening in this case?
> 
> g/708 is the test case for that particular corruption, FYI.

That's also the test case I'm failing with IOMAP_DIO_BOUNCE, but I think 
we should fix the g/362 failure first, as it is already failing now.

Thanks,
Qu

> 
> Thanks for looking into it,
> Boris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO
  2026-05-27 21:40     ` Qu Wenruo
@ 2026-05-29  4:32       ` Qu Wenruo
  2026-05-29  4:37         ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-05-29  4:32 UTC (permalink / raw)
  To: Qu Wenruo, Boris Burkov; +Cc: linux-btrfs, stable



在 2026/5/28 07:10, Qu Wenruo 写道:
> 
> 
> 在 2026/5/28 01:31, Boris Burkov 写道:
>> On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote:
>>> [BUG]
>>> The test case generic/362 will fail with "nodatasum" mount option (*):
>>>
>>>   MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch
>>>
>>>   generic/362  0s ... - output mismatch (see /home/adam/xfstests/ 
>>> results//generic/362.out.bad)
>>>      --- tests/generic/362.out    2024-08-24 15:31:37.200000000 +0930
>>>      +++ /home/adam/xfstests/results//generic/362.out.bad    
>>> 2026-05-27 10:21:17.574771567 +0930
>>>      @@ -1,2 +1,3 @@
>>>       QA output created by 362
>>>      +First write failed: Input/output error
>>>       Silence is golden
>>>      ...
>>>
>>> *: If the test case has been executed before with default data checksum,
>>> the failure will not reproduce. Need the following fix to make it
>>> reliably reproducible:
>>> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/
>>>
>>> [CAUSE]
>>> Btrfs direct write disable page fault of the input buffer, this is to
>>> avoid a deadlock specific to btrfs.
>>>
>>> So for the test case generic/362, it uses an anonymous page as input
>>> buffer. And since the page is not yet faulted in, the direct IO will
>>> fail with -EFAULT, causing us to go through the following call chain:
>>>
>>>   btrfs_direct_write()
>>
>> I believe that when direct_write() sees EFAULT from btrfs_dio_write() it
>> should do the fault and retry, not fallback straight to buffered.
> 
> It doesn't return -EFAULT.
> 
> btrfs_direct_write() returned an dio pointer, although it has not 
> submitted any bytes for that dio structure.
> 
> So later iomap_dio_complete() returned 0.
> 

I added more trace_printk(), and it shows it's the EFAULT handling reset 
the error number:
("r/i=" shows the root id and ino from btrfs, "i=" shows the generic 
ino, "pos=" shows the file pos and length)

  22.223076: __iomap_dio_rw: enter, i=257 pos=0/4096
  22.223078: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/4096
  22.223097: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/4096 phy=13631488
  22.223106: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14
  22.223107: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/4096 written=0 
ordered=0/4096
  22.223126: btrfs_dio_iomap_end: exit, r/i=5/257 pos=0/4096 written=0 
ret=-15
  22.223179: __iomap_dio_rw: after while loop, ret=-15 dio->size=0
  22.223180: __iomap_dio_rw: exit with dio, i=257

As you can see, iomap_dio_iter() itself returned -EFAULT, which is 
expected as we disabled the page fault for the iov_iter.

But then it's at the following code block that our return value is reset 
to 0:

         if (ret == -EFAULT && dio->size && (dio_flags & 
IOMAP_DIO_PARTIAL)) {
                 if (!(iocb->ki_flags & IOCB_NOWAIT))
                         wait_for_completion = true;
                 ret = 0;
         }

So that explains why we will never get a -EFAULT directly from 
__iomap_dio_rw(), at least for the write case.


> Furthermore, the page fault in won't make any difference for this 
> particular case, exactly explained by the comment itself, that the page 
> cache will be invalidated.

Sorry, I got confused with another case where the buffer page is from 
page cache.

For this particular case, it's unrelated, and we are able to fault-in 
the anonymous page.


>>> [FIX]
>>> When a short dio write happened, we shouldn't mark it as an error, but
>>> treat it like a truncated write.
>>
>> I am quite skeptical of this as the proper fix. I looked into this
>> really thoroughly back in
>> https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12-hch@lst.de/
>> and remember concluding it was much better to do the OE split and submit
>> separate direct writes, and I believe it was more or less working.

Firstly, the OE split and the proper fix doesn't conflict at all.

In fact both co-operate with each other pretty well, especially shown by 
the second write of the test case, shown by the trace:

  22.223529: btrfs_direct_write: enter, r/i=5/257 pos=0/8192
  22.223530: __iomap_dio_rw: enter, i=257 pos=0/8192
  22.223531: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/8192
  22.223545: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/8192 phy=13635584
  22.223561: __iomap_dio_rw: while loop, iomap_dio_iter ret=0
  22.223561: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/8192 
written=4096 ordered=4096/4096

Here we only copied the first page, and at this stage, the original 8K 
OE is already being split into two.
And the fix will properly truncate and remove the later half.

  22.223568: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 
written=4096 ret=-15
  22.223582: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096
  22.223588: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 
phy=13639680

This time we retry with the 2nd page.

  22.223588: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14

And still failed to fault in.

  22.223588: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 
written=0 ordered=4096/4096
  22.223590: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 
written=0 ret=-15

So again remove the failed OE.

  22.223610: __iomap_dio_rw: after while loop, ret=-15 dio->size=4096
  22.223712: __iomap_dio_rw: exit with dio, i=257
  22.223713: btrfs_direct_write: iomap_dio_complete() ret=4096

So __iomap_dio_rw() only succeeded writed 4K.

Now btrfs will fault-in the 2nd page and retry.

  22.223717: __iomap_dio_rw: enter, i=257 pos=4096/4096
  22.223718: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096
  22.223730: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 
phy=13639680
  22.223740: __iomap_dio_rw: while loop, iomap_dio_iter ret=0
  22.223740: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 
written=4096 ordered=4096/4096
  22.223741: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 
written=4096 ret=0

Now with the 2nd page faulted in, this time we succeeded in handling the 
2nd page, and every thing goes one.

  22.223755: __iomap_dio_rw: after while loop, ret=0 dio->size=4096
  22.223847: __iomap_dio_rw: exit with dio, i=257
  22.223848: btrfs_direct_write: iomap_dio_complete() ret=8192
  22.223849: btrfs_direct_write: exit, r/i=5/257 pos=0/8192 ret=8192 
written=8192

So this looks exactly the correct fix.

If this extra trace helps, I can definitely reword the commit message to 
reduce any possible confusion.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO
  2026-05-29  4:32       ` Qu Wenruo
@ 2026-05-29  4:37         ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-05-29  4:37 UTC (permalink / raw)
  To: Qu Wenruo, Boris Burkov; +Cc: linux-btrfs, stable



在 2026/5/29 14:02, Qu Wenruo 写道:
> 
> 
> 在 2026/5/28 07:10, Qu Wenruo 写道:
>>
>>
>> 在 2026/5/28 01:31, Boris Burkov 写道:
>>> On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote:
>>>> [BUG]
>>>> The test case generic/362 will fail with "nodatasum" mount option (*):
>>>>
>>>>   MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch
>>>>
>>>>   generic/362  0s ... - output mismatch (see /home/adam/xfstests/ 
>>>> results//generic/362.out.bad)
>>>>      --- tests/generic/362.out    2024-08-24 15:31:37.200000000 +0930
>>>>      +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 
>>>> 10:21:17.574771567 +0930
>>>>      @@ -1,2 +1,3 @@
>>>>       QA output created by 362
>>>>      +First write failed: Input/output error
>>>>       Silence is golden
>>>>      ...
>>>>
>>>> *: If the test case has been executed before with default data 
>>>> checksum,
>>>> the failure will not reproduce. Need the following fix to make it
>>>> reliably reproducible:
>>>> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1- 
>>>> wqu@suse.com/
>>>>
>>>> [CAUSE]
>>>> Btrfs direct write disable page fault of the input buffer, this is to
>>>> avoid a deadlock specific to btrfs.
>>>>
>>>> So for the test case generic/362, it uses an anonymous page as input
>>>> buffer. And since the page is not yet faulted in, the direct IO will
>>>> fail with -EFAULT, causing us to go through the following call chain:
>>>>
>>>>   btrfs_direct_write()
>>>
>>> I believe that when direct_write() sees EFAULT from btrfs_dio_write() it
>>> should do the fault and retry, not fallback straight to buffered.
>>
>> It doesn't return -EFAULT.
>>
>> btrfs_direct_write() returned an dio pointer, although it has not 
>> submitted any bytes for that dio structure.
>>
>> So later iomap_dio_complete() returned 0.
>>
> 
> I added more trace_printk(), and it shows it's the EFAULT handling reset 
> the error number:
> ("r/i=" shows the root id and ino from btrfs, "i=" shows the generic 
> ino, "pos=" shows the file pos and length)
> 
>   22.223076: __iomap_dio_rw: enter, i=257 pos=0/4096
>   22.223078: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/4096
>   22.223097: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/4096 phy=13631488
>   22.223106: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14
>   22.223107: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/4096 written=0 
> ordered=0/4096
>   22.223126: btrfs_dio_iomap_end: exit, r/i=5/257 pos=0/4096 written=0 
> ret=-15
>   22.223179: __iomap_dio_rw: after while loop, ret=-15 dio->size=0
>   22.223180: __iomap_dio_rw: exit with dio, i=257
> 
> As you can see, iomap_dio_iter() itself returned -EFAULT, which is 
> expected as we disabled the page fault for the iov_iter.
> 
> But then it's at the following code block that our return value is reset 
> to 0:
> 
>          if (ret == -EFAULT && dio->size && (dio_flags & 
> IOMAP_DIO_PARTIAL)) {
>                  if (!(iocb->ki_flags & IOCB_NOWAIT))
>                          wait_for_completion = true;
>                  ret = 0;
>          }

My bad, it is the ENOTBLK handling resetting the error number.
At that stage, @ret is already the error number from iomap_end(), not 
from the iomap_dio_iter():

         /* magic error code to fall back to buffered I/O */
         if (ret == -ENOTBLK) {
                 wait_for_completion = true;
                 ret = 0;
         }

> 
> So that explains why we will never get a -EFAULT directly from 
> __iomap_dio_rw(), at least for the write case.
> 
> 
>> Furthermore, the page fault in won't make any difference for this 
>> particular case, exactly explained by the comment itself, that the 
>> page cache will be invalidated.
> 
> Sorry, I got confused with another case where the buffer page is from 
> page cache.
> 
> For this particular case, it's unrelated, and we are able to fault-in 
> the anonymous page.
> 
> 
>>>> [FIX]
>>>> When a short dio write happened, we shouldn't mark it as an error, but
>>>> treat it like a truncated write.
>>>
>>> I am quite skeptical of this as the proper fix. I looked into this
>>> really thoroughly back in
>>> https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12- 
>>> hch@lst.de/
>>> and remember concluding it was much better to do the OE split and submit
>>> separate direct writes, and I believe it was more or less working.
> 
> Firstly, the OE split and the proper fix doesn't conflict at all.
> 
> In fact both co-operate with each other pretty well, especially shown by 
> the second write of the test case, shown by the trace:
> 
>   22.223529: btrfs_direct_write: enter, r/i=5/257 pos=0/8192
>   22.223530: __iomap_dio_rw: enter, i=257 pos=0/8192
>   22.223531: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/8192
>   22.223545: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/8192 phy=13635584
>   22.223561: __iomap_dio_rw: while loop, iomap_dio_iter ret=0
>   22.223561: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/8192 
> written=4096 ordered=4096/4096
> 
> Here we only copied the first page, and at this stage, the original 8K 
> OE is already being split into two.
> And the fix will properly truncate and remove the later half.
> 
>   22.223568: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 
> written=4096 ret=-15
>   22.223582: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096
>   22.223588: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 
> phy=13639680
> 
> This time we retry with the 2nd page.
> 
>   22.223588: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14
> 
> And still failed to fault in.
> 
>   22.223588: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 
> written=0 ordered=4096/4096
>   22.223590: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 
> written=0 ret=-15
> 
> So again remove the failed OE.
> 
>   22.223610: __iomap_dio_rw: after while loop, ret=-15 dio->size=4096
>   22.223712: __iomap_dio_rw: exit with dio, i=257
>   22.223713: btrfs_direct_write: iomap_dio_complete() ret=4096
> 
> So __iomap_dio_rw() only succeeded writed 4K.
> 
> Now btrfs will fault-in the 2nd page and retry.
> 
>   22.223717: __iomap_dio_rw: enter, i=257 pos=4096/4096
>   22.223718: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096
>   22.223730: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 
> phy=13639680
>   22.223740: __iomap_dio_rw: while loop, iomap_dio_iter ret=0
>   22.223740: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 
> written=4096 ordered=4096/4096
>   22.223741: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 
> written=4096 ret=0
> 
> Now with the 2nd page faulted in, this time we succeeded in handling the 
> 2nd page, and every thing goes one.
> 
>   22.223755: __iomap_dio_rw: after while loop, ret=0 dio->size=4096
>   22.223847: __iomap_dio_rw: exit with dio, i=257
>   22.223848: btrfs_direct_write: iomap_dio_complete() ret=8192
>   22.223849: btrfs_direct_write: exit, r/i=5/257 pos=0/8192 ret=8192 
> written=8192
> 
> So this looks exactly the correct fix.
> 
> If this extra trace helps, I can definitely reword the commit message to 
> reduce any possible confusion.
> 
> Thanks,
> Qu


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes
  2026-05-27  5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
@ 2026-05-29  7:55   ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-05-29  7:55 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable



在 2026/5/27 14:36, Qu Wenruo 写道:
> [BUG]
> With the previous bug of short direct writes fixed, test case
> generic/362 (*) will still fail with the following error with nodatasum
> mount option:
> 
>   generic/362  0s ... _check_dmesg: something found in dmesg (see /home/adam/xfstests/results//generic/362.dmesg)
>   - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
>      --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
>      +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:13:09.072485767 +0930
>      @@ -1,2 +1,3 @@
>       QA output created by 362
>      +Wrong file size after first write, got 8192 expected 4096
>       Silence is golden
>      ...
> 
> *: If the test case has been executed before with default data checksum,
> the failure will not reproduce. Need the following fix to make it
> reliably reproducible:
> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/
> 
> [CAUSE]
> Btrfs disables page fault in during direct IO write, to avoid a specific
> deadlock that is only specific to btrfs.
> 
> So for the test case generic/362, it will make the direct IO to fail
> with -EFAULT, then we fallback to buffered IO.
> 
> However at btrfs_dio_iomap_begin() -> btrfs_get_blocks_direct_write(),
> we have already updated the isize during extent allocation.
> And if we failed the direct IO, the isize is still the updated one.
> 
> So it means the buffered write will respect the IOCB_APPEND flag and
> write the new data at the update isize, resulting the above failure.
> 
> [FIX]
> Introduce btrfs_dio_data::updated_isize and btrfs_dio_data::old_isize,
> so that if btrfs_get_blocks_direct_write() enlarged the inode size, we
> can know the old inode size.
> 
> Then if we got a short write, and btrfs_dio_data::updated_isize is set,
> then revert to the old isize, so the buffered fallback can write into
> the correct location.
> 
> Cc: stable@vger.kernel.org # 6.1+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/direct-io.c | 15 ++++++++++++++-
>   1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 598480b77002..24163a4bcfb0 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -15,10 +15,16 @@
>   
>   struct btrfs_dio_data {
>   	ssize_t submitted;
> +	/*
> +	 * If we got a short dio write and @updated_isize is set,
> +	 * revert to the old isize.
> +	 */
> +	loff_t old_isize;
>   	struct extent_changeset *data_reserved;
>   	struct btrfs_ordered_extent *ordered;
>   	bool data_space_reserved;
>   	bool nocow_done;
> +	bool updated_isize;
>   };
>   
>   struct btrfs_dio_private {
> @@ -341,8 +347,11 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>   	 * Need to update the i_size under the extent lock so buffered
>   	 * readers will get the updated i_size when we unlock.
>   	 */
> -	if (start + len > i_size_read(inode))
> +	if (start + len > i_size_read(inode)) {
> +		dio_data->old_isize = i_size_read(inode);
> +		dio_data->updated_isize = true;
>   		i_size_write(inode, start + len);
> +	}
>   out:
>   	if (ret && space_reserved) {
>   		btrfs_delalloc_release_extents(BTRFS_I(inode), len);
> @@ -634,6 +643,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>   			 */
>   			btrfs_mark_ordered_extent_truncated(ordered, 0);
>   			btrfs_finish_ordered_extent(ordered, pos, length, true);
> +			if (dio_data->updated_isize) {
> +				i_size_write(inode, dio_data->old_isize);

This part is incorrect, as if we have a short direct write, we can still 
have part of the range written.

In that case we should not revert to the old isize, but to the new 
written size.

Considering we may go several different tries, I think it's better to 
save and set the isize inside btrfs_direct_write(), other than relying 
on the iomap_start/iomap_end() callbacks.

Thanks,
Qu> +				dio_data->updated_isize = false;
> +			}
>   		} else {
>   			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
>   						pos + length - 1, NULL);


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-29  7:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27  5:06 [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum Qu Wenruo
2026-05-27  5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo
2026-05-27 16:01   ` Boris Burkov
2026-05-27 21:40     ` Qu Wenruo
2026-05-29  4:32       ` Qu Wenruo
2026-05-29  4:37         ` Qu Wenruo
2026-05-27  5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
2026-05-29  7:55   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox