[PATCH v2 0/3] btrfs: fix generic/362 failures with nodatasum

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] btrfs: fix generic/362 failures with nodatasum
@ 2026-05-30  3:34 Qu Wenruo
  2026-05-30  3:34 ` [PATCH v2 1/3] btrfs: fix false IO failure after falling back to buffered write Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-05-30  3:34 UTC (permalink / raw)
  To: linux-btrfs

Changelog
v2:
- Add a new patch to allow retry after no byte is submitted for direct
  write

- Add a more clean callchain to explain why __iomap_dio_rw() doesn't
  return -EFAULT/-ENOTBLK directly.

- Add a code comment explaining the OE split sitation and why we can
  truncate all the remaining OE after a short write

- Fix the isize revert when part of the direct write succeeded

- Dig deeper into the original cause
  Which is very old, dates back to v5.15 LTS at least, and add a reason
  why no specific fixes tag is provided.

The test case generic/362 is pretty well hidden by several factors:

- "nodatasum" will not take effect due to the bad design of the test
  case
  If the target file already exists, the test case will reuse it,
  meanwhile "nodatasum" mount option only affects new files, meaning
  if the test case is executed before with data checksum, it will never
  go through the nodatasum path.

  There is already an update on the test case to make the failure
  reliably reproducible:
  https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

- Btrfs always falls back to buffered IO if the inode has csum
  Thus we do not exercise the zero-copy path, hide the failure for the
  default mount option.

With that said, the test case failure needs to be fixed, and there are
several bugs in the direct write path:

- Treat any short write as an error
  This is especially common as we have disabled page faulting for
  reading from the @from iov_iter.

  This is fixed by the first patch.

- No isize rollback after a short write
  The isize is increased at btrfs_get_blocks_direct_write() but when a
  short write happened, the isize is not propoerly rolledback, causing
  later buffered fallback to write at the new isize.

  This is fixed by the second patch.

- No page fault-in retry after a zero-submitted short write
  Unlike the previous two, this is only a minor problem which reduces
  the chance to do zero-copy direct IO.

  This is fixed by the third patch.

Qu Wenruo (3):
  btrfs: fix false IO failure after falling back to buffered write
  btrfs: fix incorrect buffered IO fallback for append direct writes
  btrfs: retry faulting in the pages after a zero sized short direct
    write

 fs/btrfs/direct-io.c    | 44 ++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/inode.c        |  6 +-----
 fs/btrfs/ordered-data.c | 12 +++++++++++
 fs/btrfs/ordered-data.h |  2 ++
 4 files changed, 54 insertions(+), 10 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] btrfs: fix false IO failure after falling back to buffered write
  2026-05-30  3:34 [PATCH v2 0/3] btrfs: fix generic/362 failures with nodatasum Qu Wenruo
@ 2026-05-30  3:34 ` Qu Wenruo
  2026-06-03  4:34   ` Boris Burkov
  2026-05-30  3:34 ` [PATCH v2 2/3] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
  2026-05-30  3:34 ` [PATCH v2 3/3] btrfs: retry faulting in the pages after a zero sized short direct write Qu Wenruo
  2 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-05-30  3:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
The test case generic/362 will fail with "nodatasum" mount option (*):

 MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:21:17.574771567 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +First write failed: Input/output error
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned.
Thus we never got an error pointer from __iomap_dio_rw().

The call chain looks like this:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  |- btrfs_finish_ordered_extent(uptodate = false)
 |  |  |  |  |- can_finish_ordered_extent()
 |  |  |  |     |- btrfs_mark_ordered_extent_error()
 |  |  |  |        |- mapping_set_error()
 |  |  |  |           Now the address space is marked error.
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |     Thus no error pointer will be returned.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is 0.
 |
 |- Fallback to buffered IO
 |  And the buffered write finished without error
 |
 |- filemap_fdatawait_range()
    |- filemap_check_errors()
       The previous error is recorded, thus an error is returned

However the buffered write is properly submitted and finished, the error
is from the btrfs_finish_ordered_extent() call with @uptodate = false.

[FIX]
When a short dio write happened, any range that is submitted will have
btrfs_extract_ordered_extent() to be called, thus the submitted range
will always have an OE just covering the submitted range.

The remaining OE range is never submitted, thus they should be treated
as truncated, not an error. So that we can properly reclaim and not
insert an unnecessary file extent item, without marking the mapping as
error.

Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
that helper to mark the direct IO ordered extent as truncated, so it
won't cause failure for the later buffered fallback.

[REASON FOR NO FIXES TAG]
The bug itself is pretty old, at commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we're already passing @uptodate=false finishing
the OE.
But at that time OE with IOERR won't call mapping_set_error(), so it's
not exposed.
Later commit d61bec08b904 ("btrfs: mark ordered extent and inode with
error if we fail to finish") finally exposed the bug, but that commit
is doing a correct job, not the root cause.

Anyway the bug is very old, dating back to 5.1x days, thus only CC to
stable.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c    | 17 ++++++++++++++---
 fs/btrfs/inode.c        |  6 +-----
 fs/btrfs/ordered-data.c | 12 ++++++++++++
 fs/btrfs/ordered-data.h |  2 ++
 4 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 57167d56dc72..88cb2e82a507 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -624,12 +624,23 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	if (submitted < length) {
 		pos += submitted;
 		length -= submitted;
-		if (write)
+		if (write) {
+			/*
+			 * We have a short write, if there is any range
+			 * that is submitted properly, that part will have
+			 * its own OE split from the original one.
+			 *
+			 * So for the OE at dio_data->ordered, it's the part
+			 * that is not submitted, and should be marked
+			 * as fully truncated.
+			 */
+			btrfs_mark_ordered_extent_truncated(dio_data->ordered, 0);
 			btrfs_finish_ordered_extent(dio_data->ordered,
-						    pos, length, false);
-		else
+						    pos, length, true);
+		} else {
 			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
 						pos + length - 1, NULL);
+		}
 		ret = -ENOTBLK;
 	}
 	if (write) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 973a89301baa..2c0131452754 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7590,11 +7590,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
 					       EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					       EXTENT_DEFRAG, &cached_state);
 
-		spin_lock(&inode->ordered_tree_lock);
-		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-		ordered->truncated_len = min(ordered->truncated_len,
-					     cur - ordered->file_offset);
-		spin_unlock(&inode->ordered_tree_lock);
+		btrfs_mark_ordered_extent_truncated(ordered, cur - ordered->file_offset);
 
 		/*
 		 * If the ordered extent has finished, we're safe to delete all
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index f5f77c33cf59..b32d4eabe0ab 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -358,6 +358,18 @@ void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered)
 		mapping_set_error(ordered->inode->vfs_inode.i_mapping, -EIO);
 }
 
+void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
+					 u64 truncate_len)
+{
+	struct btrfs_inode *inode = ordered->inode;
+
+	ASSERT(truncate_len <= ordered->num_bytes);
+	spin_lock(&inode->ordered_tree_lock);
+	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+	ordered->truncated_len = min(ordered->truncated_len, truncate_len);
+	spin_unlock(&inode->ordered_tree_lock);
+}
+
 static void finish_ordered_fn(struct btrfs_work *work)
 {
 	struct btrfs_ordered_extent *ordered_extent;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 03e12380a2fd..8d5d5ba1e02f 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -226,6 +226,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 struct btrfs_ordered_extent *btrfs_split_ordered_extent(
 			struct btrfs_ordered_extent *ordered, u64 len);
 void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered);
+void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
+					 u64 truncate_len);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] btrfs: fix incorrect buffered IO fallback for append direct writes
  2026-05-30  3:34 [PATCH v2 0/3] btrfs: fix generic/362 failures with nodatasum Qu Wenruo
  2026-05-30  3:34 ` [PATCH v2 1/3] btrfs: fix false IO failure after falling back to buffered write Qu Wenruo
@ 2026-05-30  3:34 ` Qu Wenruo
  2026-06-03  4:34   ` Boris Burkov
  2026-05-30  3:34 ` [PATCH v2 3/3] btrfs: retry faulting in the pages after a zero sized short direct write Qu Wenruo
  2 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-05-30  3:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
 - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:13:09.072485767 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Wrong file size after first write, got 8192 expected 4096
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.

But if the direct io finished short , we do not revert the isize to the
previous value nor to the short write end.

Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.

The call chain looks like this:

 btrfs_direct_write(pos=0, length=4K)
 |- __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     |- btrfs_get_blocks_direct_write()
 |  |        |- i_size_write()
 |  |           Which updates the isize to the write end (4K).
 |  |
 |  |- iomap_dio_iter()
 |  |  Failed with -EFAULT on the first page.
 |  |
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |     Detects a short write, return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0;}
 |     Which resets the return value.
 |
 |- ret = iomap_dio_complet()
 |  Which returns 0.
 |
 |- btrfs_buffered_write(iocb, from);
    |- generic_write_checks()
       |- iocb->ki_pos = i_size_read()
          Which is still the new size (4K), other than the original
	  isize 0.

[FIX]
Introduce btrfs_dio_data::updated_isize and btrfs_dio_data::old_isize,
so that if btrfs_get_blocks_direct_write() enlarged the inode size for
the first time, we still know the old isize.

Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on the old isize and the short write
end.

[REASON FOR NO FIXES TAG]
The bug is again very old, before commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.

Thus only a CC to stable.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 88cb2e82a507..fd53fac7186e 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -15,10 +15,16 @@
 
 struct btrfs_dio_data {
 	ssize_t submitted;
+	/*
+	 * If we got a short dio write and @updated_isize is set,
+	 * revert to the old isize.
+	 */
+	loff_t old_isize;
 	struct extent_changeset *data_reserved;
 	struct btrfs_ordered_extent *ordered;
 	bool data_space_reserved;
 	bool nocow_done;
+	bool updated_isize;
 };
 
 struct btrfs_dio_private {
@@ -228,6 +234,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	bool space_reserved = false;
 	u64 len = *lenp;
 	u64 prev_len;
+	loff_t old_isize;
 	int ret = 0;
 
 	/*
@@ -341,8 +348,14 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	 * Need to update the i_size under the extent lock so buffered
 	 * readers will get the updated i_size when we unlock.
 	 */
-	if (start + len > i_size_read(inode))
+	old_isize = i_size_read(inode);
+	if (start + len > old_isize) {
+		if (!dio_data->updated_isize) {
+			dio_data->old_isize = old_isize;
+			dio_data->updated_isize = true;
+		}
 		i_size_write(inode, start + len);
+	}
 out:
 	if (ret && space_reserved) {
 		btrfs_delalloc_release_extents(BTRFS_I(inode), len);
@@ -637,6 +650,16 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			btrfs_mark_ordered_extent_truncated(dio_data->ordered, 0);
 			btrfs_finish_ordered_extent(dio_data->ordered,
 						    pos, length, true);
+			if (dio_data->updated_isize) {
+				u64 new_isize;
+
+				if (submitted == 0)
+					new_isize = dio_data->old_isize;
+				else
+					new_isize = max(pos, dio_data->old_isize);
+				i_size_write(inode, new_isize);
+				dio_data->updated_isize = false;
+			}
 		} else {
 			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
 						pos + length - 1, NULL);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] btrfs: retry faulting in the pages after a zero sized short direct write
  2026-05-30  3:34 [PATCH v2 0/3] btrfs: fix generic/362 failures with nodatasum Qu Wenruo
  2026-05-30  3:34 ` [PATCH v2 1/3] btrfs: fix false IO failure after falling back to buffered write Qu Wenruo
  2026-05-30  3:34 ` [PATCH v2 2/3] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
@ 2026-05-30  3:34 ` Qu Wenruo
  2026-06-03  4:35   ` Boris Burkov
  2 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2026-05-30  3:34 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs_direct_write() will not try to fault in the pages, but
directly fall back to buffered writes, if the first page of the buffer
can not be faulted in.

For example, during generic/362 with nodatasum mount option, there is a
write at file offset 0, length PAGE_SIZE, and the page is not faulted in.
Then we go the following callchain and directly fall back to buffered
IO:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is now zero.
 |
 |- if (iov_iter_count() > 0 && (ret == -EFAULT || ret > 0))
 |  @ret is zero, thus not meeting the above retry condition
 |
 |- Fallback to buffered

Just slightly loosen the condition to allow retry faulting in pages after
a zero sized short write.

Unlike the previous two bug fixes, this one is not really cause any real
bug, but only reducing the chance to do zero-copy direct IO.
Thus it doesn't really require stable-CC nor fixes-tag.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index fd53fac7186e..99c959be8de7 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -960,7 +960,7 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (ret > 0)
 		written = ret;
 
-	if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) {
+	if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret >= 0)) {
 		const size_t left = iov_iter_count(from);
 		/*
 		 * We have more data left to write. Try to fault in as many as
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] btrfs: fix incorrect buffered IO fallback for append direct writes
  2026-05-30  3:34 ` [PATCH v2 2/3] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
@ 2026-06-03  4:34   ` Boris Burkov
  2026-06-03  5:15     ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Boris Burkov @ 2026-06-03  4:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Sat, May 30, 2026 at 01:04:18PM +0930, Qu Wenruo wrote:
> [BUG]
> With the previous bug of short direct writes fixed, test case
> generic/362 (*) still fails with the following error with nodatasum
> mount option:
> 
>  generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
>  - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
>     --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
>     +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:13:09.072485767 +0930
>     @@ -1,2 +1,3 @@
>      QA output created by 362
>     +Wrong file size after first write, got 8192 expected 4096
>      Silence is golden
>     ...
> 
> *: If the test case has been executed before with default data checksum,
> the failure will not reproduce. Need the following fix to make it
> reliably reproducible:
> https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/
> 
> [CAUSE]
> Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
> if it's beyond the current isize.
> 
> But if the direct io finished short , we do not revert the isize to the
> previous value nor to the short write end.
> 
> Then if we need to fall back to buffered writes, and the write has
> IOCB_APPEND flag, then the buffered write will be positioned at the
> incorrect isize.
> 
> The call chain looks like this:
> 
>  btrfs_direct_write(pos=0, length=4K)
>  |- __iomap_dio_rw()
>  |  |- iomap_iter()
>  |  |  |- btrfs_dio_iomap_begin()
>  |  |     |- btrfs_get_blocks_direct_write()
>  |  |        |- i_size_write()
>  |  |           Which updates the isize to the write end (4K).
>  |  |
>  |  |- iomap_dio_iter()
>  |  |  Failed with -EFAULT on the first page.
>  |  |
>  |  |- iomap_iter()
>  |  |  |- btrfs_dio_iomap_end()
>  |  |     Detects a short write, return -ENOTBLK
>  |  |- if (ret == -ENOTBLK) { ret = 0;}
>  |     Which resets the return value.
>  |
>  |- ret = iomap_dio_complet()
>  |  Which returns 0.
>  |
>  |- btrfs_buffered_write(iocb, from);
>     |- generic_write_checks()
>        |- iocb->ki_pos = i_size_read()
>           Which is still the new size (4K), other than the original
> 	  isize 0.
> 

The explanation is very clear, so thank you for that. I agree with the
bug and the direction of the fix.

However, I fear that there could still be a smaller bug left.

You have reasoned out a race against buffered writes, and the invalid
i_size you are fixing up only exists inside the dio thread holding the
inode lock, so the buffered write does not see it before you get to
fixup i_size. However, after we set the invalid i_size we release the
extent lock, so I believe a buffered reader could now observe the
intermediate too-big i_size before you manage to fix it.

I believe that the consequence of this is that reader will block on the
OE, then the split half will be finished/truncated, but the reader could
still see the too-big i_size and get back zeroes. I am not completely sure
if this is for sure a bug, but it does feel like it could be wrong.

For what it's worth, there is a comment at the i_size_write() in
btrfs_get_blocks_direct_write() which also confirms that it is important
that the update is done under the extent lock, not just the inode lock.

If it is, in fact, safe, then clarifying that in the existing comment
and/or a new comment would be helpful, I think. The comment is from 2012:
c3473e830074 ("Btrfs: fix dio write vs buffered read race")
so I suspect some of the original reasoning may now be out of date..

Thanks,
Boris

> [FIX]
> Introduce btrfs_dio_data::updated_isize and btrfs_dio_data::old_isize,
> so that if btrfs_get_blocks_direct_write() enlarged the inode size for
> the first time, we still know the old isize.
> 
> Then if we got a short write, and btrfs_dio_data::updated_isize is set,
> revert to the correct isize based on the old isize and the short write
> end.
> 
> [REASON FOR NO FIXES TAG]
> The bug is again very old, before commit f85781fb505e ("btrfs: switch to
> iomap for direct IO") we are already increasing isize without a
> proper rollback for short writes.
> 
> Thus only a CC to stable.
> 
> Cc: stable@vger.kernel.org # 5.15+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/direct-io.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 88cb2e82a507..fd53fac7186e 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -15,10 +15,16 @@
>  
>  struct btrfs_dio_data {
>  	ssize_t submitted;
> +	/*
> +	 * If we got a short dio write and @updated_isize is set,
> +	 * revert to the old isize.
> +	 */
> +	loff_t old_isize;
>  	struct extent_changeset *data_reserved;
>  	struct btrfs_ordered_extent *ordered;
>  	bool data_space_reserved;
>  	bool nocow_done;
> +	bool updated_isize;
>  };
>  
>  struct btrfs_dio_private {
> @@ -228,6 +234,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  	bool space_reserved = false;
>  	u64 len = *lenp;
>  	u64 prev_len;
> +	loff_t old_isize;
>  	int ret = 0;
>  
>  	/*
> @@ -341,8 +348,14 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  	 * Need to update the i_size under the extent lock so buffered
>  	 * readers will get the updated i_size when we unlock.
>  	 */
> -	if (start + len > i_size_read(inode))
> +	old_isize = i_size_read(inode);
> +	if (start + len > old_isize) {
> +		if (!dio_data->updated_isize) {
> +			dio_data->old_isize = old_isize;
> +			dio_data->updated_isize = true;
> +		}
>  		i_size_write(inode, start + len);
> +	}
>  out:
>  	if (ret && space_reserved) {
>  		btrfs_delalloc_release_extents(BTRFS_I(inode), len);
> @@ -637,6 +650,16 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>  			btrfs_mark_ordered_extent_truncated(dio_data->ordered, 0);
>  			btrfs_finish_ordered_extent(dio_data->ordered,
>  						    pos, length, true);
> +			if (dio_data->updated_isize) {
> +				u64 new_isize;
> +
> +				if (submitted == 0)
> +					new_isize = dio_data->old_isize;
> +				else
> +					new_isize = max(pos, dio_data->old_isize);
> +				i_size_write(inode, new_isize);
> +				dio_data->updated_isize = false;
> +			}
>  		} else {
>  			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
>  						pos + length - 1, NULL);
> -- 
> 2.54.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] btrfs: fix false IO failure after falling back to buffered write
  2026-05-30  3:34 ` [PATCH v2 1/3] btrfs: fix false IO failure after falling back to buffered write Qu Wenruo
@ 2026-06-03  4:34   ` Boris Burkov
  0 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2026-06-03  4:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Sat, May 30, 2026 at 01:04:17PM +0930, Qu Wenruo wrote:
> [BUG]
> The test case generic/362 will fail with "nodatasum" mount option (*):
> 
>  MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch
> 
>  generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
>     --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
>     +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:21:17.574771567 +0930
>     @@ -1,2 +1,3 @@
>      QA output created by 362
>     +First write failed: Input/output error
>      Silence is golden
>     ...
> 
> *: If the test case has been executed before with default data checksum,
> the failure will not reproduce. Need the following fix to make it
> reliably reproducible:
> https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/
> 
> [CAUSE]
> Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned.
> Thus we never got an error pointer from __iomap_dio_rw().
> 
> The call chain looks like this:
> 
>  btrfs_direct_write()
>  |- btrfs_dio_write()
>  |-  __iomap_dio_rw()
>  |  |- iomap_iter()
>  |  |  |- btrfs_dio_iomap_begin()
>  |  |     Now an ordered extent is allocated for the 4K write.
>  |  |
>  |  |- iomi.status = iomap_dio_iter()
>  |  |  Where iomap_dio_iter() returned -EFAULT.
>  |  |
>  |  |- ret = iomap_iter()
>  |  |  |- btrfs_dio_iomap_end()
>  |  |  |  |- btrfs_finish_ordered_extent(uptodate = false)
>  |  |  |  |  |- can_finish_ordered_extent()
>  |  |  |  |     |- btrfs_mark_ordered_extent_error()
>  |  |  |  |        |- mapping_set_error()
>  |  |  |  |           Now the address space is marked error.
>  |  |  |  | return -ENOTBLK
>  |  |  |- return -ENOTBLK
>  |  |- if (ret == -ENOTBLK) { ret = 0; }
>  |     Now the return value is reset to 0.
>  |     Thus no error pointer will be returned.
>  |
>  |- ret = iomap_dio_complete()
>  |  Since no byte is submitted, @ret is 0.
>  |
>  |- Fallback to buffered IO
>  |  And the buffered write finished without error
>  |
>  |- filemap_fdatawait_range()
>     |- filemap_check_errors()
>        The previous error is recorded, thus an error is returned
> 
> However the buffered write is properly submitted and finished, the error
> is from the btrfs_finish_ordered_extent() call with @uptodate = false.
> 
> [FIX]
> When a short dio write happened, any range that is submitted will have
> btrfs_extract_ordered_extent() to be called, thus the submitted range
> will always have an OE just covering the submitted range.
> 
> The remaining OE range is never submitted, thus they should be treated
> as truncated, not an error. So that we can properly reclaim and not
> insert an unnecessary file extent item, without marking the mapping as
> error.
> 
> Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
> that helper to mark the direct IO ordered extent as truncated, so it
> won't cause failure for the later buffered fallback.
> 
> [REASON FOR NO FIXES TAG]
> The bug itself is pretty old, at commit f85781fb505e ("btrfs: switch to
> iomap for direct IO") we're already passing @uptodate=false finishing
> the OE.
> But at that time OE with IOERR won't call mapping_set_error(), so it's
> not exposed.
> Later commit d61bec08b904 ("btrfs: mark ordered extent and inode with
> error if we fail to finish") finally exposed the bug, but that commit
> is doing a correct job, not the root cause.
> 
> Anyway the bug is very old, dating back to 5.1x days, thus only CC to
> stable.
> 
> Cc: stable@vger.kernel.org # 5.15+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/direct-io.c    | 17 ++++++++++++++---
>  fs/btrfs/inode.c        |  6 +-----
>  fs/btrfs/ordered-data.c | 12 ++++++++++++
>  fs/btrfs/ordered-data.h |  2 ++
>  4 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 57167d56dc72..88cb2e82a507 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -624,12 +624,23 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
>  	if (submitted < length) {
>  		pos += submitted;
>  		length -= submitted;
> -		if (write)
> +		if (write) {
> +			/*
> +			 * We have a short write, if there is any range
> +			 * that is submitted properly, that part will have
> +			 * its own OE split from the original one.
> +			 *
> +			 * So for the OE at dio_data->ordered, it's the part
> +			 * that is not submitted, and should be marked
> +			 * as fully truncated.
> +			 */
> +			btrfs_mark_ordered_extent_truncated(dio_data->ordered, 0);
>  			btrfs_finish_ordered_extent(dio_data->ordered,
> -						    pos, length, false);
> -		else
> +						    pos, length, true);
> +		} else {
>  			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
>  						pos + length - 1, NULL);
> +		}
>  		ret = -ENOTBLK;
>  	}
>  	if (write) {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 973a89301baa..2c0131452754 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7590,11 +7590,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
>  					       EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
>  					       EXTENT_DEFRAG, &cached_state);
>  
> -		spin_lock(&inode->ordered_tree_lock);
> -		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> -		ordered->truncated_len = min(ordered->truncated_len,
> -					     cur - ordered->file_offset);
> -		spin_unlock(&inode->ordered_tree_lock);
> +		btrfs_mark_ordered_extent_truncated(ordered, cur - ordered->file_offset);
>  
>  		/*
>  		 * If the ordered extent has finished, we're safe to delete all
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index f5f77c33cf59..b32d4eabe0ab 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -358,6 +358,18 @@ void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered)
>  		mapping_set_error(ordered->inode->vfs_inode.i_mapping, -EIO);
>  }
>  
> +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
> +					 u64 truncate_len)
> +{
> +	struct btrfs_inode *inode = ordered->inode;
> +
> +	ASSERT(truncate_len <= ordered->num_bytes);
> +	spin_lock(&inode->ordered_tree_lock);
> +	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> +	ordered->truncated_len = min(ordered->truncated_len, truncate_len);
> +	spin_unlock(&inode->ordered_tree_lock);
> +}
> +
>  static void finish_ordered_fn(struct btrfs_work *work)
>  {
>  	struct btrfs_ordered_extent *ordered_extent;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 03e12380a2fd..8d5d5ba1e02f 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -226,6 +226,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
>  struct btrfs_ordered_extent *btrfs_split_ordered_extent(
>  			struct btrfs_ordered_extent *ordered, u64 len);
>  void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered);
> +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered,
> +					 u64 truncate_len);
>  int __init ordered_data_init(void);
>  void __cold ordered_data_exit(void);
>  
> -- 
> 2.54.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] btrfs: retry faulting in the pages after a zero sized short direct write
  2026-05-30  3:34 ` [PATCH v2 3/3] btrfs: retry faulting in the pages after a zero sized short direct write Qu Wenruo
@ 2026-06-03  4:35   ` Boris Burkov
  0 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2026-06-03  4:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, May 30, 2026 at 01:04:19PM +0930, Qu Wenruo wrote:
> Currently btrfs_direct_write() will not try to fault in the pages, but
> directly fall back to buffered writes, if the first page of the buffer
> can not be faulted in.
> 
> For example, during generic/362 with nodatasum mount option, there is a
> write at file offset 0, length PAGE_SIZE, and the page is not faulted in.
> Then we go the following callchain and directly fall back to buffered
> IO:
> 
>  btrfs_direct_write()
>  |- btrfs_dio_write()
>  |-  __iomap_dio_rw()
>  |  |- iomap_iter()
>  |  |  |- btrfs_dio_iomap_begin()
>  |  |     Now an ordered extent is allocated for the 4K write.
>  |  |
>  |  |- iomi.status = iomap_dio_iter()
>  |  |  Where iomap_dio_iter() returned -EFAULT.
>  |  |
>  |  |- ret = iomap_iter()
>  |  |  |- btrfs_dio_iomap_end()
>  |  |  |  | return -ENOTBLK
>  |  |  |- return -ENOTBLK
>  |  |- if (ret == -ENOTBLK) { ret = 0; }
>  |     Now the return value is reset to 0.
>  |
>  |- ret = iomap_dio_complete()
>  |  Since no byte is submitted, @ret is now zero.
>  |
>  |- if (iov_iter_count() > 0 && (ret == -EFAULT || ret > 0))
>  |  @ret is zero, thus not meeting the above retry condition
>  |
>  |- Fallback to buffered
> 
> Just slightly loosen the condition to allow retry faulting in pages after
> a zero sized short write.
> 
> Unlike the previous two bug fixes, this one is not really cause any real
> bug, but only reducing the chance to do zero-copy direct IO.
> Thus it doesn't really require stable-CC nor fixes-tag.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/direct-io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index fd53fac7186e..99c959be8de7 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -960,7 +960,7 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (ret > 0)
>  		written = ret;
>  
> -	if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) {
> +	if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret >= 0)) {
>  		const size_t left = iov_iter_count(from);
>  		/*
>  		 * We have more data left to write. Try to fault in as many as
> -- 
> 2.54.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] btrfs: fix incorrect buffered IO fallback for append direct writes
  2026-06-03  4:34   ` Boris Burkov
@ 2026-06-03  5:15     ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2026-06-03  5:15 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, stable



在 2026/6/3 14:04, Boris Burkov 写道:
> On Sat, May 30, 2026 at 01:04:18PM +0930, Qu Wenruo wrote:
[...]
>> The call chain looks like this:
>>
>>   btrfs_direct_write(pos=0, length=4K)
>>   |- __iomap_dio_rw()
>>   |  |- iomap_iter()
>>   |  |  |- btrfs_dio_iomap_begin()
>>   |  |     |- btrfs_get_blocks_direct_write()
>>   |  |        |- i_size_write()
>>   |  |           Which updates the isize to the write end (4K).
>>   |  |
>>   |  |- iomap_dio_iter()
>>   |  |  Failed with -EFAULT on the first page.
>>   |  |
>>   |  |- iomap_iter()
>>   |  |  |- btrfs_dio_iomap_end()
>>   |  |     Detects a short write, return -ENOTBLK
>>   |  |- if (ret == -ENOTBLK) { ret = 0;}
>>   |     Which resets the return value.
>>   |
>>   |- ret = iomap_dio_complet()
>>   |  Which returns 0.
>>   |
>>   |- btrfs_buffered_write(iocb, from);
>>      |- generic_write_checks()
>>         |- iocb->ki_pos = i_size_read()
>>            Which is still the new size (4K), other than the original
>> 	  isize 0.
>>
> 
> The explanation is very clear, so thank you for that. I agree with the
> bug and the direction of the fix.
> 
> However, I fear that there could still be a smaller bug left.
> 
> You have reasoned out a race against buffered writes, and the invalid
> i_size you are fixing up only exists inside the dio thread holding the
> inode lock, so the buffered write does not see it before you get to
> fixup i_size. However, after we set the invalid i_size we release the
> extent lock, so I believe a buffered reader could now observe the
> intermediate too-big i_size before you manage to fix it.
> 
> I believe that the consequence of this is that reader will block on the
> OE, then the split half will be finished/truncated, but the reader could
> still see the too-big i_size and get back zeroes. I am not completely sure
> if this is for sure a bug, but it does feel like it could be wrong.

Buffered read doesn't take inode lock at all, thus inode lock doesn't 
seem enough.

Although I'm not sure if it's a good idea to mix buffered read with 
direct IO in the real world.

> 
> For what it's worth, there is a comment at the i_size_write() in
> btrfs_get_blocks_direct_write() which also confirms that it is important
> that the update is done under the extent lock, not just the inode lock.

I just checked all i_size_write() calls inside btrfs, it looks like they 
are all under extent lock.

So even just for the sake of consistency we should already follow the 
existing requirement.


I'll try to explore some methods to properly update the isize.

My current idea is to keep the EXTENT_DIO_LOCKED hold until iomap_end(), 
so that we can properly update the isize inside iomap_end().
But I'm not sure if it's safe or we may have some other hidden pitfalls.

Thanks,
Qu

> 
> If it is, in fact, safe, then clarifying that in the existing comment
> and/or a new comment would be helpful, I think. The comment is from 2012:
> c3473e830074 ("Btrfs: fix dio write vs buffered read race")
> so I suspect some of the original reasoning may now be out of date..
> 
> Thanks,
> Boris

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-03  5:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-30  3:34 [PATCH v2 0/3] btrfs: fix generic/362 failures with nodatasum Qu Wenruo
2026-05-30  3:34 ` [PATCH v2 1/3] btrfs: fix false IO failure after falling back to buffered write Qu Wenruo
2026-06-03  4:34   ` Boris Burkov
2026-05-30  3:34 ` [PATCH v2 2/3] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo
2026-06-03  4:34   ` Boris Burkov
2026-06-03  5:15     ` Qu Wenruo
2026-05-30  3:34 ` [PATCH v2 3/3] btrfs: retry faulting in the pages after a zero sized short direct write Qu Wenruo
2026-06-03  4:35   ` Boris Burkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.