* [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum @ 2026-05-27 5:06 Qu Wenruo 2026-05-27 5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo 2026-05-27 5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo 0 siblings, 2 replies; 8+ messages in thread From: Qu Wenruo @ 2026-05-27 5:06 UTC (permalink / raw) To: linux-btrfs The test case generic/362 is pretty well hidden by several factors: - "nodatasum" will not take effect due to the bad design of the test case If the target file already exists, the test case will reuse it, meanwhile "nodatasum" mount option only affects new files, meaning if the test case is executed before with data checksum, it will never go through the nodatasum path. There is already an update on the test case to make the failure reliably reproducible: https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/ - Btrfs always falls back to buffered IO if the inode has csum Thus we do not exercise the zero-copy path, hide the failure for the default mount option. With that said, the test case failure needs to be fixed, and there are two bugs in the direct io path: - Treat any short write as an error This is especially common as we have disabled page faulting for reading from the @from iov_iter. This is fixed by the first patch. - No isize rollback after a short write The isize is increased at btrfs_get_blocks_direct_write() but when a short write happened, the isize is not propoerly rolledback, causing later buffered fallback to write at the new isize. Qu Wenruo (2): btrfs: fix false IO failure after falling back to buffered IO btrfs: fix incorrect buffered IO fallback for append direct writes fs/btrfs/direct-io.c | 33 +++++++++++++++++++++++++++------ fs/btrfs/inode.c | 6 +----- fs/btrfs/ordered-data.c | 12 ++++++++++++ fs/btrfs/ordered-data.h | 2 ++ 4 files changed, 42 insertions(+), 11 deletions(-) -- 2.54.0 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO 2026-05-27 5:06 [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum Qu Wenruo @ 2026-05-27 5:06 ` Qu Wenruo 2026-05-27 16:01 ` Boris Burkov 2026-05-27 5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo 1 sibling, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-05-27 5:06 UTC (permalink / raw) To: linux-btrfs; +Cc: stable [BUG] The test case generic/362 will fail with "nodatasum" mount option (*): MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch generic/362 0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad) --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:21:17.574771567 +0930 @@ -1,2 +1,3 @@ QA output created by 362 +First write failed: Input/output error Silence is golden ... *: If the test case has been executed before with default data checksum, the failure will not reproduce. Need the following fix to make it reliably reproducible: https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/ [CAUSE] Btrfs direct write disable page fault of the input buffer, this is to avoid a deadlock specific to btrfs. So for the test case generic/362, it uses an anonymous page as input buffer. And since the page is not yet faulted in, the direct IO will fail with -EFAULT, causing us to go through the following call chain: btrfs_direct_write() |- btrfs_dio_write() | |- btrfs_dio_iomap_end() | |- btrfs_finish_ordered_extent(uptodate = false); | |- can_finish_ordered_extent() | |- btrfs_mark_ordered_extent_error() | |- mapping_set_error() | Now the address space is marked error. | |- iomap_dio_complete() | The dio bio is empty, nothing submitted. | |- Fallback to buffered | And the buffered write finished without error | |- filemap_fdatawait_range() |- filemap_check_errors() The previous error is recorded, thus an error is returned However the buffered write is properly submitted and finished, the error is from the previous short dio write. [FIX] When a short dio write happened, we shouldn't mark it as an error, but treat it like a truncated write. Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize that helper to mark the direct IO ordered extent as truncated, so it won't cause failure for the later buffered fallback. Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/direct-io.c | 18 +++++++++++++----- fs/btrfs/inode.c | 6 +----- fs/btrfs/ordered-data.c | 12 ++++++++++++ fs/btrfs/ordered-data.h | 2 ++ 4 files changed, 28 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c index 57167d56dc72..598480b77002 100644 --- a/fs/btrfs/direct-io.c +++ b/fs/btrfs/direct-io.c @@ -610,6 +610,7 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, { struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); struct btrfs_dio_data *dio_data = iter->private; + struct btrfs_ordered_extent *ordered = dio_data->ordered; size_t submitted = dio_data->submitted; const bool write = !!(flags & IOMAP_WRITE); int ret = 0; @@ -624,16 +625,23 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, if (submitted < length) { pos += submitted; length -= submitted; - if (write) - btrfs_finish_ordered_extent(dio_data->ordered, - pos, length, false); - else + if (write) { + /* + * We got a short write, will fallback to buffered IO + * for the whole range. + * Set the truncate length to 0, so that no real file + * extent item will be created. + */ + btrfs_mark_ordered_extent_truncated(ordered, 0); + btrfs_finish_ordered_extent(ordered, pos, length, true); + } else { btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1, NULL); + } ret = -ENOTBLK; } if (write) { - btrfs_put_ordered_extent(dio_data->ordered); + btrfs_put_ordered_extent(ordered); dio_data->ordered = NULL; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 973a89301baa..2c0131452754 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7590,11 +7590,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset, EXTENT_LOCKED | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, &cached_state); - spin_lock(&inode->ordered_tree_lock); - set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); - ordered->truncated_len = min(ordered->truncated_len, - cur - ordered->file_offset); - spin_unlock(&inode->ordered_tree_lock); + btrfs_mark_ordered_extent_truncated(ordered, cur - ordered->file_offset); /* * If the ordered extent has finished, we're safe to delete all diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index f5f77c33cf59..b32d4eabe0ab 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -358,6 +358,18 @@ void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered) mapping_set_error(ordered->inode->vfs_inode.i_mapping, -EIO); } +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered, + u64 truncate_len) +{ + struct btrfs_inode *inode = ordered->inode; + + ASSERT(truncate_len <= ordered->num_bytes); + spin_lock(&inode->ordered_tree_lock); + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); + ordered->truncated_len = min(ordered->truncated_len, truncate_len); + spin_unlock(&inode->ordered_tree_lock); +} + static void finish_ordered_fn(struct btrfs_work *work) { struct btrfs_ordered_extent *ordered_extent; diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 03e12380a2fd..8d5d5ba1e02f 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -226,6 +226,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, struct btrfs_ordered_extent *btrfs_split_ordered_extent( struct btrfs_ordered_extent *ordered, u64 len); void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered); +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered, + u64 truncate_len); int __init ordered_data_init(void); void __cold ordered_data_exit(void); -- 2.54.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO 2026-05-27 5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo @ 2026-05-27 16:01 ` Boris Burkov 2026-05-27 21:40 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Boris Burkov @ 2026-05-27 16:01 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs, stable On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote: > [BUG] > The test case generic/362 will fail with "nodatasum" mount option (*): > > MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch > > generic/362 0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad) > --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 > +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:21:17.574771567 +0930 > @@ -1,2 +1,3 @@ > QA output created by 362 > +First write failed: Input/output error > Silence is golden > ... > > *: If the test case has been executed before with default data checksum, > the failure will not reproduce. Need the following fix to make it > reliably reproducible: > https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/ > > [CAUSE] > Btrfs direct write disable page fault of the input buffer, this is to > avoid a deadlock specific to btrfs. > > So for the test case generic/362, it uses an anonymous page as input > buffer. And since the page is not yet faulted in, the direct IO will > fail with -EFAULT, causing us to go through the following call chain: > > btrfs_direct_write() I believe that when direct_write() sees EFAULT from btrfs_dio_write() it should do the fault and retry, not fallback straight to buffered. if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) { const size_t left = iov_iter_count(from); /* * We have more data left to write. Try to fault in as many as * possible of the remainder pages and retry. We do this without * releasing and locking again the inode, to prevent races with * truncate. * * Also, in case the iov refers to pages in the file range of the * file we want to write to (due to a mmap), we could enter an * infinite loop if we retry after faulting the pages in, since * iomap will invalidate any pages in the range early on, before * it tries to fault in the pages of the iov. So we keep track of * how much was left of iov in the previous EFAULT and fallback * to buffered IO in case we haven't made any progress. */ if (left == prev_left) { ret = -ENOTBLK; } else { fault_in_iov_iter_readable(from, left); prev_left = left; goto again; } } > |- btrfs_dio_write() > | |- btrfs_dio_iomap_end() > | |- btrfs_finish_ordered_extent(uptodate = false); > | |- can_finish_ordered_extent() > | |- btrfs_mark_ordered_extent_error() > | |- mapping_set_error() > | Now the address space is marked error. > | > |- iomap_dio_complete() > | The dio bio is empty, nothing submitted. > | > |- Fallback to buffered > | And the buffered write finished without error > | > |- filemap_fdatawait_range() > |- filemap_check_errors() > The previous error is recorded, thus an error is returned > > However the buffered write is properly submitted and finished, the error > is from the previous short dio write. > > [FIX] > When a short dio write happened, we shouldn't mark it as an error, but > treat it like a truncated write. I am quite skeptical of this as the proper fix. I looked into this really thoroughly back in https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12-hch@lst.de/ and remember concluding it was much better to do the OE split and submit separate direct writes, and I believe it was more or less working. I am willing to believe that the mapping_set_error() thing slipped through the cracks, though, so I apologize if I missed that detail. Has something changed since then that makes us fall back to buffered on a write buffer fault? Or am I misunderstanding something about what is happening in this case? g/708 is the test case for that particular corruption, FYI. Thanks for looking into it, Boris > > Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize > that helper to mark the direct IO ordered extent as truncated, so it > won't cause failure for the later buffered fallback. > > Cc: stable@vger.kernel.org # 6.1+ > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/direct-io.c | 18 +++++++++++++----- > fs/btrfs/inode.c | 6 +----- > fs/btrfs/ordered-data.c | 12 ++++++++++++ > fs/btrfs/ordered-data.h | 2 ++ > 4 files changed, 28 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c > index 57167d56dc72..598480b77002 100644 > --- a/fs/btrfs/direct-io.c > +++ b/fs/btrfs/direct-io.c > @@ -610,6 +610,7 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, > { > struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); > struct btrfs_dio_data *dio_data = iter->private; > + struct btrfs_ordered_extent *ordered = dio_data->ordered; > size_t submitted = dio_data->submitted; > const bool write = !!(flags & IOMAP_WRITE); > int ret = 0; > @@ -624,16 +625,23 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, > if (submitted < length) { > pos += submitted; > length -= submitted; > - if (write) > - btrfs_finish_ordered_extent(dio_data->ordered, > - pos, length, false); > - else > + if (write) { > + /* > + * We got a short write, will fallback to buffered IO > + * for the whole range. > + * Set the truncate length to 0, so that no real file > + * extent item will be created. > + */ > + btrfs_mark_ordered_extent_truncated(ordered, 0); > + btrfs_finish_ordered_extent(ordered, pos, length, true); > + } else { > btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos, > pos + length - 1, NULL); > + } > ret = -ENOTBLK; > } > if (write) { > - btrfs_put_ordered_extent(dio_data->ordered); > + btrfs_put_ordered_extent(ordered); > dio_data->ordered = NULL; > } > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 973a89301baa..2c0131452754 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7590,11 +7590,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset, > EXTENT_LOCKED | EXTENT_DO_ACCOUNTING | > EXTENT_DEFRAG, &cached_state); > > - spin_lock(&inode->ordered_tree_lock); > - set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); > - ordered->truncated_len = min(ordered->truncated_len, > - cur - ordered->file_offset); > - spin_unlock(&inode->ordered_tree_lock); > + btrfs_mark_ordered_extent_truncated(ordered, cur - ordered->file_offset); > > /* > * If the ordered extent has finished, we're safe to delete all > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > index f5f77c33cf59..b32d4eabe0ab 100644 > --- a/fs/btrfs/ordered-data.c > +++ b/fs/btrfs/ordered-data.c > @@ -358,6 +358,18 @@ void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered) > mapping_set_error(ordered->inode->vfs_inode.i_mapping, -EIO); > } > > +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered, > + u64 truncate_len) > +{ > + struct btrfs_inode *inode = ordered->inode; > + > + ASSERT(truncate_len <= ordered->num_bytes); > + spin_lock(&inode->ordered_tree_lock); > + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); > + ordered->truncated_len = min(ordered->truncated_len, truncate_len); > + spin_unlock(&inode->ordered_tree_lock); > +} > + > static void finish_ordered_fn(struct btrfs_work *work) > { > struct btrfs_ordered_extent *ordered_extent; > diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h > index 03e12380a2fd..8d5d5ba1e02f 100644 > --- a/fs/btrfs/ordered-data.h > +++ b/fs/btrfs/ordered-data.h > @@ -226,6 +226,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, > struct btrfs_ordered_extent *btrfs_split_ordered_extent( > struct btrfs_ordered_extent *ordered, u64 len); > void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered); > +void btrfs_mark_ordered_extent_truncated(struct btrfs_ordered_extent *ordered, > + u64 truncate_len); > int __init ordered_data_init(void); > void __cold ordered_data_exit(void); > > -- > 2.54.0 > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO 2026-05-27 16:01 ` Boris Burkov @ 2026-05-27 21:40 ` Qu Wenruo 2026-05-29 4:32 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-05-27 21:40 UTC (permalink / raw) To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, stable 在 2026/5/28 01:31, Boris Burkov 写道: > On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote: >> [BUG] >> The test case generic/362 will fail with "nodatasum" mount option (*): >> >> MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch >> >> generic/362 0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad) >> --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 >> +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:21:17.574771567 +0930 >> @@ -1,2 +1,3 @@ >> QA output created by 362 >> +First write failed: Input/output error >> Silence is golden >> ... >> >> *: If the test case has been executed before with default data checksum, >> the failure will not reproduce. Need the following fix to make it >> reliably reproducible: >> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/ >> >> [CAUSE] >> Btrfs direct write disable page fault of the input buffer, this is to >> avoid a deadlock specific to btrfs. >> >> So for the test case generic/362, it uses an anonymous page as input >> buffer. And since the page is not yet faulted in, the direct IO will >> fail with -EFAULT, causing us to go through the following call chain: >> >> btrfs_direct_write() > > I believe that when direct_write() sees EFAULT from btrfs_dio_write() it > should do the fault and retry, not fallback straight to buffered. It doesn't return -EFAULT. btrfs_direct_write() returned an dio pointer, although it has not submitted any bytes for that dio structure. So later iomap_dio_complete() returned 0. The full trace printk looks like this: dio-append-buf--2501 [008] ..... 627.444860: btrfs_direct_write: enter, root=5 ino=257 pos=0 iov_count=4096 dio-append-buf--2501 [008] ..... 627.444905: btrfs_direct_write: dio complete, ret=0 dio-append-buf--2501 [008] ..... 627.444906: btrfs_direct_write: buffered, pos=0 iov_counter=4096 dio-append-buf--2501 [008] ..... 627.444935: btrfs_direct_write: exit, root=5 ino=257 pos=0 iov_count=4096 ret=-5 written=0 > > if (iov_iter_count(from) > 0 && (ret == -EFAULT || ret > 0)) { > const size_t left = iov_iter_count(from); > /* > * We have more data left to write. Try to fault in as many as > * possible of the remainder pages and retry. We do this without > * releasing and locking again the inode, to prevent races with > * truncate. > * > * Also, in case the iov refers to pages in the file range of the > * file we want to write to (due to a mmap), we could enter an > * infinite loop if we retry after faulting the pages in, since > * iomap will invalidate any pages in the range early on, before > * it tries to fault in the pages of the iov. So we keep track of > * how much was left of iov in the previous EFAULT and fallback > * to buffered IO in case we haven't made any progress. Furthermore, the page fault in won't make any difference for this particular case, exactly explained by the comment itself, that the page cache will be invalidated. > */ > if (left == prev_left) { > ret = -ENOTBLK; > } else { > fault_in_iov_iter_readable(from, left); > prev_left = left; > goto again; > } > } > >> |- btrfs_dio_write() >> | |- btrfs_dio_iomap_end() >> | |- btrfs_finish_ordered_extent(uptodate = false); >> | |- can_finish_ordered_extent() >> | |- btrfs_mark_ordered_extent_error() >> | |- mapping_set_error() >> | Now the address space is marked error. >> | >> |- iomap_dio_complete() >> | The dio bio is empty, nothing submitted. >> | >> |- Fallback to buffered >> | And the buffered write finished without error >> | >> |- filemap_fdatawait_range() >> |- filemap_check_errors() >> The previous error is recorded, thus an error is returned >> >> However the buffered write is properly submitted and finished, the error >> is from the previous short dio write. >> >> [FIX] >> When a short dio write happened, we shouldn't mark it as an error, but >> treat it like a truncated write. > > I am quite skeptical of this as the proper fix. I looked into this > really thoroughly back in > https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12-hch@lst.de/ > and remember concluding it was much better to do the OE split and submit > separate direct writes, and I believe it was more or less working. I am > willing to believe that the mapping_set_error() thing slipped through > the cracks, though, so I apologize if I missed that detail. Has > something changed since then that makes us fall back to buffered on a > write buffer fault? The buffered fallback for datacsum is masking the failure for most cases, and the test case itself is also hiding the failure even for nodatasum mount option, so we are not detecting it for a long time. Secondly I also believed the btrfs_dio_write() should fail with -EFAULT initially, but that's no longer the case. My guess is iomap dio error handling change, but will need extra digging to pin down the iomap commit. > Or am I misunderstanding something about what is > happening in this case? > > g/708 is the test case for that particular corruption, FYI. That's also the test case I'm failing with IOMAP_DIO_BOUNCE, but I think we should fix the g/362 failure first, as it is already failing now. Thanks, Qu > > Thanks for looking into it, > Boris ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO 2026-05-27 21:40 ` Qu Wenruo @ 2026-05-29 4:32 ` Qu Wenruo 2026-05-29 4:37 ` Qu Wenruo 0 siblings, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-05-29 4:32 UTC (permalink / raw) To: Qu Wenruo, Boris Burkov; +Cc: linux-btrfs, stable 在 2026/5/28 07:10, Qu Wenruo 写道: > > > 在 2026/5/28 01:31, Boris Burkov 写道: >> On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote: >>> [BUG] >>> The test case generic/362 will fail with "nodatasum" mount option (*): >>> >>> MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch >>> >>> generic/362 0s ... - output mismatch (see /home/adam/xfstests/ >>> results//generic/362.out.bad) >>> --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 >>> +++ /home/adam/xfstests/results//generic/362.out.bad >>> 2026-05-27 10:21:17.574771567 +0930 >>> @@ -1,2 +1,3 @@ >>> QA output created by 362 >>> +First write failed: Input/output error >>> Silence is golden >>> ... >>> >>> *: If the test case has been executed before with default data checksum, >>> the failure will not reproduce. Need the following fix to make it >>> reliably reproducible: >>> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/ >>> >>> [CAUSE] >>> Btrfs direct write disable page fault of the input buffer, this is to >>> avoid a deadlock specific to btrfs. >>> >>> So for the test case generic/362, it uses an anonymous page as input >>> buffer. And since the page is not yet faulted in, the direct IO will >>> fail with -EFAULT, causing us to go through the following call chain: >>> >>> btrfs_direct_write() >> >> I believe that when direct_write() sees EFAULT from btrfs_dio_write() it >> should do the fault and retry, not fallback straight to buffered. > > It doesn't return -EFAULT. > > btrfs_direct_write() returned an dio pointer, although it has not > submitted any bytes for that dio structure. > > So later iomap_dio_complete() returned 0. > I added more trace_printk(), and it shows it's the EFAULT handling reset the error number: ("r/i=" shows the root id and ino from btrfs, "i=" shows the generic ino, "pos=" shows the file pos and length) 22.223076: __iomap_dio_rw: enter, i=257 pos=0/4096 22.223078: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/4096 22.223097: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/4096 phy=13631488 22.223106: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14 22.223107: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/4096 written=0 ordered=0/4096 22.223126: btrfs_dio_iomap_end: exit, r/i=5/257 pos=0/4096 written=0 ret=-15 22.223179: __iomap_dio_rw: after while loop, ret=-15 dio->size=0 22.223180: __iomap_dio_rw: exit with dio, i=257 As you can see, iomap_dio_iter() itself returned -EFAULT, which is expected as we disabled the page fault for the iov_iter. But then it's at the following code block that our return value is reset to 0: if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) { if (!(iocb->ki_flags & IOCB_NOWAIT)) wait_for_completion = true; ret = 0; } So that explains why we will never get a -EFAULT directly from __iomap_dio_rw(), at least for the write case. > Furthermore, the page fault in won't make any difference for this > particular case, exactly explained by the comment itself, that the page > cache will be invalidated. Sorry, I got confused with another case where the buffer page is from page cache. For this particular case, it's unrelated, and we are able to fault-in the anonymous page. >>> [FIX] >>> When a short dio write happened, we shouldn't mark it as an error, but >>> treat it like a truncated write. >> >> I am quite skeptical of this as the proper fix. I looked into this >> really thoroughly back in >> https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12-hch@lst.de/ >> and remember concluding it was much better to do the OE split and submit >> separate direct writes, and I believe it was more or less working. Firstly, the OE split and the proper fix doesn't conflict at all. In fact both co-operate with each other pretty well, especially shown by the second write of the test case, shown by the trace: 22.223529: btrfs_direct_write: enter, r/i=5/257 pos=0/8192 22.223530: __iomap_dio_rw: enter, i=257 pos=0/8192 22.223531: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/8192 22.223545: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/8192 phy=13635584 22.223561: __iomap_dio_rw: while loop, iomap_dio_iter ret=0 22.223561: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/8192 written=4096 ordered=4096/4096 Here we only copied the first page, and at this stage, the original 8K OE is already being split into two. And the fix will properly truncate and remove the later half. 22.223568: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 written=4096 ret=-15 22.223582: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096 22.223588: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 phy=13639680 This time we retry with the 2nd page. 22.223588: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14 And still failed to fault in. 22.223588: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 written=0 ordered=4096/4096 22.223590: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 written=0 ret=-15 So again remove the failed OE. 22.223610: __iomap_dio_rw: after while loop, ret=-15 dio->size=4096 22.223712: __iomap_dio_rw: exit with dio, i=257 22.223713: btrfs_direct_write: iomap_dio_complete() ret=4096 So __iomap_dio_rw() only succeeded writed 4K. Now btrfs will fault-in the 2nd page and retry. 22.223717: __iomap_dio_rw: enter, i=257 pos=4096/4096 22.223718: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096 22.223730: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 phy=13639680 22.223740: __iomap_dio_rw: while loop, iomap_dio_iter ret=0 22.223740: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 written=4096 ordered=4096/4096 22.223741: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 written=4096 ret=0 Now with the 2nd page faulted in, this time we succeeded in handling the 2nd page, and every thing goes one. 22.223755: __iomap_dio_rw: after while loop, ret=0 dio->size=4096 22.223847: __iomap_dio_rw: exit with dio, i=257 22.223848: btrfs_direct_write: iomap_dio_complete() ret=8192 22.223849: btrfs_direct_write: exit, r/i=5/257 pos=0/8192 ret=8192 written=8192 So this looks exactly the correct fix. If this extra trace helps, I can definitely reword the commit message to reduce any possible confusion. Thanks, Qu ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO 2026-05-29 4:32 ` Qu Wenruo @ 2026-05-29 4:37 ` Qu Wenruo 0 siblings, 0 replies; 8+ messages in thread From: Qu Wenruo @ 2026-05-29 4:37 UTC (permalink / raw) To: Qu Wenruo, Boris Burkov; +Cc: linux-btrfs, stable 在 2026/5/29 14:02, Qu Wenruo 写道: > > > 在 2026/5/28 07:10, Qu Wenruo 写道: >> >> >> 在 2026/5/28 01:31, Boris Burkov 写道: >>> On Wed, May 27, 2026 at 02:36:44PM +0930, Qu Wenruo wrote: >>>> [BUG] >>>> The test case generic/362 will fail with "nodatasum" mount option (*): >>>> >>>> MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch >>>> >>>> generic/362 0s ... - output mismatch (see /home/adam/xfstests/ >>>> results//generic/362.out.bad) >>>> --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 >>>> +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 >>>> 10:21:17.574771567 +0930 >>>> @@ -1,2 +1,3 @@ >>>> QA output created by 362 >>>> +First write failed: Input/output error >>>> Silence is golden >>>> ... >>>> >>>> *: If the test case has been executed before with default data >>>> checksum, >>>> the failure will not reproduce. Need the following fix to make it >>>> reliably reproducible: >>>> https://lore.kernel.org/linux-btrfs/20260526070055.60193-1- >>>> wqu@suse.com/ >>>> >>>> [CAUSE] >>>> Btrfs direct write disable page fault of the input buffer, this is to >>>> avoid a deadlock specific to btrfs. >>>> >>>> So for the test case generic/362, it uses an anonymous page as input >>>> buffer. And since the page is not yet faulted in, the direct IO will >>>> fail with -EFAULT, causing us to go through the following call chain: >>>> >>>> btrfs_direct_write() >>> >>> I believe that when direct_write() sees EFAULT from btrfs_dio_write() it >>> should do the fault and retry, not fallback straight to buffered. >> >> It doesn't return -EFAULT. >> >> btrfs_direct_write() returned an dio pointer, although it has not >> submitted any bytes for that dio structure. >> >> So later iomap_dio_complete() returned 0. >> > > I added more trace_printk(), and it shows it's the EFAULT handling reset > the error number: > ("r/i=" shows the root id and ino from btrfs, "i=" shows the generic > ino, "pos=" shows the file pos and length) > > 22.223076: __iomap_dio_rw: enter, i=257 pos=0/4096 > 22.223078: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/4096 > 22.223097: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/4096 phy=13631488 > 22.223106: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14 > 22.223107: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/4096 written=0 > ordered=0/4096 > 22.223126: btrfs_dio_iomap_end: exit, r/i=5/257 pos=0/4096 written=0 > ret=-15 > 22.223179: __iomap_dio_rw: after while loop, ret=-15 dio->size=0 > 22.223180: __iomap_dio_rw: exit with dio, i=257 > > As you can see, iomap_dio_iter() itself returned -EFAULT, which is > expected as we disabled the page fault for the iov_iter. > > But then it's at the following code block that our return value is reset > to 0: > > if (ret == -EFAULT && dio->size && (dio_flags & > IOMAP_DIO_PARTIAL)) { > if (!(iocb->ki_flags & IOCB_NOWAIT)) > wait_for_completion = true; > ret = 0; > } My bad, it is the ENOTBLK handling resetting the error number. At that stage, @ret is already the error number from iomap_end(), not from the iomap_dio_iter(): /* magic error code to fall back to buffered I/O */ if (ret == -ENOTBLK) { wait_for_completion = true; ret = 0; } > > So that explains why we will never get a -EFAULT directly from > __iomap_dio_rw(), at least for the write case. > > >> Furthermore, the page fault in won't make any difference for this >> particular case, exactly explained by the comment itself, that the >> page cache will be invalidated. > > Sorry, I got confused with another case where the buffer page is from > page cache. > > For this particular case, it's unrelated, and we are able to fault-in > the anonymous page. > > >>>> [FIX] >>>> When a short dio write happened, we shouldn't mark it as an error, but >>>> treat it like a truncated write. >>> >>> I am quite skeptical of this as the proper fix. I looked into this >>> really thoroughly back in >>> https://lore.kernel.org/linux-btrfs/20230328051957.1161316-12- >>> hch@lst.de/ >>> and remember concluding it was much better to do the OE split and submit >>> separate direct writes, and I believe it was more or less working. > > Firstly, the OE split and the proper fix doesn't conflict at all. > > In fact both co-operate with each other pretty well, especially shown by > the second write of the test case, shown by the trace: > > 22.223529: btrfs_direct_write: enter, r/i=5/257 pos=0/8192 > 22.223530: __iomap_dio_rw: enter, i=257 pos=0/8192 > 22.223531: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=0/8192 > 22.223545: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=0/8192 phy=13635584 > 22.223561: __iomap_dio_rw: while loop, iomap_dio_iter ret=0 > 22.223561: btrfs_dio_iomap_end: enter, r/i=5/257 pos=0/8192 > written=4096 ordered=4096/4096 > > Here we only copied the first page, and at this stage, the original 8K > OE is already being split into two. > And the fix will properly truncate and remove the later half. > > 22.223568: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 > written=4096 ret=-15 > 22.223582: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096 > 22.223588: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 > phy=13639680 > > This time we retry with the 2nd page. > > 22.223588: __iomap_dio_rw: while loop, iomap_dio_iter ret=-14 > > And still failed to fault in. > > 22.223588: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 > written=0 ordered=4096/4096 > 22.223590: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 > written=0 ret=-15 > > So again remove the failed OE. > > 22.223610: __iomap_dio_rw: after while loop, ret=-15 dio->size=4096 > 22.223712: __iomap_dio_rw: exit with dio, i=257 > 22.223713: btrfs_direct_write: iomap_dio_complete() ret=4096 > > So __iomap_dio_rw() only succeeded writed 4K. > > Now btrfs will fault-in the 2nd page and retry. > > 22.223717: __iomap_dio_rw: enter, i=257 pos=4096/4096 > 22.223718: btrfs_dio_iomap_begin: enter, r/i=5/257 pos=4096/4096 > 22.223730: btrfs_dio_iomap_begin: exit, r/i=5/257 pos=4096/4096 > phy=13639680 > 22.223740: __iomap_dio_rw: while loop, iomap_dio_iter ret=0 > 22.223740: btrfs_dio_iomap_end: enter, r/i=5/257 pos=4096/4096 > written=4096 ordered=4096/4096 > 22.223741: btrfs_dio_iomap_end: exit, r/i=5/257 pos=4096/4096 > written=4096 ret=0 > > Now with the 2nd page faulted in, this time we succeeded in handling the > 2nd page, and every thing goes one. > > 22.223755: __iomap_dio_rw: after while loop, ret=0 dio->size=4096 > 22.223847: __iomap_dio_rw: exit with dio, i=257 > 22.223848: btrfs_direct_write: iomap_dio_complete() ret=8192 > 22.223849: btrfs_direct_write: exit, r/i=5/257 pos=0/8192 ret=8192 > written=8192 > > So this looks exactly the correct fix. > > If this extra trace helps, I can definitely reword the commit message to > reduce any possible confusion. > > Thanks, > Qu ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes 2026-05-27 5:06 [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum Qu Wenruo 2026-05-27 5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo @ 2026-05-27 5:06 ` Qu Wenruo 2026-05-29 7:55 ` Qu Wenruo 1 sibling, 1 reply; 8+ messages in thread From: Qu Wenruo @ 2026-05-27 5:06 UTC (permalink / raw) To: linux-btrfs; +Cc: stable [BUG] With the previous bug of short direct writes fixed, test case generic/362 (*) will still fail with the following error with nodatasum mount option: generic/362 0s ... _check_dmesg: something found in dmesg (see /home/adam/xfstests/results//generic/362.dmesg) - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad) --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:13:09.072485767 +0930 @@ -1,2 +1,3 @@ QA output created by 362 +Wrong file size after first write, got 8192 expected 4096 Silence is golden ... *: If the test case has been executed before with default data checksum, the failure will not reproduce. Need the following fix to make it reliably reproducible: https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/ [CAUSE] Btrfs disables page fault in during direct IO write, to avoid a specific deadlock that is only specific to btrfs. So for the test case generic/362, it will make the direct IO to fail with -EFAULT, then we fallback to buffered IO. However at btrfs_dio_iomap_begin() -> btrfs_get_blocks_direct_write(), we have already updated the isize during extent allocation. And if we failed the direct IO, the isize is still the updated one. So it means the buffered write will respect the IOCB_APPEND flag and write the new data at the update isize, resulting the above failure. [FIX] Introduce btrfs_dio_data::updated_isize and btrfs_dio_data::old_isize, so that if btrfs_get_blocks_direct_write() enlarged the inode size, we can know the old inode size. Then if we got a short write, and btrfs_dio_data::updated_isize is set, then revert to the old isize, so the buffered fallback can write into the correct location. Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/direct-io.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c index 598480b77002..24163a4bcfb0 100644 --- a/fs/btrfs/direct-io.c +++ b/fs/btrfs/direct-io.c @@ -15,10 +15,16 @@ struct btrfs_dio_data { ssize_t submitted; + /* + * If we got a short dio write and @updated_isize is set, + * revert to the old isize. + */ + loff_t old_isize; struct extent_changeset *data_reserved; struct btrfs_ordered_extent *ordered; bool data_space_reserved; bool nocow_done; + bool updated_isize; }; struct btrfs_dio_private { @@ -341,8 +347,11 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, * Need to update the i_size under the extent lock so buffered * readers will get the updated i_size when we unlock. */ - if (start + len > i_size_read(inode)) + if (start + len > i_size_read(inode)) { + dio_data->old_isize = i_size_read(inode); + dio_data->updated_isize = true; i_size_write(inode, start + len); + } out: if (ret && space_reserved) { btrfs_delalloc_release_extents(BTRFS_I(inode), len); @@ -634,6 +643,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, */ btrfs_mark_ordered_extent_truncated(ordered, 0); btrfs_finish_ordered_extent(ordered, pos, length, true); + if (dio_data->updated_isize) { + i_size_write(inode, dio_data->old_isize); + dio_data->updated_isize = false; + } } else { btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1, NULL); -- 2.54.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes 2026-05-27 5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo @ 2026-05-29 7:55 ` Qu Wenruo 0 siblings, 0 replies; 8+ messages in thread From: Qu Wenruo @ 2026-05-29 7:55 UTC (permalink / raw) To: linux-btrfs; +Cc: stable 在 2026/5/27 14:36, Qu Wenruo 写道: > [BUG] > With the previous bug of short direct writes fixed, test case > generic/362 (*) will still fail with the following error with nodatasum > mount option: > > generic/362 0s ... _check_dmesg: something found in dmesg (see /home/adam/xfstests/results//generic/362.dmesg) > - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad) > --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 > +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:13:09.072485767 +0930 > @@ -1,2 +1,3 @@ > QA output created by 362 > +Wrong file size after first write, got 8192 expected 4096 > Silence is golden > ... > > *: If the test case has been executed before with default data checksum, > the failure will not reproduce. Need the following fix to make it > reliably reproducible: > https://lore.kernel.org/linux-btrfs/20260526070055.60193-1-wqu@suse.com/ > > [CAUSE] > Btrfs disables page fault in during direct IO write, to avoid a specific > deadlock that is only specific to btrfs. > > So for the test case generic/362, it will make the direct IO to fail > with -EFAULT, then we fallback to buffered IO. > > However at btrfs_dio_iomap_begin() -> btrfs_get_blocks_direct_write(), > we have already updated the isize during extent allocation. > And if we failed the direct IO, the isize is still the updated one. > > So it means the buffered write will respect the IOCB_APPEND flag and > write the new data at the update isize, resulting the above failure. > > [FIX] > Introduce btrfs_dio_data::updated_isize and btrfs_dio_data::old_isize, > so that if btrfs_get_blocks_direct_write() enlarged the inode size, we > can know the old inode size. > > Then if we got a short write, and btrfs_dio_data::updated_isize is set, > then revert to the old isize, so the buffered fallback can write into > the correct location. > > Cc: stable@vger.kernel.org # 6.1+ > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/direct-io.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c > index 598480b77002..24163a4bcfb0 100644 > --- a/fs/btrfs/direct-io.c > +++ b/fs/btrfs/direct-io.c > @@ -15,10 +15,16 @@ > > struct btrfs_dio_data { > ssize_t submitted; > + /* > + * If we got a short dio write and @updated_isize is set, > + * revert to the old isize. > + */ > + loff_t old_isize; > struct extent_changeset *data_reserved; > struct btrfs_ordered_extent *ordered; > bool data_space_reserved; > bool nocow_done; > + bool updated_isize; > }; > > struct btrfs_dio_private { > @@ -341,8 +347,11 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, > * Need to update the i_size under the extent lock so buffered > * readers will get the updated i_size when we unlock. > */ > - if (start + len > i_size_read(inode)) > + if (start + len > i_size_read(inode)) { > + dio_data->old_isize = i_size_read(inode); > + dio_data->updated_isize = true; > i_size_write(inode, start + len); > + } > out: > if (ret && space_reserved) { > btrfs_delalloc_release_extents(BTRFS_I(inode), len); > @@ -634,6 +643,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, > */ > btrfs_mark_ordered_extent_truncated(ordered, 0); > btrfs_finish_ordered_extent(ordered, pos, length, true); > + if (dio_data->updated_isize) { > + i_size_write(inode, dio_data->old_isize); This part is incorrect, as if we have a short direct write, we can still have part of the range written. In that case we should not revert to the old isize, but to the new written size. Considering we may go several different tries, I think it's better to save and set the isize inside btrfs_direct_write(), other than relying on the iomap_start/iomap_end() callbacks. Thanks, Qu> + dio_data->updated_isize = false; > + } > } else { > btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos, > pos + length - 1, NULL); ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-05-29 7:55 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-27 5:06 [PATCH 0/2] btrfs: fix generic/362 failures with nodatasum Qu Wenruo 2026-05-27 5:06 ` [PATCH 1/2] btrfs: fix false IO failure after falling back to buffered IO Qu Wenruo 2026-05-27 16:01 ` Boris Burkov 2026-05-27 21:40 ` Qu Wenruo 2026-05-29 4:32 ` Qu Wenruo 2026-05-29 4:37 ` Qu Wenruo 2026-05-27 5:06 ` [PATCH 2/2] btrfs: fix incorrect buffered IO fallback for append direct writes Qu Wenruo 2026-05-29 7:55 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox