Linux block layer
 help / color / mirror / Atom feed
* [PATCH v3 0/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
@ 2026-06-12  9:51 Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

[CHANGELOG]
v3:
- Fix a bug in error handling of bio_iov_iter_bounce_write()
  Which can lead to generic/708 failure on btrfs.

- Respect nofault flag in bio_iov_iter_bounce_write()
  To avoid btrfs specific deadlocks.

- Reject NOWAIT and BOUNCE direct IOs
  Since BOUNCE always allocate pages using GFP_KERNEL, which can sleep
  and break NOWAIT requirement, has to reject such combination.

v2:
- Rework the comment in btrfs_dio_write()

Commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the
inode requires checksum") solved the csum mismatch caused by unstable
direct IO buffers, it has a pretty hefty performance penalty.

Meanwhile upstream iomap has introduce IOMAP_DIO_BOUNCE flag to get
stable buffers meanwhile without falling back to buffered IOs.

Using that flag btrfs can reach 95% of the original zero-copy direct IO
performance, almost 2x the current buffered fallback performance.

However during my tests, there are several bugs related to iomap that
can lead to direct IO test case failures:

- generic/708
  Results garbage in the end of the writes, is a bug in the error
  handling of a short copy.

  Fixed in the first patch.

- Deadlock if using the page cache as direct IO buffer
  This is because bio_iov_iter_bounce_write() doesn't respect
  iov_iter::nofault flag.

  Fixed in the second patch.

- Possible NOWAIT and BOUNCE conflicts
  BOUNCE flag for both reads and writes will allocate new folios using
  GFP_KERNEL, which can sleep and break NOWAIT requirement.

  Reject such combination in iomap_dio_bio_iter() directly in the 3rd
  patch.

And the final one will enable btrfs to use IOMAP_DIO_BOUNCE flag, so
that even with data checksum we do not need to fallback to buffered IO
and reclaim most of the dropped direct IO performance.

Qu Wenruo (4):
  block: revert the iov_iter after a short copy in
    bio_iov_iter_bounce_write()
  block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  iomap: reject NOWAIT and BOUNCE direct IOs
  btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered
    IO

 block/bio.c          | 10 ++++++---
 fs/btrfs/direct-io.c | 53 ++++++++++++++++++++------------------------
 fs/iomap/direct-io.c |  4 ++++
 3 files changed, 35 insertions(+), 32 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
  2026-06-12  9:51 [PATCH v3 0/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
@ 2026-06-12  9:51 ` Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 2/4] block: respect iov_iter::nofault flag " Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty
easy to hit short copy inside bio_iov_iter_bounce_write().

This is because btrfs has disabled page fault to avoid certain deadlock
during direct writes, and instead btrfs manually fault in the pages then
retry.

And inside bio_iov_iter_bounce_write(), if we hit a short write, we
didn't revert the iov_iter, which can cause problems like unexpected
garbage for the next retry.

Revert the iov_iter after a short copy.

One thing to note is that, the folio is allocated then immediately
queued into the bio, so the proper revert size should be
(bi_size - this_len + copied).

Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f10900b3f42..b33ff69bb722 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1321,6 +1321,7 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 
 	do {
 		size_t this_len = min(total_len, SZ_1M);
+		size_t copied;
 		struct folio *folio;
 
 		if (this_len > minsize * 2)
@@ -1334,12 +1335,12 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);
 
-		if (copy_from_iter(folio_address(folio), this_len, iter) !=
-				this_len) {
+		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (copied < this_len) {
+			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len + copied);
 			bio_free_folios(bio);
 			return -EFAULT;
 		}
-
 		total_len -= this_len;
 	} while (total_len && bio->bi_vcnt < bio->bi_max_vecs);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 2/4] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  2026-06-12  9:51 [PATCH v3 0/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
@ 2026-06-12  9:51 ` Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 4/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
  3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set
iov_iter::nofault to prevent deadlock when a page fault is needed to
read out the buffer.

However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault
flag, and just call a plain copy_from_iter() so it can still trigger
page fault and cause deadlock in btrfs.

Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is
set, otherwise use copy_folio_from_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index b33ff69bb722..01bb76d9717c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1335,7 +1335,10 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);
 
-		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (iter->nofault)
+			copied = copy_folio_from_iter_atomic(folio, 0, this_len, iter);
+		else
+			copied = copy_folio_from_iter(folio, 0, this_len, iter);
 		if (copied < this_len) {
 			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len + copied);
 			bio_free_folios(bio);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs
  2026-06-12  9:51 [PATCH v3 0/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 2/4] block: respect iov_iter::nofault flag " Qu Wenruo
@ 2026-06-12  9:51 ` Qu Wenruo
  2026-06-12  9:51 ` [PATCH v3 4/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
  3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

If a direct IO requires bounced pages for stable buffer, it will always
allocate memory, and both bio_iov_iter_bounce_write() and
bio_iov_iter_bounce_read() are allocating pages using GFP_KERNEL, which
can sleep and break NOWAIT requirement.

So we need to reject such NOWAIT and BOUNCE direct IO in
iomap_dio_bio_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/iomap/direct-io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b36ee619cdcd..d1601122f0b5 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -412,6 +412,10 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	unsigned int alignment;
 	ssize_t ret = 0;
 
+	/* Bounced direct IO will need to allocate memory, breaking NOWAIT flag. */
+	if (unlikely(iter->flags & IOMAP_NOWAIT && dio->flags & IOMAP_DIO_BOUNCE))
+		return -EAGAIN;
+
 	/*
 	 * File systems that write out of place and always allocate new blocks
 	 * need each bio to be block aligned as that's the unit of allocation.
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 4/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
  2026-06-12  9:51 [PATCH v3 0/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
                   ` (2 preceding siblings ...)
  2026-06-12  9:51 ` [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs Qu Wenruo
@ 2026-06-12  9:51 ` Qu Wenruo
  3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2026-06-12  9:51 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

Previously btrfs forces direct writes to fall back to buffered ones if the
inode has data checksum or the profile has duplication.

That fallback is to avoid the content being modified that the final
content may mismatch with the checksum or the other mirrors.

That brings a pretty huge performance cost, which already caused some
concern at that time.

But later upstream commit c9d114846b38 ("iomap: add a flag to bounce
buffer direct I/O") introduced a new method by copying the content into
new pages, and do all the operations based on the newly allocated pages.

So let btrfs to utilize the new flag for direct writes if we require
stable folios.

There is a quick benchmark, using the following fio setup:

 fio --name=randwrite --filename $mnt/foobar --ioengine=libaio --size=4G \
     --rw=randwrite --iodepth=64 --runtime=60 --time_based --direct=1 \
     --bs=$blocksize

Unit is MiB/s.

 Blocksize | Zero-copy (*) | Buffered |   Bounce
-----------+---------------+----------+-----------
        4K |          35.1 |     17.1 |      33.8
       64K |           522 |      251 |       492

*: This is done by reverting the commit 968f19c5b1b7 ("btrfs: always
   fallback to buffered write if the inode requires checksum")

Although with page bouncing the performance is only around 95% of
true-zero copy, it's still almost double the performance of buffered
fallback.

There will be a small change in behavior, since we're using
IOMAP_DIO_BOUNCE flag to allocate new folios, NOWAIT flag will
immediately fail.

So for NOWAIT direct IOs, NODATASUM and RAID0/SINGLE profiles are still
required.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 53 ++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index e566a60b0ce5..bbf94056a874 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -818,13 +818,36 @@ static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
 			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
 }
 
+static bool need_stable_write(struct btrfs_inode *inode)
+{
+	const u64 data_profile = btrfs_data_alloc_profile(inode->root->fs_info) &
+				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	/* Data checksum requires stable buffer. */
+	if (!(inode->flags & BTRFS_INODE_NODATASUM))
+		return true;
+	/*
+	 * Any profile with mirror/parity will require stable buffer.
+	 * Otherwise the mirror may differ from each other.
+	 *
+	 * Thus only SINGLE and RAID0 doesn't require stable buffer.
+	 */
+	if (data_profile != 0 && data_profile != BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+	return false;
+}
+
 static struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 					 size_t done_before)
 {
 	struct btrfs_dio_data data = { 0 };
+	unsigned int dio_flags = IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED;
+
+	if (need_stable_write(BTRFS_I(file_inode(iocb->ki_filp))))
+		dio_flags |= IOMAP_DIO_BOUNCE;
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
+			      dio_flags, &data, done_before);
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -853,8 +876,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio;
-	const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
-				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -868,16 +889,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;
 
-	/*
-	 * If our data profile has duplication (either extra mirrors or RAID56),
-	 * we can not trust the direct IO buffer, the content may change during
-	 * writeback and cause different contents written to different mirrors.
-	 *
-	 * Thus only RAID0 and SINGLE can go true zero-copy direct IO.
-	 */
-	if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
-		goto buffered;
-
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
@@ -918,22 +929,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
 		goto buffered;
 	}
-	/*
-	 * We can't control the folios being passed in, applications can write
-	 * to them while a direct IO write is in progress.  This means the
-	 * content might change after we calculated the data checksum.
-	 * Therefore we can end up storing a checksum that doesn't match the
-	 * persisted data.
-	 *
-	 * To be extra safe and avoid false data checksum mismatch, if the
-	 * inode requires data checksum, just fallback to buffered IO.
-	 * For buffered IO we have full control of page cache and can ensure
-	 * no one is modifying the content during writeback.
-	 */
-	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
-		goto buffered;
-	}
 
 	/*
 	 * The iov_iter can be mapped to the same file range we are writing to.
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-12  9:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  9:51 [PATCH v3 0/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
2026-06-12  9:51 ` [PATCH v3 1/4] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
2026-06-12  9:51 ` [PATCH v3 2/4] block: respect iov_iter::nofault flag " Qu Wenruo
2026-06-12  9:51 ` [PATCH v3 3/4] iomap: reject NOWAIT and BOUNCE direct IOs Qu Wenruo
2026-06-12  9:51 ` [PATCH v3 4/4] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox