[PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
@ 2026-06-16  8:12 Qu Wenruo
  2026-06-16  8:12 ` [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

[CHANGELOG]
v4:
- Follow iomap/block layer code style to avoid lines over 80 chars

- Reject NOWAIT BOUNCE direct writes inside btrfs
  The iomap code still allocates memory with GFP_KERNEL in other
  locations.
  For now just disable NOWAIT BOUNCE direct writes and let the caller
  fall back to blocking mode.

v3:
- Fix a bug in error handling of bio_iov_iter_bounce_write()
  Which can lead to generic/708 failure on btrfs.

- Respect nofault flag in bio_iov_iter_bounce_write()
  To avoid btrfs specific deadlocks.

- Reject NOWAIT and BOUNCE direct IOs
  Since BOUNCE always allocate pages using GFP_KERNEL, which can sleep
  and break NOWAIT requirement, has to reject such combination.

v2:
- Rework the comment in btrfs_dio_write()

Commit 968f19c5b1b7 ("btrfs: always fallback to buffered write if the
inode requires checksum") solved the csum mismatch caused by unstable
direct IO buffers, it has a pretty hefty performance penalty.

Meanwhile upstream iomap has introduce IOMAP_DIO_BOUNCE flag to get
stable buffers meanwhile without falling back to buffered IOs.

Using that flag btrfs can reach 95% of the original zero-copy direct IO
performance, almost 2x the current buffered fallback performance.

However during my tests, there are several bugs related to iomap that
can lead to direct IO test case failures:

- generic/708
  Results garbage in the end of the writes, is a bug in the error
  handling of a short copy.

  Fixed in the first patch.

- Deadlock if using the page cache as direct IO buffer
  This is because bio_iov_iter_bounce_write() doesn't respect
  iov_iter::nofault flag.

  Fixed in the second patch.

- Possible NOWAIT and BOUNCE conflicts
  BOUNCE flag for both reads and writes will allocate new folios using
  GFP_KERNEL, which can sleep and break NOWAIT requirement.

  Reject such combination in btrfs when enabling IOMAP_DIO_BOUNCE
  support.

And the final one will enable btrfs to use IOMAP_DIO_BOUNCE flag, so
that even with data checksum we do not need to fallback to buffered IO
and reclaim most of the dropped direct IO performance.

Qu Wenruo (3):
  block: revert the iov_iter after a short copy in
    bio_iov_iter_bounce_write()
  block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered
    IO

 block/bio.c          | 21 +++++++++++++---
 fs/btrfs/direct-io.c | 58 ++++++++++++++++++++++----------------------
 2 files changed, 47 insertions(+), 32 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
  2026-06-16  8:12 [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
@ 2026-06-16  8:12 ` Qu Wenruo
  2026-06-16 12:44   ` Christoph Hellwig
  2026-06-16  8:12 ` [PATCH v4 2/3] block: respect iov_iter::nofault flag " Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty
easy to hit short copy inside bio_iov_iter_bounce_write().

This is because btrfs has disabled page fault to avoid certain deadlock
during direct writes, and instead btrfs manually fault in the pages then
retry.

And inside bio_iov_iter_bounce_write(), if we hit a short write, we
didn't revert the iov_iter, which can cause problems like unexpected
garbage for the next retry.

Revert the iov_iter after a short copy.

One thing to note is that, the folio is allocated then immediately
queued into the bio, so the proper revert size should be
(bi_size - this_len + copied).

Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f10900b3f42..d2fdfcb016c1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1321,6 +1321,7 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 
 	do {
 		size_t this_len = min(total_len, SZ_1M);
+		size_t copied;
 		struct folio *folio;
 
 		if (this_len > minsize * 2)
@@ -1334,12 +1335,22 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);
 
-		if (copy_from_iter(folio_address(folio), this_len, iter) !=
-				this_len) {
+		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (copied < this_len) {
+			/*
+			 * Need to revert the iov iter for all bytes we have
+			 * copied.
+			 *
+			 * However the bio size differs from the real copied
+			 * bytes as @this_len is queued but only advanced
+			 * less than that.
+			 * Need to compensate that for the revert.
+			 */
+			iov_iter_revert(iter, bio->bi_iter.bi_size - this_len +
+					copied);
 			bio_free_folios(bio);
 			return -EFAULT;
 		}
-
 		total_len -= this_len;
 	} while (total_len && bio->bi_vcnt < bio->bi_max_vecs);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/3] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  2026-06-16  8:12 [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
  2026-06-16  8:12 ` [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
@ 2026-06-16  8:12 ` Qu Wenruo
  2026-06-16 12:44   ` Christoph Hellwig
  2026-06-16  8:12 ` [PATCH v4 3/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
  2026-06-16 12:45 ` [PATCH v4 0/3] " Christoph Hellwig
  3 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set
iov_iter::nofault to prevent deadlock when a page fault is needed to
read out the buffer.

However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault
flag, and just call a plain copy_from_iter() so it can still trigger
page fault and cause deadlock in btrfs.

Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is
set, otherwise use copy_folio_from_iter().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 block/bio.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index d2fdfcb016c1..7fd7c35f9a28 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1335,7 +1335,11 @@ static int bio_iov_iter_bounce_write(struct bio *bio, struct iov_iter *iter,
 			break;
 		bio_add_folio_nofail(bio, folio, this_len, 0);
 
-		copied = copy_from_iter(folio_address(folio), this_len, iter);
+		if (iter->nofault)
+			copied = copy_folio_from_iter_atomic(folio, 0, this_len,
+							     iter);
+		else
+			copied = copy_folio_from_iter(folio, 0, this_len, iter);
 		if (copied < this_len) {
 			/*
 			 * Need to revert the iov iter for all bytes we have
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 3/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
  2026-06-16  8:12 [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
  2026-06-16  8:12 ` [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
  2026-06-16  8:12 ` [PATCH v4 2/3] block: respect iov_iter::nofault flag " Qu Wenruo
@ 2026-06-16  8:12 ` Qu Wenruo
  2026-06-16 12:45 ` [PATCH v4 0/3] " Christoph Hellwig
  3 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2026-06-16  8:12 UTC (permalink / raw)
  To: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

Previously btrfs forces direct writes to fall back to buffered ones if the
inode has data checksum or the profile has duplication.

That fallback is to avoid the content being modified that the final
content may mismatch with the checksum or the other mirrors.

That brings a pretty huge performance cost, which already caused some
concern at that time.

But later upstream commit c9d114846b38 ("iomap: add a flag to bounce
buffer direct I/O") introduced a new method by copying the content into
new pages, and do all the operations based on the newly allocated pages.

So let btrfs to utilize the new flag for direct writes if we require
stable folios.

There is a quick benchmark, using the following fio setup:

 fio --name=randwrite --filename $mnt/foobar --ioengine=libaio --size=4G \
     --rw=randwrite --iodepth=64 --runtime=60 --time_based --direct=1 \
     --bs=$blocksize

Unit is MiB/s.

 Blocksize | Zero-copy (*) | Buffered |   Bounce
-----------+---------------+----------+-----------
        4K |          35.1 |     17.1 |      33.8
       64K |           522 |      251 |       492

*: This is done by reverting the commit 968f19c5b1b7 ("btrfs: always
   fallback to buffered write if the inode requires checksum")

Although with page bouncing the performance is only around 95% of
true-zero copy, it's still almost double the performance of buffered
fallback.

There will be a small change in behavior, since we're using
IOMAP_DIO_BOUNCE flag to allocate new folios, NOWAIT flag will
immediately fail.

So for true NOWAIT direct IOs, NODATASUM and RAID0/SINGLE profiles are
still required.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 58 ++++++++++++++++++++++----------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index e566a60b0ce5..bb457649902e 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -818,13 +818,41 @@ static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
 			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
 }
 
+static bool need_stable_write(struct btrfs_inode *inode)
+{
+	const u64 data_profile = btrfs_data_alloc_profile(inode->root->fs_info) &
+				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
+
+	/* Data checksum requires stable buffer. */
+	if (!(inode->flags & BTRFS_INODE_NODATASUM))
+		return true;
+	/*
+	 * Any profile with mirror/parity will require stable buffer.
+	 * Otherwise the mirror may differ from each other.
+	 *
+	 * Thus only SINGLE and RAID0 doesn't require stable buffer.
+	 */
+	if (data_profile != 0 && data_profile != BTRFS_BLOCK_GROUP_RAID0)
+		return true;
+	return false;
+}
+
 static struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 					 size_t done_before)
 {
 	struct btrfs_dio_data data = { 0 };
+	unsigned int dio_flags = IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED;
+
+	if (need_stable_write(BTRFS_I(file_inode(iocb->ki_filp)))) {
+		/* For now no support for BOUNCE and NOWAIT direct write. */
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return ERR_PTR(-EAGAIN);
+
+		dio_flags |= IOMAP_DIO_BOUNCE;
+	}
 
 	return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			    IOMAP_DIO_PARTIAL | IOMAP_DIO_FSBLOCK_ALIGNED, &data, done_before);
+			      dio_flags, &data, done_before);
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -853,8 +881,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio;
-	const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
-				 BTRFS_BLOCK_GROUP_PROFILE_MASK;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -868,16 +894,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;
 
-	/*
-	 * If our data profile has duplication (either extra mirrors or RAID56),
-	 * we can not trust the direct IO buffer, the content may change during
-	 * writeback and cause different contents written to different mirrors.
-	 *
-	 * Thus only RAID0 and SINGLE can go true zero-copy direct IO.
-	 */
-	if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
-		goto buffered;
-
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
@@ -918,22 +934,6 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
 		goto buffered;
 	}
-	/*
-	 * We can't control the folios being passed in, applications can write
-	 * to them while a direct IO write is in progress.  This means the
-	 * content might change after we calculated the data checksum.
-	 * Therefore we can end up storing a checksum that doesn't match the
-	 * persisted data.
-	 *
-	 * To be extra safe and avoid false data checksum mismatch, if the
-	 * inode requires data checksum, just fallback to buffered IO.
-	 * For buffered IO we have full control of page cache and can ensure
-	 * no one is modifying the content during writeback.
-	 */
-	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-		btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
-		goto buffered;
-	}
 
 	/*
 	 * The iov_iter can be mapped to the same file range we are writing to.
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
  2026-06-16  8:12 ` [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
@ 2026-06-16 12:44   ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2026-06-16 12:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/3] block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
  2026-06-16  8:12 ` [PATCH v4 2/3] block: respect iov_iter::nofault flag " Qu Wenruo
@ 2026-06-16 12:44   ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2026-06-16 12:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO
  2026-06-16  8:12 [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
                   ` (2 preceding siblings ...)
  2026-06-16  8:12 ` [PATCH v4 3/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
@ 2026-06-16 12:45 ` Christoph Hellwig
  3 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2026-06-16 12:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-block, linux-fsdevel, linux-xfs

Note: You'll need to include Jens for the block bits to get either an
ACK or a merge through the block tree.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-16 12:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16  8:12 [PATCH v4 0/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
2026-06-16  8:12 ` [PATCH v4 1/3] block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write() Qu Wenruo
2026-06-16 12:44   ` Christoph Hellwig
2026-06-16  8:12 ` [PATCH v4 2/3] block: respect iov_iter::nofault flag " Qu Wenruo
2026-06-16 12:44   ` Christoph Hellwig
2026-06-16  8:12 ` [PATCH v4 3/3] btrfs: use IOMAP_DIO_BOUNCE flag instead of falling back to buffered IO Qu Wenruo
2026-06-16 12:45 ` [PATCH v4 0/3] " Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.