linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: never trust the bio from direct IO
@ 2025-10-20  9:19 Qu Wenruo
  2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was " Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2025-10-20  9:19 UTC (permalink / raw)
  To: linux-btrfs; +Cc: djwong, linux-xfs, linux-fsdevel

There is a bug report about that direct IO (and even concurrent buffered
IO) can lead to different contents of md-raid.

It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
("btrfs: always fallback to buffered write if the inode requires
checksum"), however we still leave a hole for nodatasum cases.

For nodatasum cases we still reuse the bio from direct IO, making it to
cause the same problem for RAID1*/5/6 profiles, and results
unreliable data contents read from disk, depending on the load balance.

Just do not trust any bio from direct IO, and never reuse those bios even
for nodatasum cases. Instead alloc our own bio with newly allocated
pages.

For direct read, submit that new bio, and at end io time copy the
contents to the dio bio.
For direct write, copy the contents from the dio bio, then submit the
new one.

This of course will lead to extra performance drop, but
it should still be much better than falling back to buffered IO.

There is a quick test done in my VM, with cache mode 'none' (aka, qemu
will use direct IO submitting the IO, to avoid double caching).

The VM has 10G ram, the target storage is backed by one PCIE gen3 NVME
SSD, the kernel has some minor/lightweight debug options:

The test command is pretty simple:
  dd if=/dev/zero bs=1M of=/mnt/btrfs/foobar count=4096 oflag=direct

- Raw disk IO
  dd if=/dev/zero bs=1M of=/dev/test/scratch1 count=4096 oflag=direct

  1.80748 s, 2.4 GB/s

- Fallback to buffered IO (unpatched)
  Mount option: default (with data checksum)

  20.7763 s, 207 MB/s

  Miserable, most SATA SSD is more than double the speed, and less than
  10% of the raw disk performance.
  Thankfully with this bouncing behavior, we can easily re-claim the
  performance soon.
  (Will be one small follow-up patch for it, after dropping the RFC
  tag)

- True zero-copy (unpatch)
  Mount option: nodatasum

  1.95422 s, 2.2 GB/s

  Very close to the raw disk speed.

- Bounce then zero-copy (patched)
  Mount option: nodatasum

  2.5453 s, 1.7 GB/s

  Around 23% slower than true zero-copy, but still acceptable if you ask
  me.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
REASON FOR RFC:

Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
modifying the page cache when it's still under writeback) can lead to
raid mirror contents mismatch, the 23% performance drop should still be
acceptable, and bcachefs is already doing this bouncing behavior.

But still, such performance drop can be very obvious, and performance
oriented users (who are very happy running various benchmark tools) are
going to notice or even complain.

Another question is, should we push this behavior to iomap layer so that other
fses can also benefit from it?
---
 fs/btrfs/direct-io.c | 150 ++++++++++++++++++++++++++++++-------------
 1 file changed, 105 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 802d4dbe5b38..1a8bed65c417 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -640,33 +640,6 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	return ret;
 }
 
-static void btrfs_dio_end_io(struct btrfs_bio *bbio)
-{
-	struct btrfs_dio_private *dip =
-		container_of(bbio, struct btrfs_dio_private, bbio);
-	struct btrfs_inode *inode = bbio->inode;
-	struct bio *bio = &bbio->bio;
-
-	if (bio->bi_status) {
-		btrfs_warn(inode->root->fs_info,
-		"direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d",
-			   btrfs_ino(inode), bio->bi_opf,
-			   dip->file_offset, dip->bytes, bio->bi_status);
-	}
-
-	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
-		btrfs_finish_ordered_extent(bbio->ordered, NULL,
-					    dip->file_offset, dip->bytes,
-					    !bio->bi_status);
-	} else {
-		btrfs_unlock_dio_extent(&inode->io_tree, dip->file_offset,
-					dip->file_offset + dip->bytes - 1, NULL);
-	}
-
-	bbio->bio.bi_private = bbio->private;
-	iomap_dio_bio_end_io(bio);
-}
-
 static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 					struct btrfs_ordered_extent *ordered)
 {
@@ -705,23 +678,109 @@ static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
 	return 0;
 }
 
-static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
-				loff_t file_offset)
+static void dio_end_write_copied_bio(struct btrfs_bio *bbio)
 {
-	struct btrfs_bio *bbio = btrfs_bio(bio);
+	struct bio *orig = bbio->private;
 	struct btrfs_dio_private *dip =
 		container_of(bbio, struct btrfs_dio_private, bbio);
-	struct btrfs_dio_data *dio_data = iter->private;
+	struct btrfs_inode *inode = bbio->inode;
+	struct bio *bio = &bbio->bio;
 
-	btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info,
-		       btrfs_dio_end_io, bio->bi_private);
-	bbio->inode = BTRFS_I(iter->inode);
-	bbio->file_offset = file_offset;
+	if (bio->bi_status) {
+		btrfs_warn(inode->root->fs_info,
+		"direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d",
+			   btrfs_ino(inode), bio->bi_opf,
+			   dip->file_offset, dip->bytes, bio->bi_status);
+	}
+
+	orig->bi_status = bbio->bio.bi_status;
+	btrfs_finish_ordered_extent(bbio->ordered, NULL,
+				    dip->file_offset, dip->bytes,
+				    !bio->bi_status);
+	bio_free_pages(bio);
+	bio_put(bio);
+	iomap_dio_bio_end_io(orig);
+}
+
+static void dio_end_read_copied_bio(struct btrfs_bio *bbio)
+{
+	struct bio *orig = bbio->private;
+	struct btrfs_dio_private *dip =
+		container_of(bbio, struct btrfs_dio_private, bbio);
+	struct btrfs_inode *inode = bbio->inode;
+	struct bio *bio = &bbio->bio;
+
+	if (bio->bi_status) {
+		btrfs_warn(inode->root->fs_info,
+		"direct IO failed ino %llu op 0x%0x offset %#llx len %u err no %d",
+			   btrfs_ino(inode), bio->bi_opf,
+			   dip->file_offset, dip->bytes, bio->bi_status);
+	}
+
+	orig->bi_status = bbio->bio.bi_status;
+	bio_copy_data(orig, &bbio->bio);
+	btrfs_unlock_dio_extent(&inode->io_tree, dip->file_offset,
+				dip->file_offset + dip->bytes - 1, NULL);
+	bio_free_pages(bio);
+	bio_put(bio);
+	iomap_dio_bio_end_io(orig);
+}
+
+static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *src,
+				loff_t file_offset)
+{
+	struct btrfs_dio_private *dip;
+	struct btrfs_dio_data *dio_data = iter->private;
+	struct btrfs_bio *new_bbio;
+	struct bio *new_bio;
+	const bool is_write = (btrfs_op(src) == BTRFS_MAP_WRITE);
+	btrfs_bio_end_io_t end_io;
+	const unsigned int src_size = src->bi_iter.bi_size;
+	const int nr_pages = round_up(src_size, PAGE_SIZE) >> PAGE_SHIFT;
+	unsigned int cur = 0;
+	int ret;
+
+	if (is_write)
+		end_io = dio_end_write_copied_bio;
+	else
+		end_io = dio_end_read_copied_bio;
+
+	/*
+	 * We can not trust the direct IO bio, the content can be modified at any time
+	 * during the submission/writeback.
+	 * Thus we have to allocate a new bio with pages allocated by us, so that noone
+	 * can change the content.
+	 */
+	new_bio = bio_alloc_bioset(NULL, nr_pages, src->bi_opf, GFP_NOFS, &btrfs_dio_bioset);
+	new_bbio = btrfs_bio(new_bio);
+	btrfs_bio_init(new_bbio, inode_to_fs_info(iter->inode), end_io, src);
+	dip = container_of(new_bbio, struct btrfs_dio_private, bbio);
+	new_bbio->inode = BTRFS_I(iter->inode);
+	new_bbio->file_offset = file_offset;
+	dip->file_offset = file_offset;
+	dip->bytes = src_size;
+	while (cur < src_size) {
+		struct page *page = alloc_page(GFP_NOFS);
+		unsigned int size = min(src_size - cur, PAGE_SIZE);
+
+		if (!page) {
+			ret = -ENOMEM;
+			goto error;
+		}
+		ret = bio_add_page(&new_bbio->bio, page, size, 0);
+		ASSERT(ret == size);
+		cur += size;
+	}
+	ASSERT(new_bbio->bio.bi_iter.bi_size == src_size);
+	new_bbio->bio.bi_iter.bi_sector = src->bi_iter.bi_sector;
 
 	dip->file_offset = file_offset;
-	dip->bytes = bio->bi_iter.bi_size;
+	dip->bytes = src_size;
 
-	dio_data->submitted += bio->bi_iter.bi_size;
+	dio_data->submitted += src_size;
+
+	if (is_write)
+		bio_copy_data(&new_bbio->bio, src);
 
 	/*
 	 * Check if we are doing a partial write.  If we are, we need to split
@@ -731,20 +790,22 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	 * remaining pages is blocked on the outstanding ordered extent.
 	 */
 	if (iter->flags & IOMAP_WRITE) {
-		int ret;
-
-		ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
+		ret = btrfs_extract_ordered_extent(new_bbio, dio_data->ordered);
 		if (ret) {
 			btrfs_finish_ordered_extent(dio_data->ordered, NULL,
 						    file_offset, dip->bytes,
 						    !ret);
-			bio->bi_status = errno_to_blk_status(ret);
-			iomap_dio_bio_end_io(bio);
-			return;
+			goto error;
 		}
 	}
 
-	btrfs_submit_bbio(bbio, 0);
+	btrfs_submit_bbio(new_bbio, 0);
+	return;
+error:
+	src->bi_status = errno_to_blk_status(ret);
+	bio_free_pages(&new_bbio->bio);
+	bio_put(&new_bbio->bio);
+	iomap_dio_bio_end_io(src);
 }
 
 static const struct iomap_ops btrfs_dio_iomap_ops = {
@@ -754,7 +815,6 @@ static const struct iomap_ops btrfs_dio_iomap_ops = {
 
 static const struct iomap_dio_ops btrfs_dio_ops = {
 	.submit_io		= btrfs_dio_submit_io,
-	.bio_set		= &btrfs_dio_bioset,
 };
 
 static ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-10-22  6:24 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20  9:19 [PATCH] btrfs: never trust the bio from direct IO Qu Wenruo
2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was " Christoph Hellwig
2025-10-20 10:24   ` Qu Wenruo
2025-10-20 11:45     ` Christoph Hellwig
2025-10-20 11:16   ` Jan Kara
2025-10-20 11:44     ` Christoph Hellwig
2025-10-20 13:59       ` Jan Kara
2025-10-20 14:59         ` Matthew Wilcox
2025-10-20 15:58           ` Jan Kara
2025-10-20 17:55             ` John Hubbard
2025-10-21  8:27               ` Jan Kara
2025-10-21 16:56                 ` John Hubbard
2025-10-20 19:00             ` David Hildenbrand
2025-10-21  7:49               ` Christoph Hellwig
2025-10-21  7:57                 ` David Hildenbrand
2025-10-21  9:33                   ` Jan Kara
2025-10-21  9:43                     ` David Hildenbrand
2025-10-21  9:22                 ` Jan Kara
2025-10-21  9:37                   ` David Hildenbrand
2025-10-21  9:52                     ` Jan Kara
2025-10-21  3:17   ` Qu Wenruo
2025-10-21  7:48     ` Christoph Hellwig
2025-10-21  8:15       ` Qu Wenruo
2025-10-21 11:30         ` Johannes Thumshirn
2025-10-22  2:27           ` Qu Wenruo
2025-10-22  5:04             ` hch
2025-10-22  6:17               ` Qu Wenruo
2025-10-22  6:24                 ` hch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).