[PATCH 0/4] btrfs: minor fixes and cleanups on direct IO path

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] btrfs: minor fixes and cleanups on direct IO path
@ 2026-06-05  9:01 Qu Wenruo
  2026-06-05  9:01 ` [PATCH 1/4] btrfs: return the written number if there is any progress for dio Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Qu Wenruo @ 2026-06-05  9:01 UTC (permalink / raw)
  To: linux-btrfs

There are several small problems inside the direct IO code:

- btrfs_direct_write() ignores any submitted bytes
  Any error code will have a higher priority than the @written value.

  Fixed in the first patch.

- Possible direct and buffered OE conflicts if falling back to buffered
  If btrfs_dio_iomap_begin() is called and created an OE, but later
  nothing is submitted and needs to fallback to buffered, then the OE
  is only finished but not yet removed.

  The removal of an OE is always delayed, thus there is a window that
  the buffered write can create an OE, causing conflicts.

  Fixed in the second patch.

- Duplicated btrfs_dio_data::submitted
  The iomap_end() callback has @written parameter which provides the
  same value.

  Fixed in the third patch.

- Badly structured btrfs_dio_iomap_end()
  Refactored in the last patch.

Qu Wenruo (4):
  btrfs: return the written number if there is any progress for dio
  btrfs: wait for OE to be fully removed before falling back to buffered
  btrfs: remove btrfs_dio_data::submitted
  btrfs: refactor btrfs_dio_iomap_end()

 fs/btrfs/direct-io.c | 150 ++++++++++++++++++++++++-------------------
 1 file changed, 84 insertions(+), 66 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] btrfs: return the written number if there is any progress for dio
  2026-06-05  9:01 [PATCH 0/4] btrfs: minor fixes and cleanups on direct IO path Qu Wenruo
@ 2026-06-05  9:01 ` Qu Wenruo
  2026-06-05 10:05   ` Filipe Manana
  2026-06-05  9:01 ` [PATCH 2/4] btrfs: wait for OE to be fully removed before falling back to buffered Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2026-06-05  9:01 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs_direct_write() will return error even if there is some
progress made.

E.g. a NOWAIT direct write has submitted 16KiB data, but later
failed due to -EFAULT.

In that case we can not fallback to buffered IO either, as it's an
NOWAIT operation, so we set @ret to -EAGAIN.

In that case, we should not return -EAGAIN but 16K to indicate that we
have already submitted 16KiB data.

Fix it by doing a check on @written first, only if @written is 0 then
return @ret.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 460326d34143..c97396ca476d 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -1046,7 +1046,14 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
 				 endbyte >> PAGE_SHIFT);
 out:
-	return ret < 0 ? ret : written;
+	/*
+	 * Always check if we have some bytes written, even if we hit
+	 * an error later we must report the correct number of written
+	 * bytes.
+	 */
+	if (written)
+		return written;
+	return ret;
 }
 
 static int check_direct_read(struct btrfs_fs_info *fs_info,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] btrfs: wait for OE to be fully removed before falling back to buffered
  2026-06-05  9:01 [PATCH 0/4] btrfs: minor fixes and cleanups on direct IO path Qu Wenruo
  2026-06-05  9:01 ` [PATCH 1/4] btrfs: return the written number if there is any progress for dio Qu Wenruo
@ 2026-06-05  9:01 ` Qu Wenruo
  2026-06-05 10:10   ` Filipe Manana
  2026-06-05  9:01 ` [PATCH 3/4] btrfs: remove btrfs_dio_data::submitted Qu Wenruo
  2026-06-05  9:01 ` [PATCH 4/4] btrfs: refactor btrfs_dio_iomap_end() Qu Wenruo
  3 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2026-06-05  9:01 UTC (permalink / raw)
  To: linux-btrfs

Currently if a direct write failed for whatever reasons, e.g. page fault
failure or IO error, then we fallback to buffered IO.

But it's possible that the direct write attempt has created some ordered
extent, and they may not be fully removed from the inode io tree yet:

 btrfs_direct_write:
 |- __iomap_dio_rw()
 |  |- btrfs_dio_iomap_begin()
 |  |  |- btrfs_get_blocks_direct_write()
 |  |     |- btrfs_create_dio_extent()
 |  |        Now an OE is created for the whole range
 |  |
 |  |- iomap_dio_iter()
 |  |  Failed with -EFAULT, which is pretty common inside
 |  |  btrfs as we have set iov_iter->nofault flag.
 |  |
 |  |- btrfs_dio_iomap_end()
 |     |- btrfs_finish_ordered_extent()
 |        |- btrfs_queue_ordered_fn()
 |           The real removal of the OE is queued into a workqueue.
 |
 |- btrfs_buffered_write()
 |- btrfs_fdatawrite_range()
    |- cow_file_range()
       |- btrfs_alloc_ordered_extent()
          If the previous OE is not yet removed, a conflicting OE
	  will be inserted and trigger a panic.

Fix it by calling btrfs_wait_ordered_range() before buffered writes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index c97396ca476d..8efe3a722206 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -1025,6 +1025,16 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	pos = iocb->ki_pos;
+	/*
+	 * Direct writes may have created an OE, then submitted nothing.
+	 * Although iomap_end() should have marked the OE as finished, it's
+	 * not yet removed from io-tree.
+	 * Thus we have to wait for them to be removed.
+	 */
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), pos, iov_iter_count(from));
+	if (ret < 0)
+		goto out;
+
 	written_buffered = btrfs_buffered_write(iocb, from);
 	if (written_buffered < 0) {
 		ret = written_buffered;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] btrfs: remove btrfs_dio_data::submitted
  2026-06-05  9:01 [PATCH 0/4] btrfs: minor fixes and cleanups on direct IO path Qu Wenruo
  2026-06-05  9:01 ` [PATCH 1/4] btrfs: return the written number if there is any progress for dio Qu Wenruo
  2026-06-05  9:01 ` [PATCH 2/4] btrfs: wait for OE to be fully removed before falling back to buffered Qu Wenruo
@ 2026-06-05  9:01 ` Qu Wenruo
  2026-06-05  9:01 ` [PATCH 4/4] btrfs: refactor btrfs_dio_iomap_end() Qu Wenruo
  3 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2026-06-05  9:01 UTC (permalink / raw)
  To: linux-btrfs

That member is to record how many bytes are submitted for a direct
read/write, utilized by iomap_end() callback to handle short IO cases.

However iomap_end() callback is already providing an internally tracked
@written member, which is doing the same accounting and providing the
same value as btrfs_dio_data::submitted.

There is no need to duplicate the work, just remove
btrfs_dio_data::submitted.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 8efe3a722206..e25f36eddb51 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -14,7 +14,6 @@
 #include "ordered-data.h"
 
 struct btrfs_dio_data {
-	ssize_t submitted;
 	loff_t old_isize;
 	struct extent_changeset *data_reserved;
 	struct btrfs_ordered_extent *ordered;
@@ -619,7 +618,6 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 {
 	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
 	struct btrfs_dio_data *dio_data = iter->private;
-	size_t submitted = dio_data->submitted;
 	const bool write = !!(flags & IOMAP_WRITE);
 	int ret = 0;
 
@@ -630,9 +628,9 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 		return 0;
 	}
 
-	if (submitted < length) {
-		pos += submitted;
-		length -= submitted;
+	if (written < length) {
+		pos += written;
+		length -= written;
 		if (write) {
 			/*
 			 * Got a short write and have updated the isize, need to
@@ -659,7 +657,7 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			if (dio_data->updated_isize) {
 				u64 new_isize;
 
-				if (submitted == 0)
+				if (written == 0)
 					new_isize = dio_data->old_isize;
 				else
 					new_isize = max(dio_data->old_isize, pos);
@@ -772,8 +770,6 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	dip->file_offset = file_offset;
 	dip->bytes = bio->bi_iter.bi_size;
 
-	dio_data->submitted += bio->bi_iter.bi_size;
-
 	/*
 	 * Check if we are doing a partial write.  If we are, we need to split
 	 * the ordered extent to match the submitted bio.  Hang on to the
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] btrfs: refactor btrfs_dio_iomap_end()
  2026-06-05  9:01 [PATCH 0/4] btrfs: minor fixes and cleanups on direct IO path Qu Wenruo
                   ` (2 preceding siblings ...)
  2026-06-05  9:01 ` [PATCH 3/4] btrfs: remove btrfs_dio_data::submitted Qu Wenruo
@ 2026-06-05  9:01 ` Qu Wenruo
  3 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2026-06-05  9:01 UTC (permalink / raw)
  To: linux-btrfs

That function has the following problems:

- Read/write handling scattered across different locations
  E.g. At the beginning there is a dedicated hole read handling, but
  later short read handling is at an if() branch.

- Modifying of @pos and @length parameter for short read
  Although it's completely fine to modify those parameters as they are
  passed by value, but it can still be confusing to read.
  As normally we would assume @pos and @length to be the original range.

  But for short IO handling we modify @pos/@length, and completely
  ignore @written.

- Unnecessary split for ordered extent and changeset handling
  Both OE and changeset are only for writes, but they are handled in two
  different if (write) {} blocks.

Refactor the function so that:

- Handling of reads and writes are concentrated in their code block
  Now the handling of reads are in its own small if () branch.

  Leaving the more complex writes handling to take the remaining
  function, and reduce the indent level.

  This also removes all unnecessary "if (write)" checks.

- Do not modify @pos and @length
  Let short IO handling to manually calculate the remaining range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 125 ++++++++++++++++++++++---------------------
 1 file changed, 65 insertions(+), 60 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index e25f36eddb51..f2a68c3e31e5 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -621,74 +621,79 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	const bool write = !!(flags & IOMAP_WRITE);
 	int ret = 0;
 
-	if (!write && (iomap->type == IOMAP_HOLE)) {
-		/* If reading from a hole, unlock and return */
-		btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
-					pos + length - 1, NULL);
+	if (!write) {
+		/*
+		 * Hole read, nothing is submitted, thus we have to unlock
+		 * the whole range.
+		 */
+		if (iomap->type == IOMAP_HOLE) {
+			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
+						pos + length - 1, NULL);
+			return 0;
+		}
+		/*
+		 * Short read, needs to unlock the remaining range, and
+		 * return -ENOTBLK so we can later fault in the pages and retry.
+		 */
+		if (written < length) {
+			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos + written,
+						pos + length - 1, NULL);
+			return -ENOTBLK;
+		}
+		/* The full range is submitted, endio will do the unlock. */
 		return 0;
 	}
 
 	if (written < length) {
-		pos += written;
-		length -= written;
-		if (write) {
-			/*
-			 * Got a short write and have updated the isize, need to
-			 * revert the isize change.
-			 *
-			 * Normally we need to update isize with extent lock hold,
-			 * but we're safe due to the following factors:
-			 *
-			 * - Only a single writer can be enlarging isize
-			 *   Enlarging isize will take the exclusive inode lock.
-			 *
-			 * - Buffered readers need to wait for the OE we're holding
-			 *   Buffered readers will lock extent and wait for OE
-			 *   of the folio range, and since page cache is invalidated
-			 *   the OE wait can not be skipped.
-			 *
-			 * So here we are safe to revert the isize before
-			 * finishing the OE, and no reader of the remaining range
-			 * can see the enlarged size.
-			 *
-			 * TODO: Extend the DIO_LOCKED lifespan for direct writes,
-			 * and only enlarge isize after a successful write.
-			 */
-			if (dio_data->updated_isize) {
-				u64 new_isize;
+		/*
+		 * Got a short write and have updated the isize, need to
+		 * revert the isize change.
+		 *
+		 * Normally we need to update isize with extent lock hold,
+		 * but we're safe due to the following factors:
+		 *
+		 * - Only a single writer can be enlarging isize
+		 *   Enlarging isize will take the exclusive inode lock.
+		 *
+		 * - Buffered readers need to wait for the OE we're holding
+		 *   Buffered readers will lock extent and wait for OE
+		 *   of the folio range, and since page cache is invalidated
+		 *   the OE wait can not be skipped.
+		 *
+		 * So here we are safe to revert the isize before
+		 * finishing the OE, and no reader of the remaining range
+		 * can see the enlarged size.
+		 *
+		 * TODO: Extend the DIO_LOCKED lifespan for direct writes,
+		 * and only enlarge isize after a successful write.
+		 */
+		if (dio_data->updated_isize) {
+			u64 new_isize;
 
-				if (written == 0)
-					new_isize = dio_data->old_isize;
-				else
-					new_isize = max(dio_data->old_isize, pos);
-				i_size_write(inode, new_isize);
-				dio_data->updated_isize = false;
-			}
-			/*
-			 * We have a short write, if there is any range
-			 * that is submitted properly, that part will have
-			 * its own OE split from the original one.
-			 *
-			 * So for the OE at dio_data->ordered, it's the part
-			 * that is not submitted, and should be marked
-			 * as fully truncated.
-			 */
-			btrfs_mark_ordered_extent_truncated(dio_data->ordered, 0);
-			btrfs_finish_ordered_extent(dio_data->ordered,
-						    pos, length, true);
-		} else {
-			btrfs_unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
-						pos + length - 1, NULL);
+			if (written == 0)
+				new_isize = dio_data->old_isize;
+			else
+				new_isize = max(dio_data->old_isize, pos + written);
+			i_size_write(inode, new_isize);
+			dio_data->updated_isize = false;
 		}
+		/*
+		 * We have a short write, if there is any range
+		 * that is submitted properly, that part will have
+		 * its own OE split from the original one.
+		 *
+		 * So for the OE at dio_data->ordered, it's the part
+		 * that is not submitted, and should be marked
+		 * as fully truncated.
+		 */
+		btrfs_mark_ordered_extent_truncated(dio_data->ordered, 0);
+		btrfs_finish_ordered_extent(dio_data->ordered,
+					    pos + written, length - written, true);
 		ret = -ENOTBLK;
 	}
-	if (write) {
-		btrfs_put_ordered_extent(dio_data->ordered);
-		dio_data->ordered = NULL;
-	}
-
-	if (write)
-		extent_changeset_free(dio_data->data_reserved);
+	btrfs_put_ordered_extent(dio_data->ordered);
+	dio_data->ordered = NULL;
+	extent_changeset_free(dio_data->data_reserved);
 	return ret;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] btrfs: return the written number if there is any progress for dio
  2026-06-05  9:01 ` [PATCH 1/4] btrfs: return the written number if there is any progress for dio Qu Wenruo
@ 2026-06-05 10:05   ` Filipe Manana
  2026-06-05 22:01     ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2026-06-05 10:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 5, 2026 at 10:10 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently btrfs_direct_write() will return error even if there is some
> progress made.
>
> E.g. a NOWAIT direct write has submitted 16KiB data, but later
> failed due to -EFAULT.
>
> In that case we can not fallback to buffered IO either, as it's an
> NOWAIT operation, so we set @ret to -EAGAIN.
>
> In that case, we should not return -EAGAIN but 16K to indicate that we
> have already submitted 16KiB data.

No, this will cause partial writes, resulting in unexpected behaviour
for applications.

For NOWAIT writes, if we perform a partial write we must return
EAGAIN, so that the remainder is attempted again in a blocking
context.
If we return what we were able to write, there's no retry in a
blocking context and the application gets a partial write.

We've had such bugs and reports of unexpected reads and writes in the
past, at least
with MariaDB.



>
> Fix it by doing a check on @written first, only if @written is 0 then
> return @ret.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/direct-io.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 460326d34143..c97396ca476d 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -1046,7 +1046,14 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>         invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
>                                  endbyte >> PAGE_SHIFT);
>  out:
> -       return ret < 0 ? ret : written;
> +       /*
> +        * Always check if we have some bytes written, even if we hit
> +        * an error later we must report the correct number of written
> +        * bytes.
> +        */
> +       if (written)
> +               return written;
> +       return ret;
>  }
>
>  static int check_direct_read(struct btrfs_fs_info *fs_info,
> --
> 2.54.0
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] btrfs: wait for OE to be fully removed before falling back to buffered
  2026-06-05  9:01 ` [PATCH 2/4] btrfs: wait for OE to be fully removed before falling back to buffered Qu Wenruo
@ 2026-06-05 10:10   ` Filipe Manana
  2026-06-05 21:43     ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2026-06-05 10:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 5, 2026 at 10:08 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently if a direct write failed for whatever reasons, e.g. page fault
> failure or IO error, then we fallback to buffered IO.
>
> But it's possible that the direct write attempt has created some ordered
> extent, and they may not be fully removed from the inode io tree yet:
>
>  btrfs_direct_write:
>  |- __iomap_dio_rw()
>  |  |- btrfs_dio_iomap_begin()
>  |  |  |- btrfs_get_blocks_direct_write()
>  |  |     |- btrfs_create_dio_extent()
>  |  |        Now an OE is created for the whole range
>  |  |
>  |  |- iomap_dio_iter()
>  |  |  Failed with -EFAULT, which is pretty common inside
>  |  |  btrfs as we have set iov_iter->nofault flag.
>  |  |
>  |  |- btrfs_dio_iomap_end()
>  |     |- btrfs_finish_ordered_extent()
>  |        |- btrfs_queue_ordered_fn()
>  |           The real removal of the OE is queued into a workqueue.
>  |
>  |- btrfs_buffered_write()
>  |- btrfs_fdatawrite_range()
>     |- cow_file_range()
>        |- btrfs_alloc_ordered_extent()
>           If the previous OE is not yet removed, a conflicting OE
>           will be inserted and trigger a panic.
>
> Fix it by calling btrfs_wait_ordered_range() before buffered writes.

No, this should not be needed, we already wait for ordered extents elsewhere:

btrfs_buffered_write() -> copy_one_range() ->
lock_and_cleanup_extent_if_need() -> lock_and_cleanup_extent_if_need()

Was this a theoretical fix or did you actually find this happening in practice?

>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/direct-io.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index c97396ca476d..8efe3a722206 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -1025,6 +1025,16 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>         }
>
>         pos = iocb->ki_pos;
> +       /*
> +        * Direct writes may have created an OE, then submitted nothing.
> +        * Although iomap_end() should have marked the OE as finished, it's
> +        * not yet removed from io-tree.
> +        * Thus we have to wait for them to be removed.
> +        */
> +       ret = btrfs_wait_ordered_range(BTRFS_I(inode), pos, iov_iter_count(from));
> +       if (ret < 0)
> +               goto out;
> +
>         written_buffered = btrfs_buffered_write(iocb, from);
>         if (written_buffered < 0) {
>                 ret = written_buffered;
> --
> 2.54.0
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] btrfs: wait for OE to be fully removed before falling back to buffered
  2026-06-05 10:10   ` Filipe Manana
@ 2026-06-05 21:43     ` Qu Wenruo
  0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2026-06-05 21:43 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2026/6/5 19:40, Filipe Manana 写道:
> On Fri, Jun 5, 2026 at 10:08 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Currently if a direct write failed for whatever reasons, e.g. page fault
>> failure or IO error, then we fallback to buffered IO.
>>
>> But it's possible that the direct write attempt has created some ordered
>> extent, and they may not be fully removed from the inode io tree yet:
>>
>>   btrfs_direct_write:
>>   |- __iomap_dio_rw()
>>   |  |- btrfs_dio_iomap_begin()
>>   |  |  |- btrfs_get_blocks_direct_write()
>>   |  |     |- btrfs_create_dio_extent()
>>   |  |        Now an OE is created for the whole range
>>   |  |
>>   |  |- iomap_dio_iter()
>>   |  |  Failed with -EFAULT, which is pretty common inside
>>   |  |  btrfs as we have set iov_iter->nofault flag.
>>   |  |
>>   |  |- btrfs_dio_iomap_end()
>>   |     |- btrfs_finish_ordered_extent()
>>   |        |- btrfs_queue_ordered_fn()
>>   |           The real removal of the OE is queued into a workqueue.
>>   |
>>   |- btrfs_buffered_write()
>>   |- btrfs_fdatawrite_range()
>>      |- cow_file_range()
>>         |- btrfs_alloc_ordered_extent()
>>            If the previous OE is not yet removed, a conflicting OE
>>            will be inserted and trigger a panic.
>>
>> Fix it by calling btrfs_wait_ordered_range() before buffered writes.
> 
> No, this should not be needed, we already wait for ordered extents elsewhere:
> 
> btrfs_buffered_write() -> copy_one_range() ->
> lock_and_cleanup_extent_if_need() -> lock_and_cleanup_extent_if_need()

Right, I'll drop this one.

> 
> Was this a theoretical fix or did you actually find this happening in practice?

I was hitting OE conflicts during development on another direct IO 
related patchset, but that's after some extent lock related changes, and 
I have not yet hit it again on for-next.

So it may be some other bugs.

Thanks,
Qu

> 
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/direct-io.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>> index c97396ca476d..8efe3a722206 100644
>> --- a/fs/btrfs/direct-io.c
>> +++ b/fs/btrfs/direct-io.c
>> @@ -1025,6 +1025,16 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>>          }
>>
>>          pos = iocb->ki_pos;
>> +       /*
>> +        * Direct writes may have created an OE, then submitted nothing.
>> +        * Although iomap_end() should have marked the OE as finished, it's
>> +        * not yet removed from io-tree.
>> +        * Thus we have to wait for them to be removed.
>> +        */
>> +       ret = btrfs_wait_ordered_range(BTRFS_I(inode), pos, iov_iter_count(from));
>> +       if (ret < 0)
>> +               goto out;
>> +
>>          written_buffered = btrfs_buffered_write(iocb, from);
>>          if (written_buffered < 0) {
>>                  ret = written_buffered;
>> --
>> 2.54.0
>>
>>
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] btrfs: return the written number if there is any progress for dio
  2026-06-05 10:05   ` Filipe Manana
@ 2026-06-05 22:01     ` Qu Wenruo
  0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2026-06-05 22:01 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2026/6/5 19:35, Filipe Manana 写道:
> On Fri, Jun 5, 2026 at 10:10 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Currently btrfs_direct_write() will return error even if there is some
>> progress made.
>>
>> E.g. a NOWAIT direct write has submitted 16KiB data, but later
>> failed due to -EFAULT.
>>
>> In that case we can not fallback to buffered IO either, as it's an
>> NOWAIT operation, so we set @ret to -EAGAIN.
>>
>> In that case, we should not return -EAGAIN but 16K to indicate that we
>> have already submitted 16KiB data.
> 
> No, this will cause partial writes, resulting in unexpected behaviour
> for applications.
> 
> For NOWAIT writes, if we perform a partial write we must return
> EAGAIN, so that the remainder is attempted again in a blocking
> context.
> If we return what we were able to write, there's no retry in a
> blocking context and the application gets a partial write.
> 
> We've had such bugs and reports of unexpected reads and writes in the
> past, at least
> with MariaDB.

But what about non-NOWAIT cases? And APPEND writes?

Especially for APPEND writes, if we got a partial write, the isize is 
already increased, the next APPEND writes would land in the new position.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-06-05 22:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-05  9:01 [PATCH 0/4] btrfs: minor fixes and cleanups on direct IO path Qu Wenruo
2026-06-05  9:01 ` [PATCH 1/4] btrfs: return the written number if there is any progress for dio Qu Wenruo
2026-06-05 10:05   ` Filipe Manana
2026-06-05 22:01     ` Qu Wenruo
2026-06-05  9:01 ` [PATCH 2/4] btrfs: wait for OE to be fully removed before falling back to buffered Qu Wenruo
2026-06-05 10:10   ` Filipe Manana
2026-06-05 21:43     ` Qu Wenruo
2026-06-05  9:01 ` [PATCH 3/4] btrfs: remove btrfs_dio_data::submitted Qu Wenruo
2026-06-05  9:01 ` [PATCH 4/4] btrfs: refactor btrfs_dio_iomap_end() Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.