Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* Re: [PATCH 0/6] btrfs: defrag/autodefrag fixes and cleanups
From: Qu Wenruo @ 2026-06-25 23:00 UTC (permalink / raw)
  To: fdmanana, linux-btrfs
In-Reply-To: <cover.1782321584.git.fdmanana@suse.com>



在 2026/6/26 04:50, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> There are a couple bugs related to defrag and autodefrag, one of them
> reported by syzbot and the other can often be triggered by fsstress with
> the mount option "-o autodefrag" (or fstests on random tests that use
> fsstress with multiple processes). Details in the change logs.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> 
> Filipe Manana (6):
>    btrfs: defrag: fix deadlock between defrag and delalloc space reservation
>    btrfs: fix pending delayed iputs when using autodefrag
>    btrfs: defrag: use a single list for each loop in defrag_one_range()
>    btrfs: defrag: use auto kfree in defrag_one_range() for folios array
>    btrfs: defrag: use simple list_del() in defrag_collect_targets()
>    btrfs: defrag: remove pointless list_del_init() in defrag_one_cluster()
> 
>   fs/btrfs/defrag.c  | 61 ++++++++++++++++++++++++++--------------------
>   fs/btrfs/disk-io.c | 15 ++++++++++++
>   2 files changed, 49 insertions(+), 27 deletions(-)
> 


^ permalink raw reply

* Re: [PATCH] btrfs: always wait for ordered extents to avoid OE races
From: Qu Wenruo @ 2026-06-25 22:38 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, syzbot+ba2afde329fc27e3f22e
In-Reply-To: <CAL3q7H7+WAQMV76+-AuU=dR_qoe6Gu1_CNmwUFD=zW81d2Hz0g@mail.gmail.com>



在 2026/6/25 20:59, Filipe Manana 写道:
[...]
> 
>> @@ -896,42 +896,38 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
>>          start_pos = round_down(pos, fs_info->sectorsize);
>>          last_pos = round_up(pos + write_bytes, fs_info->sectorsize) - 1;
>>
>> -       if (start_pos < inode->vfs_inode.i_size) {
>> -               struct btrfs_ordered_extent *ordered;
>> -
>> -               if (nowait) {
>> -                       if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
>> -                                                  last_pos, cached_state)) {
>> -                               folio_unlock(folio);
>> -                               folio_put(folio);
>> -                               return -EAGAIN;
>> -                       }
>> -               } else {
>> -                       btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
>> -                                         cached_state);
>> -               }
>> -
>> -               ordered = btrfs_lookup_ordered_range(inode, start_pos,
>> -                                                    last_pos - start_pos + 1);
>> -               if (ordered &&
>> -                   ordered->file_offset + ordered->num_bytes > start_pos &&
>> -                   ordered->file_offset <= last_pos) {
>> -                       btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
>> -                                           cached_state);
>> +       if (nowait) {
>> +               if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
>> +                                          last_pos, cached_state)) {
>>                          folio_unlock(folio);
>>                          folio_put(folio);
>> -                       btrfs_start_ordered_extent(ordered);
>> -                       btrfs_put_ordered_extent(ordered);
>>                          return -EAGAIN;
>>                  }
>> -               if (ordered)
>> -                       btrfs_put_ordered_extent(ordered);
>> -
>> -               *lockstart = start_pos;
>> -               *lockend = last_pos;
>> -               ret = 1;
>> +       } else {
>> +               btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
>> +                                 cached_state);
>>          }
>>
>> +       ordered = btrfs_lookup_ordered_range(inode, start_pos,
>> +                                            last_pos - start_pos + 1);
>> +       if (ordered &&
>> +           ordered->file_offset + ordered->num_bytes > start_pos &&
>> +           ordered->file_offset <= last_pos) {
>> +               btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
>> +                                   cached_state);
>> +               folio_unlock(folio);
>> +               folio_put(folio);
>> +               btrfs_start_ordered_extent(ordered);
>> +               btrfs_put_ordered_extent(ordered);
>> +               return -EAGAIN;
>> +       }
>> +       if (ordered)
>> +               btrfs_put_ordered_extent(ordered);
>> +
>> +       *lockstart = start_pos;
>> +       *lockend = last_pos;
>> +       ret = 1;
> 
> And this is pointless, right below we can use "return 1;" and get rid
> of the 'ret' variable.
> 
> Since returning 1 no longer makes sense, just make it return 0 and remove 'ret'.
> 
> Also in the caller, since lock_and_cleanup_extent_if_need() now always
> returns with the extent range locked, we can get rid of it's local
> variable named 'extents_locked' and use 'ret' as it does not make
> sense anymore.
> 
> Finally, with this change we now always look for ordered extents and
> lock the range. Previously, we only did this if writes were within
> i_size - which makes sense because, except for his new direct IO case
> falling back to buffered IO, we could never find ordered extents
> beyond i_size.
> 
> Have you checked whether performance is affected?

Here is the microbenchmark for the runtime of btrfs_buffered_write() and 
lock_and_cleanup_extent_if_need().

The workload is a very basic xfs_io -f -c "pwrite 0 1m", which is always 
appending the isize for every 4K block.

Before patch:
- lock_and_cleanup_extent_if_need():

   runtime avg = 58.16 ns
   runtime stdev = 4.2

- btrfs_buffered_write():

   runtime avg = 2115.6 ns
   runtime stdev = 1822.1


Patched:
- lock_and_cleanup_extent_if_need():

   runtime avg = 183.0 ns
   runtime stdev = 161.6

- btrfs_buffered_write():

   runtime avg = 2973.3 ns
   runtime stdev = 4225.1

So yes, the runtime of lock_and_cleanup_extent_if_need() has greatly 
increased, by more than 3 times, but still less than 1 micro second.

The btrfs_buffered_write() itself is also much slower due to the extra 
extent unlock, the runtime is increased by 40%, but still less than 3 us.

I'm not sure if the cost is acceptable or not. The increased runtime is 
pretty huge for the affected worst case scenario, and that's undeniable.

But the overall runtime of btrfs_buffered_write() itself is still at 
single digit microsecond level.

If acceptable, I'll add the microbenchmark into the commit message in 
the next update.

Thanks,
Qu

> 
> Thanks.
> 
>> +
>>          /*
>>           * We should be called after prepare_one_folio() which should have locked
>>           * all pages in the range.
>> @@ -1288,11 +1284,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>>
>>                  /* No copied bytes, unlock, release reserved space and exit. */
>>                  if (copied == 0) {
>> -                       if (extents_locked)
>> -                               btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
>> -                                                   &cached_state);
>> -                       else
>> -                               btrfs_free_extent_state(cached_state);
>> +                       btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
>> +                                           &cached_state);
>>                          btrfs_delalloc_release_extents(inode, reserved_len);
>>                          release_space(inode, *data_reserved, reserved_start, reserved_len,
>>                                        only_release_metadata);
>> @@ -1311,17 +1304,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>>
>>          ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state,
>>                                  only_release_metadata);
>> -       /*
>> -        * If we have not locked the extent range, because the range's start
>> -        * offset is >= i_size, we might still have a non-NULL cached extent
>> -        * state, acquired while marking the extent range as delalloc through
>> -        * btrfs_dirty_page(). Therefore free any possible cached extent state
>> -        * to avoid a memory leak.
>> -        */
>> -       if (extents_locked)
>> -               btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>> -       else
>> -               btrfs_free_extent_state(cached_state);
>> +       btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>>
>>          btrfs_delalloc_release_extents(inode, reserved_len);
>>          if (ret) {
>> --
>> 2.54.0
>>
>>


^ permalink raw reply

* Re: [PATCH] btrfs: skip tests that fail when autodefrag is enabled
From: Anand Suveer Jain @ 2026-06-25 22:36 UTC (permalink / raw)
  To: fdmanana, fstests; +Cc: linux-btrfs, Filipe Manana
In-Reply-To: <f0dedcfbe5d78d950325b293cfbb9d4fbab49a55.1782406920.git.fdmanana@suse.com>


LGTM

Reviewed-by: Anand Jain <asj@kernel.org>




^ permalink raw reply

* Re: [PATCH 1/5] btrfs: factor init_extent_buffer from __alloc_extent_buffer
From: Boris Burkov @ 2026-06-25 21:14 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs, kernel-team
In-Reply-To: <44af21bb-507a-42af-a908-66fa67c2a273@wdc.com>

On Thu, Jun 25, 2026 at 04:26:56PM +0200, Johannes Thumshirn wrote:
> On 6/24/26 12:35 AM, Boris Burkov wrote:
> > In preparation for preallocating extent_buffer data, factor eb
> > initialization away from specifically allocating it. This allows us to
> > allocate the eb, bfs, folios, etc. together in the main search_slot code
> > paths, but still share initialization code with the dummy/test/clone
> > allocation paths.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/extent_io.c | 20 ++++++++++++--------
> >   1 file changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 0edd532174fa..fa4cc8bcd1af 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3054,12 +3054,9 @@ void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
> >   	xa_destroy(&trans->writeback_inhibited_ebs);
> >   }
> > -static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> > -						   u64 start)
> > +static void init_extent_buffer(struct btrfs_fs_info *fs_info,
> > +			       struct extent_buffer *eb, u64 start)
> >   {
> > -	struct extent_buffer *eb = NULL;
> > -
> > -	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
> >   	eb->start = start;
> >   	eb->len = fs_info->nodesize;
> >   	eb->fs_info = fs_info;
> > @@ -3072,7 +3069,15 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
> >   	refcount_set(&eb->refs, 1);
> >   	ASSERT(eb->len <= BTRFS_MAX_METADATA_BLOCKSIZE);
> > +}
> > +
> > +static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> > +						   u64 start)
> > +{
> > +	struct extent_buffer *eb;
> > +	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS | __GFP_NOFAIL);
> > +	init_extent_buffer(fs_info, eb, start);
> >   	return eb;
> >   }
> > @@ -3476,9 +3481,8 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> >   	if (eb)
> >   		return eb;
> > -	eb = __alloc_extent_buffer(fs_info, start);
> > -	if (!eb)
> > -		return ERR_PTR(-ENOMEM);
> > +	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS | __GFP_NOFAIL);
> > +	init_extent_buffer(fs_info, eb, start);
> >   	/*
> >   	 * The reloc trees are just snapshots, so we need them to appear to be
> 
> This looks wrong to me. Not the split out code, but how it is used in
> alloc_extent_buffer(). Either you keep calling __alloc_extent_buffer() in
> alloc_extent_buffer() OR don't call init_extent_buffer() in
> __alloc_extent_buffer(). I see this is an intermediate step (and I haven't
> looked at the rest of the series yet) but it looks wrong.
> 

The idea is that you eventually (optionally) have the allocation happen
elsewhere, so alloc_extent_buffer should not call __alloc_extent_buffer
any more. That should be clearer in the second patch, I think.

I left __alloc_extent_buffer for btrfs_clone_extent_buffer() and
alloc_dummy_extent_buffer(). Would you prefer me to delete
__alloc_extent_buffer and open code it in those two callers?

Thanks,
Boris

^ permalink raw reply

* [PATCH] btrfs: skip tests that fail when autodefrag is enabled
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: fstests; +Cc: linux-btrfs, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

A couple tests, btrfs/004 and btrfs/308, fail if we pass "-o autodefrag"
in MOUNT_OPTIONS, due to changes in extent layouts. So make sure the
tests are skipped if autodefrag is enabled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 common/btrfs    | 7 +++++++
 tests/btrfs/004 | 3 +++
 tests/btrfs/308 | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/common/btrfs b/common/btrfs
index 30288f07..a75fbf1c 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -141,6 +141,13 @@ _require_btrfs_no_nodatasum()
 	fi
 }
 
+_require_btrfs_no_autodefrag()
+{
+	if _normalize_mount_options "$MOUNT_OPTIONS" | grep -q "autodefrag"; then
+		_notrun "This test requires autodefrag to be disabled"
+	fi
+}
+
 _require_btrfs_free_space_tree()
 {
 	_scratch_mkfs > /dev/null 2>&1
diff --git a/tests/btrfs/004 b/tests/btrfs/004
index 06eedde2..1c8b06f5 100755
--- a/tests/btrfs/004
+++ b/tests/btrfs/004
@@ -30,6 +30,9 @@ _require_no_large_scratch_dev
 _require_btrfs_command inspect-internal logical-resolve
 _require_btrfs_command inspect-internal inode-resolve
 _require_command "$FILEFRAG_PROG" filefrag
+# Autodefrag will randomly change the extent layout so filefrag can give us
+# different results from what we expect.
+_require_btrfs_no_autodefrag
 
 FILEFRAG_FILTER='
 	if (/blocks? of (\d+) bytes/) {
diff --git a/tests/btrfs/308 b/tests/btrfs/308
index 3c8d410a..c4ccaa0f 100755
--- a/tests/btrfs/308
+++ b/tests/btrfs/308
@@ -22,6 +22,8 @@ _require_btrfs_fs_feature "free_space_tree"
 _require_btrfs_free_space_tree
 _require_btrfs_no_compress
 _require_btrfs_no_nodatacow
+# We test for a specific stripe layout and autodefrag will change it.
+_require_btrfs_no_autodefrag
 
 test $(_get_page_size) -eq 4096 || _notrun "this tests requires 4k pagesize"
 
-- 
2.47.2


^ permalink raw reply related

* [PATCH 6/6] btrfs: defrag: remove pointless list_del_init() in defrag_one_cluster()
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: linux-btrfs
In-Reply-To: <cover.1782321584.git.fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

There's no need to call list_del_init() against each entry when freeing
the list, as the list is local and we are freeing the entry.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/defrag.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 7b3f779775a0..6ec5dd760d42 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1319,10 +1319,8 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 				      inode->root->fs_info->sectorsize_bits;
 	}
 out:
-	list_for_each_entry_safe(entry, tmp, &target_list, list) {
-		list_del_init(&entry->list);
+	list_for_each_entry_safe(entry, tmp, &target_list, list)
 		kfree(entry);
-	}
 	if (ret >= 0)
 		*last_scanned_ret = max(*last_scanned_ret, start + len);
 	return ret;
-- 
2.47.2


^ permalink raw reply related

* [PATCH 5/6] btrfs: defrag: use simple list_del() in defrag_collect_targets()
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: linux-btrfs
In-Reply-To: <cover.1782321584.git.fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

When freeing the entries from the list there is no need to initialize
the list member in an entry, since we are immediately freeing it. So use
simple list_del() instead of list_del_init().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/defrag.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index e454b59d6477..7b3f779775a0 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1093,7 +1093,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		struct defrag_target_range *tmp;
 
 		list_for_each_entry_safe(entry, tmp, target_list, list) {
-			list_del_init(&entry->list);
+			list_del(&entry->list);
 			kfree(entry);
 		}
 	}
-- 
2.47.2


^ permalink raw reply related

* [PATCH 4/6] btrfs: defrag: use auto kfree in defrag_one_range() for folios array
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: linux-btrfs
In-Reply-To: <cover.1782321584.git.fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

Use AUTO_KFREE() for the folios array, avoiding two kfree() calls, one of
them in a very specific error path.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/defrag.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index ad1d04d8f165..e454b59d6477 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1169,7 +1169,7 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	struct defrag_target_range *entry;
 	struct defrag_target_range *tmp;
 	LIST_HEAD(target_list);
-	struct folio **folios;
+	struct folio AUTO_KFREE(*folios);
 	const u32 sectorsize = inode->root->fs_info->sectorsize;
 	u64 cur = start;
 	const unsigned int nr_pages = ((start + len - 1) >> PAGE_SHIFT) -
@@ -1196,10 +1196,8 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	 * range or the extent lock.
 	 */
 	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
-	if (ret < 0) {
-		kfree(folios);
+	if (ret < 0)
 		return ret;
-	}
 
 	/* Prepare all pages */
 	for (int i = 0; cur < start + len && i < nr_pages; i++) {
@@ -1251,7 +1249,6 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 		folio_unlock(folios[i]);
 		folio_put(folios[i]);
 	}
-	kfree(folios);
 	btrfs_delalloc_release_extents(inode, len);
 	if (last_defrag_end < start + len)
 		btrfs_delalloc_release_space(inode, data_reserved, last_defrag_end,
-- 
2.47.2


^ permalink raw reply related

* [PATCH 3/6] btrfs: defrag: use a single list for each loop in defrag_one_range()
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: linux-btrfs
In-Reply-To: <cover.1782321584.git.fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

There's no need to have one list for each loop to defrag each subrange and
then another one to free each subrange (struct defrag_target_range).
We can do it in a single loop, freeing each subrange after defragging,
plus no need to delete each subrange from the list since we immediately
free it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/defrag.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 0697b285e05f..ad1d04d8f165 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1234,16 +1234,12 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	if (ret < 0)
 		goto unlock_extent;
 
-	list_for_each_entry(entry, &target_list, list) {
+	list_for_each_entry_safe(entry, tmp, &target_list, list) {
 		defrag_one_locked_target(inode, entry, folios, nr_pages, &cached_state);
 		if (entry->start > last_defrag_end)
 			btrfs_delalloc_release_space(inode, data_reserved, last_defrag_end,
 						     entry->start - last_defrag_end, true);
 		last_defrag_end = entry->start + entry->len;
-	}
-
-	list_for_each_entry_safe(entry, tmp, &target_list, list) {
-		list_del_init(&entry->list);
 		kfree(entry);
 	}
 unlock_extent:
-- 
2.47.2


^ permalink raw reply related

* [PATCH 2/6] btrfs: fix pending delayed iputs when using autodefrag
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: linux-btrfs
In-Reply-To: <cover.1782321584.git.fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

Syzbot reported the following warning recently:

   [  157.672472][ T6611] BTRFS info (device loop0): turning on flush-on-commit
   [  157.672488][ T6611] BTRFS info (device loop0): enabling free space tree
   [  157.672504][ T6611] BTRFS info (device loop0): enabling auto defrag
   [  157.672555][ T6611] BTRFS info (device loop0): use lzo compression, level 1
   [  157.672574][ T6611] BTRFS info (device loop0): max_inline set to 4096
   [  158.094512][ T5608] BTRFS info (device loop2): last unmount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [  160.073968][ T6656] BTRFS info (device loop0 state M): max_inline set to 4096
   [  160.418911][ T5611] BTRFS info (device loop0): last unmount of filesystem ab8108e1-bea5-4a9f-94c9-a3ff208d732a
   [  160.432287][ T6662] loop2: detected capacity change from 0 to 32768
   [  160.438859][ T6662] BTRFS: device fsid c9fe44da-de57-406a-8241-57ec7d4412cf devid 1 transid 8 /dev/loop2 (7:2) scanned by syz.2.74 (6662)
   [  160.459589][ T6662] BTRFS info (device loop2): first mount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
   [  160.459616][ T6662] BTRFS info (device loop2): using crc32c checksum algorithm
   [  160.634366][ T1187] ------------[ cut here ]------------
   [  160.634376][ T1187] test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state)
   [  160.634387][ T1187] WARNING: fs/btrfs/inode.c:3596 at btrfs_add_delayed_iput+0x2e3/0x340, CPU#0: kworker/u8:10/1187
   [  160.634412][ T1187] Modules linked in:
   [  160.634423][ T1187] CPU: 0 UID: 0 PID: 1187 Comm: kworker/u8:10 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
   [  160.634435][ T1187] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
   [  160.634442][ T1187] Workqueue: btrfs-endio-write btrfs_work_helper
   [  160.634456][ T1187] RIP: 0010:btrfs_add_delayed_iput+0x2e3/0x340
   [  160.634468][ T1187] Code: 53 a3 45 (...)
   [  160.634482][ T1187] RSP: 0018:ffffc900065d77c8 EFLAGS: 00010293
   [  160.634490][ T1187] RAX: ffffffff83e5f502 RBX: ffff88805aba0000 RCX: ffff888029768000
   [  160.634497][ T1187] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
   [  160.634503][ T1187] RBP: dffffc0000000000 R08: 0000000000000000 R09: 0000000000000000
   [  160.634509][ T1187] R10: dffffc0000000000 R11: ffffed100b574497 R12: 0000000000000001
   [  160.634516][ T1187] R13: dffffc0000000000 R14: ffff888061194788 R15: 0000000000000200
   [  160.634523][ T1187] FS:  0000000000000000(0000) GS:ffff888126186000(0000) knlGS:0000000000000000
   [  160.634531][ T1187] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [  160.634537][ T1187] CR2: 00007fe553a3f000 CR3: 00000000596c2000 CR4: 00000000003526f0
   [  160.634547][ T1187] Call Trace:
   [  160.634551][ T1187]  <TASK>
   [  160.634560][ T1187]  btrfs_put_ordered_extent+0x18f/0x430
   [  160.634577][ T1187]  btrfs_finish_one_ordered+0xf63/0x2680
   [  160.634598][ T1187]  ? __pfx_btrfs_finish_one_ordered+0x10/0x10
   [  160.634611][ T1187]  ? do_raw_spin_lock+0x12b/0x2f0
   [  160.634622][ T1187]  ? lock_acquire+0x106/0x350
   [  160.634636][ T1187]  ? __pfx_do_raw_spin_lock+0x10/0x10
   [  160.634650][ T1187]  btrfs_work_helper+0x38b/0xc20
   [  160.634666][ T1187]  ? process_scheduled_works+0xa70/0x1860
   [  160.634679][ T1187]  process_scheduled_works+0xb5d/0x1860
   [  160.634703][ T1187]  ? __pfx_process_scheduled_works+0x10/0x10
   [  160.634716][ T1187]  ? assign_work+0x3d5/0x5e0
   [  160.634729][ T1187]  worker_thread+0xa53/0xfc0
   [  160.634752][ T1187]  kthread+0x388/0x470
   [  160.634765][ T1187]  ? __pfx_worker_thread+0x10/0x10
   [  160.635870][ T1187]  ? __pfx_kthread+0x10/0x10
   [  160.635891][ T1187]  ret_from_fork+0x514/0xb70
   [  160.635907][ T1187]  ? __pfx_ret_from_fork+0x10/0x10
   [  160.635917][ T1187]  ? __switch_to+0xc79/0x1410
   [  160.635934][ T1187]  ? __pfx_kthread+0x10/0x10
   [  160.635948][ T1187]  ret_from_fork_asm+0x1a/0x30
   [  160.635969][ T1187]  </TASK>
   [  160.635975][ T1187] Kernel panic - not syncing: kernel: panic_on_warn set ...

It means we add a delayed iput created after we last ran delayed iputs in
close_ctree() and set the flag BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info.

This happens when using autodefrag and more likely to happen if we use
flushoncommit too. The steps are the following:

1) Unmount starts, all delalloc is flushed and we enter close_ctree();

2) In close_ctree() we park the cleaner kthread, but while we wait for it
   to park, it's in:

     btrfs_run_defrag_inodes()
        btrfs_run_defrag_inode()
           btrfs_defrag_file()
              defrag_one_cluster()
                 defrag_one_range()
                    defrag_one_locked_target()

   And dirties some folios from an inode;

3) The cleaner kthread parks and we proceed in close_ctree(), waiting
   for all ordered extents, running delayed iputs and setting the flag
   BTRFS_FS_STATE_NO_DELAYED_IPUT in fs_info;

4) Later in close_ctree() we call btrfs_commit_super(), which commits the
   current transaction. Because we are mounted with flushoncommit, the
   transaction commit flushes delalloc and waits for the resulting ordered
   extent to complete;

5) The ordered extents from the flushed dealloc created by autodefrag
   complete and create delayed iputs, triggering the warning:

     WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state));

   in btrfs_add_delayed_iput()

6) Further below in close_ctree() we will hit the following assertion:

     ASSERT(list_empty(&fs_info->delayed_iputs));

   Since we don't expect any more delayed iputs.

Fix this by flushing delalloc and waiting for the ordered extents right
after we parked the cleaner kthread and waiting for autodefrag in
close_ctree().

Reported-by: syzbot+6a843bf8604711c8fab0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1ee507.b4221f80.1326c5.0004.GAE@google.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index fa5922a21e51..93ca33ca24e7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4365,6 +4365,21 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	/* clear out the rbtree of defraggable inodes */
 	btrfs_cleanup_defrag_inodes(fs_info);
 
+	/*
+	 * After we entered close_ctree() autodefrag could be running and before
+	 * we parked the cleaner kthread, it dirtied folios of some inode.
+	 * We don't want to leave any delalloc here, it may be flushed any time
+	 * after this point and result in ordered extents that create delayed
+	 * iputs after flushed the ordered extent queues further below, run
+	 * delayed iputs and set BTRFS_FS_STATE_NO_DELAYED_IPUT. If we are
+	 * mounted with flushoncommit, then btrfs_commit_super() called below
+	 * will flush delalloc and wait for ordered extents but we end up
+	 * getting delayed iputs than are never run. So flush delalloc and wait
+	 * for ordered extents.
+	 */
+	writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
+	btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL);
+
 	/*
 	 * Handle the error fs first, as it will flush and wait for all ordered
 	 * extents.  This will generate delayed iputs, thus we want to handle
-- 
2.47.2


^ permalink raw reply related

* [PATCH 1/6] btrfs: defrag: fix deadlock between defrag and delalloc space reservation
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: linux-btrfs
In-Reply-To: <cover.1782321584.git.fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

While running fsstress with autodefrag and flushoncommit, hit a deadlock
due to the fact that defrag reserves delalloc space while it's holding
dirty and locked folios, besides the extent range lock. The stack traces
are the following:

   [430958.624136] task:kworker/u50:3   state:D stack:0     pid:20365 tgid:20365 ppid:2      task_flags:0x4208060 flags:0x00080000
   [430958.626267] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
   [430958.627821] Call Trace:
   [430958.628351]  <TASK>
   [430958.628990]  __schedule+0x4be/0x10f0
   [430958.629791]  ? preempt_count_add+0x69/0xa0
   [430958.630605]  schedule+0x26/0xd0
   [430958.631327]  wait_current_trans+0x102/0x160 [btrfs]
   [430958.632414]  ? __pfx_autoremove_wake_function+0x10/0x10
   [430958.633515]  start_transaction+0x374/0x900 [btrfs]
   [430958.634601]  btrfs_commit_current_transaction+0x1d/0x70 [btrfs]
   [430958.635982]  flush_space+0xca/0x5e0 [btrfs]
   [430958.636996]  ? _raw_spin_unlock+0x15/0x30
   [430958.637894]  ? btrfs_reduce_alloc_profile+0x8c/0x190 [btrfs]
   [430958.639217]  ? _raw_spin_unlock+0x15/0x30
   [430958.640030]  ? calc_available_free_space.isra.0+0x6f/0x110 [btrfs]
   [430958.641462]  do_async_reclaim_metadata_space+0x84/0x190 [btrfs]
   [430958.642711]  btrfs_async_reclaim_metadata_space+0x64/0x80 [btrfs]
   [430958.644015]  process_one_work+0x19d/0x3a0
   [430958.644873]  worker_thread+0x1c4/0x330
   [430958.645668]  ? __pfx_worker_thread+0x10/0x10
   [430958.646535]  kthread+0xfc/0x130
   [430958.647285]  ? __pfx_kthread+0x10/0x10
   [430958.648068]  ret_from_fork+0x1f7/0x2c0
   [430958.648894]  ? __pfx_kthread+0x10/0x10
   [430958.649713]  ret_from_fork_asm+0x1a/0x30
   [430958.650536]  </TASK>
   [430958.651036] task:kworker/u49:7   state:D stack:0     pid:52990 tgid:52990 ppid:2      task_flags:0x4208060 flags:0x00080000
   [430958.653709] Workqueue: writeback wb_workfn (flush-btrfs-334)
   [430958.655110] Call Trace:
   [430958.655737]  <TASK>
   [430958.656284]  __schedule+0x4be/0x10f0
   [430958.657178]  ? __blk_flush_plug+0xe9/0x140
   [430958.658188]  schedule+0x26/0xd0
   [430958.658982]  io_schedule+0x42/0x70
   [430958.659850]  folio_wait_bit_common+0x12b/0x330
   [430958.660954]  ? folio_wait_bit_common+0x100/0x330
   [430958.662157]  ? __pfx_wake_page_function+0x10/0x10
   [430958.663328]  extent_write_cache_pages+0x599/0x830 [btrfs]
   [430958.664496]  ? acpi_fwnode_get_reference_args+0x1fa/0x270
   [430958.665579]  btrfs_writepages+0x77/0x130 [btrfs]
   [430958.666614]  ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   [430958.667846]  do_writepages+0xc6/0x160
   [430958.668596]  __writeback_single_inode+0x42/0x310
   [430958.669535]  writeback_sb_inodes+0x231/0x570
   [430958.670583]  wb_writeback+0x8a/0x340
   [430958.671383]  wb_workfn+0xbf/0x450
   [430958.672058]  ? finish_task_switch.isra.0+0xc1/0x350
   [430958.673026]  process_one_work+0x19d/0x3a0
   [430958.673814]  worker_thread+0x1c4/0x330
   [430958.674565]  ? __pfx_worker_thread+0x10/0x10
   [430958.675440]  kthread+0xfc/0x130
   [430958.676084]  ? __pfx_kthread+0x10/0x10
   [430958.676832]  ret_from_fork+0x1f7/0x2c0
   [430958.677582]  ? __pfx_kthread+0x10/0x10
   [430958.678369]  ret_from_fork_asm+0x1a/0x30
   [430958.679171]  </TASK>
   [430958.679644] task:btrfs-cleaner   state:D stack:0     pid:296750 tgid:296750 ppid:2      task_flags:0x208040 flags:0x00080000
   [430958.681812] Call Trace:
   [430958.682318]  <TASK>
   [430958.682762]  __schedule+0x4be/0x10f0
   [430958.683542]  schedule+0x26/0xd0
   [430958.684264]  handle_reserve_ticket+0x1b9/0x2c0 [btrfs]
   [430958.685366]  ? __pfx_autoremove_wake_function+0x10/0x10
   [430958.686520]  reserve_bytes+0x283/0x4c0 [btrfs]
   [430958.687610]  btrfs_reserve_metadata_bytes+0x18/0xb0 [btrfs]
   [430958.688860]  btrfs_delalloc_reserve_metadata+0x121/0x320 [btrfs]
   [430958.690263]  btrfs_delalloc_reserve_space+0x46/0xb0 [btrfs]
   [430958.691675]  btrfs_defrag_file+0x903/0x1110 [btrfs]
   [430958.692879]  btrfs_run_defrag_inodes+0x334/0x430 [btrfs]
   [430958.694005]  cleaner_kthread+0x97/0x1c0 [btrfs]
   [430958.694969]  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
   [430958.696232]  kthread+0xfc/0x130
   [430958.696954]  ? __pfx_kthread+0x10/0x10
   [430958.697763]  ret_from_fork+0x1f7/0x2c0
   [430958.698521]  ? __pfx_kthread+0x10/0x10
   [430958.699348]  ret_from_fork_asm+0x1a/0x30
   [430958.700217]  </TASK>
   [430958.716533] task:fsstress        state:D stack:0     pid:296769 tgid:296769 ppid:296768 task_flags:0x400140 flags:0x00080000
   [430958.718780] Call Trace:
   [430958.719366]  <TASK>
   [430958.719817]  __schedule+0x4be/0x10f0
   [430958.720611]  ? preempt_count_add+0x69/0xa0
   [430958.721465]  schedule+0x26/0xd0
   [430958.722150]  wb_wait_for_completion+0x79/0xc0
   [430958.723109]  ? __pfx_autoremove_wake_function+0x10/0x10
   [430958.724173]  __writeback_inodes_sb_nr+0xc5/0xf0
   [430958.725081]  try_to_writeback_inodes_sb+0x55/0x70
   [430958.726075]  btrfs_commit_transaction+0x19d/0xeb0 [btrfs]
   [430958.727337]  ? start_transaction+0x343/0x900 [btrfs]
   [430958.728422]  btrfs_mksubvol+0x28b/0x4e0 [btrfs]
   [430958.729445]  btrfs_mksnapshot+0x74/0xa0 [btrfs]
   [430958.730511]  __btrfs_ioctl_snap_create+0x194/0x210 [btrfs]
   [430958.732245]  btrfs_ioctl_snap_create_v2+0xef/0x150 [btrfs]
   [430958.733636]  btrfs_ioctl+0x7ec/0x2a70 [btrfs]
   [430958.734665]  ? __virt_addr_valid+0xe4/0x180
   [430958.735534]  ? __check_object_size+0x1cd/0x1f0
   [430958.736613]  ? kmem_cache_free+0x146/0x380
   [430958.737645]  ? _raw_spin_unlock+0x15/0x30
   [430958.738660]  ? do_sys_openat2+0x83/0xd0
   [430958.739637]  __x64_sys_ioctl+0x92/0xe0
   [430958.740576]  do_syscall_64+0x60/0x590
   [430958.741512]  ? clear_bhb_loop+0x60/0xb0
   [430958.742485]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [430958.743772] RIP: 0033:0x7f4431e108db
   [430958.744668] RSP: 002b:00007ffcd147db20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   [430958.746327] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4431e108db
   [430958.747816] RDX: 00007ffcd147eb90 RSI: 0000000050009417 RDI: 0000000000000005
   [430958.749479] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
   [430958.751216] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffcd147fbf0
   [430958.752929] R13: 00007ffcd147eb90 R14: 0000000000000005 R15: 0000000000000003
   [430958.754684]  </TASK>

What happens is the following:

1) The cleaner kthread is running autodefrag, and in defrag_one_range()
   it acquired all the folios for the range and locked them.

   Then it locked the extent range in the inode's iotree.

   It got two subranges from defrag_collect_targets(), the first one
   with folio A and the second one with folio B.

   After it defraged the first subrange, folio A remains locked and
   dirty - it's only unlocked when defrag_one_range() returns.

   When it attempts to defrag the second subrange (containing folio B),
   btrfs_delalloc_reserve_space() creates a space reservation ticket,
   due to lack of free metadata space and blocks waiting for the async
   metadata reclaim task to free space and wake it up;

2) The async reclaim metadata task attempts to commit the current
   transaction, but it blocks because there is another task that
   started the commit first;

3) A task creating a snapshot is committing the transaction and
   because the fs was mounted with flushoncommit, it calls
   try_to_writeback_inodes_sb(), which spawns a task to flush
   delalloc and waits for it to complete;

4) The task flushing delalloc (kworker/u49:7), finds that folio A for
   the inode being defragged is dirty, so it tries to lock it...

   But it blocks because folio A is locked by the defrag task (the
   cleaner kthread) which is blocked waiting for the reservation
   ticket to be served, but the async reclaim metadata task is
   blocked waiting for the transaction commit, which in turn is
   blocked waiting for the delalloc flush task, which is trying to
   lock folio A, resulting in a deadlock.

The same type of problem can happen if the async reclaim task starts to
flush delalloc, as that requires both locking the folio and the extent
rannge in the inode's io tree, and in this case we don't need the fs to
be mounted with flushoncommit. This type of problem has ocurred several
times in the past with reflinks for example, where we had a dirty folio
while holding the extent range locked and then starting a transaction
blocked waiting for the async reclaim task due to lack of free metadata
space.

So fix this by reserving delalloc space before locking folios and locking
the extent range in the inode's iotree. We can not simply unlock the
folios for each subrange given by defrag_collect_targets() after we defrag
it because the same folio may be present too in the next subrange (due to
large folios).

Fixes: 22b398eeeed4 ("btrfs: defrag: introduce helper to defrag a contiguous prepared range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/defrag.c | 50 +++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index f0c6758b7055..0697b285e05f 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1130,20 +1130,15 @@ static_assert(PAGE_ALIGNED(CLUSTER_SIZE));
  *
  * - Extent bits are locked
  */
-static int defrag_one_locked_target(struct btrfs_inode *inode,
-				    struct defrag_target_range *target,
-				    struct folio **folios, int nr_pages,
-				    struct extent_state **cached_state)
+static void defrag_one_locked_target(struct btrfs_inode *inode,
+				     struct defrag_target_range *target,
+				     struct folio **folios, int nr_pages,
+				     struct extent_state **cached_state)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_changeset *data_reserved = NULL;
 	const u64 start = target->start;
 	const u64 len = target->len;
-	int ret = 0;
 
-	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
-	if (ret < 0)
-		return ret;
 	btrfs_clear_extent_bit(&inode->io_tree, start, start + len - 1,
 			       EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
 			       EXTENT_DEFRAG, cached_state);
@@ -1164,10 +1159,6 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 			continue;
 		btrfs_folio_clamp_set_dirty(fs_info, folio, start, len);
 	}
-	btrfs_delalloc_release_extents(inode, len);
-	extent_changeset_free(data_reserved);
-
-	return ret;
 }
 
 static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
@@ -1183,6 +1174,8 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	u64 cur = start;
 	const unsigned int nr_pages = ((start + len - 1) >> PAGE_SHIFT) -
 				      (start >> PAGE_SHIFT) + 1;
+	struct extent_changeset *data_reserved = NULL;
+	u64 last_defrag_end = start;
 	int ret = 0;
 
 	ASSERT(nr_pages <= CLUSTER_SIZE / PAGE_SIZE);
@@ -1192,6 +1185,22 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	if (!folios)
 		return -ENOMEM;
 
+	/*
+	 * Reserve delalloc space before locking the range and before locking
+	 * and dirtying any folios - otherwise we could deadlock, for example
+	 * after defrag of one range we dirty folios and keep them locked when
+	 * we move to the next range, so reserving delalloc space right before
+	 * each range could trigger flushing of delalloc and deadlock on the
+	 * extent lock or trigger a transaction commit with flushoncommit, which
+	 * can either deadlock on the lock of a folio made dirty in the previous
+	 * range or the extent lock.
+	 */
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
+	if (ret < 0) {
+		kfree(folios);
+		return ret;
+	}
+
 	/* Prepare all pages */
 	for (int i = 0; cur < start + len && i < nr_pages; i++) {
 		folios[i] = defrag_prepare_one_folio(inode, cur >> PAGE_SHIFT);
@@ -1226,10 +1235,11 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 		goto unlock_extent;
 
 	list_for_each_entry(entry, &target_list, list) {
-		ret = defrag_one_locked_target(inode, entry, folios, nr_pages,
-					       &cached_state);
-		if (ret < 0)
-			break;
+		defrag_one_locked_target(inode, entry, folios, nr_pages, &cached_state);
+		if (entry->start > last_defrag_end)
+			btrfs_delalloc_release_space(inode, data_reserved, last_defrag_end,
+						     entry->start - last_defrag_end, true);
+		last_defrag_end = entry->start + entry->len;
 	}
 
 	list_for_each_entry_safe(entry, tmp, &target_list, list) {
@@ -1246,6 +1256,12 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 		folio_put(folios[i]);
 	}
 	kfree(folios);
+	btrfs_delalloc_release_extents(inode, len);
+	if (last_defrag_end < start + len)
+		btrfs_delalloc_release_space(inode, data_reserved, last_defrag_end,
+					     start + len - last_defrag_end, true);
+	extent_changeset_free(data_reserved);
+
 	return ret;
 }
 
-- 
2.47.2


^ permalink raw reply related

* [PATCH 0/6] btrfs: defrag/autodefrag fixes and cleanups
From: fdmanana @ 2026-06-25 19:20 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There are a couple bugs related to defrag and autodefrag, one of them
reported by syzbot and the other can often be triggered by fsstress with
the mount option "-o autodefrag" (or fstests on random tests that use
fsstress with multiple processes). Details in the change logs.

Filipe Manana (6):
  btrfs: defrag: fix deadlock between defrag and delalloc space reservation
  btrfs: fix pending delayed iputs when using autodefrag
  btrfs: defrag: use a single list for each loop in defrag_one_range()
  btrfs: defrag: use auto kfree in defrag_one_range() for folios array
  btrfs: defrag: use simple list_del() in defrag_collect_targets()
  btrfs: defrag: remove pointless list_del_init() in defrag_one_cluster()

 fs/btrfs/defrag.c  | 61 ++++++++++++++++++++++++++--------------------
 fs/btrfs/disk-io.c | 15 ++++++++++++
 2 files changed, 49 insertions(+), 27 deletions(-)

-- 
2.47.2


^ permalink raw reply

* [PATCH 2/2] fs: btrfs: use fs_ls_generic() and drop custom implementation
From: Alexey Charkov @ 2026-06-25 17:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, u-boot
  Cc: Marek Behún, Tom Rini, Simon Glass, Timo tp Preißl,
	Peng Fan, Patrice Chotard, Yao Zi, Alexey Charkov
In-Reply-To: <20260625-btrfs-readdir-v1-0-dd781a2b3965@flipper.net>

Now that generic callbacks for opendir/readdir/closedir are implemented,
the custom btrfs_ls() implementation is no longer needed, along with the
btrfs_iter_dir() callback iterator.

Use fs_ls_generic() instead.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
---
 fs/btrfs/dir-item.c | 59 +++--------------------------------------------------
 fs/fs.c             |  2 +-
 include/btrfs.h     |  1 -
 3 files changed, 4 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 8cf5be3790e7..e8aebc0c7366 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -114,65 +114,12 @@ struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
 	return btrfs_match_dir_item_name(root, path, name, name_len);
 }
 
-int btrfs_iter_dir(struct btrfs_root *root, u64 ino,
-		   btrfs_iter_dir_callback_t callback)
-{
-	struct btrfs_path path;
-	struct btrfs_key key;
-	int ret;
-
-	btrfs_init_path(&path);
-	key.objectid = ino;
-	key.type = BTRFS_DIR_INDEX_KEY;
-	key.offset = 0;
-
-	ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0);
-	if (ret < 0)
-		return ret;
-	/* Should not happen */
-	if (ret == 0) {
-		ret = -EUCLEAN;
-		goto out;
-	}
-	if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
-		ret = btrfs_next_leaf(root, &path);
-		if (ret < 0)
-			goto out;
-		if (ret > 0) {
-			ret = 0;
-			goto out;
-		}
-	}
-	do {
-		struct btrfs_dir_item *di;
-
-		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
-		if (key.objectid != ino || key.type != BTRFS_DIR_INDEX_KEY)
-			break;
-		di = btrfs_item_ptr(path.nodes[0], path.slots[0],
-				    struct btrfs_dir_item);
-		if (verify_dir_item(root, path.nodes[0], di)) {
-			ret = -EUCLEAN;
-			goto out;
-		}
-		ret = callback(root, path.nodes[0], di);
-		if (ret < 0)
-			goto out;
-	} while (!(ret = btrfs_next_item(root, &path)));
-
-	if (ret > 0)
-		ret = 0;
-out:
-	btrfs_release_path(&path);
-	return ret;
-}
-
 /*
  * btrfs_next_dir_entry() - read one directory entry at or after a cursor
  *
- * Streaming counterpart to btrfs_iter_dir() for the fs-layer readdir, which
- * is re-entered once per entry: returns the first real entry whose
- * BTRFS_DIR_INDEX_KEY offset is >= *offset and advances *offset past it.
+ * Iterator for the fs-layer readdir, which is re-entered once per entry:
+ * returns the first real entry whose BTRFS_DIR_INDEX_KEY offset is
+ * >= *offset and advances *offset past it.
  *
  * @root, @ino:		directory to read
  * @offset:		in/out cursor; DIR_INDEX offset to resume from
diff --git a/fs/fs.c b/fs/fs.c
index 4694a88f776b..1ebda6f4ee2a 100644
--- a/fs/fs.c
+++ b/fs/fs.c
@@ -320,7 +320,7 @@ static struct fstype_info fstypes[] = {
 		.null_dev_desc_ok = false,
 		.probe = btrfs_probe,
 		.close = btrfs_close,
-		.ls = btrfs_ls,
+		.ls = fs_ls_generic,
 		.exists = btrfs_exists,
 		.size = btrfs_size,
 		.read = btrfs_read,
diff --git a/include/btrfs.h b/include/btrfs.h
index 6fff45a497ee..3878b7817ea6 100644
--- a/include/btrfs.h
+++ b/include/btrfs.h
@@ -15,7 +15,6 @@ struct fs_dirent;
 
 int btrfs_probe(struct blk_desc *fs_dev_desc,
 		struct disk_partition *fs_partition);
-int btrfs_ls(const char *);
 int btrfs_exists(const char *);
 int btrfs_size(const char *, loff_t *);
 int btrfs_read(const char *, void *, loff_t, loff_t, loff_t *);

-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/2] fs: btrfs: implement opendir(), readdir() and closedir()
From: Alexey Charkov @ 2026-06-25 17:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, u-boot
  Cc: Marek Behún, Tom Rini, Simon Glass, Timo tp Preißl,
	Peng Fan, Patrice Chotard, Yao Zi, Alexey Charkov
In-Reply-To: <20260625-btrfs-readdir-v1-0-dd781a2b3965@flipper.net>

Add support for generic directory iteration with opendir(), readdir() and
closedir() in the btrfs filesystem driver, following the ext4fs
implementation for opendir()/closedir() and the btrfs_iter_dir() function
for readdir().

Signed-off-by: Alexey Charkov <alchark@flipper.net>
---
 fs/btrfs/btrfs.c    | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ctree.h    |  2 ++
 fs/btrfs/dir-item.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fs.c             |  4 ++-
 include/btrfs.h     |  5 +++
 5 files changed, 193 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/btrfs.c b/fs/btrfs/btrfs.c
index f3087f690fa4..c647c8dedf4e 100644
--- a/fs/btrfs/btrfs.c
+++ b/fs/btrfs/btrfs.c
@@ -9,6 +9,7 @@
 #include <malloc.h>
 #include <u-boot/uuid.h>
 #include <linux/time.h>
+#include <fs.h>
 #include "btrfs.h"
 #include "crypto/hash.h"
 #include "disk-io.h"
@@ -159,6 +160,97 @@ int btrfs_ls(const char *path)
 	return 0;
 }
 
+struct btrfs_dir_stream {
+	struct fs_dir_stream parent;
+	struct fs_dirent dirent;
+	char *dirname;
+	u64 offset;
+};
+
+int btrfs_opendir(const char *dirname, struct fs_dir_stream **dirsp)
+{
+	struct btrfs_fs_info *fs_info = current_fs_info;
+	struct btrfs_dir_stream *dirs;
+	struct btrfs_root *root;
+	u64 ino;
+	u8 type;
+	int ret;
+
+	*dirsp = NULL;
+	ASSERT(fs_info);
+
+	ret = btrfs_lookup_path(fs_info->fs_root, BTRFS_FIRST_FREE_OBJECTID,
+				dirname, &root, &ino, &type, 40);
+	if (ret < 0)
+		return ret;
+	if (type != BTRFS_FT_DIR)
+		return -ENOTDIR;
+
+	dirs = calloc(1, sizeof(*dirs));
+	if (!dirs)
+		return -ENOMEM;
+	dirs->dirname = strdup(dirname);
+	if (!dirs->dirname) {
+		free(dirs);
+		return -ENOMEM;
+	}
+
+	*dirsp = (struct fs_dir_stream *)dirs;
+	return 0;
+}
+
+int btrfs_readdir(struct fs_dir_stream *fs_dirs, struct fs_dirent **dentp)
+{
+	struct btrfs_dir_stream *dirs = (struct btrfs_dir_stream *)fs_dirs;
+	struct btrfs_fs_info *fs_info = current_fs_info;
+	struct fs_dirent *dent = &dirs->dirent;
+	struct btrfs_root *root;
+	u64 ino;
+	u8 type;
+	int ret;
+
+	*dentp = NULL;
+	ASSERT(fs_info);
+
+	ret = btrfs_lookup_path(fs_info->fs_root, BTRFS_FIRST_FREE_OBJECTID,
+				dirs->dirname, &root, &ino, &type, 40);
+	if (ret < 0)
+		return ret;
+	if (type != BTRFS_FT_DIR)
+		return -ENOTDIR;
+
+	memset(dent, 0, sizeof(*dent));
+	ret = btrfs_next_dir_entry(root, ino, &dirs->offset, dent->name,
+				   sizeof(dent->name), &type);
+	if (ret < 0)
+		return ret;
+	if (ret > 0)
+		return -ENOENT;
+
+	switch (type) {
+	case BTRFS_FT_DIR:
+		dent->type = FS_DT_DIR;
+		break;
+	case BTRFS_FT_SYMLINK:
+		dent->type = FS_DT_LNK;
+		break;
+	default:
+		dent->type = FS_DT_REG;
+		break;
+	}
+
+	*dentp = dent;
+	return 0;
+}
+
+void btrfs_closedir(struct fs_dir_stream *fs_dirs)
+{
+	struct btrfs_dir_stream *dirs = (struct btrfs_dir_stream *)fs_dirs;
+
+	free(dirs->dirname);
+	free(dirs);
+}
+
 int btrfs_exists(const char *file)
 {
 	struct btrfs_fs_info *fs_info = current_fs_info;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b7be09d5df83..7de35aa7efd5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1225,6 +1225,8 @@ typedef int (*btrfs_iter_dir_callback_t)(struct btrfs_root *root,
 					 struct btrfs_dir_item *di);
 int btrfs_iter_dir(struct btrfs_root *root, u64 ino,
 		   btrfs_iter_dir_callback_t callback);
+int btrfs_next_dir_entry(struct btrfs_root *root, u64 ino, u64 *offset,
+			 char *namebuf, int namebuf_len, u8 *ftype);
 /* inode.c */
 int btrfs_lookup_path(struct btrfs_root *root, u64 ino, const char *filename,
 			struct btrfs_root **root_ret, u64 *ino_ret,
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 5f81d6414f06..8cf5be3790e7 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -166,3 +166,94 @@ out:
 	btrfs_release_path(&path);
 	return ret;
 }
+
+/*
+ * btrfs_next_dir_entry() - read one directory entry at or after a cursor
+ *
+ * Streaming counterpart to btrfs_iter_dir() for the fs-layer readdir, which
+ * is re-entered once per entry: returns the first real entry whose
+ * BTRFS_DIR_INDEX_KEY offset is >= *offset and advances *offset past it.
+ *
+ * @root, @ino:		directory to read
+ * @offset:		in/out cursor; DIR_INDEX offset to resume from
+ * @namebuf:		caller buffer that receives the NUL-terminated name
+ * @namebuf_len:	size of @namebuf in bytes
+ * @ftype:		receives the BTRFS_FT_* type of the entry
+ *
+ * Return: 0 if an entry was returned, 1 when the directory is exhausted,
+ *	   -ve on error.
+ */
+int btrfs_next_dir_entry(struct btrfs_root *root, u64 ino, u64 *offset,
+			 char *namebuf, int namebuf_len, u8 *ftype)
+{
+	struct btrfs_path path;
+	struct btrfs_key key;
+	int ret;
+
+	btrfs_init_path(&path);
+	key.objectid = ino;
+	key.type = BTRFS_DIR_INDEX_KEY;
+	key.offset = *offset;
+
+	ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0);
+	if (ret < 0)
+		goto out;
+
+	while (1) {
+		struct btrfs_dir_item *di;
+		int name_len;
+		u8 type;
+
+		if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
+			ret = btrfs_next_leaf(root, &path);
+			if (ret < 0)
+				goto out;
+			if (ret > 0) {		/* end of tree */
+				ret = 1;
+				goto out;
+			}
+		}
+
+		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_DIR_INDEX_KEY) {
+			ret = 1;		/* no more entries for this dir */
+			goto out;
+		}
+
+		di = btrfs_item_ptr(path.nodes[0], path.slots[0],
+				    struct btrfs_dir_item);
+		if (verify_dir_item(root, path.nodes[0], di)) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		*offset = key.offset + 1;
+		type = btrfs_dir_type(path.nodes[0], di);
+
+		/* XATTRs share the key space but are not directory entries. */
+		if (type == BTRFS_FT_XATTR) {
+			ret = btrfs_next_item(root, &path);
+			if (ret < 0)
+				goto out;
+			if (ret > 0) {
+				ret = 1;
+				goto out;
+			}
+			continue;
+		}
+
+		name_len = btrfs_dir_name_len(path.nodes[0], di);
+		if (name_len > namebuf_len - 1)
+			name_len = namebuf_len - 1;
+		read_extent_buffer(path.nodes[0], namebuf,
+				   (unsigned long)(di + 1), name_len);
+		namebuf[name_len] = '\0';
+		*ftype = type;
+		ret = 0;
+		goto out;
+	}
+
+out:
+	btrfs_release_path(&path);
+	return ret;
+}
diff --git a/fs/fs.c b/fs/fs.c
index 8ea50a6c13c4..4694a88f776b 100644
--- a/fs/fs.c
+++ b/fs/fs.c
@@ -326,7 +326,9 @@ static struct fstype_info fstypes[] = {
 		.read = btrfs_read,
 		.write = fs_write_unsupported,
 		.uuid = btrfs_uuid,
-		.opendir = fs_opendir_unsupported,
+		.opendir = btrfs_opendir,
+		.readdir = btrfs_readdir,
+		.closedir = btrfs_closedir,
 		.unlink = fs_unlink_unsupported,
 		.mkdir = fs_mkdir_unsupported,
 		.ln = fs_ln_unsupported,
diff --git a/include/btrfs.h b/include/btrfs.h
index 2d73add18e09..6fff45a497ee 100644
--- a/include/btrfs.h
+++ b/include/btrfs.h
@@ -10,6 +10,8 @@
 
 struct blk_desc;
 struct disk_partition;
+struct fs_dir_stream;
+struct fs_dirent;
 
 int btrfs_probe(struct blk_desc *fs_dev_desc,
 		struct disk_partition *fs_partition);
@@ -20,5 +22,8 @@ int btrfs_read(const char *, void *, loff_t, loff_t, loff_t *);
 void btrfs_close(void);
 int btrfs_uuid(char *);
 void btrfs_list_subvols(void);
+int btrfs_opendir(const char *filename, struct fs_dir_stream **dirsp);
+int btrfs_readdir(struct fs_dir_stream *dirs, struct fs_dirent **dentp);
+void btrfs_closedir(struct fs_dir_stream *dirs);
 
 #endif /* __U_BOOT_BTRFS_H__ */

-- 
2.53.0


^ permalink raw reply related

* [PATCH 0/2] fs: btrfs: add support for readdir
From: Alexey Charkov @ 2026-06-25 17:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, u-boot
  Cc: Marek Behún, Tom Rini, Simon Glass, Timo tp Preißl,
	Peng Fan, Patrice Chotard, Yao Zi, Alexey Charkov

Btrfs in U-boot currently uses a custom callback for ls and doesn't expose
the standard opendir/readdir/closedir interface, making it harder to use
in generic code. One area where this would be useful is in discovering BLS
type 1 entries [1] on a Btrfs filesystem.

Add support for the standard interface, and implement ls in terms of it.

[1] https://lore.kernel.org/u-boot/20260604-bls-v1-0-4ce6d1ee4711@flipper.net/

Signed-off-by: Alexey Charkov <alchark@flipper.net>
---
Alexey Charkov (2):
      fs: btrfs: implement opendir(), readdir() and closedir()
      fs: btrfs: use fs_ls_generic() and drop custom implementation

 fs/btrfs/btrfs.c    | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ctree.h    |  2 ++
 fs/btrfs/dir-item.c | 88 +++++++++++++++++++++++++++++++++++---------------
 fs/fs.c             |  6 ++--
 include/btrfs.h     |  6 +++-
 5 files changed, 166 insertions(+), 28 deletions(-)
---
base-commit: 19eafbadf20f56c1a24abe6b5e8774e776894261
change-id: 20260625-btrfs-readdir-8beac3ac5220

Best regards,
--  
Alexey Charkov <alchark@flipper.net>


^ permalink raw reply

* Re: [PATCH 1/5] btrfs: factor init_extent_buffer from __alloc_extent_buffer
From: Johannes Thumshirn @ 2026-06-25 14:26 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team
In-Reply-To: <420436ec8cb5db154adfd65cfd5fb350b380a4ec.1782249000.git.boris@bur.io>

On 6/24/26 12:35 AM, Boris Burkov wrote:
> In preparation for preallocating extent_buffer data, factor eb
> initialization away from specifically allocating it. This allows us to
> allocate the eb, bfs, folios, etc. together in the main search_slot code
> paths, but still share initialization code with the dummy/test/clone
> allocation paths.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/extent_io.c | 20 ++++++++++++--------
>   1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0edd532174fa..fa4cc8bcd1af 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3054,12 +3054,9 @@ void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
>   	xa_destroy(&trans->writeback_inhibited_ebs);
>   }
>   
> -static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> -						   u64 start)
> +static void init_extent_buffer(struct btrfs_fs_info *fs_info,
> +			       struct extent_buffer *eb, u64 start)
>   {
> -	struct extent_buffer *eb = NULL;
> -
> -	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
>   	eb->start = start;
>   	eb->len = fs_info->nodesize;
>   	eb->fs_info = fs_info;
> @@ -3072,7 +3069,15 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
>   	refcount_set(&eb->refs, 1);
>   
>   	ASSERT(eb->len <= BTRFS_MAX_METADATA_BLOCKSIZE);
> +}
> +
> +static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> +						   u64 start)
> +{
> +	struct extent_buffer *eb;
>   
> +	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS | __GFP_NOFAIL);
> +	init_extent_buffer(fs_info, eb, start);
>   	return eb;
>   }
>   
> @@ -3476,9 +3481,8 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>   	if (eb)
>   		return eb;
>   
> -	eb = __alloc_extent_buffer(fs_info, start);
> -	if (!eb)
> -		return ERR_PTR(-ENOMEM);
> +	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS | __GFP_NOFAIL);
> +	init_extent_buffer(fs_info, eb, start);
>   
>   	/*
>   	 * The reloc trees are just snapshots, so we need them to appear to be

This looks wrong to me. Not the split out code, but how it is used in 
alloc_extent_buffer(). Either you keep calling __alloc_extent_buffer() 
in alloc_extent_buffer() OR don't call init_extent_buffer() in 
__alloc_extent_buffer(). I see this is an intermediate step (and I 
haven't looked at the rest of the series yet) but it looks wrong.


^ permalink raw reply

* Re: [PATCH] btrfs: always wait for ordered extents to avoid OE races
From: Filipe Manana @ 2026-06-25 11:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, syzbot+ba2afde329fc27e3f22e
In-Reply-To: <aaabe0f2bee07800fb0490d24fab5cd162e5241d.1782366056.git.wqu@suse.com>

On Thu, Jun 25, 2026 at 6:41 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> Syzbot reported a bug that there can be conflicting OEs for the same
> range:
>
>  BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object alrea[  179.162726][ T6897] BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object already exists)
>  ------------[ cut here ]------------
>  kernel BUG at fs/btrfs/ordered-data.c:264!
>  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
>  RIP: 0010:btrfs_alloc_ordered_extent+0x943/0xad0
>  Call Trace:
>   <TASK>
>   cow_file_range+0x744/0x12a0
>   fallback_to_cow+0x5ea/0xa00
>   run_delalloc_nocow+0x110c/0x17a0
>   btrfs_run_delalloc_range+0xbe4/0x1c20
>   writepage_delalloc+0x104d/0x1ba0
>   btrfs_writepages+0x1667/0x28b0
>   do_writepages+0x338/0x560
>   filemap_fdatawrite_range+0x1f2/0x300
>   btrfs_fdatawrite_range+0x54/0xf0
>   btrfs_direct_write+0x6a0/0xc30
>   btrfs_do_write_iter+0x329/0x790
>   do_iter_readv_writev+0x624/0x8d0
>   vfs_writev+0x34c/0x990
>   __se_sys_pwritev2+0x17a/0x2a0
>   do_syscall_64+0x174/0x580
>   entry_SYSCALL_64_after_hwframe+0x77/0x7f
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>
> [CAUSE]
> Since commit ff66fe666233 ("btrfs: fix incorrect buffered IO fallback
> for append direct writes"), if the direct IO finished short, we will
> revert the isize back to the original one, so that append writes can be
> respected during the buffered fallback.
>
> Normally we rely on lock_and_cleanup_extent_if_needed() function during
> buffered writeback to wait for any existing ordered extents.
>
> But that ordered extent waiting only happens if the start_pos is inside
> the isize.
> Since we have reverted the isize during failed direct IO, we will not
> wait for any ordered extents.
>
> This means we can have a race where the direct IO OE is still in the
> tree, finished but not yet removed, then we're inserting the OE for the
> buffered write, causing the above crash.
>
> [FIX]
> Make the OE wait to be unconditional, to handle the reverted isize
> situation.
>
> And since lock_and_cleanup_extent_if_needed() can no longer turn 0, we
> can also remove the branches that handles no-extent-locked cases.
>
> Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
> Link: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
> Fixes: ff66fe666233 ("btrfs: fix incorrect buffered IO fallback for append direct writes")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/file.c | 77 +++++++++++++++++++------------------------------
>  1 file changed, 30 insertions(+), 47 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index df8f6f565070..ef31140e0942 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -879,7 +879,6 @@ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_
>   *
>   * Return:
>   * 1 - the extent is locked
> - * 0 - the extent is not locked, and everything is OK

So it stops making sense to return 1...
Just change the function to always return 0, meaning the extent range
is locked, or -EAGAIN.

>   * -EAGAIN - need to prepare the folios again
>   */
>  static noinline int
> @@ -889,6 +888,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
>                                 struct extent_state **cached_state)
>  {
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +       struct btrfs_ordered_extent *ordered;
>         u64 start_pos;
>         u64 last_pos;
>         int ret = 0;

Here initialized to 0, but we can never return such value anymore.

> @@ -896,42 +896,38 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
>         start_pos = round_down(pos, fs_info->sectorsize);
>         last_pos = round_up(pos + write_bytes, fs_info->sectorsize) - 1;
>
> -       if (start_pos < inode->vfs_inode.i_size) {
> -               struct btrfs_ordered_extent *ordered;
> -
> -               if (nowait) {
> -                       if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
> -                                                  last_pos, cached_state)) {
> -                               folio_unlock(folio);
> -                               folio_put(folio);
> -                               return -EAGAIN;
> -                       }
> -               } else {
> -                       btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
> -                                         cached_state);
> -               }
> -
> -               ordered = btrfs_lookup_ordered_range(inode, start_pos,
> -                                                    last_pos - start_pos + 1);
> -               if (ordered &&
> -                   ordered->file_offset + ordered->num_bytes > start_pos &&
> -                   ordered->file_offset <= last_pos) {
> -                       btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
> -                                           cached_state);
> +       if (nowait) {
> +               if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
> +                                          last_pos, cached_state)) {
>                         folio_unlock(folio);
>                         folio_put(folio);
> -                       btrfs_start_ordered_extent(ordered);
> -                       btrfs_put_ordered_extent(ordered);
>                         return -EAGAIN;
>                 }
> -               if (ordered)
> -                       btrfs_put_ordered_extent(ordered);
> -
> -               *lockstart = start_pos;
> -               *lockend = last_pos;
> -               ret = 1;
> +       } else {
> +               btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
> +                                 cached_state);
>         }
>
> +       ordered = btrfs_lookup_ordered_range(inode, start_pos,
> +                                            last_pos - start_pos + 1);
> +       if (ordered &&
> +           ordered->file_offset + ordered->num_bytes > start_pos &&
> +           ordered->file_offset <= last_pos) {
> +               btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
> +                                   cached_state);
> +               folio_unlock(folio);
> +               folio_put(folio);
> +               btrfs_start_ordered_extent(ordered);
> +               btrfs_put_ordered_extent(ordered);
> +               return -EAGAIN;
> +       }
> +       if (ordered)
> +               btrfs_put_ordered_extent(ordered);
> +
> +       *lockstart = start_pos;
> +       *lockend = last_pos;
> +       ret = 1;

And this is pointless, right below we can use "return 1;" and get rid
of the 'ret' variable.

Since returning 1 no longer makes sense, just make it return 0 and remove 'ret'.

Also in the caller, since lock_and_cleanup_extent_if_need() now always
returns with the extent range locked, we can get rid of it's local
variable named 'extents_locked' and use 'ret' as it does not make
sense anymore.

Finally, with this change we now always look for ordered extents and
lock the range. Previously, we only did this if writes were within
i_size - which makes sense because, except for his new direct IO case
falling back to buffered IO, we could never find ordered extents
beyond i_size.

Have you checked whether performance is affected?

Thanks.

> +
>         /*
>          * We should be called after prepare_one_folio() which should have locked
>          * all pages in the range.
> @@ -1288,11 +1284,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>
>                 /* No copied bytes, unlock, release reserved space and exit. */
>                 if (copied == 0) {
> -                       if (extents_locked)
> -                               btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
> -                                                   &cached_state);
> -                       else
> -                               btrfs_free_extent_state(cached_state);
> +                       btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
> +                                           &cached_state);
>                         btrfs_delalloc_release_extents(inode, reserved_len);
>                         release_space(inode, *data_reserved, reserved_start, reserved_len,
>                                       only_release_metadata);
> @@ -1311,17 +1304,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>
>         ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state,
>                                 only_release_metadata);
> -       /*
> -        * If we have not locked the extent range, because the range's start
> -        * offset is >= i_size, we might still have a non-NULL cached extent
> -        * state, acquired while marking the extent range as delalloc through
> -        * btrfs_dirty_page(). Therefore free any possible cached extent state
> -        * to avoid a memory leak.
> -        */
> -       if (extents_locked)
> -               btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
> -       else
> -               btrfs_free_extent_state(cached_state);
> +       btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>
>         btrfs_delalloc_release_extents(inode, reserved_len);
>         if (ret) {
> --
> 2.54.0
>
>

^ permalink raw reply

* Re: [RFC 00/15] btrfs: RAID5 with RAID stripe-tree (zoned + non-zoned)
From: Johannes Thumshirn @ 2026-06-25 11:24 UTC (permalink / raw)
  To: sun k; +Cc: linux-btrfs
In-Reply-To: <CAAuSuOpZkGX-mCYPKiD_Fni=0XdB2etgGrEkJJqBj4dSRqCZYQ@mail.gmail.com>

On 6/23/26 12:39 PM, sun k wrote:
> I've just noticed that I'm not sure about your mindset of this raid5/6
> rst feature.
> So I want to know do you think the raid5/6 rst is extent layer raid or
> block layer raid?
> Because they are quite different and I need to align it with you
> before further discussion.

I did have a what you call block-layer parity raid without and RMW in mind.

But RST is only a mapping layer, so I think we can go "extent layer" as 
well.

Hence the early RFC, I'm still open to changes, as long as it works on 
zoned filesystems.

Qu's idea with a RAID4 style (I say style because I want to be able to 
use multiple parities in the future) with dedicated parity disks is 
something worth exploring as well IMHO.

[leaving the rest for context]

> For extent layer raid5/6, it's more like what zfs is doing now: we do
> the "raid" thing on each
> file extent, split big extent into stripes, and write them on
> different disks with parity. The issue
> of this is that the performance and efficiency on small extents is
> bad, but we can always
> do full stripe write, avoiding write hole issues.
>
> For block layer raid5/6, which is like what we currently have without
> rst, we need to deal with the
> write hole issue, possibly by
> 1. totally avoid read-modify-write for a partial stripe update. We
> always write into a new stripe
>    append only instead of modifying an existing stripe, like writing a
> zoned device even on non-zoned device.
> 2. or we allow read-modify-write for a partial stripe update, but
> write the D and P elsewhere
>    instead of in-place.
>
> Option 2 will cause another free space management issue I've mentioned before:
> let's take a 4 disk raid 5 as example.
> before the partial stripe update, logical address LA is mapped to
> physical address PA0(D0), PA1(D1), PA2(D2), PA3(P);
> after updating D2, LA is mapped to address PA0(D0), PA1(D1), PA2'(D2'), PA3'(P')
> then we need to allocate PA2' and PA3' , and free PA2 and PA3 on disk 2 and 3.
> Since currently btrfs manage freespace based on logical address, how
> to manage the allocation and manage physical free space?
>


^ permalink raw reply

* Re: [PATCH] btrfs: wait for ordered extents before buffered write fallback in direct IO
From: Qu Wenruo @ 2026-06-25  6:51 UTC (permalink / raw)
  To: Zhou, Yun, clm, dsterba; +Cc: linux-btrfs, linux-kernel
In-Reply-To: <f192c12e-135f-4f98-a1a0-df2470eb8a86@windriver.com>



在 2026/6/25 16:16, Zhou, Yun 写道:
> 
> 
> On 6/25/26 13:17, Qu Wenruo wrote:
>>
>> 在 2026/6/25 14:43, Qu Wenruo 写道:
>>>
>>>
>>> 在 2026/6/25 11:44, Yun Zhou 写道:
>>>> When btrfs_direct_write() falls back to buffered IO after a failed DIO
>>>> attempt, it may race with the asynchronous completion of DIO ordered
>>>> extents.  This leads to a BUG_ON in insert_ordered_extent() due to
>>>> overlapping ordered extents in the per-inode rb-tree.
>>>>
>>>> The race sequence is:
>>>>   1. DIO creates an ordered extent via btrfs_dio_iomap_begin()
>>>>   2. Page fault occurs (nofault=true), no bio is submitted 
>>>> (submitted=0)
>>>>   3. btrfs_dio_iomap_end() truncates and finishes the OE asynchronously
>>>>      via btrfs_finish_ordered_extent() which queues work
>>>>   4. iomap returns 0, retry logic faults in pages and retries DIO
>>>>   5. Second DIO attempt also fails, code reaches buffered: label
>>>>   6. btrfs_buffered_write() dirties pages for the same range
>>>
>>> btrfs_buffered_write()
>>> |- copy_one_range()
>>>     |- lock_and_cleanup_extent_if_needed()
>>>        |- btrfs_start_ordered_extent()
>>>
>>> So your explanation doesn't makes sense. As if there is the direct IO oe
>>> remaining, we will wait for that OE to complete.
>>>
>>> There is still something missing.
>>>
>>>>   7. btrfs_fdatawrite_range() triggers writeback
>>>>   8. run_delalloc_nocow() -> fallback_to_cow() -> cow_file_range()
>>>>      tries to insert a new ordered extent for the same file offset
>>>>   9. The DIO ordered extent hasn't been removed from the rb-tree yet
>>>>      (btrfs_finish_ordered_io running async in workqueue) -> BUG_ON
>>>>
>>>> Fix this by waiting for any pending ordered extents in the target range
>>>> before starting the buffered write.
>>>>
>>>> Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
>>>> Closes: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
>>>> Fixes: acf9ed3a6c00 ("btrfs: retry faulting in the pages after a zero
>>>> sized short direct write")
>>
>> And the fixes tag is also incorrect.
>>
>> Without that commit, we will directly fallback to buffered write without
>> retry faulting in the pages.
>>
>> So by your explanation it will trigger the same problem, with or without
>> that commit.
> 
> Yes, my previous analysis does seem inaccurate. Commit acf9ed3a6c00 (which
> added retries) merely amplified the window for the issue to occur, but 
> the problem has actually existed since ff66fe666233 ('btrfs: fix 
> incorrect buffered IO fallback for append direct writes'), which 
> introduced i_size revert on DIO short writes, causing 
> lock_and_cleanup_extent_if_need() to skip the OE check (since start_pos 
>  >= reverted i_size). I will correct the commit message and the Fixes 
> tag in v2.

That's right, and I have already sent out a fix based on that isize revert:

https://lore.kernel.org/linux-btrfs/aaabe0f2bee07800fb0490d24fab5cd162e5241d.1782366056.git.wqu@suse.com/

> 
> Thanks,
> Yun


^ permalink raw reply

* Re: [PATCH] btrfs: wait for ordered extents before buffered write fallback in direct IO
From: Zhou, Yun @ 2026-06-25  6:46 UTC (permalink / raw)
  To: Qu Wenruo, clm, dsterba; +Cc: linux-btrfs, linux-kernel
In-Reply-To: <c2afdc0e-6e63-42b8-9a8a-fc77d2ab7383@suse.com>



On 6/25/26 13:17, Qu Wenruo wrote:
> 
> 在 2026/6/25 14:43, Qu Wenruo 写道:
>>
>>
>> 在 2026/6/25 11:44, Yun Zhou 写道:
>>> When btrfs_direct_write() falls back to buffered IO after a failed DIO
>>> attempt, it may race with the asynchronous completion of DIO ordered
>>> extents.  This leads to a BUG_ON in insert_ordered_extent() due to
>>> overlapping ordered extents in the per-inode rb-tree.
>>>
>>> The race sequence is:
>>>   1. DIO creates an ordered extent via btrfs_dio_iomap_begin()
>>>   2. Page fault occurs (nofault=true), no bio is submitted (submitted=0)
>>>   3. btrfs_dio_iomap_end() truncates and finishes the OE asynchronously
>>>      via btrfs_finish_ordered_extent() which queues work
>>>   4. iomap returns 0, retry logic faults in pages and retries DIO
>>>   5. Second DIO attempt also fails, code reaches buffered: label
>>>   6. btrfs_buffered_write() dirties pages for the same range
>>
>> btrfs_buffered_write()
>> |- copy_one_range()
>>     |- lock_and_cleanup_extent_if_needed()
>>        |- btrfs_start_ordered_extent()
>>
>> So your explanation doesn't makes sense. As if there is the direct IO oe
>> remaining, we will wait for that OE to complete.
>>
>> There is still something missing.
>>
>>>   7. btrfs_fdatawrite_range() triggers writeback
>>>   8. run_delalloc_nocow() -> fallback_to_cow() -> cow_file_range()
>>>      tries to insert a new ordered extent for the same file offset
>>>   9. The DIO ordered extent hasn't been removed from the rb-tree yet
>>>      (btrfs_finish_ordered_io running async in workqueue) -> BUG_ON
>>>
>>> Fix this by waiting for any pending ordered extents in the target range
>>> before starting the buffered write.
>>>
>>> Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
>>> Closes: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
>>> Fixes: acf9ed3a6c00 ("btrfs: retry faulting in the pages after a zero
>>> sized short direct write")
> 
> And the fixes tag is also incorrect.
> 
> Without that commit, we will directly fallback to buffered write without
> retry faulting in the pages.
> 
> So by your explanation it will trigger the same problem, with or without
> that commit.

Yes, my previous analysis does seem inaccurate. Commit acf9ed3a6c00 (which
added retries) merely amplified the window for the issue to occur, but 
the problem has actually existed since ff66fe666233 ('btrfs: fix 
incorrect buffered IO fallback for append direct writes'), which 
introduced i_size revert on DIO short writes, causing 
lock_and_cleanup_extent_if_need() to skip the OE check (since start_pos 
 >= reverted i_size). I will correct the commit message and the Fixes 
tag in v2.

Thanks,
Yun

^ permalink raw reply

* [PATCH] btrfs: always wait for ordered extents to avoid OE races
From: Qu Wenruo @ 2026-06-25  5:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: syzbot+ba2afde329fc27e3f22e

[BUG]
Syzbot reported a bug that there can be conflicting OEs for the same
range:

 BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object alrea[  179.162726][ T6897] BTRFS critical (device loop4): panic in insert_ordered_extent:264: overlapping ordered extents, existing oe file_offset 16384 num_bytes 430080 flags 0x1089, new oe file_offset 16384 num_bytes 430080 flags 0x80 (errno=-17 Object already exists)
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ordered-data.c:264!
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
 RIP: 0010:btrfs_alloc_ordered_extent+0x943/0xad0
 Call Trace:
  <TASK>
  cow_file_range+0x744/0x12a0
  fallback_to_cow+0x5ea/0xa00
  run_delalloc_nocow+0x110c/0x17a0
  btrfs_run_delalloc_range+0xbe4/0x1c20
  writepage_delalloc+0x104d/0x1ba0
  btrfs_writepages+0x1667/0x28b0
  do_writepages+0x338/0x560
  filemap_fdatawrite_range+0x1f2/0x300
  btrfs_fdatawrite_range+0x54/0xf0
  btrfs_direct_write+0x6a0/0xc30
  btrfs_do_write_iter+0x329/0x790
  do_iter_readv_writev+0x624/0x8d0
  vfs_writev+0x34c/0x990
  __se_sys_pwritev2+0x17a/0x2a0
  do_syscall_64+0x174/0x580
  entry_SYSCALL_64_after_hwframe+0x77/0x7f
  </TASK>
 ---[ end trace 0000000000000000 ]---

[CAUSE]
Since commit ff66fe666233 ("btrfs: fix incorrect buffered IO fallback
for append direct writes"), if the direct IO finished short, we will
revert the isize back to the original one, so that append writes can be
respected during the buffered fallback.

Normally we rely on lock_and_cleanup_extent_if_needed() function during
buffered writeback to wait for any existing ordered extents.

But that ordered extent waiting only happens if the start_pos is inside
the isize.
Since we have reverted the isize during failed direct IO, we will not
wait for any ordered extents.

This means we can have a race where the direct IO OE is still in the
tree, finished but not yet removed, then we're inserting the OE for the
buffered write, causing the above crash.

[FIX]
Make the OE wait to be unconditional, to handle the reverted isize
situation.

And since lock_and_cleanup_extent_if_needed() can no longer turn 0, we
can also remove the branches that handles no-extent-locked cases.

Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
Fixes: ff66fe666233 ("btrfs: fix incorrect buffered IO fallback for append direct writes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/file.c | 77 +++++++++++++++++++------------------------------
 1 file changed, 30 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index df8f6f565070..ef31140e0942 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -879,7 +879,6 @@ static noinline int prepare_one_folio(struct inode *inode, struct folio **folio_
  *
  * Return:
  * 1 - the extent is locked
- * 0 - the extent is not locked, and everything is OK
  * -EAGAIN - need to prepare the folios again
  */
 static noinline int
@@ -889,6 +888,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
 				struct extent_state **cached_state)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_ordered_extent *ordered;
 	u64 start_pos;
 	u64 last_pos;
 	int ret = 0;
@@ -896,42 +896,38 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
 	start_pos = round_down(pos, fs_info->sectorsize);
 	last_pos = round_up(pos + write_bytes, fs_info->sectorsize) - 1;
 
-	if (start_pos < inode->vfs_inode.i_size) {
-		struct btrfs_ordered_extent *ordered;
-
-		if (nowait) {
-			if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
-						   last_pos, cached_state)) {
-				folio_unlock(folio);
-				folio_put(folio);
-				return -EAGAIN;
-			}
-		} else {
-			btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
-					  cached_state);
-		}
-
-		ordered = btrfs_lookup_ordered_range(inode, start_pos,
-						     last_pos - start_pos + 1);
-		if (ordered &&
-		    ordered->file_offset + ordered->num_bytes > start_pos &&
-		    ordered->file_offset <= last_pos) {
-			btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
-					    cached_state);
+	if (nowait) {
+		if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
+					   last_pos, cached_state)) {
 			folio_unlock(folio);
 			folio_put(folio);
-			btrfs_start_ordered_extent(ordered);
-			btrfs_put_ordered_extent(ordered);
 			return -EAGAIN;
 		}
-		if (ordered)
-			btrfs_put_ordered_extent(ordered);
-
-		*lockstart = start_pos;
-		*lockend = last_pos;
-		ret = 1;
+	} else {
+		btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
+				  cached_state);
 	}
 
+	ordered = btrfs_lookup_ordered_range(inode, start_pos,
+					     last_pos - start_pos + 1);
+	if (ordered &&
+	    ordered->file_offset + ordered->num_bytes > start_pos &&
+	    ordered->file_offset <= last_pos) {
+		btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
+				    cached_state);
+		folio_unlock(folio);
+		folio_put(folio);
+		btrfs_start_ordered_extent(ordered);
+		btrfs_put_ordered_extent(ordered);
+		return -EAGAIN;
+	}
+	if (ordered)
+		btrfs_put_ordered_extent(ordered);
+
+	*lockstart = start_pos;
+	*lockend = last_pos;
+	ret = 1;
+
 	/*
 	 * We should be called after prepare_one_folio() which should have locked
 	 * all pages in the range.
@@ -1288,11 +1284,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
 
 		/* No copied bytes, unlock, release reserved space and exit. */
 		if (copied == 0) {
-			if (extents_locked)
-				btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
-						    &cached_state);
-			else
-				btrfs_free_extent_state(cached_state);
+			btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
+					    &cached_state);
 			btrfs_delalloc_release_extents(inode, reserved_len);
 			release_space(inode, *data_reserved, reserved_start, reserved_len,
 				      only_release_metadata);
@@ -1311,17 +1304,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
 
 	ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state,
 				only_release_metadata);
-	/*
-	 * If we have not locked the extent range, because the range's start
-	 * offset is >= i_size, we might still have a non-NULL cached extent
-	 * state, acquired while marking the extent range as delalloc through
-	 * btrfs_dirty_page(). Therefore free any possible cached extent state
-	 * to avoid a memory leak.
-	 */
-	if (extents_locked)
-		btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
-	else
-		btrfs_free_extent_state(cached_state);
+	btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 
 	btrfs_delalloc_release_extents(inode, reserved_len);
 	if (ret) {
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] btrfs: wait for ordered extents before buffered write fallback in direct IO
From: Qu Wenruo @ 2026-06-25  5:17 UTC (permalink / raw)
  To: Yun Zhou, clm, dsterba; +Cc: linux-btrfs, linux-kernel
In-Reply-To: <aa009e0d-a629-44ed-bb68-202ff2dc5063@suse.com>



在 2026/6/25 14:43, Qu Wenruo 写道:
> 
> 
> 在 2026/6/25 11:44, Yun Zhou 写道:
>> When btrfs_direct_write() falls back to buffered IO after a failed DIO
>> attempt, it may race with the asynchronous completion of DIO ordered
>> extents.  This leads to a BUG_ON in insert_ordered_extent() due to
>> overlapping ordered extents in the per-inode rb-tree.
>>
>> The race sequence is:
>>   1. DIO creates an ordered extent via btrfs_dio_iomap_begin()
>>   2. Page fault occurs (nofault=true), no bio is submitted (submitted=0)
>>   3. btrfs_dio_iomap_end() truncates and finishes the OE asynchronously
>>      via btrfs_finish_ordered_extent() which queues work
>>   4. iomap returns 0, retry logic faults in pages and retries DIO
>>   5. Second DIO attempt also fails, code reaches buffered: label
>>   6. btrfs_buffered_write() dirties pages for the same range
> 
> btrfs_buffered_write()
> |- copy_one_range()
>     |- lock_and_cleanup_extent_if_needed()
>        |- btrfs_start_ordered_extent()
> 
> So your explanation doesn't makes sense. As if there is the direct IO oe 
> remaining, we will wait for that OE to complete.
> 
> There is still something missing.
> 
>>   7. btrfs_fdatawrite_range() triggers writeback
>>   8. run_delalloc_nocow() -> fallback_to_cow() -> cow_file_range()
>>      tries to insert a new ordered extent for the same file offset
>>   9. The DIO ordered extent hasn't been removed from the rb-tree yet
>>      (btrfs_finish_ordered_io running async in workqueue) -> BUG_ON
>>
>> Fix this by waiting for any pending ordered extents in the target range
>> before starting the buffered write.
>>
>> Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
>> Closes: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
>> Fixes: acf9ed3a6c00 ("btrfs: retry faulting in the pages after a zero 
>> sized short direct write")

And the fixes tag is also incorrect.

Without that commit, we will directly fallback to buffered write without 
retry faulting in the pages.

So by your explanation it will trigger the same problem, with or without 
that commit.

>> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
>> ---
>>   fs/btrfs/direct-io.c | 24 ++++++++++++++++++++++++
>>   1 file changed, 24 insertions(+)
>>
>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>> index 460326d34143..e8ac9492844c 100644
>> --- a/fs/btrfs/direct-io.c
>> +++ b/fs/btrfs/direct-io.c
>> @@ -844,6 +844,7 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, 
>> struct iov_iter *from)
>>       struct file *file = iocb->ki_filp;
>>       struct inode *inode = file_inode(file);
>>       struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
>> +    struct btrfs_ordered_extent *ordered;
>>       loff_t pos;
>>       ssize_t written = 0;
>>       ssize_t written_buffered;
>> @@ -1025,6 +1026,29 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, 
>> struct iov_iter *from)
>>       }
>>       pos = iocb->ki_pos;
>> +
>> +    /*
>> +     * The DIO path may have created ordered extent(s) that are still 
>> being
>> +     * processed asynchronously in a work queue.  We must wait for 
>> them to
>> +     * be fully completed and removed from the rb-tree before doing a
>> +     * buffered write to the same or overlapping range; otherwise the
>> +     * buffered writeback path (run_delalloc_nocow -> fallback_to_cow ->
>> +     * cow_file_range) may try to insert a new ordered extent that 
>> conflicts
>> +     * with the still-pending DIO one, triggering a BUG_ON in
>> +     * insert_ordered_extent().
>> +     *
>> +     * This happens when DIO creates an ordered extent but has a 
>> short write
>> +     * (submitted < length in btrfs_dio_iomap_end()), which truncates 
>> and
>> +     * finishes the ordered extent asynchronously while we fall back to
>> +     * buffered IO for the same range.
>> +     */
>> +    while ((ordered = btrfs_lookup_ordered_range(BTRFS_I(inode),
>> +                (u64)(pos - written),
>> +                (u64)written + iov_iter_count(from))) != NULL) {
>> +        btrfs_start_ordered_extent(ordered);
>> +        btrfs_put_ordered_extent(ordered);
>> +    }
>> +
>>       written_buffered = btrfs_buffered_write(iocb, from);
>>       if (written_buffered < 0) {
>>           ret = written_buffered;
> 
> 


^ permalink raw reply

* Re: [PATCH] btrfs: wait for ordered extents before buffered write fallback in direct IO
From: Qu Wenruo @ 2026-06-25  5:13 UTC (permalink / raw)
  To: Yun Zhou, clm, dsterba; +Cc: linux-btrfs, linux-kernel
In-Reply-To: <20260625021456.724803-1-yun.zhou@windriver.com>



在 2026/6/25 11:44, Yun Zhou 写道:
> When btrfs_direct_write() falls back to buffered IO after a failed DIO
> attempt, it may race with the asynchronous completion of DIO ordered
> extents.  This leads to a BUG_ON in insert_ordered_extent() due to
> overlapping ordered extents in the per-inode rb-tree.
> 
> The race sequence is:
>   1. DIO creates an ordered extent via btrfs_dio_iomap_begin()
>   2. Page fault occurs (nofault=true), no bio is submitted (submitted=0)
>   3. btrfs_dio_iomap_end() truncates and finishes the OE asynchronously
>      via btrfs_finish_ordered_extent() which queues work
>   4. iomap returns 0, retry logic faults in pages and retries DIO
>   5. Second DIO attempt also fails, code reaches buffered: label
>   6. btrfs_buffered_write() dirties pages for the same range

btrfs_buffered_write()
|- copy_one_range()
    |- lock_and_cleanup_extent_if_needed()
       |- btrfs_start_ordered_extent()

So your explanation doesn't makes sense. As if there is the direct IO oe 
remaining, we will wait for that OE to complete.

There is still something missing.

>   7. btrfs_fdatawrite_range() triggers writeback
>   8. run_delalloc_nocow() -> fallback_to_cow() -> cow_file_range()
>      tries to insert a new ordered extent for the same file offset
>   9. The DIO ordered extent hasn't been removed from the rb-tree yet
>      (btrfs_finish_ordered_io running async in workqueue) -> BUG_ON
> 
> Fix this by waiting for any pending ordered extents in the target range
> before starting the buffered write.
> 
> Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
> Fixes: acf9ed3a6c00 ("btrfs: retry faulting in the pages after a zero sized short direct write")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> ---
>   fs/btrfs/direct-io.c | 24 ++++++++++++++++++++++++
>   1 file changed, 24 insertions(+)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 460326d34143..e8ac9492844c 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -844,6 +844,7 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>   	struct file *file = iocb->ki_filp;
>   	struct inode *inode = file_inode(file);
>   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +	struct btrfs_ordered_extent *ordered;
>   	loff_t pos;
>   	ssize_t written = 0;
>   	ssize_t written_buffered;
> @@ -1025,6 +1026,29 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>   	}
>   
>   	pos = iocb->ki_pos;
> +
> +	/*
> +	 * The DIO path may have created ordered extent(s) that are still being
> +	 * processed asynchronously in a work queue.  We must wait for them to
> +	 * be fully completed and removed from the rb-tree before doing a
> +	 * buffered write to the same or overlapping range; otherwise the
> +	 * buffered writeback path (run_delalloc_nocow -> fallback_to_cow ->
> +	 * cow_file_range) may try to insert a new ordered extent that conflicts
> +	 * with the still-pending DIO one, triggering a BUG_ON in
> +	 * insert_ordered_extent().
> +	 *
> +	 * This happens when DIO creates an ordered extent but has a short write
> +	 * (submitted < length in btrfs_dio_iomap_end()), which truncates and
> +	 * finishes the ordered extent asynchronously while we fall back to
> +	 * buffered IO for the same range.
> +	 */
> +	while ((ordered = btrfs_lookup_ordered_range(BTRFS_I(inode),
> +				(u64)(pos - written),
> +				(u64)written + iov_iter_count(from))) != NULL) {
> +		btrfs_start_ordered_extent(ordered);
> +		btrfs_put_ordered_extent(ordered);
> +	}
> +
>   	written_buffered = btrfs_buffered_write(iocb, from);
>   	if (written_buffered < 0) {
>   		ret = written_buffered;


^ permalink raw reply

* Re: [PATCH] btrfs: replace writeback inhibition xarray with a fixed inline buffer
From: Sun YangKai @ 2026-06-25  2:25 UTC (permalink / raw)
  To: Leo Martins, linux-btrfs, kernel-team, David Sterba,
	Filipe Manana, Boris Burkov
  Cc: kernel test robot
In-Reply-To: <e0d3d3ac7c585b023c2e14cec7e2b995b247b68e.1781728340.git.loemra.dev@gmail.com>



On 2026/6/25 01:57, Leo Martins wrote:
> Commit f9a48549a15a ("btrfs: inhibit extent buffer writeback to prevent
> COW amplification") tracks the extent buffers a transaction handle has
> inhibited in a per-handle xarray. Keying the tracking to the transaction
> handle is correct, but using an xarray for it causes two problems in
> production.
> 
> First, a write_iops regression. Every COW calls
> btrfs_inhibit_eb_writeback() from btrfs_force_cow_block() and
> should_cow_block(), which does an xa_store() keyed by eb->start. The
> kernel test robot reported a 22.6% fio.write_iops regression on a
> single-task 4k randwrite workload (ftruncate ioengine, buffered IO) on
> btrfs. The cost is the per-COW xarray store done on every COW'd block.
> Replacing it with a non-allocating fixed buffer recovers the lost
> throughput, and that buffer does more per-COW bookkeeping yet still
> recovers, so the cost is the xarray operation itself rather than the
> extra tracking work.
> 
> Second, an unbounded cleanup walk. btrfs_uninhibit_all_eb_writeback()
> iterates every eb the handle inhibited with xa_for_each(). A single
> handle that COWs a very large number of blocks (inode eviction, or
> truncate of a file with many extents, where btrfs_truncate_inode_items()
> loops over many search_again descents under one handle) makes that walk
> arbitrarily long. It runs in __btrfs_end_transaction() before
> num_writers is dropped, so it blocks the committing thread; this shows up
> as multi-second stalls and RCU stall reports.
> 
> Replace the xarray with a fixed inline array on btrfs_trans_handle,
> managed with a CLOCK (second-chance) eviction policy. Inhibiting a buffer
> becomes an array append with no allocation and no tree walk, and the
> end-of-handle cleanup is bounded by the array size.
> 
> The set that actually needs protection is the working set the handle
> revisits across search_again descents, the search path frontier, which is
> on the order of the tree height. It is not every block the handle ever
> COWs. should_cow_block() re-inhibiting an already tracked buffer marks it
> referenced, so revisited buffers survive eviction while write-once buffers
> are reclaimed first. A small fixed buffer is therefore enough where a
> non-evicting array would either overflow or have to grow without bound.
> BTRFS_INHIBITED_EBS_SLOTS is 8 and the reference bits pack into a u32.
> 
> The CLOCK eviction is what justifies the extra complexity over a plain
> non-evicting array. On a workload built to stress amplification, removing
> 16 heavily fragmented 64 MiB files in one transaction while background
> writeback keeps writing out in-use metadata, counting re-COW events (a
> buffer already COWed in the running transaction, written back, then COWed
> again) against first-COW events and summing across the eviction (n=5),
> re-COWs outnumber first COWs about 6.1 to 1 with no inhibition, 3.8 to 1
> with a non-evicting array of 32 slots, 1.6 to 1 with this 8 slot CLOCK
> array, and 1.4 to 1 with the reverted unbounded xarray. The non-evicting
> array fills with write-once buffers and stops covering the buffers the
> handle keeps revisiting, so even at four times the slots it leaves most of
> the amplification. CLOCK evicts the cold buffers and keeps the revisited
> ones, which recovers almost all of the unbounded benefit. The eviction
> policy, not the buffer size, is what closes the gap.
> 
> eb->writeback_inhibitors and the WB_SYNC_ALL bypass in
> lock_extent_buffer_for_io() are unchanged, so fsync and commit behavior
> are unaffected. A reference is taken on each tracked buffer so it cannot
> be freed while the array points at it; eviction drops that reference and
> the inhibitor count.
> 
> Fixes: f9a48549a15a ("btrfs: inhibit extent buffer writeback to prevent COW amplification")
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202603112240.f7605968-lkp@intel.com
> Signed-off-by: Leo Martins <loemra.dev@gmail.com>

Reviewed-by: Sun YangKai <sunk67188@gmail.com>

Thanks for your great work. The overall direction looks nice and the 
design is clean. I've got some non‑blocking suggestions, though.


> ---
>   fs/btrfs/extent_io.c   | 78 +++++++++++++++++++++++++-----------------
>   fs/btrfs/transaction.c |  2 --
>   fs/btrfs/transaction.h | 15 ++++++--
>   3 files changed, 59 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 9d7ca80477fd..1845a617fb10 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2988,41 +2988,58 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
>    * @trans:  transaction handle that will own the inhibitor
>    * @eb:      extent buffer to inhibit writeback on
>    *
> - * Attempt to track this extent buffer in the transaction's inhibited set.  If
> - * memory allocation fails, the buffer is simply not tracked. It may be written
> - * back and need re-COW, which is the original behavior.  This is acceptable
> - * since inhibiting writeback is an optimization.
> + * Attempt to track this extent buffer in the transaction's inhibited set.  When
> + * the set is full the coldest tracked buffer is evicted instead.  An untracked
> + * buffer may be written back and need re-COW, which is the original behavior.
> + * This is acceptable since inhibiting writeback is an optimization.
>    */
> -void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans, struct extent_buffer *eb)
> +void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans,
> +				struct extent_buffer *eb)
>   {
> -	unsigned long index = eb->start >> trans->fs_info->nodesize_bits;
> -	void *old;
> +	u32 slot;
>   
>   	lockdep_assert_held(&eb->lock);
> -	/* Check if already inhibited by this handle. */
> -	old = xa_load(&trans->writeback_inhibited_ebs, index);
> -	if (old == eb)
> -		return;
> -
> -	/* Take reference for the xarray entry. */
> -	refcount_inc(&eb->refs);
>   
> -	old = xa_store(&trans->writeback_inhibited_ebs, index, eb, GFP_NOFS);
> -	if (xa_is_err(old)) {
> -		/* Allocation failed, just skip inhibiting this buffer. */
> -		free_extent_buffer(eb);
> -		return;
> +	/* Already tracked: set its reference bit (second chance) and return. */
> +	for (u32 i = 0; i < trans->nr_inhibited_ebs; i++) {
> +		if (trans->inhibited_ebs[i] == eb) {
> +			trans->inhibited_ebs_referenced |= 1U << i;
> +			return;
> +		}
>   	}
>   
> -	/* Handle replacement of different eb at same index. */
> -	if (old && old != eb) {
> -		struct extent_buffer *old_eb = old;
> +	if (trans->nr_inhibited_ebs < BTRFS_INHIBITED_EBS_SLOTS) {
> +		slot = trans->nr_inhibited_ebs++;
> +	} else {
> +		/*
> +		 * Full: advance the CLOCK hand, clearing reference bits until a
> +		 * slot without one is found and evicted. Clearing a bit per step
> +		 * bounds this to BTRFS_INHIBITED_EBS_SLOTS iterations.
> +		 */
> +		while (trans->inhibited_ebs_referenced &
> +		       (1U << trans->inhibited_ebs_hand)) {
> +			trans->inhibited_ebs_referenced &=
> +				~(1U << trans->inhibited_ebs_hand);
> +			trans->inhibited_ebs_hand =
> +				(trans->inhibited_ebs_hand + 1) %
> +				BTRFS_INHIBITED_EBS_SLOTS;

I think we can using overflowing add with ` & (SLOTS - 1)` as the index 
as long as SLOTS is power of 2, instead of `% SLOTS`. And the higher 
bits can parsed as number of "rounds" if we keep them, and I'm not sure 
if the extra info is useful for any purpose.

> +		}
> +		slot = trans->inhibited_ebs_hand;
> +		trans->inhibited_ebs_hand =
> +			(trans->inhibited_ebs_hand + 1) % BTRFS_INHIBITED_EBS_SLOTS;

the same math here.

>   
> -		atomic_dec(&old_eb->writeback_inhibitors);
> -		free_extent_buffer(old_eb);
> +		atomic_dec(&trans->inhibited_ebs[slot]->writeback_inhibitors);
> +		free_extent_buffer(trans->inhibited_ebs[slot]);
>   	}

And personally I'd love to wrap this part of logic into a helper 
function like

static inline u8 btrfs_inhibit_prepare_slot(struct btrfs_trans_handle 
*trans) {...}

Or something with a better name.

>   
> +	/*
> +	 * Pin the eb while the array holds a raw pointer to it; the counter is
> +	 * what lock_extent_buffer_for_io() checks.
> +	 */
> +	refcount_inc(&eb->refs);
>   	atomic_inc(&eb->writeback_inhibitors);
> +	trans->inhibited_ebs[slot] = eb;
> +	trans->inhibited_ebs_referenced |= 1U << slot;
>   }
>   
>   /*
> @@ -3030,14 +3047,13 @@ void btrfs_inhibit_eb_writeback(struct btrfs_trans_handle *trans, struct extent_
>    */
>   void btrfs_uninhibit_all_eb_writeback(struct btrfs_trans_handle *trans)
>   {
> -	struct extent_buffer *eb;
> -	unsigned long index;
> -
> -	xa_for_each(&trans->writeback_inhibited_ebs, index, eb) {
> -		atomic_dec(&eb->writeback_inhibitors);
> -		free_extent_buffer(eb);
> +	for (u32 i = 0; i < trans->nr_inhibited_ebs; i++) {
> +		atomic_dec(&trans->inhibited_ebs[i]->writeback_inhibitors);
> +		free_extent_buffer(trans->inhibited_ebs[i]);
>   	}
> -	xa_destroy(&trans->writeback_inhibited_ebs);
> +	trans->nr_inhibited_ebs = 0;
> +	trans->inhibited_ebs_referenced = 0;
> +	trans->inhibited_ebs_hand = 0;
>   }
>   
>   static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 4358f4b63057..b4b8d587effb 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -697,8 +697,6 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
>   		goto alloc_fail;
>   	}
>   
> -	xa_init(&h->writeback_inhibited_ebs);
> -
>   	/*
>   	 * If we are JOIN_NOLOCK we're already committing a transaction and
>   	 * waiting on this guy, so we don't need to do the sb_start_intwrite
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 7d70fe486758..8f628d79d4f0 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,7 +12,6 @@
>   #include <linux/time64.h>
>   #include <linux/mutex.h>
>   #include <linux/wait.h>
> -#include <linux/xarray.h>
>   #include "btrfs_inode.h"
>   #include "delayed-ref.h"
>   
> @@ -23,6 +22,7 @@ struct btrfs_fs_info;
>   struct btrfs_root_item;
>   struct btrfs_root;
>   struct btrfs_path;
> +struct extent_buffer;
>   
>   /*
>    * Signal that a direct IO write is in progress, to avoid deadlock for sync
> @@ -136,6 +136,12 @@ enum {
>   
>   #define TRANS_EXTWRITERS	(__TRANS_START | __TRANS_ATTACH)
>   
> +/*
> + * Must stay <= 32: the CLOCK reference bits pack into the u32
> + * inhibited_ebs_referenced.
> + */
> +#define BTRFS_INHIBITED_EBS_SLOTS	8
> +
>   struct btrfs_trans_handle {
>   	u64 transid;
>   	u64 bytes_reserved;
> @@ -163,8 +169,11 @@ struct btrfs_trans_handle {
>   	struct btrfs_fs_info *fs_info;
>   	struct list_head new_bgs;
>   	struct btrfs_block_rsv delayed_rsv;
> -	/* Extent buffers with writeback inhibited by this handle. */
> -	struct xarray writeback_inhibited_ebs;
> +	/* Extent buffers this handle has inhibited writeback on. */
> +	struct extent_buffer *inhibited_ebs[BTRFS_INHIBITED_EBS_SLOTS];

This takes 64 bytes, which is necessary, fine.

> +	u32 inhibited_ebs_referenced;	/* CLOCK reference bit per slot */
> +	u32 nr_inhibited_ebs;

Given the 8-slot limit for inhibited_ebs (4 bits (8) for capacity, 8 
bits for bitmap), using u8 for these two fields is a clear and 
straightforward choice IMO. Even if we later extend the limit to 16, 
only the bitmap field (inhibited_ebs_referenced) would need to be 
widened to u16.

Thanks,
Sun YangKai

> +	u8 inhibited_ebs_hand;		/* CLOCK hand */
>   };
>   
>   /*


^ permalink raw reply

* [PATCH] btrfs: wait for ordered extents before buffered write fallback in direct IO
From: Yun Zhou @ 2026-06-25  2:14 UTC (permalink / raw)
  To: clm, dsterba; +Cc: linux-btrfs, linux-kernel, yun.zhou

When btrfs_direct_write() falls back to buffered IO after a failed DIO
attempt, it may race with the asynchronous completion of DIO ordered
extents.  This leads to a BUG_ON in insert_ordered_extent() due to
overlapping ordered extents in the per-inode rb-tree.

The race sequence is:
 1. DIO creates an ordered extent via btrfs_dio_iomap_begin()
 2. Page fault occurs (nofault=true), no bio is submitted (submitted=0)
 3. btrfs_dio_iomap_end() truncates and finishes the OE asynchronously
    via btrfs_finish_ordered_extent() which queues work
 4. iomap returns 0, retry logic faults in pages and retries DIO
 5. Second DIO attempt also fails, code reaches buffered: label
 6. btrfs_buffered_write() dirties pages for the same range
 7. btrfs_fdatawrite_range() triggers writeback
 8. run_delalloc_nocow() -> fallback_to_cow() -> cow_file_range()
    tries to insert a new ordered extent for the same file offset
 9. The DIO ordered extent hasn't been removed from the rb-tree yet
    (btrfs_finish_ordered_io running async in workqueue) -> BUG_ON

Fix this by waiting for any pending ordered extents in the target range
before starting the buffered write.

Reported-by: syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
Fixes: acf9ed3a6c00 ("btrfs: retry faulting in the pages after a zero sized short direct write")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/btrfs/direct-io.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 460326d34143..e8ac9492844c 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -844,6 +844,7 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	struct btrfs_ordered_extent *ordered;
 	loff_t pos;
 	ssize_t written = 0;
 	ssize_t written_buffered;
@@ -1025,6 +1026,29 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	pos = iocb->ki_pos;
+
+	/*
+	 * The DIO path may have created ordered extent(s) that are still being
+	 * processed asynchronously in a work queue.  We must wait for them to
+	 * be fully completed and removed from the rb-tree before doing a
+	 * buffered write to the same or overlapping range; otherwise the
+	 * buffered writeback path (run_delalloc_nocow -> fallback_to_cow ->
+	 * cow_file_range) may try to insert a new ordered extent that conflicts
+	 * with the still-pending DIO one, triggering a BUG_ON in
+	 * insert_ordered_extent().
+	 *
+	 * This happens when DIO creates an ordered extent but has a short write
+	 * (submitted < length in btrfs_dio_iomap_end()), which truncates and
+	 * finishes the ordered extent asynchronously while we fall back to
+	 * buffered IO for the same range.
+	 */
+	while ((ordered = btrfs_lookup_ordered_range(BTRFS_I(inode),
+				(u64)(pos - written),
+				(u64)written + iov_iter_count(from))) != NULL) {
+		btrfs_start_ordered_extent(ordered);
+		btrfs_put_ordered_extent(ordered);
+	}
+
 	written_buffered = btrfs_buffered_write(iocb, from);
 	if (written_buffered < 0) {
 		ret = written_buffered;
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox