Re: [PATCH] btrfs: always wait for ordered extents to avoid OE races

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org,
	syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
Subject: Re: [PATCH] btrfs: always wait for ordered extents to avoid OE races
Date: Fri, 26 Jun 2026 08:08:58 +0930	[thread overview]
Message-ID: <9efd034f-fa6b-485c-aca7-a38f526d4083@suse.com> (raw)
In-Reply-To: <CAL3q7H7+WAQMV76+-AuU=dR_qoe6Gu1_CNmwUFD=zW81d2Hz0g@mail.gmail.com>



在 2026/6/25 20:59, Filipe Manana 写道:
[...]
> 
>> @@ -896,42 +896,38 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
>>          start_pos = round_down(pos, fs_info->sectorsize);
>>          last_pos = round_up(pos + write_bytes, fs_info->sectorsize) - 1;
>>
>> -       if (start_pos < inode->vfs_inode.i_size) {
>> -               struct btrfs_ordered_extent *ordered;
>> -
>> -               if (nowait) {
>> -                       if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
>> -                                                  last_pos, cached_state)) {
>> -                               folio_unlock(folio);
>> -                               folio_put(folio);
>> -                               return -EAGAIN;
>> -                       }
>> -               } else {
>> -                       btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
>> -                                         cached_state);
>> -               }
>> -
>> -               ordered = btrfs_lookup_ordered_range(inode, start_pos,
>> -                                                    last_pos - start_pos + 1);
>> -               if (ordered &&
>> -                   ordered->file_offset + ordered->num_bytes > start_pos &&
>> -                   ordered->file_offset <= last_pos) {
>> -                       btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
>> -                                           cached_state);
>> +       if (nowait) {
>> +               if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
>> +                                          last_pos, cached_state)) {
>>                          folio_unlock(folio);
>>                          folio_put(folio);
>> -                       btrfs_start_ordered_extent(ordered);
>> -                       btrfs_put_ordered_extent(ordered);
>>                          return -EAGAIN;
>>                  }
>> -               if (ordered)
>> -                       btrfs_put_ordered_extent(ordered);
>> -
>> -               *lockstart = start_pos;
>> -               *lockend = last_pos;
>> -               ret = 1;
>> +       } else {
>> +               btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
>> +                                 cached_state);
>>          }
>>
>> +       ordered = btrfs_lookup_ordered_range(inode, start_pos,
>> +                                            last_pos - start_pos + 1);
>> +       if (ordered &&
>> +           ordered->file_offset + ordered->num_bytes > start_pos &&
>> +           ordered->file_offset <= last_pos) {
>> +               btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
>> +                                   cached_state);
>> +               folio_unlock(folio);
>> +               folio_put(folio);
>> +               btrfs_start_ordered_extent(ordered);
>> +               btrfs_put_ordered_extent(ordered);
>> +               return -EAGAIN;
>> +       }
>> +       if (ordered)
>> +               btrfs_put_ordered_extent(ordered);
>> +
>> +       *lockstart = start_pos;
>> +       *lockend = last_pos;
>> +       ret = 1;
> 
> And this is pointless, right below we can use "return 1;" and get rid
> of the 'ret' variable.
> 
> Since returning 1 no longer makes sense, just make it return 0 and remove 'ret'.
> 
> Also in the caller, since lock_and_cleanup_extent_if_need() now always
> returns with the extent range locked, we can get rid of it's local
> variable named 'extents_locked' and use 'ret' as it does not make
> sense anymore.
> 
> Finally, with this change we now always look for ordered extents and
> lock the range. Previously, we only did this if writes were within
> i_size - which makes sense because, except for his new direct IO case
> falling back to buffered IO, we could never find ordered extents
> beyond i_size.
> 
> Have you checked whether performance is affected?

Here is the microbenchmark for the runtime of btrfs_buffered_write() and 
lock_and_cleanup_extent_if_need().

The workload is a very basic xfs_io -f -c "pwrite 0 1m", which is always 
appending the isize for every 4K block.

Before patch:
- lock_and_cleanup_extent_if_need():

   runtime avg = 58.16 ns
   runtime stdev = 4.2

- btrfs_buffered_write():

   runtime avg = 2115.6 ns
   runtime stdev = 1822.1


Patched:
- lock_and_cleanup_extent_if_need():

   runtime avg = 183.0 ns
   runtime stdev = 161.6

- btrfs_buffered_write():

   runtime avg = 2973.3 ns
   runtime stdev = 4225.1

So yes, the runtime of lock_and_cleanup_extent_if_need() has greatly 
increased, by more than 3 times, but still less than 1 micro second.

The btrfs_buffered_write() itself is also much slower due to the extra 
extent unlock, the runtime is increased by 40%, but still less than 3 us.

I'm not sure if the cost is acceptable or not. The increased runtime is 
pretty huge for the affected worst case scenario, and that's undeniable.

But the overall runtime of btrfs_buffered_write() itself is still at 
single digit microsecond level.

If acceptable, I'll add the microbenchmark into the commit message in 
the next update.

Thanks,
Qu

> 
> Thanks.
> 
>> +
>>          /*
>>           * We should be called after prepare_one_folio() which should have locked
>>           * all pages in the range.
>> @@ -1288,11 +1284,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>>
>>                  /* No copied bytes, unlock, release reserved space and exit. */
>>                  if (copied == 0) {
>> -                       if (extents_locked)
>> -                               btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
>> -                                                   &cached_state);
>> -                       else
>> -                               btrfs_free_extent_state(cached_state);
>> +                       btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
>> +                                           &cached_state);
>>                          btrfs_delalloc_release_extents(inode, reserved_len);
>>                          release_space(inode, *data_reserved, reserved_start, reserved_len,
>>                                        only_release_metadata);
>> @@ -1311,17 +1304,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>>
>>          ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state,
>>                                  only_release_metadata);
>> -       /*
>> -        * If we have not locked the extent range, because the range's start
>> -        * offset is >= i_size, we might still have a non-NULL cached extent
>> -        * state, acquired while marking the extent range as delalloc through
>> -        * btrfs_dirty_page(). Therefore free any possible cached extent state
>> -        * to avoid a memory leak.
>> -        */
>> -       if (extents_locked)
>> -               btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>> -       else
>> -               btrfs_free_extent_state(cached_state);
>> +       btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>>
>>          btrfs_delalloc_release_extents(inode, reserved_len);
>>          if (ret) {
>> --
>> 2.54.0
>>
>>

next prev parent reply	other threads:[~2026-06-25 22:39 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25  5:41 [PATCH] btrfs: always wait for ordered extents to avoid OE races Qu Wenruo
2026-06-25 11:29 ` Filipe Manana
2026-06-25 22:38   ` Qu Wenruo [this message]
2026-06-26  9:15     ` Filipe Manana
2026-06-26 10:13       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9efd034f-fa6b-485c-aca7-a38f526d4083@suse.com \
    --to=wqu@suse.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox