From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org,
syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com
Subject: Re: [PATCH] btrfs: always wait for ordered extents to avoid OE races
Date: Fri, 26 Jun 2026 08:08:58 +0930 [thread overview]
Message-ID: <9efd034f-fa6b-485c-aca7-a38f526d4083@suse.com> (raw)
In-Reply-To: <CAL3q7H7+WAQMV76+-AuU=dR_qoe6Gu1_CNmwUFD=zW81d2Hz0g@mail.gmail.com>
在 2026/6/25 20:59, Filipe Manana 写道:
[...]
>
>> @@ -896,42 +896,38 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
>> start_pos = round_down(pos, fs_info->sectorsize);
>> last_pos = round_up(pos + write_bytes, fs_info->sectorsize) - 1;
>>
>> - if (start_pos < inode->vfs_inode.i_size) {
>> - struct btrfs_ordered_extent *ordered;
>> -
>> - if (nowait) {
>> - if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
>> - last_pos, cached_state)) {
>> - folio_unlock(folio);
>> - folio_put(folio);
>> - return -EAGAIN;
>> - }
>> - } else {
>> - btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
>> - cached_state);
>> - }
>> -
>> - ordered = btrfs_lookup_ordered_range(inode, start_pos,
>> - last_pos - start_pos + 1);
>> - if (ordered &&
>> - ordered->file_offset + ordered->num_bytes > start_pos &&
>> - ordered->file_offset <= last_pos) {
>> - btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
>> - cached_state);
>> + if (nowait) {
>> + if (!btrfs_try_lock_extent(&inode->io_tree, start_pos,
>> + last_pos, cached_state)) {
>> folio_unlock(folio);
>> folio_put(folio);
>> - btrfs_start_ordered_extent(ordered);
>> - btrfs_put_ordered_extent(ordered);
>> return -EAGAIN;
>> }
>> - if (ordered)
>> - btrfs_put_ordered_extent(ordered);
>> -
>> - *lockstart = start_pos;
>> - *lockend = last_pos;
>> - ret = 1;
>> + } else {
>> + btrfs_lock_extent(&inode->io_tree, start_pos, last_pos,
>> + cached_state);
>> }
>>
>> + ordered = btrfs_lookup_ordered_range(inode, start_pos,
>> + last_pos - start_pos + 1);
>> + if (ordered &&
>> + ordered->file_offset + ordered->num_bytes > start_pos &&
>> + ordered->file_offset <= last_pos) {
>> + btrfs_unlock_extent(&inode->io_tree, start_pos, last_pos,
>> + cached_state);
>> + folio_unlock(folio);
>> + folio_put(folio);
>> + btrfs_start_ordered_extent(ordered);
>> + btrfs_put_ordered_extent(ordered);
>> + return -EAGAIN;
>> + }
>> + if (ordered)
>> + btrfs_put_ordered_extent(ordered);
>> +
>> + *lockstart = start_pos;
>> + *lockend = last_pos;
>> + ret = 1;
>
> And this is pointless, right below we can use "return 1;" and get rid
> of the 'ret' variable.
>
> Since returning 1 no longer makes sense, just make it return 0 and remove 'ret'.
>
> Also in the caller, since lock_and_cleanup_extent_if_need() now always
> returns with the extent range locked, we can get rid of it's local
> variable named 'extents_locked' and use 'ret' as it does not make
> sense anymore.
>
> Finally, with this change we now always look for ordered extents and
> lock the range. Previously, we only did this if writes were within
> i_size - which makes sense because, except for his new direct IO case
> falling back to buffered IO, we could never find ordered extents
> beyond i_size.
>
> Have you checked whether performance is affected?
Here is the microbenchmark for the runtime of btrfs_buffered_write() and
lock_and_cleanup_extent_if_need().
The workload is a very basic xfs_io -f -c "pwrite 0 1m", which is always
appending the isize for every 4K block.
Before patch:
- lock_and_cleanup_extent_if_need():
runtime avg = 58.16 ns
runtime stdev = 4.2
- btrfs_buffered_write():
runtime avg = 2115.6 ns
runtime stdev = 1822.1
Patched:
- lock_and_cleanup_extent_if_need():
runtime avg = 183.0 ns
runtime stdev = 161.6
- btrfs_buffered_write():
runtime avg = 2973.3 ns
runtime stdev = 4225.1
So yes, the runtime of lock_and_cleanup_extent_if_need() has greatly
increased, by more than 3 times, but still less than 1 micro second.
The btrfs_buffered_write() itself is also much slower due to the extra
extent unlock, the runtime is increased by 40%, but still less than 3 us.
I'm not sure if the cost is acceptable or not. The increased runtime is
pretty huge for the affected worst case scenario, and that's undeniable.
But the overall runtime of btrfs_buffered_write() itself is still at
single digit microsecond level.
If acceptable, I'll add the microbenchmark into the commit message in
the next update.
Thanks,
Qu
>
> Thanks.
>
>> +
>> /*
>> * We should be called after prepare_one_folio() which should have locked
>> * all pages in the range.
>> @@ -1288,11 +1284,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>>
>> /* No copied bytes, unlock, release reserved space and exit. */
>> if (copied == 0) {
>> - if (extents_locked)
>> - btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
>> - &cached_state);
>> - else
>> - btrfs_free_extent_state(cached_state);
>> + btrfs_unlock_extent(&inode->io_tree, lockstart, lockend,
>> + &cached_state);
>> btrfs_delalloc_release_extents(inode, reserved_len);
>> release_space(inode, *data_reserved, reserved_start, reserved_len,
>> only_release_metadata);
>> @@ -1311,17 +1304,7 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
>>
>> ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state,
>> only_release_metadata);
>> - /*
>> - * If we have not locked the extent range, because the range's start
>> - * offset is >= i_size, we might still have a non-NULL cached extent
>> - * state, acquired while marking the extent range as delalloc through
>> - * btrfs_dirty_page(). Therefore free any possible cached extent state
>> - * to avoid a memory leak.
>> - */
>> - if (extents_locked)
>> - btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>> - else
>> - btrfs_free_extent_state(cached_state);
>> + btrfs_unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
>>
>> btrfs_delalloc_release_extents(inode, reserved_len);
>> if (ret) {
>> --
>> 2.54.0
>>
>>
next prev parent reply other threads:[~2026-06-25 22:39 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 5:41 [PATCH] btrfs: always wait for ordered extents to avoid OE races Qu Wenruo
2026-06-25 11:29 ` Filipe Manana
2026-06-25 22:38 ` Qu Wenruo [this message]
2026-06-26 9:15 ` Filipe Manana
2026-06-26 10:13 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9efd034f-fa6b-485c-aca7-a38f526d4083@suse.com \
--to=wqu@suse.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=syzbot+ba2afde329fc27e3f22e@syzkaller.appspotmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox