From: Qu Wenruo <wqu@suse.com>
To: Boris Burkov <boris@bur.io>, fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
Date: Thu, 8 May 2025 08:41:56 +0930 [thread overview]
Message-ID: <ccacb32f-e345-49c7-9c6b-8bdc72c69a7f@suse.com> (raw)
In-Reply-To: <20250507223347.GB332956@zen.localdomain>
在 2025/5/8 08:03, Boris Burkov 写道:
> On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> If we fail to allocate an ordered extent for a COW write we end up leaking
>> a qgroup data reservation since we called btrfs_qgroup_release_data() but
>> we didn't call btrfs_qgroup_free_refroot() (which would happen when
>> running the respective data delayed ref created by ordered extent
>> completion or when finishing the ordered extent in case an error happened).
>>
>> So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
>> ordered extent for a COW write.
>
> I haven't tried it myself yet, but I believe that this patch will double
> free reservation from the qgroup when this case occurs.
>
> Can you share the context where you saw this bug? Have you run fstests
> with qgroups or squotas enabled? I think this should show pretty quickly
> in generic/475 with qgroups on.
>
> Consider, for example, the following execution of the dio case:
>
> btrfs_dio_iomap_begin
> btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
> btrfs_get_blocks_direct_write
> btrfs_create_dio_extent
> btrfs_alloc_ordered_extent
> alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
> // error propagates up
> // error propagates up via PTR_ERR
>
> which brings us to the code:
> if (ret < 0)
> goto unlock_err;
> ...
> unlock_err:
> ...
> if (dio_data->data_space_reserved) {
> btrfs_free_reserved_data_space()
> }
>
> so the execution continues...
>
> btrfs_free_reserved_data_space
> btrfs_qgroup_free_data
> __btrfs_qgroup_release_data
> qgroup_free_reserved_data
> btrfs_qgroup_free_refroot
>
> This will result in a underflow of the reservation once everything
> outstanding gets released.
That should still be safe.
Firstly at alloc_ordered_extent() we released the qgroup space already,
thus there will be no EXTENT_QGROUP_RESERVED range in extent-io tree
anymore.
Then at the final cleanup, qgroup_free_reserved_data_space() will
release/free nothing, because the extent-io tree no longer has the
corresponding range with EXTENT_QGROUP_RESERVED.
This is the core design of qgroup data reserved space, which allows us
to call btrfs_release/free_data() twice without double accounting.
>
> Furthermore, raw calls to free_refroot in cases where we have a reserved
> changeset make me worried, because they will run afoul of races with
> multiple threads touching the various bits. I don't see the bugs here,
> but the reservation lifetime is really tricky so I wouldn't be surprised
> if something like that was wrong too.
This free_refroot() is the common practice here. The extent-io tree
based accounting can only cover the reserved space before ordered extent
is allocated.
Then the reserved space is "transferred" to ordered extent, and
eventually go to the new data extent, and finally freed at
btrfs_qgroup_account_extents(), which also goes the free_refroot() way.
>
> As of the last time I looked at this, I think cow_file_range handles
> this correctly as well. Looking really quickly now, it looks like maybe
> submit_one_async_extent() might not do a qgroup free, but I'm not sure
> where the corresponding reservation is coming from.
>
> I think if you have indeed found a different codepath that makes a data
> reservation but doesn't release the qgroup part when allocating the
> ordered extent fails, then the fastest path to a fix is to do that at
> the same level as where it calls btrfs_check_data_free_space or (however
> it gets the reservation), as is currently done by the main
> ordered_extent allocation paths. With async_extent, we might need to
> plumb through the reserved extent changeset through the async chunk to
> do it perfectly...
I agree with you that, the extent io tree based double freeing
prevention should only be the last safe net, not something we should
abuse when possible.
But I can't find a better solution, mostly due to the fact that after
the btrfs_qgroup_release_data() call, the qgroup reserved space is
already released, and we have no way to undo that...
Maybe we can delayed the qgroup release/free calls until the ordered
extent @entry is allocated?
Thanks,
Qu
>
> Thanks,
> Boris
>
>>
>> Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>> fs/btrfs/ordered-data.c | 12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
>> index ae49f87b27e8..e44d3dd17caf 100644
>> --- a/fs/btrfs/ordered-data.c
>> +++ b/fs/btrfs/ordered-data.c
>> @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>> struct btrfs_ordered_extent *entry;
>> int ret;
>> u64 qgroup_rsv = 0;
>> + const bool is_nocow = (flags &
>> + ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
>>
>> - if (flags &
>> - ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
>> + if (is_nocow) {
>> /* For nocow write, we can release the qgroup rsv right now */
>> ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
>> if (ret < 0)
>> @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>> return ERR_PTR(ret);
>> }
>> entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
>> - if (!entry)
>> + if (!entry) {
>> + if (!is_nocow)
>> + btrfs_qgroup_free_refroot(inode->root->fs_info,
>> + btrfs_root_id(inode->root),
>> + qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
>> return ERR_PTR(-ENOMEM);
>> + }
>>
>> entry->file_offset = file_offset;
>> entry->num_bytes = num_bytes;
>> --
>> 2.47.2
>>
>
next prev parent reply other threads:[~2025-05-07 23:12 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
2025-05-07 17:23 ` [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent fdmanana
2025-05-07 22:33 ` Boris Burkov
2025-05-07 23:11 ` Qu Wenruo [this message]
2025-05-07 23:39 ` Boris Burkov
2025-05-08 0:08 ` Boris Burkov
2025-05-08 13:43 ` Filipe Manana
2025-05-08 16:07 ` Boris Burkov
2025-05-08 21:57 ` Qu Wenruo
2025-05-08 13:40 ` Filipe Manana
2025-05-08 16:19 ` Boris Burkov
2025-05-07 17:23 ` [PATCH 2/5] btrfs: check we grabbed inode reference when allocating an " fdmanana
2025-05-07 17:23 ` [PATCH 3/5] btrfs: fold error checks when allocating ordered extent and update comments fdmanana
2025-05-07 17:23 ` [PATCH 4/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() fdmanana
2025-05-07 17:23 ` [PATCH 5/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() fdmanana
2025-05-08 16:20 ` [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation Boris Burkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ccacb32f-e345-49c7-9c6b-8bdc72c69a7f@suse.com \
--to=wqu@suse.com \
--cc=boris@bur.io \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox