From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <wqu@suse.com>
Cc: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
Date: Wed, 7 May 2025 16:39:59 -0700 [thread overview]
Message-ID: <20250507233959.GA3580243@zen.localdomain> (raw)
In-Reply-To: <ccacb32f-e345-49c7-9c6b-8bdc72c69a7f@suse.com>
On Thu, May 08, 2025 at 08:41:56AM +0930, Qu Wenruo wrote:
>
>
> 在 2025/5/8 08:03, Boris Burkov 写道:
> > On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > If we fail to allocate an ordered extent for a COW write we end up leaking
> > > a qgroup data reservation since we called btrfs_qgroup_release_data() but
> > > we didn't call btrfs_qgroup_free_refroot() (which would happen when
> > > running the respective data delayed ref created by ordered extent
> > > completion or when finishing the ordered extent in case an error happened).
> > >
> > > So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> > > ordered extent for a COW write.
> >
> > I haven't tried it myself yet, but I believe that this patch will double
> > free reservation from the qgroup when this case occurs.
> >
> > Can you share the context where you saw this bug? Have you run fstests
> > with qgroups or squotas enabled? I think this should show pretty quickly
> > in generic/475 with qgroups on.
> >
> > Consider, for example, the following execution of the dio case:
> >
> > btrfs_dio_iomap_begin
> > btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
> > btrfs_get_blocks_direct_write
> > btrfs_create_dio_extent
> > btrfs_alloc_ordered_extent
> > alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
> > // error propagates up
> > // error propagates up via PTR_ERR
> >
> > which brings us to the code:
> > if (ret < 0)
> > goto unlock_err;
> > ...
> > unlock_err:
> > ...
> > if (dio_data->data_space_reserved) {
> > btrfs_free_reserved_data_space()
> > }
> >
> > so the execution continues...
> >
> > btrfs_free_reserved_data_space
> > btrfs_qgroup_free_data
> > __btrfs_qgroup_release_data
> > qgroup_free_reserved_data
> > btrfs_qgroup_free_refroot
> >
> > This will result in a underflow of the reservation once everything
> > outstanding gets released.
>
> That should still be safe.
>
> Firstly at alloc_ordered_extent() we released the qgroup space already, thus
> there will be no EXTENT_QGROUP_RESERVED range in extent-io tree anymore.
Ah yes, good point. And that is before any chance of an error, so now I
I agree that this patch is correct, and that my analysis was wrong.
>
> Then at the final cleanup, qgroup_free_reserved_data_space() will
> release/free nothing, because the extent-io tree no longer has the
> corresponding range with EXTENT_QGROUP_RESERVED.
>
> This is the core design of qgroup data reserved space, which allows us to
> call btrfs_release/free_data() twice without double accounting.
>
> >
> > Furthermore, raw calls to free_refroot in cases where we have a reserved
> > changeset make me worried, because they will run afoul of races with
> > multiple threads touching the various bits. I don't see the bugs here,
> > but the reservation lifetime is really tricky so I wouldn't be surprised
> > if something like that was wrong too.
>
> This free_refroot() is the common practice here. The extent-io tree based
> accounting can only cover the reserved space before ordered extent is
> allocated.
>
> Then the reserved space is "transferred" to ordered extent, and eventually
> go to the new data extent, and finally freed at
> btrfs_qgroup_account_extents(), which also goes the free_refroot() way.
>
That makes sense. I was also thinking of it in terms of "transferring"
when working on this. And yes, there are other post-OE-alloc error paths
(some that I wrote, lol) but I think they are a smell. However, a key
difference is that in those cases, the reservation either already
belongs to the delayed ref logic, or the OE still exists. So this is
still a weird case where we do not have the protection of an OE but
call a raw free_refroot.
I would guess that extent locking during OE creation probably saves us,
though, from glancing at the code. generic/475 with qgroups on will be
the ultimate arbiter here.
> >
> > As of the last time I looked at this, I think cow_file_range handles
> > this correctly as well. Looking really quickly now, it looks like maybe
> > submit_one_async_extent() might not do a qgroup free, but I'm not sure
> > where the corresponding reservation is coming from.
> >
> > I think if you have indeed found a different codepath that makes a data
> > reservation but doesn't release the qgroup part when allocating the
> > ordered extent fails, then the fastest path to a fix is to do that at
> > the same level as where it calls btrfs_check_data_free_space or (however
> > it gets the reservation), as is currently done by the main
> > ordered_extent allocation paths. With async_extent, we might need to
> > plumb through the reserved extent changeset through the async chunk to
> > do it perfectly...
>
> I agree with you that, the extent io tree based double freeing prevention
> should only be the last safe net, not something we should abuse when
> possible.
Yes, this does still feel icky to me.
>
> But I can't find a better solution, mostly due to the fact that after the
> btrfs_qgroup_release_data() call, the qgroup reserved space is already
> released, and we have no way to undo that...
I so wish we could have a nice RAII qgroup reservation object flow
through the whole system :)
>
> Maybe we can delayed the qgroup release/free calls until the ordered extent
> @entry is allocated?
>
I do think that is likely the cleaner fix. Though it would have to also be
after the igrab() check later in the series.
> Thanks,
> Qu
>
>
> >
> > Thanks,
> > Boris
> >
> > >
> > > Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > > fs/btrfs/ordered-data.c | 12 +++++++++---
> > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > > index ae49f87b27e8..e44d3dd17caf 100644
> > > --- a/fs/btrfs/ordered-data.c
> > > +++ b/fs/btrfs/ordered-data.c
> > > @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > > struct btrfs_ordered_extent *entry;
> > > int ret;
> > > u64 qgroup_rsv = 0;
> > > + const bool is_nocow = (flags &
> > > + ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
> > > - if (flags &
> > > - ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> > > + if (is_nocow) {
> > > /* For nocow write, we can release the qgroup rsv right now */
> > > ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> > > if (ret < 0)
> > > @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > > return ERR_PTR(ret);
> > > }
> > > entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> > > - if (!entry)
> > > + if (!entry) {
> > > + if (!is_nocow)
> > > + btrfs_qgroup_free_refroot(inode->root->fs_info,
> > > + btrfs_root_id(inode->root),
> > > + qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
> > > return ERR_PTR(-ENOMEM);
> > > + }
> > > entry->file_offset = file_offset;
> > > entry->num_bytes = num_bytes;
> > > --
> > > 2.47.2
> > >
> >
>
next prev parent reply other threads:[~2025-05-07 23:39 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
2025-05-07 17:23 ` [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent fdmanana
2025-05-07 22:33 ` Boris Burkov
2025-05-07 23:11 ` Qu Wenruo
2025-05-07 23:39 ` Boris Burkov [this message]
2025-05-08 0:08 ` Boris Burkov
2025-05-08 13:43 ` Filipe Manana
2025-05-08 16:07 ` Boris Burkov
2025-05-08 21:57 ` Qu Wenruo
2025-05-08 13:40 ` Filipe Manana
2025-05-08 16:19 ` Boris Burkov
2025-05-07 17:23 ` [PATCH 2/5] btrfs: check we grabbed inode reference when allocating an " fdmanana
2025-05-07 17:23 ` [PATCH 3/5] btrfs: fold error checks when allocating ordered extent and update comments fdmanana
2025-05-07 17:23 ` [PATCH 4/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() fdmanana
2025-05-07 17:23 ` [PATCH 5/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() fdmanana
2025-05-08 16:20 ` [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation Boris Burkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250507233959.GA3580243@zen.localdomain \
--to=boris@bur.io \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox