public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 1/1] btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create
Date: Wed, 25 Feb 2026 17:23:02 +0000	[thread overview]
Message-ID: <CAL3q7H6bVZquyvod=_YjNw1vRBSCQscWSrb5mVEZ1YhLBS8e9Q@mail.gmail.com> (raw)
In-Reply-To: <20260225170921.GA682210@zen.localdomain>

On Wed, Feb 25, 2026 at 5:08 PM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, Feb 25, 2026 at 12:21:43PM +0000, Filipe Manana wrote:
> > On Tue, Feb 24, 2026 at 10:25 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > We have recently observed a number of subvolumes with broken dentries.
> > > ls-ing the parent dir looks like:
> > >
> > > drwxrwxrwt 1 root root 16 Jan 23 16:49 .
> > > drwxr-xr-x 1 root root 24 Jan 23 16:48 ..
> > > d????????? ? ?    ?     ?            ? broken_subvol
> > >
> > > and similarly stat-ing the file fails.
> > >
> > > In this state, deleting the subvol fails with ENOENT, but attempting to
> > > create a new file or subvol over it errors out with EEXIST and even
> > > aborts the fs. Which leaves us a bit stuck.
> > >
> > > dmesg contains a single notable error message reading:
> > > "could not do orphan cleanup -2"
> > >
> > > 2 is ENOENT and the error comes from the failure handling path of
> > > btrfs_orphan_cleanup(), with the stack leading back up to
> > > btrfs_lookup().
> > >
> > > btrfs_lookup
> > > btrfs_lookup_dentry
> > > btrfs_orphan_cleanup // prints that message and returns -ENOENT
> > >
> > > After some detailed inspection of the internal state, it became clear
> > > that:
> > > - there are no orphan items for the subvol
> > > - the subvol is otherwise healthy looking, it is not half-deleted or
> > >   anything, there is no drop progress, etc.
> > > - the subvol was created a while ago and does the meaningful first
> > >   btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much
> > >   later.
> > > - after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT,
> > >   which results in a negative dentry for the subvolume via
> > >   d_splice_alias(NULL, dentry), leading to the observed behavior. The
> > >   bug can be mitigated by dropping the dentry cache, at which point we
> > >   can successfully delete the subvolume if we want.
> > >
> > > i.e.,
> > > btrfs_lookup()
> > >   btrfs_lookup_dentry()
> > >     if (!sb_rdonly(inode->vfs_inode)->vfs_inode)
> > >     btrfs_orphan_cleanup(sub_root)
> > >       test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
> > >       btrfs_search_slot() // finds orphan item for inode N
> > >       ...
> > >       prints "could not do orphan cleanup -2"
> > >   if (inode == ERR_PTR(-ENOENT))
> > >     inode = NULL;
> > >   return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume
> > >
> > > btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
> > > on the root when it runs, so it cannot run more than once on a given
> > > root, so something else must run concurrently. However, the obvious
> > > routes to deleting an orphan when nlinks goes to 0 should not be able to
> > > run without first doing a lookup into the subvolume, which should run
> > > btrfs_orphan_cleanup() and set the bit.
> > >
> > > The final important observation is that create_subvol() calls
> > > d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if
> > > the dentry cache gets dropped, the next lookup into the subvolume will
> > > make a real call into btrfs_orphan_cleanup() for the first time. This
> > > opens up the possibility of concurrently deleting the inode/orphan items
> > > but most typical evict() paths will be holding a reference on the parent
> > > dentry (child dentry holds parent->d_lockref.count via dget in
> > > d_alloc(), released in __dentry_kill()) and prevent the parent from
> > > being removed from the dentry cache.
> > >
> > > The one exception is delayed iputs. Ordered extent creation calls
> > > igrab() on the inode. If the file is unlinked and closed while those
> > > refs are held, iput() in __dentry_kill() decrements i_count but does
> > > not trigger eviction (i_count > 0). The child dentry is freed and the
> > > subvol dentry's d_lockref.count drops to 0, making it evictable while
> > > the inode is still alive.
> > >
> > > Since there are two races (the race between writeback and unlink and
> > > the race between lookup and delayed iputs), and there are too many moving
> > > parts, the following three diagrams show the complete picture.
> > > (Only the second and third are races)
> > >
> > > Phase 1:
> > > Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set
> > >
> > > btrfs_mksubvol()
> > >   lookup_one_len()
> > >     __lookup_slow()
> > >       d_alloc_parallel()
> > >         __d_alloc() // d_lockref.count = 1
> > >   create_subvol(dentry)
> > >     // doesn't touch the bit..
> > >     d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1
> > >
> > > Phase 2:
> > > Create a delayed iput for a file in the subvol but leave the subvol in
> > > state where its dentry can be evicted (d_lockref.count == 0)
> > >
> > > T1 (task)                    T2 (writeback)                   T3 (OE workqueue)
> > >
> > > write() // dirty pages
> > >                               btrfs_writepages()
> > >                                 btrfs_run_delalloc_range()
> > >                                   cow_file_range()
> > >                                     btrfs_alloc_ordered_extent()
> > >                                       igrab() // i_count: 1 -> 2
> > > btrfs_unlink_inode()
> > >   btrfs_orphan_add()
> > > close()
> > >   __fput()
> > >     dput()
> > >       finish_dput()
> > >         __dentry_kill()
> > >           dentry_unlink_inode()
> > >             iput() // 2 -> 1
> > >           --parent->d_lockref.count // 1 -> 0; evictable
> >
> > So my previous comment from v1 still stands:
> >
> > "Where does this decrement of parent->d_lockref.count happens exactly?
> >
> > I don't see it immediately in iput(), or iput_final(). Please put the
> > full call chain."
>
> Sorry, I should have replied in greater detail. I added some callstack
> context above dput but didn't clarify anything about __dentry_kill where
> the real details are.
>
> On current for-next I see teh decrement at fs/dcache.c:690 in
> __dentry_kill() inside a conditional:
>
>   if (parent && --parent->d_lockref.count) {
>   ...
>   }
>
> I have never figured out a perfect way to mix function calls and
> statements in these race diagrams with nesting and such, but I probably
> should have written out the conditional? I tried to have it nested at
> the "stuff inside __dentry_kill level" but after iput (which I also
> wanted to put to show the inode ref count)
>
> Let me know if you have any suggestions for how I can change it to make
> it more clear!

Ok, you can add:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
> Thanks,
> Boris
>
> >
> > Thanks.
> >
> >
> >
> > >                                                                 finish_ordered_fn()
> > >                                                                   btrfs_finish_ordered_io()
> > >                                                                     btrfs_put_ordered_extent()
> > >                                                                       btrfs_add_delayed_iput()
> > >
> > > Phase 3:
> > > Once the delayed iput is pending and the subvol dentry is evictable,
> > > the shrinker can free it, causing the next lookup to go through
> > > btrfs_lookup() and call btrfs_orphan_cleanup() for the first time.
> > > If the cleaner kthread processes the delayed iput concurrently, the
> > > two race:
> > >
> > >   T1 (shrinker)              T2 (cleaner kthread)                          T3 (lookup)
> > >
> > >   super_cache_scan()
> > >     prune_dcache_sb()
> > >       __dentry_kill()
> > >       // subvol dentry freed
> > >                               btrfs_run_delayed_iputs()
> > >                                 iput()  // i_count -> 0
> > >                                   evict()  // sets I_FREEING
> > >                                     btrfs_evict_inode()
> > >                                       // truncation loop
> > >                                                                             btrfs_lookup()
> > >                                                                               btrfs_lookup_dentry()
> > >                                                                                 btrfs_orphan_cleanup()
> > >                                                                                   // first call (bit never set)
> > >                                                                                   btrfs_iget()
> > >                                                                                     // blocks on I_FREEING
> > >
> > >                                       btrfs_orphan_del()
> > >                                       // inode freed
> > >                                                                                     // returns -ENOENT
> > >                                                                                   btrfs_del_orphan_item()
> > >                                                                                     // -ENOENT
> > >                                                                                 // "could not do orphan cleanup -2"
> > >                                                                             d_splice_alias(NULL, dentry)
> > >                                                                             // negative dentry for valid subvol
> > >
> > > The most straightforward fix is to ensure the invariant that a dentry
> > > for a subvolume can exist if and only if that subvolume has
> > > BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no
> > > orphans or ran btrfs_orphan_cleanup()).
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > > Changelog:
> > > v2:
> > > - fixed some typographical errors in the commit message.
> > > - improved the commit message with more callstacks / details.
> > >
> > > ---
> > >  fs/btrfs/ioctl.c | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > index b8db877be61cc..77f7db18c6ca5 100644
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -672,6 +672,13 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
> > >                 goto out;
> > >         }
> > >
> > > +       /*
> > > +        * Subvolumes have orphans cleaned on first dentry lookup. A new
> > > +        * subvolume cannot have any orphans, so we should set the bit before we
> > > +        * add the subvolume dentry to the dentry cache, so that it is in the
> > > +        * same state as a subvolume after first lookup.
> > > +        */
> > > +       set_bit(BTRFS_ROOT_ORPHAN_CLEANUP, &new_root->state);
> > >         d_instantiate_new(dentry, new_inode_args.inode);
> > >         new_inode_args.inode = NULL;
> > >
> > > --
> > > 2.47.3
> > >
> > >

      reply	other threads:[~2026-02-25 17:23 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24 22:25 [PATCH v2 1/1] btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create Boris Burkov
2026-02-25 12:21 ` Filipe Manana
2026-02-25 17:09   ` Boris Burkov
2026-02-25 17:23     ` Filipe Manana [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAL3q7H6bVZquyvod=_YjNw1vRBSCQscWSrb5mVEZ1YhLBS8e9Q@mail.gmail.com' \
    --to=fdmanana@kernel.org \
    --cc=boris@bur.io \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox