From: Filipe Manana <fdmanana@kernel.org>
To: Boris Burkov <boris@bur.io>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 1/1] btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create
Date: Wed, 25 Feb 2026 17:23:02 +0000 [thread overview]
Message-ID: <CAL3q7H6bVZquyvod=_YjNw1vRBSCQscWSrb5mVEZ1YhLBS8e9Q@mail.gmail.com> (raw)
In-Reply-To: <20260225170921.GA682210@zen.localdomain>
On Wed, Feb 25, 2026 at 5:08 PM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, Feb 25, 2026 at 12:21:43PM +0000, Filipe Manana wrote:
> > On Tue, Feb 24, 2026 at 10:25 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > We have recently observed a number of subvolumes with broken dentries.
> > > ls-ing the parent dir looks like:
> > >
> > > drwxrwxrwt 1 root root 16 Jan 23 16:49 .
> > > drwxr-xr-x 1 root root 24 Jan 23 16:48 ..
> > > d????????? ? ? ? ? ? broken_subvol
> > >
> > > and similarly stat-ing the file fails.
> > >
> > > In this state, deleting the subvol fails with ENOENT, but attempting to
> > > create a new file or subvol over it errors out with EEXIST and even
> > > aborts the fs. Which leaves us a bit stuck.
> > >
> > > dmesg contains a single notable error message reading:
> > > "could not do orphan cleanup -2"
> > >
> > > 2 is ENOENT and the error comes from the failure handling path of
> > > btrfs_orphan_cleanup(), with the stack leading back up to
> > > btrfs_lookup().
> > >
> > > btrfs_lookup
> > > btrfs_lookup_dentry
> > > btrfs_orphan_cleanup // prints that message and returns -ENOENT
> > >
> > > After some detailed inspection of the internal state, it became clear
> > > that:
> > > - there are no orphan items for the subvol
> > > - the subvol is otherwise healthy looking, it is not half-deleted or
> > > anything, there is no drop progress, etc.
> > > - the subvol was created a while ago and does the meaningful first
> > > btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much
> > > later.
> > > - after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT,
> > > which results in a negative dentry for the subvolume via
> > > d_splice_alias(NULL, dentry), leading to the observed behavior. The
> > > bug can be mitigated by dropping the dentry cache, at which point we
> > > can successfully delete the subvolume if we want.
> > >
> > > i.e.,
> > > btrfs_lookup()
> > > btrfs_lookup_dentry()
> > > if (!sb_rdonly(inode->vfs_inode)->vfs_inode)
> > > btrfs_orphan_cleanup(sub_root)
> > > test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
> > > btrfs_search_slot() // finds orphan item for inode N
> > > ...
> > > prints "could not do orphan cleanup -2"
> > > if (inode == ERR_PTR(-ENOENT))
> > > inode = NULL;
> > > return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume
> > >
> > > btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
> > > on the root when it runs, so it cannot run more than once on a given
> > > root, so something else must run concurrently. However, the obvious
> > > routes to deleting an orphan when nlinks goes to 0 should not be able to
> > > run without first doing a lookup into the subvolume, which should run
> > > btrfs_orphan_cleanup() and set the bit.
> > >
> > > The final important observation is that create_subvol() calls
> > > d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if
> > > the dentry cache gets dropped, the next lookup into the subvolume will
> > > make a real call into btrfs_orphan_cleanup() for the first time. This
> > > opens up the possibility of concurrently deleting the inode/orphan items
> > > but most typical evict() paths will be holding a reference on the parent
> > > dentry (child dentry holds parent->d_lockref.count via dget in
> > > d_alloc(), released in __dentry_kill()) and prevent the parent from
> > > being removed from the dentry cache.
> > >
> > > The one exception is delayed iputs. Ordered extent creation calls
> > > igrab() on the inode. If the file is unlinked and closed while those
> > > refs are held, iput() in __dentry_kill() decrements i_count but does
> > > not trigger eviction (i_count > 0). The child dentry is freed and the
> > > subvol dentry's d_lockref.count drops to 0, making it evictable while
> > > the inode is still alive.
> > >
> > > Since there are two races (the race between writeback and unlink and
> > > the race between lookup and delayed iputs), and there are too many moving
> > > parts, the following three diagrams show the complete picture.
> > > (Only the second and third are races)
> > >
> > > Phase 1:
> > > Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set
> > >
> > > btrfs_mksubvol()
> > > lookup_one_len()
> > > __lookup_slow()
> > > d_alloc_parallel()
> > > __d_alloc() // d_lockref.count = 1
> > > create_subvol(dentry)
> > > // doesn't touch the bit..
> > > d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1
> > >
> > > Phase 2:
> > > Create a delayed iput for a file in the subvol but leave the subvol in
> > > state where its dentry can be evicted (d_lockref.count == 0)
> > >
> > > T1 (task) T2 (writeback) T3 (OE workqueue)
> > >
> > > write() // dirty pages
> > > btrfs_writepages()
> > > btrfs_run_delalloc_range()
> > > cow_file_range()
> > > btrfs_alloc_ordered_extent()
> > > igrab() // i_count: 1 -> 2
> > > btrfs_unlink_inode()
> > > btrfs_orphan_add()
> > > close()
> > > __fput()
> > > dput()
> > > finish_dput()
> > > __dentry_kill()
> > > dentry_unlink_inode()
> > > iput() // 2 -> 1
> > > --parent->d_lockref.count // 1 -> 0; evictable
> >
> > So my previous comment from v1 still stands:
> >
> > "Where does this decrement of parent->d_lockref.count happens exactly?
> >
> > I don't see it immediately in iput(), or iput_final(). Please put the
> > full call chain."
>
> Sorry, I should have replied in greater detail. I added some callstack
> context above dput but didn't clarify anything about __dentry_kill where
> the real details are.
>
> On current for-next I see teh decrement at fs/dcache.c:690 in
> __dentry_kill() inside a conditional:
>
> if (parent && --parent->d_lockref.count) {
> ...
> }
>
> I have never figured out a perfect way to mix function calls and
> statements in these race diagrams with nesting and such, but I probably
> should have written out the conditional? I tried to have it nested at
> the "stuff inside __dentry_kill level" but after iput (which I also
> wanted to put to show the inode ref count)
>
> Let me know if you have any suggestions for how I can change it to make
> it more clear!
Ok, you can add:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
>
> Thanks,
> Boris
>
> >
> > Thanks.
> >
> >
> >
> > > finish_ordered_fn()
> > > btrfs_finish_ordered_io()
> > > btrfs_put_ordered_extent()
> > > btrfs_add_delayed_iput()
> > >
> > > Phase 3:
> > > Once the delayed iput is pending and the subvol dentry is evictable,
> > > the shrinker can free it, causing the next lookup to go through
> > > btrfs_lookup() and call btrfs_orphan_cleanup() for the first time.
> > > If the cleaner kthread processes the delayed iput concurrently, the
> > > two race:
> > >
> > > T1 (shrinker) T2 (cleaner kthread) T3 (lookup)
> > >
> > > super_cache_scan()
> > > prune_dcache_sb()
> > > __dentry_kill()
> > > // subvol dentry freed
> > > btrfs_run_delayed_iputs()
> > > iput() // i_count -> 0
> > > evict() // sets I_FREEING
> > > btrfs_evict_inode()
> > > // truncation loop
> > > btrfs_lookup()
> > > btrfs_lookup_dentry()
> > > btrfs_orphan_cleanup()
> > > // first call (bit never set)
> > > btrfs_iget()
> > > // blocks on I_FREEING
> > >
> > > btrfs_orphan_del()
> > > // inode freed
> > > // returns -ENOENT
> > > btrfs_del_orphan_item()
> > > // -ENOENT
> > > // "could not do orphan cleanup -2"
> > > d_splice_alias(NULL, dentry)
> > > // negative dentry for valid subvol
> > >
> > > The most straightforward fix is to ensure the invariant that a dentry
> > > for a subvolume can exist if and only if that subvolume has
> > > BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no
> > > orphans or ran btrfs_orphan_cleanup()).
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > > Changelog:
> > > v2:
> > > - fixed some typographical errors in the commit message.
> > > - improved the commit message with more callstacks / details.
> > >
> > > ---
> > > fs/btrfs/ioctl.c | 7 +++++++
> > > 1 file changed, 7 insertions(+)
> > >
> > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > index b8db877be61cc..77f7db18c6ca5 100644
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -672,6 +672,13 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
> > > goto out;
> > > }
> > >
> > > + /*
> > > + * Subvolumes have orphans cleaned on first dentry lookup. A new
> > > + * subvolume cannot have any orphans, so we should set the bit before we
> > > + * add the subvolume dentry to the dentry cache, so that it is in the
> > > + * same state as a subvolume after first lookup.
> > > + */
> > > + set_bit(BTRFS_ROOT_ORPHAN_CLEANUP, &new_root->state);
> > > d_instantiate_new(dentry, new_inode_args.inode);
> > > new_inode_args.inode = NULL;
> > >
> > > --
> > > 2.47.3
> > >
> > >
prev parent reply other threads:[~2026-02-25 17:23 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 22:25 [PATCH v2 1/1] btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create Boris Burkov
2026-02-25 12:21 ` Filipe Manana
2026-02-25 17:09 ` Boris Burkov
2026-02-25 17:23 ` Filipe Manana [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAL3q7H6bVZquyvod=_YjNw1vRBSCQscWSrb5mVEZ1YhLBS8e9Q@mail.gmail.com' \
--to=fdmanana@kernel.org \
--cc=boris@bur.io \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox