Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Chris Mason <clm@fb.com>, David Sterba <dsterba@suse.com>,
	Filipe Manana <fdmanana@suse.com>,
	Josef Bacik <josef@toxicpanda.com>,
	 linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] btrfs: don't let shrinker touch extent_maps that are being logged
Date: Tue, 30 Jun 2026 08:28:34 -0400	[thread overview]
Message-ID: <2a2f3722ae7b822a48ec4ac3c641ecacc2ab052d.camel@kernel.org> (raw)
In-Reply-To: <CAL3q7H4nCvmMN8iO4+eVbP3aEm8oxmyfarb4bCyBPLEwU48O7w@mail.gmail.com>

On Mon, 2026-06-29 at 16:47 +0100, Filipe Manana wrote:
> On Mon, Jun 29, 2026 at 4:06 PM Filipe Manana <fdmanana@kernel.org> wrote:
> > 
> > On Mon, Jun 29, 2026 at 3:56 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > On Mon, 2026-06-29 at 15:19 +0100, Filipe Manana wrote:
> > > > On Mon, Jun 29, 2026 at 2:41 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > 
> > > > > The extent map shrinker can free an extent map that is still owned by an
> > > > > in-flight fsync and still linked on the inode's modified_extents list,
> > > > > corrupting that list and eventually causing an RCU stall.
> > > > > 
> > > > > btrfs_scan_inode() currently skips EXTENT_FLAG_PINNED maps, then calls
> > > > > btrfs_remove_extent_mapping() followed by btrfs_free_extent_map():
> > > > > 
> > > > >         if (em->flags & EXTENT_FLAG_PINNED)
> > > > >                 goto next;
> > > > >         ...
> > > > >         btrfs_remove_extent_mapping(inode, em);
> > > > >         btrfs_free_extent_map(em);
> > > > > 
> > > > > But btrfs_remove_extent_mapping() deliberately does NOT unlink a map that
> > > > > has EXTENT_FLAG_LOGGING set:
> > > > > 
> > > > >         if (!(em->flags & EXTENT_FLAG_LOGGING))
> > > > >                 list_del_init(&em->list);
> > > > >         remove_em(inode, em);
> > > > > 
> > > > > This sets up a UAF situation where a later fsync() can trip over the
> > > > > now-freed extent_map still on the modified_extents() list.
> > > > 
> > > > I don't see how that can happen, because fsync always takes the
> > > > inode's i_mmap_lock (in write/exclusive mode), and btrfs_scan_inode()
> > > > takes the same lock too (in shared/read mode).
> > > > EXTENT_FLAG_LOGGING is only set and cleared during fsync, so how can
> > > > the shrinker race with fsync?
> > > > 
> > > 
> > > ...and yet, the vmcore shows that the shrinker hit an extent_map that
> > > had LOGGING set while holding the i_mmap_lock.
> > 
> > Yes, but we need to understand why...
> > 
> > > 
> > > I had hoped to add an assertion here that LOGGING couldn't be set on
> > > the em's after taking the i_mmap_lock, but that's not viable:
> > > 
> > > The problem is that the i_mmap_lock coverage isn't complete. I had
> > > Claude assist me with some analysis so take it with a hefty grain of
> > > salt but the problem is that the em tree can contain em's from other
> > > inodes where the i_mmap_lock is not held:
> > 
> > And that is the problem: an em tree, which belongs to an inode, should
> > only have extent maps for that inode...
> > An extent map can only belong to an inode.
> > 
> > So that's what needs investigation. The bug is elsewhere.
> > 
> > Thanks.
> > 
> > > 
> > > -----------------------8<---------------------
> > > 
> > > ...that inode is in progress, therefore no em in its tree should have LOGGING set. Your invariant is the right one.
> > > 
> > >  But it's violated, because the i_mmap_lock interlock is incomplete. Two facts:
> > > 
> > >   1. tree-log.c never takes i_mmap_lock — grep returns 0 occurrences. Only btrfs_sync_file takes it (file.c:1608/1610), and only
> > >   on the one inode being fsync'd.
> > >   2. A single fsync logs many other inodes, and each can set LOGGING on its own ems via btrfs_log_changed_extents().
> 
> A single fsync can log other inodes yes, and without taking their
> i_mmap_lock, but the inodes are logged in LOG_INODE_EXISTS mode, so
> extents are not logged.
> The only exception is for symlink inodes, but those are logged in full
> mode, only have one extent (inlined), and always have
> BTRFS_INODE_NEEDS_FULL_SYNC set, so we never end up in
> btrfs_log_changed_extents() for symlinks.
> 

Thanks for the explanation. It sounds like we should wire some warnings
into the shrinker around these invariants. Like, if LOGGING is set and
the inode's not a symlink, do a WARN_ON_ONCE()? I'll plan to look into
that as well.

FWIW, internally at Meta, we're turning the WARN_ON() calls in
btrfs_free_extent_map() into BUG() calls across a swath of machines in
the hopes we can get a vmcore earlier. We'll keep you posted.

Thanks for the help so far!

> > >   btrfs_log_inode() is called recursively on conflicting inodes, parent directories, and referenced inodes:
> > >   tree-log.c:5982  btrfs_log_inode(di_inode, log_mode)        # dir dentries
> > >   tree-log.c:6339  btrfs_log_inode(inode, LOG_INODE_ALL)      # conflicting inodes
> > >   tree-log.c:6900  btrfs_log_inode(di_inode, log_mode)        # new dir dentries
> > >   tree-log.c:7389  btrfs_log_inode(dir_inode, LOG_INODE_ALL)  # log_all_parents
> > >   tree-log.c:7488  btrfs_log_inode(inode, ...)                # referenced inodes
> > > 
> > >   2. For all of these, btrfs_sync_file holds i_mmap_lock on the original fsync'd inode — not on these side inodes. So a side
> > >   inode X gets LOGGING set on its ems while X's own i_mmap_lock is free.
> 
> Nop, see above.
> 
> > > 
> > >   So the shrinker can down_read_trylock(X->i_mmap_lock) successfully (nobody holds it) and walk X's tree while X's ems are
> > >   LOGGING-flagged → it removes+frees one → corruption. The interlock only protects the directly-fsync'd inode, not the inodes
> > >   logged as its dependencies.
> 
> Nop. Unless you can show an actual code path where while fsyncing one
> inode we log another inode in LOG_INODE_ALL mode
> Again, the exception is symlinks, but for those, the
> BTRFS_INODE_NEEDS_FULL_SYNC flag is always set, so we never end up in
> btrfs_log_changed_extents().
> 
> > > 
> > > 
> > > > Also, there should be no () after modified_extents (it's not a function name).
> > > > 
> > > > 
> > > 
> > > Sorry, early morning typo. Fixed in my local repo.
> > > 
> > > > 
> > > > > 
> > > > > Fix it by having the shrinker skip maps that are being logged, the same
> > > > > way it skips pinned maps. Such a map is owned by the in-flight fsync and
> > > > > will become reclaimable again once logging clears the flag.
> > > > > 
> > > > > Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > > We've started hitting a number of these problems in our fleet. It
> > > > > seems to mostly happen on ARM64 architecture, but there have been some
> > > > > WARN_ONs that popped on x86_64 too.
> > > > > ---
> > > > >  fs/btrfs/extent_map.c | 8 +++++++-
> > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > > > > index fce9c5cc0122..128f7800e101 100644
> > > > > --- a/fs/btrfs/extent_map.c
> > > > > +++ b/fs/btrfs/extent_map.c
> > > > > @@ -1166,7 +1166,13 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> > > > >                 em = rb_entry(node, struct extent_map, rb_node);
> > > > >                 ctx->scanned++;
> > > > > 
> > > > > -               if (em->flags & EXTENT_FLAG_PINNED)
> > > > > +               /*
> > > > > +                * Skip extent maps that are pinned or are being logged. The
> > > > > +                * i_mmap_lock should prevent this from seeing LOGGING on extent_maps
> > > > > +                * directly associated with inode, but em may be associated with
> > > > > +                * other, dependent inodes and their locks are not held.
> > > > > +                */
> > > > > +               if (em->flags & (EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING))
> > > > >                         goto next;
> > > > > 
> > > > >                 /*
> > > > > 
> > > > > ---
> > > > > base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
> > > > > change-id: 20260629-btrfs-skip-logging-3e31701d9647
> > > > > 
> > > > > Best regards,
> > > > > --
> > > > > Jeff Layton <jlayton@kernel.org>
> > > > > 
> > > > > 
> > > 
> > > --
> > > Jeff Layton <jlayton@kernel.org>

-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2026-06-30 12:28 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-29 13:10 [PATCH] btrfs: don't let shrinker touch extent_maps that are being logged Jeff Layton
2026-06-29 14:19 ` Filipe Manana
2026-06-29 14:56   ` Jeff Layton
2026-06-29 15:06     ` Filipe Manana
2026-06-29 15:47       ` Filipe Manana
2026-06-30 12:28         ` Jeff Layton [this message]
2026-06-30 14:06           ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2a2f3722ae7b822a48ec4ac3c641ecacc2ab052d.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox