All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH v2] xfs: run blockgc on freeze to avoid iget stalls after reclaim
Date: Thu, 25 Jan 2024 07:25:14 -0500	[thread overview]
Message-ID: <ZbJSmYR8KxMcSlTy@bfoster> (raw)
In-Reply-To: <CAOQ4uxjWm82=KSQYMPo06kxfU90OBpMDmmQfyZAMS_2ZfJHnrw@mail.gmail.com>

On Sat, Jan 20, 2024 at 10:50:02AM +0200, Amir Goldstein wrote:
> On Fri, Jan 19, 2024 at 9:35 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > We've had reports on distro (pre-deferred inactivation) kernels that
> > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > lock when invoked on a frozen XFS fs. This occurs because
> > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > transaction alloc for an inode that requires an eofb trim. unfreeze
> > then blocks on the same lock and the fs is deadlocked.
> >
> > With deferred inactivation, the deadlock problem is no longer
> > present because ->destroy_inode() no longer blocks whether the fs is
> > frozen or not. There is still unfortunate behavior in that lookups
> > of a pending inactive inode spin loop waiting for the pending
> > inactive state to clear, which won't happen until the fs is
> > unfrozen. This was always possible to some degree, but is
> > potentially amplified by the fact that reclaim no longer blocks on
> > the first inode that requires inactivation work. Instead, we
> > populate the inactivation queues indefinitely. The side effect can
> > be observed easily by invoking drop_caches on a frozen fs previously
> > populated with eofb and/or cowblocks inodes and then running
> > anything that relies on inode lookup (i.e., ls).
> >
> > To mitigate this behavior, invoke a non-sync blockgc scan during the
> > freeze sequence to minimize the chance that inode evictions require
> > inactivation while the fs is frozen. A synchronous scan would
> > provide more of a guarantee, but is potentially unsafe from
> > partially frozen context. This is because a file read task may be
> > blocked on a write fault while holding iolock (such as when reading
> > into a mapped buffer) and a sync scan retries indefinitely on iolock
> > failure. Therefore, this adds risk of potential livelock during the
> > freeze sequence.
> >
> > Finally, since the deadlock issue was present for such a long time,
> > also document the subtle ->destroy_inode() constraint to avoid
> > unintentional reintroduction of the deadlock problem in the future.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Is there an appropriate Fixes: commit that could be mentioned here?
> or at least a range of stable kernels to apply this suggested fix?
> 

Hmm.. well I didn't really consider this a bug upstream. The above is
more historical reference to an issue that has since gone away, but
trying to use the bug report on a stable kernel to be forward looking
about improving on potentially awkward behavior of the latest upstream
kernel under the same sort of circumstances (i.e. reclaim while frozen).

I suppose something like this would be potentially useful for stable
kernels that don't include background inactivation. I haven't audited
which stable kernels might fall in that category (if any), but alas it
probably doesn't matter because this patch likely wasn't going anywhere
anyways.

Brian

> Thanks,
> Amir.
> 
> > ---
> >
> > Hi all,
> >
> > There was a good amount of discussion on the first version of this patch
> > [1] a couple or so years ago now. The main issue was that using a sync
> > scan is unsafe in certain cases (best described here [2]), so this
> > best-effort approach was considered as a fallback option to improve
> > behavior.
> >
> > The reason I'm reposting this is that it is one of several options for
> > dealing with the aforementioned deadlock on stable/distro kernels, so it
> > seems to have mutual benefit. Looking back through the original
> > discussion, I think there are several ways this could be improved to
> > provide the benefit of a sync scan. For example, if the scan could be
> > made to run before faults are locked out (re [3]), that may be
> > sufficient to allow a sync scan. Or now that freeze_super() actually
> > checks for ->sync_fs() errors, an async scan could be followed by a
> > check for tagged blockgc entries that triggers an -EBUSY or some error
> > return to fail the freeze, which would most likely be a rare and
> > transient situation. Etc.
> >
> > These thoughts are mainly incremental improvements upon some form of
> > freeze time scan and may not be of significant additional value given
> > current upstream behavior, so this patch takes the simple, best effort
> > approach. Thoughts?
> >
> > Brian
> >
> > [1] https://lore.kernel.org/linux-xfs/20220113133701.629593-1-bfoster@redhat.com/
> > [2] https://lore.kernel.org/linux-xfs/20220115224030.GA59729@dread.disaster.area/
> > [3] https://lore.kernel.org/linux-xfs/Yehvc4g+WakcG1mP@bfoster/
> >
> >  fs/xfs/xfs_super.c | 24 ++++++++++++++++--------
> >  1 file changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index d0009430a627..43e72e266666 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -657,8 +657,13 @@ xfs_fs_alloc_inode(
> >  }
> >
> >  /*
> > - * Now that the generic code is guaranteed not to be accessing
> > - * the linux inode, we can inactivate and reclaim the inode.
> > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > + * inactivate and reclaim it.
> > + *
> > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > + * allocation in this context. A transaction alloc that blocks on frozen state
> > + * from a context with ->s_umount held will deadlock with unfreeze.
> >   */
> >  STATIC void
> >  xfs_fs_destroy_inode(
> > @@ -811,15 +816,18 @@ xfs_fs_sync_fs(
> >          * down inodegc because once SB_FREEZE_FS is set it's too late to
> >          * prevent inactivation races with freeze. The fs doesn't get called
> >          * again by the freezing process until after SB_FREEZE_FS has been set,
> > -        * so it's now or never.  Same logic applies to speculative allocation
> > -        * garbage collection.
> > +        * so it's now or never.
> >          *
> > -        * We don't care if this is a normal syncfs call that does this or
> > -        * freeze that does this - we can run this multiple times without issue
> > -        * and we won't race with a restart because a restart can only occur
> > -        * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > +        * The same logic applies to block garbage collection. Run a best-effort
> > +        * blockgc scan to reduce the working set of inodes that the shrinker
> > +        * would send to inactivation queue purgatory while frozen. We can't run
> > +        * a sync scan with page faults blocked because that could potentially
> > +        * livelock against a read task blocked on a page fault (i.e. if reading
> > +        * into a mapped buffer) while holding iolock.
> >          */
> >         if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > +               xfs_blockgc_free_space(mp, NULL);
> > +
> >                 xfs_inodegc_stop(mp);
> >                 xfs_blockgc_stop(mp);
> >         }
> > --
> > 2.42.0
> >
> >
> 


  reply	other threads:[~2024-01-25 12:24 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-19 19:36 [RFC PATCH v2] xfs: run blockgc on freeze to avoid iget stalls after reclaim Brian Foster
2024-01-20  8:50 ` Amir Goldstein
2024-01-25 12:25   ` Brian Foster [this message]
2024-01-22  3:23 ` Dave Chinner
2024-01-25 12:46   ` Brian Foster
2024-01-25 23:46     ` Dave Chinner
2024-01-29 15:02       ` Brian Foster
2024-02-01  1:16         ` Dave Chinner
2024-02-02 19:41           ` Brian Foster
2024-02-02 23:33             ` Darrick J. Wong
2024-02-04 16:03               ` Brian Foster
2024-02-05 22:07                 ` Darrick J. Wong
2024-02-06 13:28                   ` Brian Foster
2024-02-07 17:36                     ` Darrick J. Wong
2024-02-08 12:54                       ` Brian Foster
2024-02-09  0:23                         ` Dave Chinner
2024-02-12 19:43                           ` Brian Foster
2024-02-13 17:56                             ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZbJSmYR8KxMcSlTy@bfoster \
    --to=bfoster@redhat.com \
    --cc=amir73il@gmail.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.