From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration
Date: Wed, 10 Jun 2020 09:06:28 -0400 [thread overview]
Message-ID: <20200610130628.GA50747@bfoster> (raw)
In-Reply-To: <20200609220139.GJ2040@dread.disaster.area>
On Wed, Jun 10, 2020 at 08:01:39AM +1000, Dave Chinner wrote:
> On Tue, Jun 09, 2020 at 09:11:55AM -0400, Brian Foster wrote:
> > On Thu, Jun 04, 2020 at 05:46:04PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Now that we have all the dirty inodes attached to the cluster
> > > buffer, we don't actually have to do radix tree lookups to find
> > > them. Sure, the radix tree is efficient, but walking a linked list
> > > of just the dirty inodes attached to the buffer is much better.
> > >
> > > We are also no longer dependent on having a locked inode passed into
> > > the function to determine where to start the lookup. This means we
> > > can drop it from the function call and treat all inodes the same.
> > >
> > > We also make xfs_iflush_cluster skip inodes marked with
> > > XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> > > actively referencing or are being re-initialised by inode lookup. If
> > > they are actually dirty, they'll get written by a future cluster
> > > flush....
> > >
> > > We also add a shutdown check after obtaining the flush lock so that
> > > we catch inodes that are dirty in memory and may have inconsistent
> > > state due to the shutdown in progress. We abort these inodes
> > > directly and so they remove themselves directly from the buffer list
> > > and the AIL rather than having to wait for the buffer to be failed
> > > and callbacks run to be processed correctly.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > fs/xfs/xfs_inode.c | 148 ++++++++++++++++------------------------
> > > fs/xfs/xfs_inode.h | 2 +-
> > > fs/xfs/xfs_inode_item.c | 2 +-
> > > 3 files changed, 62 insertions(+), 90 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 8566bd0f4334d..931a483d5b316 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -3611,117 +3611,94 @@ xfs_iflush(
> > > */
> > > int
> > > xfs_iflush_cluster(
> > > - struct xfs_inode *ip,
> > > struct xfs_buf *bp)
> > > {
> > ...
> > > + /*
> > > + * We must use the safe variant here as on shutdown xfs_iflush_abort()
> > > + * can remove itself from the list.
> > > + */
> > > + list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> > > + iip = (struct xfs_inode_log_item *)lip;
> > > + ip = iip->ili_inode;
> > >
> > > /*
> > > - * because this is an RCU protected lookup, we could find a
> > > - * recently freed or even reallocated inode during the lookup.
> > > - * We need to check under the i_flags_lock for a valid inode
> > > - * here. Skip it if it is not valid or the wrong inode.
> > > + * Quick and dirty check to avoid locks if possible.
> > > */
> > > - spin_lock(&cip->i_flags_lock);
> > > - if (!cip->i_ino ||
> > > - __xfs_iflags_test(cip, XFS_ISTALE)) {
> > > - spin_unlock(&cip->i_flags_lock);
> > > + if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
> > > + continue;
> > > + if (xfs_ipincount(ip))
> > > continue;
> > > - }
> > >
> > > /*
> > > - * Once we fall off the end of the cluster, no point checking
> > > - * any more inodes in the list because they will also all be
> > > - * outside the cluster.
> > > + * The inode is still attached to the buffer, which means it is
> > > + * dirty but reclaim might try to grab it. Check carefully for
> > > + * that, and grab the ilock while still holding the i_flags_lock
> > > + * to guarantee reclaim will not be able to reclaim this inode
> > > + * once we drop the i_flags_lock.
> > > */
> > > - if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> > > - spin_unlock(&cip->i_flags_lock);
> > > - break;
> > > + spin_lock(&ip->i_flags_lock);
> > > + ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
> >
> > What prevents a race with staling an inode here?
>
> xfs_buf_item_release() does not drop the buffer lock when stale
> buffers are committed. Hence the buffer it held locked until it is
> committed to the journal, and unpinning the buffer on journal IO
> completion runs xfs_iflush_done() -> xfs_iflush_abort() on all the
> stale attached inodes. At which point, they are removed from the
> buffer and the buffer is unlocked..
>
> Hence we cannot run here with stale inodes attached to the buffer
> because the buffer is locked the entire time stale inodes are
> attached.
>
Ok, so it's actually that the earlier ISTALE check on the inode is more
of an optimization since if we saw the flag set, we'd never be able to
lock the buffer anyways.
> > The push open codes an
> > unlocked (i.e. no memory barrier) check before it acquires the buffer
> > lock, so afaict it's technically possible to race and grab the buffer
> > immediately after the cluster was freed. If that could happen, it looks
> > like we'd also queue the buffer for write.
>
> Not that I can tell, because we'll fail to get the buffer lock under
> the AIL lock until the stale buffer is unpinned, but the unpin
> occurs under the buffer lock and that removes the buffer from the
> AIL. Hence there's no way the AIL pushing can actually push an inode
> cluster buffer that has stale inodes attached to it...
>
Makes sense.
> > That also raises the question.. between this patch and the earlier push
> > rework, why do we queue the buffer for write when nothing might have
> > been flushed by this call?
>
> Largely because I never observed a failure to flush at least one
> inode so I didn't really consider it worth the additional complexity
> in error handling in the push code. I've really been concerned about
> the other end of the scale where we try to push the same buffer 30+
> times for each IO.
>
> You are right in that it could happen - perhaps just returning EAGAIN
> if clcount == 0 at the end of the function is all that is necessary
> and translating that to XFS_ITEM_LOCKED in the push function would
> work.
>
That sounds reasonable enough to cover that case.
> > > - * check is not sufficient.
> > > + * If we are shut down, unpin and abort the inode now as there
> > > + * is no point in flushing it to the buffer just to get an IO
> > > + * completion to abort the buffer and remove it from the AIL.
> > > */
> > > - if (!cip->i_ino) {
> > > - xfs_ifunlock(cip);
> > > - xfs_iunlock(cip, XFS_ILOCK_SHARED);
> > > + if (XFS_FORCED_SHUTDOWN(mp)) {
> > > + xfs_iunpin_wait(ip);
> >
> > Note that we have an unlocked check above that skips pinned inodes.
>
> Right, but we could be racing with a transaction commit that pinned
> the inode and a shutdown. As the comment says: it's a quick and
> dirty check to avoid trying to get locks when we know that it is
> unlikely we can flush the inode without blocking. We still have to
> recheck that state once we have the ILOCK....
>
Right, but that means we can just as easily skip the shutdown processing
(which waits for unpin) if a particular inode is pinned. So which is
supposed to happen in the shutdown case?
ISTM that either could happen. As a result this kind of looks like
random logic to me. IIUC we rely on potentially coming back through the
cluster flush path multiple times if inodes are pinned because of the
racy pin/shutdown checks. If that's the case, then why not push the
shutdown check to after the locked pin count check, skip the
xfs_iunpin_wait() call entirely and then proceed to abort the flush at
the same point we'd call xfs_iflush() if the fs weren't shutdown?
FWIW, I'm also kind of wondering if we need this shutdown check at all
now that this path doesn't have to accommodate reclaim. We follow up
with failing the buffer in the exit path, but there's no guarantee there
aren't other inodes that are passed over due to the pin checks,
trylocks, etc. Can we just flush the inodes as normal and let xfsaild
fail the buffers on submission without additional shutdown checks?
Brian
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2020-06-10 13:06 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-04 7:45 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-04 7:45 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
2020-06-04 7:45 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
2020-06-04 7:45 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
2020-06-09 13:13 ` Brian Foster
2020-06-04 7:45 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
2020-06-04 14:04 ` Brian Foster
2020-06-04 7:45 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
2020-06-04 7:45 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
2020-06-04 7:45 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
2020-06-04 7:45 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-06-04 7:45 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
2020-06-04 7:45 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
2020-06-04 7:45 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-06-04 7:45 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
2020-06-04 7:45 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
2020-06-04 14:05 ` Brian Foster
2020-06-05 0:59 ` Dave Chinner
2020-06-05 1:32 ` [PATCH 13/30 V2] " Dave Chinner
2020-06-05 16:24 ` Brian Foster
2020-06-04 7:45 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
2020-06-04 7:45 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
2020-06-04 7:45 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-06-04 14:05 ` Brian Foster
2020-06-04 7:45 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-06-04 18:06 ` Brian Foster
2020-06-04 7:45 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-06-04 18:08 ` Brian Foster
2020-06-04 22:53 ` Dave Chinner
2020-06-05 16:25 ` Brian Foster
2020-06-04 7:45 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-05 21:07 ` Dave Chinner
2020-06-08 16:44 ` Brian Foster
2020-06-04 7:45 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-04 7:45 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-04 7:45 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-05 21:09 ` Dave Chinner
2020-06-04 7:45 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
2020-06-05 16:26 ` Brian Foster
2020-06-04 7:46 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
2020-06-05 18:27 ` Brian Foster
2020-06-05 21:32 ` Dave Chinner
2020-06-08 16:44 ` Brian Foster
2020-06-04 7:46 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-06-08 16:45 ` Brian Foster
2020-06-08 21:05 ` Dave Chinner
2020-06-04 7:46 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-06-08 16:45 ` Brian Foster
2020-06-08 21:37 ` Dave Chinner
2020-06-08 22:26 ` [PATCH 26/30 V2] " Dave Chinner
2020-06-09 13:11 ` Brian Foster
2020-06-04 7:46 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
2020-06-08 17:37 ` Brian Foster
2020-06-04 7:46 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-06-09 13:11 ` Brian Foster
2020-06-09 22:01 ` Dave Chinner
2020-06-10 13:06 ` Brian Foster [this message]
2020-06-10 23:40 ` Dave Chinner
2020-06-11 13:56 ` Brian Foster
2020-06-15 1:01 ` Dave Chinner
2020-06-15 14:21 ` Brian Foster
2020-06-16 14:41 ` Brian Foster
2020-06-11 1:56 ` [PATCH 28/30 V2] " Dave Chinner
2020-06-04 7:46 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
2020-06-09 13:12 ` Brian Foster
2020-06-09 22:14 ` Dave Chinner
2020-06-10 13:08 ` Brian Foster
2020-06-11 0:16 ` Dave Chinner
2020-06-11 14:07 ` Brian Foster
2020-06-15 1:49 ` Dave Chinner
2020-06-15 5:20 ` Amir Goldstein
2020-06-15 14:31 ` Brian Foster
2020-06-11 1:58 ` [PATCH 29/30 V2] " Dave Chinner
2020-06-04 7:46 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
2020-06-09 13:12 ` Brian Foster
-- strict thread matches above, loose matches on Subject: below --
2020-06-22 8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-22 8:16 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-01 21:42 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-06-02 23:23 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200610130628.GA50747@bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).