All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: eric.dumazet@gmail.com, xfs@oss.sgi.com
Subject: Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking
Date: Tue, 14 Dec 2010 17:05:36 -0800	[thread overview]
Message-ID: <20101215010536.GT2161@linux.vnet.ibm.com> (raw)
In-Reply-To: <20101214230047.GC16267@dastard>

On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote:
> On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > With delayed logging greatly increasing the sustained parallelism of inode
> > > operations, the inode cache locking is showing significant read vs write
> > > contention when inode reclaim runs at the same time as lookups. There is
> > > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > > so the read locking is not really buying us much in the way of parallelism.
> > > 
> > > To avoid the read vs write contention, change the cache to use RCU locking on
> > > the read side. To avoid needing to RCU free every single inode, use the built
> > > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > > freed inodes, so enѕure that ever freed inode has an inode number of zero and
> > > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > > lookup path, but also add a check for a zero inode number as well.
> > > 
> > > We canthen convert all the read locking lockups to use RCU read side locking
> > > and hence remove all read side locking.
> > 
> > OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and
> > immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU,
> > which does allow the inode to be freed and immediately reallocated,
> > correct?
> 
> No, the struct inode is embedded in the struct xfs_inode, so they
> have the same lifecycle. i.e. we don't separately allocate and free
> the struct inode. So it is all using straight RCU.
> 
> > Some questions and comments below.
> > 
> > 							Thanx, Paul
> > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Alex Elder <aelder@sgi.com>
> > > ---
> > >  fs/xfs/linux-2.6/xfs_sync.c |   27 ++++++++++++++++-----
> > >  fs/xfs/xfs_iget.c           |   50 +++++++++++++++++++++++++++++++----------
> > >  fs/xfs/xfs_inode.c          |   52 +++++++++++++++++++++++++++++++++----------
> > >  3 files changed, 98 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > > index afb0d7c..5ee02d7 100644
> > > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab(
> > >  {
> > >  	struct inode		*inode = VFS_I(ip);
> > > 
> > > +	/* check for stale RCU freed inode */
> > > +	spin_lock(&ip->i_flags_lock);
> > > +	if (!ip->i_ino)
> > > +		goto out_unlock_noent;
> > > +
> > > +	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
> > > +	if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
> > > +		goto out_unlock_noent;
> > > +	spin_unlock(&ip->i_flags_lock);
> > > +
> > 
> > OK, this works because the xfs_inode cannot be freed and reallocated (which
> > the RCU change should enforce).  It is not clear to me that the above
> > would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless
> > some higher-level checks can reliably catch an inode changing identity
> > due to quick free and reallocation.
> 
> In this case, I don't believe it matters if the inode has changed
> identity - we are walking for writeback, so if we get a reallocated
> inode we'll write it back if it is dirty. If it has not been
> reallocated or still being initialised, the !ip->i_ino and
> XFS_INEW|XFS_IRECLAIM checks are sufficient to avoid using the inode.
> 
> I probably should add a comment to this effect, yes?

One vote in favor from me.  ;-)

> > Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side
> > critical sections...  I suggest a debug check for rcu_read_lock_held() to
> > catch any call paths that might have slipped through. 
> 
> Yes, good idea.
> 
> > At first glance,
> > it appears that RCU is replacing some of ->pag_ici_lock, but I could
> > easily be mistaken.
> 
> Correct, it is replacing the read side of the ->pag_ici_lock.

OK, good!

> > > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab(
> > >  	struct xfs_inode	*ip,
> > >  	int			flags)
> > >  {
> > > +	/* check for stale RCU freed inode */
> > > +	if (!ip->i_ino)
> > > +		return 1;
> > 
> > Does this mean that we need to be under rcu_read_lock() here?  If not,
> > how do we keep the inode from really being freed out from under us?
> 
> Hmmm, I think I've mismerged a patch somewhere along the line. In
> this patch the reclaim tree walk is under the ->pag_ici_lock(), when
> in fact it should be under the rcu_read_lock(). Good catch, Paul.
> 
> As it is, being under the ->pag_ici_lock means that the tree is
> consistent and we won't be seeing RCU freed inodes in the walk.
> Hence the code is functioning correctly, just not as wasss intended.
> 
> > (Again, if we do need to be under rcu_read_lock(), I highly recommend
> > a debug check for rcu_read_lock_held().)
> 
> Yup, which would have caught the merge screwup...

;-)

> ....
> > > 
> > > +	/*
> > > +	 * check for re-use of an inode within an RCU grace period due to the
> > > +	 * radix tree nodes not being updated yet. We monitor for this by
> > > +	 * setting the inode number to zero before freeing the inode structure.
> > > +	 * If the inode has been reallocated and set up, then the inode number
> > > +	 * will not match, so check for that, too.
> > > +	 */
> > >  	spin_lock(&ip->i_flags_lock);
> > > +	if (ip->i_ino != ino) {
> > > +		trace_xfs_iget_skip(ip);
> > > +		XFS_STATS_INC(xs_ig_frecycle);
> > > +		spin_unlock(&ip->i_flags_lock);
> > > +		rcu_read_unlock();
> > > +		/* Expire the grace period so we don't trip over it again. */
> > > +		synchronize_rcu();
> > 
> > Hmmm...  Interesting.  Wouldn't the fact that we acquired the same lock
> > that was held after removing the inode guarantee that an immediate retry
> > would manage not to find this same inode again?
> 
> That is what I'm not sure of. I was more worried about resolving the
> contents of the radix tree nodes, not so much the inode itself. If a
> new traversal will resolve the tree correctly (which is what you are
> implying), then synchronize_rcu() is not needed....

Here is the sequence of events that I believe must be in place:

1.	Some CPU removes the vfs_inode from the radix tree, presumably
	while holding some lock whose identity does not matter here.

2.	Before invoking call_rcu() on a given vfs_inode, this same
	CPU clears the inode number while holding ->i_flags_lock.

	If the CPU in #1 and #2 might be different, then either
	CPU #1 must hold ->i_flags_lock while removing the vfs_inode
	from the radix tree, or I don't understand how this can work
	even with the synchronize_rcu().

3.	Some CPU, possibly different than that in #1 and #2 above,
	executes the above code.  If locking works correctly, it
	must see the earlier changes, so a subsequent access should
	see the results of #1 above, so that it won't see the element
	that was removed there.

That said, I don't claim to understand either vfs or xfs very well, so
I would be arbitrarily deeply confused.

> > If this is not the case, then readers finding it again will not be
> > protected by the RCU grace period, right?
> > 
> > In short, I don't understand why the synchronize_rcu() is needed.
> > If it is somehow helping, that sounds to me like it is covering up
> > a real bug that should be fixed separately.
> 
> It isn't covering up a bug, it was more tryingt o be consistent with
> the rest of the xfs_inode lookup failures - we back off and try
> again later. If that is unnecessary resolve the RCU lookup race,
> then it can be dropped.

OK, please let me know whether my sequence of steps above makes sense.

> > > @@ -397,7 +423,7 @@ xfs_iget(
> > >  	xfs_agino_t	agino;
> > > 
> > >  	/* reject inode numbers outside existing AGs */
> > > -	if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
> > > +	if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
> > 
> > For the above check to be safe, don't we need to already be in an
> > RCU read-side critical section?  Or is something else protecting us?
> 
> "ino" is the inode number used as the lookup key to find the struct
> xfs_inode. This is ensuring we don't try to look up an inode number
> of zero given it's new special meaning as a freed inode. Hence it
> can be safely validated outside the RCU read-side critical sectioni
> as it is a constant.

Ah, OK, got it!

							Thanx, Paul

> Thanks for the review, Paul. I'll fix up the issues you've pointed
> out and retest.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2010-12-15  1:03 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-13  1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner
2010-12-13  1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner
2010-12-14 20:23   ` Paul E. McKenney
2010-12-13  1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-12-14 21:18   ` Paul E. McKenney
2010-12-14 23:00     ` Dave Chinner
2010-12-15  1:05       ` Paul E. McKenney [this message]
2010-12-15  2:50         ` Dave Chinner
2010-12-15  6:34           ` Paul E. McKenney
2010-12-15  3:30         ` Nick Piggin
2010-12-15  6:35           ` Dave Chinner
2010-12-15  8:05             ` Nick Piggin
2010-12-13  1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-12-14 21:19   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101215010536.GT2161@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=eric.dumazet@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.