All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: eric.dumazet@gmail.com, xfs@oss.sgi.com
Subject: Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking
Date: Wed, 15 Dec 2010 10:00:47 +1100	[thread overview]
Message-ID: <20101214230047.GC16267@dastard> (raw)
In-Reply-To: <20101214211801.GH2161@linux.vnet.ibm.com>

On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With delayed logging greatly increasing the sustained parallelism of inode
> > operations, the inode cache locking is showing significant read vs write
> > contention when inode reclaim runs at the same time as lookups. There is
> > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > so the read locking is not really buying us much in the way of parallelism.
> > 
> > To avoid the read vs write contention, change the cache to use RCU locking on
> > the read side. To avoid needing to RCU free every single inode, use the built
> > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > freed inodes, so enѕure that ever freed inode has an inode number of zero and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > lookup path, but also add a check for a zero inode number as well.
> > 
> > We canthen convert all the read locking lockups to use RCU read side locking
> > and hence remove all read side locking.
> 
> OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and
> immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU,
> which does allow the inode to be freed and immediately reallocated,
> correct?

No, the struct inode is embedded in the struct xfs_inode, so they
have the same lifecycle. i.e. we don't separately allocate and free
the struct inode. So it is all using straight RCU.

> Some questions and comments below.
> 
> 							Thanx, Paul
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Alex Elder <aelder@sgi.com>
> > ---
> >  fs/xfs/linux-2.6/xfs_sync.c |   27 ++++++++++++++++-----
> >  fs/xfs/xfs_iget.c           |   50 +++++++++++++++++++++++++++++++----------
> >  fs/xfs/xfs_inode.c          |   52 +++++++++++++++++++++++++++++++++----------
> >  3 files changed, 98 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > index afb0d7c..5ee02d7 100644
> > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab(
> >  {
> >  	struct inode		*inode = VFS_I(ip);
> > 
> > +	/* check for stale RCU freed inode */
> > +	spin_lock(&ip->i_flags_lock);
> > +	if (!ip->i_ino)
> > +		goto out_unlock_noent;
> > +
> > +	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
> > +	if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
> > +		goto out_unlock_noent;
> > +	spin_unlock(&ip->i_flags_lock);
> > +
> 
> OK, this works because the xfs_inode cannot be freed and reallocated (which
> the RCU change should enforce).  It is not clear to me that the above
> would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless
> some higher-level checks can reliably catch an inode changing identity
> due to quick free and reallocation.

In this case, I don't believe it matters if the inode has changed
identity - we are walking for writeback, so if we get a reallocated
inode we'll write it back if it is dirty. If it has not been
reallocated or still being initialised, the !ip->i_ino and
XFS_INEW|XFS_IRECLAIM checks are sufficient to avoid using the inode.

I probably should add a comment to this effect, yes?

> Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side
> critical sections...  I suggest a debug check for rcu_read_lock_held() to
> catch any call paths that might have slipped through. 

Yes, good idea.

> At first glance,
> it appears that RCU is replacing some of ->pag_ici_lock, but I could
> easily be mistaken.

Correct, it is replacing the read side of the ->pag_ici_lock.

> > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab(
> >  	struct xfs_inode	*ip,
> >  	int			flags)
> >  {
> > +	/* check for stale RCU freed inode */
> > +	if (!ip->i_ino)
> > +		return 1;
> 
> Does this mean that we need to be under rcu_read_lock() here?  If not,
> how do we keep the inode from really being freed out from under us?

Hmmm, I think I've mismerged a patch somewhere along the line. In
this patch the reclaim tree walk is under the ->pag_ici_lock(), when
in fact it should be under the rcu_read_lock(). Good catch, Paul.

As it is, being under the ->pag_ici_lock means that the tree is
consistent and we won't be seeing RCU freed inodes in the walk.
Hence the code is functioning correctly, just not as wasss intended.

> (Again, if we do need to be under rcu_read_lock(), I highly recommend
> a debug check for rcu_read_lock_held().)

Yup, which would have caught the merge screwup...

....
> > 
> > +	/*
> > +	 * check for re-use of an inode within an RCU grace period due to the
> > +	 * radix tree nodes not being updated yet. We monitor for this by
> > +	 * setting the inode number to zero before freeing the inode structure.
> > +	 * If the inode has been reallocated and set up, then the inode number
> > +	 * will not match, so check for that, too.
> > +	 */
> >  	spin_lock(&ip->i_flags_lock);
> > +	if (ip->i_ino != ino) {
> > +		trace_xfs_iget_skip(ip);
> > +		XFS_STATS_INC(xs_ig_frecycle);
> > +		spin_unlock(&ip->i_flags_lock);
> > +		rcu_read_unlock();
> > +		/* Expire the grace period so we don't trip over it again. */
> > +		synchronize_rcu();
> 
> Hmmm...  Interesting.  Wouldn't the fact that we acquired the same lock
> that was held after removing the inode guarantee that an immediate retry
> would manage not to find this same inode again?

That is what I'm not sure of. I was more worried about resolving the
contents of the radix tree nodes, not so much the inode itself. If a
new traversal will resolve the tree correctly (which is what you are
implying), then synchronize_rcu() is not needed....

> If this is not the case, then readers finding it again will not be
> protected by the RCU grace period, right?
> 
> In short, I don't understand why the synchronize_rcu() is needed.
> If it is somehow helping, that sounds to me like it is covering up
> a real bug that should be fixed separately.

It isn't covering up a bug, it was more tryingt o be consistent with
the rest of the xfs_inode lookup failures - we back off and try
again later. If that is unnecessary resolve the RCU lookup race,
then it can be dropped.

> > @@ -397,7 +423,7 @@ xfs_iget(
> >  	xfs_agino_t	agino;
> > 
> >  	/* reject inode numbers outside existing AGs */
> > -	if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
> > +	if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
> 
> For the above check to be safe, don't we need to already be in an
> RCU read-side critical section?  Or is something else protecting us?

"ino" is the inode number used as the lookup key to find the struct
xfs_inode. This is ensuring we don't try to look up an inode number
of zero given it's new special meaning as a freed inode. Hence it
can be safely validated outside the RCU read-side critical sectioni
as it is a constant.

Thanks for the review, Paul. I'll fix up the issues you've pointed
out and retest.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2010-12-14 22:59 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-13  1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner
2010-12-13  1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner
2010-12-14 20:23   ` Paul E. McKenney
2010-12-13  1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-12-14 21:18   ` Paul E. McKenney
2010-12-14 23:00     ` Dave Chinner [this message]
2010-12-15  1:05       ` Paul E. McKenney
2010-12-15  2:50         ` Dave Chinner
2010-12-15  6:34           ` Paul E. McKenney
2010-12-15  3:30         ` Nick Piggin
2010-12-15  6:35           ` Dave Chinner
2010-12-15  8:05             ` Nick Piggin
2010-12-13  1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-12-14 21:19   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101214230047.GC16267@dastard \
    --to=david@fromorbit.com \
    --cc=eric.dumazet@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.