From: Dave Chinner <david@fromorbit.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: eric.dumazet@gmail.com, xfs@oss.sgi.com
Subject: Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking
Date: Wed, 15 Dec 2010 10:00:47 +1100 [thread overview]
Message-ID: <20101214230047.GC16267@dastard> (raw)
In-Reply-To: <20101214211801.GH2161@linux.vnet.ibm.com>
On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > With delayed logging greatly increasing the sustained parallelism of inode
> > operations, the inode cache locking is showing significant read vs write
> > contention when inode reclaim runs at the same time as lookups. There is
> > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > so the read locking is not really buying us much in the way of parallelism.
> >
> > To avoid the read vs write contention, change the cache to use RCU locking on
> > the read side. To avoid needing to RCU free every single inode, use the built
> > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > freed inodes, so enѕure that ever freed inode has an inode number of zero and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > lookup path, but also add a check for a zero inode number as well.
> >
> > We canthen convert all the read locking lockups to use RCU read side locking
> > and hence remove all read side locking.
>
> OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and
> immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU,
> which does allow the inode to be freed and immediately reallocated,
> correct?
No, the struct inode is embedded in the struct xfs_inode, so they
have the same lifecycle. i.e. we don't separately allocate and free
the struct inode. So it is all using straight RCU.
> Some questions and comments below.
>
> Thanx, Paul
>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Alex Elder <aelder@sgi.com>
> > ---
> > fs/xfs/linux-2.6/xfs_sync.c | 27 ++++++++++++++++-----
> > fs/xfs/xfs_iget.c | 50 +++++++++++++++++++++++++++++++----------
> > fs/xfs/xfs_inode.c | 52 +++++++++++++++++++++++++++++++++----------
> > 3 files changed, 98 insertions(+), 31 deletions(-)
> >
> > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > index afb0d7c..5ee02d7 100644
> > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab(
> > {
> > struct inode *inode = VFS_I(ip);
> >
> > + /* check for stale RCU freed inode */
> > + spin_lock(&ip->i_flags_lock);
> > + if (!ip->i_ino)
> > + goto out_unlock_noent;
> > +
> > + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */
> > + if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
> > + goto out_unlock_noent;
> > + spin_unlock(&ip->i_flags_lock);
> > +
>
> OK, this works because the xfs_inode cannot be freed and reallocated (which
> the RCU change should enforce). It is not clear to me that the above
> would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless
> some higher-level checks can reliably catch an inode changing identity
> due to quick free and reallocation.
In this case, I don't believe it matters if the inode has changed
identity - we are walking for writeback, so if we get a reallocated
inode we'll write it back if it is dirty. If it has not been
reallocated or still being initialised, the !ip->i_ino and
XFS_INEW|XFS_IRECLAIM checks are sufficient to avoid using the inode.
I probably should add a comment to this effect, yes?
> Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side
> critical sections... I suggest a debug check for rcu_read_lock_held() to
> catch any call paths that might have slipped through.
Yes, good idea.
> At first glance,
> it appears that RCU is replacing some of ->pag_ici_lock, but I could
> easily be mistaken.
Correct, it is replacing the read side of the ->pag_ici_lock.
> > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab(
> > struct xfs_inode *ip,
> > int flags)
> > {
> > + /* check for stale RCU freed inode */
> > + if (!ip->i_ino)
> > + return 1;
>
> Does this mean that we need to be under rcu_read_lock() here? If not,
> how do we keep the inode from really being freed out from under us?
Hmmm, I think I've mismerged a patch somewhere along the line. In
this patch the reclaim tree walk is under the ->pag_ici_lock(), when
in fact it should be under the rcu_read_lock(). Good catch, Paul.
As it is, being under the ->pag_ici_lock means that the tree is
consistent and we won't be seeing RCU freed inodes in the walk.
Hence the code is functioning correctly, just not as wasss intended.
> (Again, if we do need to be under rcu_read_lock(), I highly recommend
> a debug check for rcu_read_lock_held().)
Yup, which would have caught the merge screwup...
....
> >
> > + /*
> > + * check for re-use of an inode within an RCU grace period due to the
> > + * radix tree nodes not being updated yet. We monitor for this by
> > + * setting the inode number to zero before freeing the inode structure.
> > + * If the inode has been reallocated and set up, then the inode number
> > + * will not match, so check for that, too.
> > + */
> > spin_lock(&ip->i_flags_lock);
> > + if (ip->i_ino != ino) {
> > + trace_xfs_iget_skip(ip);
> > + XFS_STATS_INC(xs_ig_frecycle);
> > + spin_unlock(&ip->i_flags_lock);
> > + rcu_read_unlock();
> > + /* Expire the grace period so we don't trip over it again. */
> > + synchronize_rcu();
>
> Hmmm... Interesting. Wouldn't the fact that we acquired the same lock
> that was held after removing the inode guarantee that an immediate retry
> would manage not to find this same inode again?
That is what I'm not sure of. I was more worried about resolving the
contents of the radix tree nodes, not so much the inode itself. If a
new traversal will resolve the tree correctly (which is what you are
implying), then synchronize_rcu() is not needed....
> If this is not the case, then readers finding it again will not be
> protected by the RCU grace period, right?
>
> In short, I don't understand why the synchronize_rcu() is needed.
> If it is somehow helping, that sounds to me like it is covering up
> a real bug that should be fixed separately.
It isn't covering up a bug, it was more tryingt o be consistent with
the rest of the xfs_inode lookup failures - we back off and try
again later. If that is unnecessary resolve the RCU lookup race,
then it can be dropped.
> > @@ -397,7 +423,7 @@ xfs_iget(
> > xfs_agino_t agino;
> >
> > /* reject inode numbers outside existing AGs */
> > - if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
> > + if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
>
> For the above check to be safe, don't we need to already be in an
> RCU read-side critical section? Or is something else protecting us?
"ino" is the inode number used as the lookup key to find the struct
xfs_inode. This is ensuring we don't try to look up an inode number
of zero given it's new special meaning as a freed inode. Hence it
can be safely validated outside the RCU read-side critical sectioni
as it is a constant.
Thanks for the review, Paul. I'll fix up the issues you've pointed
out and retest.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2010-12-14 22:59 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-13 1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner
2010-12-13 1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner
2010-12-14 20:23 ` Paul E. McKenney
2010-12-13 1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-12-14 21:18 ` Paul E. McKenney
2010-12-14 23:00 ` Dave Chinner [this message]
2010-12-15 1:05 ` Paul E. McKenney
2010-12-15 2:50 ` Dave Chinner
2010-12-15 6:34 ` Paul E. McKenney
2010-12-15 3:30 ` Nick Piggin
2010-12-15 6:35 ` Dave Chinner
2010-12-15 8:05 ` Nick Piggin
2010-12-13 1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-12-14 21:19 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101214230047.GC16267@dastard \
--to=david@fromorbit.com \
--cc=eric.dumazet@gmail.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox