All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: eric.dumazet@gmail.com, xfs@oss.sgi.com
Subject: Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking
Date: Tue, 14 Dec 2010 22:34:35 -0800	[thread overview]
Message-ID: <20101215063435.GB2217@linux.vnet.ibm.com> (raw)
In-Reply-To: <20101215025049.GE9925@dastard>

On Wed, Dec 15, 2010 at 01:50:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 14, 2010 at 05:05:36PM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > > > > +	/*
> > > > > +	 * check for re-use of an inode within an RCU grace period due to the
> > > > > +	 * radix tree nodes not being updated yet. We monitor for this by
> > > > > +	 * setting the inode number to zero before freeing the inode structure.
> > > > > +	 * If the inode has been reallocated and set up, then the inode number
> > > > > +	 * will not match, so check for that, too.
> > > > > +	 */
> > > > >  	spin_lock(&ip->i_flags_lock);
> > > > > +	if (ip->i_ino != ino) {
> > > > > +		trace_xfs_iget_skip(ip);
> > > > > +		XFS_STATS_INC(xs_ig_frecycle);
> > > > > +		spin_unlock(&ip->i_flags_lock);
> > > > > +		rcu_read_unlock();
> > > > > +		/* Expire the grace period so we don't trip over it again. */
> > > > > +		synchronize_rcu();
> > > > 
> > > > Hmmm...  Interesting.  Wouldn't the fact that we acquired the same lock
> > > > that was held after removing the inode guarantee that an immediate retry
> > > > would manage not to find this same inode again?
> > > 
> > > That is what I'm not sure of. I was more worried about resolving the
> > > contents of the radix tree nodes, not so much the inode itself. If a
> > > new traversal will resolve the tree correctly (which is what you are
> > > implying), then synchronize_rcu() is not needed....
> > 
> > Here is the sequence of events that I believe must be in place:
> 
> s/vfs_inode/xfs_inode/g

Good point!

> > 1.	Some CPU removes the vfs_inode from the radix tree, presumably
> > 	while holding some lock whose identity does not matter here.
> 
> Yes, the ->pag_ici_lock.

OK.

> > 2.	Before invoking call_rcu() on a given vfs_inode, this same
> > 	CPU clears the inode number while holding ->i_flags_lock.
> 
> Not necessarily the same CPU - there are points where we take
> sleeping locks for synchronisation with any remaining users, and
> we don't use preempt_disable() to prevent a change of CPU on a
> preemptible kernel.
> 
> > 	If the CPU in #1 and #2 might be different, then either
> > 	CPU #1 must hold ->i_flags_lock while removing the vfs_inode
> > 	from the radix tree, or I don't understand how this can work
> > 	even with the synchronize_rcu().
> 
> I'm probably missing something, but why does the CPU we run
> call_rcu() to free the inode on matter w.r.t. which CPU it was
> deleted from the radix tree on?

I was oversimplifying.  What matters is that the item was deleted
from the radix tree unambiguously before it was passed to call_rcu().
There are a number of ways to make this happen:

1.	Do the removal and the call_rcu() on the same CPU, in that
	order.

2.	Do the removal while holding a given lock, and do the call_rcu()
	under a later critical section for that same lock.

3.	Do the removal while holding lock A one CPU 1, then later
	acquire lock B on CPU 1, and then do the call_rcu() after
	a later acquisition of lock B on some other CPU.

There are a bunch of other variations on this theme, but the key
requirement is again that the call_rcu() happen unambiguously after
the removal.  Otherwise, how is the grace period supposed to
guarantee that all RCU readers that might be accessing the removed
xfs_inode really have completed?

> There is this comment in include/linux/radix-tree.h:
> 
>  * It is still required that the caller manage the synchronization and lifetimes
>  * of the items. So if RCU lock-free lookups are used, typically this would mean
>  * that the items have their own locks, or are amenable to lock-free access; and
>  * that the items are freed by RCU (or only freed after having been deleted from
>  * the radix tree *and* a synchronize_rcu() grace period).
> 
> There is nothing there that mentions the items need to be deleted on
> the same CPU as they were removed from the radix tree or that the
> item lock needs to be held when the item is removed from the tree.
> AFAICT, the XFS code is following these guidelines.

Well, the grace period (from either synchronize_rcu() or call_rcu())
does need to start unambiguously after the deletion from the radix tree.
Should we upgrade the comment?

> FWIW, this is where is got the idea of using synchronize_rcu() to
> ensure a llokup retry wouldn't see the same freed inode. I'm
> thinking that simply re-running the lookup will give the same
> guarantee because of the memory barriers in the radix tree lookup
> code...

Maybe...  But we do need to be sure, right?

> > 3.	Some CPU, possibly different than that in #1 and #2 above,
> > 	executes the above code.  If locking works correctly, it
> > 	must see the earlier changes, so a subsequent access should
> > 	see the results of #1 above, so that it won't see the element
> > 	that was removed there.
> 
> Isn't the item validity considered separately to the tree traversal
> consistency? i.e. the item lock (i.e. ->i_flags_lock) provides a
> safe item validity check via the unlock->lock memory barrier, whilst
> the radix tree uses rcu_dereference() to provide memory barriers
> against the modifications?

But the safe validity check assumes that the RCU grace period starts
unambiguously after the item has been removed.  Violate that assumption,
and all bets are off.

> > That said, I don't claim to understand either vfs or xfs very well, so
> > I would be arbitrarily deeply confused.
> 
> We might both be confused ;)

Sounds like the most likely possibility, now that you mention it.  ;-)

							Thanx, Paul

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2010-12-15  6:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-13  1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner
2010-12-13  1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner
2010-12-14 20:23   ` Paul E. McKenney
2010-12-13  1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-12-14 21:18   ` Paul E. McKenney
2010-12-14 23:00     ` Dave Chinner
2010-12-15  1:05       ` Paul E. McKenney
2010-12-15  2:50         ` Dave Chinner
2010-12-15  6:34           ` Paul E. McKenney [this message]
2010-12-15  3:30         ` Nick Piggin
2010-12-15  6:35           ` Dave Chinner
2010-12-15  8:05             ` Nick Piggin
2010-12-13  1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-12-14 21:19   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101215063435.GB2217@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=eric.dumazet@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.