From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: eric.dumazet@gmail.com, xfs@oss.sgi.com
Subject: Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking
Date: Tue, 14 Dec 2010 22:34:35 -0800 [thread overview]
Message-ID: <20101215063435.GB2217@linux.vnet.ibm.com> (raw)
In-Reply-To: <20101215025049.GE9925@dastard>
On Wed, Dec 15, 2010 at 01:50:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 14, 2010 at 05:05:36PM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > > > > + /*
> > > > > + * check for re-use of an inode within an RCU grace period due to the
> > > > > + * radix tree nodes not being updated yet. We monitor for this by
> > > > > + * setting the inode number to zero before freeing the inode structure.
> > > > > + * If the inode has been reallocated and set up, then the inode number
> > > > > + * will not match, so check for that, too.
> > > > > + */
> > > > > spin_lock(&ip->i_flags_lock);
> > > > > + if (ip->i_ino != ino) {
> > > > > + trace_xfs_iget_skip(ip);
> > > > > + XFS_STATS_INC(xs_ig_frecycle);
> > > > > + spin_unlock(&ip->i_flags_lock);
> > > > > + rcu_read_unlock();
> > > > > + /* Expire the grace period so we don't trip over it again. */
> > > > > + synchronize_rcu();
> > > >
> > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock
> > > > that was held after removing the inode guarantee that an immediate retry
> > > > would manage not to find this same inode again?
> > >
> > > That is what I'm not sure of. I was more worried about resolving the
> > > contents of the radix tree nodes, not so much the inode itself. If a
> > > new traversal will resolve the tree correctly (which is what you are
> > > implying), then synchronize_rcu() is not needed....
> >
> > Here is the sequence of events that I believe must be in place:
>
> s/vfs_inode/xfs_inode/g
Good point!
> > 1. Some CPU removes the vfs_inode from the radix tree, presumably
> > while holding some lock whose identity does not matter here.
>
> Yes, the ->pag_ici_lock.
OK.
> > 2. Before invoking call_rcu() on a given vfs_inode, this same
> > CPU clears the inode number while holding ->i_flags_lock.
>
> Not necessarily the same CPU - there are points where we take
> sleeping locks for synchronisation with any remaining users, and
> we don't use preempt_disable() to prevent a change of CPU on a
> preemptible kernel.
>
> > If the CPU in #1 and #2 might be different, then either
> > CPU #1 must hold ->i_flags_lock while removing the vfs_inode
> > from the radix tree, or I don't understand how this can work
> > even with the synchronize_rcu().
>
> I'm probably missing something, but why does the CPU we run
> call_rcu() to free the inode on matter w.r.t. which CPU it was
> deleted from the radix tree on?
I was oversimplifying. What matters is that the item was deleted
from the radix tree unambiguously before it was passed to call_rcu().
There are a number of ways to make this happen:
1. Do the removal and the call_rcu() on the same CPU, in that
order.
2. Do the removal while holding a given lock, and do the call_rcu()
under a later critical section for that same lock.
3. Do the removal while holding lock A one CPU 1, then later
acquire lock B on CPU 1, and then do the call_rcu() after
a later acquisition of lock B on some other CPU.
There are a bunch of other variations on this theme, but the key
requirement is again that the call_rcu() happen unambiguously after
the removal. Otherwise, how is the grace period supposed to
guarantee that all RCU readers that might be accessing the removed
xfs_inode really have completed?
> There is this comment in include/linux/radix-tree.h:
>
> * It is still required that the caller manage the synchronization and lifetimes
> * of the items. So if RCU lock-free lookups are used, typically this would mean
> * that the items have their own locks, or are amenable to lock-free access; and
> * that the items are freed by RCU (or only freed after having been deleted from
> * the radix tree *and* a synchronize_rcu() grace period).
>
> There is nothing there that mentions the items need to be deleted on
> the same CPU as they were removed from the radix tree or that the
> item lock needs to be held when the item is removed from the tree.
> AFAICT, the XFS code is following these guidelines.
Well, the grace period (from either synchronize_rcu() or call_rcu())
does need to start unambiguously after the deletion from the radix tree.
Should we upgrade the comment?
> FWIW, this is where is got the idea of using synchronize_rcu() to
> ensure a llokup retry wouldn't see the same freed inode. I'm
> thinking that simply re-running the lookup will give the same
> guarantee because of the memory barriers in the radix tree lookup
> code...
Maybe... But we do need to be sure, right?
> > 3. Some CPU, possibly different than that in #1 and #2 above,
> > executes the above code. If locking works correctly, it
> > must see the earlier changes, so a subsequent access should
> > see the results of #1 above, so that it won't see the element
> > that was removed there.
>
> Isn't the item validity considered separately to the tree traversal
> consistency? i.e. the item lock (i.e. ->i_flags_lock) provides a
> safe item validity check via the unlock->lock memory barrier, whilst
> the radix tree uses rcu_dereference() to provide memory barriers
> against the modifications?
But the safe validity check assumes that the RCU grace period starts
unambiguously after the item has been removed. Violate that assumption,
and all bets are off.
> > That said, I don't claim to understand either vfs or xfs very well, so
> > I would be arbitrarily deeply confused.
>
> We might both be confused ;)
Sounds like the most likely possibility, now that you mention it. ;-)
Thanx, Paul
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2010-12-15 6:32 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-13 1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner
2010-12-13 1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner
2010-12-14 20:23 ` Paul E. McKenney
2010-12-13 1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-12-14 21:18 ` Paul E. McKenney
2010-12-14 23:00 ` Dave Chinner
2010-12-15 1:05 ` Paul E. McKenney
2010-12-15 2:50 ` Dave Chinner
2010-12-15 6:34 ` Paul E. McKenney [this message]
2010-12-15 3:30 ` Nick Piggin
2010-12-15 6:35 ` Dave Chinner
2010-12-15 8:05 ` Nick Piggin
2010-12-13 1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-12-14 21:19 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101215063435.GB2217@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=david@fromorbit.com \
--cc=eric.dumazet@gmail.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox