linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Guo Chao <yan@linux.vnet.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
Date: Thu, 27 Sep 2012 16:41:48 +0800	[thread overview]
Message-ID: <20120927084148.GA29769@yanx> (raw)
In-Reply-To: <20120926005409.GG29154@dastard>

On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote:
> On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
> > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
> > >  static struct inode *find_inode(struct super_block *sb,
> > >  				struct hlist_head *head,
> > >  				int (*test)(struct inode *, void *),
> > > -				void *data)
> > > +				void *data, bool locked)
> > >  {
> > >  	struct hlist_node *node;
> > >  	struct inode *inode = NULL;
> > > 
> > >  repeat:
> > > -	hlist_for_each_entry(inode, node, head, i_hash) {
> > > +	rcu_read_lock();
> > > +	hlist_for_each_entry_rcu(inode, node, head, i_hash) {
> > >  		spin_lock(&inode->i_lock);
> > > +		if (inode_unhashed(inode)) {
> > > +			spin_unlock(&inode->i_lock);
> > > +			continue;
> > > +		}
> > 
> > Is this check too early? If the unhashed inode happened to be the target
> > inode, we are wasting our time to continue the traversal and we do not wait 
> > on it.
> 
> If the inode is unhashed, then it is already passing through evict()
> or has already passed through. If it has already passed through
> evict() then it is too late to call __wait_on_freeing_inode() as the
> wakeup occurs in evict() immediately after the inode is removed
> from the hash. i.e:
> 
>         remove_inode_hash(inode);
> 
>         spin_lock(&inode->i_lock);
>         wake_up_bit(&inode->i_state, __I_NEW);
>         BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
>         spin_unlock(&inode->i_lock);
> 
> i.e. if we get the case:
> 
> Thread 1, RCU hash traversal		Thread 2, evicting foo
> 
> rcu_read_lock()
> found inode foo
> 					remove_inode_hash(inode);
> 					spin_lock(&foo->i_lock);
> 					wake_up(I_NEW)
> 					spin_unlock(&foo->i_lock);
> 					destroy_inode()
> 					......
> 	spin_lock(foo->i_lock)
> 	match sb, ino
> 	I_FREEING
> 	  rcu_read_unlock()
> 
> <rcu grace period can expire at any time now,
>  so use after free is guaranteed at some point>
> 
> 	  wait_on_freeing_inode
> 	    wait_on_bit(I_NEW)
> 
> <will never get woken>
> 
> Hence if the inode is unhashed, it doesn't matter what inode it is,
> it is never valid to use it any further because it may have already
> been freed and the only reason we can safely access here it is that
> the RCU grace period will not expire until we call
> rcu_read_unlock().
>

Yeah, looks right.
 
> > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> > >  		struct inode *old;
> > > 
> > >  		spin_lock(&inode_hash_lock);
> > > -		/* We released the lock, so.. */
> > > -		old = find_inode_fast(sb, head, ino);
> > > +		old = find_inode_fast(sb, head, ino, true);
> > >  		if (!old) {
> > >  			inode->i_ino = ino;
> > >  			spin_lock(&inode->i_lock);
> > 
> > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin
> > lock on newly allocated inode to publish I_NEW?
> 
> Yes, we could.
> 
> However, having multiple synchronisation methods for a single
> variable that should only be used in certain circumstances is
> something that is easy to misunderstand and get wrong. Memory
> barriers are much more subtle and harder to understand than spin
> locks, and every memory barrier needs to be commented to explain
> what the barrier is actually protecting against.
> 
> In the case where a spin lock is guaranteed to be uncontended and
> the cache line hot in the CPU cache, it makes no sense to replace
> the spin lock with a memory barrier, especially when every other
> place we modify the i_state/i_hash fields we have to wrap them
> with i_lock....
> 
> Simple code is good code - save the complexity for something that
> needs it.
> 

Emmm, I doubt "it's simpler and need no document". 

I bet someday there will be other guys stand out and ask "why take spin 
lock on a inode which apparently does not subject to any race condition?". 

> I know that the per-sb inode lru lock is currently the hotest of the
> inode cache locks (performance limiting at somewhere in the range of
> 8-16way workloads on XFS), and I've got work in (slow) progress to
> address that.  That work will also the address the per-sb dentry LRU
> locks, which are the hotest dentry cache locks as well.
> 

Glad to hear that.

Thank your for all your explanation, especially historical ones.

Regards,
Guo Chao


  reply	other threads:[~2012-09-27  8:42 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-21  9:31 [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Guo Chao
2012-09-21  9:31 ` [PATCH 1/4] fs/inode.c: do not take i_lock on newly allocated inode Guo Chao
2012-09-21  9:31 ` [PATCH 2/4] fs/inode.c: do not take i_lock in __(insert|remove)_inode_hash Guo Chao
2012-09-21  9:31 ` [PATCH 3/4] fs/inode.c: do not take i_lock when identify an inode Guo Chao
2012-09-21  9:31 ` [PATCH 4/4] fs/inode.c: always take i_lock before calling filesystem's test() method Guo Chao
2012-09-21 12:17 ` [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Matthew Wilcox
2012-09-21 22:49 ` Dave Chinner
2012-09-24  2:42   ` Guo Chao
2012-09-24  4:23     ` Dave Chinner
2012-09-24  6:12       ` Guo Chao
2012-09-24  6:28         ` Dave Chinner
2012-09-24  7:08           ` Guo Chao
2012-09-24  8:26             ` Dave Chinner
2012-09-25  8:59               ` Guo Chao
2012-09-26  0:54                 ` Dave Chinner
2012-09-27  8:41                   ` Guo Chao [this message]
2012-09-27 11:51                     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120927084148.GA29769@yanx \
    --to=yan@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).