Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Guo Chao <yan@linux.vnet.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
Date: Thu, 27 Sep 2012 16:41:48 +0800	[thread overview]
Message-ID: <20120927084148.GA29769@yanx> (raw)
In-Reply-To: <20120926005409.GG29154@dastard>

On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote:
> On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
> > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
> > >  static struct inode *find_inode(struct super_block *sb,
> > >  				struct hlist_head *head,
> > >  				int (*test)(struct inode *, void *),
> > > -				void *data)
> > > +				void *data, bool locked)
> > >  {
> > >  	struct hlist_node *node;
> > >  	struct inode *inode = NULL;
> > > 
> > >  repeat:
> > > -	hlist_for_each_entry(inode, node, head, i_hash) {
> > > +	rcu_read_lock();
> > > +	hlist_for_each_entry_rcu(inode, node, head, i_hash) {
> > >  		spin_lock(&inode->i_lock);
> > > +		if (inode_unhashed(inode)) {
> > > +			spin_unlock(&inode->i_lock);
> > > +			continue;
> > > +		}
> > 
> > Is this check too early? If the unhashed inode happened to be the target
> > inode, we are wasting our time to continue the traversal and we do not wait 
> > on it.
> 
> If the inode is unhashed, then it is already passing through evict()
> or has already passed through. If it has already passed through
> evict() then it is too late to call __wait_on_freeing_inode() as the
> wakeup occurs in evict() immediately after the inode is removed
> from the hash. i.e:
> 
>         remove_inode_hash(inode);
> 
>         spin_lock(&inode->i_lock);
>         wake_up_bit(&inode->i_state, __I_NEW);
>         BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
>         spin_unlock(&inode->i_lock);
> 
> i.e. if we get the case:
> 
> Thread 1, RCU hash traversal		Thread 2, evicting foo
> 
> rcu_read_lock()
> found inode foo
> 					remove_inode_hash(inode);
> 					spin_lock(&foo->i_lock);
> 					wake_up(I_NEW)
> 					spin_unlock(&foo->i_lock);
> 					destroy_inode()
> 					......
> 	spin_lock(foo->i_lock)
> 	match sb, ino
> 	I_FREEING
> 	  rcu_read_unlock()
> 
> <rcu grace period can expire at any time now,
>  so use after free is guaranteed at some point>
> 
> 	  wait_on_freeing_inode
> 	    wait_on_bit(I_NEW)
> 
> <will never get woken>
> 
> Hence if the inode is unhashed, it doesn't matter what inode it is,
> it is never valid to use it any further because it may have already
> been freed and the only reason we can safely access here it is that
> the RCU grace period will not expire until we call
> rcu_read_unlock().
>

Yeah, looks right.
 
> > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> > >  		struct inode *old;
> > > 
> > >  		spin_lock(&inode_hash_lock);
> > > -		/* We released the lock, so.. */
> > > -		old = find_inode_fast(sb, head, ino);
> > > +		old = find_inode_fast(sb, head, ino, true);
> > >  		if (!old) {
> > >  			inode->i_ino = ino;
> > >  			spin_lock(&inode->i_lock);
> > 
> > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin
> > lock on newly allocated inode to publish I_NEW?
> 
> Yes, we could.
> 
> However, having multiple synchronisation methods for a single
> variable that should only be used in certain circumstances is
> something that is easy to misunderstand and get wrong. Memory
> barriers are much more subtle and harder to understand than spin
> locks, and every memory barrier needs to be commented to explain
> what the barrier is actually protecting against.
> 
> In the case where a spin lock is guaranteed to be uncontended and
> the cache line hot in the CPU cache, it makes no sense to replace
> the spin lock with a memory barrier, especially when every other
> place we modify the i_state/i_hash fields we have to wrap them
> with i_lock....
> 
> Simple code is good code - save the complexity for something that
> needs it.
> 

Emmm, I doubt "it's simpler and need no document". 

I bet someday there will be other guys stand out and ask "why take spin 
lock on a inode which apparently does not subject to any race condition?". 

> I know that the per-sb inode lru lock is currently the hotest of the
> inode cache locks (performance limiting at somewhere in the range of
> 8-16way workloads on XFS), and I've got work in (slow) progress to
> address that.  That work will also the address the per-sb dentry LRU
> locks, which are the hotest dentry cache locks as well.
> 

Glad to hear that.

Thank your for all your explanation, especially historical ones.

Regards,
Guo Chao

next prev parent reply	other threads:[~2012-09-27  8:42 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-21  9:31 [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Guo Chao
2012-09-21  9:31 ` [PATCH 1/4] fs/inode.c: do not take i_lock on newly allocated inode Guo Chao
2012-09-21  9:31 ` [PATCH 2/4] fs/inode.c: do not take i_lock in __(insert|remove)_inode_hash Guo Chao
2012-09-21  9:31 ` [PATCH 3/4] fs/inode.c: do not take i_lock when identify an inode Guo Chao
2012-09-21  9:31 ` [PATCH 4/4] fs/inode.c: always take i_lock before calling filesystem's test() method Guo Chao
2012-09-21 12:17 ` [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Matthew Wilcox
2012-09-21 22:49 ` Dave Chinner
2012-09-24  2:42   ` Guo Chao
2012-09-24  4:23     ` Dave Chinner
2012-09-24  6:12       ` Guo Chao
2012-09-24  6:28         ` Dave Chinner
2012-09-24  7:08           ` Guo Chao
2012-09-24  8:26             ` Dave Chinner
2012-09-25  8:59               ` Guo Chao
2012-09-26  0:54                 ` Dave Chinner
2012-09-27  8:41                   ` Guo Chao [this message]
2012-09-27 11:51                     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120927084148.GA29769@yanx \
    --to=yan@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.