From mboxrd@z Thu Jan  1 00:00:00 1970
From: Guo Chao <yan@linux.vnet.ibm.com>
Subject: Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
Date: Thu, 27 Sep 2012 16:41:48 +0800
Message-ID: <20120927084148.GA29769@yanx>
References: <1348219866-1799-1-git-send-email-yan@linux.vnet.ibm.com>
 <20120921224912.GA20960@dastard>
 <20120924024221.GB7984@yanx>
 <20120924042343.GC20960@dastard>
 <20120924061205.GA27244@yanx>
 <20120924062812.GD20960@dastard>
 <20120924070852.GA10280@yanx>
 <20120924082654.GG20960@dastard>
 <20120925085955.GA15073@yanx>
 <20120926005409.GG29154@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-fsdevel@vger.kernel.org
To: Dave Chinner <david@fromorbit.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e23smtp06.au.ibm.com ([202.81.31.148]:56289 "EHLO
	e23smtp06.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751725Ab2I0ImB (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 27 Sep 2012 04:42:01 -0400
Received: from /spool/local
	by e23smtp06.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
	for <linux-fsdevel@vger.kernel.org> from <yan@linux.vnet.ibm.com>;
	Thu, 27 Sep 2012 18:40:26 +1000
Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96])
	by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q8R8WKh342467414
	for <linux-fsdevel@vger.kernel.org>; Thu, 27 Sep 2012 18:32:20 +1000
Received: from d23av01.au.ibm.com (loopback [127.0.0.1])
	by d23av01.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q8R8frRs017018
	for <linux-fsdevel@vger.kernel.org>; Thu, 27 Sep 2012 18:41:54 +1000
Content-Disposition: inline
In-Reply-To: <20120926005409.GG29154@dastard>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote:
> On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
> > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
> > >  static struct inode *find_inode(struct super_block *sb,
> > >  				struct hlist_head *head,
> > >  				int (*test)(struct inode *, void *),
> > > -				void *data)
> > > +				void *data, bool locked)
> > >  {
> > >  	struct hlist_node *node;
> > >  	struct inode *inode = NULL;
> > > 
> > >  repeat:
> > > -	hlist_for_each_entry(inode, node, head, i_hash) {
> > > +	rcu_read_lock();
> > > +	hlist_for_each_entry_rcu(inode, node, head, i_hash) {
> > >  		spin_lock(&inode->i_lock);
> > > +		if (inode_unhashed(inode)) {
> > > +			spin_unlock(&inode->i_lock);
> > > +			continue;
> > > +		}
> > 
> > Is this check too early? If the unhashed inode happened to be the target
> > inode, we are wasting our time to continue the traversal and we do not wait 
> > on it.
> 
> If the inode is unhashed, then it is already passing through evict()
> or has already passed through. If it has already passed through
> evict() then it is too late to call __wait_on_freeing_inode() as the
> wakeup occurs in evict() immediately after the inode is removed
> from the hash. i.e:
> 
>         remove_inode_hash(inode);
> 
>         spin_lock(&inode->i_lock);
>         wake_up_bit(&inode->i_state, __I_NEW);
>         BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
>         spin_unlock(&inode->i_lock);
> 
> i.e. if we get the case:
> 
> Thread 1, RCU hash traversal		Thread 2, evicting foo
> 
> rcu_read_lock()
> found inode foo
> 					remove_inode_hash(inode);
> 					spin_lock(&foo->i_lock);
> 					wake_up(I_NEW)
> 					spin_unlock(&foo->i_lock);
> 					destroy_inode()
> 					......
> 	spin_lock(foo->i_lock)
> 	match sb, ino
> 	I_FREEING
> 	  rcu_read_unlock()
> 
> <rcu grace period can expire at any time now,
>  so use after free is guaranteed at some point>
> 
> 	  wait_on_freeing_inode
> 	    wait_on_bit(I_NEW)
> 
> <will never get woken>
> 
> Hence if the inode is unhashed, it doesn't matter what inode it is,
> it is never valid to use it any further because it may have already
> been freed and the only reason we can safely access here it is that
> the RCU grace period will not expire until we call
> rcu_read_unlock().
>

Yeah, looks right.
 
> > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> > >  		struct inode *old;
> > > 
> > >  		spin_lock(&inode_hash_lock);
> > > -		/* We released the lock, so.. */
> > > -		old = find_inode_fast(sb, head, ino);
> > > +		old = find_inode_fast(sb, head, ino, true);
> > >  		if (!old) {
> > >  			inode->i_ino = ino;
> > >  			spin_lock(&inode->i_lock);
> > 
> > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin
> > lock on newly allocated inode to publish I_NEW?
> 
> Yes, we could.
> 
> However, having multiple synchronisation methods for a single
> variable that should only be used in certain circumstances is
> something that is easy to misunderstand and get wrong. Memory
> barriers are much more subtle and harder to understand than spin
> locks, and every memory barrier needs to be commented to explain
> what the barrier is actually protecting against.
> 
> In the case where a spin lock is guaranteed to be uncontended and
> the cache line hot in the CPU cache, it makes no sense to replace
> the spin lock with a memory barrier, especially when every other
> place we modify the i_state/i_hash fields we have to wrap them
> with i_lock....
> 
> Simple code is good code - save the complexity for something that
> needs it.
> 

Emmm, I doubt "it's simpler and need no document". 

I bet someday there will be other guys stand out and ask "why take spin 
lock on a inode which apparently does not subject to any race condition?". 

> I know that the per-sb inode lru lock is currently the hotest of the
> inode cache locks (performance limiting at somewhere in the range of
> 8-16way workloads on XFS), and I've got work in (slow) progress to
> address that.  That work will also the address the per-sb dentry LRU
> locks, which are the hotest dentry cache locks as well.
> 

Glad to hear that.

Thank your for all your explanation, especially historical ones.

Regards,
Guo Chao