From mboxrd@z Thu Jan 1 00:00:00 1970 From: Guo Chao Subject: Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Date: Thu, 27 Sep 2012 16:41:48 +0800 Message-ID: <20120927084148.GA29769@yanx> References: <1348219866-1799-1-git-send-email-yan@linux.vnet.ibm.com> <20120921224912.GA20960@dastard> <20120924024221.GB7984@yanx> <20120924042343.GC20960@dastard> <20120924061205.GA27244@yanx> <20120924062812.GD20960@dastard> <20120924070852.GA10280@yanx> <20120924082654.GG20960@dastard> <20120925085955.GA15073@yanx> <20120926005409.GG29154@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org To: Dave Chinner Return-path: Received: from e23smtp06.au.ibm.com ([202.81.31.148]:56289 "EHLO e23smtp06.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751725Ab2I0ImB (ORCPT ); Thu, 27 Sep 2012 04:42:01 -0400 Received: from /spool/local by e23smtp06.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 27 Sep 2012 18:40:26 +1000 Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q8R8WKh342467414 for ; Thu, 27 Sep 2012 18:32:20 +1000 Received: from d23av01.au.ibm.com (loopback [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q8R8frRs017018 for ; Thu, 27 Sep 2012 18:41:54 +1000 Content-Disposition: inline In-Reply-To: <20120926005409.GG29154@dastard> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote: > On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote: > > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote: > > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode); > > > static struct inode *find_inode(struct super_block *sb, > > > struct hlist_head *head, > > > int (*test)(struct inode *, void *), > > > - void *data) > > > + void *data, bool locked) > > > { > > > struct hlist_node *node; > > > struct inode *inode = NULL; > > > > > > repeat: > > > - hlist_for_each_entry(inode, node, head, i_hash) { > > > + rcu_read_lock(); > > > + hlist_for_each_entry_rcu(inode, node, head, i_hash) { > > > spin_lock(&inode->i_lock); > > > + if (inode_unhashed(inode)) { > > > + spin_unlock(&inode->i_lock); > > > + continue; > > > + } > > > > Is this check too early? If the unhashed inode happened to be the target > > inode, we are wasting our time to continue the traversal and we do not wait > > on it. > > If the inode is unhashed, then it is already passing through evict() > or has already passed through. If it has already passed through > evict() then it is too late to call __wait_on_freeing_inode() as the > wakeup occurs in evict() immediately after the inode is removed > from the hash. i.e: > > remove_inode_hash(inode); > > spin_lock(&inode->i_lock); > wake_up_bit(&inode->i_state, __I_NEW); > BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); > spin_unlock(&inode->i_lock); > > i.e. if we get the case: > > Thread 1, RCU hash traversal Thread 2, evicting foo > > rcu_read_lock() > found inode foo > remove_inode_hash(inode); > spin_lock(&foo->i_lock); > wake_up(I_NEW) > spin_unlock(&foo->i_lock); > destroy_inode() > ...... > spin_lock(foo->i_lock) > match sb, ino > I_FREEING > rcu_read_unlock() > > so use after free is guaranteed at some point> > > wait_on_freeing_inode > wait_on_bit(I_NEW) > > > > Hence if the inode is unhashed, it doesn't matter what inode it is, > it is never valid to use it any further because it may have already > been freed and the only reason we can safely access here it is that > the RCU grace period will not expire until we call > rcu_read_unlock(). > Yeah, looks right. > > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) > > > struct inode *old; > > > > > > spin_lock(&inode_hash_lock); > > > - /* We released the lock, so.. */ > > > - old = find_inode_fast(sb, head, ino); > > > + old = find_inode_fast(sb, head, ino, true); > > > if (!old) { > > > inode->i_ino = ino; > > > spin_lock(&inode->i_lock); > > > > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin > > lock on newly allocated inode to publish I_NEW? > > Yes, we could. > > However, having multiple synchronisation methods for a single > variable that should only be used in certain circumstances is > something that is easy to misunderstand and get wrong. Memory > barriers are much more subtle and harder to understand than spin > locks, and every memory barrier needs to be commented to explain > what the barrier is actually protecting against. > > In the case where a spin lock is guaranteed to be uncontended and > the cache line hot in the CPU cache, it makes no sense to replace > the spin lock with a memory barrier, especially when every other > place we modify the i_state/i_hash fields we have to wrap them > with i_lock.... > > Simple code is good code - save the complexity for something that > needs it. > Emmm, I doubt "it's simpler and need no document". I bet someday there will be other guys stand out and ask "why take spin lock on a inode which apparently does not subject to any race condition?". > I know that the per-sb inode lru lock is currently the hotest of the > inode cache locks (performance limiting at somewhere in the range of > 8-16way workloads on XFS), and I've got work in (slow) progress to > address that. That work will also the address the per-sb dentry LRU > locks, which are the hotest dentry cache locks as well. > Glad to hear that. Thank your for all your explanation, especially historical ones. Regards, Guo Chao