From: Guo Chao <yan@linux.vnet.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
Date: Thu, 27 Sep 2012 16:41:48 +0800 [thread overview]
Message-ID: <20120927084148.GA29769@yanx> (raw)
In-Reply-To: <20120926005409.GG29154@dastard>
On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote:
> On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
> > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
> > > static struct inode *find_inode(struct super_block *sb,
> > > struct hlist_head *head,
> > > int (*test)(struct inode *, void *),
> > > - void *data)
> > > + void *data, bool locked)
> > > {
> > > struct hlist_node *node;
> > > struct inode *inode = NULL;
> > >
> > > repeat:
> > > - hlist_for_each_entry(inode, node, head, i_hash) {
> > > + rcu_read_lock();
> > > + hlist_for_each_entry_rcu(inode, node, head, i_hash) {
> > > spin_lock(&inode->i_lock);
> > > + if (inode_unhashed(inode)) {
> > > + spin_unlock(&inode->i_lock);
> > > + continue;
> > > + }
> >
> > Is this check too early? If the unhashed inode happened to be the target
> > inode, we are wasting our time to continue the traversal and we do not wait
> > on it.
>
> If the inode is unhashed, then it is already passing through evict()
> or has already passed through. If it has already passed through
> evict() then it is too late to call __wait_on_freeing_inode() as the
> wakeup occurs in evict() immediately after the inode is removed
> from the hash. i.e:
>
> remove_inode_hash(inode);
>
> spin_lock(&inode->i_lock);
> wake_up_bit(&inode->i_state, __I_NEW);
> BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> spin_unlock(&inode->i_lock);
>
> i.e. if we get the case:
>
> Thread 1, RCU hash traversal Thread 2, evicting foo
>
> rcu_read_lock()
> found inode foo
> remove_inode_hash(inode);
> spin_lock(&foo->i_lock);
> wake_up(I_NEW)
> spin_unlock(&foo->i_lock);
> destroy_inode()
> ......
> spin_lock(foo->i_lock)
> match sb, ino
> I_FREEING
> rcu_read_unlock()
>
> <rcu grace period can expire at any time now,
> so use after free is guaranteed at some point>
>
> wait_on_freeing_inode
> wait_on_bit(I_NEW)
>
> <will never get woken>
>
> Hence if the inode is unhashed, it doesn't matter what inode it is,
> it is never valid to use it any further because it may have already
> been freed and the only reason we can safely access here it is that
> the RCU grace period will not expire until we call
> rcu_read_unlock().
>
Yeah, looks right.
> > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> > > struct inode *old;
> > >
> > > spin_lock(&inode_hash_lock);
> > > - /* We released the lock, so.. */
> > > - old = find_inode_fast(sb, head, ino);
> > > + old = find_inode_fast(sb, head, ino, true);
> > > if (!old) {
> > > inode->i_ino = ino;
> > > spin_lock(&inode->i_lock);
> >
> > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin
> > lock on newly allocated inode to publish I_NEW?
>
> Yes, we could.
>
> However, having multiple synchronisation methods for a single
> variable that should only be used in certain circumstances is
> something that is easy to misunderstand and get wrong. Memory
> barriers are much more subtle and harder to understand than spin
> locks, and every memory barrier needs to be commented to explain
> what the barrier is actually protecting against.
>
> In the case where a spin lock is guaranteed to be uncontended and
> the cache line hot in the CPU cache, it makes no sense to replace
> the spin lock with a memory barrier, especially when every other
> place we modify the i_state/i_hash fields we have to wrap them
> with i_lock....
>
> Simple code is good code - save the complexity for something that
> needs it.
>
Emmm, I doubt "it's simpler and need no document".
I bet someday there will be other guys stand out and ask "why take spin
lock on a inode which apparently does not subject to any race condition?".
> I know that the per-sb inode lru lock is currently the hotest of the
> inode cache locks (performance limiting at somewhere in the range of
> 8-16way workloads on XFS), and I've got work in (slow) progress to
> address that. That work will also the address the per-sb dentry LRU
> locks, which are the hotest dentry cache locks as well.
>
Glad to hear that.
Thank your for all your explanation, especially historical ones.
Regards,
Guo Chao
next prev parent reply other threads:[~2012-09-27 8:42 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-21 9:31 [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Guo Chao
2012-09-21 9:31 ` [PATCH 1/4] fs/inode.c: do not take i_lock on newly allocated inode Guo Chao
2012-09-21 9:31 ` [PATCH 2/4] fs/inode.c: do not take i_lock in __(insert|remove)_inode_hash Guo Chao
2012-09-21 9:31 ` [PATCH 3/4] fs/inode.c: do not take i_lock when identify an inode Guo Chao
2012-09-21 9:31 ` [PATCH 4/4] fs/inode.c: always take i_lock before calling filesystem's test() method Guo Chao
2012-09-21 12:17 ` [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Matthew Wilcox
2012-09-21 22:49 ` Dave Chinner
2012-09-24 2:42 ` Guo Chao
2012-09-24 4:23 ` Dave Chinner
2012-09-24 6:12 ` Guo Chao
2012-09-24 6:28 ` Dave Chinner
2012-09-24 7:08 ` Guo Chao
2012-09-24 8:26 ` Dave Chinner
2012-09-25 8:59 ` Guo Chao
2012-09-26 0:54 ` Dave Chinner
2012-09-27 8:41 ` Guo Chao [this message]
2012-09-27 11:51 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120927084148.GA29769@yanx \
--to=yan@linux.vnet.ibm.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).