From: Dave Chinner <david@fromorbit.com>
To: Guo Chao <yan@linux.vnet.ibm.com>
Cc: viro@zeniv.linux.org.uk, dchinner@redhat.com, hch@infradead.org,
jack@suse.cz, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage
Date: Mon, 24 Sep 2012 16:28:12 +1000 [thread overview]
Message-ID: <20120924062812.GD20960@dastard> (raw)
In-Reply-To: <20120924061205.GA27244@yanx>
On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote:
> On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
> > On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
> > > On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
> > >
> > > > On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > > > > This patchset optimizes several places which take the per inode spin lock.
> > > > > They have not been fully tested yet, thus they are marked as RFC.
> > > >
> > > > Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> > > > part of the memory barrier that allows the RCU read side to
> > > > correctly detect a freed inode during a RCU protected cache lookup
> > > > (hash list traversal for the VFS, or a radix tree traversal for XFS).
> > > > The i_lock usage around the hahs list operations ensures the hash
> > > > list operations are atomic with state changes so that such changes
> > > > are correctly detected during RCU-protected traversals...
> > > >
> > > > IOWs, removing the i_lock from around the i_state transitions and
> > > > inode hash insert/remove/traversal operations will cause races in
> > > > the RCU lookups and result in incorrectly using freed inodes instead
> > > > of failing the lookup and creating a new one.
> > > >
> > > > So I don't think this is a good idea at all...
.....
> > The inode hash lookup needs to check i_state atomically during the
> > traversal so inodes being freed are skipped (e.g. I_FREEING,
> > I_WILL_FREE). those i_state flags are set only with the i_lock held,
> > and so inode hash lookups need to take the i_lock to guarantee the
> > i_state field is correct. This inode data field synchronisation is
> > separate to the cache hash list traversal protection.
> >
> > The only way to do this is to have an inner lock (inode->i_lock)
> > that protects both the inode->i_hash_list and inode->i_state fields,
> > and a lock order that provides outer list traversal protections
> > (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
> > rcu_read_lock(), the lock order and the data fields the locks are
> > protecting are the same....
> >
> > > Of course, maybe they are there for something. Could you speak
> > > more about the race this change (patch 1,2?) brings up? Thank you.
> >
> > When you drop the lock from the i_state initialisation, you end up
> > dropping the implicit unlock->lock memory barrier that the
> > inode->i_lock provides. i.e. you get this in iget_locked():
> >
> >
> > thread 1 thread 2
> >
> > lock(inode_hash_lock)
> > for_each_hash_item()
> >
> > inode->i_state = I_NEW
> > hash_list_insert
> >
> > <finds newly inserted inode>
> > lock(inode->i_lock)
> > unlock(inode->i_lock)
> > unlock(inode_hash_lock)
> >
> > wait_on_inode()
> > <see inode->i_state = 0 >
> > <uses inode before initialisation
> > is complete>
> >
> > IOWs, there is no unlock->lock transition occurring on any lock, so
> > there are no implicit memory barriers in this code, and so other
> > CPUs are not guaranteed to see the "inode->i_state = I_NEW" write
> > that thread 2 did. The lock/unlock pair around this I_NEW assignment
> > guarantees that thread 1 will see the change to i_state correctly.
> >
> > So even without RCU, dropping the i_lock from these
> > i_state/hash insert/remove operations will result in races
> > occurring...
>
> This interleave can never happen because of inode_hash_lock.
Ah, sorry, I'm context switching too much right now.
s/lock(inode_hash_lock)/rcu_read_lock.
And that's the race condition the the locking order is *intended* to
avoid. It's just that we haven't done the last piece of the work,
which is replacing the read side inode_hash_lock usage with
rcu_read_lock.
> > Seriously, if you want to improve the locking of this code, go back
> > an resurrect the basic RCU hash traversal patches (i.e. Nick's
> > original patch rather than my SLAB_DESTROY_BY_RCU based ones). That
> > has much more benefit to many more workloads than just removing a
> > non-global, uncontended locks like this patch set does.
> >
>
> Ah, this is intended to be a code clean patchset actually. I thought these
> locks are redundant in an obvious and trivial manner. If, on the contrary,
> they are such tricky, then never mind :) Thanks for your patient.
The RCU conversion is actually trivial - everything is already set
up for it to be done, and is simpler than this patch set. It pretty
much is simply replacing all the read side inode_hash_lock pairs
with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
want to clean up this code, then RCU traversals are the conversion
to make.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2012-09-24 6:28 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-21 9:31 [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Guo Chao
2012-09-21 9:31 ` [PATCH 1/4] fs/inode.c: do not take i_lock on newly allocated inode Guo Chao
2012-09-21 9:31 ` [PATCH 2/4] fs/inode.c: do not take i_lock in __(insert|remove)_inode_hash Guo Chao
2012-09-21 9:31 ` [PATCH 3/4] fs/inode.c: do not take i_lock when identify an inode Guo Chao
2012-09-21 9:31 ` [PATCH 4/4] fs/inode.c: always take i_lock before calling filesystem's test() method Guo Chao
2012-09-21 12:17 ` [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage Matthew Wilcox
2012-09-21 22:49 ` Dave Chinner
2012-09-24 2:42 ` Guo Chao
2012-09-24 4:23 ` Dave Chinner
2012-09-24 6:12 ` Guo Chao
2012-09-24 6:28 ` Dave Chinner [this message]
2012-09-24 7:08 ` Guo Chao
2012-09-24 8:26 ` Dave Chinner
2012-09-25 8:59 ` Guo Chao
2012-09-26 0:54 ` Dave Chinner
2012-09-27 8:41 ` Guo Chao
2012-09-27 11:51 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120924062812.GD20960@dastard \
--to=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=yan@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.