From: Dave Chinner <david@fromorbit.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: viro@ZenIV.linux.org.uk, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH 3/3] fs: rcu protect inode hash lookups
Date: Tue, 2 Nov 2010 00:44:51 +1100 [thread overview]
Message-ID: <20101101134451.GN2715@dastard> (raw)
In-Reply-To: <1288604287.2660.94.camel@edumazet-laptop>
On Mon, Nov 01, 2010 at 10:38:07AM +0100, Eric Dumazet wrote:
> Le lundi 01 novembre 2010 à 16:33 +1100, Dave Chinner a écrit :
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Now that inodes are using RCU freeing, we can walk the hash lists
> > using RCU protection during lookups. Convert all the hash list
> > operations to use RCU-based operators and drop the inode_hash_lock
> > around pure lookup operations.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
>
> You probably should copy Paul on this stuff, I added him in Cc, because
> SLAB_DESTROY_BY_RCU is really tricky, and Paul review is a must.
>
> > repeat:
> > + rcu_read_lock();
> > hlist_for_each_entry(inode, node, head, i_hash) {
> > if (inode->i_sb != sb)
> > continue;
> > if (!test(inode, data))
> > continue;
> > spin_lock(&inode->i_lock);
>
> Problem with SLAB_DESTROY_BY_RCU is the inode can be freed, and reused
> immediately (no grace period) by another cpu.
>
> So you need to recheck test(inode, data) _after_ getting a stable
> reference on the inode (spin_lock() in this case), to make sure you
> indeed found the inode you are looking for, not another one.
Possibly. The test callback is a private callback to determine if,
indeed, it is the inode the caller is looking for. I need to do a
deeper look into what ordering is required for this callback.
> The test on inode->i_sb != sb can be omitted, _if_ each sb has its own
> kmem_cache (but I am not sure, please check if this is the case)
There's a slab cache per filesystem type, not per filesystem, so the
check is necessary.
> Also, you should make sure the allocation of inode is careful of not
> overwriting some fields (the i_lock in particular), since you could
> break a concurrent lookup. This is really tricky, you cannot use
> spin_lock_init(&inode->i_lock) anymore in inode_init_always().
Yes, I missed that one. Good catch. I'm used to the XFS code where
most locks are initialised only once in the slab constructor....
The other fields of note:
i_sb: overwritten in inode_init_always(). Should be safe
simply by rechecking after validating the inode is not in
the freed state as you suggest.
i_ino: overwritten just before the inode is re-inserted into
the hash. redo check like i_sb.
i_state: initialised atomically with hash insert via i_lock.
i_hash: inserted into hash list under i_lock
My intent is that the i_state/i_hash atomicity acts as the real
guard against reusing a freed inode, but you are right that the
other fields needs to be rechecked for validity after establishing
that it is not a freed inode.
> You can read Documentation/RCU/rculist_nulls.txt for some doc I wrote
> when adding SLAB_DESTROY_BY_RCU to UDP/TCP sockets. Sockets stable
Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
because the current name seems unrelated to the contents. :/
> reference is not a spinlock, but a refcount, so it was easier to init
> this refcount. With a spinlock, I believe you might need to use SLAB
> constructor, to initialize the spinlock only on fresh objects, not on
> reused ones.
Yeah, that is what I intended.
Thanks for the comments, Eric.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: viro@ZenIV.linux.org.uk, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH 3/3] fs: rcu protect inode hash lookups
Date: Tue, 2 Nov 2010 00:44:51 +1100 [thread overview]
Message-ID: <20101101134451.GN2715@dastard> (raw)
In-Reply-To: <1288604287.2660.94.camel@edumazet-laptop>
On Mon, Nov 01, 2010 at 10:38:07AM +0100, Eric Dumazet wrote:
> Le lundi 01 novembre 2010 à 16:33 +1100, Dave Chinner a écrit :
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Now that inodes are using RCU freeing, we can walk the hash lists
> > using RCU protection during lookups. Convert all the hash list
> > operations to use RCU-based operators and drop the inode_hash_lock
> > around pure lookup operations.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
>
> You probably should copy Paul on this stuff, I added him in Cc, because
> SLAB_DESTROY_BY_RCU is really tricky, and Paul review is a must.
>
> > repeat:
> > + rcu_read_lock();
> > hlist_for_each_entry(inode, node, head, i_hash) {
> > if (inode->i_sb != sb)
> > continue;
> > if (!test(inode, data))
> > continue;
> > spin_lock(&inode->i_lock);
>
> Problem with SLAB_DESTROY_BY_RCU is the inode can be freed, and reused
> immediately (no grace period) by another cpu.
>
> So you need to recheck test(inode, data) _after_ getting a stable
> reference on the inode (spin_lock() in this case), to make sure you
> indeed found the inode you are looking for, not another one.
Possibly. The test callback is a private callback to determine if,
indeed, it is the inode the caller is looking for. I need to do a
deeper look into what ordering is required for this callback.
> The test on inode->i_sb != sb can be omitted, _if_ each sb has its own
> kmem_cache (but I am not sure, please check if this is the case)
There's a slab cache per filesystem type, not per filesystem, so the
check is necessary.
> Also, you should make sure the allocation of inode is careful of not
> overwriting some fields (the i_lock in particular), since you could
> break a concurrent lookup. This is really tricky, you cannot use
> spin_lock_init(&inode->i_lock) anymore in inode_init_always().
Yes, I missed that one. Good catch. I'm used to the XFS code where
most locks are initialised only once in the slab constructor....
The other fields of note:
i_sb: overwritten in inode_init_always(). Should be safe
simply by rechecking after validating the inode is not in
the freed state as you suggest.
i_ino: overwritten just before the inode is re-inserted into
the hash. redo check like i_sb.
i_state: initialised atomically with hash insert via i_lock.
i_hash: inserted into hash list under i_lock
My intent is that the i_state/i_hash atomicity acts as the real
guard against reusing a freed inode, but you are right that the
other fields needs to be rechecked for validity after establishing
that it is not a freed inode.
> You can read Documentation/RCU/rculist_nulls.txt for some doc I wrote
> when adding SLAB_DESTROY_BY_RCU to UDP/TCP sockets. Sockets stable
Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
because the current name seems unrelated to the contents. :/
> reference is not a spinlock, but a refcount, so it was easier to init
> this refcount. With a spinlock, I believe you might need to use SLAB
> constructor, to initialize the spinlock only on fresh objects, not on
> reused ones.
Yeah, that is what I intended.
Thanks for the comments, Eric.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-11-01 13:45 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-01 5:33 fs: inode freeing and hash lookup via RCU Dave Chinner
2010-11-01 5:33 ` [PATCH 1/3] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
2010-11-01 5:33 ` [PATCH 2/3] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
2010-11-01 15:31 ` Christoph Hellwig
2010-11-01 5:33 ` [PATCH 3/3] fs: rcu protect inode hash lookups Dave Chinner
2010-11-01 9:38 ` Eric Dumazet
2010-11-01 9:38 ` Eric Dumazet
2010-11-01 13:44 ` Dave Chinner [this message]
2010-11-01 13:44 ` Dave Chinner
2010-11-01 15:29 ` Eric Dumazet
2010-11-01 15:29 ` Eric Dumazet
2010-11-02 0:01 ` Dave Chinner
2010-11-02 0:01 ` Dave Chinner
2010-11-02 4:46 ` Eric Dumazet
2010-11-02 4:46 ` Eric Dumazet
2010-11-02 12:11 ` Paul E. McKenney
2010-11-02 12:11 ` Paul E. McKenney
2010-11-16 23:56 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101101134451.GN2715@dastard \
--to=david@fromorbit.com \
--cc=eric.dumazet@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.