* Warning from unlock_new_inode
@ 2012-02-22 22:01 Jan Kara
2012-02-28 8:34 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2012-02-22 22:01 UTC (permalink / raw)
To: xfs
Hello,
while running fsstress on XFS partition with 3.3-rc4 kernel + my freeze
fixes (they do not touch anything relevant AFAICT) I've got the following
warning:
[ 62.699499] ------------[ cut here ]------------
[ 62.699531] WARNING: at fs/inode.c:967 unlock_new_inode+0x70/0x80()
[ 62.699538] Hardware name: Bochs
[ 62.699543] Modules linked in: xfs exportfs ipv6 mperf loop processor thermal_sys 8250_pnp button 8250 serial_core
[ 62.699572] Pid: 2397, comm: fsstress Not tainted 3.3.0-rc4-xen+ #9
[ 62.699578] Call Trace:
[ 62.699588] [<ffffffff81131130>] ? unlock_new_inode+0x70/0x80
[ 62.699598] [<ffffffff8103667a>] warn_slowpath_common+0x7a/0xb0
[ 62.699606] [<ffffffff810366c5>] warn_slowpath_null+0x15/0x20
[ 62.699613] [<ffffffff81131130>] unlock_new_inode+0x70/0x80
[ 62.699668] [<ffffffffa00b787f>] xfs_setup_inode+0x20f/0x2f0 [xfs]
[ 62.699696] [<ffffffffa00b3614>] xfs_iget+0x884/0x8c0 [xfs]
[ 62.699721] [<ffffffffa00b2e78>] ? xfs_iget+0xe8/0x8c0 [xfs]
[ 62.699748] [<ffffffffa00b8898>] xfs_bulkstat_one_int+0xa8/0x350 [xfs]
[ 62.699775] [<ffffffffa00b8770>] ? xfs_inumbers_fmt+0x70/0x70 [xfs]
[ 62.699803] [<ffffffffa00b8b5b>] xfs_bulkstat_one+0x1b/0x20 [xfs]
[ 62.699832] [<ffffffffa00b8f0a>] xfs_bulkstat+0x3aa/0x810 [xfs]
[ 62.699858] [<ffffffffa00b8b40>] ? xfs_bulkstat_one_int+0x350/0x350 [xfs]
[ 62.699899] [<ffffffffa00b4266>] xfs_ioc_bulkstat+0xe6/0x180 [xfs]
[ 62.699925] [<ffffffffa00b5756>] xfs_file_ioctl+0x426/0xa30 [xfs]
[ 62.699945] [<ffffffff810f7edb>] ? might_fault+0x3b/0x90
[ 62.699958] [<ffffffff81110191>] ? kfree_debugcheck+0x11/0x30
[ 62.699966] [<ffffffff81111d6e>] ? kmem_cache_free+0x13e/0x150
[ 62.699975] [<ffffffff8112954c>] do_vfs_ioctl+0x9c/0x590
[ 62.699987] [<ffffffff81114df0>] ? fd_install+0x30/0x60
[ 62.699997] [<ffffffff8138b5a5>] ? sysret_check+0x22/0x5d
[ 62.700005] [<ffffffff81129a8a>] sys_ioctl+0x4a/0x80
[ 62.700012] [<ffffffff8138b579>] system_call_fastpath+0x16/0x1b
[ 62.700020] ---[ end trace 0a20034268e46d92 ]---
It is the first time I've got this warning and I'm already testing for
quite some time so it doesn't look easily reproducible...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-22 22:01 Warning from unlock_new_inode Jan Kara
@ 2012-02-28 8:34 ` Christoph Hellwig
2012-02-28 10:11 ` Jan Kara
2012-02-29 0:53 ` Dave Chinner
0 siblings, 2 replies; 10+ messages in thread
From: Christoph Hellwig @ 2012-02-28 8:34 UTC (permalink / raw)
To: Jan Kara; +Cc: xfs
On Wed, Feb 22, 2012 at 11:01:37PM +0100, Jan Kara wrote:
> Hello,
>
> while running fsstress on XFS partition with 3.3-rc4 kernel + my freeze
> fixes (they do not touch anything relevant AFAICT) I've got the following
> warning:
That's stressing including freezes or without? Do you have a better
description of te workload?
Either way it's an odd one, I can't see any obvious way how this would
happen.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-28 8:34 ` Christoph Hellwig
@ 2012-02-28 10:11 ` Jan Kara
2012-02-29 0:53 ` Dave Chinner
1 sibling, 0 replies; 10+ messages in thread
From: Jan Kara @ 2012-02-28 10:11 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jan Kara, xfs
On Tue 28-02-12 03:34:44, Christoph Hellwig wrote:
> On Wed, Feb 22, 2012 at 11:01:37PM +0100, Jan Kara wrote:
> > Hello,
> >
> > while running fsstress on XFS partition with 3.3-rc4 kernel + my freeze
> > fixes (they do not touch anything relevant AFAICT) I've got the following
> > warning:
>
> That's stressing including freezes or without? Do you have a better
> description of te workload?
It was:
~/tests/xfstests-dev/ltp/fsstress -d /mnt -n 10000 -p 4
while doing in parallel
while true; do freeze; sync; unfreeze; sleep 4; done
> Either way it's an odd one, I can't see any obvious way how this would
> happen.
Yeah, and I never seen it again although I was stressing the filesystem
for quite a long time. So I think we can put this off for now.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-28 8:34 ` Christoph Hellwig
2012-02-28 10:11 ` Jan Kara
@ 2012-02-29 0:53 ` Dave Chinner
2012-02-29 1:49 ` Dave Chinner
1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2012-02-29 0:53 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jan Kara, xfs
On Tue, Feb 28, 2012 at 03:34:44AM -0500, Christoph Hellwig wrote:
> On Wed, Feb 22, 2012 at 11:01:37PM +0100, Jan Kara wrote:
> > Hello,
> >
> > while running fsstress on XFS partition with 3.3-rc4 kernel + my freeze
> > fixes (they do not touch anything relevant AFAICT) I've got the following
> > warning:
>
> That's stressing including freezes or without? Do you have a better
> description of te workload?
>
> Either way it's an odd one, I can't see any obvious way how this would
> happen.
FWIW, I'm trying to track down exactly the same warning on a RHEL6.2
kernel being triggered by NFS filehandle lookup. The problem is
being being reproduced reliably by a well known NFS benchmark, but
this gives more a bit more information on where a race condition in
the inode lookup may exist.
That is, the only common element here in these two lookup paths is
that they are the only two calls to xfs_iget() with
XFS_IGET_UNTRUSTED set in the flags. I doubt this is a coincidence.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-29 0:53 ` Dave Chinner
@ 2012-02-29 1:49 ` Dave Chinner
2012-02-29 9:03 ` Christoph Hellwig
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Dave Chinner @ 2012-02-29 1:49 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jan Kara, xfs
On Wed, Feb 29, 2012 at 11:53:51AM +1100, Dave Chinner wrote:
> On Tue, Feb 28, 2012 at 03:34:44AM -0500, Christoph Hellwig wrote:
> > On Wed, Feb 22, 2012 at 11:01:37PM +0100, Jan Kara wrote:
> > > Hello,
> > >
> > > while running fsstress on XFS partition with 3.3-rc4 kernel + my freeze
> > > fixes (they do not touch anything relevant AFAICT) I've got the following
> > > warning:
> >
> > That's stressing including freezes or without? Do you have a better
> > description of te workload?
> >
> > Either way it's an odd one, I can't see any obvious way how this would
> > happen.
>
> FWIW, I'm trying to track down exactly the same warning on a RHEL6.2
> kernel being triggered by NFS filehandle lookup. The problem is
> being being reproduced reliably by a well known NFS benchmark, but
> this gives more a bit more information on where a race condition in
> the inode lookup may exist.
>
> That is, the only common element here in these two lookup paths is
> that they are the only two calls to xfs_iget() with
> XFS_IGET_UNTRUSTED set in the flags. I doubt this is a coincidence.
And it isn't.
Jan, can you try the (untested) patch below?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
xfs: fix inode lookup race
From: Dave Chinner <dchinner@redhat.com>
When we get concurrent lookups of the same inode that is not in the
per-AG inode cache, there is a race condition that triggers warnings
in unlock_new_inode() indicating that we are initialising an inode
that isn't in a the correct state for a new inode.
When we do an inode lookup via a file handle or a bulkstat, we don't
serialise lookups at a higher level through the dentry cache (i.e.
pathless lookup), and so we can get concurrent lookups of the same
inode.
The race condition is between the insertion of the inode into the
cache in the case of a cache miss and a concurrently lookup:
Thread 1 Thread 2
xfs_iget()
xfs_iget_cache_miss()
xfs_iread()
lock radix tree
radix_tree_insert()
rcu_read_lock
radix_tree_lookup
lock inode flags
XFS_INEW not set
igrab()
unlock inode flags
rcu_read_unlock
use uninitialised inode
.....
lock inode flags
set XFS_INEW
unlock inode flags
unlock radix tree
xfs_setup_inode()
inode flags = I_NEW
unlock_new_inode()
WARNING as inode flags != I_NEW
This can lead to inode corruption, inode list corruption, etc, and
is generally a bad thing to occur.
Fix this by setting XFS_INEW before inserting the inode into the
radix tree. This will ensure any concurrent lookup will find the new
inode with XFS_INEW set and that forces the lookup to wait until the
XFS_INEW flag is removed before allowing the lookup to succeed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_iget.c | 17 +++++++++++------
1 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 05bed2b..2467ab7 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -350,9 +350,19 @@ xfs_iget_cache_miss(
BUG();
}
- spin_lock(&pag->pag_ici_lock);
+ /* These values _must_ be set before inserting the inode into the radix
+ * tree as the moment it is inserted a concurrent lookup (allowed by the
+ * RCU locking mechanism) can find it and that lookup must see that this
+ * is an inode currently under construction (i.e. that XFS_INEW is set).
+ * The ip->i_flags_lock that protects the XFS_INEW flag forms the
+ * memory barrier that ensures this detection works correctly at lookup
+ * time.
+ */
+ xfs_iflags_set(ip, XFS_INEW);
+ ip->i_udquot = ip->i_gdquot = NULL;
/* insert the new inode */
+ spin_lock(&pag->pag_ici_lock);
error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
if (unlikely(error)) {
WARN_ON(error != -EEXIST);
@@ -360,11 +370,6 @@ xfs_iget_cache_miss(
error = EAGAIN;
goto out_preload_end;
}
-
- /* These values _must_ be set before releasing the radix tree lock! */
- ip->i_udquot = ip->i_gdquot = NULL;
- xfs_iflags_set(ip, XFS_INEW);
-
spin_unlock(&pag->pag_ici_lock);
radix_tree_preload_end();
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
@ 2012-02-29 9:03 ` Christoph Hellwig
2012-02-29 10:24 ` Jan Kara
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2012-02-29 9:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: Jan Kara, xfs
Looks good, except that I'd drop the _must_ shouting in the comment.
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
2012-02-29 9:03 ` Christoph Hellwig
@ 2012-02-29 10:24 ` Jan Kara
2012-03-01 3:03 ` Ben Myers
2012-03-01 5:09 ` Eric Sandeen
3 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2012-02-29 10:24 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Jan Kara, xfs
On Wed 29-02-12 12:49:06, Dave Chinner wrote:
> On Wed, Feb 29, 2012 at 11:53:51AM +1100, Dave Chinner wrote:
> > On Tue, Feb 28, 2012 at 03:34:44AM -0500, Christoph Hellwig wrote:
> > > On Wed, Feb 22, 2012 at 11:01:37PM +0100, Jan Kara wrote:
> > > > Hello,
> > > >
> > > > while running fsstress on XFS partition with 3.3-rc4 kernel + my freeze
> > > > fixes (they do not touch anything relevant AFAICT) I've got the following
> > > > warning:
> > >
> > > That's stressing including freezes or without? Do you have a better
> > > description of te workload?
> > >
> > > Either way it's an odd one, I can't see any obvious way how this would
> > > happen.
> >
> > FWIW, I'm trying to track down exactly the same warning on a RHEL6.2
> > kernel being triggered by NFS filehandle lookup. The problem is
> > being being reproduced reliably by a well known NFS benchmark, but
> > this gives more a bit more information on where a race condition in
> > the inode lookup may exist.
> >
> > That is, the only common element here in these two lookup paths is
> > that they are the only two calls to xfs_iget() with
> > XFS_IGET_UNTRUSTED set in the flags. I doubt this is a coincidence.
>
> And it isn't.
>
> Jan, can you try the (untested) patch below?
Sure, I can include it in my testing. Just I've seen the warning just
once in a week of testing so reliability of my confirmation is rather low.
Honza
> --
> Dave Chinner
> david@fromorbit.com
>
> xfs: fix inode lookup race
>
> From: Dave Chinner <dchinner@redhat.com>
>
> When we get concurrent lookups of the same inode that is not in the
> per-AG inode cache, there is a race condition that triggers warnings
> in unlock_new_inode() indicating that we are initialising an inode
> that isn't in a the correct state for a new inode.
>
> When we do an inode lookup via a file handle or a bulkstat, we don't
> serialise lookups at a higher level through the dentry cache (i.e.
> pathless lookup), and so we can get concurrent lookups of the same
> inode.
>
> The race condition is between the insertion of the inode into the
> cache in the case of a cache miss and a concurrently lookup:
>
> Thread 1 Thread 2
> xfs_iget()
> xfs_iget_cache_miss()
> xfs_iread()
> lock radix tree
> radix_tree_insert()
> rcu_read_lock
> radix_tree_lookup
> lock inode flags
> XFS_INEW not set
> igrab()
> unlock inode flags
> rcu_read_unlock
> use uninitialised inode
> .....
> lock inode flags
> set XFS_INEW
> unlock inode flags
> unlock radix tree
> xfs_setup_inode()
> inode flags = I_NEW
> unlock_new_inode()
> WARNING as inode flags != I_NEW
>
> This can lead to inode corruption, inode list corruption, etc, and
> is generally a bad thing to occur.
>
> Fix this by setting XFS_INEW before inserting the inode into the
> radix tree. This will ensure any concurrent lookup will find the new
> inode with XFS_INEW set and that forces the lookup to wait until the
> XFS_INEW flag is removed before allowing the lookup to succeed.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_iget.c | 17 +++++++++++------
> 1 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index 05bed2b..2467ab7 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -350,9 +350,19 @@ xfs_iget_cache_miss(
> BUG();
> }
>
> - spin_lock(&pag->pag_ici_lock);
> + /* These values _must_ be set before inserting the inode into the radix
> + * tree as the moment it is inserted a concurrent lookup (allowed by the
> + * RCU locking mechanism) can find it and that lookup must see that this
> + * is an inode currently under construction (i.e. that XFS_INEW is set).
> + * The ip->i_flags_lock that protects the XFS_INEW flag forms the
> + * memory barrier that ensures this detection works correctly at lookup
> + * time.
> + */
> + xfs_iflags_set(ip, XFS_INEW);
> + ip->i_udquot = ip->i_gdquot = NULL;
>
> /* insert the new inode */
> + spin_lock(&pag->pag_ici_lock);
> error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> if (unlikely(error)) {
> WARN_ON(error != -EEXIST);
> @@ -360,11 +370,6 @@ xfs_iget_cache_miss(
> error = EAGAIN;
> goto out_preload_end;
> }
> -
> - /* These values _must_ be set before releasing the radix tree lock! */
> - ip->i_udquot = ip->i_gdquot = NULL;
> - xfs_iflags_set(ip, XFS_INEW);
> -
> spin_unlock(&pag->pag_ici_lock);
> radix_tree_preload_end();
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
2012-02-29 9:03 ` Christoph Hellwig
2012-02-29 10:24 ` Jan Kara
@ 2012-03-01 3:03 ` Ben Myers
2012-03-01 3:58 ` Dave Chinner
2012-03-01 5:09 ` Eric Sandeen
3 siblings, 1 reply; 10+ messages in thread
From: Ben Myers @ 2012-03-01 3:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Jan Kara, xfs
Hi Dave,
Eric mentioned a really excellent bugfix earlier. This must be it.
On Wed, Feb 29, 2012 at 12:49:06PM +1100, Dave Chinner wrote:
> xfs: fix inode lookup race
>
> From: Dave Chinner <dchinner@redhat.com>
>
> When we get concurrent lookups of the same inode that is not in the
> per-AG inode cache, there is a race condition that triggers warnings
> in unlock_new_inode() indicating that we are initialising an inode
> that isn't in a the correct state for a new inode.
>
> When we do an inode lookup via a file handle or a bulkstat, we don't
> serialise lookups at a higher level through the dentry cache (i.e.
> pathless lookup), and so we can get concurrent lookups of the same
> inode.
>
> The race condition is between the insertion of the inode into the
> cache in the case of a cache miss and a concurrently lookup:
>
> Thread 1 Thread 2
> xfs_iget()
> xfs_iget_cache_miss()
> xfs_iread()
> lock radix tree
> radix_tree_insert()
> rcu_read_lock
> radix_tree_lookup
> lock inode flags
> XFS_INEW not set
> igrab()
> unlock inode flags
> rcu_read_unlock
> use uninitialised inode
> .....
> lock inode flags
> set XFS_INEW
> unlock inode flags
> unlock radix tree
> xfs_setup_inode()
> inode flags = I_NEW
> unlock_new_inode()
> WARNING as inode flags != I_NEW
>
> This can lead to inode corruption, inode list corruption, etc, and
> is generally a bad thing to occur.
>
> Fix this by setting XFS_INEW before inserting the inode into the
> radix tree. This will ensure any concurrent lookup will find the new
> inode with XFS_INEW set and that forces the lookup to wait until the
> XFS_INEW flag is removed before allowing the lookup to succeed.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_iget.c | 17 +++++++++++------
> 1 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index 05bed2b..2467ab7 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -350,9 +350,19 @@ xfs_iget_cache_miss(
> BUG();
> }
>
> - spin_lock(&pag->pag_ici_lock);
> + /* These values _must_ be set before inserting the inode into the radix
> + * tree as the moment it is inserted a concurrent lookup (allowed by the
> + * RCU locking mechanism) can find it and that lookup must see that this
> + * is an inode currently under construction (i.e. that XFS_INEW is set).
> + * The ip->i_flags_lock that protects the XFS_INEW flag forms the
> + * memory barrier that ensures this detection works correctly at lookup
> + * time.
> + */
> + xfs_iflags_set(ip, XFS_INEW);
> + ip->i_udquot = ip->i_gdquot = NULL;
>
> /* insert the new inode */
> + spin_lock(&pag->pag_ici_lock);
> error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> if (unlikely(error)) {
> WARN_ON(error != -EEXIST);
> @@ -360,11 +370,6 @@ xfs_iget_cache_miss(
> error = EAGAIN;
> goto out_preload_end;
> }
> -
> - /* These values _must_ be set before releasing the radix tree lock! */
^^^
So, in this comment 'radix tree lock' refers to pag->pag_ici_lock?
And, pag_ici_lock lock provides no exclusion with radix_tree_lookup.
I believe I understand. That isn't to say that I couldn't use a
brush-up on RCU. Awesome. ;)
Reviewed-by: Ben Myers <bpm@sgi.com>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-03-01 3:03 ` Ben Myers
@ 2012-03-01 3:58 ` Dave Chinner
0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2012-03-01 3:58 UTC (permalink / raw)
To: Ben Myers; +Cc: Christoph Hellwig, Jan Kara, xfs
On Wed, Feb 29, 2012 at 09:03:24PM -0600, Ben Myers wrote:
> > The race condition is between the insertion of the inode into the
> > cache in the case of a cache miss and a concurrently lookup:
> >
> > Thread 1 Thread 2
> > xfs_iget()
> > xfs_iget_cache_miss()
> > xfs_iread()
> > lock radix tree
> > radix_tree_insert()
> > rcu_read_lock
> > radix_tree_lookup
> > lock inode flags
> > XFS_INEW not set
> > igrab()
> > unlock inode flags
> > rcu_read_unlock
> > use uninitialised inode
> > .....
> > lock inode flags
> > set XFS_INEW
> > unlock inode flags
> > unlock radix tree
> > xfs_setup_inode()
> > inode flags = I_NEW
> > unlock_new_inode()
> > WARNING as inode flags != I_NEW
.....
> > - spin_lock(&pag->pag_ici_lock);
> > + /* These values _must_ be set before inserting the inode into the radix
> > + * tree as the moment it is inserted a concurrent lookup (allowed by the
> > + * RCU locking mechanism) can find it and that lookup must see that this
> > + * is an inode currently under construction (i.e. that XFS_INEW is set).
> > + * The ip->i_flags_lock that protects the XFS_INEW flag forms the
> > + * memory barrier that ensures this detection works correctly at lookup
> > + * time.
> > + */
> > + xfs_iflags_set(ip, XFS_INEW);
> > + ip->i_udquot = ip->i_gdquot = NULL;
> >
> > /* insert the new inode */
> > + spin_lock(&pag->pag_ici_lock);
> > error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> > if (unlikely(error)) {
> > WARN_ON(error != -EEXIST);
> > @@ -360,11 +370,6 @@ xfs_iget_cache_miss(
> > error = EAGAIN;
> > goto out_preload_end;
> > }
> > -
> > - /* These values _must_ be set before releasing the radix tree lock! */
> ^^^
> So, in this comment 'radix tree lock' refers to pag->pag_ici_lock?
Right. "ici" is short for "in-core inode". So pag_ici_lock reads as
"per-ag in-core inode lock". The lock is used to protect
modifications to the in-core inode cache which is radix tree indexed
and the root is at pag_ici_root...
> And, pag_ici_lock lock provides no exclusion with radix_tree_lookup.
Right, but rcu based radix tree traversal is safe as long as the
objects being indexed are RCU freed and have some external method of
validating their freed/in use state independent of the radix tree
lookup. IOWs, tree lookups are not synchronised with inserts,
deletes or tag modifications - the are synchronised with RCU freeing
of the objects the tree points to. Hence we only need to validate
the objects against races with RCU freeing after the tree lookup -
we don't need to care about the state of the tree at all....
> I believe I understand. That isn't to say that I couldn't use a
> brush-up on RCU. Awesome. ;)
RCU is tricky. It took me a long time to convince myself that we
could use RCU lookups here because it relies on memory barriers
rather than locking to ensure coherency across all CPUs. I firmly
beleive that when a concept is difficult to understand, it shouldn't
be made harder to understand by optimising it with a smart and/or
tricky implementation. Spin locks provide memory barriers and they
are something simple that everyone understands, therefore all we
need to concentrate on is the order in which operations occur....
FWIW, once I had decided on the memory barrier implementation
(ip->i_flags_lock), there were three basic principles I followed to
prove to myself that the inode cache lookup is RCU safe:
1. that the state of inodes in the process of removal from
the tree is detectable during lookup (i.e. via the
XFS_IRECLAIM flag).
2. that inodes in the process of being freed (regardless of
removal state) in the RCU grace period can be detected (i.e.
by clearing ip->i_ino before calling rcu_free)
3. that the state of new inodes is detectable during lookup
so we can avoid them until they are fully initialised. i.e.
the XFS_INEW flag.
That's why xfs_iget_cache_hit() takes the i_flags_lock, then checks
ip->i_ino and against the XFS_INEW | XFS_IRECLAIM flags, and
xfs_inode_free() ensures that ip->i_ino = 0 and XFS_IRECLAIM is set
before calling rcu_free() on the inode. The problem was that I
hadn't got the ordering of operations for #3 correct, and that's
what the bug fix is for....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
` (2 preceding siblings ...)
2012-03-01 3:03 ` Ben Myers
@ 2012-03-01 5:09 ` Eric Sandeen
3 siblings, 0 replies; 10+ messages in thread
From: Eric Sandeen @ 2012-03-01 5:09 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Jan Kara, xfs
On 2/28/12 7:49 PM, Dave Chinner wrote:
...
> Jan, can you try the (untested) patch below?
>
> Cheers,
>
> Dave.
Just for the list's benefit - we tested this with the in-house NFS benchmark reproducer, and the tester also reported success.
Thanks Dave!
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-03-01 5:10 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-22 22:01 Warning from unlock_new_inode Jan Kara
2012-02-28 8:34 ` Christoph Hellwig
2012-02-28 10:11 ` Jan Kara
2012-02-29 0:53 ` Dave Chinner
2012-02-29 1:49 ` Dave Chinner
2012-02-29 9:03 ` Christoph Hellwig
2012-02-29 10:24 ` Jan Kara
2012-03-01 3:03 ` Ben Myers
2012-03-01 3:58 ` Dave Chinner
2012-03-01 5:09 ` Eric Sandeen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox