* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
@ 2012-02-29 9:03 ` Christoph Hellwig
2012-02-29 10:24 ` Jan Kara
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2012-02-29 9:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: Jan Kara, xfs
Looks good, except that I'd drop the _must_ shouting in the comment.
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
2012-02-29 9:03 ` Christoph Hellwig
@ 2012-02-29 10:24 ` Jan Kara
2012-03-01 3:03 ` Ben Myers
2012-03-01 5:09 ` Eric Sandeen
3 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2012-02-29 10:24 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Jan Kara, xfs
On Wed 29-02-12 12:49:06, Dave Chinner wrote:
> On Wed, Feb 29, 2012 at 11:53:51AM +1100, Dave Chinner wrote:
> > On Tue, Feb 28, 2012 at 03:34:44AM -0500, Christoph Hellwig wrote:
> > > On Wed, Feb 22, 2012 at 11:01:37PM +0100, Jan Kara wrote:
> > > > Hello,
> > > >
> > > > while running fsstress on XFS partition with 3.3-rc4 kernel + my freeze
> > > > fixes (they do not touch anything relevant AFAICT) I've got the following
> > > > warning:
> > >
> > > That's stressing including freezes or without? Do you have a better
> > > description of te workload?
> > >
> > > Either way it's an odd one, I can't see any obvious way how this would
> > > happen.
> >
> > FWIW, I'm trying to track down exactly the same warning on a RHEL6.2
> > kernel being triggered by NFS filehandle lookup. The problem is
> > being being reproduced reliably by a well known NFS benchmark, but
> > this gives more a bit more information on where a race condition in
> > the inode lookup may exist.
> >
> > That is, the only common element here in these two lookup paths is
> > that they are the only two calls to xfs_iget() with
> > XFS_IGET_UNTRUSTED set in the flags. I doubt this is a coincidence.
>
> And it isn't.
>
> Jan, can you try the (untested) patch below?
Sure, I can include it in my testing. Just I've seen the warning just
once in a week of testing so reliability of my confirmation is rather low.
Honza
> --
> Dave Chinner
> david@fromorbit.com
>
> xfs: fix inode lookup race
>
> From: Dave Chinner <dchinner@redhat.com>
>
> When we get concurrent lookups of the same inode that is not in the
> per-AG inode cache, there is a race condition that triggers warnings
> in unlock_new_inode() indicating that we are initialising an inode
> that isn't in a the correct state for a new inode.
>
> When we do an inode lookup via a file handle or a bulkstat, we don't
> serialise lookups at a higher level through the dentry cache (i.e.
> pathless lookup), and so we can get concurrent lookups of the same
> inode.
>
> The race condition is between the insertion of the inode into the
> cache in the case of a cache miss and a concurrently lookup:
>
> Thread 1 Thread 2
> xfs_iget()
> xfs_iget_cache_miss()
> xfs_iread()
> lock radix tree
> radix_tree_insert()
> rcu_read_lock
> radix_tree_lookup
> lock inode flags
> XFS_INEW not set
> igrab()
> unlock inode flags
> rcu_read_unlock
> use uninitialised inode
> .....
> lock inode flags
> set XFS_INEW
> unlock inode flags
> unlock radix tree
> xfs_setup_inode()
> inode flags = I_NEW
> unlock_new_inode()
> WARNING as inode flags != I_NEW
>
> This can lead to inode corruption, inode list corruption, etc, and
> is generally a bad thing to occur.
>
> Fix this by setting XFS_INEW before inserting the inode into the
> radix tree. This will ensure any concurrent lookup will find the new
> inode with XFS_INEW set and that forces the lookup to wait until the
> XFS_INEW flag is removed before allowing the lookup to succeed.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_iget.c | 17 +++++++++++------
> 1 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index 05bed2b..2467ab7 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -350,9 +350,19 @@ xfs_iget_cache_miss(
> BUG();
> }
>
> - spin_lock(&pag->pag_ici_lock);
> + /* These values _must_ be set before inserting the inode into the radix
> + * tree as the moment it is inserted a concurrent lookup (allowed by the
> + * RCU locking mechanism) can find it and that lookup must see that this
> + * is an inode currently under construction (i.e. that XFS_INEW is set).
> + * The ip->i_flags_lock that protects the XFS_INEW flag forms the
> + * memory barrier that ensures this detection works correctly at lookup
> + * time.
> + */
> + xfs_iflags_set(ip, XFS_INEW);
> + ip->i_udquot = ip->i_gdquot = NULL;
>
> /* insert the new inode */
> + spin_lock(&pag->pag_ici_lock);
> error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> if (unlikely(error)) {
> WARN_ON(error != -EEXIST);
> @@ -360,11 +370,6 @@ xfs_iget_cache_miss(
> error = EAGAIN;
> goto out_preload_end;
> }
> -
> - /* These values _must_ be set before releasing the radix tree lock! */
> - ip->i_udquot = ip->i_gdquot = NULL;
> - xfs_iflags_set(ip, XFS_INEW);
> -
> spin_unlock(&pag->pag_ici_lock);
> radix_tree_preload_end();
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
2012-02-29 9:03 ` Christoph Hellwig
2012-02-29 10:24 ` Jan Kara
@ 2012-03-01 3:03 ` Ben Myers
2012-03-01 3:58 ` Dave Chinner
2012-03-01 5:09 ` Eric Sandeen
3 siblings, 1 reply; 10+ messages in thread
From: Ben Myers @ 2012-03-01 3:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Jan Kara, xfs
Hi Dave,
Eric mentioned a really excellent bugfix earlier. This must be it.
On Wed, Feb 29, 2012 at 12:49:06PM +1100, Dave Chinner wrote:
> xfs: fix inode lookup race
>
> From: Dave Chinner <dchinner@redhat.com>
>
> When we get concurrent lookups of the same inode that is not in the
> per-AG inode cache, there is a race condition that triggers warnings
> in unlock_new_inode() indicating that we are initialising an inode
> that isn't in a the correct state for a new inode.
>
> When we do an inode lookup via a file handle or a bulkstat, we don't
> serialise lookups at a higher level through the dentry cache (i.e.
> pathless lookup), and so we can get concurrent lookups of the same
> inode.
>
> The race condition is between the insertion of the inode into the
> cache in the case of a cache miss and a concurrently lookup:
>
> Thread 1 Thread 2
> xfs_iget()
> xfs_iget_cache_miss()
> xfs_iread()
> lock radix tree
> radix_tree_insert()
> rcu_read_lock
> radix_tree_lookup
> lock inode flags
> XFS_INEW not set
> igrab()
> unlock inode flags
> rcu_read_unlock
> use uninitialised inode
> .....
> lock inode flags
> set XFS_INEW
> unlock inode flags
> unlock radix tree
> xfs_setup_inode()
> inode flags = I_NEW
> unlock_new_inode()
> WARNING as inode flags != I_NEW
>
> This can lead to inode corruption, inode list corruption, etc, and
> is generally a bad thing to occur.
>
> Fix this by setting XFS_INEW before inserting the inode into the
> radix tree. This will ensure any concurrent lookup will find the new
> inode with XFS_INEW set and that forces the lookup to wait until the
> XFS_INEW flag is removed before allowing the lookup to succeed.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_iget.c | 17 +++++++++++------
> 1 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index 05bed2b..2467ab7 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -350,9 +350,19 @@ xfs_iget_cache_miss(
> BUG();
> }
>
> - spin_lock(&pag->pag_ici_lock);
> + /* These values _must_ be set before inserting the inode into the radix
> + * tree as the moment it is inserted a concurrent lookup (allowed by the
> + * RCU locking mechanism) can find it and that lookup must see that this
> + * is an inode currently under construction (i.e. that XFS_INEW is set).
> + * The ip->i_flags_lock that protects the XFS_INEW flag forms the
> + * memory barrier that ensures this detection works correctly at lookup
> + * time.
> + */
> + xfs_iflags_set(ip, XFS_INEW);
> + ip->i_udquot = ip->i_gdquot = NULL;
>
> /* insert the new inode */
> + spin_lock(&pag->pag_ici_lock);
> error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> if (unlikely(error)) {
> WARN_ON(error != -EEXIST);
> @@ -360,11 +370,6 @@ xfs_iget_cache_miss(
> error = EAGAIN;
> goto out_preload_end;
> }
> -
> - /* These values _must_ be set before releasing the radix tree lock! */
^^^
So, in this comment 'radix tree lock' refers to pag->pag_ici_lock?
And, pag_ici_lock lock provides no exclusion with radix_tree_lookup.
I believe I understand. That isn't to say that I couldn't use a
brush-up on RCU. Awesome. ;)
Reviewed-by: Ben Myers <bpm@sgi.com>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Warning from unlock_new_inode
2012-03-01 3:03 ` Ben Myers
@ 2012-03-01 3:58 ` Dave Chinner
0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2012-03-01 3:58 UTC (permalink / raw)
To: Ben Myers; +Cc: Christoph Hellwig, Jan Kara, xfs
On Wed, Feb 29, 2012 at 09:03:24PM -0600, Ben Myers wrote:
> > The race condition is between the insertion of the inode into the
> > cache in the case of a cache miss and a concurrently lookup:
> >
> > Thread 1 Thread 2
> > xfs_iget()
> > xfs_iget_cache_miss()
> > xfs_iread()
> > lock radix tree
> > radix_tree_insert()
> > rcu_read_lock
> > radix_tree_lookup
> > lock inode flags
> > XFS_INEW not set
> > igrab()
> > unlock inode flags
> > rcu_read_unlock
> > use uninitialised inode
> > .....
> > lock inode flags
> > set XFS_INEW
> > unlock inode flags
> > unlock radix tree
> > xfs_setup_inode()
> > inode flags = I_NEW
> > unlock_new_inode()
> > WARNING as inode flags != I_NEW
.....
> > - spin_lock(&pag->pag_ici_lock);
> > + /* These values _must_ be set before inserting the inode into the radix
> > + * tree as the moment it is inserted a concurrent lookup (allowed by the
> > + * RCU locking mechanism) can find it and that lookup must see that this
> > + * is an inode currently under construction (i.e. that XFS_INEW is set).
> > + * The ip->i_flags_lock that protects the XFS_INEW flag forms the
> > + * memory barrier that ensures this detection works correctly at lookup
> > + * time.
> > + */
> > + xfs_iflags_set(ip, XFS_INEW);
> > + ip->i_udquot = ip->i_gdquot = NULL;
> >
> > /* insert the new inode */
> > + spin_lock(&pag->pag_ici_lock);
> > error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> > if (unlikely(error)) {
> > WARN_ON(error != -EEXIST);
> > @@ -360,11 +370,6 @@ xfs_iget_cache_miss(
> > error = EAGAIN;
> > goto out_preload_end;
> > }
> > -
> > - /* These values _must_ be set before releasing the radix tree lock! */
> ^^^
> So, in this comment 'radix tree lock' refers to pag->pag_ici_lock?
Right. "ici" is short for "in-core inode". So pag_ici_lock reads as
"per-ag in-core inode lock". The lock is used to protect
modifications to the in-core inode cache which is radix tree indexed
and the root is at pag_ici_root...
> And, pag_ici_lock lock provides no exclusion with radix_tree_lookup.
Right, but rcu based radix tree traversal is safe as long as the
objects being indexed are RCU freed and have some external method of
validating their freed/in use state independent of the radix tree
lookup. IOWs, tree lookups are not synchronised with inserts,
deletes or tag modifications - the are synchronised with RCU freeing
of the objects the tree points to. Hence we only need to validate
the objects against races with RCU freeing after the tree lookup -
we don't need to care about the state of the tree at all....
> I believe I understand. That isn't to say that I couldn't use a
> brush-up on RCU. Awesome. ;)
RCU is tricky. It took me a long time to convince myself that we
could use RCU lookups here because it relies on memory barriers
rather than locking to ensure coherency across all CPUs. I firmly
beleive that when a concept is difficult to understand, it shouldn't
be made harder to understand by optimising it with a smart and/or
tricky implementation. Spin locks provide memory barriers and they
are something simple that everyone understands, therefore all we
need to concentrate on is the order in which operations occur....
FWIW, once I had decided on the memory barrier implementation
(ip->i_flags_lock), there were three basic principles I followed to
prove to myself that the inode cache lookup is RCU safe:
1. that the state of inodes in the process of removal from
the tree is detectable during lookup (i.e. via the
XFS_IRECLAIM flag).
2. that inodes in the process of being freed (regardless of
removal state) in the RCU grace period can be detected (i.e.
by clearing ip->i_ino before calling rcu_free)
3. that the state of new inodes is detectable during lookup
so we can avoid them until they are fully initialised. i.e.
the XFS_INEW flag.
That's why xfs_iget_cache_hit() takes the i_flags_lock, then checks
ip->i_ino and against the XFS_INEW | XFS_IRECLAIM flags, and
xfs_inode_free() ensures that ip->i_ino = 0 and XFS_IRECLAIM is set
before calling rcu_free() on the inode. The problem was that I
hadn't got the ordering of operations for #3 correct, and that's
what the bug fix is for....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Warning from unlock_new_inode
2012-02-29 1:49 ` Dave Chinner
` (2 preceding siblings ...)
2012-03-01 3:03 ` Ben Myers
@ 2012-03-01 5:09 ` Eric Sandeen
3 siblings, 0 replies; 10+ messages in thread
From: Eric Sandeen @ 2012-03-01 5:09 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, Jan Kara, xfs
On 2/28/12 7:49 PM, Dave Chinner wrote:
...
> Jan, can you try the (untested) patch below?
>
> Cheers,
>
> Dave.
Just for the list's benefit - we tested this with the in-house NFS benchmark reproducer, and the tester also reported success.
Thanks Dave!
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread