* RE: FW: oops in 2.4.25 prune_icache() called from kswapd
@ 2005-08-08 21:03 Srivastava, Rahul
2005-08-09 0:33 ` Marcelo Tosatti
0 siblings, 1 reply; 8+ messages in thread
From: Srivastava, Rahul @ 2005-08-08 21:03 UTC (permalink / raw)
To: Ernie Petrides; +Cc: marcelo.tosatti, lwoodman, linux-fsdevel
Hi,
Thanks a lot. The second race mentioned below explains it all.
Now I understood the significance of adding I_CLEAR. I actually never
noticed that I_CLEAR flag is directly assigned to i_state. Since this
will clear up the I_FREEING flag, the addition of I_CLEAR in
__refile_inodes() does make sense.
Thanks,
Rahul
-----Original Message-----
From: Ernie Petrides [mailto:petrides@redhat.com]
Sent: Monday, August 08, 2005 3:20 PM
To: Srivastava, Rahul
Cc: marcelo.tosatti@cyclades.com; lwoodman@redhat.com;
linux-fsdevel@vger.kernel.org
Subject: Re: FW: oops in 2.4.25 prune_icache() called from kswapd
On Monday, 8-Aug-2005 at 11:43 CDT, "Rahul Srivastava" wrote:
> I was just wondering if any of you guys had a chance to validate the
> hypothesis and the proposed fix.
Larry Woodman is the one who worked on this problem, and he agreed (last
week) to follow up on this discussion. Unfortunately, he's away this
week.
I've forwarded his mail (posted to an internal Red Hat patch review
mailing list 3 months ago) in the interim, and this contains the patch
that has been committed to Red Hat Enterprise Linux version 3 (of which
Update 6 is currently in beta).
Hopefully, Larry will follow up when he gets back.
Cheers. -ernie
------- Forwarded Message
From: Larry Woodman <lwoodman@redhat.com>
Date: Mon, 09 May 2005 11:39:11 -0400
Subject: [Taroon Patch] fix inode cache deadlock/race...
------------------------------------------------------------------------
Over the past couple of weeks we have seen two races in the inode cache
code. The first is between [dispose_list()] and __refile_inode() and the
second is between prune_icache() and truncate_inodes(). I posted both of
these patches but wanted to make sure they got properly reviewed and
included in RHEL3-U6.
Fixes [RHEL3 bugzillas 149636 and] 155289.
The first scenerio is:
1.) cpu0 is in __sync_one() just about to call __refile_inode() after
taking the inode_lock and clearing I_LOCK.
---------------------------------------------------------
spin_lock(&inode_lock);
inode->i_state &= ~I_LOCK;
if (!(inode->i_state & I_FREEING))
__refile_inode(inode);
wake_up(&inode->i_wait);
---------------------------------------------------------
2.) cpu1 is in [dispose_list()] where it has dropped the inode_lock and
calls clear_inode(). It doesnt block because I_LOCK is clear so it sets
the inode state.
---------------------------------------------------------
void clear_inode(struct inode *inode)
{
...
wait_on_inode(inode);
...
inode->i_state = I_CLEAR;
...
}
---------------------------------------------------------
3.) cpu0 calls __refile_inode which places is on one of the four
possible inode lists
---------------------------------------------------------
static inline void __refile_inode(struct inode *inode)
{
if (inode->i_state & I_DIRTY)
to = &inode->i_sb->s_dirty;
else if (atomic_read(&inode->i_count))
to = &inode_in_use;
else if (inode->i_data.nrpages)
to = &inode_unused_pagecache;
else
to = &inode_unused;
list_del(&inode->i_list);
list_add(&inode->i_list, to);
}
---------------------------------------------------------
4.) cpu1 returns from clear_inode() then calls destroy_inode() which
kmem_cache_free()s it.
---------------------------------------------------------
static void destroy_inode(struct inode *inode)
{ if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
kmem_cache_free(inode_cachep, inode);
}
---------------------------------------------------------
5.) at this point we have an inode that has been kmem_cache_free()'d
that is also sitting one of the lists determined by __refile_inode(),
that cant be good!!! Also, the code looks the same in RHEL4.
The second scenerio is:
CPU0 is in prune_icache() called by kswapd and CPU1 is in
invalidate_inodes() called by
the auto-mount daemon.
1.) CPU0: prune_icache() sets the I_LOCK bit in an inode on the
inode_unused_pagecache list, releases the inode_lock and calls
invalidate_inode_pages.
2.) CPU1: invalidate_inodes() calls invalidate_list() for the
inode_unused_pagecache list with the node_lock held and sets the
I_FREEING bit in the inode->i_state.
3.) CPU0: prune_icache() acquires the inode_lock and clears the I_LOCK
bit in the inode->i_state.
4.) CPU1: dispose_list() calls clear_inode() without the inode_lock
held. Since the I_LOCK bit is clear, clear_inode() sets inode->i_state =
I_CLEAR, clearing the I_FREEING bit.
5.) CPU0: prune_icache() calls __refile_inode() because clear_inode()
cleared I_FREEING without holding the inode_lock. This inode that is no
longer on the inode_unused_pagecache list which results in that inode
being placed on the inode_unused list.
6.) CPU1: dispose_list() calls destroy_inode() which kmem_cache_free()s
an inode that is also on the inode_unused list.
At this point there is an inode that has been kmem_cache_free()'d and is
also on the inode_unused list.
This patch to clear_inode() acquires the inode_lock before manipulating
the inode->i_state field. This is the only place in the kernel that
manipulates the inode without holding the inode_lock.
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -296,7 +296,7 @@ static inline void __refile_inode(struct
{
struct list_head *to;
- if (inode->i_state & I_FREEING)
+ if (inode->i_state & (I_FREEING|I_CLEAR))
return;
if (list_empty(&inode->i_hash))
return;
@@ -636,7 +636,9 @@ void clear_inode(struct inode *inode)
cdput(inode->i_cdev);
inode->i_cdev = NULL;
}
+ spin_lock(&inode_lock);
inode->i_state = I_CLEAR;
+ spin_unlock(&inode_lock);
}
/*
------- End of Forwarded Message
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: FW: oops in 2.4.25 prune_icache() called from kswapd
2005-08-08 21:03 FW: oops in 2.4.25 prune_icache() called from kswapd Srivastava, Rahul
@ 2005-08-09 0:33 ` Marcelo Tosatti
0 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2005-08-09 0:33 UTC (permalink / raw)
To: Srivastava, Rahul; +Cc: Ernie Petrides, lwoodman, linux-fsdevel
On Mon, Aug 08, 2005 at 04:03:38PM -0500, Srivastava, Rahul wrote:
> Hi,
>
> Thanks a lot. The second race mentioned below explains it all.
>
> Now I understood the significance of adding I_CLEAR. I actually never
> noticed that I_CLEAR flag is directly assigned to i_state. Since this
> will clear up the I_FREEING flag, the addition of I_CLEAR in
> __refile_inodes() does make sense.
Right - I also missed the clearing of the I_FREEING flag.
Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: FW: oops in 2.4.25 prune_icache() called from kswapd
@ 2005-08-10 22:59 Srivastava, Rahul
0 siblings, 0 replies; 8+ messages in thread
From: Srivastava, Rahul @ 2005-08-10 22:59 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Ernie Petrides, lwoodman, linux-fsdevel
Hi Marcelo,
I also agree that in code there is no scenario, wherein we can have
__sync_one() and iput() having reference of same inode simultaneously
leading to below mentioned corruption.
Thanks a lot for your patience and time,
Rahul
-----Original Message-----
From: Marcelo Tosatti [mailto:marcelo.tosatti@cyclades.com]
Sent: Wednesday, August 10, 2005 4:56 PM
To: Srivastava, Rahul
Cc: Ernie Petrides; lwoodman@redhat.com; linux-fsdevel@vger.kernel.org
Subject: Re: FW: oops in 2.4.25 prune_icache() called from kswapd
Hi Rahul,
My previous description was incomplete, AFAICS there's more important
thing which guarantees consistency.
On Tue, Aug 09, 2005 at 12:00:54PM -0500, Srivastava, Rahul wrote:
> Hi,
>
> Please consider following scenario (with the patch/fix from Larry):
>
> -> engine 0: calls iput() and lock inode_lock. iput removes the inode
> from the i_list and unlocks inode_lock
>
> ---> engine 1: grab inode_lock and calls __sync_one()
Inodes which reach __sync_one() have been found through any of the type
lists (dirty, in use, etc), which are walked with the inode_lock held.
iput() deletes the inode from the i_list before proceeding, so they
are unreachable via the type lists after the inode lock is released.
Go ahead and try to prove me wrong -- what you're doing is very welcome
(sincerely, I dont fully understand the inode cache).
> -> engine 0: calls clear_inode(), get past the call to
> -> "wait_on_inode()"
> which looks if I_LOCK is set.
> /* From this point onwards clear_inode() and the remainder of iput()
> does not care about I_LOCK or inode_lock. */
> Now with new changes, it will wait for inode_lock before setting the
> state to I_CLEAR. So now we are waiting for inode_lock on engine 0.
>
> ---> engine 1: Sets I_LOCK and release the inode_lock
>
>
> -> engine 0: We now get the lock and set the state to I_CLEAR, release
> the lock and free the inode (though on engine 1 we have set I_LOCK and
> are thinking that no one will destroy this inode).
>
> Though numerous kind of corruption is possible now, I am sighting one
> example here: Under low memory condition it is possible that the inode
> from inode_cachep (freed inode cache), will be returned to system
> memory (subject to all the objects in that particular slab is freed).
> And that memory chuck (which we were just now using for inode) is
> allocated to some other process. Suppose, this new process which just
> got this newly allocated chunk, goes and clears the field, which was
> earlier i_state, to NULL, or some other value (other than value which
> suggests I_FREEING or I_CLEAR is set)
>
> ---> engine 1: We get the spin lock, clears I_LOCK (even though we
> ---> don't
> own this memory chunk anymore), see (As per above mentioned example
> scenario) that I_FREEING or I_CLEAR is not set and insert this freed
> inode into the list!! This way we will still end up in corrupted list
> (as per above example scenario).
>
> Please correct me if I am wrong somewhere.
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: FW: oops in 2.4.25 prune_icache() called from kswapd
@ 2005-08-09 17:00 Srivastava, Rahul
2005-08-10 21:56 ` Marcelo Tosatti
0 siblings, 1 reply; 8+ messages in thread
From: Srivastava, Rahul @ 2005-08-09 17:00 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Ernie Petrides, lwoodman, linux-fsdevel
Hi,
Please consider following scenario (with the patch/fix from Larry):
-> engine 0: calls iput() and lock inode_lock. iput removes the inode
from the i_list and unlocks inode_lock
---> engine 1: grab inode_lock and calls __sync_one()
-> engine 0: calls clear_inode(), get past the call to "wait_on_inode()"
which looks if I_LOCK is set.
/* From this point onwards clear_inode() and the remainder of iput()
does not care about I_LOCK or inode_lock. */
Now with new changes, it will wait for inode_lock before setting the
state to I_CLEAR. So now we are waiting for inode_lock on engine 0.
---> engine 1: Sets I_LOCK and release the inode_lock
-> engine 0: We now get the lock and set the state to I_CLEAR, release
the lock and free the inode (though on engine 1 we have set I_LOCK and
are thinking that no one will destroy this inode).
Though numerous kind of corruption is possible now, I am sighting one
example here: Under low memory condition it is possible that the inode
from inode_cachep (freed inode cache), will be returned to system memory
(subject to all the objects in that particular slab is freed). And that
memory chuck (which we were just now using for inode) is allocated to
some other process. Suppose, this new process which just got this newly
allocated chunk, goes and clears the field, which was earlier i_state,
to NULL, or some other value (other than value which suggests I_FREEING
or I_CLEAR is set)
---> engine 1: We get the spin lock, clears I_LOCK (even though we don't
own this memory chunk anymore), see (As per above mentioned example
scenario) that I_FREEING or I_CLEAR is not set and insert this freed
inode into the list!! This way we will still end up in corrupted list
(as per above example scenario).
Please correct me if I am wrong somewhere.
Thanks,
Rahul
-----Original Message-----
From: Marcelo Tosatti [mailto:marcelo.tosatti@cyclades.com]
Sent: Monday, August 08, 2005 9:55 PM
To: Srivastava, Rahul
Cc: petrides@redhat.com; lwoodman@redhat.com;
linux-fsdevel@vger.kernel.org
Subject: Re: FW: oops in 2.4.25 prune_icache() called from kswapd
On Mon, Aug 08, 2005 at 11:45:28AM -0500, Srivastava, Rahul wrote:
> Hi All,
>
> I was just wondering if any of you guys had a chance to validate the
> hypothesis and the proposed fix.
>
> Thanks,
> Rahul
>
>
> -----Original Message-----
> From: Srivastava, Rahul
> Sent: Tuesday, August 02, 2005 8:32 AM
> To: 'Marcelo Tosatti'; 'Ernie Petrides'; 'Larry Woodman'
> Subject: RE: oops in 2.4.25 prune_icache() called from kswapd
>
>
> Hi,
>
> Thanks for reviewing the mail. I was thinking whether below changes in
> clear_inode() will close the race window:
>
> in clear_inode(), change line:
>
> inode->i_state = I_CLEAR;
>
> with below piece of code:
>
> *****
> spin_lock(&inode_lock);
> while (inode->i_state & I_LOCK) {
> spin_unlock(&inode_lock);
> __wait_on_inode(inode);
> spin_lock(&inode_lock);
> }
> inode->i_state = I_CLEAR;
> spin_unlock(&inode_lock);
> *****
>
> I feel the race is between "__sync_one()" and "iput()/clear_inode()"
> (also suggested by Albert) which is as follows:
>
> **************** race
> condition*******************************************
>
>
> engine 0:
> |
> calls iput() and lock inode_lock. iput removes the inode from the
i_list
> and unlocks |
> inode_lock
> |
>
> |
>
> | engine 1:
>
>
> | grab inode_lock and calls __sync_one()
>
> |
> engine 0:
> |
> calls clear_inode(), get past the call to "wait_on_inode()" which
looks
> if I_LOCK is set. |
> /* From this point onwards clear_inode() and the remainder of iput()
> does not care about |
> I_LOCK or inode_lock. */
> |
>
> |
>
> | engine 1:
>
>
> | Sets I_LOCK.
>
> |
> engine 0:
> |
> sets i_state = I_CLEAR
> |
> iput() calls destroy_inode()
> |
> kmem_cache_free() returns the inode to free list of inode cache.
> |
>
> |
>
> | engine 1:
>
> | Goes ahead and inserts the freed inode into one of the three
> | possible
> lists.
As stated in private, Larry's fix should catch that in __refile_inode()
and ignore the I_CLEAR inode.
> And we endup in having a corrupted inode on the inode list.
>
> Your thoughts please.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: FW: oops in 2.4.25 prune_icache() called from kswapd
2005-08-09 17:00 Srivastava, Rahul
@ 2005-08-10 21:56 ` Marcelo Tosatti
0 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2005-08-10 21:56 UTC (permalink / raw)
To: Srivastava, Rahul; +Cc: Ernie Petrides, lwoodman, linux-fsdevel
Hi Rahul,
My previous description was incomplete, AFAICS there's more important
thing which guarantees consistency.
On Tue, Aug 09, 2005 at 12:00:54PM -0500, Srivastava, Rahul wrote:
> Hi,
>
> Please consider following scenario (with the patch/fix from Larry):
>
> -> engine 0: calls iput() and lock inode_lock. iput removes the inode
> from the i_list and unlocks inode_lock
>
> ---> engine 1: grab inode_lock and calls __sync_one()
Inodes which reach __sync_one() have been found through any of the type
lists (dirty, in use, etc), which are walked with the inode_lock held.
iput() deletes the inode from the i_list before proceeding, so they
are unreachable via the type lists after the inode lock is released.
Go ahead and try to prove me wrong -- what you're doing is very welcome
(sincerely, I dont fully understand the inode cache).
> -> engine 0: calls clear_inode(), get past the call to "wait_on_inode()"
> which looks if I_LOCK is set.
> /* From this point onwards clear_inode() and the remainder of iput()
> does not care about I_LOCK or inode_lock. */
> Now with new changes, it will wait for inode_lock before setting the
> state to I_CLEAR. So now we are waiting for inode_lock on engine 0.
>
> ---> engine 1: Sets I_LOCK and release the inode_lock
>
>
> -> engine 0: We now get the lock and set the state to I_CLEAR, release
> the lock and free the inode (though on engine 1 we have set I_LOCK and
> are thinking that no one will destroy this inode).
>
> Though numerous kind of corruption is possible now, I am sighting one
> example here: Under low memory condition it is possible that the inode
> from inode_cachep (freed inode cache), will be returned to system memory
> (subject to all the objects in that particular slab is freed). And that
> memory chuck (which we were just now using for inode) is allocated to
> some other process. Suppose, this new process which just got this newly
> allocated chunk, goes and clears the field, which was earlier i_state,
> to NULL, or some other value (other than value which suggests I_FREEING
> or I_CLEAR is set)
>
> ---> engine 1: We get the spin lock, clears I_LOCK (even though we don't
> own this memory chunk anymore), see (As per above mentioned example
> scenario) that I_FREEING or I_CLEAR is not set and insert this freed
> inode into the list!! This way we will still end up in corrupted list
> (as per above example scenario).
>
> Please correct me if I am wrong somewhere.
^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <717252EC3E37AE4392E2614EA24E9F2B0D7D0733@txnexc01.americas.cpqcorp.net>]
* Re: FW: oops in 2.4.25 prune_icache() called from kswapd
[not found] <717252EC3E37AE4392E2614EA24E9F2B0D7D0733@txnexc01.americas.cpqcorp.net>
@ 2005-08-08 20:19 ` Ernie Petrides
0 siblings, 0 replies; 8+ messages in thread
From: Ernie Petrides @ 2005-08-08 20:19 UTC (permalink / raw)
To: Rahul Srivastava; +Cc: marcelo.tosatti, lwoodman, linux-fsdevel
On Monday, 8-Aug-2005 at 11:43 CDT, "Rahul Srivastava" wrote:
> I was just wondering if any of you guys had a chance to validate the
> hypothesis and the proposed fix.
Larry Woodman is the one who worked on this problem, and he agreed (last
week) to follow up on this discussion. Unfortunately, he's away this
week.
I've forwarded his mail (posted to an internal Red Hat patch review
mailing list 3 months ago) in the interim, and this contains the patch
that has been committed to Red Hat Enterprise Linux version 3 (of which
Update 6 is currently in beta).
Hopefully, Larry will follow up when he gets back.
Cheers. -ernie
------- Forwarded Message
From: Larry Woodman <lwoodman@redhat.com>
Date: Mon, 09 May 2005 11:39:11 -0400
Subject: [Taroon Patch] fix inode cache deadlock/race...
------------------------------------------------------------------------
Over the past couple of weeks we have seen two races in the inode cache
code. The first is between [dispose_list()] and __refile_inode() and the
second is between prune_icache() and truncate_inodes(). I posted both of
these patches but wanted to make sure they got properly reviewed and
included in RHEL3-U6.
Fixes [RHEL3 bugzillas 149636 and] 155289.
The first scenerio is:
1.) cpu0 is in __sync_one() just about to call __refile_inode() after
taking the inode_lock and clearing I_LOCK.
---------------------------------------------------------
spin_lock(&inode_lock);
inode->i_state &= ~I_LOCK;
if (!(inode->i_state & I_FREEING))
__refile_inode(inode);
wake_up(&inode->i_wait);
---------------------------------------------------------
2.) cpu1 is in [dispose_list()] where it has dropped the inode_lock and calls
clear_inode(). It doesnt block because
I_LOCK is clear so it sets the inode state.
---------------------------------------------------------
void clear_inode(struct inode *inode)
{
...
wait_on_inode(inode);
...
inode->i_state = I_CLEAR;
...
}
---------------------------------------------------------
3.) cpu0 calls __refile_inode which places is on one of the four
possible inode lists
---------------------------------------------------------
static inline void __refile_inode(struct inode *inode)
{
if (inode->i_state & I_DIRTY)
to = &inode->i_sb->s_dirty;
else if (atomic_read(&inode->i_count))
to = &inode_in_use;
else if (inode->i_data.nrpages)
to = &inode_unused_pagecache;
else
to = &inode_unused;
list_del(&inode->i_list);
list_add(&inode->i_list, to);
}
---------------------------------------------------------
4.) cpu1 returns from clear_inode() then calls destroy_inode() which
kmem_cache_free()s it.
---------------------------------------------------------
static void destroy_inode(struct inode *inode)
{ if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
kmem_cache_free(inode_cachep, inode);
}
---------------------------------------------------------
5.) at this point we have an inode that has been kmem_cache_free()'d
that is also sitting one
of the lists determined by __refile_inode(), that cant be good!!! Also,
the code looks the
same in RHEL4.
The second scenerio is:
CPU0 is in prune_icache() called by kswapd and CPU1 is in
invalidate_inodes() called by
the auto-mount daemon.
1.) CPU0: prune_icache() sets the I_LOCK bit in an inode on the
inode_unused_pagecache
list, releases the inode_lock and calls invalidate_inode_pages.
2.) CPU1: invalidate_inodes() calls invalidate_list() for the
inode_unused_pagecache list
with the node_lock held and sets the I_FREEING bit in the inode->i_state.
3.) CPU0: prune_icache() acquires the inode_lock and clears the I_LOCK
bit in the inode->i_state.
4.) CPU1: dispose_list() calls clear_inode() without the inode_lock
held. Since the I_LOCK bit
is clear, clear_inode() sets inode->i_state = I_CLEAR, clearing the
I_FREEING bit.
5.) CPU0: prune_icache() calls __refile_inode() because clear_inode()
cleared I_FREEING without
holding the inode_lock. This inode that is no longer on the
inode_unused_pagecache
list which results in that inode being placed on the inode_unused list.
6.) CPU1: dispose_list() calls destroy_inode() which kmem_cache_free()s
an inode that is also on the
inode_unused list.
At this point there is an inode that has been kmem_cache_free()'d and is
also on the inode_unused list.
This patch to clear_inode() acquires the inode_lock before manipulating
the inode->i_state field. This
is the only place in the kernel that manipulates the inode without
holding the inode_lock.
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -296,7 +296,7 @@ static inline void __refile_inode(struct
{
struct list_head *to;
- if (inode->i_state & I_FREEING)
+ if (inode->i_state & (I_FREEING|I_CLEAR))
return;
if (list_empty(&inode->i_hash))
return;
@@ -636,7 +636,9 @@ void clear_inode(struct inode *inode)
cdput(inode->i_cdev);
inode->i_cdev = NULL;
}
+ spin_lock(&inode_lock);
inode->i_state = I_CLEAR;
+ spin_unlock(&inode_lock);
}
/*
------- End of Forwarded Message
^ permalink raw reply [flat|nested] 8+ messages in thread
* FW: oops in 2.4.25 prune_icache() called from kswapd
@ 2005-08-08 16:45 Srivastava, Rahul
2005-08-09 2:55 ` Marcelo Tosatti
0 siblings, 1 reply; 8+ messages in thread
From: Srivastava, Rahul @ 2005-08-08 16:45 UTC (permalink / raw)
To: marcelo.tosatti, petrides, lwoodman, linux-fsdevel
Hi All,
I was just wondering if any of you guys had a chance to validate the
hypothesis and the proposed fix.
Thanks,
Rahul
-----Original Message-----
From: Srivastava, Rahul
Sent: Tuesday, August 02, 2005 8:32 AM
To: 'Marcelo Tosatti'; 'Ernie Petrides'; 'Larry Woodman'
Subject: RE: oops in 2.4.25 prune_icache() called from kswapd
Hi,
Thanks for reviewing the mail. I was thinking whether below changes in
clear_inode() will close the race window:
in clear_inode(), change line:
inode->i_state = I_CLEAR;
with below piece of code:
*****
spin_lock(&inode_lock);
while (inode->i_state & I_LOCK) {
spin_unlock(&inode_lock);
__wait_on_inode(inode);
spin_lock(&inode_lock);
}
inode->i_state = I_CLEAR;
spin_unlock(&inode_lock);
*****
I feel the race is between "__sync_one()" and "iput()/clear_inode()"
(also suggested by Albert) which is as follows:
**************** race
condition*******************************************
engine 0:
|
calls iput() and lock inode_lock. iput removes the inode from the i_list
and unlocks |
inode_lock
|
|
| engine 1:
| grab inode_lock and calls __sync_one()
|
engine 0:
|
calls clear_inode(), get past the call to "wait_on_inode()" which looks
if I_LOCK is set. |
/* From this point onwards clear_inode() and the remainder of iput()
does not care about |
I_LOCK or inode_lock. */
|
|
| engine 1:
| Sets I_LOCK.
|
engine 0:
|
sets i_state = I_CLEAR
|
iput() calls destroy_inode()
|
kmem_cache_free() returns the inode to free list of inode cache.
|
|
| engine 1:
| Goes ahead and inserts the freed inode into one of the three possible
lists.
And we endup in having a corrupted inode on the inode list.
Your thoughts please.
Thanks,
Rahul
-----Original Message-----
From: Marcelo Tosatti [mailto:marcelo.tosatti@cyclades.com]
Sent: Monday, August 01, 2005 6:08 AM
To: Srivastava, Rahul; Ernie Petrides; Larry Woodman
Subject: Re: oops in 2.4.25 prune_icache() called from kswapd
On Thu, Jul 28, 2005 at 05:57:43PM -0500, Srivastava, Rahul wrote:
> Hi Marcelo,
>
> Thanks a lot for your detailed response.
>
> However, I still don't see how adding I_CLEAR in __refile_inode is
> avoiding the race mentioned below.
>
> I understand that the I_CLEAR flag addition is done to ensure that
> __refile_inode() doesn't insert the inode in one of the four lists and
> should just return (in this particular scenario). Please correct me if
> I am wrong in my assumption.
>
> Now if we see the code, without I_CLEAR flag (unpatched code), we are
> checking for I_FREEING flag in __refile_inode(). And if a inode is
> marked for freeing (i.e., I_FREEING is set) current code also returns
> without adding it into one of the four lists.
>
> And there is no scenario, in code, wherein we can have a inode with
> I_CLEAR flag set but I_FREEING unset.
I fail to disagree: the I_FREEING flag will always be set when an
attempt
is made to set I_CLEAR (there are asserts to guarantee that).
So, I also can't understand the addition of I_CLEAR check on
__refile_inode() and its purpose.
Larry, can you clarify for us?
> In nutshell: If I_CLEAR is set that means I_FREEING will also be set.
> And since we are already checking for I_FREEING in __refile_inode,
> checking for I_CLEAR will be a kind of duplication?
>
> Hope I have not misinterpreted the complete story.
>
> Thanks,
> Rahul
>
>
> -----Original Message-----
> From: Marcelo Tosatti [mailto:marcelo.tosatti@cyclades.com]
> Sent: Thursday, July 28, 2005 5:38 AM
> To: Srivastava, Rahul
> Subject: Re: oops in 2.4.25 prune_icache() called from kswapd
>
>
> On Thu, Jul 28, 2005 at 10:39:12AM -0500, Srivastava, Rahul wrote:
> > Hi Marcelo,
> >
> > I was seeing your fix in inode.c and need a clarification.
> >
> > In the patch you have added I_CLEAR flag. What is confusing me is
> > that
>
> > "I_CLEAR" flag is set only in "clear_inode()". And in this same
> > function we have:
> >
> > **********
> > if (!(inode->i_state & I_FREEING))
> > BUG();
> > **********
> >
> > So we are setting I_CLEAR only is I_FREEING is set. If that is the
> > case, shouldn't just a check for I_FREEING is enough in
refile_inode.
> > ?
>
> The problem is that it is a race between two processors.
>
> > Basically, I am not able to make out the significance of adding
> > I_CLEAR in _refile_inode().
> >
> > + if (inode->i_state & (I_FREEING|I_CLEAR))
>
> That will avoid __refile_inode() from putting a freed inode into
> the lists...
>
> > Thanks for your time and sorry to bother you,
> > Rahul
>
> No problem.
>
> The description goes like:
>
> The first scenerio is:
>
> 1.) cpu0 is in __sync_one() just about to call __refile_inode() after
> taking the inode_lock and clearing I_LOCK.
>
> spin_lock(&inode_lock);
> inode->i_state &= ~I_LOCK;
> if (!(inode->i_state & I_FREEING))
> __refile_inode(inode);
> wake_up(&inode->i_wait);
>
> 2.) cpu1 is in [dispose_list()] where it has dropped the inode_lock
> and calls clear_inode(). It doesnt block because I_LOCK is clear so it
> sets the inode state.
>
> void clear_inode(struct inode *inode)
> {
> ...
> wait_on_inode(inode);
> ...
> inode->i_state = I_CLEAR;
> ...
> }
>
> 3.) cpu0 calls __refile_inode which places is on one of the four
> possible inode lists
>
> static inline void __refile_inode(struct inode *inode)
> {
> if (inode->i_state & I_DIRTY)
> to = &inode->i_sb->s_dirty;
> else if (atomic_read(&inode->i_count))
> to = &inode_in_use;
> else if (inode->i_data.nrpages)
> to = &inode_unused_pagecache;
> else
> to = &inode_unused;
>
> list_del(&inode->i_list);
> list_add(&inode->i_list, to);
> }
>
> 4.) cpu1 returns from clear_inode() then calls destroy_inode() which
> kmem_cache_free()s it.
>
> static void destroy_inode(struct inode *inode)
> {
>
> if (inode->i_sb->s_op->destroy_inode)
> inode->i_sb->s_op->destroy_inode(inode);
> else
> kmem_cache_free(inode_cachep, inode);
> }
>
> 5.) at this point we have an inode that has been kmem_cache_free()'d
> that is also sitting one of the lists determined by __refile_inode(),
> that cant be good!!! Also, the code looks the same in RHEL4.
>
>
> >
> >
> >
> >
> > ============================
> >
> > FYI
> >
> > commit cc54d1333e409f714aa9c7db63f7f9ed07cc57a9
> > tree f301f581dd4389028f8b2588940d456904e552f1
> > parent 2e8f68c45925123d33d476ce369b570bd989dd9a
> > author Larry Woodman <lwoodman@redhat.com> Fri, 15 Jul 2005 11:32:08
> > -0400 committer Marcelo Tosatti <marcelo@dmt.cnet> Tue, 26 Jul 2005
> > 07:52:46 -0300
> >
> > [PATCH] workaround inode cache (prune_icache/__refile_inode) SMP
> > races
> >
> > Over the past couple of weeks we have seen two races in the
> > inode
> > cache
> > code. The first is between [dispose_list()] and __refile_inode()
> > and the
> > second is between prune_icache() and truncate_inodes(). I posted
> > both of
> > these patches but wanted to make sure they got properly reviewed
> and
> > included in RHEL3-U6.
> >
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -297,7 +297,7 @@ static inline void __refile_inode(struct {
> > struct list_head *to;
> >
> > - if (inode->i_state & I_FREEING)
> > + if (inode->i_state & (I_FREEING|I_CLEAR))
> > return;
> > if (list_empty(&inode->i_hash))
> > return;
> > @@ -634,7 +634,9 @@ void clear_inode(struct inode *inode)
> > cdput(inode->i_cdev);
> > inode->i_cdev = NULL;
> > }
> > + spin_lock(&inode_lock);
> > inode->i_state = I_CLEAR;
> > + spin_unlock(&inode_lock);
> > }
> >
> > /*
> >
> >
> > On Sun, Jun 19, 2005 at 11:07:44PM +0000, Chris Caputo wrote:
> > > My basic repro method was:
> > >
> > > --
> > > 0) start irqbalance
> > > 1) run loop_dbench, which is the following dbench script which
uses
> > > client_plain.txt:
> > >
> > > #!/bin/sh
> > >
> > > while [ 1 ]
> > > do
> > > date
> > > dbench 2
> > > 2) wait for oops
> > > --
> > >
> > > I think I was using dbench-2.1:
> > >
> > > http://samba.org/ftp/tridge/dbench/dbench-2.1.tar.gz
> > >
> > > In my case irqbalance was key. If I didn't run it I never got the
> > > problem. I think irqbalance just did a good job of exasperating a
> > > race condition in some way.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: FW: oops in 2.4.25 prune_icache() called from kswapd
2005-08-08 16:45 Srivastava, Rahul
@ 2005-08-09 2:55 ` Marcelo Tosatti
0 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2005-08-09 2:55 UTC (permalink / raw)
To: Srivastava, Rahul; +Cc: petrides, lwoodman, linux-fsdevel
On Mon, Aug 08, 2005 at 11:45:28AM -0500, Srivastava, Rahul wrote:
> Hi All,
>
> I was just wondering if any of you guys had a chance to validate the
> hypothesis and the proposed fix.
>
> Thanks,
> Rahul
>
>
> -----Original Message-----
> From: Srivastava, Rahul
> Sent: Tuesday, August 02, 2005 8:32 AM
> To: 'Marcelo Tosatti'; 'Ernie Petrides'; 'Larry Woodman'
> Subject: RE: oops in 2.4.25 prune_icache() called from kswapd
>
>
> Hi,
>
> Thanks for reviewing the mail. I was thinking whether below changes in
> clear_inode() will close the race window:
>
> in clear_inode(), change line:
>
> inode->i_state = I_CLEAR;
>
> with below piece of code:
>
> *****
> spin_lock(&inode_lock);
> while (inode->i_state & I_LOCK) {
> spin_unlock(&inode_lock);
> __wait_on_inode(inode);
> spin_lock(&inode_lock);
> }
> inode->i_state = I_CLEAR;
> spin_unlock(&inode_lock);
> *****
>
> I feel the race is between "__sync_one()" and "iput()/clear_inode()"
> (also suggested by Albert) which is as follows:
>
> **************** race
> condition*******************************************
>
>
> engine 0:
> |
> calls iput() and lock inode_lock. iput removes the inode from the i_list
> and unlocks |
> inode_lock
> |
>
> |
>
> | engine 1:
>
>
> | grab inode_lock and calls __sync_one()
>
> |
> engine 0:
> |
> calls clear_inode(), get past the call to "wait_on_inode()" which looks
> if I_LOCK is set. |
> /* From this point onwards clear_inode() and the remainder of iput()
> does not care about |
> I_LOCK or inode_lock. */
> |
>
> |
>
> | engine 1:
>
>
> | Sets I_LOCK.
>
> |
> engine 0:
> |
> sets i_state = I_CLEAR
> |
> iput() calls destroy_inode()
> |
> kmem_cache_free() returns the inode to free list of inode cache.
> |
>
> |
>
> | engine 1:
>
> | Goes ahead and inserts the freed inode into one of the three possible
> lists.
As stated in private, Larry's fix should catch that in __refile_inode() and
ignore the I_CLEAR inode.
> And we endup in having a corrupted inode on the inode list.
>
> Your thoughts please.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2005-08-10 23:00 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-08 21:03 FW: oops in 2.4.25 prune_icache() called from kswapd Srivastava, Rahul
2005-08-09 0:33 ` Marcelo Tosatti
-- strict thread matches above, loose matches on Subject: below --
2005-08-10 22:59 Srivastava, Rahul
2005-08-09 17:00 Srivastava, Rahul
2005-08-10 21:56 ` Marcelo Tosatti
[not found] <717252EC3E37AE4392E2614EA24E9F2B0D7D0733@txnexc01.americas.cpqcorp.net>
2005-08-08 20:19 ` Ernie Petrides
2005-08-08 16:45 Srivastava, Rahul
2005-08-09 2:55 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).