inode_unused list corruption in 2.4.26

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* inode_unused list corruption in 2.4.26 - spin_lock problem?
@ 2004-06-19  0:47 Chris Caputo
  2004-06-20  0:15 ` Marcelo Tosatti
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-06-19  0:47 UTC (permalink / raw)
  To: linux-kernel

In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based,
the other dual-PIII) I am seeing crashes which are the result of the
inode_unused doubly linked list in fs/inode.c becoming corrupted.

A particular instance of the corruption I have isolated is in a call from
iput() to __refile_inode().  To try to diagnose this further I placed list
verification code before and after the list_del() and list_add() calls in
__refile_inode() and observed a healthy list become corrupted after the
del/add was completed.

It would seem to me that list corruption on otherwise healthy machines
would only be the result of the inode_lock spinlock not being properly
locked prior to the call to __refile_inode(), but as far as I can tell,
the call to atomic_dec_and_lock() in iput() is doing that properly.

So I am at a loss.  Has anyone else seen this or does anyone have any idea
what routes I should be exploring to fix this problem?

Thank you,
Chris

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-19  0:47 inode_unused list corruption in 2.4.26 - spin_lock problem? Chris Caputo
@ 2004-06-20  0:15 ` Marcelo Tosatti
  2004-06-20  3:33   ` Trond Myklebust
  2004-06-24  1:50   ` Chris Caputo
  0 siblings, 2 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2004-06-20  0:15 UTC (permalink / raw)
  To: Chris Caputo; +Cc: linux-kernel, David Woodhouse, riel


Hi Chris,

I've seen your previous post -- should have answered you earlier.

On Fri, Jun 18, 2004 at 05:47:05PM -0700, Chris Caputo wrote:
> In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based,
> the other dual-PIII) I am seeing crashes which are the result of the
> inode_unused doubly linked list in fs/inode.c becoming corrupted.

What steps are required to reproduce the problem?

> A particular instance of the corruption I have isolated is in a call from
> iput() to __refile_inode().  To try to diagnose this further I placed list
> verification code before and after the list_del() and list_add() calls in
> __refile_inode() and observed a healthy list become corrupted after the
> del/add was completed.

Can you show us this data in more detail?

> It would seem to me that list corruption on otherwise healthy machines
> would only be the result of the inode_lock spinlock not being properly
> locked prior to the call to __refile_inode(), but as far as I can tell,
> the call to atomic_dec_and_lock() in iput() is doing that properly.
> 
> So I am at a loss.  Has anyone else seen this or does anyone have any idea
> what routes I should be exploring to fix this problem?

The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and 
the unused_pagecache list) must have something to do with this. 

David, Rik, can you give some help here?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-20  0:15 ` Marcelo Tosatti
@ 2004-06-20  3:33   ` Trond Myklebust
  2004-06-21  0:45     ` Marcelo Tosatti
  2004-06-24  1:50   ` Chris Caputo
  1 sibling, 1 reply; 29+ messages in thread
From: Trond Myklebust @ 2004-06-20  3:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel

På lau , 19/06/2004 klokka 20:15, skreiv Marcelo Tosatti:

> The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and 
> the unused_pagecache list) must have something to do with this. 

Here's one question:

Given the fact that in iput(), the inode remains hashed and the
inode->i_state does not change until after we've dropped the inode_lock,
called write_inode_now(), and then retaken the inode_lock, exactly what
is preventing a third party task from grabbing that inode?

(Better still: write_inode_now() itself actually calls __iget(), which
could cause that inode to be plonked right back onto the "inode_in_use"
list if ever refile_inode() gets called.)

So does the following patch help?

Cheers,
  Trond

--- linux-2.4.27-pre3/fs/inode.c.orig	2004-05-20 20:41:41.000000000 -0400
+++ linux-2.4.27-pre3/fs/inode.c	2004-06-19 23:22:29.000000000 -0400
@@ -1200,6 +1200,7 @@ void iput(struct inode *inode)
 		struct super_block *sb = inode->i_sb;
 		struct super_operations *op = NULL;
 
+again:
 		if (inode->i_state == I_CLEAR)
 			BUG();
 
@@ -1241,11 +1242,16 @@ void iput(struct inode *inode)
 				if (!(inode->i_state & (I_DIRTY|I_LOCK))) 
 					__refile_inode(inode);
 				inodes_stat.nr_unused++;
-				spin_unlock(&inode_lock);
-				if (!sb || (sb->s_flags & MS_ACTIVE))
+				if (!sb || (sb->s_flags & MS_ACTIVE)) {
+					spin_unlock(&inode_lock);
 					return;
-				write_inode_now(inode, 1);
-				spin_lock(&inode_lock);
+				}
+				if (inode->i_state & I_DIRTY) {
+					__iget(inode);
+					spin_unlock(&inode_lock);
+					write_inode_now(inode, 1);
+					goto again;
+				}
 				inodes_stat.nr_unused--;
 				list_del_init(&inode->i_hash);
 			}


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-20  3:33   ` Trond Myklebust
@ 2004-06-21  0:45     ` Marcelo Tosatti
  2004-06-21 17:10       ` Trond Myklebust
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2004-06-21  0:45 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel

On Sat, Jun 19, 2004 at 11:33:55PM -0400, Trond Myklebust wrote:
> På lau , 19/06/2004 klokka 20:15, skreiv Marcelo Tosatti:
> 
> > The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and 
> > the unused_pagecache list) must have something to do with this. 
> 
> Here's one question:
> 
> Given the fact that in iput(), the inode remains hashed and the
> inode->i_state does not change until after we've dropped the inode_lock,
> called write_inode_now(), and then retaken the inode_lock, exactly what
> is preventing a third party task from grabbing that inode?
> 
> (Better still: write_inode_now() itself actually calls __iget(), which
> could cause that inode to be plonked right back onto the "inode_in_use"
> list if ever refile_inode() gets called.)

Lets see if I get this right, while we drop the lock in iput to call 
write_inode_now() an iget happens, possibly from write_inode_now itself 
(sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. 

But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused--
and deletes the inode from the inode_in_use and adds to inode_unused. 

AFAICS its an inode with i_count==1 in the unused list, which does not
mean "list corruption", right? Am I missing something here?

If you are indeed right all 2.4.x versions contain this bug.

Thanks for helping!

> 
> So does the following patch help?
>
> +++ linux-2.4.27-pre3/fs/inode.c	2004-06-19 23:22:29.000000000 -0400
> @@ -1200,6 +1200,7 @@ void iput(struct inode *inode)
>  		struct super_block *sb = inode->i_sb;
>  		struct super_operations *op = NULL;
>  
> +again:
>  		if (inode->i_state == I_CLEAR)
>  			BUG();
>  
> @@ -1241,11 +1242,16 @@ void iput(struct inode *inode)
>  				if (!(inode->i_state & (I_DIRTY|I_LOCK))) 
>  					__refile_inode(inode);
>  				inodes_stat.nr_unused++;
> -				spin_unlock(&inode_lock);
> -				if (!sb || (sb->s_flags & MS_ACTIVE))
> +				if (!sb || (sb->s_flags & MS_ACTIVE)) {
> +					spin_unlock(&inode_lock);
>  					return;
> -				write_inode_now(inode, 1);
> -				spin_lock(&inode_lock);
> +				}
> +				if (inode->i_state & I_DIRTY) {
> +					__iget(inode);
> +					spin_unlock(&inode_lock);
> +					write_inode_now(inode, 1);
> +					goto again;
> +				}
>  				inodes_stat.nr_unused--;
>  				list_del_init(&inode->i_hash);
>  			}
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-21  0:45     ` Marcelo Tosatti
@ 2004-06-21 17:10       ` Trond Myklebust
  2004-06-21 18:23         ` Marcelo Tosatti
  2004-06-24  1:51         ` Chris Caputo
  0 siblings, 2 replies; 29+ messages in thread
From: Trond Myklebust @ 2004-06-21 17:10 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel

På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti:
> Lets see if I get this right, while we drop the lock in iput to call 
> write_inode_now() an iget happens, possibly from write_inode_now itself 
> (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. 
> But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused--
> and deletes the inode from the inode_in_use and adds to inode_unused. 
> 
> AFAICS its an inode with i_count==1 in the unused list, which does not
> mean "list corruption", right? Am I missing something here?

Yes. Please don't forget that the inode is still hashed and is not yet
marked as FREEING: find_inode() can grab it on behalf of some other
process as soon as we drop that spinlock inside iput(). Then we have the
calls to clear_inode() + destroy_inode() just a few lines further down.
;-)

If the above scenario ever does occur, it will cause random Oopses for
third party processes. Since we do not see this too often, my guess is
that the write_inode_now() path must be very rarely (or never?) called.

> If you are indeed right all 2.4.x versions contain this bug.

...and all 2.6.x versions...

I'm not saying this is the same problem that Chris is seeing, but I am
failing to see how iput() is safe as it stands right now. Please
enlighten me if I'm missing something.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-21 17:10       ` Trond Myklebust
@ 2004-06-21 18:23         ` Marcelo Tosatti
  2004-06-24  1:51         ` Chris Caputo
  1 sibling, 0 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2004-06-21 18:23 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel, viro

On Mon, Jun 21, 2004 at 01:10:21PM -0400, Trond Myklebust wrote:
> På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti:
> > Lets see if I get this right, while we drop the lock in iput to call 
> > write_inode_now() an iget happens, possibly from write_inode_now itself 
> > (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. 
> > But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused--
> > and deletes the inode from the inode_in_use and adds to inode_unused. 
> > 
> > AFAICS its an inode with i_count==1 in the unused list, which does not
> > mean "list corruption", right? Am I missing something here?
> 
> Yes. Please don't forget that the inode is still hashed and is not yet
> marked as FREEING: find_inode() can grab it on behalf of some other
> process as soon as we drop that spinlock inside iput(). Then we have the
> calls to clear_inode() + destroy_inode() just a few lines further down.
> ;-)
> 
> If the above scenario ever does occur, it will cause random Oopses for
> third party processes. Since we do not see this too often, my guess is
> that the write_inode_now() path must be very rarely (or never?) called.

Thats what I though: That if the scenario you described really happens, we 
would see random oopses (processes using a deleted inode) instead of 
Chris's list corruption.

Chris, _please_ post your full oopses.

> > If you are indeed right all 2.4.x versions contain this bug.
> 
> ...and all 2.6.x versions...
> 
> I'm not saying this is the same problem that Chris is seeing, but I am
> failing to see how iput() is safe as it stands right now. Please
> enlighten me if I'm missing something.

For me your analysis looks right and we have a problem here.

I think Al Viro knows iput() very well. Maybe he should take a look 
at your patched. CC'ed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-20  0:15 ` Marcelo Tosatti
  2004-06-20  3:33   ` Trond Myklebust
@ 2004-06-24  1:50   ` Chris Caputo
  2004-06-25  8:04     ` Chris Caputo
  2004-06-25 12:17     ` Marcelo Tosatti
  1 sibling, 2 replies; 29+ messages in thread
From: Chris Caputo @ 2004-06-24  1:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, David Woodhouse, riel, Trond Myklebust

On Sat, 19 Jun 2004, Marcelo Tosatti wrote:
> On Fri, Jun 18, 2004 at 05:47:05PM -0700, Chris Caputo wrote:
> > In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based,
> > the other dual-PIII) I am seeing crashes which are the result of the
> > inode_unused doubly linked list in fs/inode.c becoming corrupted.
> 
> What steps are required to reproduce the problem?

Unfortunately I don't yet know how to (or if I can) repro this with a
stock kernel.  I am using 2.4.26 + Ingo's tux3-2.4.23-A3 patch in
conjunction with a filesystem I wrote.  (the tux module itself is not
being used, just the patches to the existing kernel)

> > A particular instance of the corruption I have isolated is in a call from
> > iput() to __refile_inode().  To try to diagnose this further I placed list
> > verification code before and after the list_del() and list_add() calls in
> > __refile_inode() and observed a healthy list become corrupted after the
> > del/add was completed.
> 
> Can you show us this data in more detail?

In __refile_inode() before and after the list_add()/del() calls I call a
function which checks up to the first 10 items on the inode_unused list to
see if next and prev pointers are valid.
(inode->next->prev == inode && inode->prev->next == inode)

So what I observed was a case here where iput() inline __refile_inode():

 1) checked inode_unused and saw that it was good
 2) put an item on the inode_unused list
 3) checked inode_unused and saw that it was now bad and that the item 
    added was the culprit.

This all happened within __refile_inode() with the inode_lock spinlock
grabbed by iput() and so I tend to think some other code is accessing the
inode_unused list _without_ grabbing the spinlock.  I've checked the
inode.c code over and over, plus my filesystem code, and haven't yet found
a culprit.  I also checked the tux diffs to see if it was messing with
inode objects in an inappropriate way.

Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
to ask.

> > It would seem to me that list corruption on otherwise healthy machines
> > would only be the result of the inode_lock spinlock not being properly
> > locked prior to the call to __refile_inode(), but as far as I can tell,
> > the call to atomic_dec_and_lock() in iput() is doing that properly.
> > 
> > So I am at a loss.  Has anyone else seen this or does anyone have any idea
> > what routes I should be exploring to fix this problem?
> 
> The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and 
> the unused_pagecache list) must have something to do with this. 

__refile_inode() was introduced in 2.4.25.  I'll try 2.4.24 to see if I
can reproduce there.

Marcelo, you asked for some oops' relating to the problem.  Here are some.

Thanks,
Chris

---

Oops: 0000
CPU:    0
EIP:    0010:[<c015b465>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010a93
eax: 8d12e0e2   ebx: 7b402366   ecx: c40fdf18   edx: d9c92888
esi: 7b40235e   edi: 7b402366   ebp: 000007cf   esp: c40fdf10
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 7, stackpage=c40fd000)
Stack: e3b85380 000004e5 e3b85388 c9812688 00000002 c25caf40 c0317fd8 00023dd2
       c015b664 00000cb4 c0138f35 00000006 000001d0 ffffffff 000001d0 00000002
       00000006 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c
Call Trace:    [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>]
               [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>]
[<c01396b0>]
Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08


>>EIP; c015b465 <prune_icache+45/220>   <=====

>>ecx; c40fdf18 <_end+3d3ffec/38676134>
>>edx; d9c92888 <_end+198d495c/38676134>
>>esp; c40fdf10 <_end+3d3ffe4/38676134>

Trace; c015b664 <shrink_icache_memory+24/40>
Trace; c0138f35 <shrink_cache+185/410>
Trace; c013936a <shrink_caches+4a/60>
Trace; c01393e2 <try_to_free_pages_zone+62/f0>
Trace; c0139596 <kswapd_balance_pgdat+66/b0>
Trace; c0139608 <kswapd_balance+28/40>
Trace; c0139748 <kswapd+98/c0>
Trace; c01396b0 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010587e <arch_kernel_thread+2e/40>
Trace; c01396b0 <kswapd+0/c0>

Code;  c015b465 <prune_icache+45/220>
00000000 <_EIP>:
Code;  c015b465 <prune_icache+45/220>   <=====
   0:   8b 5b 04                  mov    0x4(%ebx),%ebx   <=====
Code;  c015b468 <prune_icache+48/220>
   3:   8b 86 1c 01 00 00         mov    0x11c(%esi),%eax
Code;  c015b46e <prune_icache+4e/220>
   9:   a8 38                     test   $0x38,%al
Code;  c015b470 <prune_icache+50/220>
   b:   0f 84 5d 01 00 00         je     16e <_EIP+0x16e>
Code;  c015b476 <prune_icache+56/220>
  11:   81 fb 08 00 00 00         cmp    $0x8,%ebx

---

Oops: 0000
CPU:    0
EIP:    0010:[<c015b465>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010a97
eax: 830418da   ebx: 5954e741   ecx: c40fdf18   edx: c0318f08
esi: 5954e739   edi: 5954e741   ebp: 000001b4   esp: c40fdf10
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 7, stackpage=c40fd000)
Stack: f718b780 000013a1 f718b788 d04d3988 00000001 c3cfe850 c0317fd8 00023d9e
       c015b664 00001555 c0138f35 00000006 000001d0 ffffffff 000001d0 00000001
       00000004 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c
Call Trace:    [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>]
               [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>]
 [<c01396b0>]
Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08


>>EIP; c015b465 <prune_icache+45/220>   <=====

>>edx; c0318f08 <inode_unused+0/8>

Trace; c015b664 <shrink_icache_memory+24/40>
Trace; c0138f35 <shrink_cache+185/410>
Trace; c013936a <shrink_caches+4a/60>
Trace; c01393e2 <try_to_free_pages_zone+62/f0>
Trace; c0139596 <kswapd_balance_pgdat+66/b0>
Trace; c0139608 <kswapd_balance+28/40>
Trace; c0139748 <kswapd+98/c0>
Trace; c01396b0 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010587e <arch_kernel_thread+2e/40>
Trace; c01396b0 <kswapd+0/c0>

Code;  c015b465 <prune_icache+45/220>
00000000 <_EIP>:
Code;  c015b465 <prune_icache+45/220>   <=====
   0:   8b 5b 04                  mov    0x4(%ebx),%ebx   <=====
Code;  c015b468 <prune_icache+48/220>
   3:   8b 86 1c 01 00 00         mov    0x11c(%esi),%eax
Code;  c015b46e <prune_icache+4e/220>
   9:   a8 38                     test   $0x38,%al
Code;  c015b470 <prune_icache+50/220>
   b:   0f 84 5d 01 00 00         je     16e <_EIP+0x16e>
Code;  c015b476 <prune_icache+56/220>
  11:   81 fb 08 00 00 00         cmp    $0x8,%ebx

---

I think this one was an infinite loop which I used alt-sysrq-p to get 
deduce:

Pid: 7, comm:               kswapd
EIP: 0010:[<c014dd1c>] CPU: 2 EFLAGS: 00000246    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EAX: c4e4f824 EBX: c4e4f804 ECX: 00000006 EDX: 00000000
ESI: c4e4f80c EDI: c4e4f804 EBP: 000009ee DS: 0018 ES: 0018
Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
line ignored
CR0: 8005003b CR2: bfffdfb4 CR3: 00101000 CR4: 000006d0
Call Trace:    [<c0167f8e>] [<c01681d4>] [<c013d9d0>] [<c0168254>] [<c013fe69>]
  [<c01404ae>] [<c0140533>] [<c01405b2>] [<c011a6cb>] [<c0140766>] [<c01407d8>]
  [<c0140918>] [<c0140880>] [<c010595e>] [<c0140880>]
Warning (Oops_read): Code line not seen, dumping what data is available


>>EIP; c014dd1c <inode_has_buffers+6c/90>   <=====

>>EAX; c4e4f824 <_end+4a6b2f8/3864fb34>
>>EBX; c4e4f804 <_end+4a6b2d8/3864fb34>
>>ESI; c4e4f80c <_end+4a6b2e0/3864fb34>
>>EDI; c4e4f804 <_end+4a6b2d8/3864fb34>

Trace; c0167f8e <prune_icache+7e/320>
Trace; c01681d4 <prune_icache+2c4/320>
Trace; c013d9d0 <kmem_cache_shrink+70/c0>
Trace; c0168254 <shrink_icache_memory+24/40>
Trace; c013fe69 <shrink_cache+1d9/6d0>
Trace; c01404ae <refill_inactive+14e/160>
Trace; c0140533 <shrink_caches+73/90>
Trace; c01405b2 <try_to_free_pages_zone+62/f0>
Trace; c011a6cb <schedule+34b/5f0>
Trace; c0140766 <kswapd_balance_pgdat+66/b0>
Trace; c01407d8 <kswapd_balance+28/40>
Trace; c0140918 <kswapd+98/c0>
Trace; c0140880 <kswapd+0/c0>
Trace; c010595e <arch_kernel_thread+2e/40>
Trace; c0140880 <kswapd+0/c0>

---

Another infinite loop (alt-sysrq-p):

Pid: 7, comm:               kswapd
EIP: 0010:[<c0167bc8>] CPU: 3 EFLAGS: 00000206    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EAX: 00000001 EBX: ebe7f40c ECX: 00000002 EDX: 00000000
ESI: ebe7f60c EDI: ebe7f604 EBP: c2851ee4 DS: 0018 ES: 0018
Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
line ignored
CR0: 8005003b CR2: 40014000 CR3: 00101000 CR4: 000006d0
Call Trace:    [<c013d7e2>] [<c0167eb7>] [<c013fd05>] [<c0140360>] [<c01403e2>]
  [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>] [<c010596e>]
  [<c0140760>]

>>EIP; c0167bc8 <prune_icache+78/340>   <=====

Trace; c013d7e2 <kmem_cache_shrink+72/c0>
Trace; c0167eb7 <shrink_icache_memory+27/40>
Trace; c013fd05 <shrink_cache+1d5/6d0>
Trace; c0140360 <refill_inactive+160/170>
Trace; c01403e2 <shrink_caches+72/90>
Trace; c0140462 <try_to_free_pages_zone+62/f0>
Trace; c0140638 <kswapd_balance_pgdat+78/c0>
Trace; c01406a8 <kswapd_balance+28/40>
Trace; c01407fa <kswapd+9a/c0>
Trace; c0140760 <kswapd+0/c0>
Trace; c010596e <arch_kernel_thread+2e/40>
Trace; c0140760 <kswapd+0/c0>

---

Infinite loop (alt-sysrq-p):

Pid: 9, comm:             kupdated
EIP: 0010:[<c0168df3>] CPU: 2 EFLAGS: 00000282    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EAX: f7bea000 EBX: f7bea000 ECX: f7bebfd0 EDX: 00000000
ESI: f7bebfc8 EDI: f7bebfd8 EBP: f7bebf10 DS: 0018 ES: 0018
Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
line ignored
CR0: 8005003b CR2: 40013090 CR3: 00101000 CR4: 000006d0
Call Trace:    [<c011a1b2>] [<c0151333>] [<c015185e>] [<c0107a42>]
[<c010596e>] [<c0151690>]

>>EIP; c0168df3 <.text.lock.inode+69/256>   <=====

>>EAX; f7bea000 <_end+377ffb74/38649bd4>
>>EBX; f7bea000 <_end+377ffb74/38649bd4>
>>ECX; f7bebfd0 <_end+37801b44/38649bd4>
>>ESI; f7bebfc8 <_end+37801b3c/38649bd4>
>>EDI; f7bebfd8 <_end+37801b4c/38649bd4>
>>EBP; f7bebf10 <_end+37801a84/38649bd4>

Trace; c011a1b2 <schedule_timeout+62/b0>
Trace; c0151333 <sync_old_buffers+53/160>
Trace; c015185e <kupdate+1ce/230>
Trace; c0107a42 <ret_from_fork+6/20>
Trace; c010596e <arch_kernel_thread+2e/40>
Trace; c0151690 <kupdate+0/230>

---

Infinite loop (alt-sysrq-p):

Pid: 7, comm:               kswapd
EIP: 0010:[<c0167be0>] CPU: 3 EFLAGS: 00000202    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EAX: 00000020 EBX: dcbc120c ECX: 00000086 EDX: daae3044
ESI: dcbc120c EDI: dcbc1204 EBP: c2851ee4 DS: 0018 ES: 0018
Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
line ign
ored
CR0: 8005003b CR2: 0807f260 CR3: 00101000 CR4: 000006d0
Call Trace:    [<c013d7e2>] [<c0167ed7>] [<c013fd05>] [<c0140360>] [<c01403e2>]
  [<c011a55b>] [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>]
  [<c010596e>] [<c0140760>]

>>EIP; c0167be0 <prune_icache+90/360>   <=====

>>EBX; dcbc120c <_end+1c7d6d80/38649bd4>
>>EDX; daae3044 <_end+1a6f8bb8/38649bd4>
>>ESI; dcbc120c <_end+1c7d6d80/38649bd4>
>>EDI; dcbc1204 <_end+1c7d6d78/38649bd4>
>>EBP; c2851ee4 <_end+2467a58/38649bd4>

Trace; c013d7e2 <kmem_cache_shrink+72/c0>
Trace; c0167ed7 <shrink_icache_memory+27/40>
Trace; c013fd05 <shrink_cache+1d5/6d0>
Trace; c0140360 <refill_inactive+160/170>
Trace; c01403e2 <shrink_caches+72/90>
Trace; c011a55b <schedule+34b/5f0>
Trace; c0140462 <try_to_free_pages_zone+62/f0>
Trace; c0140638 <kswapd_balance_pgdat+78/c0>
Trace; c01406a8 <kswapd_balance+28/40>
Trace; c01407fa <kswapd+9a/c0>
Trace; c0140760 <kswapd+0/c0>
Trace; c010596e <arch_kernel_thread+2e/40>
Trace; c0140760 <kswapd+0/c0>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-21 17:10       ` Trond Myklebust
  2004-06-21 18:23         ` Marcelo Tosatti
@ 2004-06-24  1:51         ` Chris Caputo
  2004-06-25  7:47           ` Chris Caputo
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-06-24  1:51 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marcelo Tosatti, linux-kernel, David Woodhouse, riel

On Mon, 21 Jun 2004, Trond Myklebust wrote:
> På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti:
> > Lets see if I get this right, while we drop the lock in iput to call 
> > write_inode_now() an iget happens, possibly from write_inode_now itself 
> > (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. 
> > But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused--
> > and deletes the inode from the inode_in_use and adds to inode_unused. 
> > 
> > AFAICS its an inode with i_count==1 in the unused list, which does not
> > mean "list corruption", right? Am I missing something here?
> 
> Yes. Please don't forget that the inode is still hashed and is not yet
> marked as FREEING: find_inode() can grab it on behalf of some other
> process as soon as we drop that spinlock inside iput(). Then we have the
> calls to clear_inode() + destroy_inode() just a few lines further down.
> ;-)
> 
> If the above scenario ever does occur, it will cause random Oopses for
> third party processes. Since we do not see this too often, my guess is
> that the write_inode_now() path must be very rarely (or never?) called.
> 
> > If you are indeed right all 2.4.x versions contain this bug.
> 
> ...and all 2.6.x versions...
> 
> I'm not saying this is the same problem that Chris is seeing, but I am
> failing to see how iput() is safe as it stands right now. Please
> enlighten me if I'm missing something.

I think this is a different (albeit apparently valid) problem.  In my case
MS_ACTIVE (in iput() below) will be set since I am not unmounting a volume
and so I believe iput() will return immediately after adding the inode to
the unused list.

That said, I have added your patch to my test setup in case it helps.

Thanks,
Chris

----

                        if (!list_empty(&inode->i_hash)) {
                                if (!(inode->i_state & (I_DIRTY|I_LOCK)))
                                        __refile_inode(inode);
                                inodes_stat.nr_unused++;
                                spin_unlock(&inode_lock);
                                if (!sb || (sb->s_flags & MS_ACTIVE))
                                        return;
                                write_inode_now(inode, 1);
                                spin_lock(&inode_lock);
                                inodes_stat.nr_unused--;
                                list_del_init(&inode->i_hash);
                        }


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-24  1:51         ` Chris Caputo
@ 2004-06-25  7:47           ` Chris Caputo
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Caputo @ 2004-06-25  7:47 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Marcelo Tosatti, linux-kernel

On Wed, 23 Jun 2004, Chris Caputo wrote:
> On Mon, 21 Jun 2004, Trond Myklebust wrote:
> > På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti:
> > > Lets see if I get this right, while we drop the lock in iput to call 
> > > write_inode_now() an iget happens, possibly from write_inode_now itself 
> > > (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. 
> > > But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused--
> > > and deletes the inode from the inode_in_use and adds to inode_unused. 
> > > 
> > > AFAICS its an inode with i_count==1 in the unused list, which does not
> > > mean "list corruption", right? Am I missing something here?
> > 
> > Yes. Please don't forget that the inode is still hashed and is not yet
> > marked as FREEING: find_inode() can grab it on behalf of some other
> > process as soon as we drop that spinlock inside iput(). Then we have the
> > calls to clear_inode() + destroy_inode() just a few lines further down.
> > ;-)
> > 
> > If the above scenario ever does occur, it will cause random Oopses for
> > third party processes. Since we do not see this too often, my guess is
> > that the write_inode_now() path must be very rarely (or never?) called.
> > 
> > > If you are indeed right all 2.4.x versions contain this bug.
> > 
> > ...and all 2.6.x versions...
> > 
> > I'm not saying this is the same problem that Chris is seeing, but I am
> > failing to see how iput() is safe as it stands right now. Please
> > enlighten me if I'm missing something.
> 
> I think this is a different (albeit apparently valid) problem.  In my case
> MS_ACTIVE (in iput() below) will be set since I am not unmounting a volume
> and so I believe iput() will return immediately after adding the inode to
> the unused list.
> 
> That said, I have added your patch to my test setup in case it helps.

I was able to duplicate the problem I am seeing even with Trond's patch
applied.  So the patch potentially solves a different problem but not the
one I am seeing.

Chris


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-24  1:50   ` Chris Caputo
@ 2004-06-25  8:04     ` Chris Caputo
  2004-06-25 10:18       ` Chris Caputo
  2004-06-25 12:17     ` Marcelo Tosatti
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-06-25  8:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Wed, 23 Jun 2004, Chris Caputo wrote:
> On Sat, 19 Jun 2004, Marcelo Tosatti wrote:
> > Can you show us this data in more detail?
> 
> In __refile_inode() before and after the list_add()/del() calls I call a
> function which checks up to the first 10 items on the inode_unused list to
> see if next and prev pointers are valid.
> (inode->next->prev == inode && inode->prev->next == inode)
> 
> So what I observed was a case here where iput() inline __refile_inode():
> 
>  1) checked inode_unused and saw that it was good
>  2) put an item on the inode_unused list
>  3) checked inode_unused and saw that it was now bad and that the item 
>     added was the culprit.
> 
> This all happened within __refile_inode() with the inode_lock spinlock
> grabbed by iput() and so I tend to think some other code is accessing the
> inode_unused list _without_ grabbing the spinlock.  I've checked the
> inode.c code over and over, plus my filesystem code, and haven't yet found
> a culprit.  I also checked the tux diffs to see if it was messing with
> inode objects in an inappropriate way.
> 
> Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
> to ask.

An update on this.

Line #3 above is not entirely correct in that I have not seen the item
being added becoming immediately corrupt, but rather items beyond it.

Specifically I have now seen:

  inode_unused->next->next->prev != inode_unused->next

and:

  inode_unused->next->next->next->prev != inode_unused->next->next

My verification function doesn't check the whole inode_unused list (would
be too slow to do so), but it may be that items are only corrupted shortly
after being added to the list.  Ie., someone is still using the inode
shortly after when they shouldn't be.

> __refile_inode() was introduced in 2.4.25.  I'll try 2.4.24 to see if I
> can reproduce there.

No word yet on my 2.4.24 testing.  (test still running without failure)

I'll keep digging,
Chris


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-25  8:04     ` Chris Caputo
@ 2004-06-25 10:18       ` Chris Caputo
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Caputo @ 2004-06-25 10:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Fri, 25 Jun 2004, Chris Caputo wrote:
> On Wed, 23 Jun 2004, Chris Caputo wrote:
> > __refile_inode() was introduced in 2.4.25.  I'll try 2.4.24 to see if I
> > can reproduce there.
> 
> No word yet on my 2.4.24 testing.  (test still running without failure)

I have now reproduced (below) with 2.4.24 (with tux patches + my driver).  
In the code I believe the line doing the null deref is:

  entry = entry->prev;  [771 or 2.4.24 fs/inode.c]
  
Next I'll try to repro with simply a stock 2.4.26.

Chris

---

Unable to handle kernel NULL pointer dereference at virtual address
00000004
c01665f5
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c01665f5>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010217
eax: 00000020   ebx: 00000000   ecx: 00000006   edx: 00000001
esi: 00000000   edi: fffffff8   ebp: c7f95f58   esp: c7f95f30
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 5, stackpage=c7f95000)
Stack: c7f95f58 c0163794 00000000 c7f95f3c c7f95f3c c7f95f64 c013d152 000001d0
       0000003a 0000000a c7f95f64 c01666f7 00018057 c7f95f8c c013fdcf 00000006
       000001d0 00000000 c0329a60 00000001 c0329a60 00000001 c7f94000 
c7f95fa8
Call Trace:    [<c0163794>] [<c013d152>] [<c01666f7>] [<c013fdcf>] [<c013ff78>]
  [<c013ffe8>] [<c014013a>] [<c01400a0>] [<c01059ce>] [<c01400a0>]
Code: 8b 76 04 8b 87 30 01 00 00 a8 38 74 6e 81 fe 28 a9 32 c0 75


>>EIP; c01665f5 <prune_icache+65/140>   <=====

>>ebp; c7f95f58 <_end+7bcdd8c/3862ae94>
>>esp; c7f95f30 <_end+7bcdd64/3862ae94>

Trace; c0163794 <prune_dcache+194/280>
Trace; c013d152 <kmem_cache_shrink+72/c0>
Trace; c01666f7 <shrink_icache_memory+27/40>
Trace; c013fdcf <try_to_free_pages_zone+8f/f0>
Trace; c013ff78 <kswapd_balance_pgdat+78/c0>
Trace; c013ffe8 <kswapd_balance+28/40>
Trace; c014013a <kswapd+9a/c0>
Trace; c01400a0 <kswapd+0/c0>
Trace; c01059ce <arch_kernel_thread+2e/40>
Trace; c01400a0 <kswapd+0/c0>

Code;  c01665f5 <prune_icache+65/140>
00000000 <_EIP>:
Code;  c01665f5 <prune_icache+65/140>   <=====
   0:   8b 76 04                  mov    0x4(%esi),%esi   <=====
Code;  c01665f8 <prune_icache+68/140>
   3:   8b 87 30 01 00 00         mov    0x130(%edi),%eax
Code;  c01665fe <prune_icache+6e/140>
   9:   a8 38                     test   $0x38,%al
Code;  c0166600 <prune_icache+70/140>
   b:   74 6e                     je     7b <_EIP+0x7b>
Code;  c0166602 <prune_icache+72/140>
   d:   81 fe 28 a9 32 c0         cmp    $0xc032a928,%esi
Code;  c0166608 <prune_icache+78/140>
  13:   75 00                     jne    15 <_EIP+0x15>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-24  1:50   ` Chris Caputo
  2004-06-25  8:04     ` Chris Caputo
@ 2004-06-25 12:17     ` Marcelo Tosatti
  2004-07-02 20:00       ` Chris Caputo
  1 sibling, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2004-06-25 12:17 UTC (permalink / raw)
  To: Chris Caputo; +Cc: linux-kernel, David Woodhouse, riel, Trond Myklebust

On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote:
> On Sat, 19 Jun 2004, Marcelo Tosatti wrote:
> > On Fri, Jun 18, 2004 at 05:47:05PM -0700, Chris Caputo wrote:
> > > In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based,
> > > the other dual-PIII) I am seeing crashes which are the result of the
> > > inode_unused doubly linked list in fs/inode.c becoming corrupted.
> > 
> > What steps are required to reproduce the problem?
> 
> Unfortunately I don't yet know how to (or if I can) repro this with a
> stock kernel.  I am using 2.4.26 + Ingo's tux3-2.4.23-A3 patch in
> conjunction with a filesystem I wrote.  (the tux module itself is not
> being used, just the patches to the existing kernel)
> 
> > > A particular instance of the corruption I have isolated is in a call from
> > > iput() to __refile_inode().  To try to diagnose this further I placed list
> > > verification code before and after the list_del() and list_add() calls in
> > > __refile_inode() and observed a healthy list become corrupted after the
> > > del/add was completed.
> > 
> > Can you show us this data in more detail?
> 
> In __refile_inode() before and after the list_add()/del() calls I call a
> function which checks up to the first 10 items on the inode_unused list to
> see if next and prev pointers are valid.
> (inode->next->prev == inode && inode->prev->next == inode)
> 
> So what I observed was a case here where iput() inline __refile_inode():
> 
>  1) checked inode_unused and saw that it was good
>  2) put an item on the inode_unused list
>  3) checked inode_unused and saw that it was now bad and that the item 
>     added was the culprit.
> 
> This all happened within __refile_inode() with the inode_lock spinlock
> grabbed by iput() and so I tend to think some other code is accessing the
> inode_unused list _without_ grabbing the spinlock.  I've checked the
> inode.c code over and over, plus my filesystem code, and haven't yet found
> a culprit.  I also checked the tux diffs to see if it was messing with
> inode objects in an inappropriate way.
> 
> Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
> to ask.

Pretty sure it is, used all over. You can try to use non-optimize version 
at lib/dec_and_lock.c for a test.

> > > It would seem to me that list corruption on otherwise healthy machines
> > > would only be the result of the inode_lock spinlock not being properly
> > > locked prior to the call to __refile_inode(), but as far as I can tell,
> > > the call to atomic_dec_and_lock() in iput() is doing that properly.
> > > 
> > > So I am at a loss.  Has anyone else seen this or does anyone have any idea
> > > what routes I should be exploring to fix this problem?
> > 
> > The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and 
> > the unused_pagecache list) must have something to do with this. 
> 
> __refile_inode() was introduced in 2.4.25.  I'll try 2.4.24 to see if I
> can reproduce there.
> 
> Marcelo, you asked for some oops' relating to the problem.  Here are some.

Chris, thanks. Would love to see you reproducing the oops with a stock kernel with ext2/3.

> ---
> 
> Oops: 0000
> CPU:    0
> EIP:    0010:[<c015b465>]    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010a93
> eax: 8d12e0e2   ebx: 7b402366   ecx: c40fdf18   edx: d9c92888
> esi: 7b40235e   edi: 7b402366   ebp: 000007cf   esp: c40fdf10
> ds: 0018   es: 0018   ss: 0018
> Process kswapd (pid: 7, stackpage=c40fd000)
> Stack: e3b85380 000004e5 e3b85388 c9812688 00000002 c25caf40 c0317fd8 00023dd2
>        c015b664 00000cb4 c0138f35 00000006 000001d0 ffffffff 000001d0 00000002
>        00000006 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c
> Call Trace:    [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>]
>                [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>]
> [<c01396b0>]
> Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08
> 
> 
> >>EIP; c015b465 <prune_icache+45/220>   <=====
> 
> >>ecx; c40fdf18 <_end+3d3ffec/38676134>
> >>edx; d9c92888 <_end+198d495c/38676134>
> >>esp; c40fdf10 <_end+3d3ffe4/38676134>
> 
> Trace; c015b664 <shrink_icache_memory+24/40>
> Trace; c0138f35 <shrink_cache+185/410>
> Trace; c013936a <shrink_caches+4a/60>
> Trace; c01393e2 <try_to_free_pages_zone+62/f0>
> Trace; c0139596 <kswapd_balance_pgdat+66/b0>
> Trace; c0139608 <kswapd_balance+28/40>
> Trace; c0139748 <kswapd+98/c0>
> Trace; c01396b0 <kswapd+0/c0>
> Trace; c0105000 <_stext+0/0>
> Trace; c010587e <arch_kernel_thread+2e/40>
> Trace; c01396b0 <kswapd+0/c0>
> 
> Code;  c015b465 <prune_icache+45/220>
> 00000000 <_EIP>:
> Code;  c015b465 <prune_icache+45/220>   <=====
>    0:   8b 5b 04                  mov    0x4(%ebx),%ebx   <=====
> Code;  c015b468 <prune_icache+48/220>
>    3:   8b 86 1c 01 00 00         mov    0x11c(%esi),%eax
> Code;  c015b46e <prune_icache+4e/220>
>    9:   a8 38                     test   $0x38,%al
> Code;  c015b470 <prune_icache+50/220>
>    b:   0f 84 5d 01 00 00         je     16e <_EIP+0x16e>
> Code;  c015b476 <prune_icache+56/220>
>   11:   81 fb 08 00 00 00         cmp    $0x8,%ebx
> 
> ---
> 
> Oops: 0000
> CPU:    0
> EIP:    0010:[<c015b465>]    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010a97
> eax: 830418da   ebx: 5954e741   ecx: c40fdf18   edx: c0318f08
> esi: 5954e739   edi: 5954e741   ebp: 000001b4   esp: c40fdf10
> ds: 0018   es: 0018   ss: 0018
> Process kswapd (pid: 7, stackpage=c40fd000)
> Stack: f718b780 000013a1 f718b788 d04d3988 00000001 c3cfe850 c0317fd8 00023d9e
>        c015b664 00001555 c0138f35 00000006 000001d0 ffffffff 000001d0 00000001
>        00000004 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c
> Call Trace:    [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>]
>                [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>]
>  [<c01396b0>]
> Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08
> 
> 
> >>EIP; c015b465 <prune_icache+45/220>   <=====
> 
> >>edx; c0318f08 <inode_unused+0/8>
> 
> Trace; c015b664 <shrink_icache_memory+24/40>
> Trace; c0138f35 <shrink_cache+185/410>
> Trace; c013936a <shrink_caches+4a/60>
> Trace; c01393e2 <try_to_free_pages_zone+62/f0>
> Trace; c0139596 <kswapd_balance_pgdat+66/b0>
> Trace; c0139608 <kswapd_balance+28/40>
> Trace; c0139748 <kswapd+98/c0>
> Trace; c01396b0 <kswapd+0/c0>
> Trace; c0105000 <_stext+0/0>
> Trace; c010587e <arch_kernel_thread+2e/40>
> Trace; c01396b0 <kswapd+0/c0>
> 
> Code;  c015b465 <prune_icache+45/220>
> 00000000 <_EIP>:
> Code;  c015b465 <prune_icache+45/220>   <=====
>    0:   8b 5b 04                  mov    0x4(%ebx),%ebx   <=====
> Code;  c015b468 <prune_icache+48/220>
>    3:   8b 86 1c 01 00 00         mov    0x11c(%esi),%eax
> Code;  c015b46e <prune_icache+4e/220>
>    9:   a8 38                     test   $0x38,%al
> Code;  c015b470 <prune_icache+50/220>
>    b:   0f 84 5d 01 00 00         je     16e <_EIP+0x16e>
> Code;  c015b476 <prune_icache+56/220>
>   11:   81 fb 08 00 00 00         cmp    $0x8,%ebx
> 
> ---
> 
> I think this one was an infinite loop which I used alt-sysrq-p to get 
> deduce:
> 
> Pid: 7, comm:               kswapd
> EIP: 0010:[<c014dd1c>] CPU: 2 EFLAGS: 00000246    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EAX: c4e4f824 EBX: c4e4f804 ECX: 00000006 EDX: 00000000
> ESI: c4e4f80c EDI: c4e4f804 EBP: 000009ee DS: 0018 ES: 0018
> Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
> line ignored
> CR0: 8005003b CR2: bfffdfb4 CR3: 00101000 CR4: 000006d0
> Call Trace:    [<c0167f8e>] [<c01681d4>] [<c013d9d0>] [<c0168254>] [<c013fe69>]
>   [<c01404ae>] [<c0140533>] [<c01405b2>] [<c011a6cb>] [<c0140766>] [<c01407d8>]
>   [<c0140918>] [<c0140880>] [<c010595e>] [<c0140880>]
> Warning (Oops_read): Code line not seen, dumping what data is available
> 
> 
> >>EIP; c014dd1c <inode_has_buffers+6c/90>   <=====
> 
> >>EAX; c4e4f824 <_end+4a6b2f8/3864fb34>
> >>EBX; c4e4f804 <_end+4a6b2d8/3864fb34>
> >>ESI; c4e4f80c <_end+4a6b2e0/3864fb34>
> >>EDI; c4e4f804 <_end+4a6b2d8/3864fb34>
> 
> Trace; c0167f8e <prune_icache+7e/320>
> Trace; c01681d4 <prune_icache+2c4/320>
> Trace; c013d9d0 <kmem_cache_shrink+70/c0>
> Trace; c0168254 <shrink_icache_memory+24/40>
> Trace; c013fe69 <shrink_cache+1d9/6d0>
> Trace; c01404ae <refill_inactive+14e/160>
> Trace; c0140533 <shrink_caches+73/90>
> Trace; c01405b2 <try_to_free_pages_zone+62/f0>
> Trace; c011a6cb <schedule+34b/5f0>
> Trace; c0140766 <kswapd_balance_pgdat+66/b0>
> Trace; c01407d8 <kswapd_balance+28/40>
> Trace; c0140918 <kswapd+98/c0>
> Trace; c0140880 <kswapd+0/c0>
> Trace; c010595e <arch_kernel_thread+2e/40>
> Trace; c0140880 <kswapd+0/c0>
> 
> ---
> 
> Another infinite loop (alt-sysrq-p):
> 
> Pid: 7, comm:               kswapd
> EIP: 0010:[<c0167bc8>] CPU: 3 EFLAGS: 00000206    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EAX: 00000001 EBX: ebe7f40c ECX: 00000002 EDX: 00000000
> ESI: ebe7f60c EDI: ebe7f604 EBP: c2851ee4 DS: 0018 ES: 0018
> Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
> line ignored
> CR0: 8005003b CR2: 40014000 CR3: 00101000 CR4: 000006d0
> Call Trace:    [<c013d7e2>] [<c0167eb7>] [<c013fd05>] [<c0140360>] [<c01403e2>]
>   [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>] [<c010596e>]
>   [<c0140760>]
> 
> >>EIP; c0167bc8 <prune_icache+78/340>   <=====
> 
> Trace; c013d7e2 <kmem_cache_shrink+72/c0>
> Trace; c0167eb7 <shrink_icache_memory+27/40>
> Trace; c013fd05 <shrink_cache+1d5/6d0>
> Trace; c0140360 <refill_inactive+160/170>
> Trace; c01403e2 <shrink_caches+72/90>
> Trace; c0140462 <try_to_free_pages_zone+62/f0>
> Trace; c0140638 <kswapd_balance_pgdat+78/c0>
> Trace; c01406a8 <kswapd_balance+28/40>
> Trace; c01407fa <kswapd+9a/c0>
> Trace; c0140760 <kswapd+0/c0>
> Trace; c010596e <arch_kernel_thread+2e/40>
> Trace; c0140760 <kswapd+0/c0>
> 
> ---
> 
> Infinite loop (alt-sysrq-p):
> 
> Pid: 9, comm:             kupdated
> EIP: 0010:[<c0168df3>] CPU: 2 EFLAGS: 00000282    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EAX: f7bea000 EBX: f7bea000 ECX: f7bebfd0 EDX: 00000000
> ESI: f7bebfc8 EDI: f7bebfd8 EBP: f7bebf10 DS: 0018 ES: 0018
> Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
> line ignored
> CR0: 8005003b CR2: 40013090 CR3: 00101000 CR4: 000006d0
> Call Trace:    [<c011a1b2>] [<c0151333>] [<c015185e>] [<c0107a42>]
> [<c010596e>] [<c0151690>]
> 
> >>EIP; c0168df3 <.text.lock.inode+69/256>   <=====
> 
> >>EAX; f7bea000 <_end+377ffb74/38649bd4>
> >>EBX; f7bea000 <_end+377ffb74/38649bd4>
> >>ECX; f7bebfd0 <_end+37801b44/38649bd4>
> >>ESI; f7bebfc8 <_end+37801b3c/38649bd4>
> >>EDI; f7bebfd8 <_end+37801b4c/38649bd4>
> >>EBP; f7bebf10 <_end+37801a84/38649bd4>
> 
> Trace; c011a1b2 <schedule_timeout+62/b0>
> Trace; c0151333 <sync_old_buffers+53/160>
> Trace; c015185e <kupdate+1ce/230>
> Trace; c0107a42 <ret_from_fork+6/20>
> Trace; c010596e <arch_kernel_thread+2e/40>
> Trace; c0151690 <kupdate+0/230>
> 
> ---
> 
> Infinite loop (alt-sysrq-p):
> 
> Pid: 7, comm:               kswapd
> EIP: 0010:[<c0167be0>] CPU: 3 EFLAGS: 00000202    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EAX: 00000020 EBX: dcbc120c ECX: 00000086 EDX: daae3044
> ESI: dcbc120c EDI: dcbc1204 EBP: c2851ee4 DS: 0018 ES: 0018
> Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register 
> line ign
> ored
> CR0: 8005003b CR2: 0807f260 CR3: 00101000 CR4: 000006d0
> Call Trace:    [<c013d7e2>] [<c0167ed7>] [<c013fd05>] [<c0140360>] [<c01403e2>]
>   [<c011a55b>] [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>]
>   [<c010596e>] [<c0140760>]
> 
> >>EIP; c0167be0 <prune_icache+90/360>   <=====
> 
> >>EBX; dcbc120c <_end+1c7d6d80/38649bd4>
> >>EDX; daae3044 <_end+1a6f8bb8/38649bd4>
> >>ESI; dcbc120c <_end+1c7d6d80/38649bd4>
> >>EDI; dcbc1204 <_end+1c7d6d78/38649bd4>
> >>EBP; c2851ee4 <_end+2467a58/38649bd4>
> 
> Trace; c013d7e2 <kmem_cache_shrink+72/c0>
> Trace; c0167ed7 <shrink_icache_memory+27/40>
> Trace; c013fd05 <shrink_cache+1d5/6d0>
> Trace; c0140360 <refill_inactive+160/170>
> Trace; c01403e2 <shrink_caches+72/90>
> Trace; c011a55b <schedule+34b/5f0>
> Trace; c0140462 <try_to_free_pages_zone+62/f0>
> Trace; c0140638 <kswapd_balance_pgdat+78/c0>
> Trace; c01406a8 <kswapd_balance+28/40>
> Trace; c01407fa <kswapd+9a/c0>
> Trace; c0140760 <kswapd+0/c0>
> Trace; c010596e <arch_kernel_thread+2e/40>
> Trace; c0140760 <kswapd+0/c0>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-06-25 12:17     ` Marcelo Tosatti
@ 2004-07-02 20:00       ` Chris Caputo
  2004-07-03  5:15         ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-07-02 20:00 UTC (permalink / raw)
  To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel

On Fri, 25 Jun 2004, Marcelo Tosatti wrote:
> On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote:
> > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> > iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
> > to ask.
> 
> Pretty sure it is, used all over. You can try to use non-optimize version 
> at lib/dec_and_lock.c for a test.

Hi Marcelo.  Just an update...

I normally run irqbalance on these systems which are experiencing the
inode_unused list corruption problem.

I disabled irqbalance on two servers which have experienced the
inode_unused list corruption problem and they have now been running
without crashing for considerably longer than they had been before.  The
length of run time without problem is not yet conclusive, but I wanted to
give you a heads-up on my progress.

My current theory is that occasionally when irqbalance changes CPU
affinities that the resulting set_ioapic_affinity() calls somehow cause
either inter-CPU locking or cache coherency or ??? to fail.

I CC'ed Arjan because this involves irqbalance, although I'd expect the
fix will be needed in set_ioapic_affinity() or related kernel code.

Any insights welcome.  I am not sure what to do next except wait and see
if these servers with irqbalance disabled continue to stay up.

Chris

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-02 20:00       ` Chris Caputo
@ 2004-07-03  5:15         ` Arjan van de Ven
  2004-07-26 17:41           ` Chris Caputo
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2004-07-03  5:15 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Marcelo Tosatti, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote:
> On Fri, 25 Jun 2004, Marcelo Tosatti wrote:
> > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote:
> > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> > > iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
> > > to ask.
> > 
> > Pretty sure it is, used all over. You can try to use non-optimize version 
> > at lib/dec_and_lock.c for a test.
> 
> My current theory is that occasionally when irqbalance changes CPU
> affinities that the resulting set_ioapic_affinity() calls somehow cause
> either inter-CPU locking or cache coherency or ??? to fail.

or.... some spinlock is just incorrect and having the irqbalance irqlayout
unhides that.. irqbalance only balances very very rarely so I doubt it's the
cause of anything...

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-03  5:15         ` Arjan van de Ven
@ 2004-07-26 17:41           ` Chris Caputo
  2004-07-27 14:19             ` Marcelo Tosatti
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-07-26 17:41 UTC (permalink / raw)
  To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel

On Sat, 3 Jul 2004, Arjan van de Ven wrote:
> On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote:
> > On Fri, 25 Jun 2004, Marcelo Tosatti wrote:
> > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote:
> > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> > > > iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
> > > > to ask.
> > > 
> > > Pretty sure it is, used all over. You can try to use non-optimize version 
> > > at lib/dec_and_lock.c for a test.
> > 
> > My current theory is that occasionally when irqbalance changes CPU
> > affinities that the resulting set_ioapic_affinity() calls somehow cause
> > either inter-CPU locking or cache coherency or ??? to fail.
> 
> or.... some spinlock is just incorrect and having the irqbalance irqlayout
> unhides that.. irqbalance only balances very very rarely so I doubt it's the
> cause of anything...

It has been a while since I have been able to follow up on this but I want
to let you know that I _have been able_ to reproduce the problem (believed
to be IRQ twiddling resulting in failed spinlock protection) with a stock
kernel.

I would like to come up with a more reliable way to reproduce the problem
with a stock kernel (2.4.26), since it is presently very rare (less than
once per week) in the way I presently get it to happen, but as yet have
not done so.

My plan of attack is to remove irqbalance from the equation and repeatedly
change with random intervals /proc/irq entries directly from one user mode
program while another user mode program does things which inspire a lot of
fs/inode.c spinlock activity (since that is where I continue to see list
corruption).

A few questions which could help me with this:

  - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking?

  - What does it mean if a stack trace is incomplete?  For example, one I 
    have gotten is simply the tail end of the code snippet:

         0b 9a 00 5d c8

    And so I have wondered if the failure to make a full stack trace 
    indicates something in of itself.

Thanks for any assistance.  I hope to find more time to work on this in
the coming weeks.

Chris

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-26 17:41           ` Chris Caputo
@ 2004-07-27 14:19             ` Marcelo Tosatti
  2004-07-27 16:08               ` Chris Caputo
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2004-07-27 14:19 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Arjan van de Ven, linux-kernel


On Mon, Jul 26, 2004 at 10:41:24AM -0700, Chris Caputo wrote:
> On Sat, 3 Jul 2004, Arjan van de Ven wrote:
> > On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote:
> > > On Fri, 25 Jun 2004, Marcelo Tosatti wrote:
> > > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote:
> > > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> > > > > iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
> > > > > to ask.
> > > > 
> > > > Pretty sure it is, used all over. You can try to use non-optimize version 
> > > > at lib/dec_and_lock.c for a test.
> > > 
> > > My current theory is that occasionally when irqbalance changes CPU
> > > affinities that the resulting set_ioapic_affinity() calls somehow cause
> > > either inter-CPU locking or cache coherency or ??? to fail.
> > 
> > or.... some spinlock is just incorrect and having the irqbalance irqlayout
> > unhides that.. irqbalance only balances very very rarely so I doubt it's the
> > cause of anything...

Hi Chris,

> It has been a while since I have been able to follow up on this but I want
> to let you know that I _have been able_ to reproduce the problem (believed
> to be IRQ twiddling resulting in failed spinlock protection) with a stock
> kernel.

Well, no manipulation of the inode lists are done under IRQ context. 

What you think might be happening is that an IRQ comes in __refile_inode() 
(and other paths) and that causes a problem? 

Thats perfectly fine. Again, no manipulation of the lists are done in 
IRQ context.

> I would like to come up with a more reliable way to reproduce the problem
> with a stock kernel (2.4.26), since it is presently very rare (less than
> once per week) in the way I presently get it to happen, but as yet have
> not done so. 

What is your workload? I'm more than willing to use the SMP boxes I have 
access to try to reproduce this. 

You said you also reproduced the same inode_unused corruption with 2.4.24, yes?

> My plan of attack is to remove irqbalance from the equation and repeatedly
> change with random intervals /proc/irq entries directly from one user mode
> program while another user mode program does things which inspire a lot of
> fs/inode.c spinlock activity (since that is where I continue to see list
> corruption).
> 
> A few questions which could help me with this:
> 
>   - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking?

None as far as spinlocks are concerned. On x86 spinlock just does "lock ; dec $x" 
operation which guarantees the atomicity of the "dec". 

I feel I'm not answering your question, still. What do you mean?

>   - What does it mean if a stack trace is incomplete?  For example, one I 
>     have gotten is simply the tail end of the code snippet:
> 
>          0b 9a 00 5d c8
> 
>     And so I have wondered if the failure to make a full stack trace 
>     indicates something in of itself.

Dont know the answer. I also usually see incomplete stack traces
that I can make no sense of. 

> Thanks for any assistance.  I hope to find more time to work on this in
> the coming weeks.

I'm willing to help. Just please do not mix report data which includes
your own filesystem/modified kernel. Lets work with stock kernels and 
only that, to make it as simple as possible.

Thanks!


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-27 14:19             ` Marcelo Tosatti
@ 2004-07-27 16:08               ` Chris Caputo
  2004-07-29  0:25                 ` Marcelo Tosatti
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-07-27 16:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Arjan van de Ven, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5886 bytes --]

On Tue, 27 Jul 2004, Marcelo Tosatti wrote:
> On Mon, Jul 26, 2004 at 10:41:24AM -0700, Chris Caputo wrote:
> > On Sat, 3 Jul 2004, Arjan van de Ven wrote:
> > > On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote:
> > > > On Fri, 25 Jun 2004, Marcelo Tosatti wrote:
> > > > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote:
> > > > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which
> > > > > > iput() uses, is well trusted?  I figure it's got to be, but doesn't hurt
> > > > > > to ask.
> > > > > 
> > > > > Pretty sure it is, used all over. You can try to use non-optimize version 
> > > > > at lib/dec_and_lock.c for a test.
> > > > 
> > > > My current theory is that occasionally when irqbalance changes CPU
> > > > affinities that the resulting set_ioapic_affinity() calls somehow cause
> > > > either inter-CPU locking or cache coherency or ??? to fail.
> > > 
> > > or.... some spinlock is just incorrect and having the irqbalance irqlayout
> > > unhides that.. irqbalance only balances very very rarely so I doubt it's the
> > > cause of anything...
> 
> Hi Chris,
> 
> > It has been a while since I have been able to follow up on this but I want
> > to let you know that I _have been able_ to reproduce the problem (believed
> > to be IRQ twiddling resulting in failed spinlock protection) with a stock
> > kernel.
> 
> Well, no manipulation of the inode lists are done under IRQ context. 
> 
> What you think might be happening is that an IRQ comes in __refile_inode() 
> (and other paths) and that causes a problem? 
> 
> Thats perfectly fine. Again, no manipulation of the lists are done in 
> IRQ context.

I may have phrased the question poorly.  I am trying to get at whether the
x86 "lock ; dec $x" operation can somehow be affected by changes to IRQ
affinity, since that is what appears to happen.  I assume SMP inter-cpu
locking doesn't generate IRQs at the linux kernel level (right?), but
there seems to be something at a low level which gets affected by changes
to which CPU handles which IRQs.

> > I would like to come up with a more reliable way to reproduce the problem
> > with a stock kernel (2.4.26), since it is presently very rare (less than
> > once per week) in the way I presently get it to happen, but as yet have
> > not done so. 
> 
> What is your workload? I'm more than willing to use the SMP boxes I have 
> access to try to reproduce this. 

Here is the basic repro so far:

-1) SMP x86 machine (PIII or PIV), 4 gigs of RAM, kernel 2.4.26.  Here are 
    some possibly relevant configs:

CONFIG_X86=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_F00F_WORKS_OK=y
CONFIG_X86_MCE=y
CONFIG_HIGHMEM4G=y
CONFIG_HIGHMEM=y
CONFIG_HIGHIO=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_X86_TSC=y
CONFIG_HAVE_DEC_LOCK=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y

0) Start irqbalance.
1) Have 10+ gigs of space available for testing preferably on two separate
   filesystems, mounted on say /test_0 and /test_1.
2) As root do a "find /" to fill up the dcache/inode caches with a bunch 
   of data.
3) Compile the attached writer2.c: "gcc -O2 -o writer2 writer2.c"
4) Run "writer2 /test_0 0" and "writer2 /test_1 0" in the background.
5) Run "emerge sync" a few times per day.
6) Wait for crash.

Step 5 is unique to Gentoo, so it would be nice to have another test
program which does the equivalent in terms of opening/created/deleting a
bunch of times.  Intent to be inspiring that a bunch of items are added to
and removed from the inode_unused list regularly.

> You said you also reproduced the same inode_unused corruption with
> 2.4.24, yes?

I have reproduced the same symptoms (inode_unused list corruption) on
2.4.24 with a modified kernel (tux patches plus my filesystem) but I have
not tried to do so with a stock 2.4.24.  So far only stock kernel I have
tried has been 2.4.26.

> > My plan of attack is to remove irqbalance from the equation and repeatedly
> > change with random intervals /proc/irq entries directly from one user mode
> > program while another user mode program does things which inspire a lot of
> > fs/inode.c spinlock activity (since that is where I continue to see list
> > corruption).
> > 
> > A few questions which could help me with this:
> > 
> >   - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking?
> 
> None as far as spinlocks are concerned. On x86 spinlock just does "lock
> ; dec $x"  operation which guarantees the atomicity of the "dec".
> 
> I feel I'm not answering your question, still. What do you mean?

What I seem to be seeing is that changes in IRQ affinity are at some level
screwing up the guaranteed atomicity of the "lock ; dec $x".  I realize
that might be crazy to think, but maybe it is something known about the
x86 spec.  The two CPU's I have seen this with, by the way, are a PIV-Xeon
(CPUID F29) and a PIII (CPUID 06B1), so I don't imagine a microcode-type
bug unique to a certain CPU release.

> >   - What does it mean if a stack trace is incomplete?  For example, one I 
> >     have gotten is simply the tail end of the code snippet:
> > 
> >          0b 9a 00 5d c8
> > 
> >     And so I have wondered if the failure to make a full stack trace 
> >     indicates something in of itself.
> 
> Dont know the answer. I also usually see incomplete stack traces
> that I can make no sense of. 
> 
> > Thanks for any assistance.  I hope to find more time to work on this in
> > the coming weeks.
> 
> I'm willing to help. Just please do not mix report data which includes
> your own filesystem/modified kernel. Lets work with stock kernels and 
> only that, to make it as simple as possible.

Agreed.

Chris








[-- Attachment #2: Type: TEXT/PLAIN, Size: 5560 bytes --]

#define _LARGEFILE64_SOURCE
#define _FILE_OFFSET_BITS 64

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/statvfs.h>
#include <sys/resource.h>
#include <signal.h>

#define NUM_DIFF_START_LOCS 10000
#define MAX_WRITE_LEN (10 * 1048576)  // 10 megs

#define NUM_FILES 1
#define FILE_SIZE (10 * 1024 * 1048576LL)  // 10 gigs

typedef signed long long s64;

int g_fSigTermReceived = 0;


static void sig_term(int signo)
{
  g_fSigTermReceived = 1;
}


int write_wrapper(int fd,
		  void *buf,
		  int len)
{
  int total = 0;
  int subtotal = 0;
  char *targetBuf = buf;

  while (total < len)
    {
      subtotal = write(fd, targetBuf, len - total);

      if (0 > subtotal)
	{
	  return -1;
	}
      else
	{
	  total += subtotal;
	  targetBuf += subtotal;
	}
    }

  return total;
}
		 

int main(int argc,
	 char *argv[])
{
  unsigned int uIterationCount, uCurIteration;
  unsigned int uLength;
  char *srcBuffer;
  struct timeval tvStart, tvEnd;
  s64 s64_usec;
  char szDirectory[256];
  char szFilename[256];
  int fds[NUM_FILES], fd;
  unsigned int i;
  off_t lseekRes;

  //
  // Usage.
  //

  if (2 != argc &&
      3 != argc)
    {
      fprintf(stderr,
	      "Usage: %s <test dir> [# of iterations]\n",
	      argv[0]);
      fprintf(stderr,
	      " -if iteration count is not specified, defaults to 1.  If 0\n");
      fprintf(stderr,
	      "     it means run for a long long time (2^32 - 1).\n");
      exit(1);
    }

  //
  // Consume parameters.
  //
  argc--;  // eat program name
  strncpy(szDirectory, argv[1], sizeof(szDirectory));
  argc--;  // eat test directory
  if (0 < argc)
    {
      uIterationCount = strtoul(argv[2], NULL, 10);
    }
  else
    {
      uIterationCount = 1;
    }

  // make it so a supplied iteration count of 0 runs for a long long time.
  if (0 == uIterationCount)
    {
      uIterationCount--;
    }

  // seed random()
  srandom(getpid());

  // allocate space for the srcBuffer
  // add long int extra so we can do a rapid random fill below
  srcBuffer = malloc(MAX_WRITE_LEN +
		     NUM_DIFF_START_LOCS +
		     sizeof(long int));
  if (NULL == srcBuffer)
    {
      fprintf(stderr,
	      "%s: was unable to malloc %u bytes.  Exiting.\n",
	      argv[0],
	      MAX_WRITE_LEN +
	      NUM_DIFF_START_LOCS +
	      sizeof(long int));
      exit(1);
    }

  // fill srcBuffer with random data
  {
    long int *pli;
    long int *end;
    int len = (MAX_WRITE_LEN +
	       NUM_DIFF_START_LOCS +
	       sizeof(long int)) / sizeof(long int);

    end = (long int *)srcBuffer + len;
    
    for (pli = (long int *)srcBuffer; pli < end; pli++)
      {
	*pli = random();
      }
  }

  // create the files
  for (i = 0; i < NUM_FILES; i++)
    {
      if (0 > snprintf(szFilename,
		       sizeof(szFilename),
		       "%s/%u",
		       szDirectory,
		       i))
	{
	  fprintf(stderr, "snprintf(szFilename) failed\n");
	  exit(1);
	}
      fds[i] = open(szFilename, O_CREAT | O_WRONLY | O_LARGEFILE, 0666);
      if (-1 == fds[i])
	{
	  perror("open");
	  goto exit;
	}
      if (-1 == lseek(fds[i], FILE_SIZE - 1, SEEK_SET))
	{
	  perror("lseek");
	  goto exit;
	}
      if (-1 == write(fds[i], "x", 1))
	{
	  perror("write");
	  goto exit;
	}
      if (-1 == lseek(fds[i], 0, SEEK_SET))  // set back to start
	{
	  perror("lseek");
	  goto exit;
	}
    }
  
  signal(SIGINT, sig_term);

  //
  // Main loop.
  //
  for (uCurIteration = 1;
       uCurIteration <= uIterationCount && 0 == g_fSigTermReceived;
       uCurIteration++)
    {
      printf("%u:W:",
	     uCurIteration);

      uLength = random() % MAX_WRITE_LEN;
      printf("len = %u",
	     uLength);
      
      fd = fds[random() % NUM_FILES];

      lseekRes = lseek(fd,
		       0,
		       SEEK_CUR);

      if (-1 == lseekRes)
	{
	  perror("lseek");
	  goto exit;
	}
      
      if (lseekRes + uLength >= FILE_SIZE)
	{
	  if (-1 == lseek(fd, 0, SEEK_SET))  // set back to start
	    {
	      perror("lseek");
	      goto exit;
	    }
	}

      gettimeofday(&tvStart, NULL);

      // Do networking.
      if (-1 == write_wrapper(fd,
			      srcBuffer + random() % NUM_DIFF_START_LOCS,
			      uLength))
	{
	  perror("write_wrapper");
	  goto exit;
	}

      gettimeofday(&tvEnd, NULL);

      s64_usec  = (s64)tvEnd.tv_sec   * 1000000 + tvEnd.tv_usec;
      s64_usec -= (s64)tvStart.tv_sec * 1000000 + tvStart.tv_usec;
      printf(" %Ld usecs %.2f mbits/sec",
	     s64_usec,
	     7.62939453125 * (float)uLength /
	     (float)s64_usec);

      printf("\n");
      fflush(stdout);
      usleep(random() % 100000);
    }

 exit:
  free(srcBuffer);
  
  if (1 == g_fSigTermReceived)
    {
      printf("SIGINT received!\n");
    }

#if 0
  printf("cleaning up %u files...\n", NUM_FILES);
  fflush(stdout);

  // close/unlink the files
  for (i = 0; i < NUM_FILES; i++)
    {
      printf("%u ", i);
      fflush(stdout);
      if (0 > snprintf(szFilename,
		       sizeof(szFilename),
		       "%s/%u",
		       szDirectory,
		       i))
	{
	  fprintf(stderr, "snprintf(szFilename) failed\n");
	  exit(1);
	}
      close(fds[i]);
      unlink(szFilename);
    }

  printf("\n");
#endif

  exit(0);
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-27 16:08               ` Chris Caputo
@ 2004-07-29  0:25                 ` Marcelo Tosatti
  2004-07-29  6:27                   ` Chris Caputo
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2004-07-29  0:25 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Arjan van de Ven, linux-kernel


On Tue, Jul 27, 2004 at 09:08:57AM -0700, Chris Caputo wrote:

> I may have phrased the question poorly.  I am trying to get at whether the
> x86 "lock ; dec $x" operation can somehow be affected by changes to IRQ
> affinity, since that is what appears to happen.  I assume SMP inter-cpu
> locking doesn't generate IRQs at the linux kernel level (right?), but
> there seems to be something at a low level which gets affected by changes
> to which CPU handles which IRQs.
> 
> > > I would like to come up with a more reliable way to reproduce the problem
> > > with a stock kernel (2.4.26), since it is presently very rare (less than
> > > once per week) in the way I presently get it to happen, but as yet have
> > > not done so. 
> > 
> > What is your workload? I'm more than willing to use the SMP boxes I have 
> > access to try to reproduce this. 
> 
> Here is the basic repro so far:
> 
> -1) SMP x86 machine (PIII or PIV), 4 gigs of RAM, kernel 2.4.26.  Here are 
>     some possibly relevant configs:
> 
> CONFIG_X86=y
> CONFIG_X86_WP_WORKS_OK=y
> CONFIG_X86_INVLPG=y
> CONFIG_X86_CMPXCHG=y
> CONFIG_X86_XADD=y
> CONFIG_X86_BSWAP=y
> CONFIG_X86_POPAD_OK=y
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_X86_HAS_TSC=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_PGE=y
> CONFIG_X86_USE_PPRO_CHECKSUM=y
> CONFIG_X86_F00F_WORKS_OK=y
> CONFIG_X86_MCE=y
> CONFIG_HIGHMEM4G=y
> CONFIG_HIGHMEM=y
> CONFIG_HIGHIO=y
> CONFIG_MTRR=y
> CONFIG_SMP=y
> CONFIG_X86_TSC=y
> CONFIG_HAVE_DEC_LOCK=y
> CONFIG_X86_IO_APIC=y
> CONFIG_X86_LOCAL_APIC=y
> 
> 0) Start irqbalance.
> 1) Have 10+ gigs of space available for testing preferably on two separate
>    filesystems, mounted on say /test_0 and /test_1.
> 2) As root do a "find /" to fill up the dcache/inode caches with a bunch 
>    of data.
> 3) Compile the attached writer2.c: "gcc -O2 -o writer2 writer2.c"
> 4) Run "writer2 /test_0 0" and "writer2 /test_1 0" in the background.
> 5) Run "emerge sync" a few times per day.
> 6) Wait for crash.

OK, I have started a testcase on OSDL 8-way PentiumIII, 16GB RAM.
Running stock 2.4.26, ext2 and ext3 filesystems.

Three "write2" running on three different partitions, plus
a "dbench 2" (which guarantees high inode cache activity).

Lets wait a week or so and see what has happened ;)

> Step 5 is unique to Gentoo, so it would be nice to have another test
> program which does the equivalent in terms of opening/created/deleting a
> bunch of times.  Intent to be inspiring that a bunch of items are added to
> and removed from the inode_unused list regularly.
> 
> > You said you also reproduced the same inode_unused corruption with
> > 2.4.24, yes?
> 
> I have reproduced the same symptoms (inode_unused list corruption) on
> 2.4.24 with a modified kernel (tux patches plus my filesystem) but I have
> not tried to do so with a stock 2.4.24.  So far only stock kernel I have
> tried has been 2.4.26.
> 
> > > My plan of attack is to remove irqbalance from the equation and repeatedly
> > > change with random intervals /proc/irq entries directly from one user mode
> > > program while another user mode program does things which inspire a lot of
> > > fs/inode.c spinlock activity (since that is where I continue to see list
> > > corruption).
> > > 
> > > A few questions which could help me with this:
> > > 
> > >   - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking?
> > 
> > None as far as spinlocks are concerned. On x86 spinlock just does "lock
> > ; dec $x"  operation which guarantees the atomicity of the "dec".
> > 
> > I feel I'm not answering your question, still. What do you mean?
> 
> What I seem to be seeing is that changes in IRQ affinity are at some level
> screwing up the guaranteed atomicity of the "lock ; dec $x".  I realize
> that might be crazy to think, but maybe it is something known about the
> x86 spec.  The two CPU's I have seen this with, by the way, are a PIV-Xeon
> (CPUID F29) and a PIII (CPUID 06B1), so I don't imagine a microcode-type
> bug unique to a certain CPU release.

Changing the affinity writes new values to the IOAPIC registers, I can't see
how that could interfere with the atomicity of a spinlock operation. I dont
understand why you think irqbalance could affect anything.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-29  0:25                 ` Marcelo Tosatti
@ 2004-07-29  6:27                   ` Chris Caputo
  2004-07-29  7:54                     ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-07-29  6:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Arjan van de Ven, linux-kernel

On Wed, 28 Jul 2004, Marcelo Tosatti wrote:
> Changing the affinity writes new values to the IOAPIC registers, I can't see
> how that could interfere with the atomicity of a spinlock operation. I dont
> understand why you think irqbalance could affect anything.

Because when I stop running irqbalance the crashes no longer happen.

Chris


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-29  6:27                   ` Chris Caputo
@ 2004-07-29  7:54                     ` Arjan van de Ven
  2004-07-29 10:57                       ` Marcelo Tosatti
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2004-07-29  7:54 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Marcelo Tosatti, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 584 bytes --]

On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote:
> On Wed, 28 Jul 2004, Marcelo Tosatti wrote:
> > Changing the affinity writes new values to the IOAPIC registers, I can't see
> > how that could interfere with the atomicity of a spinlock operation. I dont
> > understand why you think irqbalance could affect anything.
> 
> Because when I stop running irqbalance the crashes no longer happen.

what is the irq distribution when you do that?
Can you run irqbalance for a bit to make sure there's a static distribution
of irq's and then disable it and see if it survives ?

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-29  7:54                     ` Arjan van de Ven
@ 2004-07-29 10:57                       ` Marcelo Tosatti
  2004-07-29 12:21                         ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2004-07-29 10:57 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Chris Caputo, linux-kernel

On Thu, Jul 29, 2004 at 09:54:29AM +0200, Arjan van de Ven wrote:
> On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote:
> > On Wed, 28 Jul 2004, Marcelo Tosatti wrote:
> > > Changing the affinity writes new values to the IOAPIC registers, I can't see
> > > how that could interfere with the atomicity of a spinlock operation. I dont
> > > understand why you think irqbalance could affect anything.
> > 
> > Because when I stop running irqbalance the crashes no longer happen.
> 
> what is the irq distribution when you do that?
> Can you run irqbalance for a bit to make sure there's a static distribution
> of irq's and then disable it and see if it survives ?

Chris, Yes I'm also running irqbalance. 

Arjan, what is an easy way for me to make irqbalance change the affinity
as crazy on the SMP 8way box, just for a test?

TIA



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-29 10:57                       ` Marcelo Tosatti
@ 2004-07-29 12:21                         ` Arjan van de Ven
  2004-07-29 16:22                           ` Chris Caputo
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2004-07-29 12:21 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Chris Caputo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1079 bytes --]

On Thu, Jul 29, 2004 at 07:57:55AM -0300, Marcelo Tosatti wrote:
> On Thu, Jul 29, 2004 at 09:54:29AM +0200, Arjan van de Ven wrote:
> > On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote:
> > > On Wed, 28 Jul 2004, Marcelo Tosatti wrote:
> > > > Changing the affinity writes new values to the IOAPIC registers, I can't see
> > > > how that could interfere with the atomicity of a spinlock operation. I dont
> > > > understand why you think irqbalance could affect anything.
> > > 
> > > Because when I stop running irqbalance the crashes no longer happen.
> > 
> > what is the irq distribution when you do that?
> > Can you run irqbalance for a bit to make sure there's a static distribution
> > of irq's and then disable it and see if it survives ?
> 
> Chris, Yes I'm also running irqbalance. 
> 
> Arjan, what is an easy way for me to make irqbalance change the affinity
> as crazy on the SMP 8way box, just for a test?

there is a sleep(10 seconds) in the code, if you change that to something
really short and then cause irq burst rates on different devices...



[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-29 12:21                         ` Arjan van de Ven
@ 2004-07-29 16:22                           ` Chris Caputo
  2004-07-29 16:23                             ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-07-29 16:22 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Marcelo Tosatti, linux-kernel

On Thu, 29 Jul 2004, Arjan van de Ven wrote:
> On Thu, Jul 29, 2004 at 07:57:55AM -0300, Marcelo Tosatti wrote:
> > On Thu, Jul 29, 2004 at 09:54:29AM +0200, Arjan van de Ven wrote:
> > > On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote:
> > > > On Wed, 28 Jul 2004, Marcelo Tosatti wrote:
> > > > > Changing the affinity writes new values to the IOAPIC registers, I can't see
> > > > > how that could interfere with the atomicity of a spinlock operation. I dont
> > > > > understand why you think irqbalance could affect anything.
> > > > 
> > > > Because when I stop running irqbalance the crashes no longer happen.
> > > 
> > > what is the irq distribution when you do that?
> > > Can you run irqbalance for a bit to make sure there's a static distribution
> > > of irq's and then disable it and see if it survives ?
> > 
> > Chris, Yes I'm also running irqbalance. 
> > 
> > Arjan, what is an easy way for me to make irqbalance change the affinity
> > as crazy on the SMP 8way box, just for a test?
> 
> there is a sleep(10 seconds) in the code, if you change that to something
> really short and then cause irq burst rates on different devices...

I messed with that at one point and didn't have a noticible improvement in
getting the crash to happen.  I'd recommend not going lower than sleep(1)
since that still gives irqbalance some time to amass stats prior to the
next iteration.

On Thu, 29 Jul 2004, Arjan van de Ven wrote:
> what is the irq distribution when you do that?

You mean like what does "cat /proc/interrupts" show?  One thing I could do
is have a "watch -n 1 cat /proc/interrupts" running in one screen while
the tests are running, and then when the crash happens copy/paste the
results.  Would that be useful?

> Can you run irqbalance for a bit to make sure there's a static
> distribution of irq's and then disable it and see if it survives ?

I've had mixed results with this.  In theory the list corruption can
happen while irqbalance has been running, and then after stopping it, days
later the crash can happen when a corrupted list item is finally accessed.

Should I try the run-once/oneshot feature of irqbalance on boot, and then
leave it off and see if anything happens?

Thanks,
Chris

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-29 16:22                           ` Chris Caputo
@ 2004-07-29 16:23                             ` Arjan van de Ven
  2004-08-04 15:38                               ` Chris Caputo
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2004-07-29 16:23 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Marcelo Tosatti, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 325 bytes --]


On Thu, Jul 29, 2004 at 09:22:40AM -0700, Chris Caputo wrote:
> 
> Should I try the run-once/oneshot feature of irqbalance on boot, and then
> leave it off and see if anything happens?

that sounds like a good test

it really smells that a certain distribution of irq's is causing you grief
more than the actual changing...

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-07-29 16:23                             ` Arjan van de Ven
@ 2004-08-04 15:38                               ` Chris Caputo
  2004-08-04 18:47                                 ` Chris Caputo
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Caputo @ 2004-08-04 15:38 UTC (permalink / raw)
  To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel

On Thu, 29 Jul 2004, Arjan van de Ven wrote:
> On Thu, Jul 29, 2004 at 09:22:40AM -0700, Chris Caputo wrote:
> > Should I try the run-once/oneshot feature of irqbalance on boot, and then
> > leave it off and see if anything happens?
> 
> that sounds like a good test
> 
> it really smells that a certain distribution of irq's is causing you grief
> more than the actual changing...

Here is my latest repro.  Turns out writer2 is not needed Marcelo, so feel
free to skip those.

I'll now try to get a crash with just a one-shot irqbalance.  The below
was with the unmodified 0.08 irqbalance (affinity changes possible once
every 10 seconds).

Arjan, the irq affinities and /proc/interrupts output near the time of the
crash are at the bottom of this email.  It is unknown whether these were
the exact affinities when the list corruption happened or not, since the
watch command was only running once per second.

One thing to note below is that even though dbench was doing a ton of
filesystem activity at the time of the crash, the process which invoked
the system call which resulted in the iput() was irqbalance.  I don't
always see irqbalance in the call trace when these crashes happen, but it
is pretty common, like about 75% of the time.

Repro method:

-1) 2.4.26 kernel
0) start irqbalance
1) watch -n 1 'cat /proc/sys/fs/inode-nr ; head -n 11 /proc/interrupts ;
   head /proc/irq/*/*'
2) run loop_dbench, which is the following dbench script which uses 
   client_plain.txt:

   #!/bin/sh

   while [ 1 ]
   do
        date
        dbench 2
   done

3) Wait.

Results:

started 1:04am on Sunday August 1st
crashed 4:28am on Tuesday August 3rd

(If anyone else has spare SMP machines they can try the above with on
2.4.26, that would be great.)

---

Here is the crash dump.  The ALTNET lines are from extra code in inode.c
which scans the inode_unused list up to 10 items in each direction to
determine if the list is valid.  It is called before and after
list_del/list_add pairs in inode.c.  (If you would like the patch for
this, let me know.)

The i=0 below indicates this is the first item tested off from the head of
the inode_unused list.  Basically inode_unused->next points to an item 
whose next and prev pointers are NULL.

AAAAAAAAAAAAAAAAAAAAAA
ALTNET(forward): cur->p == NULL, i=0, inode=0xecc83804, i_size=0
ALTNET: cur=0xecc8380c, cur->next=0x00000000, cur->prev=0x00000000
ALTNET: i_hash=0xecc83804,
i_hash->next=0xf7de8188, i_hash->next->prev=0xecc83804,
i_hash->prev=0xf7de8188, i_hash->prev->next=0xecc83804
ALTNET: inode_lock.lock = 0
kernel BUG at inode.c:154!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c016657f>]    Not tainted
EFLAGS: 00010282
eax: 0000001c   ebx: ecc8380c   ecx: c03393f8   edx: 0000001f
esi: ecc83804   edi: 00000000   ebp: e72abf14   esp: e72abef0
ds: 0018   es: 0018   ss: 0018
Process irqbalance (pid: 9177, stackpage=e72ab000)
Stack: c02f00a1 00000000 f7de8188 ecc83804 f7de8188 ecc83804 ecc8380c ecc83804
       c3f7b800 e72abf44 c01696de 00000002 c033b980 e72abf4c ecc83830 c033bd60
       40015000 00000216 c625d3ec ecc83804 ecc83804 e72abf5c c0163e91 ecc83804
Call Trace:    [<c01696de>] [<c0163e91>] [<c014c957>] [<c0177db7>] [<c014a66f>]
  [<c014a748>] [<c0107a9f>]

Code: 0f 0b 9a 00 99 00 2f c0 8b 43 04 e9 ac fa ff ff c7 04 24 67

---

>>EIP; c016657f <ALTNET_verify_unused_list+5ef/620>   <=====

>>ebx; ecc8380c <_end+2c8a9540/38618d94>
>>ecx; c03393f8 <console_sem+0/18>
>>esi; ecc83804 <_end+2c8a9538/38618d94>
>>ebp; e72abf14 <_end+26ed1c48/38618d94>
>>esp; e72abef0 <_end+26ed1c24/38618d94>

Trace; c01696de <iput+36e/4d0>
Trace; c0163e91 <dput+f1/230>
Trace; c014c957 <fput+137/190>
Trace; c0177db7 <proc_file_write+37/50>
Trace; c014a66f <filp_close+bf/120>
Trace; c014a748 <sys_close+78/90>
Trace; c0107a9f <system_call+33/38>

Code;  c016657f <ALTNET_verify_unused_list+5ef/620>
00000000 <_EIP>:
Code;  c016657f <ALTNET_verify_unused_list+5ef/620>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c0166581 <ALTNET_verify_unused_list+5f1/620>
   2:   9a 00 99 00 2f c0 8b      lcall  $0x8bc0,$0x2f009900
Code;  c0166588 <ALTNET_verify_unused_list+5f8/620>
   9:   43                        inc    %ebx
Code;  c0166589 <ALTNET_verify_unused_list+5f9/620>
   a:   04 e9                     add    $0xe9,%al
Code;  c016658b <ALTNET_verify_unused_list+5fb/620>
   c:   ac                        lods   %ds:(%esi),%al
Code;  c016658c <ALTNET_verify_unused_list+5fc/620>
   d:   fa                        cli
Code;  c016658d <ALTNET_verify_unused_list+5fd/620>
   e:   ff                        (bad)
Code;  c016658e <ALTNET_verify_unused_list+5fe/620>
   f:   ff c7                     inc    %edi
Code;  c0166590 <ALTNET_verify_unused_list+600/620>
  11:   04 24                     add    $0x24,%al
Code;  c0166592 <ALTNET_verify_unused_list+602/620>
  13:   67 00 00                  addr16 add %al,(%bx,%si)

---

Current output of the loop_dbench script at the time of the crash:

Tue Aug  3 04:27:03 GMT 2004
2 clients started
   0     62477  41.81 MB/secc
Throughput 41.8089 MB/sec 2 procs
Tue Aug  3 04:27:21 GMT 2004
2 clients started
   0     62477  9.35 MB/secc
Throughput 9.35182 MB/sec 2 procs
Tue Aug  3 04:28:36 GMT 2004
2 clients started
   0     62477  61.05 MB/secc
Throughput 61.0461 MB/sec 2 procs
Tue Aug  3 04:28:48 GMT 2004

---

Current output of the "watch" command at the time of the crash:

Every 1s: cat /proc/sys/fs/inode-nr ; head -n 11 /p...  Tue Aug  3 
04:28:49 2004

/proc/sys/fs/inode-nr: 122465  105275

           CPU0       CPU1
  0:   47299923   50226384    IO-APIC-edge  timer
  1:         11          7    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  4:          1          6    IO-APIC-edge  serial
  8:          0          2    IO-APIC-edge  rtc
 14:          1          4    IO-APIC-edge  ide0
 18:    1801274     370200   IO-APIC-level  eth0
 19:     938177    1120685   IO-APIC-level  eth1
 21:   10833570   10778863   IO-APIC-level  3ware Storage Controller
 25:    2677989    3054823   IO-APIC-level  3ware Storage Controller

==> /proc/irq/0/smp_affinity <==
00000002

==> /proc/irq/1/smp_affinity <==
00000001

==> /proc/irq/10/smp_affinity <==
ffffffff

==> /proc/irq/11/smp_affinity <==
ffffffff

==> /proc/irq/12/smp_affinity <==
ffffffff

==> /proc/irq/13/smp_affinity <==
ffffffff

==> /proc/irq/14/smp_affinity <==
00000001

==> /proc/irq/15/smp_affinity <==
ffffffff

==> /proc/irq/18/smp_affinity <==
00000001

==> /proc/irq/19/smp_affinity <==
00000002

==> /proc/irq/2/smp_affinity <==
ffffffff

==> /proc/irq/21/smp_affinity <==
00000001

==> /proc/irq/24/smp_affinity <==
ffffffff

==> /proc/irq/25/smp_affinity <==
00000001

==> /proc/irq/3/smp_affinity <==
ffffffff

==> /proc/irq/4/smp_affinity <==
00000002

==> /proc/irq/5/smp_affinity <==
ffffffff

==> /proc/irq/6/smp_affinity <==
ffffffff

==> /proc/irq/7/smp_affinity <==
ffffffff

==> /proc/irq/8/smp_affinity <==
00000001

==> /proc/irq/9/smp_affinity <==
ffffffff

Chris


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-08-04 15:38                               ` Chris Caputo
@ 2004-08-04 18:47                                 ` Chris Caputo
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Caputo @ 2004-08-04 18:47 UTC (permalink / raw)
  To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel

Correction: I have been using client_oplocks.txt, not client_plain.txt, in
this repro.  (both were in the directory, but it is oplocks.txt one that
is being defaulted to)

Chris

On Wed, 4 Aug 2004, Chris Caputo wrote:
> 2) run loop_dbench, which is the following dbench script which uses 
>    client_plain.txt:
> 
>    #!/bin/sh
> 
>    while [ 1 ]
>    do
>         date
>         dbench 2
>    done


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
       [not found] <20040805225549.GA18420@logos.cnet>
@ 2004-08-07 16:00 ` Chris Caputo
  2004-08-07 18:31   ` Marcelo Tosatti
  2004-08-07 19:36   ` Marcelo Tosatti
  0 siblings, 2 replies; 29+ messages in thread
From: Chris Caputo @ 2004-08-07 16:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linux Kernel

On Thu, 5 Aug 2004, Marcelo Tosatti wrote:
> On Wed, Aug 04, 2004 at 02:55:38PM -0700, Chris Caputo wrote:
> > Right now the 3 tests I have running are:
> > 
> >  Intel SDS2 mb / dual-PIII / 3ware / 2.4.26 / irqbalance --oneshot
> >  Intel SDS2 mb / dual-PIII / 3ware / 2.4.27-rc5 / irqbalance ongoing
> >  Intel STL2 mb / dual-PIII / DAC960 / 2.4.26 / irqbalance ongoing

Well, I ended up getting the same crash (report below) with the non-3ware
(STL2 based) server, so I think I can rule the 3ware driver out as being
an instigator.  The STL2 based server took 2 days 14.5 hours to get a
corrupted inode_unused list.

This makes the list of motherboards I have seen the problem on be:

   Intel SE7501HG2 with dual-PIV's, 4 gig of ram
   Intel SDS2 with dual-PIII's, 4 gig of ram
   Intel STL2 with dual-PIII's, 2 gig of ram

At present the 2.4.26 with oneshot irqbalance and the 2.4.27-rc5 with
normal irqbalance are continuing to run without problems.  Coming up on 3
days without issues...  I'll keep them running.

Also, I'll start running 2.4.27-rc5 on a second server (the STL2) with a
normal irqbalance.

> Hum perhaps CONFIG_DEBUG_STACKOVERFLOW? And CONFIG_DEBUG_SLAB? 
> 
> I recall you said you had CONFIG_DEBUG_SLAB set already?

I have been running kernels with both DEBUG_SLAB and DEBUG_STACKOVERFLOW
set.

Marcelo, I take it the 8-proc server is still running fine?

Anyone else out there got a spare P3 or P4 dual-proc machine they can have
run the following repro scenario with 2.4.26 for a week?

Chris

---

Steps to repro:

0) irqbalance running on boot
1) watch -n 1 'cat /proc/sys/fs/inode-nr ; head -n 11 /proc/interrupts ; 
   head /proc/irq/*/*'
3) loop_dbench

started Wed Aug  4 18:56 GMT.
crashed Sat Aug  7 09:25:49 GMT.

---

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ALTNET(forward): cur->p == NULL, i=0, inode=0xc6f59004, i_size=0
ALTNET: cur=0xc6f5900c, cur->next=0x00000000, cur->prev=0x00000000
ALTNET: i_hash=0xc6f59004, i_hash->next=0xc2918cd0, 
i_hash->next->prev=0xc6f5900
4, i_hash->prev=0xc2918cd0, i_hash->prev->next=0xc6f59004
ALTNET: inode_lock.lock = 0
kernel BUG at inode.c:154!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c0165d3f>]    Not tainted
EFLAGS: 00010282
eax: 0000001c   ebx: c6f5900c   ecx: c031e4f8   edx: 0000001f
esi: c6f59004   edi: 00000000   ebp: f76fff14   esp: f76ffef0
ds: 0018   es: 0018   ss: 0018
Process irqbalance (pid: 1124, stackpage=f76ff000)
Stack: c02e569d 00000000 c2918cd0 c6f59004 c2918cd0 c6f59004 c6f5900c c6f59004
       c283b800 f76fff44 c0168e9e 00000002 c0320a80 f76fff4c c6f59030 c0320e60
       40015000 00000216 f767556c c6f59004 c6f59004 f76fff5c c01637ed c6f59004
Call Trace:    [<c0168e9e>] [<c01637ed>] [<c014c3e7>] [<c0177417>] [<c014a12f>]
  [<c014a208>] [<c0107a5b>]

Code: 0f 0b 9a 00 95 56 2e c0 8b 43 04 e9 ac fa ff ff c7 04 24 37

---

>>EIP; c0165d3f <ALTNET_verify_unused_list+5ef/620>   <=====

>>ecx; c031e4f8 <console_sem+0/18>

Trace; c0168e9e <iput+36e/4d0>
Trace; c01637ed <dput+ed/210>
Trace; c014c3e7 <fput+137/190>
Trace; c0177417 <proc_file_write+37/50>
Trace; c014a12f <filp_close+bf/120>
Trace; c014a208 <sys_close+78/90>
Trace; c0107a5b <system_call+33/38>

Code;  c0165d3f <ALTNET_verify_unused_list+5ef/620>
00000000 <_EIP>:
Code;  c0165d3f <ALTNET_verify_unused_list+5ef/620>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c0165d41 <ALTNET_verify_unused_list+5f1/620>
   2:   9a 00 95 56 2e c0 8b      lcall  $0x8bc0,$0x2e569500
Code;  c0165d48 <ALTNET_verify_unused_list+5f8/620>
   9:   43                        inc    %ebx
Code;  c0165d49 <ALTNET_verify_unused_list+5f9/620>
   a:   04 e9                     add    $0xe9,%al
Code;  c0165d4b <ALTNET_verify_unused_list+5fb/620>
   c:   ac                        lods   %ds:(%esi),%al
Code;  c0165d4c <ALTNET_verify_unused_list+5fc/620>
   d:   fa                        cli
Code;  c0165d4d <ALTNET_verify_unused_list+5fd/620>
   e:   ff                        (bad)
Code;  c0165d4e <ALTNET_verify_unused_list+5fe/620>
   f:   ff c7                     inc    %edi
Code;  c0165d50 <ALTNET_verify_unused_list+600/620>
  11:   04 24                     add    $0x24,%al
Code;  c0165d52 <ALTNET_verify_unused_list+602/620>
  13:   37                        aaa

---

Sat Aug  7 09:25:38 GMT 2004
2 clients started
   0     62477  185.37 MB/sec
Throughput 185.368 MB/sec 2 procs
Sat Aug  7 09:25:43 GMT 2004
2 clients started
   0     62477  191.27 MB/sec
Throughput 191.266 MB/sec 2 procs
Sat Aug  7 09:25:48 GMT 2004
2 clients started

---

Every 1s: cat /proc/sys/fs/inode-nr ; head -n 11 /p...  Sat Aug  7 
09:25:49 2004

/proc/sys/fs/inode-nr: 113480  113303

           CPU0       CPU1
  0:   11261584   11262640    IO-APIC-edge  timer
  1:          3         11    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  4:          6          1    IO-APIC-edge  serial
  8:          2          0    IO-APIC-edge  rtc
 18:    3019678         26   IO-APIC-level  eth0
 20:     607652     716283   IO-APIC-level  Mylex eXtremeRAID 2000
NMI:          0          0
LOC:   22526983   22526974
ERR:          0

==> /proc/irq/0/smp_affinity <==
00000001

==> /proc/irq/1/smp_affinity <==
00000002

==> /proc/irq/10/smp_affinity <==
ffffffff

==> /proc/irq/11/smp_affinity <==
ffffffff

==> /proc/irq/12/smp_affinity <==
ffffffff

==> /proc/irq/13/smp_affinity <==
ffffffff

==> /proc/irq/14/smp_affinity <==
ffffffff

==> /proc/irq/15/smp_affinity <==
ffffffff

==> /proc/irq/18/smp_affinity <==
00000001

==> /proc/irq/2/smp_affinity <==
ffffffff

==> /proc/irq/20/smp_affinity <==
00000002

==> /proc/irq/21/smp_affinity <==
ffffffff

==> /proc/irq/3/smp_affinity <==
ffffffff

==> /proc/irq/4/smp_affinity <==
00000001

==> /proc/irq/5/smp_affinity <==
ffffffff

==> /proc/irq/6/smp_affinity <==
ffffffff

==> /proc/irq/7/smp_affinity <==
ffffffff

==> /proc/irq/8/smp_affinity <==
00000002

==> /proc/irq/9/smp_affinity <==
ffffffff




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-08-07 16:00 ` Chris Caputo
@ 2004-08-07 18:31   ` Marcelo Tosatti
  2004-08-07 19:36   ` Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2004-08-07 18:31 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Linux Kernel

On Sat, Aug 07, 2004 at 09:00:30AM -0700, Chris Caputo wrote:
> On Thu, 5 Aug 2004, Marcelo Tosatti wrote:
> > On Wed, Aug 04, 2004 at 02:55:38PM -0700, Chris Caputo wrote:
> > > Right now the 3 tests I have running are:
> > > 
> > >  Intel SDS2 mb / dual-PIII / 3ware / 2.4.26 / irqbalance --oneshot
> > >  Intel SDS2 mb / dual-PIII / 3ware / 2.4.27-rc5 / irqbalance ongoing
> > >  Intel STL2 mb / dual-PIII / DAC960 / 2.4.26 / irqbalance ongoing
> 
> Well, I ended up getting the same crash (report below) with the non-3ware
> (STL2 based) server, so I think I can rule the 3ware driver out as being
> an instigator.  The STL2 based server took 2 days 14.5 hours to get a
> corrupted inode_unused list.
> 
> This makes the list of motherboards I have seen the problem on be:
> 
>    Intel SE7501HG2 with dual-PIV's, 4 gig of ram
>    Intel SDS2 with dual-PIII's, 4 gig of ram
>    Intel STL2 with dual-PIII's, 2 gig of ram
> 
> At present the 2.4.26 with oneshot irqbalance and the 2.4.27-rc5 with
> normal irqbalance are continuing to run without problems.  Coming up on 3
> days without issues...  I'll keep them running.
> 
> Also, I'll start running 2.4.27-rc5 on a second server (the STL2) with a
> normal irqbalance.
> 
> > Hum perhaps CONFIG_DEBUG_STACKOVERFLOW? And CONFIG_DEBUG_SLAB? 
> > 
> > I recall you said you had CONFIG_DEBUG_SLAB set already?
> 
> I have been running kernels with both DEBUG_SLAB and DEBUG_STACKOVERFLOW
> set.
> 
> Marcelo, I take it the 8-proc server is still running fine?

Yes, its fine.

> Anyone else out there got a spare P3 or P4 dual-proc machine they can have
> run the following repro scenario with 2.4.26 for a week?

I have Dual P4 SE7501 here. I'll start the tests now with v2.4.26 plus your
corruption patches.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: inode_unused list corruption in 2.4.26 - spin_lock problem?
  2004-08-07 16:00 ` Chris Caputo
  2004-08-07 18:31   ` Marcelo Tosatti
@ 2004-08-07 19:36   ` Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2004-08-07 19:36 UTC (permalink / raw)
  To: Chris Caputo; +Cc: Linux Kernel

On Sat, Aug 07, 2004 at 09:00:30AM -0700, Chris Caputo wrote:
> On Thu, 5 Aug 2004, Marcelo Tosatti wrote:
> > On Wed, Aug 04, 2004 at 02:55:38PM -0700, Chris Caputo wrote:
> > > Right now the 3 tests I have running are:
> > > 
> > >  Intel SDS2 mb / dual-PIII / 3ware / 2.4.26 / irqbalance --oneshot
> > >  Intel SDS2 mb / dual-PIII / 3ware / 2.4.27-rc5 / irqbalance ongoing
> > >  Intel STL2 mb / dual-PIII / DAC960 / 2.4.26 / irqbalance ongoing
> 
> Well, I ended up getting the same crash (report below) with the non-3ware
> (STL2 based) server, so I think I can rule the 3ware driver out as being
> an instigator.  The STL2 based server took 2 days 14.5 hours to get a
> corrupted inode_unused list.
> 
> This makes the list of motherboards I have seen the problem on be:
> 
>    Intel SE7501HG2 with dual-PIV's, 4 gig of ram
>    Intel SDS2 with dual-PIII's, 4 gig of ram
>    Intel STL2 with dual-PIII's, 2 gig of ram
> 
> At present the 2.4.26 with oneshot irqbalance and the 2.4.27-rc5 with
> normal irqbalance are continuing to run without problems.  Coming up on 3
> days without issues...  I'll keep them running.
> 
> Also, I'll start running 2.4.27-rc5 on a second server (the STL2) with a
> normal irqbalance.

Chris,

Started running "dbench 10" on 2.4.27-rc6 + arjan's irqbalance 0.9 on my 
dual P4 (e7500 plumas) with Symbios 53c1030 controller and 2GB of RAM.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2004-08-07 20:31 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-19  0:47 inode_unused list corruption in 2.4.26 - spin_lock problem? Chris Caputo
2004-06-20  0:15 ` Marcelo Tosatti
2004-06-20  3:33   ` Trond Myklebust
2004-06-21  0:45     ` Marcelo Tosatti
2004-06-21 17:10       ` Trond Myklebust
2004-06-21 18:23         ` Marcelo Tosatti
2004-06-24  1:51         ` Chris Caputo
2004-06-25  7:47           ` Chris Caputo
2004-06-24  1:50   ` Chris Caputo
2004-06-25  8:04     ` Chris Caputo
2004-06-25 10:18       ` Chris Caputo
2004-06-25 12:17     ` Marcelo Tosatti
2004-07-02 20:00       ` Chris Caputo
2004-07-03  5:15         ` Arjan van de Ven
2004-07-26 17:41           ` Chris Caputo
2004-07-27 14:19             ` Marcelo Tosatti
2004-07-27 16:08               ` Chris Caputo
2004-07-29  0:25                 ` Marcelo Tosatti
2004-07-29  6:27                   ` Chris Caputo
2004-07-29  7:54                     ` Arjan van de Ven
2004-07-29 10:57                       ` Marcelo Tosatti
2004-07-29 12:21                         ` Arjan van de Ven
2004-07-29 16:22                           ` Chris Caputo
2004-07-29 16:23                             ` Arjan van de Ven
2004-08-04 15:38                               ` Chris Caputo
2004-08-04 18:47                                 ` Chris Caputo
     [not found] <20040805225549.GA18420@logos.cnet>
2004-08-07 16:00 ` Chris Caputo
2004-08-07 18:31   ` Marcelo Tosatti
2004-08-07 19:36   ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox