* inode_unused list corruption in 2.4.26 - spin_lock problem? @ 2004-06-19 0:47 Chris Caputo 2004-06-20 0:15 ` Marcelo Tosatti 0 siblings, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-06-19 0:47 UTC (permalink / raw) To: linux-kernel In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based, the other dual-PIII) I am seeing crashes which are the result of the inode_unused doubly linked list in fs/inode.c becoming corrupted. A particular instance of the corruption I have isolated is in a call from iput() to __refile_inode(). To try to diagnose this further I placed list verification code before and after the list_del() and list_add() calls in __refile_inode() and observed a healthy list become corrupted after the del/add was completed. It would seem to me that list corruption on otherwise healthy machines would only be the result of the inode_lock spinlock not being properly locked prior to the call to __refile_inode(), but as far as I can tell, the call to atomic_dec_and_lock() in iput() is doing that properly. So I am at a loss. Has anyone else seen this or does anyone have any idea what routes I should be exploring to fix this problem? Thank you, Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-19 0:47 inode_unused list corruption in 2.4.26 - spin_lock problem? Chris Caputo @ 2004-06-20 0:15 ` Marcelo Tosatti 2004-06-20 3:33 ` Trond Myklebust 2004-06-24 1:50 ` Chris Caputo 0 siblings, 2 replies; 29+ messages in thread From: Marcelo Tosatti @ 2004-06-20 0:15 UTC (permalink / raw) To: Chris Caputo; +Cc: linux-kernel, David Woodhouse, riel Hi Chris, I've seen your previous post -- should have answered you earlier. On Fri, Jun 18, 2004 at 05:47:05PM -0700, Chris Caputo wrote: > In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based, > the other dual-PIII) I am seeing crashes which are the result of the > inode_unused doubly linked list in fs/inode.c becoming corrupted. What steps are required to reproduce the problem? > A particular instance of the corruption I have isolated is in a call from > iput() to __refile_inode(). To try to diagnose this further I placed list > verification code before and after the list_del() and list_add() calls in > __refile_inode() and observed a healthy list become corrupted after the > del/add was completed. Can you show us this data in more detail? > It would seem to me that list corruption on otherwise healthy machines > would only be the result of the inode_lock spinlock not being properly > locked prior to the call to __refile_inode(), but as far as I can tell, > the call to atomic_dec_and_lock() in iput() is doing that properly. > > So I am at a loss. Has anyone else seen this or does anyone have any idea > what routes I should be exploring to fix this problem? The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and the unused_pagecache list) must have something to do with this. David, Rik, can you give some help here? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-20 0:15 ` Marcelo Tosatti @ 2004-06-20 3:33 ` Trond Myklebust 2004-06-21 0:45 ` Marcelo Tosatti 2004-06-24 1:50 ` Chris Caputo 1 sibling, 1 reply; 29+ messages in thread From: Trond Myklebust @ 2004-06-20 3:33 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel På lau , 19/06/2004 klokka 20:15, skreiv Marcelo Tosatti: > The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and > the unused_pagecache list) must have something to do with this. Here's one question: Given the fact that in iput(), the inode remains hashed and the inode->i_state does not change until after we've dropped the inode_lock, called write_inode_now(), and then retaken the inode_lock, exactly what is preventing a third party task from grabbing that inode? (Better still: write_inode_now() itself actually calls __iget(), which could cause that inode to be plonked right back onto the "inode_in_use" list if ever refile_inode() gets called.) So does the following patch help? Cheers, Trond --- linux-2.4.27-pre3/fs/inode.c.orig 2004-05-20 20:41:41.000000000 -0400 +++ linux-2.4.27-pre3/fs/inode.c 2004-06-19 23:22:29.000000000 -0400 @@ -1200,6 +1200,7 @@ void iput(struct inode *inode) struct super_block *sb = inode->i_sb; struct super_operations *op = NULL; +again: if (inode->i_state == I_CLEAR) BUG(); @@ -1241,11 +1242,16 @@ void iput(struct inode *inode) if (!(inode->i_state & (I_DIRTY|I_LOCK))) __refile_inode(inode); inodes_stat.nr_unused++; - spin_unlock(&inode_lock); - if (!sb || (sb->s_flags & MS_ACTIVE)) + if (!sb || (sb->s_flags & MS_ACTIVE)) { + spin_unlock(&inode_lock); return; - write_inode_now(inode, 1); - spin_lock(&inode_lock); + } + if (inode->i_state & I_DIRTY) { + __iget(inode); + spin_unlock(&inode_lock); + write_inode_now(inode, 1); + goto again; + } inodes_stat.nr_unused--; list_del_init(&inode->i_hash); } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-20 3:33 ` Trond Myklebust @ 2004-06-21 0:45 ` Marcelo Tosatti 2004-06-21 17:10 ` Trond Myklebust 0 siblings, 1 reply; 29+ messages in thread From: Marcelo Tosatti @ 2004-06-21 0:45 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel On Sat, Jun 19, 2004 at 11:33:55PM -0400, Trond Myklebust wrote: > På lau , 19/06/2004 klokka 20:15, skreiv Marcelo Tosatti: > > > The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and > > the unused_pagecache list) must have something to do with this. > > Here's one question: > > Given the fact that in iput(), the inode remains hashed and the > inode->i_state does not change until after we've dropped the inode_lock, > called write_inode_now(), and then retaken the inode_lock, exactly what > is preventing a third party task from grabbing that inode? > > (Better still: write_inode_now() itself actually calls __iget(), which > could cause that inode to be plonked right back onto the "inode_in_use" > list if ever refile_inode() gets called.) Lets see if I get this right, while we drop the lock in iput to call write_inode_now() an iget happens, possibly from write_inode_now itself (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused-- and deletes the inode from the inode_in_use and adds to inode_unused. AFAICS its an inode with i_count==1 in the unused list, which does not mean "list corruption", right? Am I missing something here? If you are indeed right all 2.4.x versions contain this bug. Thanks for helping! > > So does the following patch help? > > +++ linux-2.4.27-pre3/fs/inode.c 2004-06-19 23:22:29.000000000 -0400 > @@ -1200,6 +1200,7 @@ void iput(struct inode *inode) > struct super_block *sb = inode->i_sb; > struct super_operations *op = NULL; > > +again: > if (inode->i_state == I_CLEAR) > BUG(); > > @@ -1241,11 +1242,16 @@ void iput(struct inode *inode) > if (!(inode->i_state & (I_DIRTY|I_LOCK))) > __refile_inode(inode); > inodes_stat.nr_unused++; > - spin_unlock(&inode_lock); > - if (!sb || (sb->s_flags & MS_ACTIVE)) > + if (!sb || (sb->s_flags & MS_ACTIVE)) { > + spin_unlock(&inode_lock); > return; > - write_inode_now(inode, 1); > - spin_lock(&inode_lock); > + } > + if (inode->i_state & I_DIRTY) { > + __iget(inode); > + spin_unlock(&inode_lock); > + write_inode_now(inode, 1); > + goto again; > + } > inodes_stat.nr_unused--; > list_del_init(&inode->i_hash); > } > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-21 0:45 ` Marcelo Tosatti @ 2004-06-21 17:10 ` Trond Myklebust 2004-06-21 18:23 ` Marcelo Tosatti 2004-06-24 1:51 ` Chris Caputo 0 siblings, 2 replies; 29+ messages in thread From: Trond Myklebust @ 2004-06-21 17:10 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti: > Lets see if I get this right, while we drop the lock in iput to call > write_inode_now() an iget happens, possibly from write_inode_now itself > (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. > But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused-- > and deletes the inode from the inode_in_use and adds to inode_unused. > > AFAICS its an inode with i_count==1 in the unused list, which does not > mean "list corruption", right? Am I missing something here? Yes. Please don't forget that the inode is still hashed and is not yet marked as FREEING: find_inode() can grab it on behalf of some other process as soon as we drop that spinlock inside iput(). Then we have the calls to clear_inode() + destroy_inode() just a few lines further down. ;-) If the above scenario ever does occur, it will cause random Oopses for third party processes. Since we do not see this too often, my guess is that the write_inode_now() path must be very rarely (or never?) called. > If you are indeed right all 2.4.x versions contain this bug. ...and all 2.6.x versions... I'm not saying this is the same problem that Chris is seeing, but I am failing to see how iput() is safe as it stands right now. Please enlighten me if I'm missing something. Cheers, Trond ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-21 17:10 ` Trond Myklebust @ 2004-06-21 18:23 ` Marcelo Tosatti 2004-06-24 1:51 ` Chris Caputo 1 sibling, 0 replies; 29+ messages in thread From: Marcelo Tosatti @ 2004-06-21 18:23 UTC (permalink / raw) To: Trond Myklebust; +Cc: Chris Caputo, linux-kernel, David Woodhouse, riel, viro On Mon, Jun 21, 2004 at 01:10:21PM -0400, Trond Myklebust wrote: > På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti: > > Lets see if I get this right, while we drop the lock in iput to call > > write_inode_now() an iget happens, possibly from write_inode_now itself > > (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. > > But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused-- > > and deletes the inode from the inode_in_use and adds to inode_unused. > > > > AFAICS its an inode with i_count==1 in the unused list, which does not > > mean "list corruption", right? Am I missing something here? > > Yes. Please don't forget that the inode is still hashed and is not yet > marked as FREEING: find_inode() can grab it on behalf of some other > process as soon as we drop that spinlock inside iput(). Then we have the > calls to clear_inode() + destroy_inode() just a few lines further down. > ;-) > > If the above scenario ever does occur, it will cause random Oopses for > third party processes. Since we do not see this too often, my guess is > that the write_inode_now() path must be very rarely (or never?) called. Thats what I though: That if the scenario you described really happens, we would see random oopses (processes using a deleted inode) instead of Chris's list corruption. Chris, _please_ post your full oopses. > > If you are indeed right all 2.4.x versions contain this bug. > > ...and all 2.6.x versions... > > I'm not saying this is the same problem that Chris is seeing, but I am > failing to see how iput() is safe as it stands right now. Please > enlighten me if I'm missing something. For me your analysis looks right and we have a problem here. I think Al Viro knows iput() very well. Maybe he should take a look at your patched. CC'ed. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-21 17:10 ` Trond Myklebust 2004-06-21 18:23 ` Marcelo Tosatti @ 2004-06-24 1:51 ` Chris Caputo 2004-06-25 7:47 ` Chris Caputo 1 sibling, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-06-24 1:51 UTC (permalink / raw) To: Trond Myklebust; +Cc: Marcelo Tosatti, linux-kernel, David Woodhouse, riel On Mon, 21 Jun 2004, Trond Myklebust wrote: > På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti: > > Lets see if I get this right, while we drop the lock in iput to call > > write_inode_now() an iget happens, possibly from write_inode_now itself > > (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. > > But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused-- > > and deletes the inode from the inode_in_use and adds to inode_unused. > > > > AFAICS its an inode with i_count==1 in the unused list, which does not > > mean "list corruption", right? Am I missing something here? > > Yes. Please don't forget that the inode is still hashed and is not yet > marked as FREEING: find_inode() can grab it on behalf of some other > process as soon as we drop that spinlock inside iput(). Then we have the > calls to clear_inode() + destroy_inode() just a few lines further down. > ;-) > > If the above scenario ever does occur, it will cause random Oopses for > third party processes. Since we do not see this too often, my guess is > that the write_inode_now() path must be very rarely (or never?) called. > > > If you are indeed right all 2.4.x versions contain this bug. > > ...and all 2.6.x versions... > > I'm not saying this is the same problem that Chris is seeing, but I am > failing to see how iput() is safe as it stands right now. Please > enlighten me if I'm missing something. I think this is a different (albeit apparently valid) problem. In my case MS_ACTIVE (in iput() below) will be set since I am not unmounting a volume and so I believe iput() will return immediately after adding the inode to the unused list. That said, I have added your patch to my test setup in case it helps. Thanks, Chris ---- if (!list_empty(&inode->i_hash)) { if (!(inode->i_state & (I_DIRTY|I_LOCK))) __refile_inode(inode); inodes_stat.nr_unused++; spin_unlock(&inode_lock); if (!sb || (sb->s_flags & MS_ACTIVE)) return; write_inode_now(inode, 1); spin_lock(&inode_lock); inodes_stat.nr_unused--; list_del_init(&inode->i_hash); } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-24 1:51 ` Chris Caputo @ 2004-06-25 7:47 ` Chris Caputo 0 siblings, 0 replies; 29+ messages in thread From: Chris Caputo @ 2004-06-25 7:47 UTC (permalink / raw) To: Trond Myklebust; +Cc: Marcelo Tosatti, linux-kernel On Wed, 23 Jun 2004, Chris Caputo wrote: > On Mon, 21 Jun 2004, Trond Myklebust wrote: > > På su , 20/06/2004 klokka 20:45, skreiv Marcelo Tosatti: > > > Lets see if I get this right, while we drop the lock in iput to call > > > write_inode_now() an iget happens, possibly from write_inode_now itself > > > (sync_one->__iget) causing the inode->i_list to be added to to inode_in_use. > > > But then the call returns, locks inode_lock, decreases inodes_stat.nr_unused-- > > > and deletes the inode from the inode_in_use and adds to inode_unused. > > > > > > AFAICS its an inode with i_count==1 in the unused list, which does not > > > mean "list corruption", right? Am I missing something here? > > > > Yes. Please don't forget that the inode is still hashed and is not yet > > marked as FREEING: find_inode() can grab it on behalf of some other > > process as soon as we drop that spinlock inside iput(). Then we have the > > calls to clear_inode() + destroy_inode() just a few lines further down. > > ;-) > > > > If the above scenario ever does occur, it will cause random Oopses for > > third party processes. Since we do not see this too often, my guess is > > that the write_inode_now() path must be very rarely (or never?) called. > > > > > If you are indeed right all 2.4.x versions contain this bug. > > > > ...and all 2.6.x versions... > > > > I'm not saying this is the same problem that Chris is seeing, but I am > > failing to see how iput() is safe as it stands right now. Please > > enlighten me if I'm missing something. > > I think this is a different (albeit apparently valid) problem. In my case > MS_ACTIVE (in iput() below) will be set since I am not unmounting a volume > and so I believe iput() will return immediately after adding the inode to > the unused list. > > That said, I have added your patch to my test setup in case it helps. I was able to duplicate the problem I am seeing even with Trond's patch applied. So the patch potentially solves a different problem but not the one I am seeing. Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-20 0:15 ` Marcelo Tosatti 2004-06-20 3:33 ` Trond Myklebust @ 2004-06-24 1:50 ` Chris Caputo 2004-06-25 8:04 ` Chris Caputo 2004-06-25 12:17 ` Marcelo Tosatti 1 sibling, 2 replies; 29+ messages in thread From: Chris Caputo @ 2004-06-24 1:50 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel, David Woodhouse, riel, Trond Myklebust On Sat, 19 Jun 2004, Marcelo Tosatti wrote: > On Fri, Jun 18, 2004 at 05:47:05PM -0700, Chris Caputo wrote: > > In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based, > > the other dual-PIII) I am seeing crashes which are the result of the > > inode_unused doubly linked list in fs/inode.c becoming corrupted. > > What steps are required to reproduce the problem? Unfortunately I don't yet know how to (or if I can) repro this with a stock kernel. I am using 2.4.26 + Ingo's tux3-2.4.23-A3 patch in conjunction with a filesystem I wrote. (the tux module itself is not being used, just the patches to the existing kernel) > > A particular instance of the corruption I have isolated is in a call from > > iput() to __refile_inode(). To try to diagnose this further I placed list > > verification code before and after the list_del() and list_add() calls in > > __refile_inode() and observed a healthy list become corrupted after the > > del/add was completed. > > Can you show us this data in more detail? In __refile_inode() before and after the list_add()/del() calls I call a function which checks up to the first 10 items on the inode_unused list to see if next and prev pointers are valid. (inode->next->prev == inode && inode->prev->next == inode) So what I observed was a case here where iput() inline __refile_inode(): 1) checked inode_unused and saw that it was good 2) put an item on the inode_unused list 3) checked inode_unused and saw that it was now bad and that the item added was the culprit. This all happened within __refile_inode() with the inode_lock spinlock grabbed by iput() and so I tend to think some other code is accessing the inode_unused list _without_ grabbing the spinlock. I've checked the inode.c code over and over, plus my filesystem code, and haven't yet found a culprit. I also checked the tux diffs to see if it was messing with inode objects in an inappropriate way. Is it safe to assume that the x86 version of atomic_dec_and_lock(), which iput() uses, is well trusted? I figure it's got to be, but doesn't hurt to ask. > > It would seem to me that list corruption on otherwise healthy machines > > would only be the result of the inode_lock spinlock not being properly > > locked prior to the call to __refile_inode(), but as far as I can tell, > > the call to atomic_dec_and_lock() in iput() is doing that properly. > > > > So I am at a loss. Has anyone else seen this or does anyone have any idea > > what routes I should be exploring to fix this problem? > > The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and > the unused_pagecache list) must have something to do with this. __refile_inode() was introduced in 2.4.25. I'll try 2.4.24 to see if I can reproduce there. Marcelo, you asked for some oops' relating to the problem. Here are some. Thanks, Chris --- Oops: 0000 CPU: 0 EIP: 0010:[<c015b465>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010a93 eax: 8d12e0e2 ebx: 7b402366 ecx: c40fdf18 edx: d9c92888 esi: 7b40235e edi: 7b402366 ebp: 000007cf esp: c40fdf10 ds: 0018 es: 0018 ss: 0018 Process kswapd (pid: 7, stackpage=c40fd000) Stack: e3b85380 000004e5 e3b85388 c9812688 00000002 c25caf40 c0317fd8 00023dd2 c015b664 00000cb4 c0138f35 00000006 000001d0 ffffffff 000001d0 00000002 00000006 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c Call Trace: [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>] [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>] [<c01396b0>] Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08 >>EIP; c015b465 <prune_icache+45/220> <===== >>ecx; c40fdf18 <_end+3d3ffec/38676134> >>edx; d9c92888 <_end+198d495c/38676134> >>esp; c40fdf10 <_end+3d3ffe4/38676134> Trace; c015b664 <shrink_icache_memory+24/40> Trace; c0138f35 <shrink_cache+185/410> Trace; c013936a <shrink_caches+4a/60> Trace; c01393e2 <try_to_free_pages_zone+62/f0> Trace; c0139596 <kswapd_balance_pgdat+66/b0> Trace; c0139608 <kswapd_balance+28/40> Trace; c0139748 <kswapd+98/c0> Trace; c01396b0 <kswapd+0/c0> Trace; c0105000 <_stext+0/0> Trace; c010587e <arch_kernel_thread+2e/40> Trace; c01396b0 <kswapd+0/c0> Code; c015b465 <prune_icache+45/220> 00000000 <_EIP>: Code; c015b465 <prune_icache+45/220> <===== 0: 8b 5b 04 mov 0x4(%ebx),%ebx <===== Code; c015b468 <prune_icache+48/220> 3: 8b 86 1c 01 00 00 mov 0x11c(%esi),%eax Code; c015b46e <prune_icache+4e/220> 9: a8 38 test $0x38,%al Code; c015b470 <prune_icache+50/220> b: 0f 84 5d 01 00 00 je 16e <_EIP+0x16e> Code; c015b476 <prune_icache+56/220> 11: 81 fb 08 00 00 00 cmp $0x8,%ebx --- Oops: 0000 CPU: 0 EIP: 0010:[<c015b465>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010a97 eax: 830418da ebx: 5954e741 ecx: c40fdf18 edx: c0318f08 esi: 5954e739 edi: 5954e741 ebp: 000001b4 esp: c40fdf10 ds: 0018 es: 0018 ss: 0018 Process kswapd (pid: 7, stackpage=c40fd000) Stack: f718b780 000013a1 f718b788 d04d3988 00000001 c3cfe850 c0317fd8 00023d9e c015b664 00001555 c0138f35 00000006 000001d0 ffffffff 000001d0 00000001 00000004 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c Call Trace: [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>] [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>] [<c01396b0>] Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08 >>EIP; c015b465 <prune_icache+45/220> <===== >>edx; c0318f08 <inode_unused+0/8> Trace; c015b664 <shrink_icache_memory+24/40> Trace; c0138f35 <shrink_cache+185/410> Trace; c013936a <shrink_caches+4a/60> Trace; c01393e2 <try_to_free_pages_zone+62/f0> Trace; c0139596 <kswapd_balance_pgdat+66/b0> Trace; c0139608 <kswapd_balance+28/40> Trace; c0139748 <kswapd+98/c0> Trace; c01396b0 <kswapd+0/c0> Trace; c0105000 <_stext+0/0> Trace; c010587e <arch_kernel_thread+2e/40> Trace; c01396b0 <kswapd+0/c0> Code; c015b465 <prune_icache+45/220> 00000000 <_EIP>: Code; c015b465 <prune_icache+45/220> <===== 0: 8b 5b 04 mov 0x4(%ebx),%ebx <===== Code; c015b468 <prune_icache+48/220> 3: 8b 86 1c 01 00 00 mov 0x11c(%esi),%eax Code; c015b46e <prune_icache+4e/220> 9: a8 38 test $0x38,%al Code; c015b470 <prune_icache+50/220> b: 0f 84 5d 01 00 00 je 16e <_EIP+0x16e> Code; c015b476 <prune_icache+56/220> 11: 81 fb 08 00 00 00 cmp $0x8,%ebx --- I think this one was an infinite loop which I used alt-sysrq-p to get deduce: Pid: 7, comm: kswapd EIP: 0010:[<c014dd1c>] CPU: 2 EFLAGS: 00000246 Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EAX: c4e4f824 EBX: c4e4f804 ECX: 00000006 EDX: 00000000 ESI: c4e4f80c EDI: c4e4f804 EBP: 000009ee DS: 0018 ES: 0018 Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register line ignored CR0: 8005003b CR2: bfffdfb4 CR3: 00101000 CR4: 000006d0 Call Trace: [<c0167f8e>] [<c01681d4>] [<c013d9d0>] [<c0168254>] [<c013fe69>] [<c01404ae>] [<c0140533>] [<c01405b2>] [<c011a6cb>] [<c0140766>] [<c01407d8>] [<c0140918>] [<c0140880>] [<c010595e>] [<c0140880>] Warning (Oops_read): Code line not seen, dumping what data is available >>EIP; c014dd1c <inode_has_buffers+6c/90> <===== >>EAX; c4e4f824 <_end+4a6b2f8/3864fb34> >>EBX; c4e4f804 <_end+4a6b2d8/3864fb34> >>ESI; c4e4f80c <_end+4a6b2e0/3864fb34> >>EDI; c4e4f804 <_end+4a6b2d8/3864fb34> Trace; c0167f8e <prune_icache+7e/320> Trace; c01681d4 <prune_icache+2c4/320> Trace; c013d9d0 <kmem_cache_shrink+70/c0> Trace; c0168254 <shrink_icache_memory+24/40> Trace; c013fe69 <shrink_cache+1d9/6d0> Trace; c01404ae <refill_inactive+14e/160> Trace; c0140533 <shrink_caches+73/90> Trace; c01405b2 <try_to_free_pages_zone+62/f0> Trace; c011a6cb <schedule+34b/5f0> Trace; c0140766 <kswapd_balance_pgdat+66/b0> Trace; c01407d8 <kswapd_balance+28/40> Trace; c0140918 <kswapd+98/c0> Trace; c0140880 <kswapd+0/c0> Trace; c010595e <arch_kernel_thread+2e/40> Trace; c0140880 <kswapd+0/c0> --- Another infinite loop (alt-sysrq-p): Pid: 7, comm: kswapd EIP: 0010:[<c0167bc8>] CPU: 3 EFLAGS: 00000206 Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EAX: 00000001 EBX: ebe7f40c ECX: 00000002 EDX: 00000000 ESI: ebe7f60c EDI: ebe7f604 EBP: c2851ee4 DS: 0018 ES: 0018 Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register line ignored CR0: 8005003b CR2: 40014000 CR3: 00101000 CR4: 000006d0 Call Trace: [<c013d7e2>] [<c0167eb7>] [<c013fd05>] [<c0140360>] [<c01403e2>] [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>] [<c010596e>] [<c0140760>] >>EIP; c0167bc8 <prune_icache+78/340> <===== Trace; c013d7e2 <kmem_cache_shrink+72/c0> Trace; c0167eb7 <shrink_icache_memory+27/40> Trace; c013fd05 <shrink_cache+1d5/6d0> Trace; c0140360 <refill_inactive+160/170> Trace; c01403e2 <shrink_caches+72/90> Trace; c0140462 <try_to_free_pages_zone+62/f0> Trace; c0140638 <kswapd_balance_pgdat+78/c0> Trace; c01406a8 <kswapd_balance+28/40> Trace; c01407fa <kswapd+9a/c0> Trace; c0140760 <kswapd+0/c0> Trace; c010596e <arch_kernel_thread+2e/40> Trace; c0140760 <kswapd+0/c0> --- Infinite loop (alt-sysrq-p): Pid: 9, comm: kupdated EIP: 0010:[<c0168df3>] CPU: 2 EFLAGS: 00000282 Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EAX: f7bea000 EBX: f7bea000 ECX: f7bebfd0 EDX: 00000000 ESI: f7bebfc8 EDI: f7bebfd8 EBP: f7bebf10 DS: 0018 ES: 0018 Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register line ignored CR0: 8005003b CR2: 40013090 CR3: 00101000 CR4: 000006d0 Call Trace: [<c011a1b2>] [<c0151333>] [<c015185e>] [<c0107a42>] [<c010596e>] [<c0151690>] >>EIP; c0168df3 <.text.lock.inode+69/256> <===== >>EAX; f7bea000 <_end+377ffb74/38649bd4> >>EBX; f7bea000 <_end+377ffb74/38649bd4> >>ECX; f7bebfd0 <_end+37801b44/38649bd4> >>ESI; f7bebfc8 <_end+37801b3c/38649bd4> >>EDI; f7bebfd8 <_end+37801b4c/38649bd4> >>EBP; f7bebf10 <_end+37801a84/38649bd4> Trace; c011a1b2 <schedule_timeout+62/b0> Trace; c0151333 <sync_old_buffers+53/160> Trace; c015185e <kupdate+1ce/230> Trace; c0107a42 <ret_from_fork+6/20> Trace; c010596e <arch_kernel_thread+2e/40> Trace; c0151690 <kupdate+0/230> --- Infinite loop (alt-sysrq-p): Pid: 7, comm: kswapd EIP: 0010:[<c0167be0>] CPU: 3 EFLAGS: 00000202 Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EAX: 00000020 EBX: dcbc120c ECX: 00000086 EDX: daae3044 ESI: dcbc120c EDI: dcbc1204 EBP: c2851ee4 DS: 0018 ES: 0018 Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register line ign ored CR0: 8005003b CR2: 0807f260 CR3: 00101000 CR4: 000006d0 Call Trace: [<c013d7e2>] [<c0167ed7>] [<c013fd05>] [<c0140360>] [<c01403e2>] [<c011a55b>] [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>] [<c010596e>] [<c0140760>] >>EIP; c0167be0 <prune_icache+90/360> <===== >>EBX; dcbc120c <_end+1c7d6d80/38649bd4> >>EDX; daae3044 <_end+1a6f8bb8/38649bd4> >>ESI; dcbc120c <_end+1c7d6d80/38649bd4> >>EDI; dcbc1204 <_end+1c7d6d78/38649bd4> >>EBP; c2851ee4 <_end+2467a58/38649bd4> Trace; c013d7e2 <kmem_cache_shrink+72/c0> Trace; c0167ed7 <shrink_icache_memory+27/40> Trace; c013fd05 <shrink_cache+1d5/6d0> Trace; c0140360 <refill_inactive+160/170> Trace; c01403e2 <shrink_caches+72/90> Trace; c011a55b <schedule+34b/5f0> Trace; c0140462 <try_to_free_pages_zone+62/f0> Trace; c0140638 <kswapd_balance_pgdat+78/c0> Trace; c01406a8 <kswapd_balance+28/40> Trace; c01407fa <kswapd+9a/c0> Trace; c0140760 <kswapd+0/c0> Trace; c010596e <arch_kernel_thread+2e/40> Trace; c0140760 <kswapd+0/c0> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-24 1:50 ` Chris Caputo @ 2004-06-25 8:04 ` Chris Caputo 2004-06-25 10:18 ` Chris Caputo 2004-06-25 12:17 ` Marcelo Tosatti 1 sibling, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-06-25 8:04 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel On Wed, 23 Jun 2004, Chris Caputo wrote: > On Sat, 19 Jun 2004, Marcelo Tosatti wrote: > > Can you show us this data in more detail? > > In __refile_inode() before and after the list_add()/del() calls I call a > function which checks up to the first 10 items on the inode_unused list to > see if next and prev pointers are valid. > (inode->next->prev == inode && inode->prev->next == inode) > > So what I observed was a case here where iput() inline __refile_inode(): > > 1) checked inode_unused and saw that it was good > 2) put an item on the inode_unused list > 3) checked inode_unused and saw that it was now bad and that the item > added was the culprit. > > This all happened within __refile_inode() with the inode_lock spinlock > grabbed by iput() and so I tend to think some other code is accessing the > inode_unused list _without_ grabbing the spinlock. I've checked the > inode.c code over and over, plus my filesystem code, and haven't yet found > a culprit. I also checked the tux diffs to see if it was messing with > inode objects in an inappropriate way. > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt > to ask. An update on this. Line #3 above is not entirely correct in that I have not seen the item being added becoming immediately corrupt, but rather items beyond it. Specifically I have now seen: inode_unused->next->next->prev != inode_unused->next and: inode_unused->next->next->next->prev != inode_unused->next->next My verification function doesn't check the whole inode_unused list (would be too slow to do so), but it may be that items are only corrupted shortly after being added to the list. Ie., someone is still using the inode shortly after when they shouldn't be. > __refile_inode() was introduced in 2.4.25. I'll try 2.4.24 to see if I > can reproduce there. No word yet on my 2.4.24 testing. (test still running without failure) I'll keep digging, Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-25 8:04 ` Chris Caputo @ 2004-06-25 10:18 ` Chris Caputo 0 siblings, 0 replies; 29+ messages in thread From: Chris Caputo @ 2004-06-25 10:18 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel On Fri, 25 Jun 2004, Chris Caputo wrote: > On Wed, 23 Jun 2004, Chris Caputo wrote: > > __refile_inode() was introduced in 2.4.25. I'll try 2.4.24 to see if I > > can reproduce there. > > No word yet on my 2.4.24 testing. (test still running without failure) I have now reproduced (below) with 2.4.24 (with tux patches + my driver). In the code I believe the line doing the null deref is: entry = entry->prev; [771 or 2.4.24 fs/inode.c] Next I'll try to repro with simply a stock 2.4.26. Chris --- Unable to handle kernel NULL pointer dereference at virtual address 00000004 c01665f5 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c01665f5>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010217 eax: 00000020 ebx: 00000000 ecx: 00000006 edx: 00000001 esi: 00000000 edi: fffffff8 ebp: c7f95f58 esp: c7f95f30 ds: 0018 es: 0018 ss: 0018 Process kswapd (pid: 5, stackpage=c7f95000) Stack: c7f95f58 c0163794 00000000 c7f95f3c c7f95f3c c7f95f64 c013d152 000001d0 0000003a 0000000a c7f95f64 c01666f7 00018057 c7f95f8c c013fdcf 00000006 000001d0 00000000 c0329a60 00000001 c0329a60 00000001 c7f94000 c7f95fa8 Call Trace: [<c0163794>] [<c013d152>] [<c01666f7>] [<c013fdcf>] [<c013ff78>] [<c013ffe8>] [<c014013a>] [<c01400a0>] [<c01059ce>] [<c01400a0>] Code: 8b 76 04 8b 87 30 01 00 00 a8 38 74 6e 81 fe 28 a9 32 c0 75 >>EIP; c01665f5 <prune_icache+65/140> <===== >>ebp; c7f95f58 <_end+7bcdd8c/3862ae94> >>esp; c7f95f30 <_end+7bcdd64/3862ae94> Trace; c0163794 <prune_dcache+194/280> Trace; c013d152 <kmem_cache_shrink+72/c0> Trace; c01666f7 <shrink_icache_memory+27/40> Trace; c013fdcf <try_to_free_pages_zone+8f/f0> Trace; c013ff78 <kswapd_balance_pgdat+78/c0> Trace; c013ffe8 <kswapd_balance+28/40> Trace; c014013a <kswapd+9a/c0> Trace; c01400a0 <kswapd+0/c0> Trace; c01059ce <arch_kernel_thread+2e/40> Trace; c01400a0 <kswapd+0/c0> Code; c01665f5 <prune_icache+65/140> 00000000 <_EIP>: Code; c01665f5 <prune_icache+65/140> <===== 0: 8b 76 04 mov 0x4(%esi),%esi <===== Code; c01665f8 <prune_icache+68/140> 3: 8b 87 30 01 00 00 mov 0x130(%edi),%eax Code; c01665fe <prune_icache+6e/140> 9: a8 38 test $0x38,%al Code; c0166600 <prune_icache+70/140> b: 74 6e je 7b <_EIP+0x7b> Code; c0166602 <prune_icache+72/140> d: 81 fe 28 a9 32 c0 cmp $0xc032a928,%esi Code; c0166608 <prune_icache+78/140> 13: 75 00 jne 15 <_EIP+0x15> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-24 1:50 ` Chris Caputo 2004-06-25 8:04 ` Chris Caputo @ 2004-06-25 12:17 ` Marcelo Tosatti 2004-07-02 20:00 ` Chris Caputo 1 sibling, 1 reply; 29+ messages in thread From: Marcelo Tosatti @ 2004-06-25 12:17 UTC (permalink / raw) To: Chris Caputo; +Cc: linux-kernel, David Woodhouse, riel, Trond Myklebust On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote: > On Sat, 19 Jun 2004, Marcelo Tosatti wrote: > > On Fri, Jun 18, 2004 at 05:47:05PM -0700, Chris Caputo wrote: > > > In 2.4.26 on two different dual-proc x86 machines (one dual-P4 Xeon based, > > > the other dual-PIII) I am seeing crashes which are the result of the > > > inode_unused doubly linked list in fs/inode.c becoming corrupted. > > > > What steps are required to reproduce the problem? > > Unfortunately I don't yet know how to (or if I can) repro this with a > stock kernel. I am using 2.4.26 + Ingo's tux3-2.4.23-A3 patch in > conjunction with a filesystem I wrote. (the tux module itself is not > being used, just the patches to the existing kernel) > > > > A particular instance of the corruption I have isolated is in a call from > > > iput() to __refile_inode(). To try to diagnose this further I placed list > > > verification code before and after the list_del() and list_add() calls in > > > __refile_inode() and observed a healthy list become corrupted after the > > > del/add was completed. > > > > Can you show us this data in more detail? > > In __refile_inode() before and after the list_add()/del() calls I call a > function which checks up to the first 10 items on the inode_unused list to > see if next and prev pointers are valid. > (inode->next->prev == inode && inode->prev->next == inode) > > So what I observed was a case here where iput() inline __refile_inode(): > > 1) checked inode_unused and saw that it was good > 2) put an item on the inode_unused list > 3) checked inode_unused and saw that it was now bad and that the item > added was the culprit. > > This all happened within __refile_inode() with the inode_lock spinlock > grabbed by iput() and so I tend to think some other code is accessing the > inode_unused list _without_ grabbing the spinlock. I've checked the > inode.c code over and over, plus my filesystem code, and haven't yet found > a culprit. I also checked the tux diffs to see if it was messing with > inode objects in an inappropriate way. > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt > to ask. Pretty sure it is, used all over. You can try to use non-optimize version at lib/dec_and_lock.c for a test. > > > It would seem to me that list corruption on otherwise healthy machines > > > would only be the result of the inode_lock spinlock not being properly > > > locked prior to the call to __refile_inode(), but as far as I can tell, > > > the call to atomic_dec_and_lock() in iput() is doing that properly. > > > > > > So I am at a loss. Has anyone else seen this or does anyone have any idea > > > what routes I should be exploring to fix this problem? > > > > The changes between 2.4.25->2.4.26 (which introduce __refile_inode() and > > the unused_pagecache list) must have something to do with this. > > __refile_inode() was introduced in 2.4.25. I'll try 2.4.24 to see if I > can reproduce there. > > Marcelo, you asked for some oops' relating to the problem. Here are some. Chris, thanks. Would love to see you reproducing the oops with a stock kernel with ext2/3. > --- > > Oops: 0000 > CPU: 0 > EIP: 0010:[<c015b465>] Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EFLAGS: 00010a93 > eax: 8d12e0e2 ebx: 7b402366 ecx: c40fdf18 edx: d9c92888 > esi: 7b40235e edi: 7b402366 ebp: 000007cf esp: c40fdf10 > ds: 0018 es: 0018 ss: 0018 > Process kswapd (pid: 7, stackpage=c40fd000) > Stack: e3b85380 000004e5 e3b85388 c9812688 00000002 c25caf40 c0317fd8 00023dd2 > c015b664 00000cb4 c0138f35 00000006 000001d0 ffffffff 000001d0 00000002 > 00000006 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c > Call Trace: [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>] > [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>] > [<c01396b0>] > Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08 > > > >>EIP; c015b465 <prune_icache+45/220> <===== > > >>ecx; c40fdf18 <_end+3d3ffec/38676134> > >>edx; d9c92888 <_end+198d495c/38676134> > >>esp; c40fdf10 <_end+3d3ffe4/38676134> > > Trace; c015b664 <shrink_icache_memory+24/40> > Trace; c0138f35 <shrink_cache+185/410> > Trace; c013936a <shrink_caches+4a/60> > Trace; c01393e2 <try_to_free_pages_zone+62/f0> > Trace; c0139596 <kswapd_balance_pgdat+66/b0> > Trace; c0139608 <kswapd_balance+28/40> > Trace; c0139748 <kswapd+98/c0> > Trace; c01396b0 <kswapd+0/c0> > Trace; c0105000 <_stext+0/0> > Trace; c010587e <arch_kernel_thread+2e/40> > Trace; c01396b0 <kswapd+0/c0> > > Code; c015b465 <prune_icache+45/220> > 00000000 <_EIP>: > Code; c015b465 <prune_icache+45/220> <===== > 0: 8b 5b 04 mov 0x4(%ebx),%ebx <===== > Code; c015b468 <prune_icache+48/220> > 3: 8b 86 1c 01 00 00 mov 0x11c(%esi),%eax > Code; c015b46e <prune_icache+4e/220> > 9: a8 38 test $0x38,%al > Code; c015b470 <prune_icache+50/220> > b: 0f 84 5d 01 00 00 je 16e <_EIP+0x16e> > Code; c015b476 <prune_icache+56/220> > 11: 81 fb 08 00 00 00 cmp $0x8,%ebx > > --- > > Oops: 0000 > CPU: 0 > EIP: 0010:[<c015b465>] Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EFLAGS: 00010a97 > eax: 830418da ebx: 5954e741 ecx: c40fdf18 edx: c0318f08 > esi: 5954e739 edi: 5954e741 ebp: 000001b4 esp: c40fdf10 > ds: 0018 es: 0018 ss: 0018 > Process kswapd (pid: 7, stackpage=c40fd000) > Stack: f718b780 000013a1 f718b788 d04d3988 00000001 c3cfe850 c0317fd8 00023d9e > c015b664 00001555 c0138f35 00000006 000001d0 ffffffff 000001d0 00000001 > 00000004 000001d0 c0317fd8 c0317fd8 c013936a c40fdf84 000001d0 0000003c > Call Trace: [<c015b664>] [<c0138f35>] [<c013936a>] [<c01393e2>] [<c0139596>] > [<c0139608>] [<c0139748>] [<c01396b0>] [<c0105000>] [<c010587e>] > [<c01396b0>] > Code: 8b 5b 04 8b 86 1c 01 00 00 a8 38 0f 84 5d 01 00 00 81 fb 08 > > > >>EIP; c015b465 <prune_icache+45/220> <===== > > >>edx; c0318f08 <inode_unused+0/8> > > Trace; c015b664 <shrink_icache_memory+24/40> > Trace; c0138f35 <shrink_cache+185/410> > Trace; c013936a <shrink_caches+4a/60> > Trace; c01393e2 <try_to_free_pages_zone+62/f0> > Trace; c0139596 <kswapd_balance_pgdat+66/b0> > Trace; c0139608 <kswapd_balance+28/40> > Trace; c0139748 <kswapd+98/c0> > Trace; c01396b0 <kswapd+0/c0> > Trace; c0105000 <_stext+0/0> > Trace; c010587e <arch_kernel_thread+2e/40> > Trace; c01396b0 <kswapd+0/c0> > > Code; c015b465 <prune_icache+45/220> > 00000000 <_EIP>: > Code; c015b465 <prune_icache+45/220> <===== > 0: 8b 5b 04 mov 0x4(%ebx),%ebx <===== > Code; c015b468 <prune_icache+48/220> > 3: 8b 86 1c 01 00 00 mov 0x11c(%esi),%eax > Code; c015b46e <prune_icache+4e/220> > 9: a8 38 test $0x38,%al > Code; c015b470 <prune_icache+50/220> > b: 0f 84 5d 01 00 00 je 16e <_EIP+0x16e> > Code; c015b476 <prune_icache+56/220> > 11: 81 fb 08 00 00 00 cmp $0x8,%ebx > > --- > > I think this one was an infinite loop which I used alt-sysrq-p to get > deduce: > > Pid: 7, comm: kswapd > EIP: 0010:[<c014dd1c>] CPU: 2 EFLAGS: 00000246 Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EAX: c4e4f824 EBX: c4e4f804 ECX: 00000006 EDX: 00000000 > ESI: c4e4f80c EDI: c4e4f804 EBP: 000009ee DS: 0018 ES: 0018 > Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register > line ignored > CR0: 8005003b CR2: bfffdfb4 CR3: 00101000 CR4: 000006d0 > Call Trace: [<c0167f8e>] [<c01681d4>] [<c013d9d0>] [<c0168254>] [<c013fe69>] > [<c01404ae>] [<c0140533>] [<c01405b2>] [<c011a6cb>] [<c0140766>] [<c01407d8>] > [<c0140918>] [<c0140880>] [<c010595e>] [<c0140880>] > Warning (Oops_read): Code line not seen, dumping what data is available > > > >>EIP; c014dd1c <inode_has_buffers+6c/90> <===== > > >>EAX; c4e4f824 <_end+4a6b2f8/3864fb34> > >>EBX; c4e4f804 <_end+4a6b2d8/3864fb34> > >>ESI; c4e4f80c <_end+4a6b2e0/3864fb34> > >>EDI; c4e4f804 <_end+4a6b2d8/3864fb34> > > Trace; c0167f8e <prune_icache+7e/320> > Trace; c01681d4 <prune_icache+2c4/320> > Trace; c013d9d0 <kmem_cache_shrink+70/c0> > Trace; c0168254 <shrink_icache_memory+24/40> > Trace; c013fe69 <shrink_cache+1d9/6d0> > Trace; c01404ae <refill_inactive+14e/160> > Trace; c0140533 <shrink_caches+73/90> > Trace; c01405b2 <try_to_free_pages_zone+62/f0> > Trace; c011a6cb <schedule+34b/5f0> > Trace; c0140766 <kswapd_balance_pgdat+66/b0> > Trace; c01407d8 <kswapd_balance+28/40> > Trace; c0140918 <kswapd+98/c0> > Trace; c0140880 <kswapd+0/c0> > Trace; c010595e <arch_kernel_thread+2e/40> > Trace; c0140880 <kswapd+0/c0> > > --- > > Another infinite loop (alt-sysrq-p): > > Pid: 7, comm: kswapd > EIP: 0010:[<c0167bc8>] CPU: 3 EFLAGS: 00000206 Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EAX: 00000001 EBX: ebe7f40c ECX: 00000002 EDX: 00000000 > ESI: ebe7f60c EDI: ebe7f604 EBP: c2851ee4 DS: 0018 ES: 0018 > Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register > line ignored > CR0: 8005003b CR2: 40014000 CR3: 00101000 CR4: 000006d0 > Call Trace: [<c013d7e2>] [<c0167eb7>] [<c013fd05>] [<c0140360>] [<c01403e2>] > [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>] [<c010596e>] > [<c0140760>] > > >>EIP; c0167bc8 <prune_icache+78/340> <===== > > Trace; c013d7e2 <kmem_cache_shrink+72/c0> > Trace; c0167eb7 <shrink_icache_memory+27/40> > Trace; c013fd05 <shrink_cache+1d5/6d0> > Trace; c0140360 <refill_inactive+160/170> > Trace; c01403e2 <shrink_caches+72/90> > Trace; c0140462 <try_to_free_pages_zone+62/f0> > Trace; c0140638 <kswapd_balance_pgdat+78/c0> > Trace; c01406a8 <kswapd_balance+28/40> > Trace; c01407fa <kswapd+9a/c0> > Trace; c0140760 <kswapd+0/c0> > Trace; c010596e <arch_kernel_thread+2e/40> > Trace; c0140760 <kswapd+0/c0> > > --- > > Infinite loop (alt-sysrq-p): > > Pid: 9, comm: kupdated > EIP: 0010:[<c0168df3>] CPU: 2 EFLAGS: 00000282 Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EAX: f7bea000 EBX: f7bea000 ECX: f7bebfd0 EDX: 00000000 > ESI: f7bebfc8 EDI: f7bebfd8 EBP: f7bebf10 DS: 0018 ES: 0018 > Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register > line ignored > CR0: 8005003b CR2: 40013090 CR3: 00101000 CR4: 000006d0 > Call Trace: [<c011a1b2>] [<c0151333>] [<c015185e>] [<c0107a42>] > [<c010596e>] [<c0151690>] > > >>EIP; c0168df3 <.text.lock.inode+69/256> <===== > > >>EAX; f7bea000 <_end+377ffb74/38649bd4> > >>EBX; f7bea000 <_end+377ffb74/38649bd4> > >>ECX; f7bebfd0 <_end+37801b44/38649bd4> > >>ESI; f7bebfc8 <_end+37801b3c/38649bd4> > >>EDI; f7bebfd8 <_end+37801b4c/38649bd4> > >>EBP; f7bebf10 <_end+37801a84/38649bd4> > > Trace; c011a1b2 <schedule_timeout+62/b0> > Trace; c0151333 <sync_old_buffers+53/160> > Trace; c015185e <kupdate+1ce/230> > Trace; c0107a42 <ret_from_fork+6/20> > Trace; c010596e <arch_kernel_thread+2e/40> > Trace; c0151690 <kupdate+0/230> > > --- > > Infinite loop (alt-sysrq-p): > > Pid: 7, comm: kswapd > EIP: 0010:[<c0167be0>] CPU: 3 EFLAGS: 00000202 Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EAX: 00000020 EBX: dcbc120c ECX: 00000086 EDX: daae3044 > ESI: dcbc120c EDI: dcbc1204 EBP: c2851ee4 DS: 0018 ES: 0018 > Warning (Oops_set_regs): garbage 'DS: 0018 ES: 0018' at end of register > line ign > ored > CR0: 8005003b CR2: 0807f260 CR3: 00101000 CR4: 000006d0 > Call Trace: [<c013d7e2>] [<c0167ed7>] [<c013fd05>] [<c0140360>] [<c01403e2>] > [<c011a55b>] [<c0140462>] [<c0140638>] [<c01406a8>] [<c01407fa>] [<c0140760>] > [<c010596e>] [<c0140760>] > > >>EIP; c0167be0 <prune_icache+90/360> <===== > > >>EBX; dcbc120c <_end+1c7d6d80/38649bd4> > >>EDX; daae3044 <_end+1a6f8bb8/38649bd4> > >>ESI; dcbc120c <_end+1c7d6d80/38649bd4> > >>EDI; dcbc1204 <_end+1c7d6d78/38649bd4> > >>EBP; c2851ee4 <_end+2467a58/38649bd4> > > Trace; c013d7e2 <kmem_cache_shrink+72/c0> > Trace; c0167ed7 <shrink_icache_memory+27/40> > Trace; c013fd05 <shrink_cache+1d5/6d0> > Trace; c0140360 <refill_inactive+160/170> > Trace; c01403e2 <shrink_caches+72/90> > Trace; c011a55b <schedule+34b/5f0> > Trace; c0140462 <try_to_free_pages_zone+62/f0> > Trace; c0140638 <kswapd_balance_pgdat+78/c0> > Trace; c01406a8 <kswapd_balance+28/40> > Trace; c01407fa <kswapd+9a/c0> > Trace; c0140760 <kswapd+0/c0> > Trace; c010596e <arch_kernel_thread+2e/40> > Trace; c0140760 <kswapd+0/c0> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-06-25 12:17 ` Marcelo Tosatti @ 2004-07-02 20:00 ` Chris Caputo 2004-07-03 5:15 ` Arjan van de Ven 0 siblings, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-07-02 20:00 UTC (permalink / raw) To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel On Fri, 25 Jun 2004, Marcelo Tosatti wrote: > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote: > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which > > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt > > to ask. > > Pretty sure it is, used all over. You can try to use non-optimize version > at lib/dec_and_lock.c for a test. Hi Marcelo. Just an update... I normally run irqbalance on these systems which are experiencing the inode_unused list corruption problem. I disabled irqbalance on two servers which have experienced the inode_unused list corruption problem and they have now been running without crashing for considerably longer than they had been before. The length of run time without problem is not yet conclusive, but I wanted to give you a heads-up on my progress. My current theory is that occasionally when irqbalance changes CPU affinities that the resulting set_ioapic_affinity() calls somehow cause either inter-CPU locking or cache coherency or ??? to fail. I CC'ed Arjan because this involves irqbalance, although I'd expect the fix will be needed in set_ioapic_affinity() or related kernel code. Any insights welcome. I am not sure what to do next except wait and see if these servers with irqbalance disabled continue to stay up. Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-02 20:00 ` Chris Caputo @ 2004-07-03 5:15 ` Arjan van de Ven 2004-07-26 17:41 ` Chris Caputo 0 siblings, 1 reply; 29+ messages in thread From: Arjan van de Ven @ 2004-07-03 5:15 UTC (permalink / raw) To: Chris Caputo; +Cc: Marcelo Tosatti, linux-kernel [-- Attachment #1: Type: text/plain, Size: 851 bytes --] On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote: > On Fri, 25 Jun 2004, Marcelo Tosatti wrote: > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote: > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which > > > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt > > > to ask. > > > > Pretty sure it is, used all over. You can try to use non-optimize version > > at lib/dec_and_lock.c for a test. > > My current theory is that occasionally when irqbalance changes CPU > affinities that the resulting set_ioapic_affinity() calls somehow cause > either inter-CPU locking or cache coherency or ??? to fail. or.... some spinlock is just incorrect and having the irqbalance irqlayout unhides that.. irqbalance only balances very very rarely so I doubt it's the cause of anything... [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-03 5:15 ` Arjan van de Ven @ 2004-07-26 17:41 ` Chris Caputo 2004-07-27 14:19 ` Marcelo Tosatti 0 siblings, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-07-26 17:41 UTC (permalink / raw) To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel On Sat, 3 Jul 2004, Arjan van de Ven wrote: > On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote: > > On Fri, 25 Jun 2004, Marcelo Tosatti wrote: > > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote: > > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which > > > > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt > > > > to ask. > > > > > > Pretty sure it is, used all over. You can try to use non-optimize version > > > at lib/dec_and_lock.c for a test. > > > > My current theory is that occasionally when irqbalance changes CPU > > affinities that the resulting set_ioapic_affinity() calls somehow cause > > either inter-CPU locking or cache coherency or ??? to fail. > > or.... some spinlock is just incorrect and having the irqbalance irqlayout > unhides that.. irqbalance only balances very very rarely so I doubt it's the > cause of anything... It has been a while since I have been able to follow up on this but I want to let you know that I _have been able_ to reproduce the problem (believed to be IRQ twiddling resulting in failed spinlock protection) with a stock kernel. I would like to come up with a more reliable way to reproduce the problem with a stock kernel (2.4.26), since it is presently very rare (less than once per week) in the way I presently get it to happen, but as yet have not done so. My plan of attack is to remove irqbalance from the equation and repeatedly change with random intervals /proc/irq entries directly from one user mode program while another user mode program does things which inspire a lot of fs/inode.c spinlock activity (since that is where I continue to see list corruption). A few questions which could help me with this: - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking? - What does it mean if a stack trace is incomplete? For example, one I have gotten is simply the tail end of the code snippet: 0b 9a 00 5d c8 And so I have wondered if the failure to make a full stack trace indicates something in of itself. Thanks for any assistance. I hope to find more time to work on this in the coming weeks. Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-26 17:41 ` Chris Caputo @ 2004-07-27 14:19 ` Marcelo Tosatti 2004-07-27 16:08 ` Chris Caputo 0 siblings, 1 reply; 29+ messages in thread From: Marcelo Tosatti @ 2004-07-27 14:19 UTC (permalink / raw) To: Chris Caputo; +Cc: Arjan van de Ven, linux-kernel On Mon, Jul 26, 2004 at 10:41:24AM -0700, Chris Caputo wrote: > On Sat, 3 Jul 2004, Arjan van de Ven wrote: > > On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote: > > > On Fri, 25 Jun 2004, Marcelo Tosatti wrote: > > > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote: > > > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which > > > > > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt > > > > > to ask. > > > > > > > > Pretty sure it is, used all over. You can try to use non-optimize version > > > > at lib/dec_and_lock.c for a test. > > > > > > My current theory is that occasionally when irqbalance changes CPU > > > affinities that the resulting set_ioapic_affinity() calls somehow cause > > > either inter-CPU locking or cache coherency or ??? to fail. > > > > or.... some spinlock is just incorrect and having the irqbalance irqlayout > > unhides that.. irqbalance only balances very very rarely so I doubt it's the > > cause of anything... Hi Chris, > It has been a while since I have been able to follow up on this but I want > to let you know that I _have been able_ to reproduce the problem (believed > to be IRQ twiddling resulting in failed spinlock protection) with a stock > kernel. Well, no manipulation of the inode lists are done under IRQ context. What you think might be happening is that an IRQ comes in __refile_inode() (and other paths) and that causes a problem? Thats perfectly fine. Again, no manipulation of the lists are done in IRQ context. > I would like to come up with a more reliable way to reproduce the problem > with a stock kernel (2.4.26), since it is presently very rare (less than > once per week) in the way I presently get it to happen, but as yet have > not done so. What is your workload? I'm more than willing to use the SMP boxes I have access to try to reproduce this. You said you also reproduced the same inode_unused corruption with 2.4.24, yes? > My plan of attack is to remove irqbalance from the equation and repeatedly > change with random intervals /proc/irq entries directly from one user mode > program while another user mode program does things which inspire a lot of > fs/inode.c spinlock activity (since that is where I continue to see list > corruption). > > A few questions which could help me with this: > > - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking? None as far as spinlocks are concerned. On x86 spinlock just does "lock ; dec $x" operation which guarantees the atomicity of the "dec". I feel I'm not answering your question, still. What do you mean? > - What does it mean if a stack trace is incomplete? For example, one I > have gotten is simply the tail end of the code snippet: > > 0b 9a 00 5d c8 > > And so I have wondered if the failure to make a full stack trace > indicates something in of itself. Dont know the answer. I also usually see incomplete stack traces that I can make no sense of. > Thanks for any assistance. I hope to find more time to work on this in > the coming weeks. I'm willing to help. Just please do not mix report data which includes your own filesystem/modified kernel. Lets work with stock kernels and only that, to make it as simple as possible. Thanks! ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-27 14:19 ` Marcelo Tosatti @ 2004-07-27 16:08 ` Chris Caputo 2004-07-29 0:25 ` Marcelo Tosatti 0 siblings, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-07-27 16:08 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Arjan van de Ven, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 5886 bytes --] On Tue, 27 Jul 2004, Marcelo Tosatti wrote: > On Mon, Jul 26, 2004 at 10:41:24AM -0700, Chris Caputo wrote: > > On Sat, 3 Jul 2004, Arjan van de Ven wrote: > > > On Fri, Jul 02, 2004 at 01:00:19PM -0700, Chris Caputo wrote: > > > > On Fri, 25 Jun 2004, Marcelo Tosatti wrote: > > > > > On Wed, Jun 23, 2004 at 06:50:48PM -0700, Chris Caputo wrote: > > > > > > Is it safe to assume that the x86 version of atomic_dec_and_lock(), which > > > > > > iput() uses, is well trusted? I figure it's got to be, but doesn't hurt > > > > > > to ask. > > > > > > > > > > Pretty sure it is, used all over. You can try to use non-optimize version > > > > > at lib/dec_and_lock.c for a test. > > > > > > > > My current theory is that occasionally when irqbalance changes CPU > > > > affinities that the resulting set_ioapic_affinity() calls somehow cause > > > > either inter-CPU locking or cache coherency or ??? to fail. > > > > > > or.... some spinlock is just incorrect and having the irqbalance irqlayout > > > unhides that.. irqbalance only balances very very rarely so I doubt it's the > > > cause of anything... > > Hi Chris, > > > It has been a while since I have been able to follow up on this but I want > > to let you know that I _have been able_ to reproduce the problem (believed > > to be IRQ twiddling resulting in failed spinlock protection) with a stock > > kernel. > > Well, no manipulation of the inode lists are done under IRQ context. > > What you think might be happening is that an IRQ comes in __refile_inode() > (and other paths) and that causes a problem? > > Thats perfectly fine. Again, no manipulation of the lists are done in > IRQ context. I may have phrased the question poorly. I am trying to get at whether the x86 "lock ; dec $x" operation can somehow be affected by changes to IRQ affinity, since that is what appears to happen. I assume SMP inter-cpu locking doesn't generate IRQs at the linux kernel level (right?), but there seems to be something at a low level which gets affected by changes to which CPU handles which IRQs. > > I would like to come up with a more reliable way to reproduce the problem > > with a stock kernel (2.4.26), since it is presently very rare (less than > > once per week) in the way I presently get it to happen, but as yet have > > not done so. > > What is your workload? I'm more than willing to use the SMP boxes I have > access to try to reproduce this. Here is the basic repro so far: -1) SMP x86 machine (PIII or PIV), 4 gigs of RAM, kernel 2.4.26. Here are some possibly relevant configs: CONFIG_X86=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_CMPXCHG=y CONFIG_X86_XADD=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_X86_HAS_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_PGE=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_F00F_WORKS_OK=y CONFIG_X86_MCE=y CONFIG_HIGHMEM4G=y CONFIG_HIGHMEM=y CONFIG_HIGHIO=y CONFIG_MTRR=y CONFIG_SMP=y CONFIG_X86_TSC=y CONFIG_HAVE_DEC_LOCK=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y 0) Start irqbalance. 1) Have 10+ gigs of space available for testing preferably on two separate filesystems, mounted on say /test_0 and /test_1. 2) As root do a "find /" to fill up the dcache/inode caches with a bunch of data. 3) Compile the attached writer2.c: "gcc -O2 -o writer2 writer2.c" 4) Run "writer2 /test_0 0" and "writer2 /test_1 0" in the background. 5) Run "emerge sync" a few times per day. 6) Wait for crash. Step 5 is unique to Gentoo, so it would be nice to have another test program which does the equivalent in terms of opening/created/deleting a bunch of times. Intent to be inspiring that a bunch of items are added to and removed from the inode_unused list regularly. > You said you also reproduced the same inode_unused corruption with > 2.4.24, yes? I have reproduced the same symptoms (inode_unused list corruption) on 2.4.24 with a modified kernel (tux patches plus my filesystem) but I have not tried to do so with a stock 2.4.24. So far only stock kernel I have tried has been 2.4.26. > > My plan of attack is to remove irqbalance from the equation and repeatedly > > change with random intervals /proc/irq entries directly from one user mode > > program while another user mode program does things which inspire a lot of > > fs/inode.c spinlock activity (since that is where I continue to see list > > corruption). > > > > A few questions which could help me with this: > > > > - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking? > > None as far as spinlocks are concerned. On x86 spinlock just does "lock > ; dec $x" operation which guarantees the atomicity of the "dec". > > I feel I'm not answering your question, still. What do you mean? What I seem to be seeing is that changes in IRQ affinity are at some level screwing up the guaranteed atomicity of the "lock ; dec $x". I realize that might be crazy to think, but maybe it is something known about the x86 spec. The two CPU's I have seen this with, by the way, are a PIV-Xeon (CPUID F29) and a PIII (CPUID 06B1), so I don't imagine a microcode-type bug unique to a certain CPU release. > > - What does it mean if a stack trace is incomplete? For example, one I > > have gotten is simply the tail end of the code snippet: > > > > 0b 9a 00 5d c8 > > > > And so I have wondered if the failure to make a full stack trace > > indicates something in of itself. > > Dont know the answer. I also usually see incomplete stack traces > that I can make no sense of. > > > Thanks for any assistance. I hope to find more time to work on this in > > the coming weeks. > > I'm willing to help. Just please do not mix report data which includes > your own filesystem/modified kernel. Lets work with stock kernels and > only that, to make it as simple as possible. Agreed. Chris [-- Attachment #2: Type: TEXT/PLAIN, Size: 5560 bytes --] #define _LARGEFILE64_SOURCE #define _FILE_OFFSET_BITS 64 #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/time.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/statvfs.h> #include <sys/resource.h> #include <signal.h> #define NUM_DIFF_START_LOCS 10000 #define MAX_WRITE_LEN (10 * 1048576) // 10 megs #define NUM_FILES 1 #define FILE_SIZE (10 * 1024 * 1048576LL) // 10 gigs typedef signed long long s64; int g_fSigTermReceived = 0; static void sig_term(int signo) { g_fSigTermReceived = 1; } int write_wrapper(int fd, void *buf, int len) { int total = 0; int subtotal = 0; char *targetBuf = buf; while (total < len) { subtotal = write(fd, targetBuf, len - total); if (0 > subtotal) { return -1; } else { total += subtotal; targetBuf += subtotal; } } return total; } int main(int argc, char *argv[]) { unsigned int uIterationCount, uCurIteration; unsigned int uLength; char *srcBuffer; struct timeval tvStart, tvEnd; s64 s64_usec; char szDirectory[256]; char szFilename[256]; int fds[NUM_FILES], fd; unsigned int i; off_t lseekRes; // // Usage. // if (2 != argc && 3 != argc) { fprintf(stderr, "Usage: %s <test dir> [# of iterations]\n", argv[0]); fprintf(stderr, " -if iteration count is not specified, defaults to 1. If 0\n"); fprintf(stderr, " it means run for a long long time (2^32 - 1).\n"); exit(1); } // // Consume parameters. // argc--; // eat program name strncpy(szDirectory, argv[1], sizeof(szDirectory)); argc--; // eat test directory if (0 < argc) { uIterationCount = strtoul(argv[2], NULL, 10); } else { uIterationCount = 1; } // make it so a supplied iteration count of 0 runs for a long long time. if (0 == uIterationCount) { uIterationCount--; } // seed random() srandom(getpid()); // allocate space for the srcBuffer // add long int extra so we can do a rapid random fill below srcBuffer = malloc(MAX_WRITE_LEN + NUM_DIFF_START_LOCS + sizeof(long int)); if (NULL == srcBuffer) { fprintf(stderr, "%s: was unable to malloc %u bytes. Exiting.\n", argv[0], MAX_WRITE_LEN + NUM_DIFF_START_LOCS + sizeof(long int)); exit(1); } // fill srcBuffer with random data { long int *pli; long int *end; int len = (MAX_WRITE_LEN + NUM_DIFF_START_LOCS + sizeof(long int)) / sizeof(long int); end = (long int *)srcBuffer + len; for (pli = (long int *)srcBuffer; pli < end; pli++) { *pli = random(); } } // create the files for (i = 0; i < NUM_FILES; i++) { if (0 > snprintf(szFilename, sizeof(szFilename), "%s/%u", szDirectory, i)) { fprintf(stderr, "snprintf(szFilename) failed\n"); exit(1); } fds[i] = open(szFilename, O_CREAT | O_WRONLY | O_LARGEFILE, 0666); if (-1 == fds[i]) { perror("open"); goto exit; } if (-1 == lseek(fds[i], FILE_SIZE - 1, SEEK_SET)) { perror("lseek"); goto exit; } if (-1 == write(fds[i], "x", 1)) { perror("write"); goto exit; } if (-1 == lseek(fds[i], 0, SEEK_SET)) // set back to start { perror("lseek"); goto exit; } } signal(SIGINT, sig_term); // // Main loop. // for (uCurIteration = 1; uCurIteration <= uIterationCount && 0 == g_fSigTermReceived; uCurIteration++) { printf("%u:W:", uCurIteration); uLength = random() % MAX_WRITE_LEN; printf("len = %u", uLength); fd = fds[random() % NUM_FILES]; lseekRes = lseek(fd, 0, SEEK_CUR); if (-1 == lseekRes) { perror("lseek"); goto exit; } if (lseekRes + uLength >= FILE_SIZE) { if (-1 == lseek(fd, 0, SEEK_SET)) // set back to start { perror("lseek"); goto exit; } } gettimeofday(&tvStart, NULL); // Do networking. if (-1 == write_wrapper(fd, srcBuffer + random() % NUM_DIFF_START_LOCS, uLength)) { perror("write_wrapper"); goto exit; } gettimeofday(&tvEnd, NULL); s64_usec = (s64)tvEnd.tv_sec * 1000000 + tvEnd.tv_usec; s64_usec -= (s64)tvStart.tv_sec * 1000000 + tvStart.tv_usec; printf(" %Ld usecs %.2f mbits/sec", s64_usec, 7.62939453125 * (float)uLength / (float)s64_usec); printf("\n"); fflush(stdout); usleep(random() % 100000); } exit: free(srcBuffer); if (1 == g_fSigTermReceived) { printf("SIGINT received!\n"); } #if 0 printf("cleaning up %u files...\n", NUM_FILES); fflush(stdout); // close/unlink the files for (i = 0; i < NUM_FILES; i++) { printf("%u ", i); fflush(stdout); if (0 > snprintf(szFilename, sizeof(szFilename), "%s/%u", szDirectory, i)) { fprintf(stderr, "snprintf(szFilename) failed\n"); exit(1); } close(fds[i]); unlink(szFilename); } printf("\n"); #endif exit(0); } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-27 16:08 ` Chris Caputo @ 2004-07-29 0:25 ` Marcelo Tosatti 2004-07-29 6:27 ` Chris Caputo 0 siblings, 1 reply; 29+ messages in thread From: Marcelo Tosatti @ 2004-07-29 0:25 UTC (permalink / raw) To: Chris Caputo; +Cc: Arjan van de Ven, linux-kernel On Tue, Jul 27, 2004 at 09:08:57AM -0700, Chris Caputo wrote: > I may have phrased the question poorly. I am trying to get at whether the > x86 "lock ; dec $x" operation can somehow be affected by changes to IRQ > affinity, since that is what appears to happen. I assume SMP inter-cpu > locking doesn't generate IRQs at the linux kernel level (right?), but > there seems to be something at a low level which gets affected by changes > to which CPU handles which IRQs. > > > > I would like to come up with a more reliable way to reproduce the problem > > > with a stock kernel (2.4.26), since it is presently very rare (less than > > > once per week) in the way I presently get it to happen, but as yet have > > > not done so. > > > > What is your workload? I'm more than willing to use the SMP boxes I have > > access to try to reproduce this. > > Here is the basic repro so far: > > -1) SMP x86 machine (PIII or PIV), 4 gigs of RAM, kernel 2.4.26. Here are > some possibly relevant configs: > > CONFIG_X86=y > CONFIG_X86_WP_WORKS_OK=y > CONFIG_X86_INVLPG=y > CONFIG_X86_CMPXCHG=y > CONFIG_X86_XADD=y > CONFIG_X86_BSWAP=y > CONFIG_X86_POPAD_OK=y > CONFIG_RWSEM_XCHGADD_ALGORITHM=y > CONFIG_X86_HAS_TSC=y > CONFIG_X86_GOOD_APIC=y > CONFIG_X86_PGE=y > CONFIG_X86_USE_PPRO_CHECKSUM=y > CONFIG_X86_F00F_WORKS_OK=y > CONFIG_X86_MCE=y > CONFIG_HIGHMEM4G=y > CONFIG_HIGHMEM=y > CONFIG_HIGHIO=y > CONFIG_MTRR=y > CONFIG_SMP=y > CONFIG_X86_TSC=y > CONFIG_HAVE_DEC_LOCK=y > CONFIG_X86_IO_APIC=y > CONFIG_X86_LOCAL_APIC=y > > 0) Start irqbalance. > 1) Have 10+ gigs of space available for testing preferably on two separate > filesystems, mounted on say /test_0 and /test_1. > 2) As root do a "find /" to fill up the dcache/inode caches with a bunch > of data. > 3) Compile the attached writer2.c: "gcc -O2 -o writer2 writer2.c" > 4) Run "writer2 /test_0 0" and "writer2 /test_1 0" in the background. > 5) Run "emerge sync" a few times per day. > 6) Wait for crash. OK, I have started a testcase on OSDL 8-way PentiumIII, 16GB RAM. Running stock 2.4.26, ext2 and ext3 filesystems. Three "write2" running on three different partitions, plus a "dbench 2" (which guarantees high inode cache activity). Lets wait a week or so and see what has happened ;) > Step 5 is unique to Gentoo, so it would be nice to have another test > program which does the equivalent in terms of opening/created/deleting a > bunch of times. Intent to be inspiring that a bunch of items are added to > and removed from the inode_unused list regularly. > > > You said you also reproduced the same inode_unused corruption with > > 2.4.24, yes? > > I have reproduced the same symptoms (inode_unused list corruption) on > 2.4.24 with a modified kernel (tux patches plus my filesystem) but I have > not tried to do so with a stock 2.4.24. So far only stock kernel I have > tried has been 2.4.26. > > > > My plan of attack is to remove irqbalance from the equation and repeatedly > > > change with random intervals /proc/irq entries directly from one user mode > > > program while another user mode program does things which inspire a lot of > > > fs/inode.c spinlock activity (since that is where I continue to see list > > > corruption). > > > > > > A few questions which could help me with this: > > > > > > - Which IRQ (if any) is used by CPU's to coordinate inter-CPU locking? > > > > None as far as spinlocks are concerned. On x86 spinlock just does "lock > > ; dec $x" operation which guarantees the atomicity of the "dec". > > > > I feel I'm not answering your question, still. What do you mean? > > What I seem to be seeing is that changes in IRQ affinity are at some level > screwing up the guaranteed atomicity of the "lock ; dec $x". I realize > that might be crazy to think, but maybe it is something known about the > x86 spec. The two CPU's I have seen this with, by the way, are a PIV-Xeon > (CPUID F29) and a PIII (CPUID 06B1), so I don't imagine a microcode-type > bug unique to a certain CPU release. Changing the affinity writes new values to the IOAPIC registers, I can't see how that could interfere with the atomicity of a spinlock operation. I dont understand why you think irqbalance could affect anything. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-29 0:25 ` Marcelo Tosatti @ 2004-07-29 6:27 ` Chris Caputo 2004-07-29 7:54 ` Arjan van de Ven 0 siblings, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-07-29 6:27 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Arjan van de Ven, linux-kernel On Wed, 28 Jul 2004, Marcelo Tosatti wrote: > Changing the affinity writes new values to the IOAPIC registers, I can't see > how that could interfere with the atomicity of a spinlock operation. I dont > understand why you think irqbalance could affect anything. Because when I stop running irqbalance the crashes no longer happen. Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-29 6:27 ` Chris Caputo @ 2004-07-29 7:54 ` Arjan van de Ven 2004-07-29 10:57 ` Marcelo Tosatti 0 siblings, 1 reply; 29+ messages in thread From: Arjan van de Ven @ 2004-07-29 7:54 UTC (permalink / raw) To: Chris Caputo; +Cc: Marcelo Tosatti, linux-kernel [-- Attachment #1: Type: text/plain, Size: 584 bytes --] On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote: > On Wed, 28 Jul 2004, Marcelo Tosatti wrote: > > Changing the affinity writes new values to the IOAPIC registers, I can't see > > how that could interfere with the atomicity of a spinlock operation. I dont > > understand why you think irqbalance could affect anything. > > Because when I stop running irqbalance the crashes no longer happen. what is the irq distribution when you do that? Can you run irqbalance for a bit to make sure there's a static distribution of irq's and then disable it and see if it survives ? [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-29 7:54 ` Arjan van de Ven @ 2004-07-29 10:57 ` Marcelo Tosatti 2004-07-29 12:21 ` Arjan van de Ven 0 siblings, 1 reply; 29+ messages in thread From: Marcelo Tosatti @ 2004-07-29 10:57 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Chris Caputo, linux-kernel On Thu, Jul 29, 2004 at 09:54:29AM +0200, Arjan van de Ven wrote: > On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote: > > On Wed, 28 Jul 2004, Marcelo Tosatti wrote: > > > Changing the affinity writes new values to the IOAPIC registers, I can't see > > > how that could interfere with the atomicity of a spinlock operation. I dont > > > understand why you think irqbalance could affect anything. > > > > Because when I stop running irqbalance the crashes no longer happen. > > what is the irq distribution when you do that? > Can you run irqbalance for a bit to make sure there's a static distribution > of irq's and then disable it and see if it survives ? Chris, Yes I'm also running irqbalance. Arjan, what is an easy way for me to make irqbalance change the affinity as crazy on the SMP 8way box, just for a test? TIA ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-29 10:57 ` Marcelo Tosatti @ 2004-07-29 12:21 ` Arjan van de Ven 2004-07-29 16:22 ` Chris Caputo 0 siblings, 1 reply; 29+ messages in thread From: Arjan van de Ven @ 2004-07-29 12:21 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Chris Caputo, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1079 bytes --] On Thu, Jul 29, 2004 at 07:57:55AM -0300, Marcelo Tosatti wrote: > On Thu, Jul 29, 2004 at 09:54:29AM +0200, Arjan van de Ven wrote: > > On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote: > > > On Wed, 28 Jul 2004, Marcelo Tosatti wrote: > > > > Changing the affinity writes new values to the IOAPIC registers, I can't see > > > > how that could interfere with the atomicity of a spinlock operation. I dont > > > > understand why you think irqbalance could affect anything. > > > > > > Because when I stop running irqbalance the crashes no longer happen. > > > > what is the irq distribution when you do that? > > Can you run irqbalance for a bit to make sure there's a static distribution > > of irq's and then disable it and see if it survives ? > > Chris, Yes I'm also running irqbalance. > > Arjan, what is an easy way for me to make irqbalance change the affinity > as crazy on the SMP 8way box, just for a test? there is a sleep(10 seconds) in the code, if you change that to something really short and then cause irq burst rates on different devices... [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-29 12:21 ` Arjan van de Ven @ 2004-07-29 16:22 ` Chris Caputo 2004-07-29 16:23 ` Arjan van de Ven 0 siblings, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-07-29 16:22 UTC (permalink / raw) To: Arjan van de Ven; +Cc: Marcelo Tosatti, linux-kernel On Thu, 29 Jul 2004, Arjan van de Ven wrote: > On Thu, Jul 29, 2004 at 07:57:55AM -0300, Marcelo Tosatti wrote: > > On Thu, Jul 29, 2004 at 09:54:29AM +0200, Arjan van de Ven wrote: > > > On Wed, Jul 28, 2004 at 11:27:41PM -0700, Chris Caputo wrote: > > > > On Wed, 28 Jul 2004, Marcelo Tosatti wrote: > > > > > Changing the affinity writes new values to the IOAPIC registers, I can't see > > > > > how that could interfere with the atomicity of a spinlock operation. I dont > > > > > understand why you think irqbalance could affect anything. > > > > > > > > Because when I stop running irqbalance the crashes no longer happen. > > > > > > what is the irq distribution when you do that? > > > Can you run irqbalance for a bit to make sure there's a static distribution > > > of irq's and then disable it and see if it survives ? > > > > Chris, Yes I'm also running irqbalance. > > > > Arjan, what is an easy way for me to make irqbalance change the affinity > > as crazy on the SMP 8way box, just for a test? > > there is a sleep(10 seconds) in the code, if you change that to something > really short and then cause irq burst rates on different devices... I messed with that at one point and didn't have a noticible improvement in getting the crash to happen. I'd recommend not going lower than sleep(1) since that still gives irqbalance some time to amass stats prior to the next iteration. On Thu, 29 Jul 2004, Arjan van de Ven wrote: > what is the irq distribution when you do that? You mean like what does "cat /proc/interrupts" show? One thing I could do is have a "watch -n 1 cat /proc/interrupts" running in one screen while the tests are running, and then when the crash happens copy/paste the results. Would that be useful? > Can you run irqbalance for a bit to make sure there's a static > distribution of irq's and then disable it and see if it survives ? I've had mixed results with this. In theory the list corruption can happen while irqbalance has been running, and then after stopping it, days later the crash can happen when a corrupted list item is finally accessed. Should I try the run-once/oneshot feature of irqbalance on boot, and then leave it off and see if anything happens? Thanks, Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-29 16:22 ` Chris Caputo @ 2004-07-29 16:23 ` Arjan van de Ven 2004-08-04 15:38 ` Chris Caputo 0 siblings, 1 reply; 29+ messages in thread From: Arjan van de Ven @ 2004-07-29 16:23 UTC (permalink / raw) To: Chris Caputo; +Cc: Marcelo Tosatti, linux-kernel [-- Attachment #1: Type: text/plain, Size: 325 bytes --] On Thu, Jul 29, 2004 at 09:22:40AM -0700, Chris Caputo wrote: > > Should I try the run-once/oneshot feature of irqbalance on boot, and then > leave it off and see if anything happens? that sounds like a good test it really smells that a certain distribution of irq's is causing you grief more than the actual changing... [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-07-29 16:23 ` Arjan van de Ven @ 2004-08-04 15:38 ` Chris Caputo 2004-08-04 18:47 ` Chris Caputo 0 siblings, 1 reply; 29+ messages in thread From: Chris Caputo @ 2004-08-04 15:38 UTC (permalink / raw) To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel On Thu, 29 Jul 2004, Arjan van de Ven wrote: > On Thu, Jul 29, 2004 at 09:22:40AM -0700, Chris Caputo wrote: > > Should I try the run-once/oneshot feature of irqbalance on boot, and then > > leave it off and see if anything happens? > > that sounds like a good test > > it really smells that a certain distribution of irq's is causing you grief > more than the actual changing... Here is my latest repro. Turns out writer2 is not needed Marcelo, so feel free to skip those. I'll now try to get a crash with just a one-shot irqbalance. The below was with the unmodified 0.08 irqbalance (affinity changes possible once every 10 seconds). Arjan, the irq affinities and /proc/interrupts output near the time of the crash are at the bottom of this email. It is unknown whether these were the exact affinities when the list corruption happened or not, since the watch command was only running once per second. One thing to note below is that even though dbench was doing a ton of filesystem activity at the time of the crash, the process which invoked the system call which resulted in the iput() was irqbalance. I don't always see irqbalance in the call trace when these crashes happen, but it is pretty common, like about 75% of the time. Repro method: -1) 2.4.26 kernel 0) start irqbalance 1) watch -n 1 'cat /proc/sys/fs/inode-nr ; head -n 11 /proc/interrupts ; head /proc/irq/*/*' 2) run loop_dbench, which is the following dbench script which uses client_plain.txt: #!/bin/sh while [ 1 ] do date dbench 2 done 3) Wait. Results: started 1:04am on Sunday August 1st crashed 4:28am on Tuesday August 3rd (If anyone else has spare SMP machines they can try the above with on 2.4.26, that would be great.) --- Here is the crash dump. The ALTNET lines are from extra code in inode.c which scans the inode_unused list up to 10 items in each direction to determine if the list is valid. It is called before and after list_del/list_add pairs in inode.c. (If you would like the patch for this, let me know.) The i=0 below indicates this is the first item tested off from the head of the inode_unused list. Basically inode_unused->next points to an item whose next and prev pointers are NULL. AAAAAAAAAAAAAAAAAAAAAA ALTNET(forward): cur->p == NULL, i=0, inode=0xecc83804, i_size=0 ALTNET: cur=0xecc8380c, cur->next=0x00000000, cur->prev=0x00000000 ALTNET: i_hash=0xecc83804, i_hash->next=0xf7de8188, i_hash->next->prev=0xecc83804, i_hash->prev=0xf7de8188, i_hash->prev->next=0xecc83804 ALTNET: inode_lock.lock = 0 kernel BUG at inode.c:154! invalid operand: 0000 CPU: 0 EIP: 0010:[<c016657f>] Not tainted EFLAGS: 00010282 eax: 0000001c ebx: ecc8380c ecx: c03393f8 edx: 0000001f esi: ecc83804 edi: 00000000 ebp: e72abf14 esp: e72abef0 ds: 0018 es: 0018 ss: 0018 Process irqbalance (pid: 9177, stackpage=e72ab000) Stack: c02f00a1 00000000 f7de8188 ecc83804 f7de8188 ecc83804 ecc8380c ecc83804 c3f7b800 e72abf44 c01696de 00000002 c033b980 e72abf4c ecc83830 c033bd60 40015000 00000216 c625d3ec ecc83804 ecc83804 e72abf5c c0163e91 ecc83804 Call Trace: [<c01696de>] [<c0163e91>] [<c014c957>] [<c0177db7>] [<c014a66f>] [<c014a748>] [<c0107a9f>] Code: 0f 0b 9a 00 99 00 2f c0 8b 43 04 e9 ac fa ff ff c7 04 24 67 --- >>EIP; c016657f <ALTNET_verify_unused_list+5ef/620> <===== >>ebx; ecc8380c <_end+2c8a9540/38618d94> >>ecx; c03393f8 <console_sem+0/18> >>esi; ecc83804 <_end+2c8a9538/38618d94> >>ebp; e72abf14 <_end+26ed1c48/38618d94> >>esp; e72abef0 <_end+26ed1c24/38618d94> Trace; c01696de <iput+36e/4d0> Trace; c0163e91 <dput+f1/230> Trace; c014c957 <fput+137/190> Trace; c0177db7 <proc_file_write+37/50> Trace; c014a66f <filp_close+bf/120> Trace; c014a748 <sys_close+78/90> Trace; c0107a9f <system_call+33/38> Code; c016657f <ALTNET_verify_unused_list+5ef/620> 00000000 <_EIP>: Code; c016657f <ALTNET_verify_unused_list+5ef/620> <===== 0: 0f 0b ud2a <===== Code; c0166581 <ALTNET_verify_unused_list+5f1/620> 2: 9a 00 99 00 2f c0 8b lcall $0x8bc0,$0x2f009900 Code; c0166588 <ALTNET_verify_unused_list+5f8/620> 9: 43 inc %ebx Code; c0166589 <ALTNET_verify_unused_list+5f9/620> a: 04 e9 add $0xe9,%al Code; c016658b <ALTNET_verify_unused_list+5fb/620> c: ac lods %ds:(%esi),%al Code; c016658c <ALTNET_verify_unused_list+5fc/620> d: fa cli Code; c016658d <ALTNET_verify_unused_list+5fd/620> e: ff (bad) Code; c016658e <ALTNET_verify_unused_list+5fe/620> f: ff c7 inc %edi Code; c0166590 <ALTNET_verify_unused_list+600/620> 11: 04 24 add $0x24,%al Code; c0166592 <ALTNET_verify_unused_list+602/620> 13: 67 00 00 addr16 add %al,(%bx,%si) --- Current output of the loop_dbench script at the time of the crash: Tue Aug 3 04:27:03 GMT 2004 2 clients started 0 62477 41.81 MB/secc Throughput 41.8089 MB/sec 2 procs Tue Aug 3 04:27:21 GMT 2004 2 clients started 0 62477 9.35 MB/secc Throughput 9.35182 MB/sec 2 procs Tue Aug 3 04:28:36 GMT 2004 2 clients started 0 62477 61.05 MB/secc Throughput 61.0461 MB/sec 2 procs Tue Aug 3 04:28:48 GMT 2004 --- Current output of the "watch" command at the time of the crash: Every 1s: cat /proc/sys/fs/inode-nr ; head -n 11 /p... Tue Aug 3 04:28:49 2004 /proc/sys/fs/inode-nr: 122465 105275 CPU0 CPU1 0: 47299923 50226384 IO-APIC-edge timer 1: 11 7 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 4: 1 6 IO-APIC-edge serial 8: 0 2 IO-APIC-edge rtc 14: 1 4 IO-APIC-edge ide0 18: 1801274 370200 IO-APIC-level eth0 19: 938177 1120685 IO-APIC-level eth1 21: 10833570 10778863 IO-APIC-level 3ware Storage Controller 25: 2677989 3054823 IO-APIC-level 3ware Storage Controller ==> /proc/irq/0/smp_affinity <== 00000002 ==> /proc/irq/1/smp_affinity <== 00000001 ==> /proc/irq/10/smp_affinity <== ffffffff ==> /proc/irq/11/smp_affinity <== ffffffff ==> /proc/irq/12/smp_affinity <== ffffffff ==> /proc/irq/13/smp_affinity <== ffffffff ==> /proc/irq/14/smp_affinity <== 00000001 ==> /proc/irq/15/smp_affinity <== ffffffff ==> /proc/irq/18/smp_affinity <== 00000001 ==> /proc/irq/19/smp_affinity <== 00000002 ==> /proc/irq/2/smp_affinity <== ffffffff ==> /proc/irq/21/smp_affinity <== 00000001 ==> /proc/irq/24/smp_affinity <== ffffffff ==> /proc/irq/25/smp_affinity <== 00000001 ==> /proc/irq/3/smp_affinity <== ffffffff ==> /proc/irq/4/smp_affinity <== 00000002 ==> /proc/irq/5/smp_affinity <== ffffffff ==> /proc/irq/6/smp_affinity <== ffffffff ==> /proc/irq/7/smp_affinity <== ffffffff ==> /proc/irq/8/smp_affinity <== 00000001 ==> /proc/irq/9/smp_affinity <== ffffffff Chris ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-08-04 15:38 ` Chris Caputo @ 2004-08-04 18:47 ` Chris Caputo 0 siblings, 0 replies; 29+ messages in thread From: Chris Caputo @ 2004-08-04 18:47 UTC (permalink / raw) To: Marcelo Tosatti, Arjan van de Ven; +Cc: linux-kernel Correction: I have been using client_oplocks.txt, not client_plain.txt, in this repro. (both were in the directory, but it is oplocks.txt one that is being defaulted to) Chris On Wed, 4 Aug 2004, Chris Caputo wrote: > 2) run loop_dbench, which is the following dbench script which uses > client_plain.txt: > > #!/bin/sh > > while [ 1 ] > do > date > dbench 2 > done ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20040805225549.GA18420@logos.cnet>]
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? [not found] <20040805225549.GA18420@logos.cnet> @ 2004-08-07 16:00 ` Chris Caputo 2004-08-07 18:31 ` Marcelo Tosatti 2004-08-07 19:36 ` Marcelo Tosatti 0 siblings, 2 replies; 29+ messages in thread From: Chris Caputo @ 2004-08-07 16:00 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linux Kernel On Thu, 5 Aug 2004, Marcelo Tosatti wrote: > On Wed, Aug 04, 2004 at 02:55:38PM -0700, Chris Caputo wrote: > > Right now the 3 tests I have running are: > > > > Intel SDS2 mb / dual-PIII / 3ware / 2.4.26 / irqbalance --oneshot > > Intel SDS2 mb / dual-PIII / 3ware / 2.4.27-rc5 / irqbalance ongoing > > Intel STL2 mb / dual-PIII / DAC960 / 2.4.26 / irqbalance ongoing Well, I ended up getting the same crash (report below) with the non-3ware (STL2 based) server, so I think I can rule the 3ware driver out as being an instigator. The STL2 based server took 2 days 14.5 hours to get a corrupted inode_unused list. This makes the list of motherboards I have seen the problem on be: Intel SE7501HG2 with dual-PIV's, 4 gig of ram Intel SDS2 with dual-PIII's, 4 gig of ram Intel STL2 with dual-PIII's, 2 gig of ram At present the 2.4.26 with oneshot irqbalance and the 2.4.27-rc5 with normal irqbalance are continuing to run without problems. Coming up on 3 days without issues... I'll keep them running. Also, I'll start running 2.4.27-rc5 on a second server (the STL2) with a normal irqbalance. > Hum perhaps CONFIG_DEBUG_STACKOVERFLOW? And CONFIG_DEBUG_SLAB? > > I recall you said you had CONFIG_DEBUG_SLAB set already? I have been running kernels with both DEBUG_SLAB and DEBUG_STACKOVERFLOW set. Marcelo, I take it the 8-proc server is still running fine? Anyone else out there got a spare P3 or P4 dual-proc machine they can have run the following repro scenario with 2.4.26 for a week? Chris --- Steps to repro: 0) irqbalance running on boot 1) watch -n 1 'cat /proc/sys/fs/inode-nr ; head -n 11 /proc/interrupts ; head /proc/irq/*/*' 3) loop_dbench started Wed Aug 4 18:56 GMT. crashed Sat Aug 7 09:25:49 GMT. --- AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ALTNET(forward): cur->p == NULL, i=0, inode=0xc6f59004, i_size=0 ALTNET: cur=0xc6f5900c, cur->next=0x00000000, cur->prev=0x00000000 ALTNET: i_hash=0xc6f59004, i_hash->next=0xc2918cd0, i_hash->next->prev=0xc6f5900 4, i_hash->prev=0xc2918cd0, i_hash->prev->next=0xc6f59004 ALTNET: inode_lock.lock = 0 kernel BUG at inode.c:154! invalid operand: 0000 CPU: 0 EIP: 0010:[<c0165d3f>] Not tainted EFLAGS: 00010282 eax: 0000001c ebx: c6f5900c ecx: c031e4f8 edx: 0000001f esi: c6f59004 edi: 00000000 ebp: f76fff14 esp: f76ffef0 ds: 0018 es: 0018 ss: 0018 Process irqbalance (pid: 1124, stackpage=f76ff000) Stack: c02e569d 00000000 c2918cd0 c6f59004 c2918cd0 c6f59004 c6f5900c c6f59004 c283b800 f76fff44 c0168e9e 00000002 c0320a80 f76fff4c c6f59030 c0320e60 40015000 00000216 f767556c c6f59004 c6f59004 f76fff5c c01637ed c6f59004 Call Trace: [<c0168e9e>] [<c01637ed>] [<c014c3e7>] [<c0177417>] [<c014a12f>] [<c014a208>] [<c0107a5b>] Code: 0f 0b 9a 00 95 56 2e c0 8b 43 04 e9 ac fa ff ff c7 04 24 37 --- >>EIP; c0165d3f <ALTNET_verify_unused_list+5ef/620> <===== >>ecx; c031e4f8 <console_sem+0/18> Trace; c0168e9e <iput+36e/4d0> Trace; c01637ed <dput+ed/210> Trace; c014c3e7 <fput+137/190> Trace; c0177417 <proc_file_write+37/50> Trace; c014a12f <filp_close+bf/120> Trace; c014a208 <sys_close+78/90> Trace; c0107a5b <system_call+33/38> Code; c0165d3f <ALTNET_verify_unused_list+5ef/620> 00000000 <_EIP>: Code; c0165d3f <ALTNET_verify_unused_list+5ef/620> <===== 0: 0f 0b ud2a <===== Code; c0165d41 <ALTNET_verify_unused_list+5f1/620> 2: 9a 00 95 56 2e c0 8b lcall $0x8bc0,$0x2e569500 Code; c0165d48 <ALTNET_verify_unused_list+5f8/620> 9: 43 inc %ebx Code; c0165d49 <ALTNET_verify_unused_list+5f9/620> a: 04 e9 add $0xe9,%al Code; c0165d4b <ALTNET_verify_unused_list+5fb/620> c: ac lods %ds:(%esi),%al Code; c0165d4c <ALTNET_verify_unused_list+5fc/620> d: fa cli Code; c0165d4d <ALTNET_verify_unused_list+5fd/620> e: ff (bad) Code; c0165d4e <ALTNET_verify_unused_list+5fe/620> f: ff c7 inc %edi Code; c0165d50 <ALTNET_verify_unused_list+600/620> 11: 04 24 add $0x24,%al Code; c0165d52 <ALTNET_verify_unused_list+602/620> 13: 37 aaa --- Sat Aug 7 09:25:38 GMT 2004 2 clients started 0 62477 185.37 MB/sec Throughput 185.368 MB/sec 2 procs Sat Aug 7 09:25:43 GMT 2004 2 clients started 0 62477 191.27 MB/sec Throughput 191.266 MB/sec 2 procs Sat Aug 7 09:25:48 GMT 2004 2 clients started --- Every 1s: cat /proc/sys/fs/inode-nr ; head -n 11 /p... Sat Aug 7 09:25:49 2004 /proc/sys/fs/inode-nr: 113480 113303 CPU0 CPU1 0: 11261584 11262640 IO-APIC-edge timer 1: 3 11 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 4: 6 1 IO-APIC-edge serial 8: 2 0 IO-APIC-edge rtc 18: 3019678 26 IO-APIC-level eth0 20: 607652 716283 IO-APIC-level Mylex eXtremeRAID 2000 NMI: 0 0 LOC: 22526983 22526974 ERR: 0 ==> /proc/irq/0/smp_affinity <== 00000001 ==> /proc/irq/1/smp_affinity <== 00000002 ==> /proc/irq/10/smp_affinity <== ffffffff ==> /proc/irq/11/smp_affinity <== ffffffff ==> /proc/irq/12/smp_affinity <== ffffffff ==> /proc/irq/13/smp_affinity <== ffffffff ==> /proc/irq/14/smp_affinity <== ffffffff ==> /proc/irq/15/smp_affinity <== ffffffff ==> /proc/irq/18/smp_affinity <== 00000001 ==> /proc/irq/2/smp_affinity <== ffffffff ==> /proc/irq/20/smp_affinity <== 00000002 ==> /proc/irq/21/smp_affinity <== ffffffff ==> /proc/irq/3/smp_affinity <== ffffffff ==> /proc/irq/4/smp_affinity <== 00000001 ==> /proc/irq/5/smp_affinity <== ffffffff ==> /proc/irq/6/smp_affinity <== ffffffff ==> /proc/irq/7/smp_affinity <== ffffffff ==> /proc/irq/8/smp_affinity <== 00000002 ==> /proc/irq/9/smp_affinity <== ffffffff ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-08-07 16:00 ` Chris Caputo @ 2004-08-07 18:31 ` Marcelo Tosatti 2004-08-07 19:36 ` Marcelo Tosatti 1 sibling, 0 replies; 29+ messages in thread From: Marcelo Tosatti @ 2004-08-07 18:31 UTC (permalink / raw) To: Chris Caputo; +Cc: Linux Kernel On Sat, Aug 07, 2004 at 09:00:30AM -0700, Chris Caputo wrote: > On Thu, 5 Aug 2004, Marcelo Tosatti wrote: > > On Wed, Aug 04, 2004 at 02:55:38PM -0700, Chris Caputo wrote: > > > Right now the 3 tests I have running are: > > > > > > Intel SDS2 mb / dual-PIII / 3ware / 2.4.26 / irqbalance --oneshot > > > Intel SDS2 mb / dual-PIII / 3ware / 2.4.27-rc5 / irqbalance ongoing > > > Intel STL2 mb / dual-PIII / DAC960 / 2.4.26 / irqbalance ongoing > > Well, I ended up getting the same crash (report below) with the non-3ware > (STL2 based) server, so I think I can rule the 3ware driver out as being > an instigator. The STL2 based server took 2 days 14.5 hours to get a > corrupted inode_unused list. > > This makes the list of motherboards I have seen the problem on be: > > Intel SE7501HG2 with dual-PIV's, 4 gig of ram > Intel SDS2 with dual-PIII's, 4 gig of ram > Intel STL2 with dual-PIII's, 2 gig of ram > > At present the 2.4.26 with oneshot irqbalance and the 2.4.27-rc5 with > normal irqbalance are continuing to run without problems. Coming up on 3 > days without issues... I'll keep them running. > > Also, I'll start running 2.4.27-rc5 on a second server (the STL2) with a > normal irqbalance. > > > Hum perhaps CONFIG_DEBUG_STACKOVERFLOW? And CONFIG_DEBUG_SLAB? > > > > I recall you said you had CONFIG_DEBUG_SLAB set already? > > I have been running kernels with both DEBUG_SLAB and DEBUG_STACKOVERFLOW > set. > > Marcelo, I take it the 8-proc server is still running fine? Yes, its fine. > Anyone else out there got a spare P3 or P4 dual-proc machine they can have > run the following repro scenario with 2.4.26 for a week? I have Dual P4 SE7501 here. I'll start the tests now with v2.4.26 plus your corruption patches. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: inode_unused list corruption in 2.4.26 - spin_lock problem? 2004-08-07 16:00 ` Chris Caputo 2004-08-07 18:31 ` Marcelo Tosatti @ 2004-08-07 19:36 ` Marcelo Tosatti 1 sibling, 0 replies; 29+ messages in thread From: Marcelo Tosatti @ 2004-08-07 19:36 UTC (permalink / raw) To: Chris Caputo; +Cc: Linux Kernel On Sat, Aug 07, 2004 at 09:00:30AM -0700, Chris Caputo wrote: > On Thu, 5 Aug 2004, Marcelo Tosatti wrote: > > On Wed, Aug 04, 2004 at 02:55:38PM -0700, Chris Caputo wrote: > > > Right now the 3 tests I have running are: > > > > > > Intel SDS2 mb / dual-PIII / 3ware / 2.4.26 / irqbalance --oneshot > > > Intel SDS2 mb / dual-PIII / 3ware / 2.4.27-rc5 / irqbalance ongoing > > > Intel STL2 mb / dual-PIII / DAC960 / 2.4.26 / irqbalance ongoing > > Well, I ended up getting the same crash (report below) with the non-3ware > (STL2 based) server, so I think I can rule the 3ware driver out as being > an instigator. The STL2 based server took 2 days 14.5 hours to get a > corrupted inode_unused list. > > This makes the list of motherboards I have seen the problem on be: > > Intel SE7501HG2 with dual-PIV's, 4 gig of ram > Intel SDS2 with dual-PIII's, 4 gig of ram > Intel STL2 with dual-PIII's, 2 gig of ram > > At present the 2.4.26 with oneshot irqbalance and the 2.4.27-rc5 with > normal irqbalance are continuing to run without problems. Coming up on 3 > days without issues... I'll keep them running. > > Also, I'll start running 2.4.27-rc5 on a second server (the STL2) with a > normal irqbalance. Chris, Started running "dbench 10" on 2.4.27-rc6 + arjan's irqbalance 0.9 on my dual P4 (e7500 plumas) with Symbios 53c1030 controller and 2GB of RAM. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2004-08-07 20:31 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-19 0:47 inode_unused list corruption in 2.4.26 - spin_lock problem? Chris Caputo
2004-06-20 0:15 ` Marcelo Tosatti
2004-06-20 3:33 ` Trond Myklebust
2004-06-21 0:45 ` Marcelo Tosatti
2004-06-21 17:10 ` Trond Myklebust
2004-06-21 18:23 ` Marcelo Tosatti
2004-06-24 1:51 ` Chris Caputo
2004-06-25 7:47 ` Chris Caputo
2004-06-24 1:50 ` Chris Caputo
2004-06-25 8:04 ` Chris Caputo
2004-06-25 10:18 ` Chris Caputo
2004-06-25 12:17 ` Marcelo Tosatti
2004-07-02 20:00 ` Chris Caputo
2004-07-03 5:15 ` Arjan van de Ven
2004-07-26 17:41 ` Chris Caputo
2004-07-27 14:19 ` Marcelo Tosatti
2004-07-27 16:08 ` Chris Caputo
2004-07-29 0:25 ` Marcelo Tosatti
2004-07-29 6:27 ` Chris Caputo
2004-07-29 7:54 ` Arjan van de Ven
2004-07-29 10:57 ` Marcelo Tosatti
2004-07-29 12:21 ` Arjan van de Ven
2004-07-29 16:22 ` Chris Caputo
2004-07-29 16:23 ` Arjan van de Ven
2004-08-04 15:38 ` Chris Caputo
2004-08-04 18:47 ` Chris Caputo
[not found] <20040805225549.GA18420@logos.cnet>
2004-08-07 16:00 ` Chris Caputo
2004-08-07 18:31 ` Marcelo Tosatti
2004-08-07 19:36 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox