public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Possible dcache BUG: debugging patch
@ 2004-08-19 22:25 Marcelo Tosatti
  0 siblings, 0 replies; 4+ messages in thread
From: Marcelo Tosatti @ 2004-08-19 22:25 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel


Andrew, 

IMO it is worth to include this BUG_ON's on -mm or even mainline.

Gene is hitting this mapping->private_list corruption, with the BUG_ON's
it will be clearer what is happening in case others hit the same bug.

----- Forwarded message from Marcelo Tosatti <marcelo.tosatti@cyclades.com> -----

From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
Date: Fri, 13 Aug 2004 23:18:09 -0300
To: Gene Heskett <gene.heskett@verizon.net>
Cc: linux-kernel@vger.kernel.org, Linus Torvalds <torvalds@osdl.org>,
	Andrew Morton <akpm@osdl.org>,
	Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
In-Reply-To: <200408130027.24470.gene.heskett@verizon.net>
Subject: Re: Possible dcache BUG

On Fri, Aug 13, 2004 at 12:27:24AM -0400, Gene Heskett wrote:
> On Wednesday 11 August 2004 00:59, Linus Torvalds wrote:
> >I wrote:
> >> Notably, the output of "/proc/meminfo" and "/proc/slabinfo". "ps
> >> axm" helps too.
> >
> >That should be "ps axv" of course. Just shows what a retard I am.
> >
> >		Linus

> Acck!  I just logged an Oops:
> Aug 13 00:02:00 coyote kernel: kjournald starting.  Commit interval 5 seconds
> Aug 13 00:02:00 coyote kernel: EXT3 FS on hdb3, internal journal
> Aug 13 00:02:00 coyote kernel: EXT3-fs: mounted filesystem with ordered data mode.
> Aug 13 00:05:09 coyote kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004
> Aug 13 00:05:09 coyote kernel:  printing eip:
> Aug 13 00:05:09 coyote kernel: c014e0dc
> Aug 13 00:05:09 coyote kernel: *pde = 00000000
> Aug 13 00:05:09 coyote kernel: Oops: 0002 [#1]
> Aug 13 00:05:09 coyote kernel: PREEMPT
> Aug 13 00:05:09 coyote kernel: Modules linked in: eeprom snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_bt87x snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd forcedeth sg
> Aug 13 00:05:09 coyote kernel: CPU:    0
> Aug 13 00:05:09 coyote kernel: EIP:    0060:[<c014e0dc>]    Not tainted
> Aug 13 00:05:09 coyote kernel: EFLAGS: 00010246   (2.6.8-rc4)
> Aug 13 00:05:09 coyote kernel: EIP is at remove_inode_buffers+0x4c/0x90
> Aug 13 00:05:09 coyote kernel: eax: 00000000   ebx: d7ff68b4   ecx: d7ffffb4   edx: 00000000
> Aug 13 00:05:09 coyote kernel: esi: d7ff67e0   edi: 00000001   ebp: c198bed8   esp: c198bec8
> Aug 13 00:05:09 coyote kernel: ds: 007b   es: 007b   ss: 0068
> Aug 13 00:05:09 coyote kernel: Process kswapd0 (pid: 66, threadinfo=c198b000 task=c1978050)
> Aug 13 00:05:09 coyote kernel: Stack: d7ff67e0 d7ff67e8 d7ff67e0 0000001e c198bf04 c0165242 d7ff67e0 c198b000
> Aug 13 00:05:09 coyote kernel:        00000000 0000001e d7ff6988 ed3be928 00000080 00000000 c198b000 c198bf10
> Aug 13 00:05:09 coyote kernel:        c016532f 00000080 c198bf44 c013a32c 00000080 000000d0 0002cc1d 013b0a00
> Aug 13 00:05:09 coyote kernel: Call Trace:
> Aug 13 00:05:09 coyote kernel:  [<c010476f>] show_stack+0x7f/0xa0
> Aug 13 00:05:09 coyote kernel:  [<c0104908>] show_registers+0x158/0x1b0
> Aug 13 00:05:09 coyote kernel:  [<c0104a89>] die+0x89/0x100
> Aug 13 00:05:09 coyote kernel:  [<c0111725>] do_page_fault+0x1f5/0x553
> Aug 13 00:05:09 coyote kernel:  [<c01043d9>] error_code+0x2d/0x38
> Aug 13 00:05:09 coyote kernel:  [<c0165242>] prune_icache+0x142/0x1f0
> Aug 13 00:05:09 coyote kernel:  [<c016532f>] shrink_icache_memory+0x3f/0x50
> Aug 13 00:05:09 coyote kernel:  [<c013a32c>] shrink_slab+0x14c/0x190
> Aug 13 00:05:09 coyote kernel:  [<c013b639>] balance_pgdat+0x1a9/0x1f0
> Aug 13 00:05:09 coyote kernel:  [<c013b73f>] kswapd+0xbf/0xd0
> Aug 13 00:05:09 coyote kernel:  [<c0102471>] kernel_thread_helper+0x5/0x14
> Aug 13 00:05:09 coyote kernel: Code: 89 50 04 89 02 89 49 04 89 09 8b 03 39 d8 89 c1 75 e2 b8 00
> Aug 13 00:05:09 coyote kernel:  <6>note: kswapd0[66] exited with preempt_count 1
> 
> The first 3 entries are from a nightly run of rsync, which mounts a
> normally unmounted partition for the duration of its run.

Hi fellows,

I've taken some time to look at this oopses, and I truly believe we 
are facing real corruption.

The symptom is that an inode's (blockdev) i_mapping->private_list gets corrupted, 
one of its buffer_head's contains a b_assoc_mapping list_head with NULL pointers. 

And this is not an SMP race, because Gene is not running SMP.

Gene's oops happens when remove_inode_buffers calls  __remove_assoc_queue(bh)

Ingo's oops happens while remove_inode_buffers does

 struct buffer_head *bh = BH_ENTRY(list->next);

which is

	mov ffffffd8(%ecx), (%somewhere)

%ecx is zero, so...

There is a bug somewhere.

--- a/fs/buffer.c.original	2004-08-14 00:19:55.000000000 -0300
+++ b/fs/buffer.c	2004-08-14 00:34:57.000000000 -0300
@@ -802,6 +802,8 @@
  */
 static inline void __remove_assoc_queue(struct buffer_head *bh)
 {
+	BUG_ON(bh->b_assoc_buffers.next == NULL);
+	BUG_ON(bh->b_assoc_buffers.prev == NULL);
 	list_del_init(&bh->b_assoc_buffers);
 }
 
@@ -1073,6 +1075,7 @@
 
 		spin_lock(&buffer_mapping->private_lock);
 		while (!list_empty(list)) {
+			BUG_ON(list->next == NULL);
 			struct buffer_head *bh = BH_ENTRY(list->next);
 			if (buffer_dirty(bh)) {
 				ret = 0;


Ingo oops for reference:
Unable to handle kernel paging request at virtual address ffffffd8
 printing eip:
c016a3d0
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP 
Modules linked in:
CPU:    0
EIP:    0060:[<c016a3d0>]    Not tainted VLI
EFLAGS: 00010217   (2.6.8-rc2-mm2) 
EIP is at remove_inode_buffers+0x60/0xe0
eax: 00000000   ebx: c03ba9dc   ecx: 00000000   edx: c03ba8d0
esi: c03ba8d0   edi: c0379b2a   ebp: c4115ec4   esp: c4115eac
ds: 007b   es: 007b   ss: 0068
Process kswapd0 (pid: 39, threadinfo=c4114000 task=c40aa070)
Stack: c03ba8d0 c0379b76 00000001 c03ba8d8 c03ba8d0 00000000 c4115ef8 c0186c4c 
       c03ba8d0 00000077 c4114000 00000000 0000004d 00000000 c4115ee4 c4115ee4 
       c4114000 c07fd6a0 00004e09 c4115f04 c0186df5 00000080 c4115f38 c014f4b3 
Call Trace:
 [<c01059ff>] show_stack+0x8f/0xb0
 [<c0105bb3>] show_registers+0x163/0x1d0
 [<c0105dc6>] die+0xe6/0x1c0
 [<c0117773>] do_page_fault+0x213/0x6c0
 [<c0105674>] exception_start+0x6/0xe
 [<c0186c4c>] prune_icache+0x20c/0x390
 [<c0186df5>] shrink_icache_memory+0x25/0x50
 [<c014f4b3>] shrink_slab+0x123/0x1d0
 [<c01511ee>] balance_pgdat+0x24e/0x2a0
 [<c015130c>] kswapd+0xcc/0xe0
 [<c0102899>] kernel_thread_helper+0x5/0xc
Code: 00 e0 ff ff 21 e0 ff 40 14 8d 47 4c 89 45 ec 31 c0 86 47 4c 84 c0 0f 8e 79 00 \
00 00 8b 86 0c 01 00 00 39 d8 74 23 89 c1 8d 76 00 <8b> 41 d8 a8 02 75 5a 8b 01 8b 51 \
04 89 02 89 09 89 50 04 8b 03   <6>note: kswapd0[39] exited with preempt_count 1


----- End forwarded message -----

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible dcache BUG: debugging patch
  2004-09-14 21:13 Neil Schemenauer
@ 2004-09-14 20:06 ` Marcelo Tosatti
  2004-09-14 22:21   ` Neil Schemenauer
  0 siblings, 1 reply; 4+ messages in thread
From: Marcelo Tosatti @ 2004-09-14 20:06 UTC (permalink / raw)
  To: Neil Schemenauer; +Cc: linux-kernel, mingo, Gene Heskett


Hi Neil,

IIRC Gene's problem was hardware misconfiguration. 

And your trace doesnt seem to look similar to his (different location inside
prune_dcache from what I remember).

Anyway, how hard is for you to reproduce this? 

Yes 2.6.9-rc1-mm5 contains the remove_inode_buffers BUGs to catch NULL 
list pointers, you can try that, but it might be unrelated. 

Ingo was also seeing this oopses he said, but he didnt replied. Ingo?

On Tue, Sep 14, 2004 at 05:13:01PM -0400, Neil Schemenauer wrote:
> Hi Marcelo,
> 
> One of our AMD K7 machines seems to be running into the dcache bug.
> I'm attaching the oops trace.  Is it possible to apply your BUG
> patch to the 2.6.7 or 2.6.8.1 kernel or should I use 2.6.9-rc1-mm5?
> 
> Best regards,
> 
>   Neil
> 
> 
> ======================================================================
> 
> Unable to handle kernel paging request at virtual address 705f6573
>  printing eip:
> c01539c6
> *pde = 00000000
> Oops: 0002 [#1]
> Modules linked in: via_rhine crc32 binfmt_capwrap
> CPU:    0
> EIP:    0060:[prune_dcache+38/288]    Not tainted
> EFLAGS: 00010212   (2.6.7) 
> EIP is at prune_dcache+0x26/0x120
> eax: c038edd4   ebx: c0601ec4   ecx: c8601e50   edx: 705f6573
> esi: c05e8be0   edi: 00000061   ebp: f7ffea48   esp: f7d92f0c
> ds: 007b   es: 007b   ss: 0068
> Process kswapd0 (pid: 32, threadinfo=f7d92000 task=f7dc6050)
> Stack: 00000080 00000000 f7d92000 c0153d92 c0131417 02c47800 00000000 00030260 
>        000000eb 00000000 000000d0 000000c0 c038dca4 00000001 0000000a c038db80 
>        c01324b3 00000000 f7d92f9c 000000a0 00000000 000000c0 000000c0 000000c0 
> Call Trace:
>  [shrink_dcache_memory+18/32] shrink_dcache_memory+0x12/0x20
>  [shrink_slab+311/368] shrink_slab+0x137/0x170
>  [balance_pgdat+419/496] balance_pgdat+0x1a3/0x1f0
>  [kswapd+182/192] kswapd+0xb6/0xc0
>  [autoremove_wake_function+0/80] autoremove_wake_function+0x0/0x50
>  [autoremove_wake_function+0/80] autoremove_wake_function+0x0/0x50
>  [kswapd+0/192] kswapd+0x0/0xc0
>  [kernel_thread_helper+5/24] kernel_thread_helper+0x5/0x18
> 
> Code: 89 02 89 49 04 89 09 a1 d8 ed 38 c0 0f 18 00 90 ff 0d e0 ed 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible dcache BUG: debugging patch
@ 2004-09-14 21:13 Neil Schemenauer
  2004-09-14 20:06 ` Marcelo Tosatti
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Schemenauer @ 2004-09-14 21:13 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

Hi Marcelo,

One of our AMD K7 machines seems to be running into the dcache bug.
I'm attaching the oops trace.  Is it possible to apply your BUG
patch to the 2.6.7 or 2.6.8.1 kernel or should I use 2.6.9-rc1-mm5?

Best regards,

  Neil


======================================================================

Unable to handle kernel paging request at virtual address 705f6573
 printing eip:
c01539c6
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: via_rhine crc32 binfmt_capwrap
CPU:    0
EIP:    0060:[prune_dcache+38/288]    Not tainted
EFLAGS: 00010212   (2.6.7) 
EIP is at prune_dcache+0x26/0x120
eax: c038edd4   ebx: c0601ec4   ecx: c8601e50   edx: 705f6573
esi: c05e8be0   edi: 00000061   ebp: f7ffea48   esp: f7d92f0c
ds: 007b   es: 007b   ss: 0068
Process kswapd0 (pid: 32, threadinfo=f7d92000 task=f7dc6050)
Stack: 00000080 00000000 f7d92000 c0153d92 c0131417 02c47800 00000000 00030260 
       000000eb 00000000 000000d0 000000c0 c038dca4 00000001 0000000a c038db80 
       c01324b3 00000000 f7d92f9c 000000a0 00000000 000000c0 000000c0 000000c0 
Call Trace:
 [shrink_dcache_memory+18/32] shrink_dcache_memory+0x12/0x20
 [shrink_slab+311/368] shrink_slab+0x137/0x170
 [balance_pgdat+419/496] balance_pgdat+0x1a3/0x1f0
 [kswapd+182/192] kswapd+0xb6/0xc0
 [autoremove_wake_function+0/80] autoremove_wake_function+0x0/0x50
 [autoremove_wake_function+0/80] autoremove_wake_function+0x0/0x50
 [kswapd+0/192] kswapd+0x0/0xc0
 [kernel_thread_helper+5/24] kernel_thread_helper+0x5/0x18

Code: 89 02 89 49 04 89 09 a1 d8 ed 38 c0 0f 18 00 90 ff 0d e0 ed 


$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(TM) XP 2400+
stepping        : 1
cpu MHz         : 2001.026
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 3956.73


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible dcache BUG: debugging patch
  2004-09-14 20:06 ` Marcelo Tosatti
@ 2004-09-14 22:21   ` Neil Schemenauer
  0 siblings, 0 replies; 4+ messages in thread
From: Neil Schemenauer @ 2004-09-14 22:21 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Tue, Sep 14, 2004 at 05:06:03PM -0300, Marcelo Tosatti wrote:
> And your trace doesnt seem to look similar to his (different location inside
> prune_dcache from what I remember).
> 
> Anyway, how hard is for you to reproduce this? 

It's only happened once so far.  I'll upgrade to 2.6.8.1 and apply
your BUG patch on top.  If it happens again then maybe we will get
more info.

  Neil

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-09-14 22:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-19 22:25 Possible dcache BUG: debugging patch Marcelo Tosatti
  -- strict thread matches above, loose matches on Subject: below --
2004-09-14 21:13 Neil Schemenauer
2004-09-14 20:06 ` Marcelo Tosatti
2004-09-14 22:21   ` Neil Schemenauer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox