Re: 2.6.17-rc2-mm1

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: 2.6.17-rc2-mm1
@ 2006-04-27 16:47 Martin Bligh
  2006-04-28  8:20 ` 2.6.17-rc2-mm1 Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Bligh @ 2006-04-27 16:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc64-dev, linux-kernel

Still crashes in LTP on x86_64:
(introduced in previous release)

http://test.kernel.org/abat/29674/debug/console.log

Different panic on 2-way ppp64  blade, again during LTP.

http://test.kernel.org/abat/29675/debug/console.log

  Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA
Modules linked in: evdev joydev st sr_mod ipv6 usbcore sg dm_mod
NIP: C000000000048F0C LR: C0000000000AF854 CTR: 800000000000A984
REGS: c0000000074af560 TRAP: 0300   Not tainted  (2.6.17-rc2-mm1-autokern1)
MSR: 8000000000001032 <ME,IR,DR>  CR: 24002024  XER: 00000010
DAR: C00001800056B0B0, DSISR: 0000000040010000
TASK = c000000007460800[84] 'kswapd0' THREAD: c0000000074ac000 CPU: 1
GPR00: 8000000000001032 C0000000074AF7E0 C000000000691420 C0000000007586A8
GPR04: 000000000000000F 0000000000000000 0000000000000000 0000000000000000
GPR08: C0000000FE80AAD8 C00001800056B080 0000000000000001 C0000000007586A8
GPR12: 0000000024002024 C00000000056B280 0000000000000020 0000000000000020
GPR16: 0000000000000020 0000000000000000 0000000000000000 000000000000000F
GPR20: C0000000074AF860 0000000000000000 C0000000FFFF3098 0000000000000001
GPR24: C0000000074AFE00 C00000000059FCC0 0000000000000001 C0000000007586A8
GPR28: C000000000545680 0000000000000022 C0000000005A4DA8 C00000000056B080
NIP [C000000000048F0C] .try_to_wake_up+0x98/0x598
LR [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
Call Trace:
[C0000000074AF7E0] [C0000000005A4DA8] 0xc0000000005a4da8 (unreliable)
[C0000000074AF8F0] [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
[C0000000074AF990] [C000000000098290] .remove_mapping+0x88/0x174
[C0000000074AFA20] [C000000000099340] .shrink_zone+0xc74/0xf9c
[C0000000074AFD30] [C00000000009A008] .kswapd+0x3e4/0x54c
[C0000000074AFED0] [C0000000000705C8] .kthread+0x174/0x1c4
[C0000000074AFF90] [C000000000024AB0] .kernel_thread+0x4c/0x68
Instruction dump:
3a810080 7d2000a6 79208042 f9340000 78008000 7c010164 e97b0008 ebfe8008
eb9e8000 812b0010 79294da4 7d29fa14 <e8090030> 7fbc0214 7fa3eb78 4841f615
-- 0:conmux-control -- time-stamp -- Apr/27/06  5:10:48 --

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
@ 2006-04-27 16:50 Martin Bligh
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Bligh @ 2006-04-27 16:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc64-dev, LKML

Still crashes in LTP on x86_64:
(introduced in previous release)

http://test.kernel.org/abat/29674/debug/console.log

Different panic on 2-way ppp64  blade, again during LTP.

http://test.kernel.org/abat/29675/debug/console.log

  Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA
Modules linked in: evdev joydev st sr_mod ipv6 usbcore sg dm_mod
NIP: C000000000048F0C LR: C0000000000AF854 CTR: 800000000000A984
REGS: c0000000074af560 TRAP: 0300   Not tainted  (2.6.17-rc2-mm1-autokern1)
MSR: 8000000000001032 <ME,IR,DR>  CR: 24002024  XER: 00000010
DAR: C00001800056B0B0, DSISR: 0000000040010000
TASK = c000000007460800[84] 'kswapd0' THREAD: c0000000074ac000 CPU: 1
GPR00: 8000000000001032 C0000000074AF7E0 C000000000691420 C0000000007586A8
GPR04: 000000000000000F 0000000000000000 0000000000000000 0000000000000000
GPR08: C0000000FE80AAD8 C00001800056B080 0000000000000001 C0000000007586A8
GPR12: 0000000024002024 C00000000056B280 0000000000000020 0000000000000020
GPR16: 0000000000000020 0000000000000000 0000000000000000 000000000000000F
GPR20: C0000000074AF860 0000000000000000 C0000000FFFF3098 0000000000000001
GPR24: C0000000074AFE00 C00000000059FCC0 0000000000000001 C0000000007586A8
GPR28: C000000000545680 0000000000000022 C0000000005A4DA8 C00000000056B080
NIP [C000000000048F0C] .try_to_wake_up+0x98/0x598
LR [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
Call Trace:
[C0000000074AF7E0] [C0000000005A4DA8] 0xc0000000005a4da8 (unreliable)
[C0000000074AF8F0] [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
[C0000000074AF990] [C000000000098290] .remove_mapping+0x88/0x174
[C0000000074AFA20] [C000000000099340] .shrink_zone+0xc74/0xf9c
[C0000000074AFD30] [C00000000009A008] .kswapd+0x3e4/0x54c
[C0000000074AFED0] [C0000000000705C8] .kthread+0x174/0x1c4
[C0000000074AFF90] [C000000000024AB0] .kernel_thread+0x4c/0x68
Instruction dump:
3a810080 7d2000a6 79208042 f9340000 78008000 7c010164 e97b0008 ebfe8008
eb9e8000 812b0010 79294da4 7d29fa14 <e8090030> 7fbc0214 7fa3eb78 4841f615
-- 0:conmux-control -- time-stamp -- Apr/27/06  5:10:48 --

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
@ 2006-04-27 16:54 Martin Bligh
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Bligh @ 2006-04-27 16:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc64-dev, LKML

OK, one more time, cause I'm an idiot, and can't type.

--------------------------------

Still crashes in LTP on x86_64:
(introduced in previous release)

http://test.kernel.org/abat/29674/debug/console.log

Different panic on 2-way ppp64  blade, again during LTP.

http://test.kernel.org/abat/29675/debug/console.log

  Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA
Modules linked in: evdev joydev st sr_mod ipv6 usbcore sg dm_mod
NIP: C000000000048F0C LR: C0000000000AF854 CTR: 800000000000A984
REGS: c0000000074af560 TRAP: 0300   Not tainted  (2.6.17-rc2-mm1-autokern1)
MSR: 8000000000001032 <ME,IR,DR>  CR: 24002024  XER: 00000010
DAR: C00001800056B0B0, DSISR: 0000000040010000
TASK = c000000007460800[84] 'kswapd0' THREAD: c0000000074ac000 CPU: 1
GPR00: 8000000000001032 C0000000074AF7E0 C000000000691420 C0000000007586A8
GPR04: 000000000000000F 0000000000000000 0000000000000000 0000000000000000
GPR08: C0000000FE80AAD8 C00001800056B080 0000000000000001 C0000000007586A8
GPR12: 0000000024002024 C00000000056B280 0000000000000020 0000000000000020
GPR16: 0000000000000020 0000000000000000 0000000000000000 000000000000000F
GPR20: C0000000074AF860 0000000000000000 C0000000FFFF3098 0000000000000001
GPR24: C0000000074AFE00 C00000000059FCC0 0000000000000001 C0000000007586A8
GPR28: C000000000545680 0000000000000022 C0000000005A4DA8 C00000000056B080
NIP [C000000000048F0C] .try_to_wake_up+0x98/0x598
LR [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
Call Trace:
[C0000000074AF7E0] [C0000000005A4DA8] 0xc0000000005a4da8 (unreliable)
[C0000000074AF8F0] [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
[C0000000074AF990] [C000000000098290] .remove_mapping+0x88/0x174
[C0000000074AFA20] [C000000000099340] .shrink_zone+0xc74/0xf9c
[C0000000074AFD30] [C00000000009A008] .kswapd+0x3e4/0x54c
[C0000000074AFED0] [C0000000000705C8] .kthread+0x174/0x1c4
[C0000000074AFF90] [C000000000024AB0] .kernel_thread+0x4c/0x68
Instruction dump:
3a810080 7d2000a6 79208042 f9340000 78008000 7c010164 e97b0008 ebfe8008
eb9e8000 812b0010 79294da4 7d29fa14 <e8090030> 7fbc0214 7fa3eb78 4841f615
-- 0:conmux-control -- time-stamp -- Apr/27/06  5:10:48 --

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-04-27 16:47 2.6.17-rc2-mm1 Martin Bligh
@ 2006-04-28  8:20 ` Andrew Morton
  2006-05-01 14:24   ` 2.6.17-rc2-mm1 Martin J. Bligh
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2006-04-28  8:20 UTC (permalink / raw)
  To: Martin Bligh; +Cc: linuxppc64-dev, linux-kernel


(I did s/linux-kernel@google.com/linux-kernel@vger.kernel.org/)

Martin Bligh <mbligh@google.com> wrote:
>
> Still crashes in LTP on x86_64:
> (introduced in previous release)
> 
> http://test.kernel.org/abat/29674/debug/console.log

What a mess.  A doublefault inside an NMI watchdog timeout.  I think.  It's
hard to see.  Some CPUs are stuck on a CPU scheduler lock, others seem to
be stuck in flush_tlb_others.  One of these could be a consequence of the
other, or both could be a consequence of something else.

> Different panic on 2-way ppp64  blade, again during LTP.
> 
> http://test.kernel.org/abat/29675/debug/console.log
> 
>   Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=128 NUMA
> Modules linked in: evdev joydev st sr_mod ipv6 usbcore sg dm_mod
> NIP: C000000000048F0C LR: C0000000000AF854 CTR: 800000000000A984
> REGS: c0000000074af560 TRAP: 0300   Not tainted  (2.6.17-rc2-mm1-autokern1)
> MSR: 8000000000001032 <ME,IR,DR>  CR: 24002024  XER: 00000010
> DAR: C00001800056B0B0, DSISR: 0000000040010000
> TASK = c000000007460800[84] 'kswapd0' THREAD: c0000000074ac000 CPU: 1
> GPR00: 8000000000001032 C0000000074AF7E0 C000000000691420 C0000000007586A8
> GPR04: 000000000000000F 0000000000000000 0000000000000000 0000000000000000
> GPR08: C0000000FE80AAD8 C00001800056B080 0000000000000001 C0000000007586A8
> GPR12: 0000000024002024 C00000000056B280 0000000000000020 0000000000000020
> GPR16: 0000000000000020 0000000000000000 0000000000000000 000000000000000F
> GPR20: C0000000074AF860 0000000000000000 C0000000FFFF3098 0000000000000001
> GPR24: C0000000074AFE00 C00000000059FCC0 0000000000000001 C0000000007586A8
> GPR28: C000000000545680 0000000000000022 C0000000005A4DA8 C00000000056B080
> NIP [C000000000048F0C] .try_to_wake_up+0x98/0x598
> LR [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
> Call Trace:
> [C0000000074AF7E0] [C0000000005A4DA8] 0xc0000000005a4da8 (unreliable)
> [C0000000074AF8F0] [C0000000000AF854] .add_to_swapped_list+0x23c/0x264
> [C0000000074AF990] [C000000000098290] .remove_mapping+0x88/0x174
> [C0000000074AFA20] [C000000000099340] .shrink_zone+0xc74/0xf9c
> [C0000000074AFD30] [C00000000009A008] .kswapd+0x3e4/0x54c
> [C0000000074AFED0] [C0000000000705C8] .kthread+0x174/0x1c4
> [C0000000074AFF90] [C000000000024AB0] .kernel_thread+0x4c/0x68
> Instruction dump:
> 3a810080 7d2000a6 79208042 f9340000 78008000 7c010164 e97b0008 ebfe8008
> eb9e8000 812b0010 79294da4 7d29fa14 <e8090030> 7fbc0214 7fa3eb78 4841f615
> -- 0:conmux-control -- time-stamp -- Apr/27/06  5:10:48 --

Well that's silly.  kswapd died trying to wake up kprefetchd.  That code's
bog-simple, so I'd assume something's gone wrong with a CPU scheduler data
structure.  So if there's a common strand here, it's breakage of sched data
structures by mtest01.   Let me see if I can provoke it here.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-04-28  8:20 ` 2.6.17-rc2-mm1 Andrew Morton
@ 2006-05-01 14:24   ` Martin J. Bligh
  2006-05-01 17:07     ` 2.6.17-rc2-mm1 Andrew Morton
  2006-05-01 18:34     ` 2.6.17-rc2-mm1 Andi Kleen
  0 siblings, 2 replies; 15+ messages in thread
From: Martin J. Bligh @ 2006-05-01 14:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc64-dev, Andi Kleen, linux-kernel

Andrew Morton wrote:
> (I did s/linux-kernel@google.com/linux-kernel@vger.kernel.org/)
> 
> Martin Bligh <mbligh@google.com> wrote:
> 
>>Still crashes in LTP on x86_64:
>>(introduced in previous release)
>>
>>http://test.kernel.org/abat/29674/debug/console.log
> 
> 
> What a mess.  A doublefault inside an NMI watchdog timeout.  I think.  It's
> hard to see.  Some CPUs are stuck on a CPU scheduler lock, others seem to
> be stuck in flush_tlb_others.  One of these could be a consequence of the
> other, or both could be a consequence of something else.

OK, well the latest one seems cleaner, on -rc3-mm1.
http://test.kernel.org/abat/30007/debug/console.log

Just has the double fault, with no NMI watchdog timeouts. Not that
it means any more to me, but still ;-) mtest01 seems to be able to
reproduce this every time, but I don't have an appropriate box here
to diagnose it with (this was a 4x Opteron inside IBM), and it's
definitely something in -mm that's not in mainline.

M.

double fault: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
CPU 0
Modules linked in:
Pid: 20519, comm: mtest01 Not tainted 2.6.17-rc3-mm1-autokern1 #1
RIP: 0010:[<ffffffff8047c8b8>] <ffffffff8047c8b8>{__sched_text_start+1856}
RSP: 0000:0000000000000000  EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff805d9438
RDX: ffff8100db12c0d0 RSI: ffffffff805d9438 RDI: ffff8100db12c0d0
RBP: ffffffff805d9438 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8100e39bd440 R14: ffff810008003620 R15: 000002b02751726c
FS:  0000000000000000(0000) GS:ffffffff805fa000(0063) knlGS:00000000f7dd0460
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 00000000da399000 CR4: 00000000000006e0
Process mtest01 (pid: 20519, threadinfo ffff8100b1bb4000, task 
ffff8100db12c0d0)
Stack: ffffffff80579e20 ffff8100db12c0d0 0000000000000001 ffffffff80579f58
        0000000000000000 ffffffff80579e78 ffffffff8020b0b2 ffffffff80579f58
        0000000000000000 ffffffff80485520
Call Trace: <#DF> <ffffffff8020b0b2>{show_registers+140}
        <ffffffff8020b357>{__die+159} <ffffffff8020b3cc>{die+50}
        <ffffffff8020bba6>{do_double_fault+115} 
<ffffffff8020aa91>{double_fault+125}
        <ffffffff8047c8b8>{__sched_text_start+1856} <EOE>

Code: e8 4c ba d8 ff 65 48 8b 34 25 00 00 00 00 4c 8b 46 08 f0 41
RIP <ffffffff8047c8b8>{__sched_text_start+1856} RSP <0000000000000000>
  -- 0:conmux-control -- time-stamp -- May/01/06  3:54:37 --

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 14:24   ` 2.6.17-rc2-mm1 Martin J. Bligh
@ 2006-05-01 17:07     ` Andrew Morton
  2006-05-01 17:14       ` 2.6.17-rc2-mm1 Martin Bligh
  2006-05-01 17:19       ` 2.6.17-rc2-mm1 Badari Pulavarty
  2006-05-01 18:34     ` 2.6.17-rc2-mm1 Andi Kleen
  1 sibling, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2006-05-01 17:07 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linuxppc64-dev, ak, linux-kernel

"Martin J. Bligh" <mbligh@google.com> wrote:
>
> Andrew Morton wrote:
> > (I did s/linux-kernel@google.com/linux-kernel@vger.kernel.org/)
> > 
> > Martin Bligh <mbligh@google.com> wrote:
> > 
> >>Still crashes in LTP on x86_64:
> >>(introduced in previous release)
> >>
> >>http://test.kernel.org/abat/29674/debug/console.log
> > 
> > 
> > What a mess.  A doublefault inside an NMI watchdog timeout.  I think.  It's
> > hard to see.  Some CPUs are stuck on a CPU scheduler lock, others seem to
> > be stuck in flush_tlb_others.  One of these could be a consequence of the
> > other, or both could be a consequence of something else.
> 
> OK, well the latest one seems cleaner, on -rc3-mm1.
> http://test.kernel.org/abat/30007/debug/console.log
> 
> Just has the double fault, with no NMI watchdog timeouts. Not that
> it means any more to me, but still ;-) mtest01 seems to be able to
> reproduce this every time, but I don't have an appropriate box here
> to diagnose it with (this was a 4x Opteron inside IBM), and it's
> definitely something in -mm that's not in mainline.
> 
> M.
> 
> double fault: 0000 [1] SMP
> last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
> CPU 0
> Modules linked in:
> Pid: 20519, comm: mtest01 Not tainted 2.6.17-rc3-mm1-autokern1 #1
> RIP: 0010:[<ffffffff8047c8b8>] <ffffffff8047c8b8>{__sched_text_start+1856}
> RSP: 0000:0000000000000000  EFLAGS: 00010082
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff805d9438
> RDX: ffff8100db12c0d0 RSI: ffffffff805d9438 RDI: ffff8100db12c0d0
> RBP: ffffffff805d9438 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff8100e39bd440 R14: ffff810008003620 R15: 000002b02751726c
> FS:  0000000000000000(0000) GS:ffffffff805fa000(0063) knlGS:00000000f7dd0460
> CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: fffffffffffffff8 CR3: 00000000da399000 CR4: 00000000000006e0
> Process mtest01 (pid: 20519, threadinfo ffff8100b1bb4000, task 
> ffff8100db12c0d0)
> Stack: ffffffff80579e20 ffff8100db12c0d0 0000000000000001 ffffffff80579f58
>         0000000000000000 ffffffff80579e78 ffffffff8020b0b2 ffffffff80579f58
>         0000000000000000 ffffffff80485520
> Call Trace: <#DF> <ffffffff8020b0b2>{show_registers+140}
>         <ffffffff8020b357>{__die+159} <ffffffff8020b3cc>{die+50}
>         <ffffffff8020bba6>{do_double_fault+115} 
> <ffffffff8020aa91>{double_fault+125}
>         <ffffffff8047c8b8>{__sched_text_start+1856} <EOE>
> 
> Code: e8 4c ba d8 ff 65 48 8b 34 25 00 00 00 00 4c 8b 46 08 f0 41
> RIP <ffffffff8047c8b8>{__sched_text_start+1856} RSP <0000000000000000>
>   -- 0:conmux-control -- time-stamp -- May/01/06  3:54:37 --

I was not able to reproduce this on the 4-way EMT64 machine.  Am a bit stuck.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 17:07     ` 2.6.17-rc2-mm1 Andrew Morton
@ 2006-05-01 17:14       ` Martin Bligh
  2006-05-01 17:19       ` 2.6.17-rc2-mm1 Badari Pulavarty
  1 sibling, 0 replies; 15+ messages in thread
From: Martin Bligh @ 2006-05-01 17:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc64-dev, ak, linux-kernel


>>double fault: 0000 [1] SMP
>>last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
>>CPU 0
>>Modules linked in:
>>Pid: 20519, comm: mtest01 Not tainted 2.6.17-rc3-mm1-autokern1 #1
>>RIP: 0010:[<ffffffff8047c8b8>] <ffffffff8047c8b8>{__sched_text_start+1856}
>>RSP: 0000:0000000000000000  EFLAGS: 00010082
>>RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff805d9438
>>RDX: ffff8100db12c0d0 RSI: ffffffff805d9438 RDI: ffff8100db12c0d0
>>RBP: ffffffff805d9438 R08: 0000000000000000 R09: 0000000000000000
>>R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>R13: ffff8100e39bd440 R14: ffff810008003620 R15: 000002b02751726c
>>FS:  0000000000000000(0000) GS:ffffffff805fa000(0063) knlGS:00000000f7dd0460
>>CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
>>CR2: fffffffffffffff8 CR3: 00000000da399000 CR4: 00000000000006e0
>>Process mtest01 (pid: 20519, threadinfo ffff8100b1bb4000, task 
>>ffff8100db12c0d0)
>>Stack: ffffffff80579e20 ffff8100db12c0d0 0000000000000001 ffffffff80579f58
>>        0000000000000000 ffffffff80579e78 ffffffff8020b0b2 ffffffff80579f58
>>        0000000000000000 ffffffff80485520
>>Call Trace: <#DF> <ffffffff8020b0b2>{show_registers+140}
>>        <ffffffff8020b357>{__die+159} <ffffffff8020b3cc>{die+50}
>>        <ffffffff8020bba6>{do_double_fault+115} 
>><ffffffff8020aa91>{double_fault+125}
>>        <ffffffff8047c8b8>{__sched_text_start+1856} <EOE>
>>
>>Code: e8 4c ba d8 ff 65 48 8b 34 25 00 00 00 00 4c 8b 46 08 f0 41
>>RIP <ffffffff8047c8b8>{__sched_text_start+1856} RSP <0000000000000000>
>>  -- 0:conmux-control -- time-stamp -- May/01/06  3:54:37 --
> 
> 
> I was not able to reproduce this on the 4-way EMT64 machine.  Am a bit stuck.

OK, is there anything we could run this with that'd dump more info?
(eg debug patches or something). There's bugger all of use that I
can see in that stack (and why does __sched_text_start come up anyway,
is that an x86_64-ism ?). I suppose if we're really desperate, we can
play chop search, but that's very boring to try to do remotely ...

It's a couple-of-year-old 4x newisys box.

M.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 17:07     ` 2.6.17-rc2-mm1 Andrew Morton
  2006-05-01 17:14       ` 2.6.17-rc2-mm1 Martin Bligh
@ 2006-05-01 17:19       ` Badari Pulavarty
  2006-05-01 17:26         ` 2.6.17-rc2-mm1 Martin Bligh
  1 sibling, 1 reply; 15+ messages in thread
From: Badari Pulavarty @ 2006-05-01 17:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc64-dev, ak, lkml, Martin J. Bligh

On Mon, 2006-05-01 at 10:07 -0700, Andrew Morton wrote:
> "Martin J. Bligh" <mbligh@google.com> wrote:
> >
> > Andrew Morton wrote:
> > > (I did s/linux-kernel@google.com/linux-kernel@vger.kernel.org/)
> > > 
> > > Martin Bligh <mbligh@google.com> wrote:
> > > 
> > >>Still crashes in LTP on x86_64:
> > >>(introduced in previous release)
> > >>
> > >>http://test.kernel.org/abat/29674/debug/console.log
> > > 
> > > 
> > > What a mess.  A doublefault inside an NMI watchdog timeout.  I think.  It's
> > > hard to see.  Some CPUs are stuck on a CPU scheduler lock, others seem to
> > > be stuck in flush_tlb_others.  One of these could be a consequence of the
> > > other, or both could be a consequence of something else.
> > 
> > OK, well the latest one seems cleaner, on -rc3-mm1.
> > http://test.kernel.org/abat/30007/debug/console.log
> > 
> > Just has the double fault, with no NMI watchdog timeouts. Not that
> > it means any more to me, but still ;-) mtest01 seems to be able to
> > reproduce this every time, but I don't have an appropriate box here
> > to diagnose it with (this was a 4x Opteron inside IBM), and it's
> > definitely something in -mm that's not in mainline.
> > 
> > M.
> > 
> > double fault: 0000 [1] SMP
> > last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
> > CPU 0
> > Modules linked in:
> > Pid: 20519, comm: mtest01 Not tainted 2.6.17-rc3-mm1-autokern1 #1
> > RIP: 0010:[<ffffffff8047c8b8>] <ffffffff8047c8b8>{__sched_text_start+1856}
> > RSP: 0000:0000000000000000  EFLAGS: 00010082
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff805d9438
> > RDX: ffff8100db12c0d0 RSI: ffffffff805d9438 RDI: ffff8100db12c0d0
> > RBP: ffffffff805d9438 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > R13: ffff8100e39bd440 R14: ffff810008003620 R15: 000002b02751726c
> > FS:  0000000000000000(0000) GS:ffffffff805fa000(0063) knlGS:00000000f7dd0460
> > CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> > CR2: fffffffffffffff8 CR3: 00000000da399000 CR4: 00000000000006e0
> > Process mtest01 (pid: 20519, threadinfo ffff8100b1bb4000, task 
> > ffff8100db12c0d0)
> > Stack: ffffffff80579e20 ffff8100db12c0d0 0000000000000001 ffffffff80579f58
> >         0000000000000000 ffffffff80579e78 ffffffff8020b0b2 ffffffff80579f58
> >         0000000000000000 ffffffff80485520
> > Call Trace: <#DF> <ffffffff8020b0b2>{show_registers+140}
> >         <ffffffff8020b357>{__die+159} <ffffffff8020b3cc>{die+50}
> >         <ffffffff8020bba6>{do_double_fault+115} 
> > <ffffffff8020aa91>{double_fault+125}
> >         <ffffffff8047c8b8>{__sched_text_start+1856} <EOE>
> > 
> > Code: e8 4c ba d8 ff 65 48 8b 34 25 00 00 00 00 4c 8b 46 08 f0 41
> > RIP <ffffffff8047c8b8>{__sched_text_start+1856} RSP <0000000000000000>
> >   -- 0:conmux-control -- time-stamp -- May/01/06  3:54:37 --
> 
> I was not able to reproduce this on the 4-way EMT64 machine.  Am a bit stuck.

I ran mtest01 multiple times with various options on my 4-way AMD64 box.
So far couldn't reproduce the problem (2.6.17-rc3-mm1).

Are there any special config or test options you are testing with ?

Thanks,
Badari

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 17:19       ` 2.6.17-rc2-mm1 Badari Pulavarty
@ 2006-05-01 17:26         ` Martin Bligh
  2006-05-01 17:55           ` 2.6.17-rc2-mm1 Badari Pulavarty
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Bligh @ 2006-05-01 17:26 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Andrew Morton, linuxppc64-dev, ak, lkml

> I ran mtest01 multiple times with various options on my 4-way AMD64 box.
> So far couldn't reproduce the problem (2.6.17-rc3-mm1).
> 
> Are there any special config or test options you are testing with ?

Config is here:

http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/amd64

It's just doing "runalltests", I think.

M.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 17:26         ` 2.6.17-rc2-mm1 Martin Bligh
@ 2006-05-01 17:55           ` Badari Pulavarty
  2006-05-01 17:57             ` 2.6.17-rc2-mm1 Martin Bligh
  0 siblings, 1 reply; 15+ messages in thread
From: Badari Pulavarty @ 2006-05-01 17:55 UTC (permalink / raw)
  To: Martin Bligh; +Cc: Andrew Morton, linuxppc64-dev, ak, lkml

On Mon, 2006-05-01 at 10:26 -0700, Martin Bligh wrote:
> > I ran mtest01 multiple times with various options on my 4-way AMD64 box.
> > So far couldn't reproduce the problem (2.6.17-rc3-mm1).
> > 
> > Are there any special config or test options you are testing with ?
> 
> Config is here:
> 
> http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/amd64
> 
> It's just doing "runalltests", I think.

FWIW, I tried your config file on my 4-way AMD64 (melody) box 
and ran latest "mtest01" fine.

I am now trying runalltests. I guess, its time to bi-sect :(

Thanks,
Badari

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 17:55           ` 2.6.17-rc2-mm1 Badari Pulavarty
@ 2006-05-01 17:57             ` Martin Bligh
  2006-05-01 18:32               ` 2.6.17-rc2-mm1 Andy Whitcroft
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Bligh @ 2006-05-01 17:57 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Andrew Morton, linuxppc64-dev, ak, lkml

Badari Pulavarty wrote:
> On Mon, 2006-05-01 at 10:26 -0700, Martin Bligh wrote:
> 
>>>I ran mtest01 multiple times with various options on my 4-way AMD64 box.
>>>So far couldn't reproduce the problem (2.6.17-rc3-mm1).
>>>
>>>Are there any special config or test options you are testing with ?
>>
>>Config is here:
>>
>>http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/amd64
>>
>>It's just doing "runalltests", I think.
> 
> 
> FWIW, I tried your config file on my 4-way AMD64 (melody) box 
> and ran latest "mtest01" fine.
> 
> I am now trying runalltests. I guess, its time to bi-sect :(

There was a panic on PPC64 during LTP too, but it seems to have gone
away with rc3-mm1. Not sure if it was really fixed, or just intermittent.

http://test.kernel.org/abat/29675/debug/console.log

M.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 17:57             ` 2.6.17-rc2-mm1 Martin Bligh
@ 2006-05-01 18:32               ` Andy Whitcroft
  2006-05-01 23:29                 ` 2.6.17-rc2-mm1 Badari Pulavarty
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Whitcroft @ 2006-05-01 18:32 UTC (permalink / raw)
  To: Martin Bligh; +Cc: Andrew Morton, linuxppc64-dev, Badari Pulavarty, lkml, ak

Martin Bligh wrote:
> Badari Pulavarty wrote:
> 
>> On Mon, 2006-05-01 at 10:26 -0700, Martin Bligh wrote:
>>
>>>> I ran mtest01 multiple times with various options on my 4-way AMD64
>>>> box.
>>>> So far couldn't reproduce the problem (2.6.17-rc3-mm1).
>>>>
>>>> Are there any special config or test options you are testing with ?
>>>
>>>
>>> Config is here:
>>>
>>> http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/amd64
>>>
>>> It's just doing "runalltests", I think.
>>
>>
>>
>> FWIW, I tried your config file on my 4-way AMD64 (melody) box and ran
>> latest "mtest01" fine.
>>
>> I am now trying runalltests. I guess, its time to bi-sect :(
> 
> 
> There was a panic on PPC64 during LTP too, but it seems to have gone
> away with rc3-mm1. Not sure if it was really fixed, or just intermittent.
> 
> http://test.kernel.org/abat/29675/debug/console.log

I think its more intermittant than gone.  I've got another machine which
runs the same tests, and she threw a very similar failure on 2.6.18-rc3-mm1.

-apw

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 14:24   ` 2.6.17-rc2-mm1 Martin J. Bligh
  2006-05-01 17:07     ` 2.6.17-rc2-mm1 Andrew Morton
@ 2006-05-01 18:34     ` Andi Kleen
  2006-05-02 13:20       ` 2.6.17-rc2-mm1 Andy Whitcroft
  1 sibling, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2006-05-01 18:34 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, linuxppc64-dev, linux-kernel

On Monday 01 May 2006 16:24, Martin J. Bligh wrote:

> double fault: 0000 [1] SMP
> last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
> CPU 0
> Modules linked in:
> Pid: 20519, comm: mtest01 Not tainted 2.6.17-rc3-mm1-autokern1 #1
> RIP: 0010:[<ffffffff8047c8b8>] <ffffffff8047c8b8>{__sched_text_start+1856}
> RSP: 0000:0000000000000000  EFLAGS: 00010082
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff805d9438
> RDX: ffff8100db12c0d0 RSI: ffffffff805d9438 RDI: ffff8100db12c0d0
> RBP: ffffffff805d9438 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff8100e39bd440 R14: ffff810008003620 R15: 000002b02751726c
> FS:  0000000000000000(0000) GS:ffffffff805fa000(0063) knlGS:00000000f7dd0460
> CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: fffffffffffffff8 CR3: 00000000da399000 CR4: 00000000000006e0
> Process mtest01 (pid: 20519, threadinfo ffff8100b1bb4000, task 
> ffff8100db12c0d0)
> Stack: ffffffff80579e20 ffff8100db12c0d0 0000000000000001 ffffffff80579f58
>         0000000000000000 ffffffff80579e78 ffffffff8020b0b2 ffffffff80579f58
>         0000000000000000 ffffffff80485520
> Call Trace: <#DF> <ffffffff8020b0b2>{show_registers+140}
>         <ffffffff8020b357>{__die+159} <ffffffff8020b3cc>{die+50}
>         <ffffffff8020bba6>{do_double_fault+115} 
> <ffffffff8020aa91>{double_fault+125}
>         <ffffffff8047c8b8>{__sched_text_start+1856} <EOE>

That's really strange - i wonder why the backtracer can't find the original
stack. Should probably add some printk diagnosis here.

Can you send the output with this patch?

Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -238,6 +238,7 @@ void show_trace(unsigned long *stack)
 			HANDLE_STACK (stack < estack_end);
 			i += printk(" <EOE>");
 			stack = (unsigned long *) estack_end[-2];
+			printk("new stack %lx (%lx %lx %lx %lx %lx)\n", stack, estack_end[0], estack_end[-1], estack_end[-2], estack_end[-3], estack_end[-4]);
 			continue;
 		}
 		if (irqstack_end) {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 18:32               ` 2.6.17-rc2-mm1 Andy Whitcroft
@ 2006-05-01 23:29                 ` Badari Pulavarty
  0 siblings, 0 replies; 15+ messages in thread
From: Badari Pulavarty @ 2006-05-01 23:29 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: Andrew Morton, linuxppc64-dev, lkml, Martin Bligh, ak



Andy Whitcroft wrote:

>Martin Bligh wrote:
>
>>Badari Pulavarty wrote:
>>
>>>On Mon, 2006-05-01 at 10:26 -0700, Martin Bligh wrote:
>>>
>>>>>I ran mtest01 multiple times with various options on my 4-way AMD64
>>>>>box.
>>>>>So far couldn't reproduce the problem (2.6.17-rc3-mm1).
>>>>>
>>>>>Are there any special config or test options you are testing with ?
>>>>>
>>>>
>>>>Config is here:
>>>>
>>>>http://ftp.kernel.org/pub/linux/kernel/people/mbligh/config/abat/amd64
>>>>
>>>>It's just doing "runalltests", I think.
>>>>
>>>
>>>
>>>FWIW, I tried your config file on my 4-way AMD64 (melody) box and ran
>>>latest "mtest01" fine.
>>>
>>>I am now trying runalltests. I guess, its time to bi-sect :(
>>>
>>
>>There was a panic on PPC64 during LTP too, but it seems to have gone
>>away with rc3-mm1. Not sure if it was really fixed, or just intermittent.
>>
>>http://test.kernel.org/abat/29675/debug/console.log
>>
>
>I think its more intermittant than gone.  I've got another machine which
>runs the same tests, and she threw a very similar failure on 2.6.18-rc3-mm1.
>
I ran whole LTP with 2.6.17-rc3-mm1 on my (2-way P710) Power box and
didn't see any crashes. I also ran LTP on AMD64 box without any crashes.

Thanks,
Badari

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 2.6.17-rc2-mm1
  2006-05-01 18:34     ` 2.6.17-rc2-mm1 Andi Kleen
@ 2006-05-02 13:20       ` Andy Whitcroft
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Whitcroft @ 2006-05-02 13:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linuxppc64-dev, linux-kernel, Martin J. Bligh

Andi Kleen wrote:
> On Monday 01 May 2006 16:24, Martin J. Bligh wrote:
> 
> 
>>double fault: 0000 [1] SMP
>>last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
>>CPU 0
>>Modules linked in:
>>Pid: 20519, comm: mtest01 Not tainted 2.6.17-rc3-mm1-autokern1 #1
>>RIP: 0010:[<ffffffff8047c8b8>] <ffffffff8047c8b8>{__sched_text_start+1856}
>>RSP: 0000:0000000000000000  EFLAGS: 00010082
>>RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff805d9438
>>RDX: ffff8100db12c0d0 RSI: ffffffff805d9438 RDI: ffff8100db12c0d0
>>RBP: ffffffff805d9438 R08: 0000000000000000 R09: 0000000000000000
>>R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>R13: ffff8100e39bd440 R14: ffff810008003620 R15: 000002b02751726c
>>FS:  0000000000000000(0000) GS:ffffffff805fa000(0063) knlGS:00000000f7dd0460
>>CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
>>CR2: fffffffffffffff8 CR3: 00000000da399000 CR4: 00000000000006e0
>>Process mtest01 (pid: 20519, threadinfo ffff8100b1bb4000, task 
>>ffff8100db12c0d0)
>>Stack: ffffffff80579e20 ffff8100db12c0d0 0000000000000001 ffffffff80579f58
>>        0000000000000000 ffffffff80579e78 ffffffff8020b0b2 ffffffff80579f58
>>        0000000000000000 ffffffff80485520
>>Call Trace: <#DF> <ffffffff8020b0b2>{show_registers+140}
>>        <ffffffff8020b357>{__die+159} <ffffffff8020b3cc>{die+50}
>>        <ffffffff8020bba6>{do_double_fault+115} 
>><ffffffff8020aa91>{double_fault+125}
>>        <ffffffff8047c8b8>{__sched_text_start+1856} <EOE>
> 
> 
> That's really strange - i wonder why the backtracer can't find the original
> stack. Should probably add some printk diagnosis here.
> 
> Can you send the output with this patch?

Submitted, they should show up in teh matrix forthwith.  Will drop you
the output when done.

-apw

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-05-02 13:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-27 16:47 2.6.17-rc2-mm1 Martin Bligh
2006-04-28  8:20 ` 2.6.17-rc2-mm1 Andrew Morton
2006-05-01 14:24   ` 2.6.17-rc2-mm1 Martin J. Bligh
2006-05-01 17:07     ` 2.6.17-rc2-mm1 Andrew Morton
2006-05-01 17:14       ` 2.6.17-rc2-mm1 Martin Bligh
2006-05-01 17:19       ` 2.6.17-rc2-mm1 Badari Pulavarty
2006-05-01 17:26         ` 2.6.17-rc2-mm1 Martin Bligh
2006-05-01 17:55           ` 2.6.17-rc2-mm1 Badari Pulavarty
2006-05-01 17:57             ` 2.6.17-rc2-mm1 Martin Bligh
2006-05-01 18:32               ` 2.6.17-rc2-mm1 Andy Whitcroft
2006-05-01 23:29                 ` 2.6.17-rc2-mm1 Badari Pulavarty
2006-05-01 18:34     ` 2.6.17-rc2-mm1 Andi Kleen
2006-05-02 13:20       ` 2.6.17-rc2-mm1 Andy Whitcroft
  -- strict thread matches above, loose matches on Subject: below --
2006-04-27 16:50 2.6.17-rc2-mm1 Martin Bligh
2006-04-27 16:54 2.6.17-rc2-mm1 Martin Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).