linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Kernel oops while duming user core.
@ 2008-01-31 13:45 Rune Torgersen
  2008-01-31 16:15 ` Nathan Lynch
  0 siblings, 1 reply; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 13:45 UTC (permalink / raw)
  To: linuxppc-dev

Hi

I get the following kernel core while a user program I have is dumping
core.
Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on
a 8280)
When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11
and dumps core.
On 2.6.24, I ghet the kernel oops, and then the program hangs sround
forever and is unkillable.

Unable to handle kernel paging request for data at address 0x48024000
Faulting instruction address: 0xc000ef88
Oops: Kernel access of bad area, sig: 11 [#1]
PREEMPT Innovative Systems ApMax
Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
NIP: c000ef88 LR: c0012180 CTR: 00000080
REGS: eebc9b70 TRAP: 0300   Tainted: P         (2.6.24)
MSR: 00009032 <EE,ME,IR,DR>  CR: 24004442  XER: 00000000
DAR: 48024000, DSISR: 20000000
TASK =3D eebac3c0[3131] 'armd' THREAD: eebc8000
GPR00: ee9b7d00 eebc9c20 eebac3c0 48024000 00000080 399a4181 48024000
00000000
GPR08: 399a4181 ee9b7d00 00000000 c2000000 44004422 10100f38 ee82fc00
bfffffff
GPR16: ef377060 00000030 ee9b7d00 00000000 eebc9cdc 00000011 eebc9cd8
eeb96480
GPR24: ee9b7d00 399a4181 48024000 eeb9a370 eeb9a370 399a4181 48024000
c2733480
NIP [c000ef88] __flush_dcache_icache+0x14/0x40
LR [c0012180] update_mmu_cache+0x74/0x114
Call Trace:
[eebc9c20] [eebc8000] 0xeebc8000 (unreliable)
[eebc9c40] [c005d060] handle_mm_fault+0x630/0xbc0
[eebc9c80] [c005d9e4] get_user_pages+0x3f4/0x4fc
[eebc9cd0] [c00aa7c4] elf_core_dump+0x9a4/0xc5c
[eebc9d60] [c00779e4] do_coredump+0x6e0/0x748
[eebc9e50] [c002a5b0] get_signal_to_deliver+0x40c/0x45c
[eebc9e80] [c0008ce8] do_signal+0x50/0x294
[eebc9f40] [c000fb98] do_user_signal+0x74/0xc4
--- Exception: 300 at 0x10044efc
    LR =3D 0x10044ec0
Instruction dump:
4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000
54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8
7c0004ac
---[ end trace 97db37eaf213da3c ]---
note: armd[3131] exited with preempt_count 2
=20

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 13:45 Kernel oops while duming user core Rune Torgersen
@ 2008-01-31 16:15 ` Nathan Lynch
  2008-01-31 16:26   ` Rune Torgersen
  2008-02-01 17:38   ` Scott Wood
  0 siblings, 2 replies; 21+ messages in thread
From: Nathan Lynch @ 2008-01-31 16:15 UTC (permalink / raw)
  To: Rune Torgersen; +Cc: linuxppc-dev

Rune Torgersen wrote:
> Hi
> 
> I get the following kernel core while a user program I have is dumping
> core.
> Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on
> a 8280)
> When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11
> and dumps core.
> On 2.6.24, I ghet the kernel oops, and then the program hangs sround
> forever and is unkillable.

Hmm, this is the second report of 2.6.24 crashing in
__flush_dcache_icache during a core dump; see:
http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html

Is this easily recreatable?

> 
> Unable to handle kernel paging request for data at address 0x48024000
> Faulting instruction address: 0xc000ef88
> Oops: Kernel access of bad area, sig: 11 [#1]
> PREEMPT Innovative Systems ApMax
> Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
> drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
> NIP: c000ef88 LR: c0012180 CTR: 00000080
> REGS: eebc9b70 TRAP: 0300   Tainted: P         (2.6.24)
> MSR: 00009032 <EE,ME,IR,DR>  CR: 24004442  XER: 00000000
> DAR: 48024000, DSISR: 20000000
> TASK = eebac3c0[3131] 'armd' THREAD: eebc8000
> GPR00: ee9b7d00 eebc9c20 eebac3c0 48024000 00000080 399a4181 48024000
> 00000000
> GPR08: 399a4181 ee9b7d00 00000000 c2000000 44004422 10100f38 ee82fc00
> bfffffff
> GPR16: ef377060 00000030 ee9b7d00 00000000 eebc9cdc 00000011 eebc9cd8
> eeb96480
> GPR24: ee9b7d00 399a4181 48024000 eeb9a370 eeb9a370 399a4181 48024000
> c2733480
> NIP [c000ef88] __flush_dcache_icache+0x14/0x40
> LR [c0012180] update_mmu_cache+0x74/0x114
> Call Trace:
> [eebc9c20] [eebc8000] 0xeebc8000 (unreliable)
> [eebc9c40] [c005d060] handle_mm_fault+0x630/0xbc0
> [eebc9c80] [c005d9e4] get_user_pages+0x3f4/0x4fc
> [eebc9cd0] [c00aa7c4] elf_core_dump+0x9a4/0xc5c
> [eebc9d60] [c00779e4] do_coredump+0x6e0/0x748
> [eebc9e50] [c002a5b0] get_signal_to_deliver+0x40c/0x45c
> [eebc9e80] [c0008ce8] do_signal+0x50/0x294
> [eebc9f40] [c000fb98] do_user_signal+0x74/0xc4
> --- Exception: 300 at 0x10044efc
>     LR = 0x10044ec0
> Instruction dump:
> 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000
> 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8
> 7c0004ac
> ---[ end trace 97db37eaf213da3c ]---
> note: armd[3131] exited with preempt_count 2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 16:15 ` Nathan Lynch
@ 2008-01-31 16:26   ` Rune Torgersen
  2008-01-31 17:40     ` Rune Torgersen
  2008-01-31 19:15     ` Kernel oops while duming " Kumar Gala
  2008-02-01 17:38   ` Scott Wood
  1 sibling, 2 replies; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 16:26 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev

Nathan Lynch wrote:
> Hmm, this is the second report of 2.6.24 crashing in
> __flush_dcache_icache during a core dump; see:
> http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html
>=20
> Is this easily recreatable?

Yes. I have a binary that will do this every time it is started (on this
particular system),=20
only takes about 10 seconds before it dumps.

I was going to test HEAD of powerpc.git to see if it is still there.
I cannot test any earlier versions as our board port was done on 2.6.24.

Our older kernel port is 2.6.18 on arch/ppc, and it works just fine.


One potential clue:
> Unable to handle kernel paging request for data at address 0x48024000

this adddress is beyond our physical memory. We have 1GB of mem=20
(CONFIG_HIGH_MEM enabled) so 0x3fff_ffff is the last valid address.
0x4000_0000 to 0x7fff_ffff are unused, 0x8000_0000 to 0x9fff_ffff is
used by PCI.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 16:26   ` Rune Torgersen
@ 2008-01-31 17:40     ` Rune Torgersen
  2008-01-31 19:15       ` Kumar Gala
  2008-01-31 20:16       ` Scott Wood
  2008-01-31 19:15     ` Kernel oops while duming " Kumar Gala
  1 sibling, 2 replies; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 17:40 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev

Rune Torgersen wrote:
> I was going to test HEAD of powerpc.git to see if it is still there.

Still there. Also used GDB on the vmlinux image to get source and
dissasembly of the ooops:
Unable to handle kernel paging request for data at address 0x48024000
Faulting instruction address: 0xc000f0a0
Oops: Kernel access of bad area, sig: 11 [#1]
PREEMPT Innovative Systems ApMax
Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
NIP: c000f0a0 LR: c0011fec CTR: 00000080
REGS: eebe9b70 TRAP: 0300   Tainted: P         (2.6.24-test)
MSR: 00009032 <EE,ME,IR,DR>  CR: 24004442  XER: 00000000
DAR: 48024000, DSISR: 20000000
TASK =3D eeba9780[2554] 'armd_crash' THREAD: eebe8000
GPR00: eea44d00 eebe9c20 eeba9780 48024000 00000080 37a56181 48024000
00000000
GPR08: 37a56181 eea44d00 00000000 c2000000 44004422 10100f38 ef336600
bfffffff
GPR16: eeff0300 00000030 eea44d00 00000000 eebe9cdc 00000011 eebe9cd8
eebca480
GPR24: eea44d00 37a56181 48024000 eebad580 eebad580 37a56181 48024000
c26f4ac0
NIP [c000f0a0] __flush_dcache_icache+0x14/0x40
LR [c0011fec] update_mmu_cache+0x74/0x114
Call Trace:
[eebe9c20] [eebe8000] 0xeebe8000 (unreliable)
[eebe9c40] [c005cfd0] handle_mm_fault+0x630/0xbc0
[eebe9c80] [c005d954] get_user_pages+0x3f4/0x4fc
[eebe9cd0] [c00aa730] elf_core_dump+0x9a4/0xc5c
[eebe9d60] [c0077954] do_coredump+0x6e0/0x748
[eebe9e50] [c002a520] get_signal_to_deliver+0x40c/0x45c
[eebe9e80] [c0008cec] do_signal+0x50/0x294
[eebe9f40] [c000fc9c] do_user_signal+0x74/0xc4
--- Exception: 300 at 0x10044efc
    LR =3D 0x10044ec0
Instruction dump:
4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000
54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8
7c0004ac
---[ end trace 37755b0fb9e79677 ]---
note: armd_crash[2554] exited with preempt_count 2

backtrace using gdb on vmlinux image:

0xc00aa730 is in elf_core_dump (fs/binfmt_elf.c:1762).
1757
1758                    for (addr =3D vma->vm_start; addr < end; addr =
+=3D
PAGE_SIZE) {
1759                            struct page *page;
1760                            struct vm_area_struct *vma;
1761
1762                            if (get_user_pages(current, current->mm,
addr, 1, 0, 1,
1763                                                    &page, &vma) =
<=3D
0) {
1764                                    DUMP_SEEK(PAGE_SIZE);
1765                            } else {
1766                                    if (page =3D=3D ZERO_PAGE(0)) {
(gdb) list *0xc005d954
0xc005d954 is in get_user_pages (mm/memory.c:1072).
1067                            cond_resched();
1068                            while (!(page =3D follow_page(vma, =
start,
foll_flags))) {
1069                                    int ret;
1070                                    ret =3D handle_mm_fault(mm, vma,
start,
1071                                                    foll_flags &
FOLL_WRITE);
1072                                    if (ret & VM_FAULT_ERROR) {
1073                                            if (ret & VM_FAULT_OOM)
1074                                                    return i ? i :
-ENOMEM;
1075                                            else if (ret &
VM_FAULT_SIGBUS)
1076                                                    return i ? i :
-EFAULT;
(gdb) list *0xc005cfd0
0xc005cfd0 is in handle_mm_fault (include/asm/thread_info.h:99).
94      {
95              register unsigned long sp asm("r1");
96
97              /* gcc4, at least, is smart enough to turn this into a
single
98               * rlwinm for ppc32 and clrrdi for ppc64 */
99              return (struct thread_info *)(sp & ~(THREAD_SIZE-1));
100     }
101
102     #endif /* __ASSEMBLY__ */
103
(gdb)                   =20
(gdb) list *0xc0011fec
0xc0011fec is in update_mmu_cache (arch/powerpc/mm/mem.c:489).
484                     _tlbie(address, 0 /* 8xx doesn't care about PID
*/);
485     #endif
486                     if (!PageReserved(page)
487                         && !test_bit(PG_arch_1, &page->flags)) {
488                             if (vma->vm_mm =3D=3D =
current->active_mm) {
489                                     __flush_dcache_icache((void *)
address);
490                             } else
491                                     flush_dcache_icache_page(page);
492                             set_bit(PG_arch_1, &page->flags);
493                     }
(gdb) list *0xc000f0a0
No source file for address 0xc000f0a0.
(gdb) disassemble 0xc000f0a0
Dump of assembler code for function __flush_dcache_icache:
0xc000f08c <__flush_dcache_icache+0>:   dec    %esi
0xc000f08d <__flush_dcache_icache+1>:   addb   $0x20,(%eax)
0xc000f090 <__flush_dcache_icache+4>:   push   %esp
0xc000f091 <__flush_dcache_icache+5>:   arpl   %ax,(%eax)
0xc000f093 <__flush_dcache_icache+7>:   cmp    %al,%es:0x897c8000(%eax)
0xc000f09a <__flush_dcache_icache+14>:  add    0x781b667c(%esi),%esp
0xc000f0a0 <__flush_dcache_icache+20>:  jl     0xc000f0a2
<__flush_dcache_icache+22>
0xc000f0a2 <__flush_dcache_icache+22>:  sbb    %ch,0x63(%eax,%edi,1)
0xc000f0a6 <__flush_dcache_icache+26>:  add    %ah,(%eax)
0xc000f0a8 <__flush_dcache_icache+28>:  inc    %edx
0xc000f0a9 <__flush_dcache_icache+29>:  add    %bh,%bh
0xc000f0ab <__flush_dcache_icache+31>:  clc
0xc000f0ac <__flush_dcache_icache+32>:  jl     0xc000f0ae
<__flush_dcache_icache+34>
0xc000f0ae <__flush_dcache_icache+34>:  add    $0xac,%al
0xc000f0b0 <__flush_dcache_icache+36>:  jl     0xc000f03b
<flush_dcache_range+15>
0xc000f0b2 <__flush_dcache_icache+38>:  add    0xac37007c(%esi),%esp
0xc000f0b8 <__flush_dcache_icache+44>:  cmp    %al,%dh
0xc000f0ba <__flush_dcache_icache+46>:  add    %ah,(%eax)
0xc000f0bc <__flush_dcache_icache+48>:  inc    %edx
0xc000f0bd <__flush_dcache_icache+49>:  add    %bh,%bh
0xc000f0bf <__flush_dcache_icache+51>:  clc
0xc000f0c0 <__flush_dcache_icache+52>:  jl     0xc000f0c2
<__flush_dcache_icache+54>
0xc000f0c2 <__flush_dcache_icache+54>:  add    $0xac,%al
0xc000f0c4 <__flush_dcache_icache+56>:  dec    %esp
0xc000f0c5 <__flush_dcache_icache+57>:  add    %al,(%ecx)
0xc000f0c7 <__flush_dcache_icache+59>:  sub    $0x4e,%al
0xc000f0c9 <__flush_dcache_icache+61>:  addb   $0x20,(%eax)
End of assembler dump.
(gdb)                       =20

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 17:40     ` Rune Torgersen
@ 2008-01-31 19:15       ` Kumar Gala
  2008-01-31 19:18         ` Rune Torgersen
  2008-01-31 20:16       ` Scott Wood
  1 sibling, 1 reply; 21+ messages in thread
From: Kumar Gala @ 2008-01-31 19:15 UTC (permalink / raw)
  To: Rune Torgersen; +Cc: linuxppc-dev, Nathan Lynch

>                }
> (gdb) list *0xc000f0a0
> No source file for address 0xc000f0a0.
> (gdb) disassemble 0xc000f0a0
> Dump of assembler code for function __flush_dcache_icache:
> 0xc000f08c <__flush_dcache_icache+0>:   dec    %esi
> 0xc000f08d <__flush_dcache_icache+1>:   addb   $0x20,(%eax)
> 0xc000f090 <__flush_dcache_icache+4>:   push   %esp
> 0xc000f091 <__flush_dcache_icache+5>:   arpl   %ax,(%eax)
> 0xc000f093 <__flush_dcache_icache+7>:   cmp    %al,%es: 
> 0x897c8000(%eax)
> 0xc000f09a <__flush_dcache_icache+14>:  add    0x781b667c(%esi),%esp
> 0xc000f0a0 <__flush_dcache_icache+20>:  jl     0xc000f0a2
> <__flush_dcache_icache+22>
> 0xc000f0a2 <__flush_dcache_icache+22>:  sbb    %ch,0x63(%eax,%edi,1)
> 0xc000f0a6 <__flush_dcache_icache+26>:  add    %ah,(%eax)
> 0xc000f0a8 <__flush_dcache_icache+28>:  inc    %edx
> 0xc000f0a9 <__flush_dcache_icache+29>:  add    %bh,%bh
> 0xc000f0ab <__flush_dcache_icache+31>:  clc
> 0xc000f0ac <__flush_dcache_icache+32>:  jl     0xc000f0ae
> <__flush_dcache_icache+34>
> 0xc000f0ae <__flush_dcache_icache+34>:  add    $0xac,%al
> 0xc000f0b0 <__flush_dcache_icache+36>:  jl     0xc000f03b
> <flush_dcache_range+15>
> 0xc000f0b2 <__flush_dcache_icache+38>:  add    0xac37007c(%esi),%esp
> 0xc000f0b8 <__flush_dcache_icache+44>:  cmp    %al,%dh
> 0xc000f0ba <__flush_dcache_icache+46>:  add    %ah,(%eax)
> 0xc000f0bc <__flush_dcache_icache+48>:  inc    %edx
> 0xc000f0bd <__flush_dcache_icache+49>:  add    %bh,%bh
> 0xc000f0bf <__flush_dcache_icache+51>:  clc
> 0xc000f0c0 <__flush_dcache_icache+52>:  jl     0xc000f0c2
> <__flush_dcache_icache+54>
> 0xc000f0c2 <__flush_dcache_icache+54>:  add    $0xac,%al
> 0xc000f0c4 <__flush_dcache_icache+56>:  dec    %esp
> 0xc000f0c5 <__flush_dcache_icache+57>:  add    %al,(%ecx)
> 0xc000f0c7 <__flush_dcache_icache+59>:  sub    $0x4e,%al
> 0xc000f0c9 <__flush_dcache_icache+61>:  addb   $0x20,(%eax)
> End of assembler dump.

This doesn't look like ppc disasm to me :)

- k

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 16:26   ` Rune Torgersen
  2008-01-31 17:40     ` Rune Torgersen
@ 2008-01-31 19:15     ` Kumar Gala
  2008-01-31 19:23       ` Rune Torgersen
  1 sibling, 1 reply; 21+ messages in thread
From: Kumar Gala @ 2008-01-31 19:15 UTC (permalink / raw)
  To: Rune Torgersen; +Cc: linuxppc-dev, Nathan Lynch


On Jan 31, 2008, at 10:26 AM, Rune Torgersen wrote:

> Nathan Lynch wrote:
>> Hmm, this is the second report of 2.6.24 crashing in
>> __flush_dcache_icache during a core dump; see:
>> http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html
>>
>> Is this easily recreatable?
>
> Yes. I have a binary that will do this every time it is started (on  
> this
> particular system),
> only takes about 10 seconds before it dumps.
>
> I was going to test HEAD of powerpc.git to see if it is still there.
> I cannot test any earlier versions as our board port was done on  
> 2.6.24.
>
> Our older kernel port is 2.6.18 on arch/ppc, and it works just fine.
>
>
> One potential clue:
>> Unable to handle kernel paging request for data at address 0x48024000
>
> this adddress is beyond our physical memory. We have 1GB of mem
> (CONFIG_HIGH_MEM enabled) so 0x3fff_ffff is the last valid address.
> 0x4000_0000 to 0x7fff_ffff are unused, 0x8000_0000 to 0x9fff_ffff is
> used by PCI.


Can you git-bisect to narrow this down further.

- k

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 19:15       ` Kumar Gala
@ 2008-01-31 19:18         ` Rune Torgersen
  0 siblings, 0 replies; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 19:18 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev, Nathan Lynch

Kumar Gala wrote:
> This doesn't look like ppc disasm to me :)
>=20

Helps if i use the cross-compiler gdb instead of the x86 native one...
here is the disasembly dump for NIP

(gdb) disassemble 0xc000f0a0
Dump of assembler code for function __flush_dcache_icache:
0xc000f08c <__flush_dcache_icache+0>:   blr
0xc000f090 <__flush_dcache_icache+4>:   rlwinm  r3,r3,0,0,19
0xc000f094 <__flush_dcache_icache+8>:   li      r4,128
0xc000f098 <__flush_dcache_icache+12>:  mtctr   r4
0xc000f09c <__flush_dcache_icache+16>:  mr      r6,r3
0xc000f0a0 <__flush_dcache_icache+20>:  dcbst   r0,r3
0xc000f0a4 <__flush_dcache_icache+24>:  addi    r3,r3,32
0xc000f0a8 <__flush_dcache_icache+28>:  bdnz+   0xc000f0a0
<__flush_dcache_icache+20>
0xc000f0ac <__flush_dcache_icache+32>:  sync
0xc000f0b0 <__flush_dcache_icache+36>:  mtctr   r4
0xc000f0b4 <__flush_dcache_icache+40>:  icbi    r0,r6
0xc000f0b8 <__flush_dcache_icache+44>:  addi    r6,r6,32
0xc000f0bc <__flush_dcache_icache+48>:  bdnz+   0xc000f0b4
<__flush_dcache_icache+40>
0xc000f0c0 <__flush_dcache_icache+52>:  sync
0xc000f0c4 <__flush_dcache_icache+56>:  isync
0xc000f0c8 <__flush_dcache_icache+60>:  blr
End of assembler dump.
(gdb)                =20

registers were:
NIP: c000f0a0 LR: c0011fec CTR: 00000080
REGS: eebe9b70 TRAP: 0300   Tainted: P         (2.6.24-test)
MSR: 00009032 <EE,ME,IR,DR>  CR: 24004442  XER: 00000000
DAR: 48024000, DSISR: 20000000
TASK =3D eeba9780[2554] 'armd_crash' THREAD: eebe8000
GPR00: eea44d00 eebe9c20 eeba9780 48024000 00000080 37a56181 48024000
00000000
GPR08: 37a56181 eea44d00 00000000 c2000000 44004422 10100f38 ef336600
bfffffff
GPR16: eeff0300 00000030 eea44d00 00000000 eebe9cdc 00000011 eebe9cd8
eebca480
GPR24: eea44d00 37a56181 48024000 eebad580 eebad580 37a56181 48024000
c26f4ac0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 19:15     ` Kernel oops while duming " Kumar Gala
@ 2008-01-31 19:23       ` Rune Torgersen
  2008-01-31 19:54         ` Nathan Lynch
  0 siblings, 1 reply; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 19:23 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev, Nathan Lynch

Kumar Gala wrote:
> Can you git-bisect to narrow this down further.

Not easilly, as the board port to arch/powerpc was done on 2.6.24-rc7
and up.
Is there an somewhat esy way in git to apply the differences from master
branch to our board branch to a branch created by bisect?

And I don't even know where this started to happen.
Would trying arch/ppc help any? I have our arch/ppc port in a
semiworking state for kernels up to 2.6.23

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 19:23       ` Rune Torgersen
@ 2008-01-31 19:54         ` Nathan Lynch
  0 siblings, 0 replies; 21+ messages in thread
From: Nathan Lynch @ 2008-01-31 19:54 UTC (permalink / raw)
  To: Rune Torgersen; +Cc: linuxppc-dev

Rune Torgersen wrote:
> Kumar Gala wrote:
> > Can you git-bisect to narrow this down further.
> 
> Not easilly, as the board port to arch/powerpc was done on 2.6.24-rc7
> and up.
> Is there an somewhat esy way in git to apply the differences from master
> branch to our board branch to a branch created by bisect?
> 
> And I don't even know where this started to happen.
> Would trying arch/ppc help any? I have our arch/ppc port in a
> semiworking state for kernels up to 2.6.23

Well, we know this happens on other 32-bit powerpc machines (pmac at
least)... perhaps someone could arrange to bisect on a machine that
works with older powerpc kernels (assuming they have a good repro
case).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 17:40     ` Rune Torgersen
  2008-01-31 19:15       ` Kumar Gala
@ 2008-01-31 20:16       ` Scott Wood
  2008-01-31 20:19         ` Rune Torgersen
                           ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Scott Wood @ 2008-01-31 20:16 UTC (permalink / raw)
  To: Rune Torgersen; +Cc: linuxppc-dev, Nathan Lynch

On Thu, Jan 31, 2008 at 11:40:04AM -0600, Rune Torgersen wrote:
> Unable to handle kernel paging request for data at address 0x48024000
> Faulting instruction address: 0xc000f0a0
> Oops: Kernel access of bad area, sig: 11 [#1]
> PREEMPT Innovative Systems ApMax

Does it happen without preempt?

> Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
> drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
> NIP: c000f0a0 LR: c0011fec CTR: 00000080
> REGS: eebe9b70 TRAP: 0300   Tainted: P         (2.6.24-test)

Does it happen without the modules?

> MSR: 00009032 <EE,ME,IR,DR>  CR: 24004442  XER: 00000000
> DAR: 48024000, DSISR: 20000000

Hmm, this doesn't look like a valid DSISR, so I'm guessing this was a TLB
miss that got redirected to DataAccess (or is there something that causes
DSRISR[2] to be set on 8280?  I didn't see anything in the manual...). 
However, SRR1 in that case seems to indicate a store, which dcbst shouldn't
generate (except on 8xx, according to the comment in update_mmu_cache).

Do you have a simple test case that we could try to reproduce?  I tried a
simple core dump on an 8280, and it worked.

Failing that, I'd add code to the page fault handler to dump what is (or
isn't) supposed to be mapped at the faulting address, and something to track
which (if any) TLB miss exception it came through.

-Scott

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 20:16       ` Scott Wood
@ 2008-01-31 20:19         ` Rune Torgersen
  2008-01-31 20:38         ` Rune Torgersen
  2008-01-31 20:41         ` Nathan Lynch
  2 siblings, 0 replies; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 20:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, Nathan Lynch

Scott Wood wrote:
> Does it happen without preempt?

Will try shortly, just updated my git to HEAD of Linus's tree
>=20
>> Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
>> drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
>> NIP: c000f0a0 LR: c0011fec CTR: 00000080
>> REGS: eebe9b70 TRAP: 0300   Tainted: P         (2.6.24-test)
>=20
> Does it happen without the modules?
Cannot test without most of them.

> Do you have a simple test case that we could try to
> reproduce?  I tried a
> simple core dump on an 8280, and it worked.

I do not have a testcase, except a app for our board that does this
reliably after about 10 seconds.

> Failing that, I'd add code to the page fault handler to dump what is
> (or isn't) supposed to be mapped at the faulting address, and
> something to track which (if any) TLB miss exception it came through.

I can test code.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 20:16       ` Scott Wood
  2008-01-31 20:19         ` Rune Torgersen
@ 2008-01-31 20:38         ` Rune Torgersen
  2008-01-31 20:41         ` Nathan Lynch
  2 siblings, 0 replies; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 20:38 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, Nathan Lynch

Scott Wood wrote:
> Does it happen without preempt?

Yes

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 20:16       ` Scott Wood
  2008-01-31 20:19         ` Rune Torgersen
  2008-01-31 20:38         ` Rune Torgersen
@ 2008-01-31 20:41         ` Nathan Lynch
  2008-01-31 20:45           ` Rune Torgersen
  2008-01-31 20:55           ` Scott Wood
  2 siblings, 2 replies; 21+ messages in thread
From: Nathan Lynch @ 2008-01-31 20:41 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

Scott Wood wrote:
> On Thu, Jan 31, 2008 at 11:40:04AM -0600, Rune Torgersen wrote:
> > Unable to handle kernel paging request for data at address 0x48024000
> > Faulting instruction address: 0xc000f0a0
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > PREEMPT Innovative Systems ApMax
> 
> Does it happen without preempt?
> 
> > Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
> > drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
> > NIP: c000f0a0 LR: c0011fec CTR: 00000080
> > REGS: eebe9b70 TRAP: 0300   Tainted: P         (2.6.24-test)
> 
> Does it happen without the modules?

I doubt the modules are the problem; there was a practically identical
report from someone with an untainted 2.6.24-rc kernel a few weeks ago
(see my first reply to Rune).

> 
> > MSR: 00009032 <EE,ME,IR,DR>  CR: 24004442  XER: 00000000
> > DAR: 48024000, DSISR: 20000000
> 
> Hmm, this doesn't look like a valid DSISR, so I'm guessing this was a TLB
> miss that got redirected to DataAccess (or is there something that causes
> DSRISR[2] to be set on 8280?  I didn't see anything in the manual...). 
> However, SRR1 in that case seems to indicate a store, which dcbst shouldn't
> generate (except on 8xx, according to the comment in update_mmu_cache).
> 
> Do you have a simple test case that we could try to reproduce?  I tried a
> simple core dump on an 8280, and it worked.

Is the crashing program multithreaded?  The first report had firefox
triggering the oops.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 20:41         ` Nathan Lynch
@ 2008-01-31 20:45           ` Rune Torgersen
  2008-01-31 20:55           ` Scott Wood
  1 sibling, 0 replies; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 20:45 UTC (permalink / raw)
  To: Nathan Lynch, Scott Wood; +Cc: linuxppc-dev

Nathan Lynch wrote:
> Scott Wood wrote:
>> Do you have a simple test case that we could try to reproduce?  I
>> tried a simple core dump on an 8280, and it worked.
>=20
> Is the crashing program multithreaded?  The first report had firefox
> triggering the oops.

The crashing program has 10 threads. (NPTL pthreads, glibc-2.5, gcc
4.1.2)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 20:41         ` Nathan Lynch
  2008-01-31 20:45           ` Rune Torgersen
@ 2008-01-31 20:55           ` Scott Wood
  2008-01-31 21:58             ` Scott Wood
  1 sibling, 1 reply; 21+ messages in thread
From: Scott Wood @ 2008-01-31 20:55 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev

Nathan Lynch wrote:
> I doubt the modules are the problem; there was a practically identical
> report from someone with an untainted 2.6.24-rc kernel a few weeks ago
> (see my first reply to Rune).

I didn't think they were; I was just trying to eliminate the low hanging 
fruit and get a simpler testcase. :-)

>> Do you have a simple test case that we could try to reproduce?  I tried a
>> simple core dump on an 8280, and it worked.
> 
> Is the crashing program multithreaded?  The first report had firefox
> triggering the oops.

OK, I've got a test program that triggers it now.  I'll see if I can 
figure out what's going on.

-Scott

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 20:55           ` Scott Wood
@ 2008-01-31 21:58             ` Scott Wood
  2008-01-31 22:10               ` Rune Torgersen
  0 siblings, 1 reply; 21+ messages in thread
From: Scott Wood @ 2008-01-31 21:58 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev

Scott Wood wrote:
> Nathan Lynch wrote:
>> Is the crashing program multithreaded?  The first report had firefox
>> triggering the oops.
> 
> OK, I've got a test program that triggers it now.  I'll see if I can 
> figure out what's going on.

The problem seems to be that update_mmu_cache() is called on a guard 
page with no access rights.

Changing update_mmu_cache() to always call flush_dcache_icache_page() 
fixes it, though a better performing fix would probably be to add an 
exception table entry for the dcbst.

-Scott

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 21:58             ` Scott Wood
@ 2008-01-31 22:10               ` Rune Torgersen
  2008-02-03  7:34                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 21+ messages in thread
From: Rune Torgersen @ 2008-01-31 22:10 UTC (permalink / raw)
  To: Scott Wood, Nathan Lynch; +Cc: linuxppc-dev

Scott Wood wrote:
> Scott Wood wrote:
>> Nathan Lynch wrote:
>>> Is the crashing program multithreaded?  The first report had firefox
>>> triggering the oops.
>>=20
>> OK, I've got a test program that triggers it now.  I'll see if I can
>> figure out what's going on.
>=20
> The problem seems to be that update_mmu_cache() is called on a guard
> page with no access rights.=20
>=20
> Changing update_mmu_cache() to always call flush_dcache_icache_page()
> fixes it, though a better performing fix would probably be to add an
> exception table entry for the dcbst.

I can confirm that this seems to fix it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-01-31 16:15 ` Nathan Lynch
  2008-01-31 16:26   ` Rune Torgersen
@ 2008-02-01 17:38   ` Scott Wood
  2008-02-02 12:05     ` Clemens Koller
  1 sibling, 1 reply; 21+ messages in thread
From: Scott Wood @ 2008-02-01 17:38 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev

On Thu, Jan 31, 2008 at 10:15:27AM -0600, Nathan Lynch wrote:
> Rune Torgersen wrote:
> > Hi
> > 
> > I get the following kernel core while a user program I have is dumping
> > core.
> > Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on
> > a 8280)
> > When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11
> > and dumps core.
> > On 2.6.24, I ghet the kernel oops, and then the program hangs sround
> > forever and is unkillable.
> 
> Hmm, this is the second report of 2.6.24 crashing in
> __flush_dcache_icache during a core dump; see:
> http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html
> 
> Is this easily recreatable?

Yes, this program does it reliably:

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>

void *threadfn(void *arg)
{
	fprintf(stderr, "threadfn\n");
	fflush(stderr);
	sleep(1);
	*(char *)0=0;
	return NULL;
}

int main(void)
{
	pthread_t thread[4];
	int i;

	for (i = 0; i < 4; i++)
		pthread_create(&thread[0], NULL, threadfn, NULL);

	for (;;);
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while duming user core.
  2008-02-01 17:38   ` Scott Wood
@ 2008-02-02 12:05     ` Clemens Koller
  0 siblings, 0 replies; 21+ messages in thread
From: Clemens Koller @ 2008-02-02 12:05 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev, Nathan Lynch

Scott Wood schrieb:
> On Thu, Jan 31, 2008 at 10:15:27AM -0600, Nathan Lynch wrote:
>> Rune Torgersen wrote:
>>> I get the following kernel core while a user program I have is dumping
>>> core.
>>> Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on
>>> a 8280)
>>> When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11
>>> and dumps core.
>>> On 2.6.24, I ghet the kernel oops, and then the program hangs sround
>>> forever and is unkillable.
>> Hmm, this is the second report of 2.6.24 crashing in
>> __flush_dcache_icache during a core dump; see:
>> http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html
>>
>> Is this easily recreatable?
> 
> Yes, this program does it reliably:
> 
> #include <pthread.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <signal.h>
> 
> void *threadfn(void *arg)
> {
> 	fprintf(stderr, "threadfn\n");
> 	fflush(stderr);
> 	sleep(1);
> 	*(char *)0=0;
> 	return NULL;
> }
> 
> int main(void)
> {
> 	pthread_t thread[4];
> 	int i;
> 
> 	for (i = 0; i < 4; i++)
> 		pthread_create(&thread[0], NULL, threadfn, NULL);
> 
> 	for (;;);
> }

Ack!

This is a MPC8540ADS arch/powerpc compatible environment here:

Feb  2 12:59:17 fox_1 kernel: Unable to handle kernel paging request for data at address 0x4802f000
Feb  2 12:59:17 fox_1 kernel: Faulting instruction address: 0xc000d5b8
Feb  2 12:59:17 fox_1 kernel: Oops: Kernel access of bad area, sig: 11 [#1]
Feb  2 12:59:17 fox_1 kernel: MPC85xx ADS
Feb  2 12:59:17 fox_1 kernel: Modules linked in:
Feb  2 12:59:17 fox_1 kernel: NIP: c000d5b8 LR: c0010fb8 CTR: 00000080
Feb  2 12:59:17 fox_1 kernel: REGS: c24abb20 TRAP: 0300   Not tainted  (2.6.24)
Feb  2 12:59:17 fox_1 kernel: MSR: 00029000 <EE,ME>  CR: 22882222  XER: 00000000
Feb  2 12:59:17 fox_1 kernel: DEAR: 4802f000, ESR: 00000000
Feb  2 12:59:17 fox_1 kernel: TASK = cf894d20[942] 'oops' THREAD: c24aa000
Feb  2 12:59:17 fox_1 kernel: GPR00: c22c7680 c24abbd0 cf894d20 4802f000 00000080 000f8b60 4802f000 ffffffff
Feb  2 12:59:17 fox_1 kernel: GPR08: 00000000 c22c7680 000008d1 00000000 22882222 10018a64 00000006 c035a300
Feb  2 12:59:17 fox_1 kernel: GPR16: 00024000 c0380000 c24aa000 c24abc9c c24abc98 c2570480 c22c7680 c0380000
Feb  2 12:59:17 fox_1 kernel: GPR24: c0390420 cf09d000 c0497b60 c5b63948 4802f000 c24aa000 000000bc c0497b60
Feb  2 12:59:17 fox_1 kernel: NIP [c000d5b8] __flush_dcache_icache+0x14/0x40
Feb  2 12:59:17 fox_1 kernel: LR [c0010fb8] update_mmu_cache+0x94/0x98
Feb  2 12:59:17 fox_1 kernel: Call Trace:
Feb  2 12:59:17 fox_1 kernel: [c24abbd0] [c24aa000] 0xc24aa000 (unreliable)
Feb  2 12:59:17 fox_1 kernel: [c24abbe0] [c005d978] handle_mm_fault+0x374/0x6a4
Feb  2 12:59:17 fox_1 kernel: [c24abc30] [c005ddd0] get_user_pages+0x128/0x384
Feb  2 12:59:17 fox_1 kernel: [c24abc90] [c00a80d8] elf_core_dump+0xab8/0xb74
Feb  2 12:59:17 fox_1 kernel: [c24abd30] [c007718c] do_coredump+0x730/0x758
Feb  2 12:59:17 fox_1 kernel: [c24abe30] [c002eeb0] get_signal_to_deliver+0x244/0x3c4
Feb  2 12:59:17 fox_1 kernel: [c24abe80] [c000782c] do_signal+0x48/0x264
Feb  2 12:59:17 fox_1 kernel: [c24abf40] [c000e4ac] do_user_signal+0x74/0xc4
Feb  2 12:59:17 fox_1 kernel: Instruction dump:
Feb  2 12:59:17 fox_1 kernel: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000
Feb  2 12:59:17 fox_1 kernel: 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8 7c0004ac
Feb  2 12:59:17 fox_1 kernel: ---[ end trace a1d91e665173315a ]---
Feb  2 12:59:17 fox_1 kernel: note: oops[942] exited with preempt_count 1

It does not oops when the core dump is disabled.

Regards,

Clemens

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Kernel oops while duming user core.
  2008-01-31 22:10               ` Rune Torgersen
@ 2008-02-03  7:34                 ` Benjamin Herrenschmidt
  2008-02-04 18:23                   ` Kernel oops while dumping " Scott Wood
  0 siblings, 1 reply; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2008-02-03  7:34 UTC (permalink / raw)
  To: Rune Torgersen; +Cc: Scott Wood, linuxppc-dev, Nathan Lynch


On Thu, 2008-01-31 at 16:10 -0600, Rune Torgersen wrote:
> Scott Wood wrote:
> > Scott Wood wrote:
> >> Nathan Lynch wrote:
> >>> Is the crashing program multithreaded?  The first report had firefox
> >>> triggering the oops.
> >> 
> >> OK, I've got a test program that triggers it now.  I'll see if I can
> >> figure out what's going on.
> > 
> > The problem seems to be that update_mmu_cache() is called on a guard
> > page with no access rights. 
> > 
> > Changing update_mmu_cache() to always call flush_dcache_icache_page()
> > fixes it, though a better performing fix would probably be to add an
> > exception table entry for the dcbst.
> 
> I can confirm that this seems to fix it.

Might be better to avoid the flush when the page isn't readable ?

Ben.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Kernel oops while dumping user core.
  2008-02-03  7:34                 ` Benjamin Herrenschmidt
@ 2008-02-04 18:23                   ` Scott Wood
  0 siblings, 0 replies; 21+ messages in thread
From: Scott Wood @ 2008-02-04 18:23 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Nathan Lynch

Benjamin Herrenschmidt wrote:
> On Thu, 2008-01-31 at 16:10 -0600, Rune Torgersen wrote:
>> Scott Wood wrote:
>>> Changing update_mmu_cache() to always call flush_dcache_icache_page()
>>> fixes it, though a better performing fix would probably be to add an
>>> exception table entry for the dcbst.
>> I can confirm that this seems to fix it.
> 
> Might be better to avoid the flush when the page isn't readable ?

Sure, that'd work.  I was trying to avoid a tablewalk to determine that, 
not noticing the pte argument staring me in the face. :-P

-Scott

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-02-04 18:25 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-31 13:45 Kernel oops while duming user core Rune Torgersen
2008-01-31 16:15 ` Nathan Lynch
2008-01-31 16:26   ` Rune Torgersen
2008-01-31 17:40     ` Rune Torgersen
2008-01-31 19:15       ` Kumar Gala
2008-01-31 19:18         ` Rune Torgersen
2008-01-31 20:16       ` Scott Wood
2008-01-31 20:19         ` Rune Torgersen
2008-01-31 20:38         ` Rune Torgersen
2008-01-31 20:41         ` Nathan Lynch
2008-01-31 20:45           ` Rune Torgersen
2008-01-31 20:55           ` Scott Wood
2008-01-31 21:58             ` Scott Wood
2008-01-31 22:10               ` Rune Torgersen
2008-02-03  7:34                 ` Benjamin Herrenschmidt
2008-02-04 18:23                   ` Kernel oops while dumping " Scott Wood
2008-01-31 19:15     ` Kernel oops while duming " Kumar Gala
2008-01-31 19:23       ` Rune Torgersen
2008-01-31 19:54         ` Nathan Lynch
2008-02-01 17:38   ` Scott Wood
2008-02-02 12:05     ` Clemens Koller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).