* Kernel oops while duming user core. @ 2008-01-31 13:45 Rune Torgersen 2008-01-31 16:15 ` Nathan Lynch 0 siblings, 1 reply; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 13:45 UTC (permalink / raw) To: linuxppc-dev Hi I get the following kernel core while a user program I have is dumping core. Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on a 8280) When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11 and dumps core. On 2.6.24, I ghet the kernel oops, and then the program hangs sround forever and is unkillable. Unable to handle kernel paging request for data at address 0x48024000 Faulting instruction address: 0xc000ef88 Oops: Kernel access of bad area, sig: 11 [#1] PREEMPT Innovative Systems ApMax Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) NIP: c000ef88 LR: c0012180 CTR: 00000080 REGS: eebc9b70 TRAP: 0300 Tainted: P (2.6.24) MSR: 00009032 <EE,ME,IR,DR> CR: 24004442 XER: 00000000 DAR: 48024000, DSISR: 20000000 TASK =3D eebac3c0[3131] 'armd' THREAD: eebc8000 GPR00: ee9b7d00 eebc9c20 eebac3c0 48024000 00000080 399a4181 48024000 00000000 GPR08: 399a4181 ee9b7d00 00000000 c2000000 44004422 10100f38 ee82fc00 bfffffff GPR16: ef377060 00000030 ee9b7d00 00000000 eebc9cdc 00000011 eebc9cd8 eeb96480 GPR24: ee9b7d00 399a4181 48024000 eeb9a370 eeb9a370 399a4181 48024000 c2733480 NIP [c000ef88] __flush_dcache_icache+0x14/0x40 LR [c0012180] update_mmu_cache+0x74/0x114 Call Trace: [eebc9c20] [eebc8000] 0xeebc8000 (unreliable) [eebc9c40] [c005d060] handle_mm_fault+0x630/0xbc0 [eebc9c80] [c005d9e4] get_user_pages+0x3f4/0x4fc [eebc9cd0] [c00aa7c4] elf_core_dump+0x9a4/0xc5c [eebc9d60] [c00779e4] do_coredump+0x6e0/0x748 [eebc9e50] [c002a5b0] get_signal_to_deliver+0x40c/0x45c [eebc9e80] [c0008ce8] do_signal+0x50/0x294 [eebc9f40] [c000fb98] do_user_signal+0x74/0xc4 --- Exception: 300 at 0x10044efc LR =3D 0x10044ec0 Instruction dump: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8 7c0004ac ---[ end trace 97db37eaf213da3c ]--- note: armd[3131] exited with preempt_count 2 =20 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 13:45 Kernel oops while duming user core Rune Torgersen @ 2008-01-31 16:15 ` Nathan Lynch 2008-01-31 16:26 ` Rune Torgersen 2008-02-01 17:38 ` Scott Wood 0 siblings, 2 replies; 21+ messages in thread From: Nathan Lynch @ 2008-01-31 16:15 UTC (permalink / raw) To: Rune Torgersen; +Cc: linuxppc-dev Rune Torgersen wrote: > Hi > > I get the following kernel core while a user program I have is dumping > core. > Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on > a 8280) > When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11 > and dumps core. > On 2.6.24, I ghet the kernel oops, and then the program hangs sround > forever and is unkillable. Hmm, this is the second report of 2.6.24 crashing in __flush_dcache_icache during a core dump; see: http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html Is this easily recreatable? > > Unable to handle kernel paging request for data at address 0x48024000 > Faulting instruction address: 0xc000ef88 > Oops: Kernel access of bad area, sig: 11 [#1] > PREEMPT Innovative Systems ApMax > Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 > drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) > NIP: c000ef88 LR: c0012180 CTR: 00000080 > REGS: eebc9b70 TRAP: 0300 Tainted: P (2.6.24) > MSR: 00009032 <EE,ME,IR,DR> CR: 24004442 XER: 00000000 > DAR: 48024000, DSISR: 20000000 > TASK = eebac3c0[3131] 'armd' THREAD: eebc8000 > GPR00: ee9b7d00 eebc9c20 eebac3c0 48024000 00000080 399a4181 48024000 > 00000000 > GPR08: 399a4181 ee9b7d00 00000000 c2000000 44004422 10100f38 ee82fc00 > bfffffff > GPR16: ef377060 00000030 ee9b7d00 00000000 eebc9cdc 00000011 eebc9cd8 > eeb96480 > GPR24: ee9b7d00 399a4181 48024000 eeb9a370 eeb9a370 399a4181 48024000 > c2733480 > NIP [c000ef88] __flush_dcache_icache+0x14/0x40 > LR [c0012180] update_mmu_cache+0x74/0x114 > Call Trace: > [eebc9c20] [eebc8000] 0xeebc8000 (unreliable) > [eebc9c40] [c005d060] handle_mm_fault+0x630/0xbc0 > [eebc9c80] [c005d9e4] get_user_pages+0x3f4/0x4fc > [eebc9cd0] [c00aa7c4] elf_core_dump+0x9a4/0xc5c > [eebc9d60] [c00779e4] do_coredump+0x6e0/0x748 > [eebc9e50] [c002a5b0] get_signal_to_deliver+0x40c/0x45c > [eebc9e80] [c0008ce8] do_signal+0x50/0x294 > [eebc9f40] [c000fb98] do_user_signal+0x74/0xc4 > --- Exception: 300 at 0x10044efc > LR = 0x10044ec0 > Instruction dump: > 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000 > 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8 > 7c0004ac > ---[ end trace 97db37eaf213da3c ]--- > note: armd[3131] exited with preempt_count 2 ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 16:15 ` Nathan Lynch @ 2008-01-31 16:26 ` Rune Torgersen 2008-01-31 17:40 ` Rune Torgersen 2008-01-31 19:15 ` Kernel oops while duming " Kumar Gala 2008-02-01 17:38 ` Scott Wood 1 sibling, 2 replies; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 16:26 UTC (permalink / raw) To: Nathan Lynch; +Cc: linuxppc-dev Nathan Lynch wrote: > Hmm, this is the second report of 2.6.24 crashing in > __flush_dcache_icache during a core dump; see: > http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html >=20 > Is this easily recreatable? Yes. I have a binary that will do this every time it is started (on this particular system),=20 only takes about 10 seconds before it dumps. I was going to test HEAD of powerpc.git to see if it is still there. I cannot test any earlier versions as our board port was done on 2.6.24. Our older kernel port is 2.6.18 on arch/ppc, and it works just fine. One potential clue: > Unable to handle kernel paging request for data at address 0x48024000 this adddress is beyond our physical memory. We have 1GB of mem=20 (CONFIG_HIGH_MEM enabled) so 0x3fff_ffff is the last valid address. 0x4000_0000 to 0x7fff_ffff are unused, 0x8000_0000 to 0x9fff_ffff is used by PCI. ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 16:26 ` Rune Torgersen @ 2008-01-31 17:40 ` Rune Torgersen 2008-01-31 19:15 ` Kumar Gala 2008-01-31 20:16 ` Scott Wood 2008-01-31 19:15 ` Kernel oops while duming " Kumar Gala 1 sibling, 2 replies; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 17:40 UTC (permalink / raw) To: Nathan Lynch; +Cc: linuxppc-dev Rune Torgersen wrote: > I was going to test HEAD of powerpc.git to see if it is still there. Still there. Also used GDB on the vmlinux image to get source and dissasembly of the ooops: Unable to handle kernel paging request for data at address 0x48024000 Faulting instruction address: 0xc000f0a0 Oops: Kernel access of bad area, sig: 11 [#1] PREEMPT Innovative Systems ApMax Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) NIP: c000f0a0 LR: c0011fec CTR: 00000080 REGS: eebe9b70 TRAP: 0300 Tainted: P (2.6.24-test) MSR: 00009032 <EE,ME,IR,DR> CR: 24004442 XER: 00000000 DAR: 48024000, DSISR: 20000000 TASK =3D eeba9780[2554] 'armd_crash' THREAD: eebe8000 GPR00: eea44d00 eebe9c20 eeba9780 48024000 00000080 37a56181 48024000 00000000 GPR08: 37a56181 eea44d00 00000000 c2000000 44004422 10100f38 ef336600 bfffffff GPR16: eeff0300 00000030 eea44d00 00000000 eebe9cdc 00000011 eebe9cd8 eebca480 GPR24: eea44d00 37a56181 48024000 eebad580 eebad580 37a56181 48024000 c26f4ac0 NIP [c000f0a0] __flush_dcache_icache+0x14/0x40 LR [c0011fec] update_mmu_cache+0x74/0x114 Call Trace: [eebe9c20] [eebe8000] 0xeebe8000 (unreliable) [eebe9c40] [c005cfd0] handle_mm_fault+0x630/0xbc0 [eebe9c80] [c005d954] get_user_pages+0x3f4/0x4fc [eebe9cd0] [c00aa730] elf_core_dump+0x9a4/0xc5c [eebe9d60] [c0077954] do_coredump+0x6e0/0x748 [eebe9e50] [c002a520] get_signal_to_deliver+0x40c/0x45c [eebe9e80] [c0008cec] do_signal+0x50/0x294 [eebe9f40] [c000fc9c] do_user_signal+0x74/0xc4 --- Exception: 300 at 0x10044efc LR =3D 0x10044ec0 Instruction dump: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8 7c0004ac ---[ end trace 37755b0fb9e79677 ]--- note: armd_crash[2554] exited with preempt_count 2 backtrace using gdb on vmlinux image: 0xc00aa730 is in elf_core_dump (fs/binfmt_elf.c:1762). 1757 1758 for (addr =3D vma->vm_start; addr < end; addr = +=3D PAGE_SIZE) { 1759 struct page *page; 1760 struct vm_area_struct *vma; 1761 1762 if (get_user_pages(current, current->mm, addr, 1, 0, 1, 1763 &page, &vma) = <=3D 0) { 1764 DUMP_SEEK(PAGE_SIZE); 1765 } else { 1766 if (page =3D=3D ZERO_PAGE(0)) { (gdb) list *0xc005d954 0xc005d954 is in get_user_pages (mm/memory.c:1072). 1067 cond_resched(); 1068 while (!(page =3D follow_page(vma, = start, foll_flags))) { 1069 int ret; 1070 ret =3D handle_mm_fault(mm, vma, start, 1071 foll_flags & FOLL_WRITE); 1072 if (ret & VM_FAULT_ERROR) { 1073 if (ret & VM_FAULT_OOM) 1074 return i ? i : -ENOMEM; 1075 else if (ret & VM_FAULT_SIGBUS) 1076 return i ? i : -EFAULT; (gdb) list *0xc005cfd0 0xc005cfd0 is in handle_mm_fault (include/asm/thread_info.h:99). 94 { 95 register unsigned long sp asm("r1"); 96 97 /* gcc4, at least, is smart enough to turn this into a single 98 * rlwinm for ppc32 and clrrdi for ppc64 */ 99 return (struct thread_info *)(sp & ~(THREAD_SIZE-1)); 100 } 101 102 #endif /* __ASSEMBLY__ */ 103 (gdb) =20 (gdb) list *0xc0011fec 0xc0011fec is in update_mmu_cache (arch/powerpc/mm/mem.c:489). 484 _tlbie(address, 0 /* 8xx doesn't care about PID */); 485 #endif 486 if (!PageReserved(page) 487 && !test_bit(PG_arch_1, &page->flags)) { 488 if (vma->vm_mm =3D=3D = current->active_mm) { 489 __flush_dcache_icache((void *) address); 490 } else 491 flush_dcache_icache_page(page); 492 set_bit(PG_arch_1, &page->flags); 493 } (gdb) list *0xc000f0a0 No source file for address 0xc000f0a0. (gdb) disassemble 0xc000f0a0 Dump of assembler code for function __flush_dcache_icache: 0xc000f08c <__flush_dcache_icache+0>: dec %esi 0xc000f08d <__flush_dcache_icache+1>: addb $0x20,(%eax) 0xc000f090 <__flush_dcache_icache+4>: push %esp 0xc000f091 <__flush_dcache_icache+5>: arpl %ax,(%eax) 0xc000f093 <__flush_dcache_icache+7>: cmp %al,%es:0x897c8000(%eax) 0xc000f09a <__flush_dcache_icache+14>: add 0x781b667c(%esi),%esp 0xc000f0a0 <__flush_dcache_icache+20>: jl 0xc000f0a2 <__flush_dcache_icache+22> 0xc000f0a2 <__flush_dcache_icache+22>: sbb %ch,0x63(%eax,%edi,1) 0xc000f0a6 <__flush_dcache_icache+26>: add %ah,(%eax) 0xc000f0a8 <__flush_dcache_icache+28>: inc %edx 0xc000f0a9 <__flush_dcache_icache+29>: add %bh,%bh 0xc000f0ab <__flush_dcache_icache+31>: clc 0xc000f0ac <__flush_dcache_icache+32>: jl 0xc000f0ae <__flush_dcache_icache+34> 0xc000f0ae <__flush_dcache_icache+34>: add $0xac,%al 0xc000f0b0 <__flush_dcache_icache+36>: jl 0xc000f03b <flush_dcache_range+15> 0xc000f0b2 <__flush_dcache_icache+38>: add 0xac37007c(%esi),%esp 0xc000f0b8 <__flush_dcache_icache+44>: cmp %al,%dh 0xc000f0ba <__flush_dcache_icache+46>: add %ah,(%eax) 0xc000f0bc <__flush_dcache_icache+48>: inc %edx 0xc000f0bd <__flush_dcache_icache+49>: add %bh,%bh 0xc000f0bf <__flush_dcache_icache+51>: clc 0xc000f0c0 <__flush_dcache_icache+52>: jl 0xc000f0c2 <__flush_dcache_icache+54> 0xc000f0c2 <__flush_dcache_icache+54>: add $0xac,%al 0xc000f0c4 <__flush_dcache_icache+56>: dec %esp 0xc000f0c5 <__flush_dcache_icache+57>: add %al,(%ecx) 0xc000f0c7 <__flush_dcache_icache+59>: sub $0x4e,%al 0xc000f0c9 <__flush_dcache_icache+61>: addb $0x20,(%eax) End of assembler dump. (gdb) =20 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 17:40 ` Rune Torgersen @ 2008-01-31 19:15 ` Kumar Gala 2008-01-31 19:18 ` Rune Torgersen 2008-01-31 20:16 ` Scott Wood 1 sibling, 1 reply; 21+ messages in thread From: Kumar Gala @ 2008-01-31 19:15 UTC (permalink / raw) To: Rune Torgersen; +Cc: linuxppc-dev, Nathan Lynch > } > (gdb) list *0xc000f0a0 > No source file for address 0xc000f0a0. > (gdb) disassemble 0xc000f0a0 > Dump of assembler code for function __flush_dcache_icache: > 0xc000f08c <__flush_dcache_icache+0>: dec %esi > 0xc000f08d <__flush_dcache_icache+1>: addb $0x20,(%eax) > 0xc000f090 <__flush_dcache_icache+4>: push %esp > 0xc000f091 <__flush_dcache_icache+5>: arpl %ax,(%eax) > 0xc000f093 <__flush_dcache_icache+7>: cmp %al,%es: > 0x897c8000(%eax) > 0xc000f09a <__flush_dcache_icache+14>: add 0x781b667c(%esi),%esp > 0xc000f0a0 <__flush_dcache_icache+20>: jl 0xc000f0a2 > <__flush_dcache_icache+22> > 0xc000f0a2 <__flush_dcache_icache+22>: sbb %ch,0x63(%eax,%edi,1) > 0xc000f0a6 <__flush_dcache_icache+26>: add %ah,(%eax) > 0xc000f0a8 <__flush_dcache_icache+28>: inc %edx > 0xc000f0a9 <__flush_dcache_icache+29>: add %bh,%bh > 0xc000f0ab <__flush_dcache_icache+31>: clc > 0xc000f0ac <__flush_dcache_icache+32>: jl 0xc000f0ae > <__flush_dcache_icache+34> > 0xc000f0ae <__flush_dcache_icache+34>: add $0xac,%al > 0xc000f0b0 <__flush_dcache_icache+36>: jl 0xc000f03b > <flush_dcache_range+15> > 0xc000f0b2 <__flush_dcache_icache+38>: add 0xac37007c(%esi),%esp > 0xc000f0b8 <__flush_dcache_icache+44>: cmp %al,%dh > 0xc000f0ba <__flush_dcache_icache+46>: add %ah,(%eax) > 0xc000f0bc <__flush_dcache_icache+48>: inc %edx > 0xc000f0bd <__flush_dcache_icache+49>: add %bh,%bh > 0xc000f0bf <__flush_dcache_icache+51>: clc > 0xc000f0c0 <__flush_dcache_icache+52>: jl 0xc000f0c2 > <__flush_dcache_icache+54> > 0xc000f0c2 <__flush_dcache_icache+54>: add $0xac,%al > 0xc000f0c4 <__flush_dcache_icache+56>: dec %esp > 0xc000f0c5 <__flush_dcache_icache+57>: add %al,(%ecx) > 0xc000f0c7 <__flush_dcache_icache+59>: sub $0x4e,%al > 0xc000f0c9 <__flush_dcache_icache+61>: addb $0x20,(%eax) > End of assembler dump. This doesn't look like ppc disasm to me :) - k ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 19:15 ` Kumar Gala @ 2008-01-31 19:18 ` Rune Torgersen 0 siblings, 0 replies; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 19:18 UTC (permalink / raw) To: Kumar Gala; +Cc: linuxppc-dev, Nathan Lynch Kumar Gala wrote: > This doesn't look like ppc disasm to me :) >=20 Helps if i use the cross-compiler gdb instead of the x86 native one... here is the disasembly dump for NIP (gdb) disassemble 0xc000f0a0 Dump of assembler code for function __flush_dcache_icache: 0xc000f08c <__flush_dcache_icache+0>: blr 0xc000f090 <__flush_dcache_icache+4>: rlwinm r3,r3,0,0,19 0xc000f094 <__flush_dcache_icache+8>: li r4,128 0xc000f098 <__flush_dcache_icache+12>: mtctr r4 0xc000f09c <__flush_dcache_icache+16>: mr r6,r3 0xc000f0a0 <__flush_dcache_icache+20>: dcbst r0,r3 0xc000f0a4 <__flush_dcache_icache+24>: addi r3,r3,32 0xc000f0a8 <__flush_dcache_icache+28>: bdnz+ 0xc000f0a0 <__flush_dcache_icache+20> 0xc000f0ac <__flush_dcache_icache+32>: sync 0xc000f0b0 <__flush_dcache_icache+36>: mtctr r4 0xc000f0b4 <__flush_dcache_icache+40>: icbi r0,r6 0xc000f0b8 <__flush_dcache_icache+44>: addi r6,r6,32 0xc000f0bc <__flush_dcache_icache+48>: bdnz+ 0xc000f0b4 <__flush_dcache_icache+40> 0xc000f0c0 <__flush_dcache_icache+52>: sync 0xc000f0c4 <__flush_dcache_icache+56>: isync 0xc000f0c8 <__flush_dcache_icache+60>: blr End of assembler dump. (gdb) =20 registers were: NIP: c000f0a0 LR: c0011fec CTR: 00000080 REGS: eebe9b70 TRAP: 0300 Tainted: P (2.6.24-test) MSR: 00009032 <EE,ME,IR,DR> CR: 24004442 XER: 00000000 DAR: 48024000, DSISR: 20000000 TASK =3D eeba9780[2554] 'armd_crash' THREAD: eebe8000 GPR00: eea44d00 eebe9c20 eeba9780 48024000 00000080 37a56181 48024000 00000000 GPR08: 37a56181 eea44d00 00000000 c2000000 44004422 10100f38 ef336600 bfffffff GPR16: eeff0300 00000030 eea44d00 00000000 eebe9cdc 00000011 eebe9cd8 eebca480 GPR24: eea44d00 37a56181 48024000 eebad580 eebad580 37a56181 48024000 c26f4ac0 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 17:40 ` Rune Torgersen 2008-01-31 19:15 ` Kumar Gala @ 2008-01-31 20:16 ` Scott Wood 2008-01-31 20:19 ` Rune Torgersen ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Scott Wood @ 2008-01-31 20:16 UTC (permalink / raw) To: Rune Torgersen; +Cc: linuxppc-dev, Nathan Lynch On Thu, Jan 31, 2008 at 11:40:04AM -0600, Rune Torgersen wrote: > Unable to handle kernel paging request for data at address 0x48024000 > Faulting instruction address: 0xc000f0a0 > Oops: Kernel access of bad area, sig: 11 [#1] > PREEMPT Innovative Systems ApMax Does it happen without preempt? > Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 > drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) > NIP: c000f0a0 LR: c0011fec CTR: 00000080 > REGS: eebe9b70 TRAP: 0300 Tainted: P (2.6.24-test) Does it happen without the modules? > MSR: 00009032 <EE,ME,IR,DR> CR: 24004442 XER: 00000000 > DAR: 48024000, DSISR: 20000000 Hmm, this doesn't look like a valid DSISR, so I'm guessing this was a TLB miss that got redirected to DataAccess (or is there something that causes DSRISR[2] to be set on 8280? I didn't see anything in the manual...). However, SRR1 in that case seems to indicate a store, which dcbst shouldn't generate (except on 8xx, according to the comment in update_mmu_cache). Do you have a simple test case that we could try to reproduce? I tried a simple core dump on an 8280, and it worked. Failing that, I'd add code to the page fault handler to dump what is (or isn't) supposed to be mapped at the faulting address, and something to track which (if any) TLB miss exception it came through. -Scott ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 20:16 ` Scott Wood @ 2008-01-31 20:19 ` Rune Torgersen 2008-01-31 20:38 ` Rune Torgersen 2008-01-31 20:41 ` Nathan Lynch 2 siblings, 0 replies; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 20:19 UTC (permalink / raw) To: Scott Wood; +Cc: linuxppc-dev, Nathan Lynch Scott Wood wrote: > Does it happen without preempt? Will try shortly, just updated my git to HEAD of Linus's tree >=20 >> Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 >> drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) >> NIP: c000f0a0 LR: c0011fec CTR: 00000080 >> REGS: eebe9b70 TRAP: 0300 Tainted: P (2.6.24-test) >=20 > Does it happen without the modules? Cannot test without most of them. > Do you have a simple test case that we could try to > reproduce? I tried a > simple core dump on an 8280, and it worked. I do not have a testcase, except a app for our board that does this reliably after about 10 seconds. > Failing that, I'd add code to the page fault handler to dump what is > (or isn't) supposed to be mapped at the faulting address, and > something to track which (if any) TLB miss exception it came through. I can test code. ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 20:16 ` Scott Wood 2008-01-31 20:19 ` Rune Torgersen @ 2008-01-31 20:38 ` Rune Torgersen 2008-01-31 20:41 ` Nathan Lynch 2 siblings, 0 replies; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 20:38 UTC (permalink / raw) To: Scott Wood; +Cc: linuxppc-dev, Nathan Lynch Scott Wood wrote: > Does it happen without preempt? Yes ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 20:16 ` Scott Wood 2008-01-31 20:19 ` Rune Torgersen 2008-01-31 20:38 ` Rune Torgersen @ 2008-01-31 20:41 ` Nathan Lynch 2008-01-31 20:45 ` Rune Torgersen 2008-01-31 20:55 ` Scott Wood 2 siblings, 2 replies; 21+ messages in thread From: Nathan Lynch @ 2008-01-31 20:41 UTC (permalink / raw) To: Scott Wood; +Cc: linuxppc-dev Scott Wood wrote: > On Thu, Jan 31, 2008 at 11:40:04AM -0600, Rune Torgersen wrote: > > Unable to handle kernel paging request for data at address 0x48024000 > > Faulting instruction address: 0xc000f0a0 > > Oops: Kernel access of bad area, sig: 11 [#1] > > PREEMPT Innovative Systems ApMax > > Does it happen without preempt? > > > Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7 > > drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P) > > NIP: c000f0a0 LR: c0011fec CTR: 00000080 > > REGS: eebe9b70 TRAP: 0300 Tainted: P (2.6.24-test) > > Does it happen without the modules? I doubt the modules are the problem; there was a practically identical report from someone with an untainted 2.6.24-rc kernel a few weeks ago (see my first reply to Rune). > > > MSR: 00009032 <EE,ME,IR,DR> CR: 24004442 XER: 00000000 > > DAR: 48024000, DSISR: 20000000 > > Hmm, this doesn't look like a valid DSISR, so I'm guessing this was a TLB > miss that got redirected to DataAccess (or is there something that causes > DSRISR[2] to be set on 8280? I didn't see anything in the manual...). > However, SRR1 in that case seems to indicate a store, which dcbst shouldn't > generate (except on 8xx, according to the comment in update_mmu_cache). > > Do you have a simple test case that we could try to reproduce? I tried a > simple core dump on an 8280, and it worked. Is the crashing program multithreaded? The first report had firefox triggering the oops. ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 20:41 ` Nathan Lynch @ 2008-01-31 20:45 ` Rune Torgersen 2008-01-31 20:55 ` Scott Wood 1 sibling, 0 replies; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 20:45 UTC (permalink / raw) To: Nathan Lynch, Scott Wood; +Cc: linuxppc-dev Nathan Lynch wrote: > Scott Wood wrote: >> Do you have a simple test case that we could try to reproduce? I >> tried a simple core dump on an 8280, and it worked. >=20 > Is the crashing program multithreaded? The first report had firefox > triggering the oops. The crashing program has 10 threads. (NPTL pthreads, glibc-2.5, gcc 4.1.2) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 20:41 ` Nathan Lynch 2008-01-31 20:45 ` Rune Torgersen @ 2008-01-31 20:55 ` Scott Wood 2008-01-31 21:58 ` Scott Wood 1 sibling, 1 reply; 21+ messages in thread From: Scott Wood @ 2008-01-31 20:55 UTC (permalink / raw) To: Nathan Lynch; +Cc: linuxppc-dev Nathan Lynch wrote: > I doubt the modules are the problem; there was a practically identical > report from someone with an untainted 2.6.24-rc kernel a few weeks ago > (see my first reply to Rune). I didn't think they were; I was just trying to eliminate the low hanging fruit and get a simpler testcase. :-) >> Do you have a simple test case that we could try to reproduce? I tried a >> simple core dump on an 8280, and it worked. > > Is the crashing program multithreaded? The first report had firefox > triggering the oops. OK, I've got a test program that triggers it now. I'll see if I can figure out what's going on. -Scott ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 20:55 ` Scott Wood @ 2008-01-31 21:58 ` Scott Wood 2008-01-31 22:10 ` Rune Torgersen 0 siblings, 1 reply; 21+ messages in thread From: Scott Wood @ 2008-01-31 21:58 UTC (permalink / raw) To: Nathan Lynch; +Cc: linuxppc-dev Scott Wood wrote: > Nathan Lynch wrote: >> Is the crashing program multithreaded? The first report had firefox >> triggering the oops. > > OK, I've got a test program that triggers it now. I'll see if I can > figure out what's going on. The problem seems to be that update_mmu_cache() is called on a guard page with no access rights. Changing update_mmu_cache() to always call flush_dcache_icache_page() fixes it, though a better performing fix would probably be to add an exception table entry for the dcbst. -Scott ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 21:58 ` Scott Wood @ 2008-01-31 22:10 ` Rune Torgersen 2008-02-03 7:34 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 22:10 UTC (permalink / raw) To: Scott Wood, Nathan Lynch; +Cc: linuxppc-dev Scott Wood wrote: > Scott Wood wrote: >> Nathan Lynch wrote: >>> Is the crashing program multithreaded? The first report had firefox >>> triggering the oops. >>=20 >> OK, I've got a test program that triggers it now. I'll see if I can >> figure out what's going on. >=20 > The problem seems to be that update_mmu_cache() is called on a guard > page with no access rights.=20 >=20 > Changing update_mmu_cache() to always call flush_dcache_icache_page() > fixes it, though a better performing fix would probably be to add an > exception table entry for the dcbst. I can confirm that this seems to fix it. ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 22:10 ` Rune Torgersen @ 2008-02-03 7:34 ` Benjamin Herrenschmidt 2008-02-04 18:23 ` Kernel oops while dumping " Scott Wood 0 siblings, 1 reply; 21+ messages in thread From: Benjamin Herrenschmidt @ 2008-02-03 7:34 UTC (permalink / raw) To: Rune Torgersen; +Cc: Scott Wood, linuxppc-dev, Nathan Lynch On Thu, 2008-01-31 at 16:10 -0600, Rune Torgersen wrote: > Scott Wood wrote: > > Scott Wood wrote: > >> Nathan Lynch wrote: > >>> Is the crashing program multithreaded? The first report had firefox > >>> triggering the oops. > >> > >> OK, I've got a test program that triggers it now. I'll see if I can > >> figure out what's going on. > > > > The problem seems to be that update_mmu_cache() is called on a guard > > page with no access rights. > > > > Changing update_mmu_cache() to always call flush_dcache_icache_page() > > fixes it, though a better performing fix would probably be to add an > > exception table entry for the dcbst. > > I can confirm that this seems to fix it. Might be better to avoid the flush when the page isn't readable ? Ben. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while dumping user core. 2008-02-03 7:34 ` Benjamin Herrenschmidt @ 2008-02-04 18:23 ` Scott Wood 0 siblings, 0 replies; 21+ messages in thread From: Scott Wood @ 2008-02-04 18:23 UTC (permalink / raw) To: benh; +Cc: linuxppc-dev, Nathan Lynch Benjamin Herrenschmidt wrote: > On Thu, 2008-01-31 at 16:10 -0600, Rune Torgersen wrote: >> Scott Wood wrote: >>> Changing update_mmu_cache() to always call flush_dcache_icache_page() >>> fixes it, though a better performing fix would probably be to add an >>> exception table entry for the dcbst. >> I can confirm that this seems to fix it. > > Might be better to avoid the flush when the page isn't readable ? Sure, that'd work. I was trying to avoid a tablewalk to determine that, not noticing the pte argument staring me in the face. :-P -Scott ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 16:26 ` Rune Torgersen 2008-01-31 17:40 ` Rune Torgersen @ 2008-01-31 19:15 ` Kumar Gala 2008-01-31 19:23 ` Rune Torgersen 1 sibling, 1 reply; 21+ messages in thread From: Kumar Gala @ 2008-01-31 19:15 UTC (permalink / raw) To: Rune Torgersen; +Cc: linuxppc-dev, Nathan Lynch On Jan 31, 2008, at 10:26 AM, Rune Torgersen wrote: > Nathan Lynch wrote: >> Hmm, this is the second report of 2.6.24 crashing in >> __flush_dcache_icache during a core dump; see: >> http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html >> >> Is this easily recreatable? > > Yes. I have a binary that will do this every time it is started (on > this > particular system), > only takes about 10 seconds before it dumps. > > I was going to test HEAD of powerpc.git to see if it is still there. > I cannot test any earlier versions as our board port was done on > 2.6.24. > > Our older kernel port is 2.6.18 on arch/ppc, and it works just fine. > > > One potential clue: >> Unable to handle kernel paging request for data at address 0x48024000 > > this adddress is beyond our physical memory. We have 1GB of mem > (CONFIG_HIGH_MEM enabled) so 0x3fff_ffff is the last valid address. > 0x4000_0000 to 0x7fff_ffff are unused, 0x8000_0000 to 0x9fff_ffff is > used by PCI. Can you git-bisect to narrow this down further. - k ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Kernel oops while duming user core. 2008-01-31 19:15 ` Kernel oops while duming " Kumar Gala @ 2008-01-31 19:23 ` Rune Torgersen 2008-01-31 19:54 ` Nathan Lynch 0 siblings, 1 reply; 21+ messages in thread From: Rune Torgersen @ 2008-01-31 19:23 UTC (permalink / raw) To: Kumar Gala; +Cc: linuxppc-dev, Nathan Lynch Kumar Gala wrote: > Can you git-bisect to narrow this down further. Not easilly, as the board port to arch/powerpc was done on 2.6.24-rc7 and up. Is there an somewhat esy way in git to apply the differences from master branch to our board branch to a branch created by bisect? And I don't even know where this started to happen. Would trying arch/ppc help any? I have our arch/ppc port in a semiworking state for kernels up to 2.6.23 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 19:23 ` Rune Torgersen @ 2008-01-31 19:54 ` Nathan Lynch 0 siblings, 0 replies; 21+ messages in thread From: Nathan Lynch @ 2008-01-31 19:54 UTC (permalink / raw) To: Rune Torgersen; +Cc: linuxppc-dev Rune Torgersen wrote: > Kumar Gala wrote: > > Can you git-bisect to narrow this down further. > > Not easilly, as the board port to arch/powerpc was done on 2.6.24-rc7 > and up. > Is there an somewhat esy way in git to apply the differences from master > branch to our board branch to a branch created by bisect? > > And I don't even know where this started to happen. > Would trying arch/ppc help any? I have our arch/ppc port in a > semiworking state for kernels up to 2.6.23 Well, we know this happens on other 32-bit powerpc machines (pmac at least)... perhaps someone could arrange to bisect on a machine that works with older powerpc kernels (assuming they have a good repro case). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-01-31 16:15 ` Nathan Lynch 2008-01-31 16:26 ` Rune Torgersen @ 2008-02-01 17:38 ` Scott Wood 2008-02-02 12:05 ` Clemens Koller 1 sibling, 1 reply; 21+ messages in thread From: Scott Wood @ 2008-02-01 17:38 UTC (permalink / raw) To: Nathan Lynch; +Cc: linuxppc-dev On Thu, Jan 31, 2008 at 10:15:27AM -0600, Nathan Lynch wrote: > Rune Torgersen wrote: > > Hi > > > > I get the following kernel core while a user program I have is dumping > > core. > > Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on > > a 8280) > > When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11 > > and dumps core. > > On 2.6.24, I ghet the kernel oops, and then the program hangs sround > > forever and is unkillable. > > Hmm, this is the second report of 2.6.24 crashing in > __flush_dcache_icache during a core dump; see: > http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html > > Is this easily recreatable? Yes, this program does it reliably: #include <pthread.h> #include <stdio.h> #include <unistd.h> #include <signal.h> void *threadfn(void *arg) { fprintf(stderr, "threadfn\n"); fflush(stderr); sleep(1); *(char *)0=0; return NULL; } int main(void) { pthread_t thread[4]; int i; for (i = 0; i < 4; i++) pthread_create(&thread[0], NULL, threadfn, NULL); for (;;); } ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Kernel oops while duming user core. 2008-02-01 17:38 ` Scott Wood @ 2008-02-02 12:05 ` Clemens Koller 0 siblings, 0 replies; 21+ messages in thread From: Clemens Koller @ 2008-02-02 12:05 UTC (permalink / raw) To: Scott Wood; +Cc: linuxppc-dev, Nathan Lynch Scott Wood schrieb: > On Thu, Jan 31, 2008 at 10:15:27AM -0600, Nathan Lynch wrote: >> Rune Torgersen wrote: >>> I get the following kernel core while a user program I have is dumping >>> core. >>> Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on >>> a 8280) >>> When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11 >>> and dumps core. >>> On 2.6.24, I ghet the kernel oops, and then the program hangs sround >>> forever and is unkillable. >> Hmm, this is the second report of 2.6.24 crashing in >> __flush_dcache_icache during a core dump; see: >> http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html >> >> Is this easily recreatable? > > Yes, this program does it reliably: > > #include <pthread.h> > #include <stdio.h> > #include <unistd.h> > #include <signal.h> > > void *threadfn(void *arg) > { > fprintf(stderr, "threadfn\n"); > fflush(stderr); > sleep(1); > *(char *)0=0; > return NULL; > } > > int main(void) > { > pthread_t thread[4]; > int i; > > for (i = 0; i < 4; i++) > pthread_create(&thread[0], NULL, threadfn, NULL); > > for (;;); > } Ack! This is a MPC8540ADS arch/powerpc compatible environment here: Feb 2 12:59:17 fox_1 kernel: Unable to handle kernel paging request for data at address 0x4802f000 Feb 2 12:59:17 fox_1 kernel: Faulting instruction address: 0xc000d5b8 Feb 2 12:59:17 fox_1 kernel: Oops: Kernel access of bad area, sig: 11 [#1] Feb 2 12:59:17 fox_1 kernel: MPC85xx ADS Feb 2 12:59:17 fox_1 kernel: Modules linked in: Feb 2 12:59:17 fox_1 kernel: NIP: c000d5b8 LR: c0010fb8 CTR: 00000080 Feb 2 12:59:17 fox_1 kernel: REGS: c24abb20 TRAP: 0300 Not tainted (2.6.24) Feb 2 12:59:17 fox_1 kernel: MSR: 00029000 <EE,ME> CR: 22882222 XER: 00000000 Feb 2 12:59:17 fox_1 kernel: DEAR: 4802f000, ESR: 00000000 Feb 2 12:59:17 fox_1 kernel: TASK = cf894d20[942] 'oops' THREAD: c24aa000 Feb 2 12:59:17 fox_1 kernel: GPR00: c22c7680 c24abbd0 cf894d20 4802f000 00000080 000f8b60 4802f000 ffffffff Feb 2 12:59:17 fox_1 kernel: GPR08: 00000000 c22c7680 000008d1 00000000 22882222 10018a64 00000006 c035a300 Feb 2 12:59:17 fox_1 kernel: GPR16: 00024000 c0380000 c24aa000 c24abc9c c24abc98 c2570480 c22c7680 c0380000 Feb 2 12:59:17 fox_1 kernel: GPR24: c0390420 cf09d000 c0497b60 c5b63948 4802f000 c24aa000 000000bc c0497b60 Feb 2 12:59:17 fox_1 kernel: NIP [c000d5b8] __flush_dcache_icache+0x14/0x40 Feb 2 12:59:17 fox_1 kernel: LR [c0010fb8] update_mmu_cache+0x94/0x98 Feb 2 12:59:17 fox_1 kernel: Call Trace: Feb 2 12:59:17 fox_1 kernel: [c24abbd0] [c24aa000] 0xc24aa000 (unreliable) Feb 2 12:59:17 fox_1 kernel: [c24abbe0] [c005d978] handle_mm_fault+0x374/0x6a4 Feb 2 12:59:17 fox_1 kernel: [c24abc30] [c005ddd0] get_user_pages+0x128/0x384 Feb 2 12:59:17 fox_1 kernel: [c24abc90] [c00a80d8] elf_core_dump+0xab8/0xb74 Feb 2 12:59:17 fox_1 kernel: [c24abd30] [c007718c] do_coredump+0x730/0x758 Feb 2 12:59:17 fox_1 kernel: [c24abe30] [c002eeb0] get_signal_to_deliver+0x244/0x3c4 Feb 2 12:59:17 fox_1 kernel: [c24abe80] [c000782c] do_signal+0x48/0x264 Feb 2 12:59:17 fox_1 kernel: [c24abf40] [c000e4ac] do_user_signal+0x74/0xc4 Feb 2 12:59:17 fox_1 kernel: Instruction dump: Feb 2 12:59:17 fox_1 kernel: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000 Feb 2 12:59:17 fox_1 kernel: 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8 7c0004ac Feb 2 12:59:17 fox_1 kernel: ---[ end trace a1d91e665173315a ]--- Feb 2 12:59:17 fox_1 kernel: note: oops[942] exited with preempt_count 1 It does not oops when the core dump is disabled. Regards, Clemens ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2008-02-04 18:25 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-31 13:45 Kernel oops while duming user core Rune Torgersen 2008-01-31 16:15 ` Nathan Lynch 2008-01-31 16:26 ` Rune Torgersen 2008-01-31 17:40 ` Rune Torgersen 2008-01-31 19:15 ` Kumar Gala 2008-01-31 19:18 ` Rune Torgersen 2008-01-31 20:16 ` Scott Wood 2008-01-31 20:19 ` Rune Torgersen 2008-01-31 20:38 ` Rune Torgersen 2008-01-31 20:41 ` Nathan Lynch 2008-01-31 20:45 ` Rune Torgersen 2008-01-31 20:55 ` Scott Wood 2008-01-31 21:58 ` Scott Wood 2008-01-31 22:10 ` Rune Torgersen 2008-02-03 7:34 ` Benjamin Herrenschmidt 2008-02-04 18:23 ` Kernel oops while dumping " Scott Wood 2008-01-31 19:15 ` Kernel oops while duming " Kumar Gala 2008-01-31 19:23 ` Rune Torgersen 2008-01-31 19:54 ` Nathan Lynch 2008-02-01 17:38 ` Scott Wood 2008-02-02 12:05 ` Clemens Koller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).