* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
@ 2007-01-24 5:25 ` David Miller
2007-01-25 3:00 ` Vince Weaver
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-01-24 5:25 UTC (permalink / raw)
To: sparclinux
From: Vince Weaver <vince@deater.net>
Date: Thu, 18 Jan 2007 13:33:15 -0500 (EST)
>
> I am running Linux 2.6.20-rc5 on an UltraSparc T1 (Niagara) with 24
> threads.
>
> When trying to compile gcc-4.2-20070117 gcc snapshot from scratch, the
> following BUG() happens:
What distribution and version are you running? I tried to dump
the code at address 0x1a368 of the /bin/sh binary running on
Ubuntu Dapper and it didn't show a code location which could
trigger this code path.
> The relevant code is:
>
> 409 /* If we took a ITLB miss on a non-executable page, catch
> 410 * that here.
> 411 */
> 412 if ((fault_code & FAULT_CODE_ITLB) && !(vma->vm_flags & VM_EXEC)) {
> 413 BUG_ON(address != regs->tpc);
> 414 BUG_ON(regs->tstate & TSTATE_PRIV);
> 415 goto bad_area;
> 416 }
>
> What's the next step in tracking down what's going on?
Try to print out the "fault_code", "address", and regs->tpc value
when this triggers.
I think the thread struct is being corrupted by some parallel access
and this corrupts the fault state, in particular "fault_code" is
garbage.
But I can only confirm that theory with the information I've
requested above.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
2007-01-24 5:25 ` David Miller
@ 2007-01-25 3:00 ` Vince Weaver
2007-01-25 22:02 ` David Miller
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Vince Weaver @ 2007-01-25 3:00 UTC (permalink / raw)
To: sparclinux
> What distribution and version are you running? I tried to dump
> the code at address 0x1a368 of the /bin/sh binary running on
> Ubuntu Dapper and it didn't show a code location which could
> trigger this code path.
I am running Ubuntu Feisty. (probably a bit too new, I was having
problems getting ldap/autofs working and thought maybe newer would be
better).
The code surrounding that address is:
1a360: 01 00 00 00 nop
1a364: 9d e3 bf 90 save %sp, -112, %sp
1a368: 40 00 70 33 call 36434 <malloc@plt>
1a36c: 90 10 00 18 mov %i0, %o0
1a370: 80 a2 20 00 cmp %o0, 0
1a374: 02 80 00 04 be 1a384 <_IO_stdin_used-0x8b84>
1a378: b0 10 00 08 mov %o0, %i0
1a37c: 81 c7 e0 08 ret
> Try to print out the "fault_code", "address", and regs->tpc value
> when this triggers.
I've reproduced it, here are the results (the bug is still at line 413,
my debug code pushed it down a bit):
[ 263.773194] VMW: fault_code=4 addressÿ0cc000 regs->tpcp131660
[ 263.773220] kernel BUG at arch/sparc64/mm/fault.c:417!
[ 263.773238] \|/ ____ \|/
[ 263.773244] "@'/ .. \`@"
[ 263.773250] /_| \__/ |_\
[ 263.773256] \__U_/
[ 263.773269] sh(6396): Kernel bad sw trap 5 [#1]
[ 263.773286] TSTATE: 0000000011001607 TPC: 00000000006903ec TNPC: 00000000006903f0 Y: 00000000 Not tainted
[ 263.773314] TPC: <do_sparc64_fault+0x394/0x700>
[ 263.773328] g0: fffff801f9e4c000 g1: 0000000000000000 g2: 0000000000000001 g3: 0000000000000000
[ 263.773350] g4: fffff801fd760540 g5: fffff80003ce3fc0 g6: fffff801f9e4c000 g7: 0000000000000000
[ 263.773368] o0: 000000000000003d o1: 00000000007182a0 o2: 00000000000001a1 o3: 0000000070131660
[ 263.773391] o4: 4849001106491d49 o5: fffff801fb8f1060 sp: fffff801f9e4f5c1 ret_pc: 00000000006903e4
[ 263.773411] RPC: <do_sparc64_fault+0x38c/0x700>
[ 263.773427] l0: fffff801ffa8f9e0 l1: 0000000000000004 l2: fffff801fb8f1060 l3: 00000000ff0cc000
[ 263.773447] l4: 0000000000000000 l5: fffff801fb8f10c0 l6: fffff801f9e4c000 l7: 0000000011009006
[ 263.773466] i0: fffff801f9e4ff60 i1: 0000000000000033 i2: 0000000000024fd2 i3: 0000000000000003
[ 263.773485] i4: 0000000000000000 i5: 0000000000000003 i6: fffff801f9e4f6a1 i7: 0000000000404d6c
[ 263.773509] I7: <sparc64_realfault_common+0x18/0x20>
[ 263.773521] Caller[0000000000404d6c]: sparc64_realfault_common+0x18/0x20
[ 263.773542] Caller[000000000001a368]: 0x1a370
[ 263.773573] Instruction DUMP: 921021a1 7ff62ae7 901222a0 <91d02005> 12480032 8208e002 8208e005 02c84066 030000c0
Let me know if you need more info,
Thanks
Vince
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
2007-01-24 5:25 ` David Miller
2007-01-25 3:00 ` Vince Weaver
@ 2007-01-25 22:02 ` David Miller
2007-01-25 23:26 ` David Miller
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-01-25 22:02 UTC (permalink / raw)
To: sparclinux
From: Vince Weaver <vince@deater.net>
Date: Wed, 24 Jan 2007 22:00:12 -0500 (EST)
> I've reproduced it, here are the results (the bug is still at line 413,
> my debug code pushed it down a bit):
>
>
> [ 263.773194] VMW: fault_code=4 addressÿ0cc000 regs->tpcp131660
Thanks for the info Vince, I'll look more closely at this.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
` (2 preceding siblings ...)
2007-01-25 22:02 ` David Miller
@ 2007-01-25 23:26 ` David Miller
2007-01-25 23:48 ` David Miller
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-01-25 23:26 UTC (permalink / raw)
To: sparclinux
From: Vince Weaver <vince@deater.net>
Date: Wed, 24 Jan 2007 22:00:12 -0500 (EST)
> I've reproduced it, here are the results (the bug is still at line 413,
> my debug code pushed it down a bit):
>
> [ 263.773194] VMW: fault_code=4 addressÿ0cc000 regs->tpcp131660
>
I have a theory about what is happening here, but not why.
But I'd like some more confirmation. What I think is going
on is that (in do_sparc64_fault):
fault_code = get_thread_fault_code();
if (notify_page_fault(DIE_PAGE_FAULT, "page_fault", regs,
fault_code, 0, SIGSEGV) = NOTIFY_STOP)
return;
si_code = SEGV_MAPERR;
address = current_thread_info()->fault_address;
The addr/code state is changed between the read of
get_thread_fault_code() and ->fault_address.
Besides the main fault path, there is only one other path which
could trigger here which writes to these thread state variables.
That's the window spill/fill code, BUT that code guards the writes
with a check if we came from privileged mode which should always
be true, for example:
rdpr %tstate, %g1
andcc %g1, TSTATE_PRIV, %g0
saved
be,pn %xcc, 1f
and %g1, TSTATE_CWP, %g1
retry
1: mov FAULT_CODE_WRITE | FAULT_CODE_DTLB | FAULT_CODE_WINFIXUP, %g4
stb %g4, [%g6 + TI_FAULT_CODE]
stx %g5, [%g6 + TI_FAULT_ADDR]
If TSTATE_PRIV is set, it skips the write to TI_FAULT_{CODE,ADDR}.
I highly suspected this code path because the value of "address" is
in the user process stack address range, however the logic is clearly
correct here.
The other problematic case are other cpus accessing the
current_thread_info()->flags member in a non-atomic manner.
This is critical because get_thread_fault_code() is a byte
embedded in that area. So if another thread does a read-modify-write
of current_thread_info->flags we can get:
Cpu 1 Cpu 2
... read ->flags, see old fault_code
store new fault_code
store new fault_address
write ->flags, puts old fault_code there
and that would look exactly like this.
I think the fault being taken is a stack fault via a save instruction
in userspace or similar. The most recent fault before that was an
instruction fault into GLIBC which set the fault code to FAULT_CODE_ITLB.
That's why we see FAULT_CODE_ITLB combined with a stack address.
I did some auditing and I really can't see how it can happen :-)
But we can test this theory with a relatively simple patch,
can you give this a try? What it does is it moves fault_code
out of the thread flags word.
If you can't trigger it, then my theory is right. If you can't,
then the bug is somewhere else :-)
Note, this patch is pretty straight forward, but I've only compile
tested it. I'm about to test boot it myself right now.
diff --git a/include/asm-sparc64/thread_info.h b/include/asm-sparc64/thread_info.h
index 2ebf7f2..75921fd 100644
--- a/include/asm-sparc64/thread_info.h
+++ b/include/asm-sparc64/thread_info.h
@@ -11,8 +11,6 @@
#define NSWINS 7
-#define TI_FLAG_BYTE_FAULT_CODE 0
-#define TI_FLAG_FAULT_CODE_SHIFT 56
#define TI_FLAG_BYTE_WSTATE 1
#define TI_FLAG_WSTATE_SHIFT 48
#define TI_FLAG_BYTE_CWP 2
@@ -49,7 +47,8 @@ struct thread_info {
int preempt_count; /* 0 => preemptable, <0 => BUG */
__u8 new_child;
__u8 syscall_noerror;
- __u16 __pad;
+ __u8 fault_code;
+ __u8 __pad;
unsigned long *utraps;
@@ -77,7 +76,6 @@ struct thread_info {
/* offsets into the thread_info struct for assembly code access */
#define TI_TASK 0x00000000
#define TI_FLAGS 0x00000008
-#define TI_FAULT_CODE (TI_FLAGS + TI_FLAG_BYTE_FAULT_CODE)
#define TI_WSTATE (TI_FLAGS + TI_FLAG_BYTE_WSTATE)
#define TI_CWP (TI_FLAGS + TI_FLAG_BYTE_CWP)
#define TI_CURRENT_DS (TI_FLAGS + TI_FLAG_BYTE_CURRENT_DS)
@@ -92,6 +90,7 @@ struct thread_info {
#define TI_PRE_COUNT 0x00000038
#define TI_NEW_CHILD 0x0000003c
#define TI_SYS_NOERROR 0x0000003d
+#define TI_FAULT_CODE 0x0000003e
#define TI_UTRAPS 0x00000040
#define TI_REG_WINDOW 0x00000048
#define TI_RWIN_SPTRS 0x000003c8
@@ -179,8 +178,8 @@ register struct thread_info *current_thread_info_reg asm("g6");
((unsigned char *)(&((ti)->flags)))
#define __cur_thread_flag_byte_ptr __thread_flag_byte_ptr(current_thread_info())
-#define get_thread_fault_code() (__cur_thread_flag_byte_ptr[TI_FLAG_BYTE_FAULT_CODE])
-#define set_thread_fault_code(val) (__cur_thread_flag_byte_ptr[TI_FLAG_BYTE_FAULT_CODE] = (val))
+#define get_thread_fault_code() (current_thread_info()->fault_code)
+#define set_thread_fault_code(val) (current_thread_info()->fault_code = (val))
#define get_thread_wstate() (__cur_thread_flag_byte_ptr[TI_FLAG_BYTE_WSTATE])
#define set_thread_wstate(val) (__cur_thread_flag_byte_ptr[TI_FLAG_BYTE_WSTATE] = (val))
#define get_thread_cwp() (__cur_thread_flag_byte_ptr[TI_FLAG_BYTE_CWP])
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
` (3 preceding siblings ...)
2007-01-25 23:26 ` David Miller
@ 2007-01-25 23:48 ` David Miller
2007-01-26 3:21 ` Vince Weaver
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-01-25 23:48 UTC (permalink / raw)
To: sparclinux
From: David Miller <davem@davemloft.net>
Date: Thu, 25 Jan 2007 15:26:21 -0800 (PST)
> Note, this patch is pretty straight forward, but I've only compile
> tested it. I'm about to test boot it myself right now.
FWIW, it works fine for me :)
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
` (4 preceding siblings ...)
2007-01-25 23:48 ` David Miller
@ 2007-01-26 3:21 ` Vince Weaver
2007-01-26 8:39 ` David Miller
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Vince Weaver @ 2007-01-26 3:21 UTC (permalink / raw)
To: sparclinux
>> Note, this patch is pretty straight forward, but I've only compile
>> tested it. I'm about to test boot it myself right now.
>
> FWIW, it works fine for me :)
Unfortunately it didn't seem to fix things. It ran a lot longer this
time, but eventually stopped with
[ 597.730241] VMW: fault_code=4 addressÿ564000 regs->tpcp179660
[ 597.730266] kernel BUG at arch/sparc64/mm/fault.c:417!
[ 597.730284] \|/ ____ \|/
[ 597.730290] "@'/ .. \`@"
[ 597.730296] /_| \__/ |_\
[ 597.730302] \__U_/
[ 597.730315] sh(8775): Kernel bad sw trap 5 [#1]
[ 597.730331] TSTATE: 0000000011001607 TPC: 00000000006903ec TNPC: 00000000006903f0 Y: 00000000 Not tainted
[ 597.730359] TPC: <do_sparc64_fault+0x394/0x700>
[ 597.730374] g0: fffff801f91ac000 g1: 0000000000000000 g2: 0000000000000001 g3: 0000000000000000
[ 597.730395] g4: fffff801fc8c9ae0 g5: fffff80003ca3fc0 g6: fffff801f91ac000 g7: 0000000000000000
[ 597.730414] o0: 000000000000003d o1: 00000000007182a0 o2: 00000000000001a1 o3: 0000000070179660
[ 597.730436] o4: 4849001106491d49 o5: fffff801fa57c140 sp: fffff801f91af5c1 ret_pc: 00000000006903e4
[ 597.730456] RPC: <do_sparc64_fault+0x38c/0x700>
[ 597.730472] l0: fffff801fd945a88 l1: 0000000000000004 l2: fffff801fa57c140 l3: 00000000ff564000
[ 597.730493] l4: 0000000000000000 l5: fffff801fa57c1a0 l6: fffff801f91ac000 l7: 0000000011009006
[ 597.730512] i0: fffff801f91aff60 i1: 0000000000000033 i2: 0000000000024fd2 i3: 0000000000000003
[ 597.730531] i4: 0000000000000000 i5: 0000000000000003 i6: fffff801f91af6a1 i7: 0000000000404d6c
[ 597.730555] I7: <sparc64_realfault_common+0x18/0x20>
[ 597.730566] Caller[0000000000404d6c]: sparc64_realfault_common+0x18/0x20
[ 597.730587] Caller[000000000001a368]: 0x1a370
[ 597.730617] Instruction DUMP: 921021a1 7ff62ae7 901222a0 <91d02005> 12480032 8208e002 8208e005 02c84066 030000c0
Vince
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
` (5 preceding siblings ...)
2007-01-26 3:21 ` Vince Weaver
@ 2007-01-26 8:39 ` David Miller
2007-01-26 11:01 ` David Miller
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-01-26 8:39 UTC (permalink / raw)
To: sparclinux
From: Vince Weaver <vince@deater.net>
Date: Thu, 25 Jan 2007 22:21:00 -0500 (EST)
>
> >> Note, this patch is pretty straight forward, but I've only compile
> >> tested it. I'm about to test boot it myself right now.
> >
> > FWIW, it works fine for me :)
>
> Unfortunately it didn't seem to fix things. It ran a lot longer this
> time, but eventually stopped with
>
> [ 597.730241] VMW: fault_code=4 addressÿ564000 regs->tpcp179660
> [ 597.730266] kernel BUG at arch/sparc64/mm/fault.c:417!
That eliminates one theory.
I'll think about this some more, but could you do me a favor
and test something? Revert my patch, and change your debugging
printk to print out not just fault_code and address, but also
get_thread_fault_code() and current_thread_info()->fault_address.
I want to see if they are changing after being read for some
reason.
Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
` (6 preceding siblings ...)
2007-01-26 8:39 ` David Miller
@ 2007-01-26 11:01 ` David Miller
2007-01-26 19:44 ` Vince Weaver
2007-01-26 21:05 ` David Miller
9 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-01-26 11:01 UTC (permalink / raw)
To: sparclinux
Vince, I think I figured out what the bug is, can you test the
following patch? If we set %g5 with the fault address here, we have
to set %g4 too. This allows to correctly handle a DTLB-PROT trap for
a window spill during trap entry for another top-level fault. The
non-Niagara DTLB-PROT code does this properly, it's just the sun4v
side that had the bug.
This is why in your logs the address was in the stack, but the fault
code was I-TLB :-)
Thanks a lot!
diff --git a/arch/sparc64/kernel/sun4v_tlb_miss.S b/arch/sparc64/kernel/sun4v_tlb_miss.S
index b731881..9871dbb 100644
--- a/arch/sparc64/kernel/sun4v_tlb_miss.S
+++ b/arch/sparc64/kernel/sun4v_tlb_miss.S
@@ -142,9 +142,9 @@ sun4v_dtlb_prot:
rdpr %tl, %g1
cmp %g1, 1
bgu,pn %xcc, winfix_trampoline
- nop
- ba,pt %xcc, sparc64_realfault_common
mov FAULT_CODE_DTLB | FAULT_CODE_WRITE, %g4
+ ba,pt %xcc, sparc64_realfault_common
+ nop
/* Called from trap table:
* %g4: vaddr
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
` (7 preceding siblings ...)
2007-01-26 11:01 ` David Miller
@ 2007-01-26 19:44 ` Vince Weaver
2007-01-26 21:05 ` David Miller
9 siblings, 0 replies; 11+ messages in thread
From: Vince Weaver @ 2007-01-26 19:44 UTC (permalink / raw)
To: sparclinux
> Vince, I think I figured out what the bug is, can you test the
> following patch? If we set %g5 with the fault address here, we have
> to set %g4 too. This allows to correctly handle a DTLB-PROT trap for
> a window spill during trap entry for another top-level fault. The
> non-Niagara DTLB-PROT code does this properly, it's just the sun4v
> side that had the bug.
Great news! This fixes things. The gcc snapshot has been compiling now
for over an hour without a problem, wheras before the bug would appear
within seconds.
Thanks for fixing this,
Vince
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: kernel BUG at arch/sparc64/mm/fault.c:413!
2007-01-18 18:33 kernel BUG at arch/sparc64/mm/fault.c:413! Vince Weaver
` (8 preceding siblings ...)
2007-01-26 19:44 ` Vince Weaver
@ 2007-01-26 21:05 ` David Miller
9 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-01-26 21:05 UTC (permalink / raw)
To: sparclinux
From: Vince Weaver <vince@deater.net>
Date: Fri, 26 Jan 2007 14:44:44 -0500 (EST)
>
> > Vince, I think I figured out what the bug is, can you test the
> > following patch? If we set %g5 with the fault address here, we have
> > to set %g4 too. This allows to correctly handle a DTLB-PROT trap for
> > a window spill during trap entry for another top-level fault. The
> > non-Niagara DTLB-PROT code does this properly, it's just the sun4v
> > side that had the bug.
>
> Great news! This fixes things. The gcc snapshot has been compiling now
> for over an hour without a problem, wheras before the bug would appear
> within seconds.
>
> Thanks for fixing this,
Awesome, thanks for testing.
^ permalink raw reply [flat|nested] 11+ messages in thread