linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* strange, spurious seeming vector exception on pxa300
@ 2009-12-01 22:13 Yeasah Pell
  2009-12-02  6:00 ` Eric Miao
  0 siblings, 1 reply; 7+ messages in thread
From: Yeasah Pell @ 2009-12-01 22:13 UTC (permalink / raw)
  To: linux-arm-kernel

Has anybody ever seen vector exceptions happen on an ARM (xscale, 
pxa300) without 26-bit mode being used? I have some application and 
kernel code which appears to work on most hardware, but we have at least 
one board which causes periodic messages:

Unhandled fault: vector exception (0x010) at 0x412c8a90

(I also fudged the fault handler a bit to dump the SPSR: 0x80000010)

These messages correspond with a SEGV being sent to the application. The 
code address is always the same, but the instruction in question is just 
an ordinary load just like many others surrounding it. It's in a hot 
path so it's being run successfully for thousands of iterations before 
the problem manifests. Running in gdb and ignoring the SEGV causes the 
application to continue normally, so apparently the load is successful 
(the particular load operation in question is critical to proper 
operation of the app)

The definition of a vector exception as I understand it seems to be at 
odds with the context in which the exception is being generated (for one 
thing the CPU's not in 26-bit mode, and for another thing the data 
address is nowhere near the exception vectors), so it seems like it 
might be spurious somehow.

If anybody has any theory how this might happen other than some kind of 
hardware fault, please let me know! It's driving me absolutely nuts.

Thanks,
Yeasah Pell

^ permalink raw reply	[flat|nested] 7+ messages in thread

* strange, spurious seeming vector exception on pxa300
  2009-12-01 22:13 strange, spurious seeming vector exception on pxa300 Yeasah Pell
@ 2009-12-02  6:00 ` Eric Miao
  2009-12-02  6:07   ` Eric Miao
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Miao @ 2009-12-02  6:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 2, 2009 at 6:13 AM, Yeasah Pell <yeasah@comrex.com> wrote:
> Has anybody ever seen vector exceptions happen on an ARM (xscale, pxa300)
> without 26-bit mode being used? I have some application and kernel code
> which appears to work on most hardware, but we have at least one board which
> causes periodic messages:
>
> Unhandled fault: vector exception (0x010) at 0x412c8a90
>
> (I also fudged the fault handler a bit to dump the SPSR: 0x80000010)

Never had such exceptions. This is weird, SPSR[4] == 1 indicates a 32-bit mode.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* strange, spurious seeming vector exception on pxa300
  2009-12-02  6:00 ` Eric Miao
@ 2009-12-02  6:07   ` Eric Miao
  2009-12-02 14:40     ` Yeasah Pell
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Miao @ 2009-12-02  6:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 2, 2009 at 2:00 PM, Eric Miao <eric.y.miao@gmail.com> wrote:
> On Wed, Dec 2, 2009 at 6:13 AM, Yeasah Pell <yeasah@comrex.com> wrote:
>> Has anybody ever seen vector exceptions happen on an ARM (xscale, pxa300)
>> without 26-bit mode being used? I have some application and kernel code
>> which appears to work on most hardware, but we have at least one board which
>> causes periodic messages:
>>
>> Unhandled fault: vector exception (0x010) at 0x412c8a90
>>
>> (I also fudged the fault handler a bit to dump the SPSR: 0x80000010)
>
> Never had such exceptions. This is weird, SPSR[4] == 1 indicates a 32-bit mode.

When the processor is in a 32-bit configuration (PROG32 is active) and
in a 26-bit mode (CPSR[4] == 0),
data access (but not instruction fetches) to the exception vectors
(address 0x0 to 0x1f) causes a data abort.
This is known as a vector exception.

This is what explained in the manual, seems something related to 26-bit mode.
What's your compiling environment and flags for your application?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* strange, spurious seeming vector exception on pxa300
  2009-12-02  6:07   ` Eric Miao
@ 2009-12-02 14:40     ` Yeasah Pell
  2009-12-02 15:50       ` Russell King - ARM Linux
  0 siblings, 1 reply; 7+ messages in thread
From: Yeasah Pell @ 2009-12-02 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

Eric Miao wrote:
> On Wed, Dec 2, 2009 at 2:00 PM, Eric Miao <eric.y.miao@gmail.com> wrote:
>   
>> On Wed, Dec 2, 2009 at 6:13 AM, Yeasah Pell <yeasah@comrex.com> wrote:
>>     
>>> Has anybody ever seen vector exceptions happen on an ARM (xscale, pxa300)
>>> without 26-bit mode being used? I have some application and kernel code
>>> which appears to work on most hardware, but we have at least one board which
>>> causes periodic messages:
>>>
>>> Unhandled fault: vector exception (0x010) at 0x412c8a90
>>>
>>> (I also fudged the fault handler a bit to dump the SPSR: 0x80000010)
>>>       
>> Never had such exceptions. This is weird, SPSR[4] == 1 indicates a 32-bit mode.
>>     
>
> When the processor is in a 32-bit configuration (PROG32 is active) and
> in a 26-bit mode (CPSR[4] == 0),
> data access (but not instruction fetches) to the exception vectors
> (address 0x0 to 0x1f) causes a data abort.
> This is known as a vector exception.
>
> This is what explained in the manual, seems something related to 26-bit mode.
> What's your compiling environment and flags for your application?
>   

Hi, Eric -- thanks for the reply.

It's a crosstool-ng generated toolchain w/gcc 4.3.2. The optimization 
flags are '-mcpu=xscale -funroll-loops -O3', but it has been observed on 
debug builds which lack these flags as well.

There's no 26-bit code in the system that I'm aware of, certainly not in 
the application where the exception occurs. As you can see from the 
saved CPSR, the processor isn't in 26-bit mode at the time of the 
exception anyway. And even if it was, the load is from 0x412c8a90 
(etc.), not 0x0-0x1f. From what I've seen in the ARM architecture manual 
(mostly the part that you've copied above), this operation should not be 
able to cause such an exception, so I'm wondering if there is some 
alternate condition that can lead to this kind of exception.

In gdb, things look like this (after the SEGV from the fault is received 
by the target):

(gdb) info registers
r0             0x0    0
r1             0x412c8a04    1093437956
r2             0x0    0
r3             0x401c57f8    1075599352
r4             0x4029457c    1076446588
r5             0x9    9
r6             0x40390000    1077477376
r7             0x412c94e0    1093440736
r8             0x40390150    1077477712
r9             0x3d0f00    4001536
r10            0x4037a6bc    1077388988
r11            0x412c8b84    1093438340
r12            0x401d6c20    1075670048
sp             0x412c8a2c    0x412c8a2c
lr             0x4029603c    1076453436
pc             0x400ec47c    0x400ec47c <f1+172>
fps            0x0    0
cpsr           0x60000010    1610612752
(gdb) disassemble 0x400ec47c
Dump of assembler code for function f1:
0x400ec3d0 <f1+0>:    mov    r12, sp
0x400ec3d4 <f1+4>:    push    {r4, r5, r6, r7, r8, r9, r10, r11, r12, 
lr, pc}
0x400ec3d8 <f1+8>:    ldr    r4, [pc, #3508]    ; 0x400ed194 <f1+3524>
0x400ec3dc <f1+12>:    sub    r11, r12, #4    ; 0x4
0x400ec3e0 <f1+16>:    ldr    lr, [pc, #3504]    ; 0x400ed198 <f1+3528>
0x400ec3e4 <f1+20>:    ldr    r12, [pc, #3504]    ; 0x400ed19c <f1+3532>
0x400ec3e8 <f1+24>:    add    r3, pc, r4
0x400ec3ec <f1+28>:    sub    sp, sp, #304    ; 0x130
0x400ec3f0 <f1+32>:    str    r3, [r11, #-296]
0x400ec3f4 <f1+36>:    ldr    r4, [r3, r12]
0x400ec3f8 <f1+40>:    add    lr, r3, lr
0x400ec3fc <f1+44>:    ldr    r12, [r11, #-296]
0x400ec400 <f1+48>:    ldr    r3, [pc, #3480]    ; 0x400ed1a0 <f1+3536>
0x400ec404 <f1+52>:    str    r0, [r11, #-244]
0x400ec408 <f1+56>:    sub    r0, r11, #40    ; 0x28
0x400ec40c <f1+60>:    add    r3, r12, r3
0x400ec410 <f1+64>:    sub    r12, r11, #140    ; 0x8c
0x400ec414 <f1+68>:    str    r4, [r11, #-148]
0x400ec418 <f1+72>:    str    lr, [r11, #-144]
0x400ec41c <f1+76>:    stmib    r12, {r3, sp}
0x400ec420 <f1+80>:    str    r0, [r11, #-140]
0x400ec424 <f1+84>:    sub    r0, r11, #172    ; 0xac
0x400ec428 <f1+88>:    str    r1, [r11, #-248]
0x400ec42c <f1+92>:    str    r2, [r11, #-252]
0x400ec430 <f1+96>:    bl    0x400e1c60 <_init+1048>

0x400ec434 <f1+100>:    ldr    r1, [r11, #-248] ; beginning of "actual" 
function code
0x400ec438 <f1+104>:    cmp    r1, #0    ; 0x0 ; this is expected to be 
always unequal
0x400ec43c <f1+108>:    streq    r1, [r11, #-228]
0x400ec440 <f1+112>:    beq    0x400ec47c <f1+172>
0x400ec444 <f1+116>:    ldr    r3, [pc, #3416]    ; 0x400ed1a4 <f1+3540>
0x400ec448 <f1+120>:    ldr    r2, [r11, #-296]
0x400ec44c <f1+124>:    ldr    lr, [pc, #3412]    ; 0x400ed1a8 <f1+3544>
0x400ec450 <f1+128>:    mov    r0, r1
0x400ec454 <f1+132>:    ldr    r1, [r2, r3]
0x400ec458 <f1+136>:    mov    r3, #0    ; 0x0
0x400ec45c <f1+140>:    ldr    r2, [r2, lr]
0x400ec460 <f1+144>:    bl    0x400e3370 <_init+6952>
0x400ec464 <f1+148>:    cmp    r0, #0    ; 0x0 ; this is expected to be 
always equal
0x400ec468 <f1+152>:    ldrne    r12, [r11, #-244]
0x400ec46c <f1+156>:    movne    r3, #1    ; 0x1
0x400ec470 <f1+160>:    str    r0, [r11, #-228]
0x400ec474 <f1+164>:    strne    r3, [r12, #16]
0x400ec478 <f1+168>:    strne    r3, [r12, #8]

0x400ec47c <f1+172>:    ldr    r1, [r11, #-244] ; this throws an 
exception once in many thousand iterations
0x400ec480 <f1+176>:    ldr    r0, [r1, #16]
...

The compare at 0x400ec434 is expected to be unequal (and the register 
state shown above confirms this at the time of the exception), and the 
compare at 0x400ec464 is expected to be equal (again the register state 
confirms this). So we know the path of execution must have included for 
example 0x400ec448, which is a substantially similar operation to the 
one which causes the exception: a plain register load from the same page 
in memory.

I noticed that the instruction that throws the exception is a branch 
target (from 0x400ec430). Inserting a nop at the location the exception 
is thrown appears to avoid the problem at any timescale that I can 
detect (many hours at least, versus up to a few minutes that it takes to 
fail without it) -- but inserting a nop at any other location in the 
function doesn't seem effective. Perhaps I will try running this test 
with branch prediction disabled -- assuming that doesn't hurt 
performance so much that the test cannot be run.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* strange, spurious seeming vector exception on pxa300
  2009-12-02 14:40     ` Yeasah Pell
@ 2009-12-02 15:50       ` Russell King - ARM Linux
  2009-12-02 16:04         ` Daniel Mack
  0 siblings, 1 reply; 7+ messages in thread
From: Russell King - ARM Linux @ 2009-12-02 15:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 02, 2009 at 09:40:41AM -0500, Yeasah Pell wrote:
> Eric Miao wrote:
>> On Wed, Dec 2, 2009 at 2:00 PM, Eric Miao <eric.y.miao@gmail.com> wrote:
>>> On Wed, Dec 2, 2009 at 6:13 AM, Yeasah Pell <yeasah@comrex.com> wrote:
>>>> Has anybody ever seen vector exceptions happen on an ARM (xscale, pxa300)
>>>> without 26-bit mode being used? I have some application and kernel code
>>>> which appears to work on most hardware, but we have at least one board which
>>>> causes periodic messages:
>>>>
>>>> Unhandled fault: vector exception (0x010) at 0x412c8a90

FSR=0x010, which decodes as: Read. Domain 1. Status 0.

>>>> (I also fudged the fault handler a bit to dump the SPSR: 0x80000010)

NZCV=1000, 32-bit, user mode.

> There's no 26-bit code in the system that I'm aware of, certainly not in  
> the application where the exception occurs. As you can see from the  
> saved CPSR, the processor isn't in 26-bit mode at the time of the  
> exception anyway. And even if it was, the load is from 0x412c8a90  
> (etc.), not 0x0-0x1f. From what I've seen in the ARM architecture manual  
> (mostly the part that you've copied above), this operation should not be  
> able to cause such an exception, so I'm wondering if there is some  
> alternate condition that can lead to this kind of exception.
>
> In gdb, things look like this (after the SEGV from the fault is received  
> by the target):
>
> (gdb) info registers
> r0             0x0    0
> r1             0x412c8a04    1093437956
> r2             0x0    0
> r3             0x401c57f8    1075599352
> r4             0x4029457c    1076446588
> r5             0x9    9
> r6             0x40390000    1077477376
> r7             0x412c94e0    1093440736
> r8             0x40390150    1077477712
> r9             0x3d0f00    4001536
> r10            0x4037a6bc    1077388988
> r11            0x412c8b84    1093438340

0x412c8a90 (the fault value) + 224 = 0x412c8b84, which is the r11 value.
So that's consistent.

> r12            0x401d6c20    1075670048
> sp             0x412c8a2c    0x412c8a2c
> lr             0x4029603c    1076453436
> pc             0x400ec47c    0x400ec47c <f1+172>
> fps            0x0    0
> cpsr           0x60000010    1610612752

CPSR says NZCV=0110 (zero, carry).  32-bit user mode.

Given that the conditions are clearly wrong for a vector exception, I would
say that you're hitting some kind of hardware bug - maybe caused by a dirty
power supply to the PXA, causing it to misbehave?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* strange, spurious seeming vector exception on pxa300
  2009-12-02 15:50       ` Russell King - ARM Linux
@ 2009-12-02 16:04         ` Daniel Mack
  2009-12-02 16:39           ` Yeasah Pell
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Mack @ 2009-12-02 16:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 02, 2009 at 03:50:57PM +0000, Russell King - ARM Linux wrote:
> > r12            0x401d6c20    1075670048
> > sp             0x412c8a2c    0x412c8a2c
> > lr             0x4029603c    1076453436
> > pc             0x400ec47c    0x400ec47c <f1+172>
> > fps            0x0    0
> > cpsr           0x60000010    1610612752
> 
> CPSR says NZCV=0110 (zero, carry).  32-bit user mode.
> 
> Given that the conditions are clearly wrong for a vector exception, I would
> say that you're hitting some kind of hardware bug - maybe caused by a dirty
> power supply to the PXA, causing it to misbehave?

We've had trouble of that kind as well some month ago with an early
prototype. It wasn't an exception we got, but the bug was clearly
hitting the same code path every single time, so this issue might be
related. Eventually it went away with new board revision which made wire
patching around the DDR SDRAM unnecessary (i.e, cleaner signal pathes).

Strange enough, I would have expected such flaws to cause processor
misbeviour of all sorts, totally random and unpredictable. The fact that
is was the same function we always ended up in is still some kind of
miracle I can't explain.

Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* strange, spurious seeming vector exception on pxa300
  2009-12-02 16:04         ` Daniel Mack
@ 2009-12-02 16:39           ` Yeasah Pell
  0 siblings, 0 replies; 7+ messages in thread
From: Yeasah Pell @ 2009-12-02 16:39 UTC (permalink / raw)
  To: linux-arm-kernel

Daniel Mack wrote:
> On Wed, Dec 02, 2009 at 03:50:57PM +0000, Russell King - ARM Linux wrote:
>>
>>
>> Given that the conditions are clearly wrong for a vector exception, I would
>> say that you're hitting some kind of hardware bug - maybe caused by a dirty
>> power supply to the PXA, causing it to misbehave?
>>     

Of course I desperately want to believe this, as then I can ignore all 
this insanity and move on, but it's a bit difficult to swallow that some 
kind of hardware issue would cause a failure that is so consistent (same 
instruction every time, regardless of physical/virtual memory locations, 
compilation flags, etc.) yet caused absolutely no other perceptible 
problems...

> We've had trouble of that kind as well some month ago with an early
> prototype. It wasn't an exception we got, but the bug was clearly
> hitting the same code path every single time, so this issue might be
> related. Eventually it went away with new board revision which made wire
> patching around the DDR SDRAM unnecessary (i.e, cleaner signal pathes).
>
> Strange enough, I would have expected such flaws to cause processor
> misbeviour of all sorts, totally random and unpredictable. The fact that
> is was the same function we always ended up in is still some kind of
> miracle I can't explain.
>
> Daniel
>   

...and having some anecdotal evidence of that kind of situation 
happening is helpful.

Thanks, guys.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-12-02 16:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-01 22:13 strange, spurious seeming vector exception on pxa300 Yeasah Pell
2009-12-02  6:00 ` Eric Miao
2009-12-02  6:07   ` Eric Miao
2009-12-02 14:40     ` Yeasah Pell
2009-12-02 15:50       ` Russell King - ARM Linux
2009-12-02 16:04         ` Daniel Mack
2009-12-02 16:39           ` Yeasah Pell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).