From mboxrd@z Thu Jan 1 00:00:00 1970 From: yeasah@comrex.com (Yeasah Pell) Date: Wed, 02 Dec 2009 09:40:41 -0500 Subject: strange, spurious seeming vector exception on pxa300 In-Reply-To: References: <4B159524.2020408@comrex.com> Message-ID: <4B167C69.6060903@comrex.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Eric Miao wrote: > On Wed, Dec 2, 2009 at 2:00 PM, Eric Miao wrote: > >> On Wed, Dec 2, 2009 at 6:13 AM, Yeasah Pell wrote: >> >>> Has anybody ever seen vector exceptions happen on an ARM (xscale, pxa300) >>> without 26-bit mode being used? I have some application and kernel code >>> which appears to work on most hardware, but we have at least one board which >>> causes periodic messages: >>> >>> Unhandled fault: vector exception (0x010) at 0x412c8a90 >>> >>> (I also fudged the fault handler a bit to dump the SPSR: 0x80000010) >>> >> Never had such exceptions. This is weird, SPSR[4] == 1 indicates a 32-bit mode. >> > > When the processor is in a 32-bit configuration (PROG32 is active) and > in a 26-bit mode (CPSR[4] == 0), > data access (but not instruction fetches) to the exception vectors > (address 0x0 to 0x1f) causes a data abort. > This is known as a vector exception. > > This is what explained in the manual, seems something related to 26-bit mode. > What's your compiling environment and flags for your application? > Hi, Eric -- thanks for the reply. It's a crosstool-ng generated toolchain w/gcc 4.3.2. The optimization flags are '-mcpu=xscale -funroll-loops -O3', but it has been observed on debug builds which lack these flags as well. There's no 26-bit code in the system that I'm aware of, certainly not in the application where the exception occurs. As you can see from the saved CPSR, the processor isn't in 26-bit mode at the time of the exception anyway. And even if it was, the load is from 0x412c8a90 (etc.), not 0x0-0x1f. From what I've seen in the ARM architecture manual (mostly the part that you've copied above), this operation should not be able to cause such an exception, so I'm wondering if there is some alternate condition that can lead to this kind of exception. In gdb, things look like this (after the SEGV from the fault is received by the target): (gdb) info registers r0 0x0 0 r1 0x412c8a04 1093437956 r2 0x0 0 r3 0x401c57f8 1075599352 r4 0x4029457c 1076446588 r5 0x9 9 r6 0x40390000 1077477376 r7 0x412c94e0 1093440736 r8 0x40390150 1077477712 r9 0x3d0f00 4001536 r10 0x4037a6bc 1077388988 r11 0x412c8b84 1093438340 r12 0x401d6c20 1075670048 sp 0x412c8a2c 0x412c8a2c lr 0x4029603c 1076453436 pc 0x400ec47c 0x400ec47c fps 0x0 0 cpsr 0x60000010 1610612752 (gdb) disassemble 0x400ec47c Dump of assembler code for function f1: 0x400ec3d0 : mov r12, sp 0x400ec3d4 : push {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr, pc} 0x400ec3d8 : ldr r4, [pc, #3508] ; 0x400ed194 0x400ec3dc : sub r11, r12, #4 ; 0x4 0x400ec3e0 : ldr lr, [pc, #3504] ; 0x400ed198 0x400ec3e4 : ldr r12, [pc, #3504] ; 0x400ed19c 0x400ec3e8 : add r3, pc, r4 0x400ec3ec : sub sp, sp, #304 ; 0x130 0x400ec3f0 : str r3, [r11, #-296] 0x400ec3f4 : ldr r4, [r3, r12] 0x400ec3f8 : add lr, r3, lr 0x400ec3fc : ldr r12, [r11, #-296] 0x400ec400 : ldr r3, [pc, #3480] ; 0x400ed1a0 0x400ec404 : str r0, [r11, #-244] 0x400ec408 : sub r0, r11, #40 ; 0x28 0x400ec40c : add r3, r12, r3 0x400ec410 : sub r12, r11, #140 ; 0x8c 0x400ec414 : str r4, [r11, #-148] 0x400ec418 : str lr, [r11, #-144] 0x400ec41c : stmib r12, {r3, sp} 0x400ec420 : str r0, [r11, #-140] 0x400ec424 : sub r0, r11, #172 ; 0xac 0x400ec428 : str r1, [r11, #-248] 0x400ec42c : str r2, [r11, #-252] 0x400ec430 : bl 0x400e1c60 <_init+1048> 0x400ec434 : ldr r1, [r11, #-248] ; beginning of "actual" function code 0x400ec438 : cmp r1, #0 ; 0x0 ; this is expected to be always unequal 0x400ec43c : streq r1, [r11, #-228] 0x400ec440 : beq 0x400ec47c 0x400ec444 : ldr r3, [pc, #3416] ; 0x400ed1a4 0x400ec448 : ldr r2, [r11, #-296] 0x400ec44c : ldr lr, [pc, #3412] ; 0x400ed1a8 0x400ec450 : mov r0, r1 0x400ec454 : ldr r1, [r2, r3] 0x400ec458 : mov r3, #0 ; 0x0 0x400ec45c : ldr r2, [r2, lr] 0x400ec460 : bl 0x400e3370 <_init+6952> 0x400ec464 : cmp r0, #0 ; 0x0 ; this is expected to be always equal 0x400ec468 : ldrne r12, [r11, #-244] 0x400ec46c : movne r3, #1 ; 0x1 0x400ec470 : str r0, [r11, #-228] 0x400ec474 : strne r3, [r12, #16] 0x400ec478 : strne r3, [r12, #8] 0x400ec47c : ldr r1, [r11, #-244] ; this throws an exception once in many thousand iterations 0x400ec480 : ldr r0, [r1, #16] ... The compare at 0x400ec434 is expected to be unequal (and the register state shown above confirms this at the time of the exception), and the compare at 0x400ec464 is expected to be equal (again the register state confirms this). So we know the path of execution must have included for example 0x400ec448, which is a substantially similar operation to the one which causes the exception: a plain register load from the same page in memory. I noticed that the instruction that throws the exception is a branch target (from 0x400ec430). Inserting a nop at the location the exception is thrown appears to avoid the problem at any timescale that I can detect (many hours at least, versus up to a few minutes that it takes to fail without it) -- but inserting a nop at any other location in the function doesn't seem effective. Perhaps I will try running this test with branch prediction disabled -- assuming that doesn't hurt performance so much that the test cannot be run.