Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7
@ 2013-08-31 16:31 Jochen De Smet
  2013-08-31 20:06 ` Russell King - ARM Linux
  0 siblings, 1 reply; 5+ messages in thread
From: Jochen De Smet @ 2013-08-31 16:31 UTC (permalink / raw)
  To: linux-arm-kernel

[Not subscribed, so please keep me on CC]

Running on a mirabox (armada-370), stock 3.11-rc7 kernel, on fedora 19
with gcc:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/armv7hl-redhat-linux-gnueabi/4.7.2/lto-wrapper
Target: armv7hl-redhat-linux-gnueabi
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
--infodir=/usr/share/info 
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap 
--enable-shared --enable-threads=posix --enable-checking=release 
--disable-build-with-cxx --disable-build-poststage1-with-cxx 
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-linker-build-id 
--with-linker-hash-style=gnu 
--enable-languages=c,c++,objc,obj-c++,java,fortran,go,lto 
--enable-plugin --enable-initfini-array --enable-java-awt=gtk 
--disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre 
--enable-libgcj-multifile --enable-java-maintainer-mode 
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar 
--disable-libjava-multilib --with-ppl --with-cloog 
--disable-sjlj-exceptions --with-cpu=cortex-a8 --with-tune=cortex-a8 
--with-arch=armv7-a --with-float=hard --with-fpu=vfpv3-d16 
--with-abi=aapcs-linux --build=armv7hl-redhat-linux-gnueabi
Thread model: posix
gcc version 4.7.2 20121109 (Red Hat 4.7.2-8) (GCC)


Running into this oops:

[54580.094832] Internal error: Oops - undefined instruction: 0 [#1] ARM
[54580.101207] Modules linked in: sha1_generic drbd lru_cache dlm sctp 
configfs raid1 md_mod iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat 
iptable_mangle ipt_REJECT xt_conntrack ebtable_filter ebtables 
iptable_filter ip_tables ext3 jbd ftdi_sio usbserial autofs4 ext4 jbd2 
mbcache sd_mod usb_storage mmc_block xhci_hcd mvsdio mmc_core ehci_orion
[54580.136437] CPU: 0 PID: 0 Comm: swapper Not tainted 3.11.0-rc7-stock2 #30
[54580.143239] task: c03f9540 ti: c03ee000 task.ti: c03ee000
[54580.148658] PC is at quirk_usb_early_handoff+0x7d0/0x7f4
[54580.153983] LR is at start_unlink_async+0x20/0x2c
[54580.158697] pc : [<c020837c>]    lr : [<c020c014>] psr: 00000193
[54580.158697] sp : c03efd98  ip : ef2735d0  fp : c03efda4
[54580.170194] r10: 60000193  r9 : 00000006  r8 : c03013ec
[54580.175427] r7 : 000031ac  r6 : d77d6a38  r5 : 00000001  r4 : 00000ef4
[54580.181965] r3 : ee817c00  r2 : ef2de8c0  r1 : ee804600  r0 : ef273500
[54580.188504] Flags: nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM 
Segment kernel
[54580.195912] Control: 10c5387d  Table: 2cb6c019  DAC: 00000015
[54580.201666] Process swapper (pid: 0, stack limit = 0xc03ee230)
[54580.207509] Stack: (0xc03efd98 to 0xc03f0000)
[54580.211874] 
fd80:                                                       c03efdbc 
c03efda8
[54580.220068] fda0: c020c014 c020874c ef2735d0 ef273500 c03efdd4 
c03efdc0 c020c0e0 c020c000
[54580.228262] fdc0: d7860e21 00000000 c03efe34 c03efdd8 c020949c 
c020c02c c03efe04 ef273558
[54580.236456] fde0: c0151cf4 c0150578 ef273500 00000000 d7860e21 
000031ac d7860e21 000031ac
[54580.244650] fe00: c003ad54 00000220 00000000 ef273558 c03ffab8 
c03ffab8 00000000 00000003
[54580.252844] fe20: c03ffa88 c03ffa88 c03efe54 c03efe38 c003af08 
c0209428 c0e33044 00000010
[54580.261038] fe40: d785ff4c 000031ac c03efeb4 c03efe58 c003b658 
c003aec0 00000000 00000000
[54580.269232] fe60: d785ff4c 000031ac d785ff4c 000031ac d785ff4c 
000031ac ffffffff 7fffffff
[54580.277426] fe80: d785ff4c 000031ac 00000000 c0412240 c0406400 
ef007cc0 c0e33044 00000010
[54580.285621] fea0: c03ee000 c03f60c8 c03efecc c03efeb8 c0222558 
c003b574 c0222514 ef0048c0
[54580.293815] fec0: c03efef4 c03efed0 c006a5b4 c0222520 c006a550 
00000010 00000010 00000000
[54580.302011] fee0: c03eff50 00000001 c03eff0c c03efef8 c0067450 
c006a55c 0000006e c0406088
[54580.310206] ff00: c03eff2c c03eff10 c000f520 c0067434 00000074 
c0433100 000003ff c0433100
[54580.318401] ff20: c03eff4c c03eff30 c0008554 c000f4f4 c0049954 
60000013 ffffffff c03eff84
[54580.326596] ff40: c03effac c03eff50 c02df7a0 c0008514 ffffffed 
00000000 c0411c48 c001b4e4
[54580.334791] ff60: c03ee000 c0417a87 c0417a87 c03ee000 00000001 
c03ee000 c03f60c8 c03effac
[54580.342986] ff80: c03eff88 c03eff98 c000f700 c0049954 60000013 
ffffffff 00000000 c0e31cc0
[54580.351180] ffa0: c03effbc c03effb0 c02d8700 c0049914 c03efff4 
c03effc0 c03c6a54 c02d86a8
[54580.359374] ffc0: ffffffff ffffffff c03c6544 00000000 00000000 
c03e54b8 10c53c7d c03f6070
[54580.367568] ffe0: c03e54b4 c03fa640 00000000 c03efff8 00008070 
c03c67c0 00000000 00000000
[54580.375757] Backtrace:
[54580.378225] [<c0208740>] (single_unlink_async+0x0/0x74) from 
[<c020c014>] (start_unlink_async+0x20/0x2c)
[54580.387726] [<c020bff4>] (start_unlink_async+0x0/0x2c) from 
[<c020c0e0>] (unlink_empty_async+0xc0/0xcc)
[54580.397134]  r4:ef273500 r3:ef2735d0
[54580.400741] [<c020c020>] (unlink_empty_async+0x0/0xcc) from 
[<c020949c>] (ehci_hrtimer_func+0x80/0xe8)
[54580.410061]  r5:00000000 r4:d7860e21
[54580.413673] [<c020941c>] (ehci_hrtimer_func+0x0/0xe8) from 
[<c003af08>] (__run_hrtimer.isra.20+0x54/0x104)
[54580.423348] [<c003aeb4>] (__run_hrtimer.isra.20+0x0/0x104) from 
[<c003b658>] (hrtimer_interrupt+0xf0/0x288)
[54580.433102]  r5:000031ac r4:d785ff4c
[54580.436715] [<c003b568>] (hrtimer_interrupt+0x0/0x288) from 
[<c0222558>] (armada_370_xp_timer_interrupt+0x44/0x54)
[54580.447086] [<c0222514>] (armada_370_xp_timer_interrupt+0x0/0x54) 
from [<c006a5b4>] (handle_percpu_devid_irq+0x64/0x80)
[54580.457884]  r4:ef0048c0 r3:c0222514
[54580.461497] [<c006a550>] (handle_percpu_devid_irq+0x0/0x80) from 
[<c0067450>] (generic_handle_irq+0x28/0x38)
[54580.471338]  r8:00000001 r7:c03eff50 r6:00000000 r5:00000010 r4:00000010
r3:c006a550
[54580.479258] [<c0067428>] (generic_handle_irq+0x0/0x38) from 
[<c000f520>] (handle_IRQ+0x38/0x8c)
[54580.487970]  r4:c0406088 r3:0000006e
[54580.491578] [<c000f4e8>] (handle_IRQ+0x0/0x8c) from [<c0008554>] 
(armada_370_xp_handle_irq+0x4c/0x54)
[54580.500810]  r6:c0433100 r5:000003ff r4:c0433100 r3:00000074
[54580.506527] [<c0008508>] (armada_370_xp_handle_irq+0x0/0x54) from 
[<c02df7a0>] (__irq_svc+0x40/0x50)
[54580.515674] Exception stack(0xc03eff50 to 0xc03eff98)
[54580.520735] ff40:                                     ffffffed 
00000000 c0411c48 c001b4e4
[54580.528929] ff60: c03ee000 c0417a87 c0417a87 c03ee000 00000001 
c03ee000 c03f60c8 c03effac
[54580.537123] ff80: c03eff88 c03eff98 c000f700 c0049954 60000013 ffffffff
[54580.543747]  r7:c03eff84 r6:ffffffff r5:60000013 r4:c0049954
[54580.549469] [<c0049908>] (cpu_startup_entry+0x0/0xe8) from 
[<c02d8700>] (rest_init+0x64/0x7c)
[54580.558006]  r7:c0e31cc0 r3:00000000
[54580.561618] [<c02d869c>] (rest_init+0x0/0x7c) from [<c03c6a54>] 
(start_kernel+0x2a0/0x2f4)
[54580.569904] [<c03c67b4>] (start_kernel+0x0/0x2f4) from [<00008070>] 
(0x8070)
[54580.576967] Code: eaffffcc c03f6040 c0406068 c0394a20 (c03949f0)
[54580.583077] ---[ end trace 7ff80fa55787f992 ]---
[54580.587702] Kernel panic - not syncing: Fatal exception in interrupt


Didn't have debug symbols enabled (compiling with them now), but both 
decodecode and
gdb seem to track the problem here:

All code
========
    0:   eaffffcc        b       0xffffff38
    4:   c03f6040        eorsgt  r6, pc, r0, asr #32
    8:   c0406068        subgt   r6, r0, r8, rrx
    c:   c0394a20        eorsgt  r4, r9, r0, lsr #20
   10:*  c03949f0        ldrshtgt        r4, [r9], -r0 <-- trapping 
instruction

from gdb with a bit more context:

    0xc020836c <+1984>:  b       0xc02082a4 <quirk_usb_early_handoff+1784>
    0xc0208370 <+1988>:  eorsgt  r6, pc, r0, asr #32
    0xc0208374 <+1992>:  subgt   r6, r0, r8, rrx
    0xc0208378 <+1996>:  eorsgt  r4, r9, r0, lsr #20
    0xc020837c <+2000>:  ldrshtgt        r4, [r9], -r0
    0xc0208380 <+2004>:  eorsgt  r4, r9, r4, asr #20
    0xc0208384 <+2008>:  eorsgt  r4, r9, r0, asr #21
    0xc0208388 <+2012>:  eorsgt  sp, r7, r12, lsl #4
    0xc020838c <+2016>:  mlasgt  r9, r4, r10, r4
    0xc0208390 <+2020>:  eorsgt  r4, r9, r8, ror #20
    0xc0208394 <+2024>:  eorsgt  r4, r9, r4, lsl #22
    0xc0208398 <+2028>:  eorsgt  r4, r9, r12, lsr r11
    0xc020839c <+2032>:  ldrsbtgt        r4, [r9], -r8


The oops is relatively sporadic, perhaps 1-3 times a day.

Would appreciate any help in getting this fixed.

J.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7
  2013-08-31 16:31 Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7 Jochen De Smet
@ 2013-08-31 20:06 ` Russell King - ARM Linux
  2013-08-31 23:00   ` Jochen De Smet
  0 siblings, 1 reply; 5+ messages in thread
From: Russell King - ARM Linux @ 2013-08-31 20:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Aug 31, 2013 at 12:31:44PM -0400, Jochen De Smet wrote:
> [54580.136437] CPU: 0 PID: 0 Comm: swapper Not tainted 3.11.0-rc7-stock2 #30
> [54580.143239] task: c03f9540 ti: c03ee000 task.ti: c03ee000
> [54580.148658] PC is at quirk_usb_early_handoff+0x7d0/0x7f4
> [54580.153983] LR is at start_unlink_async+0x20/0x2c
> [54580.158697] pc : [<c020837c>]    lr : [<c020c014>] psr: 00000193
> [54580.158697] sp : c03efd98  ip : ef2735d0  fp : c03efda4
> [54580.170194] r10: 60000193  r9 : 00000006  r8 : c03013ec
> [54580.175427] r7 : 000031ac  r6 : d77d6a38  r5 : 00000001  r4 : 00000ef4
> [54580.181965] r3 : ee817c00  r2 : ef2de8c0  r1 : ee804600  r0 : ef273500
> [54580.188504] Flags: nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  
>...
> [54580.576967] Code: eaffffcc c03f6040 c0406068 c0394a20 (c03949f0)
> [54580.583077] ---[ end trace 7ff80fa55787f992 ]---
> [54580.587702] Kernel panic - not syncing: Fatal exception in interrupt
>
>
> Didn't have debug symbols enabled (compiling with them now), but both  
> decodecode and gdb seem to track the problem here:
>
> All code
> ========
>    0:   eaffffcc        b       0xffffff38
>    4:   c03f6040        eorsgt  r6, pc, r0, asr #32
>    8:   c0406068        subgt   r6, r0, r8, rrx
>    c:   c0394a20        eorsgt  r4, r9, r0, lsr #20
>   10:*  c03949f0        ldrshtgt        r4, [r9], -r0 <-- trapping  
> instruction

Thanks for disassembling the Code: line, and providing the code below.

> from gdb with a bit more context:
>
>    0xc020836c <+1984>:  b       0xc02082a4 <quirk_usb_early_handoff+1784>
>    0xc0208370 <+1988>:  eorsgt  r6, pc, r0, asr #32
>    0xc0208374 <+1992>:  subgt   r6, r0, r8, rrx
>    0xc0208378 <+1996>:  eorsgt  r4, r9, r0, lsr #20
>    0xc020837c <+2000>:  ldrshtgt        r4, [r9], -r0
>    0xc0208380 <+2004>:  eorsgt  r4, r9, r4, asr #20
>    0xc0208384 <+2008>:  eorsgt  r4, r9, r0, asr #21
>    0xc0208388 <+2012>:  eorsgt  sp, r7, r12, lsl #4
>    0xc020838c <+2016>:  mlasgt  r9, r4, r10, r4
>    0xc0208390 <+2020>:  eorsgt  r4, r9, r8, ror #20
>    0xc0208394 <+2024>:  eorsgt  r4, r9, r4, lsl #22
>    0xc0208398 <+2028>:  eorsgt  r4, r9, r12, lsr r11
>    0xc020839c <+2032>:  ldrsbtgt        r4, [r9], -r8

This doesn't look like valid ARM code (it doesn't make sense).  Instead,
what it looks like is a literal pool placed after the function (which is
something GCC does all the time.)

The question is - how did you end up trying to execute a literal pool.

Well, if we assume that the link register is intact, we would return to:

	start_unlink_async+0x20 (0xc020c014)

so presumably the instruction at the previous address is the one which
called this (I'm assuming no tail-call optimisation.)

Well, just to be confusing, the kernel has three functions called
"start_unlink_async".  One of them is quite a big function, so is unlikely
to be 0x2c bytes in size, so the two candidates are:

static void start_unlink_async(struct ehci_hcd *ehci, struct ehci_qh *qh)
{
        /* If the QH isn't linked then there's nothing we can do. */
        if (qh->qh_state != QH_STATE_LINKED)
                return;

        single_unlink_async(ehci, qh);
        start_iaa_cycle(ehci);
}

static void start_unlink_async(struct fusbh200_hcd *fusbh200, struct fusbh200_qh *qh)
{
        /*
         * If the QH isn't linked then there's nothing we can do
         * unless we were called during a giveback, in which case
         * qh_completions() has to deal with it.
         */
        if (qh->qh_state != QH_STATE_LINKED) {
                if (qh->qh_state == QH_STATE_COMPLETING)
                        qh->needs_rescan = 1;
                return;
        }

        single_unlink_async(fusbh200, qh);
        start_iaa_cycle(fusbh200, false);
}

Neither call quirk_usb_early_handoff().  I'm going to assume that it's
the EHCI one.

The backtrace (and stack) gives us another clue:

> [54580.378225] [<c0208740>] (single_unlink_async+0x0/0x74) from [<c020c014>] (start_unlink_async+0x20/0x2c)
> [54580.387726] [<c020bff4>] (start_unlink_async+0x0/0x2c) from [<c020c0e0>] (unlink_empty_async+0xc0/0xcc)

So the unwinder thinks we entered single_unlink_async().  Given the LR
value, I think that's reasonable (it would be useful to have the complete
disassembly of start_unlink_async() to confirm).

static void single_unlink_async(struct ehci_hcd *ehci, struct ehci_qh *qh)
{
        struct ehci_qh          *prev;

        /* Add to the end of the list of QHs waiting for the next IAAD */
        qh->qh_state = QH_STATE_UNLINK_WAIT;
        list_add_tail(&qh->unlink_node, &ehci->async_unlink);

        /* Unlink it from the schedule */
        prev = ehci->async;
        while (prev->qh_next.qh != qh)
                prev = prev->qh_next.qh;

        prev->hw->hw_next = qh->hw->hw_next;
        prev->qh_next = qh->qh_next;
        if (ehci->qh_scan_next == qh)
                ehci->qh_scan_next = qh->qh_next.qh;
}

Nothing in there does an indirect function call (or any function call).
Again, having the disassembly to that function may be useful.  Also
knowing how much RAM you have in lowmem too, so we know the possible
range of valid kernel addresses.

> The oops is relatively sporadic, perhaps 1-3 times a day.

Is it always the same oops?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7
  2013-08-31 20:06 ` Russell King - ARM Linux
@ 2013-08-31 23:00   ` Jochen De Smet
  2013-08-31 23:54     ` Russell King - ARM Linux
  0 siblings, 1 reply; 5+ messages in thread
From: Jochen De Smet @ 2013-08-31 23:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 8/31/2013 16:06, Russell King - ARM Linux wrote:
> On Sat, Aug 31, 2013 at 12:31:44PM -0400, Jochen De Smet wrote:
>     0xc0208378 <+1996>:  eorsgt  r4, r9, r0, lsr #20
>     0xc020837c <+2000>:  ldrshtgt        r4, [r9], -r0
>     0xc0208380 <+2004>:  eorsgt  r4, r9, r4, asr #20
>     0xc0208384 <+2008>:  eorsgt  r4, r9, r0, asr #21
>     0xc0208388 <+2012>:  eorsgt  sp, r7, r12, lsl #4
>     0xc020838c <+2016>:  mlasgt  r9, r4, r10, r4
>     0xc0208390 <+2020>:  eorsgt  r4, r9, r8, ror #20
>     0xc0208394 <+2024>:  eorsgt  r4, r9, r4, lsl #22
>     0xc0208398 <+2028>:  eorsgt  r4, r9, r12, lsr r11
>     0xc020839c <+2032>:  ldrsbtgt        r4, [r9], -r8
> This doesn't look like valid ARM code (it doesn't make sense).  Instead,
> what it looks like is a literal pool placed after the function (which is
> something GCC does all the time.)
>
> The question is - how did you end up trying to execute a literal pool.
>
> Well, if we assume that the link register is intact, we would return to:
>
> 	start_unlink_async+0x20 (0xc020c014)
>
> so presumably the instruction at the previous address is the one which
> called this (I'm assuming no tail-call optimisation.)
>
> Well, just to be confusing, the kernel has three functions called
> "start_unlink_async".  One of them is quite a big function, so is unlikely
> to be 0x2c bytes in size, so the two candidates are:
>
> static void start_unlink_async(struct ehci_hcd *ehci, struct ehci_qh *qh)
> {
>          /* If the QH isn't linked then there's nothing we can do. */
>          if (qh->qh_state != QH_STATE_LINKED)
>                  return;
>
>          single_unlink_async(ehci, qh);
>          start_iaa_cycle(ehci);
> }
>
> static void start_unlink_async(struct fusbh200_hcd *fusbh200, struct fusbh200_qh *qh)
> {
>          /*
>           * If the QH isn't linked then there's nothing we can do
>           * unless we were called during a giveback, in which case
>           * qh_completions() has to deal with it.
>           */
>          if (qh->qh_state != QH_STATE_LINKED) {
>                  if (qh->qh_state == QH_STATE_COMPLETING)
>                          qh->needs_rescan = 1;
>                  return;
>          }
>
>          single_unlink_async(fusbh200, qh);
>          start_iaa_cycle(fusbh200, false);
> }
>
> Neither call quirk_usb_early_handoff().  I'm going to assume that it's
> the EHCI one.
Curiously enough, I don't see either one (ehci-q.c or fusbh200-hcd.c) in 
the kernel "make" output.
Ah, ehci-q gets directly included by ehci-hcd.c, which I do see. Don't 
see anything similar for fusbh200
or oxu210hp-hcd.c, so I'm pretty sure the EHCI one is the only one I'm 
compiling and your guess is
right.
> The backtrace (and stack) gives us another clue:
>
>> [54580.378225] [<c0208740>] (single_unlink_async+0x0/0x74) from [<c020c014>] (start_unlink_async+0x20/0x2c)
>> [54580.387726] [<c020bff4>] (start_unlink_async+0x0/0x2c) from [<c020c0e0>] (unlink_empty_async+0xc0/0xcc)
> So the unwinder thinks we entered single_unlink_async().  Given the LR
> value, I think that's reasonable (it would be useful to have the complete
> disassembly of start_unlink_async() to confirm).
(gdb) disassemble /r start_unlink_async
Dump of assembler code for function start_unlink_async:
    0xc020bff4 <+0>:     0d c0 a0 e1     mov     r12, sp
    0xc020bff8 <+4>:     18 d8 2d e9     push    {r3, r4, r11, r12, lr, pc}
    0xc020bffc <+8>:     04 b0 4c e2     sub     r11, r12, #4
    0xc020c000 <+12>:    2c 30 d1 e5     ldrb    r3, [r1, #44]   ; 0x2c
    0xc020c004 <+16>:    00 40 a0 e1     mov     r4, r0
    0xc020c008 <+20>:    01 00 53 e3     cmp     r3, #1
    0xc020c00c <+24>:    18 a8 9d 18     ldmne   sp, {r3, r4, r11, sp, pc}
    0xc020c010 <+28>:    ca f1 ff eb     bl      0xc0208740 
<single_unlink_async>
    0xc020c014 <+32>:    04 00 a0 e1     mov     r0, r4
    0xc020c018 <+36>:    40 ff ff eb     bl      0xc020bd20 
<start_iaa_cycle>
    0xc020c01c <+40>:    18 a8 9d e8     ldm     sp, {r3, r4, r11, sp, pc}
End of assembler dump.

disassemble /m  doesn't seem to work for this; is that normal?   On the 
bright side
the address does match what's in the stacktrace, so it should be the 
right function.
>
> static void single_unlink_async(struct ehci_hcd *ehci, struct ehci_qh *qh)
> {
>          struct ehci_qh          *prev;
>
>          /* Add to the end of the list of QHs waiting for the next IAAD */
>          qh->qh_state = QH_STATE_UNLINK_WAIT;
>          list_add_tail(&qh->unlink_node, &ehci->async_unlink);
>
>          /* Unlink it from the schedule */
>          prev = ehci->async;
>          while (prev->qh_next.qh != qh)
>                  prev = prev->qh_next.qh;
>
>          prev->hw->hw_next = qh->hw->hw_next;
>          prev->qh_next = qh->qh_next;
>          if (ehci->qh_scan_next == qh)
>                  ehci->qh_scan_next = qh->qh_next.qh;
> }
>
> Nothing in there does an indirect function call (or any function call).
> Again, having the disassembly to that function may be useful.  Also
(gdb) disassemble single_unlink_async
Dump of assembler code for function single_unlink_async:
    0xc0208740 <+0>:     mov     r12, sp
    0xc0208744 <+4>:     push    {r11, r12, lr, pc}
    0xc0208748 <+8>:     sub     r11, r12, #4
    0xc020874c <+12>:    mov     r3, #4
    0xc0208750 <+16>:    strb    r3, [r1, #44]   ; 0x2c
    0xc0208754 <+20>:    ldr     r3, [r0, #212]  ; 0xd4
    0xc0208758 <+24>:    add     r2, r1, #32
    0xc020875c <+28>:    add     r12, r0, #208   ; 0xd0
    0xc0208760 <+32>:    str     r2, [r0, #212]  ; 0xd4
    0xc0208764 <+36>:    str     r12, [r1, #32]
    0xc0208768 <+40>:    str     r3, [r1, #36]   ; 0x24
    0xc020876c <+44>:    str     r2, [r3]
    0xc0208770 <+48>:    ldr     r2, [r0, #200]  ; 0xc8
    0xc0208774 <+52>:    b       0xc020877c <single_unlink_async+60>
    0xc0208778 <+56>:    mov     r2, r3
    0xc020877c <+60>:    ldr     r3, [r2, #8]
    0xc0208780 <+64>:    cmp     r3, r1
    0xc0208784 <+68>:    bne     0xc0208778 <single_unlink_async+56>
    0xc0208788 <+72>:    ldr     r12, [r1]
    0xc020878c <+76>:    ldr     r3, [r2]
    0xc0208790 <+80>:    ldr     r12, [r12]
    0xc0208794 <+84>:    str     r12, [r3]
    0xc0208798 <+88>:    ldr     r3, [r1, #8]
    0xc020879c <+92>:    str     r3, [r2, #8]
    0xc02087a0 <+96>:    ldr     r3, [r0, #196]  ; 0xc4
    0xc02087a4 <+100>:   cmp     r3, r1
    0xc02087a8 <+104>:   ldreq   r3, [r1, #8]
    0xc02087ac <+108>:   streq   r3, [r0, #196]  ; 0xc4
    0xc02087b0 <+112>:   ldm     sp, {r11, sp, pc}
End of assembler dump.

> knowing how much RAM you have in lowmem too, so we know the possible
> range of valid kernel addresses.
Sorry, not sure how to get this.  Dumping some of the things that come 
to mind:

$ free
              total       used       free     shared    buffers cached
Mem:       1035324     999140      36184          0       5392 828716
-/+ buffers/cache:     165032     870292
Swap:       499996       1212     498784

]$ cat /proc/meminfo
MemTotal:        1035324 kB
MemFree:           36096 kB
Buffers:            5392 kB
Cached:           828716 kB
SwapCached:           28 kB
Active:           227984 kB
Inactive:         661920 kB
Active(anon):      32972 kB
Inactive(anon):    70092 kB
Active(file):     195012 kB
Inactive(file):   591828 kB
Unevictable:        3688 kB
Mlocked:            3688 kB
HighTotal:        270336 kB
HighFree:           1416 kB
LowTotal:         764988 kB
LowFree:           34680 kB
SwapTotal:        499996 kB
SwapFree:         498784 kB
Dirty:               236 kB
Writeback:             0 kB
AnonPages:         59472 kB
Mapped:            59180 kB
Shmem:             44932 kB
Slab:              55144 kB
SReclaimable:      40732 kB
SUnreclaim:        14412 kB
KernelStack:        1160 kB
PageTables:         2756 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1017656 kB
Committed_AS:     359744 kB
VmallocTotal:     245760 kB
VmallocUsed:        3764 kB
VmallocChunk:     233092 kB

>
>> The oops is relatively sporadic, perhaps 1-3 times a day.
> Is it always the same oops?
I'm afraid I didn't save a full copy of the previous ones, but as far as 
I remember
yes it's the same backtrace every time.

J.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7
  2013-08-31 23:00   ` Jochen De Smet
@ 2013-08-31 23:54     ` Russell King - ARM Linux
  2013-09-01  0:37       ` Jochen De Smet
  0 siblings, 1 reply; 5+ messages in thread
From: Russell King - ARM Linux @ 2013-08-31 23:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Aug 31, 2013 at 07:00:29PM -0400, Jochen De Smet wrote:
> On 8/31/2013 16:06, Russell King - ARM Linux wrote:
>> Neither call quirk_usb_early_handoff().  I'm going to assume that it's
>> the EHCI one.
> Curiously enough, I don't see either one (ehci-q.c or fusbh200-hcd.c) in  
> the kernel "make" output.
> Ah, ehci-q gets directly included by ehci-hcd.c, which I do see. Don't  
> see anything similar for fusbh200
> or oxu210hp-hcd.c, so I'm pretty sure the EHCI one is the only one I'm  
> compiling and your guess is
> right.

Thanks for confirming.

> (gdb) disassemble /r start_unlink_async
> Dump of assembler code for function start_unlink_async:
>    0xc020bff4 <+0>:     0d c0 a0 e1     mov     r12, sp
>    0xc020bff8 <+4>:     18 d8 2d e9     push    {r3, r4, r11, r12, lr, pc}
>    0xc020bffc <+8>:     04 b0 4c e2     sub     r11, r12, #4
>    0xc020c000 <+12>:    2c 30 d1 e5     ldrb    r3, [r1, #44]   ; 0x2c
>    0xc020c004 <+16>:    00 40 a0 e1     mov     r4, r0
>    0xc020c008 <+20>:    01 00 53 e3     cmp     r3, #1
>    0xc020c00c <+24>:    18 a8 9d 18     ldmne   sp, {r3, r4, r11, sp, pc}
>    0xc020c010 <+28>:    ca f1 ff eb     bl      0xc0208740  
> <single_unlink_async>
>    0xc020c014 <+32>:    04 00 a0 e1     mov     r0, r4
>    0xc020c018 <+36>:    40 ff ff eb     bl      0xc020bd20  
> <start_iaa_cycle>
>    0xc020c01c <+40>:    18 a8 9d e8     ldm     sp, {r3, r4, r11, sp, pc}
> End of assembler dump.

Okay, so 0xc020c014 is the location of interest, and it's immediately after
a branch to single_unlink_async().  Okay, that confirms that the suspected
path is valid, and we did enter single_unlink_async from the correct place
in the code.

> disassemble /m  doesn't seem to work for this; is that normal?

Hmm, disassemble /m... I'm not up with gdb I'm afraid.

> (gdb) disassemble single_unlink_async
> Dump of assembler code for function single_unlink_async:
>    0xc0208740 <+0>:     mov     r12, sp
>    0xc0208744 <+4>:     push    {r11, r12, lr, pc}
>    0xc0208748 <+8>:     sub     r11, r12, #4
>    0xc020874c <+12>:    mov     r3, #4
>    0xc0208750 <+16>:    strb    r3, [r1, #44]   ; 0x2c
>    0xc0208754 <+20>:    ldr     r3, [r0, #212]  ; 0xd4
>    0xc0208758 <+24>:    add     r2, r1, #32
>    0xc020875c <+28>:    add     r12, r0, #208   ; 0xd0
>    0xc0208760 <+32>:    str     r2, [r0, #212]  ; 0xd4
>    0xc0208764 <+36>:    str     r12, [r1, #32]
>    0xc0208768 <+40>:    str     r3, [r1, #36]   ; 0x24
>    0xc020876c <+44>:    str     r2, [r3]
>    0xc0208770 <+48>:    ldr     r2, [r0, #200]  ; 0xc8
>    0xc0208774 <+52>:    b       0xc020877c <single_unlink_async+60>
>    0xc0208778 <+56>:    mov     r2, r3
>    0xc020877c <+60>:    ldr     r3, [r2, #8]
>    0xc0208780 <+64>:    cmp     r3, r1
>    0xc0208784 <+68>:    bne     0xc0208778 <single_unlink_async+56>

Okay.  First, here's the stack from your previous post, annotated with
the saved registers:

fd80:                                                       c03efdbc c03efda8
                                                            r11      r12
fda0: c020c014 c020874c ef2735d0 ef273500 c03efdd4 c03efdc0 c020c0e0 c020c000
      lr       pc       r3       r4       r11      r12      lr       pc

Unfortunately, this don't really provide much in the way of useful
information other than confirming that the stack layout is as we'd
expect it to be if we got into this function.

Let's now look at the register state:

pc : [<c020837c>]    lr : [<c020c014>] psr: 00000193
sp : c03efd98  ip : ef2735d0  fp : c03efda4
r10: 60000193  r9 : 00000006  r8 : c03013ec
r7 : 000031ac  r6 : d77d6a38  r5 : 00000001  r4 : 00000ef4
r3 : ee817c00  r2 : ef2de8c0  r1 : ee804600  r0 : ef273500

The trick here is to pull out what this tells us based on the code from
the above function.  The first thing to note is that the sp/fp values
are correct: the fp points at the saved PC for this stack frame, which
is what I'd expect.  (Because of prefetching, the saved PC will be ahead
of the instruction which saved it.)

The second thing to note is this:

ip (ef2735d0) = r0 (ef273500) + 0xd0

That suggests that the instruction at 0xc020875c was executed, which is
fair confirmation that we made it into this function and got that far.
Unfortunately, we can't tell much else from comparing the registers and
this code.  Let's look at the code where we ended up:

c0208374:   c0406068        subgt   r6, r0, r8, rrx
c0208378:   c0394a20        eorsgt  r4, r9, r0, lsr #20
c020837c:   c03949f0        ldrshtgt        r4, [r9], -r0

I've annotated this with the correct address from your previous report.
An important thing to note here is that the PSR flags are zero (NZCV
are all clear) so the 'gt' condition will allow these instructions to
execute.

So, can we deduce anything from this?  Well, we have this:

r4 (00000ef4) = r9 (00000006) ^ (r0 (ef273500) >> 20)

so it looks like the instruction at c0208378 was executed.  Obviously
the instruction at c020837c caused a fault, so that was definitely
executed.  What about c0208374?

r6 (d77d6a38) != r0 (ef273500) ^ (r8 (c03013ec) rrx) (rotate right with
extend - a 33 bit right rotate).

That doesn't work, so it suggests that the instruction at c0208374 wasn't
executed.

Now.  How can we get from the above function to c0208378?  Nothing in
this function does a call through pointer, and we certainly haven't
loaded anything off the stack.  Did the PC just spontaneously jump
there?  I think not, but there are two branches in the above code.

There is this:

>    0xc0208784 <+68>:    bne     0xc0208778 <single_unlink_async+56>  

Notice the destination addresses similarity to the address of the first
instruction we think was executed - 0xc0208778 vs c0208378.  Here's
the instruction opcodes for branches to those two locations:

c02107d8:	1affdfe6 	bne	c0208778
c02107d8:	1affdee6 	bne	c0208378

See the single bit difference there on bit 8?

So, this is what I think: either _something_ has cleared that bit, or
you have a problem with your SDRAM wiring, or your SDRAM containing
this location is going bad and is suffering from a bit error at this
location.

I'm afraid that I think you have a hardware problem.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7
  2013-08-31 23:54     ` Russell King - ARM Linux
@ 2013-09-01  0:37       ` Jochen De Smet
  0 siblings, 0 replies; 5+ messages in thread
From: Jochen De Smet @ 2013-09-01  0:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 8/31/2013 19:54, Russell King - ARM Linux wrote:
> On Sat, Aug 31, 2013 at 07:00:29PM -0400, Jochen De Smet wrote:
> Hmm, disassemble /m... I'm not up with gdb I'm afraid. 
Supposed to show mixed source/assembler I believe.
> Notice the destination addresses similarity to the address of the first
> instruction we think was executed - 0xc0208778 vs c0208378.  Here's
> the instruction opcodes for branches to those two locations:
>
> c02107d8:	1affdfe6 	bne	c0208778
> c02107d8:	1affdee6 	bne	c0208378
>
> See the single bit difference there on bit 8?
>
> So, this is what I think: either _something_ has cleared that bit, or
> you have a problem with your SDRAM wiring, or your SDRAM containing
> this location is going bad and is suffering from a bit error at this
> location.
>
> I'm afraid that I think you have a hardware problem.
The only counter-indication I have is that the 3.10 kernel I've been 
running has
never had any issues, nor the default 2.6.x kernel that came with it.  I 
might have
just gotten lucky with exactly what those compiled to though I suppose?

I've got 2 of these boxes, so I'll update the kernel on the second one 
to this version
as well and see if it reproduces there.

Thanks a lot for the fast diagnosis.

J.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-09-01  0:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-31 16:31 Undefined instruction (ldrshtgt?) on mirabox with 3.11-rc7 Jochen De Smet
2013-08-31 20:06 ` Russell King - ARM Linux
2013-08-31 23:00   ` Jochen De Smet
2013-08-31 23:54     ` Russell King - ARM Linux
2013-09-01  0:37       ` Jochen De Smet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox