* mvneta: oops in __rcu_read_lock on mirabox
@ 2013-09-15 1:05 Ethan Tuttle
2013-09-15 18:57 ` Thomas Petazzoni
0 siblings, 1 reply; 21+ messages in thread
From: Ethan Tuttle @ 2013-09-15 1:05 UTC (permalink / raw)
To: linux-arm-kernel
When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing
oopses while receiving network traffic (see below). Sending a flood
ping will trigger the oops within a few minutes.
The stack looks similar, but not identical to, the one reported
earlier by Jochen De Smet[1]. In my case the PC is always
__rcu_read_lock.
A git bisect found a878764 "Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the
first bad commit... interesting, because neither of the merge parents
produce the oops. I rebased the net changes onto the other merge
parent and bisected that series, which identified 702821f "net: revert
8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit.
Indeed, reverting 702821f from 3.11 produces a kernel which stands up
to a ping flood for hours.
Each of the times I reproduced this, it was identified as "Unhandled
prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I
got "unknown 16 (0x400)".
I'm assuming this is an mvneta bug that was exposed by 702821f.
That's just a guess, and I don't have the skills to debug this any
further. In any case, I figured the maintainers would want to know
about it.
Thanks much,
Ethan
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-September/196332.html
Unhandled prefetch abort: unknown 25 (0x409) at 0xc0036ea0
Internal error: : 409 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.11.0-ARCH-00005-gecca798 #31
task: c074b140 ti: c0740000 task.ti: c0740000
PC is at __rcu_read_lock+0x1c/0x20
LR is at __netif_receive_skb_core+0x80/0x6fc
pc : [<c0036ea0>] lr : [<c04528d4>] psr: 60000113
sp : c0741de8 ip : 5232ad87 fp : ef181800
r10: c073ede4 r9 : c07494b8 r8 : ef181800
r7 : 00000000 r6 : 00000001 r5 : ee972b40 r4 : ee972b40
r3 : c074b140 r2 : 00000001 r1 : 00000042 r0 : 0000ffff
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
Control: 10c5387d Table: 2e8cc019 DAC: 00000015
Process swapper/0 (pid: 0, stack limit = 0xc0740240)
Stack: (0xc0741de8 to 0xc0742000)
1de0: 00000000 c0741e28 ee972b40 ee972b40 ef300c00 00000067
1e00: f014f000 ee972b40 ee972b40 ee972b40 ef300c00 00000067 f014f000 00000000
1e20: ef181800 c0455a70 b685321e 13236156 ee972b40 ee972b40 ef300c00 00000067
1e40: f014f000 00000003 ee972b40 c04562a8 00000000 ef181c80 f014fce0 c03bd688
1e60: 00000000 00000000 ef181ccc 00000001 00000001 00000001 c077e190 00000040
1e80: 00000100 00000000 ef181800 ef181c80 ef300c00 00000000 ef181ccc c03bd860
1ea0: 00000001 c076ebf8 c03bd7b0 ef181ccc c1363dc0 00000001 0000012c c1363dc8
1ec0: 00000040 c077d773 c07420c0 c0456018 c074208c 000044fe 0000000c 00000001
1ee0: c0742090 c0740000 0000000a 3f8bdf7c 00000000 00200000 00000101 c0024368
1f00: c077e190 c001bc60 00000000 0000000c 00000003 000044fd c02f91ac 00000017
1f20: 00000000 c0741f78 000003ff c07484c0 561f5811 00000000 00000000 c0024768
1f40: 00000017 c000e6c0 f0002870 c00579e4 c07a4440 c000851c c00579e4 60000013
1f60: ffffffff c0741fac c135f4c0 561f5811 00000000 c00118c0 c13629b0 00000000
1f80: 00000000 00000000 c0740000 c077dd00 c055cc88 c0735450 c135f4c0 561f5811
1fa0: 00000000 00000000 00000002 c0741fc0 c000e980 c00579e4 60000013 ffffffff
1fc0: c0740000 c0711a38 ffffffff ffffffff c0711544 00000000 00000000 c0735450
1fe0: 10c5387d c07484fc c073544c c074c290 00004059 00008074 00000000 00000000
[<c0036ea0>] (__rcu_read_lock+0x1c/0x20) from [<c04528d4>]
(__netif_receive_skb_core+0x80/0x6fc)
[<c04528d4>] (__netif_receive_skb_core+0x80/0x6fc) from [<c0455a70>]
(netif_receive_skb+0x60/0xb8)
[<c0455a70>] (netif_receive_skb+0x60/0xb8) from [<c04562a8>]
(napi_gro_receive+0x48/0x98)
[<c04562a8>] (napi_gro_receive+0x48/0x98) from [<c03bd688>]
(mvneta_rx+0x244/0x36c)
[<c03bd688>] (mvneta_rx+0x244/0x36c) from [<c03bd860>] (mvneta_poll+0xb0/0x15c)
[<c03bd860>] (mvneta_poll+0xb0/0x15c) from [<c0456018>]
(net_rx_action+0x70/0x170)
[<c0456018>] (net_rx_action+0x70/0x170) from [<c0024368>]
(__do_softirq+0xd4/0x1c8)
[<c0024368>] (__do_softirq+0xd4/0x1c8) from [<c0024768>] (irq_exit+0x74/0x88)
[<c0024768>] (irq_exit+0x74/0x88) from [<c000e6c0>] (handle_IRQ+0x68/0x8c)
[<c000e6c0>] (handle_IRQ+0x68/0x8c) from [<c000851c>]
(armada_370_xp_handle_irq+0x44/0xa4)
[<c000851c>] (armada_370_xp_handle_irq+0x44/0xa4) from [<c00118c0>]
(__irq_svc+0x40/0x70)
Exception stack(0xc0741f78 to 0xc0741fc0)
1f60: c13629b0 00000000
1f80: 00000000 00000000 c0740000 c077dd00 c055cc88 c0735450 c135f4c0 561f5811
1fa0: 00000000 00000000 00000002 c0741fc0 c000e980 c00579e4 60000013 ffffffff
[<c00118c0>] (__irq_svc+0x40/0x70) from [<c00579e4>]
(cpu_startup_entry+0xb0/0x114)
[<c00579e4>] (cpu_startup_entry+0xb0/0x114) from [<c0711a38>]
(start_kernel+0x2c8/0x324)
Code: e593300c e59321b4 e2822001 e58321b4 (e12fff1e)
---[ end trace 8f21018165664a9e ]---
Kernel panic - not syncing: Fatal exception in interrupt
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-15 1:05 mvneta: oops in __rcu_read_lock on mirabox Ethan Tuttle
@ 2013-09-15 18:57 ` Thomas Petazzoni
2013-09-16 6:50 ` Willy Tarreau
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Petazzoni @ 2013-09-15 18:57 UTC (permalink / raw)
To: linux-arm-kernel
Hello Ethan,
On Sat, 14 Sep 2013 18:05:32 -0700, Ethan Tuttle wrote:
> When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing
> oopses while receiving network traffic (see below). Sending a flood
> ping will trigger the oops within a few minutes.
>
> The stack looks similar, but not identical to, the one reported
> earlier by Jochen De Smet[1]. In my case the PC is always
> __rcu_read_lock.
>
> A git bisect found a878764 "Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the
> first bad commit... interesting, because neither of the merge parents
> produce the oops. I rebased the net changes onto the other merge
> parent and bisected that series, which identified 702821f "net: revert
> 8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit.
> Indeed, reverting 702821f from 3.11 produces a kernel which stands up
> to a ping flood for hours.
>
> Each of the times I reproduced this, it was identified as "Unhandled
> prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I
> got "unknown 16 (0x400)".
>
> I'm assuming this is an mvneta bug that was exposed by 702821f.
> That's just a guess, and I don't have the skills to debug this any
> further. In any case, I figured the maintainers would want to know
> about it.
Thanks a lot for the report and the detailed investigation.
Unfortunately, I don't have Armada 370 hardware with me this week, so
I'm unable to test and reproduce the issue.
However, I've added a bunch of Armada 370 people/maintainers in Cc,
hopefully they can at least try to reproduce and confirm that reverting
this patch makes the problem go away, which would confirm that we
should look for a bug in the mvneta driver around this problem.
Thanks!
Thomas
--
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-15 18:57 ` Thomas Petazzoni
@ 2013-09-16 6:50 ` Willy Tarreau
2013-09-16 8:56 ` Ethan Tuttle
2013-09-16 15:51 ` Thomas Petazzoni
0 siblings, 2 replies; 21+ messages in thread
From: Willy Tarreau @ 2013-09-16 6:50 UTC (permalink / raw)
To: linux-arm-kernel
Hi Thomas,
On Sun, Sep 15, 2013 at 08:57:01PM +0200, Thomas Petazzoni wrote:
> Hello Ethan,
>
> On Sat, 14 Sep 2013 18:05:32 -0700, Ethan Tuttle wrote:
> > When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing
> > oopses while receiving network traffic (see below). Sending a flood
> > ping will trigger the oops within a few minutes.
> >
> > The stack looks similar, but not identical to, the one reported
> > earlier by Jochen De Smet[1]. In my case the PC is always
> > __rcu_read_lock.
> >
> > A git bisect found a878764 "Merge
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the
> > first bad commit... interesting, because neither of the merge parents
> > produce the oops. I rebased the net changes onto the other merge
> > parent and bisected that series, which identified 702821f "net: revert
> > 8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit.
> > Indeed, reverting 702821f from 3.11 produces a kernel which stands up
> > to a ping flood for hours.
> >
> > Each of the times I reproduced this, it was identified as "Unhandled
> > prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I
> > got "unknown 16 (0x400)".
> >
> > I'm assuming this is an mvneta bug that was exposed by 702821f.
> > That's just a guess, and I don't have the skills to debug this any
> > further. In any case, I figured the maintainers would want to know
> > about it.
>
> Thanks a lot for the report and the detailed investigation.
> Unfortunately, I don't have Armada 370 hardware with me this week, so
> I'm unable to test and reproduce the issue.
>
> However, I've added a bunch of Armada 370 people/maintainers in Cc,
> hopefully they can at least try to reproduce and confirm that reverting
> this patch makes the problem go away, which would confirm that we
> should look for a bug in the mvneta driver around this problem.
I'm currently testing on 3.11.1 (which I had here) and am not getting
any issue after 50M packets. My kernel is running in thumb mode and
without SMP.
Ethan, we'll need your config I guess.
Thanks,
Willy
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 6:50 ` Willy Tarreau
@ 2013-09-16 8:56 ` Ethan Tuttle
2013-09-16 15:51 ` Thomas Petazzoni
1 sibling, 0 replies; 21+ messages in thread
From: Ethan Tuttle @ 2013-09-16 8:56 UTC (permalink / raw)
To: linux-arm-kernel
Hi guys. Here's the config I was building with:
https://gist.github.com/anonymous/6578139
It's based on the one I found in archlinuxarm's git repo. I didn't
change any of the options - at least, not manually.
Thanks for the follow up!
Ethan
On Sun, Sep 15, 2013 at 11:50 PM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Thomas,
>
> On Sun, Sep 15, 2013 at 08:57:01PM +0200, Thomas Petazzoni wrote:
>> Hello Ethan,
>>
>> On Sat, 14 Sep 2013 18:05:32 -0700, Ethan Tuttle wrote:
>> > When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing
>> > oopses while receiving network traffic (see below). Sending a flood
>> > ping will trigger the oops within a few minutes.
>> >
>> > The stack looks similar, but not identical to, the one reported
>> > earlier by Jochen De Smet[1]. In my case the PC is always
>> > __rcu_read_lock.
>> >
>> > A git bisect found a878764 "Merge
>> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the
>> > first bad commit... interesting, because neither of the merge parents
>> > produce the oops. I rebased the net changes onto the other merge
>> > parent and bisected that series, which identified 702821f "net: revert
>> > 8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit.
>> > Indeed, reverting 702821f from 3.11 produces a kernel which stands up
>> > to a ping flood for hours.
>> >
>> > Each of the times I reproduced this, it was identified as "Unhandled
>> > prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I
>> > got "unknown 16 (0x400)".
>> >
>> > I'm assuming this is an mvneta bug that was exposed by 702821f.
>> > That's just a guess, and I don't have the skills to debug this any
>> > further. In any case, I figured the maintainers would want to know
>> > about it.
>>
>> Thanks a lot for the report and the detailed investigation.
>> Unfortunately, I don't have Armada 370 hardware with me this week, so
>> I'm unable to test and reproduce the issue.
>>
>> However, I've added a bunch of Armada 370 people/maintainers in Cc,
>> hopefully they can at least try to reproduce and confirm that reverting
>> this patch makes the problem go away, which would confirm that we
>> should look for a bug in the mvneta driver around this problem.
>
> I'm currently testing on 3.11.1 (which I had here) and am not getting
> any issue after 50M packets. My kernel is running in thumb mode and
> without SMP.
>
> Ethan, we'll need your config I guess.
>
> Thanks,
> Willy
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 6:50 ` Willy Tarreau
2013-09-16 8:56 ` Ethan Tuttle
@ 2013-09-16 15:51 ` Thomas Petazzoni
2013-09-16 16:22 ` Russell King - ARM Linux
2013-09-16 16:35 ` Ethan Tuttle
1 sibling, 2 replies; 21+ messages in thread
From: Thomas Petazzoni @ 2013-09-16 15:51 UTC (permalink / raw)
To: linux-arm-kernel
Willy, Ethan,
On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote:
> I'm currently testing on 3.11.1 (which I had here) and am not getting
> any issue after 50M packets. My kernel is running in thumb mode and
> without SMP.
>
> Ethan, we'll need your config I guess.
Can both of you also report the U-Boot version you're using, and the
SoC revision (it's visible in the U-Boot output). Maybe Globalscale is
shipping Mirabox with a different version of the bootloader, or some
hardware difference, that is causing problems? (I'm just speculating
here, but another user already reported having issues with his Mirabox,
and Russell King analyzed the oops as very likely being hardware
problems).
Thomas
--
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 15:51 ` Thomas Petazzoni
@ 2013-09-16 16:22 ` Russell King - ARM Linux
2013-09-16 16:24 ` Thomas Petazzoni
2013-09-16 16:35 ` Ethan Tuttle
1 sibling, 1 reply; 21+ messages in thread
From: Russell King - ARM Linux @ 2013-09-16 16:22 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 05:51:52PM +0200, Thomas Petazzoni wrote:
> Willy, Ethan,
>
> On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote:
>
> > I'm currently testing on 3.11.1 (which I had here) and am not getting
> > any issue after 50M packets. My kernel is running in thumb mode and
> > without SMP.
> >
> > Ethan, we'll need your config I guess.
>
> Can both of you also report the U-Boot version you're using, and the
> SoC revision (it's visible in the U-Boot output). Maybe Globalscale is
> shipping Mirabox with a different version of the bootloader, or some
> hardware difference, that is causing problems? (I'm just speculating
> here, but another user already reported having issues with his Mirabox,
> and Russell King analyzed the oops as very likely being hardware
> problems).
One seemed to be a single bit error in an instruction inside the kernel
image. The other was what seems to be an impossible abort.
I still don't see how we could end up with a prefetch abort inside memset()
due to the kernel domain being inaccessible, but still be able to get
an oops out, especially when we dump out the memory for the faulting
instruction by accessing that memory via that apparantly inaccessible
domain while running the code which dumps that memory also under this
apparantly inaccessible domain. If the domain containing the kernel
really was inaccessible, the system would be completely dead.
The only possibilities I can come up with for that is that abort was
caused by something spurious happening at the hardware level causing
corruption of the instruction TLB (corrupting the domain index stored
in the I-TLB) or other CPU control hardware causing it to spuriously
generate that fault.
As the domain field in the page table L1 entries covers bit 8, and the
single bit error with the instruction was also bit 8, maybe there's a
design weakness on data line bit 8 causing marginal operation.
To add to this, the abort given in this report gives an IFSR value of
0x409, which equates to "Synchronous parity error on memory access"
in ARMv7. The other value (0x400) equates to "TLB conflict abort"
which can only happen with LPAE support enabled... So this is just
getting more weird!
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 16:22 ` Russell King - ARM Linux
@ 2013-09-16 16:24 ` Thomas Petazzoni
2013-09-16 17:14 ` Russell King - ARM Linux
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Petazzoni @ 2013-09-16 16:24 UTC (permalink / raw)
To: linux-arm-kernel
Russell,
On Mon, 16 Sep 2013 17:22:09 +0100, Russell King - ARM Linux wrote:
> One seemed to be a single bit error in an instruction inside the kernel
> image. The other was what seems to be an impossible abort.
>
> I still don't see how we could end up with a prefetch abort inside memset()
> due to the kernel domain being inaccessible, but still be able to get
> an oops out, especially when we dump out the memory for the faulting
> instruction by accessing that memory via that apparantly inaccessible
> domain while running the code which dumps that memory also under this
> apparantly inaccessible domain. If the domain containing the kernel
> really was inaccessible, the system would be completely dead.
>
> The only possibilities I can come up with for that is that abort was
> caused by something spurious happening at the hardware level causing
> corruption of the instruction TLB (corrupting the domain index stored
> in the I-TLB) or other CPU control hardware causing it to spuriously
> generate that fault.
>
> As the domain field in the page table L1 entries covers bit 8, and the
> single bit error with the instruction was also bit 8, maybe there's a
> design weakness on data line bit 8 causing marginal operation.
>
> To add to this, the abort given in this report gives an IFSR value of
> 0x409, which equates to "Synchronous parity error on memory access"
> in ARMv7. The other value (0x400) equates to "TLB conflict abort"
> which can only happen with LPAE support enabled... So this is just
> getting more weird!
Could this be caused by bitflips in the RAM due to bad timings, or
overheating or that kind of things?
Thomas
--
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 15:51 ` Thomas Petazzoni
2013-09-16 16:22 ` Russell King - ARM Linux
@ 2013-09-16 16:35 ` Ethan Tuttle
2013-09-16 16:39 ` Willy Tarreau
1 sibling, 1 reply; 21+ messages in thread
From: Ethan Tuttle @ 2013-09-16 16:35 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Willy, Ethan,
>
> On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote:
>
>> I'm currently testing on 3.11.1 (which I had here) and am not getting
>> any issue after 50M packets. My kernel is running in thumb mode and
>> without SMP.
>>
>> Ethan, we'll need your config I guess.
>
> Can both of you also report the U-Boot version you're using, and the
> SoC revision (it's visible in the U-Boot output).
Mine says:
U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ
SoC: MV6710 A1
> Maybe Globalscale is
> shipping Mirabox with a different version of the bootloader, or some
> hardware difference, that is causing problems? (I'm just speculating
> here, but another user already reported having issues with his Mirabox,
> and Russell King analyzed the oops as very likely being hardware
> problems).
>
> Thomas
> --
> Thomas Petazzoni, Free Electrons
> Kernel, drivers, real-time and embedded Linux
> development, consulting, training and support.
> http://free-electrons.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 16:35 ` Ethan Tuttle
@ 2013-09-16 16:39 ` Willy Tarreau
2013-09-16 16:44 ` Willy Tarreau
0 siblings, 1 reply; 21+ messages in thread
From: Willy Tarreau @ 2013-09-16 16:39 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 09:35:16AM -0700, Ethan Tuttle wrote:
> On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni
> <thomas.petazzoni@free-electrons.com> wrote:
> > Willy, Ethan,
> >
> > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote:
> >
> >> I'm currently testing on 3.11.1 (which I had here) and am not getting
> >> any issue after 50M packets. My kernel is running in thumb mode and
> >> without SMP.
> >>
> >> Ethan, we'll need your config I guess.
> >
> > Can both of you also report the U-Boot version you're using, and the
> > SoC revision (it's visible in the U-Boot output).
>
> Mine says:
>
> U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ
> SoC: MV6710 A1
I just checked on my old captures and I have the same here, with more
details such as the CPU's revision (Rev 1) :
http://1wt.eu/articles/mirabox-vs-guruplug/
Willy
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 16:39 ` Willy Tarreau
@ 2013-09-16 16:44 ` Willy Tarreau
2013-09-16 17:24 ` Ethan Tuttle
0 siblings, 1 reply; 21+ messages in thread
From: Willy Tarreau @ 2013-09-16 16:44 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 06:39:37PM +0200, Willy Tarreau wrote:
> On Mon, Sep 16, 2013 at 09:35:16AM -0700, Ethan Tuttle wrote:
> > On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni
> > <thomas.petazzoni@free-electrons.com> wrote:
> > > Willy, Ethan,
> > >
> > > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote:
> > >
> > >> I'm currently testing on 3.11.1 (which I had here) and am not getting
> > >> any issue after 50M packets. My kernel is running in thumb mode and
> > >> without SMP.
> > >>
> > >> Ethan, we'll need your config I guess.
> > >
> > > Can both of you also report the U-Boot version you're using, and the
> > > SoC revision (it's visible in the U-Boot output).
> >
> > Mine says:
> >
> > U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ
> > SoC: MV6710 A1
>
> I just checked on my old captures and I have the same here, with more
> details such as the CPU's revision (Rev 1) :
>
> http://1wt.eu/articles/mirabox-vs-guruplug/
BTW Ethan, I don't know if you have already opened your mirabox, but on the
link above you'll find settings for trying other frequencies for the CPU. It
could be nice to try 1 GHz with L2/DDR @500 instead of 1200/600 to see if the
issue remains or not. If it disappears, there's also a working setting with
CPU at 1.2G, L2 at 800M and DDR at 400M to help find if CPU, L2 or DDR is the culprit.
Willy
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 16:24 ` Thomas Petazzoni
@ 2013-09-16 17:14 ` Russell King - ARM Linux
2013-09-16 17:45 ` Willy Tarreau
0 siblings, 1 reply; 21+ messages in thread
From: Russell King - ARM Linux @ 2013-09-16 17:14 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 06:24:50PM +0200, Thomas Petazzoni wrote:
> Could this be caused by bitflips in the RAM due to bad timings, or
> overheating or that kind of things?
Well, the SoC is an Armada 370, which uses Marvell's own Sheeva core.
>From what I understand, this is a CPU designed entirely by Marvell, so
the interpretation of these codes may not be correct. This is made
harder to diagnose in that Marvell is soo secret with their
documentation; indeed for this CPU there is no information publically
available (there's only the product briefs).
Bad timings could certainly cause bitflips, as could poor routing of
data line D8 (eg, incorrect termination or routing causing reflections
on the data line - remember that with modern hardware, almost every
signal is a transmission line).
Marginal or noisy power supplies could also be a problem - for example,
if the impedance of the power supply connections is too great, it may
work with some patterns of use but not others.
There's soo many possibilities...
However, if the fault codes above really do equate to what's in the ARMv7
Architecture Reference Manual, I think we can rule out the routing and
RAM chips - because a cache parity error points to bit flips in the cache,
or if there is no cache parity checking implemented, it means something
is corrupting the state of the SoC - which could be due to bad power
supplies.
How do we get to the bottom of this? That's a very good question - one
which is going to be very difficult to solve. Ideally, it means working
with the manufacturer's design team to try and work out what's going on
at the board level, probably using logic analysers to capture the bus
activity leading up to the failure. Also, checking the power supplies
at the SoC too - checking that they're within correct tolerance and
checking the amount of noise on them.
I think all we can do at the moment is to wait for further reports to roll
in and see whether a better pattern emerges.
If you want to try something - and you suspect it may be heat related,
you could try putting the board inside a container, monitor the temperature
inside the container, and put it in your freezer! Just be careful of the
temperature of the other devices on the board getting too cold though -
remember, most consumer electronics is only rated for an *operating*
temperature range of 0?C to 70?C and your freezer will be something like
-20?C - so don't let the ambient temperature inside the container go
below 0?C! If the CPU is producing lots of heat though, it may keep the
container sufficiently warm that that's not a problem. The theory is
that by making the ambient 15 to 20?C cooler, you will also lower the
temperature of the hotter parts by a similar amount.
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 16:44 ` Willy Tarreau
@ 2013-09-16 17:24 ` Ethan Tuttle
2013-09-16 17:47 ` Willy Tarreau
0 siblings, 1 reply; 21+ messages in thread
From: Ethan Tuttle @ 2013-09-16 17:24 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 9:44 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Mon, Sep 16, 2013 at 06:39:37PM +0200, Willy Tarreau wrote:
>> On Mon, Sep 16, 2013 at 09:35:16AM -0700, Ethan Tuttle wrote:
>> > On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni
>> > <thomas.petazzoni@free-electrons.com> wrote:
>> > > Willy, Ethan,
>> > >
>> > > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote:
>> > >
>> > >> I'm currently testing on 3.11.1 (which I had here) and am not getting
>> > >> any issue after 50M packets. My kernel is running in thumb mode and
>> > >> without SMP.
>> > >>
>> > >> Ethan, we'll need your config I guess.
>> > >
>> > > Can both of you also report the U-Boot version you're using, and the
>> > > SoC revision (it's visible in the U-Boot output).
>> >
>> > Mine says:
>> >
>> > U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ
>> > SoC: MV6710 A1
>>
>> I just checked on my old captures and I have the same here, with more
>> details such as the CPU's revision (Rev 1) :
>>
>> http://1wt.eu/articles/mirabox-vs-guruplug/
>
> BTW Ethan, I don't know if you have already opened your mirabox, but on the
> link above you'll find settings for trying other frequencies for the CPU. It
> could be nice to try 1 GHz with L2/DDR @500 instead of 1200/600 to see if the
> issue remains or not. If it disappears, there's also a working setting with
> CPU at 1.2G, L2 at 800M and DDR at 400M to help find if CPU, L2 or DDR is the culprit.
>
> Willy
>
I have not opened my mirabox - but sure, I'll open it up and try those
other settings when I get a chance.
Also, you mentioned that you have SMP disabled in your kernel. It
looks like it's on in my .config. Should I run a test with SMP
disabled?
I'm surprised that nobody else sees this crash given how easy it is
for me to reproduce. BTW, the 3.11 kernel I made with 702821f
reverted has been humming along for days without issue.
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 17:14 ` Russell King - ARM Linux
@ 2013-09-16 17:45 ` Willy Tarreau
2013-09-16 18:25 ` Russell King - ARM Linux
0 siblings, 1 reply; 21+ messages in thread
From: Willy Tarreau @ 2013-09-16 17:45 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 06:14:16PM +0100, Russell King - ARM Linux wrote:
> On Mon, Sep 16, 2013 at 06:24:50PM +0200, Thomas Petazzoni wrote:
> > Could this be caused by bitflips in the RAM due to bad timings, or
> > overheating or that kind of things?
>
> Well, the SoC is an Armada 370, which uses Marvell's own Sheeva core.
> From what I understand, this is a CPU designed entirely by Marvell, so
> the interpretation of these codes may not be correct. This is made
> harder to diagnose in that Marvell is soo secret with their
> documentation; indeed for this CPU there is no information publically
> available (there's only the product briefs).
Yes and their salesmen never respond after many attempts in more than one
year now. Looks like they want to keep their chips for themselves only :-(
> Bad timings could certainly cause bitflips, as could poor routing of
> data line D8 (eg, incorrect termination or routing causing reflections
> on the data line - remember that with modern hardware, almost every
> signal is a transmission line).
This board has a really clean routing and placement, chips are very close.
That does not rule out the possibility of a lacking termination, but it
would probably affect more users.
> Marginal or noisy power supplies could also be a problem - for example,
> if the impedance of the power supply connections is too great, it may
> work with some patterns of use but not others.
We have some margin here, I measured less than 1 Amp to boot and something
like 6-700 mA in idle if my memory serves me correctly. The 3A PSU and its
thicker-than-average wires seem safe. I think that Globalscale learned a
lot from the horrible Guruplug design that all this part needs to be done
correctly and they did a very clean job this time.
> There's soo many possibilities...
Including faulty components. I'm not aware of an equivalent of cpuburn for
ARM, it would probably help, though it's probably harder to design in a
generic way than on x86 where all systems are the same.
> However, if the fault codes above really do equate to what's in the ARMv7
> Architecture Reference Manual, I think we can rule out the routing and
> RAM chips - because a cache parity error points to bit flips in the cache,
> or if there is no cache parity checking implemented, it means something
> is corrupting the state of the SoC - which could be due to bad power
> supplies.
>
> How do we get to the bottom of this? That's a very good question - one
> which is going to be very difficult to solve. Ideally, it means working
> with the manufacturer's design team to try and work out what's going on
> at the board level, probably using logic analysers to capture the bus
> activity leading up to the failure. Also, checking the power supplies
> at the SoC too - checking that they're within correct tolerance and
> checking the amount of noise on them.
>
> I think all we can do at the moment is to wait for further reports to roll
> in and see whether a better pattern emerges.
Especially since there are also some heavy testers who don't seem to be
impacted :-/
> If you want to try something - and you suspect it may be heat related,
> you could try putting the board inside a container, monitor the temperature
> inside the container, and put it in your freezer! Just be careful of the
> temperature of the other devices on the board getting too cold though -
> remember, most consumer electronics is only rated for an *operating*
> temperature range of 0?C to 70?C and your freezer will be something like
> -20?C - so don't let the ambient temperature inside the container go
> below 0?C! If the CPU is producing lots of heat though, it may keep the
> container sufficiently warm that that's not a problem. The theory is
> that by making the ambient 15 to 20?C cooler, you will also lower the
> temperature of the hotter parts by a similar amount.
Sometimes you can also do the opposite, heat it gently with an hair dryer
while working to see if problems happen moore frequently. It's often easier
to do than working in a cold place as you don't have issues with the wires,
and it does not accumulate moist.
I've detected some early failures this way ; the NAND in my Iomega Iconnect
is extremely sensitive to heating to the point that I had to stick a heat
sink on it and take the board out of its case to avoid hangs. The hair
dryer quickly revealed the culprit in a few minutes when it took weeks to
get a failure before.
Willy
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 17:24 ` Ethan Tuttle
@ 2013-09-16 17:47 ` Willy Tarreau
2013-09-16 18:28 ` Russell King - ARM Linux
0 siblings, 1 reply; 21+ messages in thread
From: Willy Tarreau @ 2013-09-16 17:47 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 10:24:05AM -0700, Ethan Tuttle wrote:
> I have not opened my mirabox - but sure, I'll open it up and try those
> other settings when I get a chance.
OK
> Also, you mentioned that you have SMP disabled in your kernel. It
> looks like it's on in my .config. Should I run a test with SMP
> disabled?
You may want to try but I wouldn't bet on this.
> I'm surprised that nobody else sees this crash given how easy it is
> for me to reproduce. BTW, the 3.11 kernel I made with 702821f
> reverted has been humming along for days without issue.
I'll have to rebuild with your config and exact 3.11 to test again.
Can you check the packet rate of your ping flood to give an order of
magnitude so that we're sure to be in the same conditions ?
Willy
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 17:45 ` Willy Tarreau
@ 2013-09-16 18:25 ` Russell King - ARM Linux
0 siblings, 0 replies; 21+ messages in thread
From: Russell King - ARM Linux @ 2013-09-16 18:25 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 07:45:14PM +0200, Willy Tarreau wrote:
> This board has a really clean routing and placement, chips are very close.
> That does not rule out the possibility of a lacking termination, but it
> would probably affect more users.
True - though in your photographs, we can't see the tracking for the
data bus, because that's all buried in the inner board layers.
However, there is some evidence in there of trace lengths being matched
which is a good sign. :)
> On Mon, Sep 16, 2013 at 06:14:16PM +0100, Russell King - ARM Linux wrote:
> > Marginal or noisy power supplies could also be a problem - for example,
> > if the impedance of the power supply connections is too great, it may
> > work with some patterns of use but not others.
>
> We have some margin here, I measured less than 1 Amp to boot and something
> like 6-700 mA in idle if my memory serves me correctly. The 3A PSU and its
> thicker-than-average wires seem safe. I think that Globalscale learned a
> lot from the horrible Guruplug design that all this part needs to be done
> correctly and they did a very clean job this time.
Not quite the power supply I was referring to - I'm talking about the
on-board regulators which supply the 3.3V and other lower voltages to
the SDRAM and SoC - and the quality of their decoupling. The on-board
regulators will have a certain degree of "line" noise immunity.
If I had to guess, I'd say C366 is probably the output bulk capacitor on
the CPU core supply (which comes via BIT7, C273, L1, U19 being the
switching regulator chip.
I'd also guess one of C370, C396 or C398 supplies the SDRAM - and of
those C370 is the most likely - the resistors in the boxes marked K and
B, and R123 I suspect may be the SDRAM data bus termination (that
covers R107 to R136), though I only count a total of 30 of those
connecting to U5 pin 4 - and that point looks _well_ decoupled with lots
of capacitors (C8-C16, C287 on one side, C7, C246, C247 on the other.)
The other two? Maybe R105/106 which are on the underside of the CPU,
though they're a long way from that well decoupled point. R137/138?
They're up by the NAND chip and connect to ground. Though... if one
of those is for D8...
Anyway, that's all speculation.
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 17:47 ` Willy Tarreau
@ 2013-09-16 18:28 ` Russell King - ARM Linux
2013-09-17 3:43 ` Ethan Tuttle
0 siblings, 1 reply; 21+ messages in thread
From: Russell King - ARM Linux @ 2013-09-16 18:28 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 07:47:08PM +0200, Willy Tarreau wrote:
> I'll have to rebuild with your config and exact 3.11 to test again.
> Can you check the packet rate of your ping flood to give an order of
> magnitude so that we're sure to be in the same conditions ?
Also, try swapping kernel binaries between yourselves, so that you can
be sure you're running the exact same kernel on different hardware.
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-16 18:28 ` Russell King - ARM Linux
@ 2013-09-17 3:43 ` Ethan Tuttle
2013-09-17 6:01 ` Willy Tarreau
0 siblings, 1 reply; 21+ messages in thread
From: Ethan Tuttle @ 2013-09-17 3:43 UTC (permalink / raw)
To: linux-arm-kernel
I just built 3.11.1 with the posted config and got the usual crash in
about 2 minutes with a ping flood.
The kernel image is available here:
https://www.dropbox.com/s/cqkqop3jjb1stk3/uImage-dtb.armada-370-mirabox
The md5 is 05f350a193c6c60d9dac40bea810bbdd. You may notice the
version string reveals a patch on top of 3.11.1, this is just a
makefile patch to "Build a uImage with dtb already appended".
Tcpdump captured about 2,800 icmp packets per second while the ping
flood was running.
Hope this helps! If Willy wants to share a kernel image I'll see if I
can crash it :)
Thanks,
Ethan
On Mon, Sep 16, 2013 at 11:28 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Mon, Sep 16, 2013 at 07:47:08PM +0200, Willy Tarreau wrote:
>> I'll have to rebuild with your config and exact 3.11 to test again.
>> Can you check the packet rate of your ping flood to give an order of
>> magnitude so that we're sure to be in the same conditions ?
>
> Also, try swapping kernel binaries between yourselves, so that you can
> be sure you're running the exact same kernel on different hardware.
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-17 3:43 ` Ethan Tuttle
@ 2013-09-17 6:01 ` Willy Tarreau
2013-09-18 6:30 ` Ethan Tuttle
0 siblings, 1 reply; 21+ messages in thread
From: Willy Tarreau @ 2013-09-17 6:01 UTC (permalink / raw)
To: linux-arm-kernel
Hi Ethan,
On Mon, Sep 16, 2013 at 08:43:19PM -0700, Ethan Tuttle wrote:
> I just built 3.11.1 with the posted config and got the usual crash in
> about 2 minutes with a ping flood.
>
> The kernel image is available here:
>
> https://www.dropbox.com/s/cqkqop3jjb1stk3/uImage-dtb.armada-370-mirabox
OK thank you. Unfortunately I can't boot it here as my only rootfs is
a squashfs and it is not enabled in this kernel.
> The md5 is 05f350a193c6c60d9dac40bea810bbdd. You may notice the
> version string reveals a patch on top of 3.11.1, this is just a
> makefile patch to "Build a uImage with dtb already appended".
Interesting one, I was not aware of it, I'll probably add it to my
trees to stop relying on build scripts.
> Tcpdump captured about 2,800 icmp packets per second while the ping
> flood was running.
OK I've been running mine at this exact rate as well (2803 pps) for
11 minutes now. I disabled icmp_ratelimit to ensure that I got as
many responses as requests. No problem so far.
> Hope this helps! If Willy wants to share a kernel image I'll see if I
> can crash it :)
I've put my working images here :
http://1wt.eu/ethan-kernel/
One is done with my config, the other one with your config in which
I added support for squashfs and blk_dev_ram that I'm using to boot
a rootfs loaded in memory by the boot loader.
I can't make it fail either. I'm really starting to suspect a hardware
issue...
Next step should be that you test both kernels to be sure.
Cheers,
Willy
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-17 6:01 ` Willy Tarreau
@ 2013-09-18 6:30 ` Ethan Tuttle
2013-09-18 16:35 ` Thomas Petazzoni
0 siblings, 1 reply; 21+ messages in thread
From: Ethan Tuttle @ 2013-09-18 6:30 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 16, 2013 at 11:01 PM, Willy Tarreau <w@1wt.eu> wrote:
> Next step should be that you test both kernels to be sure.
Thanks for the kernel images, Willy. I'm still experimenting but
initial results are strange: I haven't seen a crash from the -ethan
image you provided, nor by a kernel with that config that I built
myself. The config is only different from my crashing config by a few
options. So perhaps some combination of options prevents the crash.
I'll see if I can narrow it down.
For a moment I thought I found a likely culprit: all along I've been
loading my kernel in to 0x02000000 in uboot, while the stock uboot env
(and Willy) uses 0x6400000. But I've seen at least one
__rcu_read_lock oops since switching to 0x6400000. So I guess I can
rule that out.
Thanks,
Ethan
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-18 6:30 ` Ethan Tuttle
@ 2013-09-18 16:35 ` Thomas Petazzoni
2013-09-18 16:49 ` Willy Tarreau
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Petazzoni @ 2013-09-18 16:35 UTC (permalink / raw)
To: linux-arm-kernel
Dear Ethan Tuttle,
On Tue, 17 Sep 2013 23:30:56 -0700, Ethan Tuttle wrote:
> On Mon, Sep 16, 2013 at 11:01 PM, Willy Tarreau <w@1wt.eu> wrote:
> > Next step should be that you test both kernels to be sure.
>
> Thanks for the kernel images, Willy. I'm still experimenting but
> initial results are strange: I haven't seen a crash from the -ethan
> image you provided, nor by a kernel with that config that I built
> myself. The config is only different from my crashing config by a few
> options. So perhaps some combination of options prevents the crash.
> I'll see if I can narrow it down.
A toolchain generating some crappy code maybe? Ethan, Willy, comparing
your toolchain (compiler version, origin of the toolchain) could be
interesting.
Thomas
--
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox
2013-09-18 16:35 ` Thomas Petazzoni
@ 2013-09-18 16:49 ` Willy Tarreau
0 siblings, 0 replies; 21+ messages in thread
From: Willy Tarreau @ 2013-09-18 16:49 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Sep 18, 2013 at 06:35:49PM +0200, Thomas Petazzoni wrote:
> Dear Ethan Tuttle,
>
> On Tue, 17 Sep 2013 23:30:56 -0700, Ethan Tuttle wrote:
> > On Mon, Sep 16, 2013 at 11:01 PM, Willy Tarreau <w@1wt.eu> wrote:
> > > Next step should be that you test both kernels to be sure.
> >
> > Thanks for the kernel images, Willy. I'm still experimenting but
> > initial results are strange: I haven't seen a crash from the -ethan
> > image you provided, nor by a kernel with that config that I built
> > myself. The config is only different from my crashing config by a few
> > options. So perhaps some combination of options prevents the crash.
> > I'll see if I can narrow it down.
>
> A toolchain generating some crappy code maybe? Ethan, Willy, comparing
> your toolchain (compiler version, origin of the toolchain) could be
> interesting.
I thought about this but it looks suspicious, I don't see why the toolchain
would produce random bitflips.
My toolchain is a linaro 4.7 gcc into which I have added support for a
"pj4b" CPU target which is essentially the same as cortex-a9 plus support
for the IDIV instruction in thumb mode.
But I can send it to Ethan if that helps.
Willy
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2013-09-18 16:49 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-15 1:05 mvneta: oops in __rcu_read_lock on mirabox Ethan Tuttle
2013-09-15 18:57 ` Thomas Petazzoni
2013-09-16 6:50 ` Willy Tarreau
2013-09-16 8:56 ` Ethan Tuttle
2013-09-16 15:51 ` Thomas Petazzoni
2013-09-16 16:22 ` Russell King - ARM Linux
2013-09-16 16:24 ` Thomas Petazzoni
2013-09-16 17:14 ` Russell King - ARM Linux
2013-09-16 17:45 ` Willy Tarreau
2013-09-16 18:25 ` Russell King - ARM Linux
2013-09-16 16:35 ` Ethan Tuttle
2013-09-16 16:39 ` Willy Tarreau
2013-09-16 16:44 ` Willy Tarreau
2013-09-16 17:24 ` Ethan Tuttle
2013-09-16 17:47 ` Willy Tarreau
2013-09-16 18:28 ` Russell King - ARM Linux
2013-09-17 3:43 ` Ethan Tuttle
2013-09-17 6:01 ` Willy Tarreau
2013-09-18 6:30 ` Ethan Tuttle
2013-09-18 16:35 ` Thomas Petazzoni
2013-09-18 16:49 ` Willy Tarreau
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).