* mvneta: oops in __rcu_read_lock on mirabox @ 2013-09-15 1:05 Ethan Tuttle 2013-09-15 18:57 ` Thomas Petazzoni 0 siblings, 1 reply; 21+ messages in thread From: Ethan Tuttle @ 2013-09-15 1:05 UTC (permalink / raw) To: linux-arm-kernel When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing oopses while receiving network traffic (see below). Sending a flood ping will trigger the oops within a few minutes. The stack looks similar, but not identical to, the one reported earlier by Jochen De Smet[1]. In my case the PC is always __rcu_read_lock. A git bisect found a878764 "Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the first bad commit... interesting, because neither of the merge parents produce the oops. I rebased the net changes onto the other merge parent and bisected that series, which identified 702821f "net: revert 8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit. Indeed, reverting 702821f from 3.11 produces a kernel which stands up to a ping flood for hours. Each of the times I reproduced this, it was identified as "Unhandled prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I got "unknown 16 (0x400)". I'm assuming this is an mvneta bug that was exposed by 702821f. That's just a guess, and I don't have the skills to debug this any further. In any case, I figured the maintainers would want to know about it. Thanks much, Ethan [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-September/196332.html Unhandled prefetch abort: unknown 25 (0x409) at 0xc0036ea0 Internal error: : 409 [#1] PREEMPT SMP ARM Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.11.0-ARCH-00005-gecca798 #31 task: c074b140 ti: c0740000 task.ti: c0740000 PC is at __rcu_read_lock+0x1c/0x20 LR is at __netif_receive_skb_core+0x80/0x6fc pc : [<c0036ea0>] lr : [<c04528d4>] psr: 60000113 sp : c0741de8 ip : 5232ad87 fp : ef181800 r10: c073ede4 r9 : c07494b8 r8 : ef181800 r7 : 00000000 r6 : 00000001 r5 : ee972b40 r4 : ee972b40 r3 : c074b140 r2 : 00000001 r1 : 00000042 r0 : 0000ffff Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel Control: 10c5387d Table: 2e8cc019 DAC: 00000015 Process swapper/0 (pid: 0, stack limit = 0xc0740240) Stack: (0xc0741de8 to 0xc0742000) 1de0: 00000000 c0741e28 ee972b40 ee972b40 ef300c00 00000067 1e00: f014f000 ee972b40 ee972b40 ee972b40 ef300c00 00000067 f014f000 00000000 1e20: ef181800 c0455a70 b685321e 13236156 ee972b40 ee972b40 ef300c00 00000067 1e40: f014f000 00000003 ee972b40 c04562a8 00000000 ef181c80 f014fce0 c03bd688 1e60: 00000000 00000000 ef181ccc 00000001 00000001 00000001 c077e190 00000040 1e80: 00000100 00000000 ef181800 ef181c80 ef300c00 00000000 ef181ccc c03bd860 1ea0: 00000001 c076ebf8 c03bd7b0 ef181ccc c1363dc0 00000001 0000012c c1363dc8 1ec0: 00000040 c077d773 c07420c0 c0456018 c074208c 000044fe 0000000c 00000001 1ee0: c0742090 c0740000 0000000a 3f8bdf7c 00000000 00200000 00000101 c0024368 1f00: c077e190 c001bc60 00000000 0000000c 00000003 000044fd c02f91ac 00000017 1f20: 00000000 c0741f78 000003ff c07484c0 561f5811 00000000 00000000 c0024768 1f40: 00000017 c000e6c0 f0002870 c00579e4 c07a4440 c000851c c00579e4 60000013 1f60: ffffffff c0741fac c135f4c0 561f5811 00000000 c00118c0 c13629b0 00000000 1f80: 00000000 00000000 c0740000 c077dd00 c055cc88 c0735450 c135f4c0 561f5811 1fa0: 00000000 00000000 00000002 c0741fc0 c000e980 c00579e4 60000013 ffffffff 1fc0: c0740000 c0711a38 ffffffff ffffffff c0711544 00000000 00000000 c0735450 1fe0: 10c5387d c07484fc c073544c c074c290 00004059 00008074 00000000 00000000 [<c0036ea0>] (__rcu_read_lock+0x1c/0x20) from [<c04528d4>] (__netif_receive_skb_core+0x80/0x6fc) [<c04528d4>] (__netif_receive_skb_core+0x80/0x6fc) from [<c0455a70>] (netif_receive_skb+0x60/0xb8) [<c0455a70>] (netif_receive_skb+0x60/0xb8) from [<c04562a8>] (napi_gro_receive+0x48/0x98) [<c04562a8>] (napi_gro_receive+0x48/0x98) from [<c03bd688>] (mvneta_rx+0x244/0x36c) [<c03bd688>] (mvneta_rx+0x244/0x36c) from [<c03bd860>] (mvneta_poll+0xb0/0x15c) [<c03bd860>] (mvneta_poll+0xb0/0x15c) from [<c0456018>] (net_rx_action+0x70/0x170) [<c0456018>] (net_rx_action+0x70/0x170) from [<c0024368>] (__do_softirq+0xd4/0x1c8) [<c0024368>] (__do_softirq+0xd4/0x1c8) from [<c0024768>] (irq_exit+0x74/0x88) [<c0024768>] (irq_exit+0x74/0x88) from [<c000e6c0>] (handle_IRQ+0x68/0x8c) [<c000e6c0>] (handle_IRQ+0x68/0x8c) from [<c000851c>] (armada_370_xp_handle_irq+0x44/0xa4) [<c000851c>] (armada_370_xp_handle_irq+0x44/0xa4) from [<c00118c0>] (__irq_svc+0x40/0x70) Exception stack(0xc0741f78 to 0xc0741fc0) 1f60: c13629b0 00000000 1f80: 00000000 00000000 c0740000 c077dd00 c055cc88 c0735450 c135f4c0 561f5811 1fa0: 00000000 00000000 00000002 c0741fc0 c000e980 c00579e4 60000013 ffffffff [<c00118c0>] (__irq_svc+0x40/0x70) from [<c00579e4>] (cpu_startup_entry+0xb0/0x114) [<c00579e4>] (cpu_startup_entry+0xb0/0x114) from [<c0711a38>] (start_kernel+0x2c8/0x324) Code: e593300c e59321b4 e2822001 e58321b4 (e12fff1e) ---[ end trace 8f21018165664a9e ]--- Kernel panic - not syncing: Fatal exception in interrupt ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-15 1:05 mvneta: oops in __rcu_read_lock on mirabox Ethan Tuttle @ 2013-09-15 18:57 ` Thomas Petazzoni 2013-09-16 6:50 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Thomas Petazzoni @ 2013-09-15 18:57 UTC (permalink / raw) To: linux-arm-kernel Hello Ethan, On Sat, 14 Sep 2013 18:05:32 -0700, Ethan Tuttle wrote: > When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing > oopses while receiving network traffic (see below). Sending a flood > ping will trigger the oops within a few minutes. > > The stack looks similar, but not identical to, the one reported > earlier by Jochen De Smet[1]. In my case the PC is always > __rcu_read_lock. > > A git bisect found a878764 "Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the > first bad commit... interesting, because neither of the merge parents > produce the oops. I rebased the net changes onto the other merge > parent and bisected that series, which identified 702821f "net: revert > 8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit. > Indeed, reverting 702821f from 3.11 produces a kernel which stands up > to a ping flood for hours. > > Each of the times I reproduced this, it was identified as "Unhandled > prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I > got "unknown 16 (0x400)". > > I'm assuming this is an mvneta bug that was exposed by 702821f. > That's just a guess, and I don't have the skills to debug this any > further. In any case, I figured the maintainers would want to know > about it. Thanks a lot for the report and the detailed investigation. Unfortunately, I don't have Armada 370 hardware with me this week, so I'm unable to test and reproduce the issue. However, I've added a bunch of Armada 370 people/maintainers in Cc, hopefully they can at least try to reproduce and confirm that reverting this patch makes the problem go away, which would confirm that we should look for a bug in the mvneta driver around this problem. Thanks! Thomas -- Thomas Petazzoni, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-15 18:57 ` Thomas Petazzoni @ 2013-09-16 6:50 ` Willy Tarreau 2013-09-16 8:56 ` Ethan Tuttle 2013-09-16 15:51 ` Thomas Petazzoni 0 siblings, 2 replies; 21+ messages in thread From: Willy Tarreau @ 2013-09-16 6:50 UTC (permalink / raw) To: linux-arm-kernel Hi Thomas, On Sun, Sep 15, 2013 at 08:57:01PM +0200, Thomas Petazzoni wrote: > Hello Ethan, > > On Sat, 14 Sep 2013 18:05:32 -0700, Ethan Tuttle wrote: > > When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing > > oopses while receiving network traffic (see below). Sending a flood > > ping will trigger the oops within a few minutes. > > > > The stack looks similar, but not identical to, the one reported > > earlier by Jochen De Smet[1]. In my case the PC is always > > __rcu_read_lock. > > > > A git bisect found a878764 "Merge > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the > > first bad commit... interesting, because neither of the merge parents > > produce the oops. I rebased the net changes onto the other merge > > parent and bisected that series, which identified 702821f "net: revert > > 8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit. > > Indeed, reverting 702821f from 3.11 produces a kernel which stands up > > to a ping flood for hours. > > > > Each of the times I reproduced this, it was identified as "Unhandled > > prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I > > got "unknown 16 (0x400)". > > > > I'm assuming this is an mvneta bug that was exposed by 702821f. > > That's just a guess, and I don't have the skills to debug this any > > further. In any case, I figured the maintainers would want to know > > about it. > > Thanks a lot for the report and the detailed investigation. > Unfortunately, I don't have Armada 370 hardware with me this week, so > I'm unable to test and reproduce the issue. > > However, I've added a bunch of Armada 370 people/maintainers in Cc, > hopefully they can at least try to reproduce and confirm that reverting > this patch makes the problem go away, which would confirm that we > should look for a bug in the mvneta driver around this problem. I'm currently testing on 3.11.1 (which I had here) and am not getting any issue after 50M packets. My kernel is running in thumb mode and without SMP. Ethan, we'll need your config I guess. Thanks, Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 6:50 ` Willy Tarreau @ 2013-09-16 8:56 ` Ethan Tuttle 2013-09-16 15:51 ` Thomas Petazzoni 1 sibling, 0 replies; 21+ messages in thread From: Ethan Tuttle @ 2013-09-16 8:56 UTC (permalink / raw) To: linux-arm-kernel Hi guys. Here's the config I was building with: https://gist.github.com/anonymous/6578139 It's based on the one I found in archlinuxarm's git repo. I didn't change any of the options - at least, not manually. Thanks for the follow up! Ethan On Sun, Sep 15, 2013 at 11:50 PM, Willy Tarreau <w@1wt.eu> wrote: > Hi Thomas, > > On Sun, Sep 15, 2013 at 08:57:01PM +0200, Thomas Petazzoni wrote: >> Hello Ethan, >> >> On Sat, 14 Sep 2013 18:05:32 -0700, Ethan Tuttle wrote: >> > When I upgraded my mirabox from 3.11-rc4 to 3.11, I started seeing >> > oopses while receiving network traffic (see below). Sending a flood >> > ping will trigger the oops within a few minutes. >> > >> > The stack looks similar, but not identical to, the one reported >> > earlier by Jochen De Smet[1]. In my case the PC is always >> > __rcu_read_lock. >> > >> > A git bisect found a878764 "Merge >> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net" to be the >> > first bad commit... interesting, because neither of the merge parents >> > produce the oops. I rebased the net changes onto the other merge >> > parent and bisected that series, which identified 702821f "net: revert >> > 8728c544a9c ("net: dev_pick_tx() fix")" as the first bad commit. >> > Indeed, reverting 702821f from 3.11 produces a kernel which stands up >> > to a ping flood for hours. >> > >> > Each of the times I reproduced this, it was identified as "Unhandled >> > prefetch abort: unknown 25 (0x409) at 0xc0036ea0", except once when I >> > got "unknown 16 (0x400)". >> > >> > I'm assuming this is an mvneta bug that was exposed by 702821f. >> > That's just a guess, and I don't have the skills to debug this any >> > further. In any case, I figured the maintainers would want to know >> > about it. >> >> Thanks a lot for the report and the detailed investigation. >> Unfortunately, I don't have Armada 370 hardware with me this week, so >> I'm unable to test and reproduce the issue. >> >> However, I've added a bunch of Armada 370 people/maintainers in Cc, >> hopefully they can at least try to reproduce and confirm that reverting >> this patch makes the problem go away, which would confirm that we >> should look for a bug in the mvneta driver around this problem. > > I'm currently testing on 3.11.1 (which I had here) and am not getting > any issue after 50M packets. My kernel is running in thumb mode and > without SMP. > > Ethan, we'll need your config I guess. > > Thanks, > Willy > ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 6:50 ` Willy Tarreau 2013-09-16 8:56 ` Ethan Tuttle @ 2013-09-16 15:51 ` Thomas Petazzoni 2013-09-16 16:22 ` Russell King - ARM Linux 2013-09-16 16:35 ` Ethan Tuttle 1 sibling, 2 replies; 21+ messages in thread From: Thomas Petazzoni @ 2013-09-16 15:51 UTC (permalink / raw) To: linux-arm-kernel Willy, Ethan, On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote: > I'm currently testing on 3.11.1 (which I had here) and am not getting > any issue after 50M packets. My kernel is running in thumb mode and > without SMP. > > Ethan, we'll need your config I guess. Can both of you also report the U-Boot version you're using, and the SoC revision (it's visible in the U-Boot output). Maybe Globalscale is shipping Mirabox with a different version of the bootloader, or some hardware difference, that is causing problems? (I'm just speculating here, but another user already reported having issues with his Mirabox, and Russell King analyzed the oops as very likely being hardware problems). Thomas -- Thomas Petazzoni, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 15:51 ` Thomas Petazzoni @ 2013-09-16 16:22 ` Russell King - ARM Linux 2013-09-16 16:24 ` Thomas Petazzoni 2013-09-16 16:35 ` Ethan Tuttle 1 sibling, 1 reply; 21+ messages in thread From: Russell King - ARM Linux @ 2013-09-16 16:22 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 05:51:52PM +0200, Thomas Petazzoni wrote: > Willy, Ethan, > > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote: > > > I'm currently testing on 3.11.1 (which I had here) and am not getting > > any issue after 50M packets. My kernel is running in thumb mode and > > without SMP. > > > > Ethan, we'll need your config I guess. > > Can both of you also report the U-Boot version you're using, and the > SoC revision (it's visible in the U-Boot output). Maybe Globalscale is > shipping Mirabox with a different version of the bootloader, or some > hardware difference, that is causing problems? (I'm just speculating > here, but another user already reported having issues with his Mirabox, > and Russell King analyzed the oops as very likely being hardware > problems). One seemed to be a single bit error in an instruction inside the kernel image. The other was what seems to be an impossible abort. I still don't see how we could end up with a prefetch abort inside memset() due to the kernel domain being inaccessible, but still be able to get an oops out, especially when we dump out the memory for the faulting instruction by accessing that memory via that apparantly inaccessible domain while running the code which dumps that memory also under this apparantly inaccessible domain. If the domain containing the kernel really was inaccessible, the system would be completely dead. The only possibilities I can come up with for that is that abort was caused by something spurious happening at the hardware level causing corruption of the instruction TLB (corrupting the domain index stored in the I-TLB) or other CPU control hardware causing it to spuriously generate that fault. As the domain field in the page table L1 entries covers bit 8, and the single bit error with the instruction was also bit 8, maybe there's a design weakness on data line bit 8 causing marginal operation. To add to this, the abort given in this report gives an IFSR value of 0x409, which equates to "Synchronous parity error on memory access" in ARMv7. The other value (0x400) equates to "TLB conflict abort" which can only happen with LPAE support enabled... So this is just getting more weird! ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 16:22 ` Russell King - ARM Linux @ 2013-09-16 16:24 ` Thomas Petazzoni 2013-09-16 17:14 ` Russell King - ARM Linux 0 siblings, 1 reply; 21+ messages in thread From: Thomas Petazzoni @ 2013-09-16 16:24 UTC (permalink / raw) To: linux-arm-kernel Russell, On Mon, 16 Sep 2013 17:22:09 +0100, Russell King - ARM Linux wrote: > One seemed to be a single bit error in an instruction inside the kernel > image. The other was what seems to be an impossible abort. > > I still don't see how we could end up with a prefetch abort inside memset() > due to the kernel domain being inaccessible, but still be able to get > an oops out, especially when we dump out the memory for the faulting > instruction by accessing that memory via that apparantly inaccessible > domain while running the code which dumps that memory also under this > apparantly inaccessible domain. If the domain containing the kernel > really was inaccessible, the system would be completely dead. > > The only possibilities I can come up with for that is that abort was > caused by something spurious happening at the hardware level causing > corruption of the instruction TLB (corrupting the domain index stored > in the I-TLB) or other CPU control hardware causing it to spuriously > generate that fault. > > As the domain field in the page table L1 entries covers bit 8, and the > single bit error with the instruction was also bit 8, maybe there's a > design weakness on data line bit 8 causing marginal operation. > > To add to this, the abort given in this report gives an IFSR value of > 0x409, which equates to "Synchronous parity error on memory access" > in ARMv7. The other value (0x400) equates to "TLB conflict abort" > which can only happen with LPAE support enabled... So this is just > getting more weird! Could this be caused by bitflips in the RAM due to bad timings, or overheating or that kind of things? Thomas -- Thomas Petazzoni, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 16:24 ` Thomas Petazzoni @ 2013-09-16 17:14 ` Russell King - ARM Linux 2013-09-16 17:45 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Russell King - ARM Linux @ 2013-09-16 17:14 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 06:24:50PM +0200, Thomas Petazzoni wrote: > Could this be caused by bitflips in the RAM due to bad timings, or > overheating or that kind of things? Well, the SoC is an Armada 370, which uses Marvell's own Sheeva core. >From what I understand, this is a CPU designed entirely by Marvell, so the interpretation of these codes may not be correct. This is made harder to diagnose in that Marvell is soo secret with their documentation; indeed for this CPU there is no information publically available (there's only the product briefs). Bad timings could certainly cause bitflips, as could poor routing of data line D8 (eg, incorrect termination or routing causing reflections on the data line - remember that with modern hardware, almost every signal is a transmission line). Marginal or noisy power supplies could also be a problem - for example, if the impedance of the power supply connections is too great, it may work with some patterns of use but not others. There's soo many possibilities... However, if the fault codes above really do equate to what's in the ARMv7 Architecture Reference Manual, I think we can rule out the routing and RAM chips - because a cache parity error points to bit flips in the cache, or if there is no cache parity checking implemented, it means something is corrupting the state of the SoC - which could be due to bad power supplies. How do we get to the bottom of this? That's a very good question - one which is going to be very difficult to solve. Ideally, it means working with the manufacturer's design team to try and work out what's going on at the board level, probably using logic analysers to capture the bus activity leading up to the failure. Also, checking the power supplies at the SoC too - checking that they're within correct tolerance and checking the amount of noise on them. I think all we can do at the moment is to wait for further reports to roll in and see whether a better pattern emerges. If you want to try something - and you suspect it may be heat related, you could try putting the board inside a container, monitor the temperature inside the container, and put it in your freezer! Just be careful of the temperature of the other devices on the board getting too cold though - remember, most consumer electronics is only rated for an *operating* temperature range of 0?C to 70?C and your freezer will be something like -20?C - so don't let the ambient temperature inside the container go below 0?C! If the CPU is producing lots of heat though, it may keep the container sufficiently warm that that's not a problem. The theory is that by making the ambient 15 to 20?C cooler, you will also lower the temperature of the hotter parts by a similar amount. ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 17:14 ` Russell King - ARM Linux @ 2013-09-16 17:45 ` Willy Tarreau 2013-09-16 18:25 ` Russell King - ARM Linux 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2013-09-16 17:45 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 06:14:16PM +0100, Russell King - ARM Linux wrote: > On Mon, Sep 16, 2013 at 06:24:50PM +0200, Thomas Petazzoni wrote: > > Could this be caused by bitflips in the RAM due to bad timings, or > > overheating or that kind of things? > > Well, the SoC is an Armada 370, which uses Marvell's own Sheeva core. > From what I understand, this is a CPU designed entirely by Marvell, so > the interpretation of these codes may not be correct. This is made > harder to diagnose in that Marvell is soo secret with their > documentation; indeed for this CPU there is no information publically > available (there's only the product briefs). Yes and their salesmen never respond after many attempts in more than one year now. Looks like they want to keep their chips for themselves only :-( > Bad timings could certainly cause bitflips, as could poor routing of > data line D8 (eg, incorrect termination or routing causing reflections > on the data line - remember that with modern hardware, almost every > signal is a transmission line). This board has a really clean routing and placement, chips are very close. That does not rule out the possibility of a lacking termination, but it would probably affect more users. > Marginal or noisy power supplies could also be a problem - for example, > if the impedance of the power supply connections is too great, it may > work with some patterns of use but not others. We have some margin here, I measured less than 1 Amp to boot and something like 6-700 mA in idle if my memory serves me correctly. The 3A PSU and its thicker-than-average wires seem safe. I think that Globalscale learned a lot from the horrible Guruplug design that all this part needs to be done correctly and they did a very clean job this time. > There's soo many possibilities... Including faulty components. I'm not aware of an equivalent of cpuburn for ARM, it would probably help, though it's probably harder to design in a generic way than on x86 where all systems are the same. > However, if the fault codes above really do equate to what's in the ARMv7 > Architecture Reference Manual, I think we can rule out the routing and > RAM chips - because a cache parity error points to bit flips in the cache, > or if there is no cache parity checking implemented, it means something > is corrupting the state of the SoC - which could be due to bad power > supplies. > > How do we get to the bottom of this? That's a very good question - one > which is going to be very difficult to solve. Ideally, it means working > with the manufacturer's design team to try and work out what's going on > at the board level, probably using logic analysers to capture the bus > activity leading up to the failure. Also, checking the power supplies > at the SoC too - checking that they're within correct tolerance and > checking the amount of noise on them. > > I think all we can do at the moment is to wait for further reports to roll > in and see whether a better pattern emerges. Especially since there are also some heavy testers who don't seem to be impacted :-/ > If you want to try something - and you suspect it may be heat related, > you could try putting the board inside a container, monitor the temperature > inside the container, and put it in your freezer! Just be careful of the > temperature of the other devices on the board getting too cold though - > remember, most consumer electronics is only rated for an *operating* > temperature range of 0?C to 70?C and your freezer will be something like > -20?C - so don't let the ambient temperature inside the container go > below 0?C! If the CPU is producing lots of heat though, it may keep the > container sufficiently warm that that's not a problem. The theory is > that by making the ambient 15 to 20?C cooler, you will also lower the > temperature of the hotter parts by a similar amount. Sometimes you can also do the opposite, heat it gently with an hair dryer while working to see if problems happen moore frequently. It's often easier to do than working in a cold place as you don't have issues with the wires, and it does not accumulate moist. I've detected some early failures this way ; the NAND in my Iomega Iconnect is extremely sensitive to heating to the point that I had to stick a heat sink on it and take the board out of its case to avoid hangs. The hair dryer quickly revealed the culprit in a few minutes when it took weeks to get a failure before. Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 17:45 ` Willy Tarreau @ 2013-09-16 18:25 ` Russell King - ARM Linux 0 siblings, 0 replies; 21+ messages in thread From: Russell King - ARM Linux @ 2013-09-16 18:25 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 07:45:14PM +0200, Willy Tarreau wrote: > This board has a really clean routing and placement, chips are very close. > That does not rule out the possibility of a lacking termination, but it > would probably affect more users. True - though in your photographs, we can't see the tracking for the data bus, because that's all buried in the inner board layers. However, there is some evidence in there of trace lengths being matched which is a good sign. :) > On Mon, Sep 16, 2013 at 06:14:16PM +0100, Russell King - ARM Linux wrote: > > Marginal or noisy power supplies could also be a problem - for example, > > if the impedance of the power supply connections is too great, it may > > work with some patterns of use but not others. > > We have some margin here, I measured less than 1 Amp to boot and something > like 6-700 mA in idle if my memory serves me correctly. The 3A PSU and its > thicker-than-average wires seem safe. I think that Globalscale learned a > lot from the horrible Guruplug design that all this part needs to be done > correctly and they did a very clean job this time. Not quite the power supply I was referring to - I'm talking about the on-board regulators which supply the 3.3V and other lower voltages to the SDRAM and SoC - and the quality of their decoupling. The on-board regulators will have a certain degree of "line" noise immunity. If I had to guess, I'd say C366 is probably the output bulk capacitor on the CPU core supply (which comes via BIT7, C273, L1, U19 being the switching regulator chip. I'd also guess one of C370, C396 or C398 supplies the SDRAM - and of those C370 is the most likely - the resistors in the boxes marked K and B, and R123 I suspect may be the SDRAM data bus termination (that covers R107 to R136), though I only count a total of 30 of those connecting to U5 pin 4 - and that point looks _well_ decoupled with lots of capacitors (C8-C16, C287 on one side, C7, C246, C247 on the other.) The other two? Maybe R105/106 which are on the underside of the CPU, though they're a long way from that well decoupled point. R137/138? They're up by the NAND chip and connect to ground. Though... if one of those is for D8... Anyway, that's all speculation. ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 15:51 ` Thomas Petazzoni 2013-09-16 16:22 ` Russell King - ARM Linux @ 2013-09-16 16:35 ` Ethan Tuttle 2013-09-16 16:39 ` Willy Tarreau 1 sibling, 1 reply; 21+ messages in thread From: Ethan Tuttle @ 2013-09-16 16:35 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Willy, Ethan, > > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote: > >> I'm currently testing on 3.11.1 (which I had here) and am not getting >> any issue after 50M packets. My kernel is running in thumb mode and >> without SMP. >> >> Ethan, we'll need your config I guess. > > Can both of you also report the U-Boot version you're using, and the > SoC revision (it's visible in the U-Boot output). Mine says: U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ SoC: MV6710 A1 > Maybe Globalscale is > shipping Mirabox with a different version of the bootloader, or some > hardware difference, that is causing problems? (I'm just speculating > here, but another user already reported having issues with his Mirabox, > and Russell King analyzed the oops as very likely being hardware > problems). > > Thomas > -- > Thomas Petazzoni, Free Electrons > Kernel, drivers, real-time and embedded Linux > development, consulting, training and support. > http://free-electrons.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 16:35 ` Ethan Tuttle @ 2013-09-16 16:39 ` Willy Tarreau 2013-09-16 16:44 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2013-09-16 16:39 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 09:35:16AM -0700, Ethan Tuttle wrote: > On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni > <thomas.petazzoni@free-electrons.com> wrote: > > Willy, Ethan, > > > > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote: > > > >> I'm currently testing on 3.11.1 (which I had here) and am not getting > >> any issue after 50M packets. My kernel is running in thumb mode and > >> without SMP. > >> > >> Ethan, we'll need your config I guess. > > > > Can both of you also report the U-Boot version you're using, and the > > SoC revision (it's visible in the U-Boot output). > > Mine says: > > U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ > SoC: MV6710 A1 I just checked on my old captures and I have the same here, with more details such as the CPU's revision (Rev 1) : http://1wt.eu/articles/mirabox-vs-guruplug/ Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 16:39 ` Willy Tarreau @ 2013-09-16 16:44 ` Willy Tarreau 2013-09-16 17:24 ` Ethan Tuttle 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2013-09-16 16:44 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 06:39:37PM +0200, Willy Tarreau wrote: > On Mon, Sep 16, 2013 at 09:35:16AM -0700, Ethan Tuttle wrote: > > On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni > > <thomas.petazzoni@free-electrons.com> wrote: > > > Willy, Ethan, > > > > > > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote: > > > > > >> I'm currently testing on 3.11.1 (which I had here) and am not getting > > >> any issue after 50M packets. My kernel is running in thumb mode and > > >> without SMP. > > >> > > >> Ethan, we'll need your config I guess. > > > > > > Can both of you also report the U-Boot version you're using, and the > > > SoC revision (it's visible in the U-Boot output). > > > > Mine says: > > > > U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ > > SoC: MV6710 A1 > > I just checked on my old captures and I have the same here, with more > details such as the CPU's revision (Rev 1) : > > http://1wt.eu/articles/mirabox-vs-guruplug/ BTW Ethan, I don't know if you have already opened your mirabox, but on the link above you'll find settings for trying other frequencies for the CPU. It could be nice to try 1 GHz with L2/DDR @500 instead of 1200/600 to see if the issue remains or not. If it disappears, there's also a working setting with CPU at 1.2G, L2 at 800M and DDR at 400M to help find if CPU, L2 or DDR is the culprit. Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 16:44 ` Willy Tarreau @ 2013-09-16 17:24 ` Ethan Tuttle 2013-09-16 17:47 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Ethan Tuttle @ 2013-09-16 17:24 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 9:44 AM, Willy Tarreau <w@1wt.eu> wrote: > On Mon, Sep 16, 2013 at 06:39:37PM +0200, Willy Tarreau wrote: >> On Mon, Sep 16, 2013 at 09:35:16AM -0700, Ethan Tuttle wrote: >> > On Mon, Sep 16, 2013 at 8:51 AM, Thomas Petazzoni >> > <thomas.petazzoni@free-electrons.com> wrote: >> > > Willy, Ethan, >> > > >> > > On Mon, 16 Sep 2013 08:50:47 +0200, Willy Tarreau wrote: >> > > >> > >> I'm currently testing on 3.11.1 (which I had here) and am not getting >> > >> any issue after 50M packets. My kernel is running in thumb mode and >> > >> without SMP. >> > >> >> > >> Ethan, we'll need your config I guess. >> > > >> > > Can both of you also report the U-Boot version you're using, and the >> > > SoC revision (it's visible in the U-Boot output). >> > >> > Mine says: >> > >> > U-Boot 2009.08 (Sep 16 2012 - 22:50:06)Marvell version: 1.1.2 NQ >> > SoC: MV6710 A1 >> >> I just checked on my old captures and I have the same here, with more >> details such as the CPU's revision (Rev 1) : >> >> http://1wt.eu/articles/mirabox-vs-guruplug/ > > BTW Ethan, I don't know if you have already opened your mirabox, but on the > link above you'll find settings for trying other frequencies for the CPU. It > could be nice to try 1 GHz with L2/DDR @500 instead of 1200/600 to see if the > issue remains or not. If it disappears, there's also a working setting with > CPU at 1.2G, L2 at 800M and DDR at 400M to help find if CPU, L2 or DDR is the culprit. > > Willy > I have not opened my mirabox - but sure, I'll open it up and try those other settings when I get a chance. Also, you mentioned that you have SMP disabled in your kernel. It looks like it's on in my .config. Should I run a test with SMP disabled? I'm surprised that nobody else sees this crash given how easy it is for me to reproduce. BTW, the 3.11 kernel I made with 702821f reverted has been humming along for days without issue. ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 17:24 ` Ethan Tuttle @ 2013-09-16 17:47 ` Willy Tarreau 2013-09-16 18:28 ` Russell King - ARM Linux 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2013-09-16 17:47 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 10:24:05AM -0700, Ethan Tuttle wrote: > I have not opened my mirabox - but sure, I'll open it up and try those > other settings when I get a chance. OK > Also, you mentioned that you have SMP disabled in your kernel. It > looks like it's on in my .config. Should I run a test with SMP > disabled? You may want to try but I wouldn't bet on this. > I'm surprised that nobody else sees this crash given how easy it is > for me to reproduce. BTW, the 3.11 kernel I made with 702821f > reverted has been humming along for days without issue. I'll have to rebuild with your config and exact 3.11 to test again. Can you check the packet rate of your ping flood to give an order of magnitude so that we're sure to be in the same conditions ? Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 17:47 ` Willy Tarreau @ 2013-09-16 18:28 ` Russell King - ARM Linux 2013-09-17 3:43 ` Ethan Tuttle 0 siblings, 1 reply; 21+ messages in thread From: Russell King - ARM Linux @ 2013-09-16 18:28 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 07:47:08PM +0200, Willy Tarreau wrote: > I'll have to rebuild with your config and exact 3.11 to test again. > Can you check the packet rate of your ping flood to give an order of > magnitude so that we're sure to be in the same conditions ? Also, try swapping kernel binaries between yourselves, so that you can be sure you're running the exact same kernel on different hardware. ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-16 18:28 ` Russell King - ARM Linux @ 2013-09-17 3:43 ` Ethan Tuttle 2013-09-17 6:01 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Ethan Tuttle @ 2013-09-17 3:43 UTC (permalink / raw) To: linux-arm-kernel I just built 3.11.1 with the posted config and got the usual crash in about 2 minutes with a ping flood. The kernel image is available here: https://www.dropbox.com/s/cqkqop3jjb1stk3/uImage-dtb.armada-370-mirabox The md5 is 05f350a193c6c60d9dac40bea810bbdd. You may notice the version string reveals a patch on top of 3.11.1, this is just a makefile patch to "Build a uImage with dtb already appended". Tcpdump captured about 2,800 icmp packets per second while the ping flood was running. Hope this helps! If Willy wants to share a kernel image I'll see if I can crash it :) Thanks, Ethan On Mon, Sep 16, 2013 at 11:28 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Mon, Sep 16, 2013 at 07:47:08PM +0200, Willy Tarreau wrote: >> I'll have to rebuild with your config and exact 3.11 to test again. >> Can you check the packet rate of your ping flood to give an order of >> magnitude so that we're sure to be in the same conditions ? > > Also, try swapping kernel binaries between yourselves, so that you can > be sure you're running the exact same kernel on different hardware. ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-17 3:43 ` Ethan Tuttle @ 2013-09-17 6:01 ` Willy Tarreau 2013-09-18 6:30 ` Ethan Tuttle 0 siblings, 1 reply; 21+ messages in thread From: Willy Tarreau @ 2013-09-17 6:01 UTC (permalink / raw) To: linux-arm-kernel Hi Ethan, On Mon, Sep 16, 2013 at 08:43:19PM -0700, Ethan Tuttle wrote: > I just built 3.11.1 with the posted config and got the usual crash in > about 2 minutes with a ping flood. > > The kernel image is available here: > > https://www.dropbox.com/s/cqkqop3jjb1stk3/uImage-dtb.armada-370-mirabox OK thank you. Unfortunately I can't boot it here as my only rootfs is a squashfs and it is not enabled in this kernel. > The md5 is 05f350a193c6c60d9dac40bea810bbdd. You may notice the > version string reveals a patch on top of 3.11.1, this is just a > makefile patch to "Build a uImage with dtb already appended". Interesting one, I was not aware of it, I'll probably add it to my trees to stop relying on build scripts. > Tcpdump captured about 2,800 icmp packets per second while the ping > flood was running. OK I've been running mine at this exact rate as well (2803 pps) for 11 minutes now. I disabled icmp_ratelimit to ensure that I got as many responses as requests. No problem so far. > Hope this helps! If Willy wants to share a kernel image I'll see if I > can crash it :) I've put my working images here : http://1wt.eu/ethan-kernel/ One is done with my config, the other one with your config in which I added support for squashfs and blk_dev_ram that I'm using to boot a rootfs loaded in memory by the boot loader. I can't make it fail either. I'm really starting to suspect a hardware issue... Next step should be that you test both kernels to be sure. Cheers, Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-17 6:01 ` Willy Tarreau @ 2013-09-18 6:30 ` Ethan Tuttle 2013-09-18 16:35 ` Thomas Petazzoni 0 siblings, 1 reply; 21+ messages in thread From: Ethan Tuttle @ 2013-09-18 6:30 UTC (permalink / raw) To: linux-arm-kernel On Mon, Sep 16, 2013 at 11:01 PM, Willy Tarreau <w@1wt.eu> wrote: > Next step should be that you test both kernels to be sure. Thanks for the kernel images, Willy. I'm still experimenting but initial results are strange: I haven't seen a crash from the -ethan image you provided, nor by a kernel with that config that I built myself. The config is only different from my crashing config by a few options. So perhaps some combination of options prevents the crash. I'll see if I can narrow it down. For a moment I thought I found a likely culprit: all along I've been loading my kernel in to 0x02000000 in uboot, while the stock uboot env (and Willy) uses 0x6400000. But I've seen at least one __rcu_read_lock oops since switching to 0x6400000. So I guess I can rule that out. Thanks, Ethan ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-18 6:30 ` Ethan Tuttle @ 2013-09-18 16:35 ` Thomas Petazzoni 2013-09-18 16:49 ` Willy Tarreau 0 siblings, 1 reply; 21+ messages in thread From: Thomas Petazzoni @ 2013-09-18 16:35 UTC (permalink / raw) To: linux-arm-kernel Dear Ethan Tuttle, On Tue, 17 Sep 2013 23:30:56 -0700, Ethan Tuttle wrote: > On Mon, Sep 16, 2013 at 11:01 PM, Willy Tarreau <w@1wt.eu> wrote: > > Next step should be that you test both kernels to be sure. > > Thanks for the kernel images, Willy. I'm still experimenting but > initial results are strange: I haven't seen a crash from the -ethan > image you provided, nor by a kernel with that config that I built > myself. The config is only different from my crashing config by a few > options. So perhaps some combination of options prevents the crash. > I'll see if I can narrow it down. A toolchain generating some crappy code maybe? Ethan, Willy, comparing your toolchain (compiler version, origin of the toolchain) could be interesting. Thomas -- Thomas Petazzoni, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* mvneta: oops in __rcu_read_lock on mirabox 2013-09-18 16:35 ` Thomas Petazzoni @ 2013-09-18 16:49 ` Willy Tarreau 0 siblings, 0 replies; 21+ messages in thread From: Willy Tarreau @ 2013-09-18 16:49 UTC (permalink / raw) To: linux-arm-kernel On Wed, Sep 18, 2013 at 06:35:49PM +0200, Thomas Petazzoni wrote: > Dear Ethan Tuttle, > > On Tue, 17 Sep 2013 23:30:56 -0700, Ethan Tuttle wrote: > > On Mon, Sep 16, 2013 at 11:01 PM, Willy Tarreau <w@1wt.eu> wrote: > > > Next step should be that you test both kernels to be sure. > > > > Thanks for the kernel images, Willy. I'm still experimenting but > > initial results are strange: I haven't seen a crash from the -ethan > > image you provided, nor by a kernel with that config that I built > > myself. The config is only different from my crashing config by a few > > options. So perhaps some combination of options prevents the crash. > > I'll see if I can narrow it down. > > A toolchain generating some crappy code maybe? Ethan, Willy, comparing > your toolchain (compiler version, origin of the toolchain) could be > interesting. I thought about this but it looks suspicious, I don't see why the toolchain would produce random bitflips. My toolchain is a linaro 4.7 gcc into which I have added support for a "pj4b" CPU target which is essentially the same as cortex-a9 plus support for the IDIV instruction in thumb mode. But I can send it to Ethan if that helps. Willy ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2013-09-18 16:49 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-15 1:05 mvneta: oops in __rcu_read_lock on mirabox Ethan Tuttle 2013-09-15 18:57 ` Thomas Petazzoni 2013-09-16 6:50 ` Willy Tarreau 2013-09-16 8:56 ` Ethan Tuttle 2013-09-16 15:51 ` Thomas Petazzoni 2013-09-16 16:22 ` Russell King - ARM Linux 2013-09-16 16:24 ` Thomas Petazzoni 2013-09-16 17:14 ` Russell King - ARM Linux 2013-09-16 17:45 ` Willy Tarreau 2013-09-16 18:25 ` Russell King - ARM Linux 2013-09-16 16:35 ` Ethan Tuttle 2013-09-16 16:39 ` Willy Tarreau 2013-09-16 16:44 ` Willy Tarreau 2013-09-16 17:24 ` Ethan Tuttle 2013-09-16 17:47 ` Willy Tarreau 2013-09-16 18:28 ` Russell King - ARM Linux 2013-09-17 3:43 ` Ethan Tuttle 2013-09-17 6:01 ` Willy Tarreau 2013-09-18 6:30 ` Ethan Tuttle 2013-09-18 16:35 ` Thomas Petazzoni 2013-09-18 16:49 ` Willy Tarreau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).