* Oops: 17 SMP ARM (v3.16-rc2) [not found] ` <20140626140115.GQ32514@n2100.arm.linux.org.uk> @ 2014-06-26 14:44 ` Mattis Lorentzon 2014-06-26 15:14 ` Russell King - ARM Linux 0 siblings, 1 reply; 44+ messages in thread From: Mattis Lorentzon @ 2014-06-26 14:44 UTC (permalink / raw) To: linux-arm-kernel Thank you for your reply, > On Wed, Jun 25, 2014 at 01:55:05PM +0000, Mattis Lorentzon wrote: > > I have a similar issue with v3.16-rc2 as previously reported by Waldemar > Brodkorb for v3.15-rc4. > > https://lkml.org/lkml/2014/5/9/330 > > This URL returns no useful information. I find that lkml.org is broken more > times than not in recent years. Please use a different archive site when > referring to posts, thanks. http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html > I have had two iMX6 platforms running root-NFS for about the last six to nine > months with various workloads, and have never seen this oops. > Unfortunately, the description above gives very little information for what > the mechanism to trigger this bug may be. For example, if I wanted to > reproduce it, what would I need to do? We have managed to trigger the Oops by just transferring a large file over nfs cat /mnt/foo > /dev/null where foo is a file that is approximately 2 GB. There may be some packet losses on this network, perhaps this differs from your workload? > > The error is sporadic and it seems to occur more frequently when using > perf. > > So it occurs when not using perf? Yes, certainly, see above. We have done some more investigations, please find it in this mail: http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html The Oops seems to have been introduced somewhere between v3.12 and v3.13: - The Oops is reproducible within seconds when running Linux 3.16-rc2. - We have observed the Oops on 8 different hardware units and two different chipsets (Freescale i.MX6 and Xilinx Zynq). - The Oops has not been seen on Linux 3.12 so it appears to be good. - The Oops has been seen on Linux 3.13, 3.14, 3.15, 3.16-rc2 so these appear to be bad. Configs and a couple of Oops reports are attached to the linked mail. Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-26 14:44 ` Oops: 17 SMP ARM (v3.16-rc2) Mattis Lorentzon @ 2014-06-26 15:14 ` Russell King - ARM Linux 2014-06-27 11:21 ` Russell King - ARM Linux 0 siblings, 1 reply; 44+ messages in thread From: Russell King - ARM Linux @ 2014-06-26 15:14 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jun 26, 2014 at 02:44:52PM +0000, Mattis Lorentzon wrote: > Thank you for your reply, > > > On Wed, Jun 25, 2014 at 01:55:05PM +0000, Mattis Lorentzon wrote: > > > I have a similar issue with v3.16-rc2 as previously reported by Waldemar > > Brodkorb for v3.15-rc4. > > > https://lkml.org/lkml/2014/5/9/330 > > > > This URL returns no useful information. I find that lkml.org is broken more > > times than not in recent years. Please use a different archive site when > > referring to posts, thanks. > > http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html I remember that report, but it was never resolved as I think no one has any ideas what is causing these, and no one has any idea where to start looking. > We have managed to trigger the Oops by just transferring a large file > over nfs > cat /mnt/foo > /dev/null > where foo is a file that is approximately 2 GB. There may be some > packet losses on this network, perhaps this differs from your workload? That's a similar workload to the one which is mentioned in the previous report. I've just set a similar transfer going, but this will be a 16GB file. > We have done some more investigations, please find it in this mail: > > http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html Yes, I saw that before I replied, and my reply was written with that message in mind. That's what prompted this paragraph in my previous reply: "Your other oops dumps also show various other functions apparantly returning 0xffffffff. I can't believe that there's more than one bug doing this, so I doubt the problem is in these functions. Something else must be going on." One of the problems is that there's soo much work going on with the kernel by many different parties, pulling it in various directions, that no one really has an overview of all the changes, and so no one has much of a feel what could be the cause of weird bugs like this. I don't know what to suggest - you could try using git bisect to see if you can track it down to a particular commit, but it sounds like that's going to be very time consuming. You mentioned that 3.12 doesn't show the bug, but 3.13 does - so start off telling git bisect that 3.12 is "good" and 3.13 is "bad". Hopefully there won't be too many breakages during the 3.13 merge window (between 3.12 and 3.13-rc1), but I don't have much faith in that; people seem to have a habbit of holding back fixes until -rc1, which makes _exactly_ this kind of bug much harder for people like yourselves to track down - or maybe even impossible. I'm afraid I can't offer very much help beyond this until either I can produce it, or someone manages to identify a particular change which caused this. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-26 15:14 ` Russell King - ARM Linux @ 2014-06-27 11:21 ` Russell King - ARM Linux 2014-06-27 16:16 ` Fredrik Noring 0 siblings, 1 reply; 44+ messages in thread From: Russell King - ARM Linux @ 2014-06-27 11:21 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote: > On Thu, Jun 26, 2014 at 02:44:52PM +0000, Mattis Lorentzon wrote: > > We have managed to trigger the Oops by just transferring a large file > > over nfs > > cat /mnt/foo > /dev/null > > where foo is a file that is approximately 2 GB. There may be some > > packet losses on this network, perhaps this differs from your workload? > > That's a similar workload to the one which is mentioned in the previous > report. I've just set a similar transfer going, but this will be a 16GB > file. I've run this transfer several times, but so far I've unable to reproduce the issue here. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-27 11:21 ` Russell King - ARM Linux @ 2014-06-27 16:16 ` Fredrik Noring 2014-06-27 16:31 ` Russell King - ARM Linux 0 siblings, 1 reply; 44+ messages in thread From: Fredrik Noring @ 2014-06-27 16:16 UTC (permalink / raw) To: linux-arm-kernel Hi Russel, > On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote: > > That's a similar workload to the one which is mentioned in the > > previous report. I've just set a similar transfer going, but this > > will be a 16GB file. > > I've run this transfer several times, but so far I've unable to reproduce the > issue here. Many thanks for testing this. We attempted to bisect, but unfortunately the result was not conclusive. One reason might be that the config had to be updated during the process, and so we did not end up with the exact same configuration (things like e.g. IMX_SDMA in DMA_ENGINE etc.). Some runs deadlocked without any visible Oops or printout. Some versions did not have an entirely working console configuration. Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of some interest? (We also had memtester run for days on the i.MX6 hardware, without issues.) All the best, Fredrik ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c() NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc2 #19 Backtrace: [<80012390>] (dump_backtrace) from [<8001266c>] (show_stack+0x18/0x1c) r6:00000108 r5:00000000 r4:8064e29c r3:00000000 [<80012654>] (show_stack) from [<8049791c>] (dump_stack+0x8c/0x9c) [<80497890>] (dump_stack) from [<80024f4c>] (warn_slowpath_common+0x74/0x90) r5:00000009 r4:80631d70 [<80024ed8>] (warn_slowpath_common) from [<80024fa0>] (warn_slowpath_fmt+0x38/0x40) r8:806320c0 r7:9d85a254 r6:9d879000 r5:9d85a000 r4:00000000 [<80024f6c>] (warn_slowpath_fmt) from [<803b8ff0>] (dev_watchdog+0x270/0x27c) r3:9d85a000 r2:805c4790 [<803b8d80>] (dev_watchdog) from [<8002f280>] (call_timer_fn+0x6c/0xe4) r10:80630008 r9:9d85a000 r8:803b8d80 r7:00000100 r6:80630000 r5:00000001 r4:80631dd8 [<8002f214>] (call_timer_fn) from [<8002fec8>] (run_timer_softirq+0x1d4/0x254) r10:803b8d80 r9:806320c0 r8:9d85a000 r7:00000000 r6:80631e28 r5:80667040 r4:9d85a284 [<8002fcf4>] (run_timer_softirq) from [<8002945c>] (__do_softirq+0x17c/0x30c) r10:00000001 r9:80632080 r8:40000001 r7:80630000 r6:00000100 r5:80632084 r4:00000020 [<800292e0>] (__do_softirq) from [<80029920>] (irq_exit+0xd0/0x114) r10:80630000 r9:80665f19 r8:00000001 r7:f4000100 r6:00000000 r5:80630008 r4:80630000 [<80029850>] (irq_exit) from [<8000f348>] (handle_IRQ+0x4c/0x98) r5:0000001d r4:8062ce44 [<8000f2fc>] (handle_IRQ) from [<80008614>] (gic_handle_irq+0x34/0x64) r6:80631f20 r5:80638a40 r4:f400010c r3:000000a0 [<800085e0>] (gic_handle_irq) from [<800131c4>] (__irq_svc+0x44/0x58) Exception stack(0x80631f20 to 0x80631f68) 1f20: 00000001 00000001 00000000 8063b6f0 8063852c 806384d8 80665f19 804a0040 1f40: 00000001 80665f19 80630000 80631f74 00000000 80631f68 800614b8 8000f6a8 1f60: 200f0013 ffffffff r7:80631f54 r6:ffffffff r5:200f0013 r4:8000f6a8 [<8000f67c>] (arch_cpu_idle) from [<8005cbf8>] (cpu_startup_entry+0x10c/0x164) [<8005caec>] (cpu_startup_entry) from [<80492b68>] (rest_init+0xc8/0xd8) r7:80625028 r3:00000000 [<80492aa0>] (rest_init) from [<805f6c5c>] (start_kernel+0x39c/0x3a8) r5:00000001 r4:806385d0 [<805f68c0>] (start_kernel) from [<10008074>] (0x10008074) ---[ end trace a7b7109ab2d04e11 ]--- *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-27 16:16 ` Fredrik Noring @ 2014-06-27 16:31 ` Russell King - ARM Linux 2014-06-30 6:22 ` Fredrik Noring ` (3 more replies) 0 siblings, 4 replies; 44+ messages in thread From: Russell King - ARM Linux @ 2014-06-27 16:31 UTC (permalink / raw) To: linux-arm-kernel Hi Fredrik, On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote: > Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of > some interest? It's not that serious... I know that the FEC ethernet driver is horrendously racy (I have had a patch set for about the last six months which fixes some of its problems) but as I've had a lot of patches to deal with, and it's been pushed to the back of the queue... The races don't lead to data corruption though, merely timeouts and some lost packets. Now because things have changed during the last merge window, I've got an even bigger problem sorting through that patch set and getting it back into a submittable state. I've just sent out v2 for it onto the netdev at vger.kernel.org mailing list. The initial version (marked RFC) attracted very little interest from testers, or acks. I'd very much like to have some testing of it, so if you want to try it out, I can provide you with a git URL, patches or a combined patch. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-27 16:31 ` Russell King - ARM Linux @ 2014-06-30 6:22 ` Fredrik Noring 2014-06-30 12:30 ` Fredrik Noring ` (2 subsequent siblings) 3 siblings, 0 replies; 44+ messages in thread From: Fredrik Noring @ 2014-06-30 6:22 UTC (permalink / raw) To: linux-arm-kernel Hi Russell, > -----Original Message----- > It's not that serious... I know that the FEC ethernet driver is horrendously > racy (I have had a patch set for about the last six months which fixes some of > its problems) but as I've had a lot of patches to deal with, and it's been > pushed to the back of the queue... > > The races don't lead to data corruption though, merely timeouts and some > lost packets. The serial port (uart1) and Ethernet are essentially the only things we use. No disks, no graphics, no USB, etc. If not the Ethernet driver, what else is likely to crash NFS so badly? Also, we are happy to change our config if that would simplify things: http://lkml.iu.edu/hypermail/linux/kernel/1406.3/01488/config.gz > Now because things have changed during the last merge window, I've got an > even bigger problem sorting through that patch set and getting it back into a > submittable state. I've just sent out v2 for it onto the > netdev at vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from testers, or > acks. I'd very much like to have some testing of it, so if you want to try it > out, I can provide you with a git URL, patches or a combined patch. Sure! A combined gzip patch attachment is fine. Git over HTTP probably works too. All the best, Fredrik *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-27 16:31 ` Russell King - ARM Linux 2014-06-30 6:22 ` Fredrik Noring @ 2014-06-30 12:30 ` Fredrik Noring 2014-06-30 13:00 ` Nathan Lynch 2014-07-02 6:02 ` Fredrik Noring 2014-08-05 13:31 ` Mattis Lorentzon 3 siblings, 1 reply; 44+ messages in thread From: Fredrik Noring @ 2014-06-30 12:30 UTC (permalink / raw) To: linux-arm-kernel Hi Russell, It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a properly working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to do a lot better. No crashes so far with v3.16-rc2! All the best, Fredrik > -----Original Message----- > Hi Fredrik, > > On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote: > > Please find below a trace that appeared once with 3.16-rc2. Perhaps it > > is of some interest? > > It's not that serious... I know that the FEC ethernet driver is horrendously > racy (I have had a patch set for about the last six months which fixes some of > its problems) but as I've had a lot of patches to deal with, and it's been > pushed to the back of the queue... > > The races don't lead to data corruption though, merely timeouts and some > lost packets. > > Now because things have changed during the last merge window, I've got an > even bigger problem sorting through that patch set and getting it back into a > submittable state. I've just sent out v2 for it onto the > netdev at vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from testers, or > acks. I'd very much like to have some testing of it, so if you want to try it > out, I can provide you with a git URL, patches or a combined patch. > > -- > FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly > improving, and getting towards what was expected from it. *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-30 12:30 ` Fredrik Noring @ 2014-06-30 13:00 ` Nathan Lynch 0 siblings, 0 replies; 44+ messages in thread From: Nathan Lynch @ 2014-06-30 13:00 UTC (permalink / raw) To: linux-arm-kernel On 06/30/2014 07:30 AM, Fredrik Noring wrote: >> >> On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote: >>> Please find below a trace that appeared once with 3.16-rc2. Perhaps it >>> is of some interest? >> >> It's not that serious... I know that the FEC ethernet driver is horrendously >> racy (I have had a patch set for about the last six months which fixes some of >> its problems) but as I've had a lot of patches to deal with, and it's been >> pushed to the back of the queue... >> >> The races don't lead to data corruption though, merely timeouts and some >> lost packets. > It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a properly > working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to do a lot > better. No crashes so far with v3.16-rc2! > Did you narrow it down to a particular GCC bug? The symptoms you reported remind me of: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854 Sadly, unpatched GCC 4.8.1 and 4.8.2 are unsuitable for building ARM kernels. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-27 16:31 ` Russell King - ARM Linux 2014-06-30 6:22 ` Fredrik Noring 2014-06-30 12:30 ` Fredrik Noring @ 2014-07-02 6:02 ` Fredrik Noring 2014-08-05 13:31 ` Mattis Lorentzon 3 siblings, 0 replies; 44+ messages in thread From: Fredrik Noring @ 2014-07-02 6:02 UTC (permalink / raw) To: linux-arm-kernel Hi Russell, > -----Original Message----- > > The initial version (marked RFC) attracted very little interest from > > testers, or acks. I'd very much like to have some testing of it, so > > if you want to try it out, I can provide you with a git URL, patches > > or a combined patch. > > Sure! A combined gzip patch attachment is fine. Git over HTTP probably > works too. We are still interested in trying out your patches to improve network performance. We can do some testing this week and in August. Best regards, Fredrik *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-06-27 16:31 ` Russell King - ARM Linux ` (2 preceding siblings ...) 2014-07-02 6:02 ` Fredrik Noring @ 2014-08-05 13:31 ` Mattis Lorentzon 2014-08-05 13:53 ` Fabio Estevam 2014-08-06 9:50 ` Russell King - ARM Linux 3 siblings, 2 replies; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-05 13:31 UTC (permalink / raw) To: linux-arm-kernel Hi Russell! > Now because things have changed during the last merge window, I've got an > even bigger problem sorting through that patch set and getting it back into a > submittable state. I've just sent out v2 for it onto the > netdev at vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from testers, or > acks. I'd very much like to have some testing of it, so if you want to try it out, > I can provide you with a git URL, patches or a combined patch. We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are currently running some stability tests. During our first test round we triggered a timeout which caused the fec driver to become unresponsive for several minutes. The attached backtrace was shown when the hardware was rebooted. Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fec-transmit-queue-timed-out.txt URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140805/048707db/attachment-0001.txt> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-05 13:31 ` Mattis Lorentzon @ 2014-08-05 13:53 ` Fabio Estevam 2014-08-06 6:48 ` Mattis Lorentzon 2014-08-06 9:50 ` Russell King - ARM Linux 1 sibling, 1 reply; 44+ messages in thread From: Fabio Estevam @ 2014-08-05 13:53 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 5, 2014 at 10:31 AM, Mattis Lorentzon <Mattis.Lorentzon@autoliv.com> wrote: > We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are > currently running some stability tests. > > During our first test round we triggered a timeout which caused the fec driver > to become unresponsive for several minutes. The attached backtrace was > shown when the hardware was rebooted. Could this problem be the same one as reported at: http://www.spinics.net/lists/arm-kernel/msg347914.html ? Which Ethernet PHY do you use? Do you have pull-up in the MDIO line? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-05 13:53 ` Fabio Estevam @ 2014-08-06 6:48 ` Mattis Lorentzon 0 siblings, 0 replies; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-06 6:48 UTC (permalink / raw) To: linux-arm-kernel Hi Fabio, > Could this problem be the same one as reported at: > http://www.spinics.net/lists/arm-kernel/msg347914.html ? The problem you link to describes a permanent issue, our problem seems to be sporadic as most of our tests work fine (at least for a while). > Which Ethernet PHY do you use? Do you have pull-up in the MDIO line? Our hardware has the KSZ9021RN PHY, so the MDIO line should be pull-up. Do you know if there are debug options that could help us determine the cause of the timeout? Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-05 13:31 ` Mattis Lorentzon 2014-08-05 13:53 ` Fabio Estevam @ 2014-08-06 9:50 ` Russell King - ARM Linux 2014-08-06 11:10 ` Mattis Lorentzon 1 sibling, 1 reply; 44+ messages in thread From: Russell King - ARM Linux @ 2014-08-06 9:50 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 05, 2014 at 01:31:29PM +0000, Mattis Lorentzon wrote: > We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are > currently running some stability tests. > > During our first test round we triggered a timeout which caused the fec driver > to become unresponsive for several minutes. The attached backtrace was > shown when the hardware was rebooted. What is on the other end of the link? > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c() > NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out ... > fec 2188000.ethernet eth0: TX ring dump > Nr SC addr len SKB > 0 0x1c00 0x00000000 66 (null) ... > 83 0x1c00 0x00000000 66 (null) > 84 H 0x1c00 0x00000000 66 (null) > 85 0x9c00 0x2e205000 66 9e384f00 > 86 0x1c00 0x2e204800 66 9e384d80 > 87 0x1c00 0x2e204000 66 9e384180 ... > 376 0x1c00 0x2e252800 66 81cf6180 > 377 0x1c00 0x2e253000 66 81cf6240 > 378 S 0x1c00 0x00000000 66 (null) So, the software would insert the next packet into slot 378. However, the slots from 85 to 377 have not been reaped, despite those in 86 to 377 allegedly having been sent. This is because the entry in slot 85 shows that it has yet to be sent. I've no idea what causes this; it looks like there's something screwed with the hardware which causes the transmitter to skip an entry in the ring under certain circumstances. As I've never been able to reproduce it here, I've not been able to investigate it. What I would like to do is to stamp each packet in some way with an identifier marking its ring position, and then monitor the network to find out whether the packet at slot 85 was actually transmitted - that's made slightly harder because packets may be dropped at the receiver when operating in promisc mode. This would then allow us to work out some likely causes. Note that after the transmit watchdog, the interface should recover and start operating normally again - and that should not take "several minutes." -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-06 9:50 ` Russell King - ARM Linux @ 2014-08-06 11:10 ` Mattis Lorentzon 2014-08-06 12:55 ` Russell King - ARM Linux 0 siblings, 1 reply; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-06 11:10 UTC (permalink / raw) To: linux-arm-kernel Russell, > What is on the other end of the link? 16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20 machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05). There may be multiple problems. The backtrace has only been seen a few times, on two different cards. Most of the time, the network for a random card just stalls without any visible backtrace or error messages. The other cards seem to be unaffected when this happens. > What I would like to do is to stamp each packet in some way with an > identifier marking its ring position, and then monitor the network to find out > whether the packet at slot 85 was actually transmitted - that's made slightly > harder because packets may be dropped at the receiver when operating in > promisc mode. This would then allow us to work out some likely causes. We would be glad to run this test on our setup, do you have more detailed information on how to set it up? > Note that after the transmit watchdog, the interface should recover and start > operating normally again - and that should not take "several minutes." After a network stall, we usually have to powercycle the ARM hardware to get it back to a usable state. These stalls last at least several minutes, perhaps indefinitely. It does not seem to recover properly, and is no longer reachable via the network. Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-06 11:10 ` Mattis Lorentzon @ 2014-08-06 12:55 ` Russell King - ARM Linux 2014-08-07 11:11 ` Mattis Lorentzon 0 siblings, 1 reply; 44+ messages in thread From: Russell King - ARM Linux @ 2014-08-06 12:55 UTC (permalink / raw) To: linux-arm-kernel On Wed, Aug 06, 2014 at 11:10:06AM +0000, Mattis Lorentzon wrote: > Russell, > > > What is on the other end of the link? > > 16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20 > machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05). > > There may be multiple problems. The backtrace has only been seen a few > times, on two different cards. Most of the time, the network for a random > card just stalls without any visible backtrace or error messages. The other > cards seem to be unaffected when this happens. Can you ascertain whether these stalls are a result of some failure of the receive side or the transmit side - you should be able to tell that if you watch the packet counts via ifconfig on the stalled card. Also, it would be useful to know whether the FEC interrupt was firing. I hope you have some kind of serial console on these cards? > > What I would like to do is to stamp each packet in some way with an > > identifier marking its ring position, and then monitor the network to find out > > whether the packet at slot 85 was actually transmitted - that's made slightly > > harder because packets may be dropped at the receiver when operating in > > promisc mode. This would then allow us to work out some likely causes. > > We would be glad to run this test on our setup, do you have more detailed > information on how to set it up? One of the problems is to find some way to stamp each packet with a 10-bit number without having any side effects. I guess one possibility would be to overwrite the source MAC address on transmit, which hopefully should not cause any side effects. > After a network stall, we usually have to powercycle the ARM hardware to > get it back to a usable state. These stalls last at least several minutes, > perhaps indefinitely. It does not seem to recover properly, and is no longer > reachable via the network. Hmm. Okay, I think the first thing we need to do is to work out why the silent stalls are happening. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-06 12:55 ` Russell King - ARM Linux @ 2014-08-07 11:11 ` Mattis Lorentzon 2014-08-07 12:12 ` Russell King - ARM Linux 0 siblings, 1 reply; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-07 11:11 UTC (permalink / raw) To: linux-arm-kernel Russell, > Can you ascertain whether these stalls are a result of some failure of the > receive side or the transmit side - you should be able to tell that if you watch > the packet counts via ifconfig on the stalled card. Also, it would be useful to > know whether the FEC interrupt was firing. grep eth /proc/interrupts 151: 0 0 0 0 GIC 151 2188000.ethernet 166: 1205661 0 0 0 gpio-mxc 6 2188000.ethernet The interrupt counter 166 increases regularly during the stalls. Ifconfig indicates that the RX and TX counters do not increase. > I hope you have some kind of serial console on these cards? Yes, indeed. Local stimuli seems to be able to unstall the network in a somewhat random fashion. Running e.g. ifconfig or ping locally may immediately or after up to about half a minute make the network responsive. However, it usually degenerates again to a complete stall within seconds. Without local stimuli the network does not appear to recover at all. The card does not even respond to pings (again, most often without any apparent error messages). Running both of the following commands in parallel from the FC server seems to trigger the problem within minutes (please note that the arm card stops responding to both ping and ssh): # while :; do ssh arm-card echo Ok; done # ping arm-card We have noticed the same problem on both the i.MX6 and the Zynq cards (using KSZ9021 and Cadence GEM drivers). However, the number of iterations required to trigger the problem vary. Sometimes it might stall after less than 100, but in other cases the stalls begin after nearly 10000 iterations. Once stalled (and unstalled after stimuli), the network on that particular card degenerates a lot more often. Apart from the kernel, IP numbers and MAC addresses, the software configurations are identical between the Zynq and the i.MX6. Perhaps the fault is unrelated to the Freescale driver? > Hmm. Okay, I think the first thing we need to do is to work out why the > silent stalls are happening. Would you have any ideas on what to check next? Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-07 11:11 ` Mattis Lorentzon @ 2014-08-07 12:12 ` Russell King - ARM Linux 2014-08-07 14:20 ` Fabio Estevam 2014-08-08 18:09 ` Russell King - ARM Linux 0 siblings, 2 replies; 44+ messages in thread From: Russell King - ARM Linux @ 2014-08-07 12:12 UTC (permalink / raw) To: linux-arm-kernel On Thu, Aug 07, 2014 at 11:11:06AM +0000, Mattis Lorentzon wrote: > Russell, > > > Can you ascertain whether these stalls are a result of some failure of the > > receive side or the transmit side - you should be able to tell that if you watch > > the packet counts via ifconfig on the stalled card. Also, it would be useful to > > know whether the FEC interrupt was firing. > > grep eth /proc/interrupts > 151: 0 0 0 0 GIC 151 2188000.ethernet > 166: 1205661 0 0 0 gpio-mxc 6 2188000.ethernet > > The interrupt counter 166 increases regularly during the stalls. > Ifconfig indicates that the RX and TX counters do not increase. Hmm, I'm slightly confused. On my iMX6Q, I have: 150: 581754 0 0 0 GIC 150 2188000.ethernet 151: 0 0 0 0 GIC 151 2188000.ethernet In the DT file, we have: fec: ethernet at 02188000 { compatible = "fsl,imx6q-fec"; reg = <0x02188000 0x4000>; interrupts-extended = <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; clocks = <&clks 117>, <&clks 117>, <&clks 190>; clock-names = "ipg", "ahb", "ptp"; status = "disabled"; }; which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. Yet you seem to have nothing registered against GIC 150, instead having an interrupt against GPIO 6. This seems very odd, and as this is an on-SoC device, I don't see why you would want to bind the interrupts for the FEC device any differently to standard platforms. This could well be the cause of your stalls. What's GPIO 6 used for on your board? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-07 12:12 ` Russell King - ARM Linux @ 2014-08-07 14:20 ` Fabio Estevam 2014-08-07 14:38 ` Fabio Estevam 2014-08-08 14:05 ` Fabio Estevam 2014-08-08 18:09 ` Russell King - ARM Linux 1 sibling, 2 replies; 44+ messages in thread From: Fabio Estevam @ 2014-08-07 14:20 UTC (permalink / raw) To: linux-arm-kernel On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > Hmm, I'm slightly confused. On my iMX6Q, I have: > > 150: 581754 0 0 0 GIC 150 2188000.ethernet > 151: 0 0 0 0 GIC 151 2188000.ethernet Same here on a mx6qsabresd. > In the DT file, we have: > > fec: ethernet at 02188000 { > compatible = "fsl,imx6q-fec"; > reg = <0x02188000 0x4000>; > interrupts-extended = > <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, > <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; > clocks = <&clks 117>, <&clks 117>, <&clks 190>; > clock-names = "ipg", "ahb", "ptp"; > status = "disabled"; > }; > > which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. > Yet you seem to have nothing registered against GIC 150, instead having > an interrupt against GPIO 6. > > This seems very odd, and as this is an on-SoC device, I don't see why > you would want to bind the interrupts for the FEC device any differently > to standard platforms. > > This could well be the cause of your stalls. > > What's GPIO 6 used for on your board? On a imx6q sabreauto I also get: 151: 0 0 0 0 GIC 151 2188000.ethernet 166: 4577 0 0 0 gpio-mxc 6 2188000.ethernet and the GPIO1_6 interrupt comes from this commit: commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3 Author: Troy Kisky <troy.kisky@boundarydevices.com> Date: Fri Dec 20 11:47:12 2013 -0700 ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt. This works around a hardware bug. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: Shawn Guo <shawn.guo@linaro.org> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-07 14:20 ` Fabio Estevam @ 2014-08-07 14:38 ` Fabio Estevam 2014-08-08 1:30 ` Troy Kisky 2014-08-08 14:05 ` Fabio Estevam 1 sibling, 1 reply; 44+ messages in thread From: Fabio Estevam @ 2014-08-07 14:38 UTC (permalink / raw) To: linux-arm-kernel On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam <festevam@gmail.com> wrote: > On a imx6q sabreauto I also get: > > 151: 0 0 0 0 GIC 151 2188000.ethernet > 166: 4577 0 0 0 gpio-mxc 6 2188000.ethernet > > and the GPIO1_6 interrupt comes from this commit: > > commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3 > Author: Troy Kisky <troy.kisky@boundarydevices.com> > Date: Fri Dec 20 11:47:12 2013 -0700 > > ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt. > > This works around a hardware bug. > > Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> > Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Actually a more descriptive commit log can be found here: commit 6261c4c8f13eb91f733e8ba6d67c409a2e841667 Author: Troy Kisky <troy.kisky@boundarydevices.com> Date: Fri Dec 20 11:47:11 2013 -0700 ARM: dts: imx6qdl-sabrelite: use GPIO_6 for FEC interrupt. This works around a hardware bug. From "Chip Errata for the i.MX 6Dual/6Quad" ERR006687 ENET: Only the ENET wake-up interrupt request can wake the system from Wait mode. The ENET block generates many interrupts. Only one of these interrupt lines is connected to the General Power Controller (GPC) block, but a logical OR of all of the ENET interrupts is connected to the General Interrupt Controller (GIC). When the system enters Wait mode, a normal RX Done or TX Done does not wake up the system because the GPC cannot see this interrupt. This impacts performance of the ENET block because its interrupts are serviced only when the chip exits Wait mode due to an interrupt from some other wake-up source. Before this patch, ping times of a Sabre Lite board are quite random: ping 192.168.0.13 -i.5 -c5 PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data. 64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=15.7 ms 64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=14.4 ms 64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=13.4 ms 64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=12.4 ms 64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=11.4 ms === 192.168.0.13 ping statistics === 5 packets transmitted, 5 received, 0% packet loss, time 2004ms rtt min/avg/max/mdev = 11.431/13.501/15.746/1.508 ms ____________________________________________________ After this patch: ping 192.168.0.13 -i.5 -c5 PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data. 64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=0.120 ms 64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=0.175 ms 64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=0.169 ms 64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=0.168 ms 64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=0.172 ms === 192.168.0.13 ping statistics === 5 packets transmitted, 5 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.120/0.160/0.175/0.026 ms ____________________________________________________ Also, apply same change to imx6qdl-nitrogen6x. This change may not be appropriate for all boards. Sabre Lite uses GPIO6 as a power down output for a ov5642 camera. As this expansion board does not yet work with mainline, this is not yet a conflict. It would be nice to have an alternative fix for boards where this is a problem. For example Sabre SD uses GPIO6 for I2C3_SDA. It also has long ping times currently. But cannot use this fix without giving up a touchscreen. Its ping times are also random. ping 192.168.0.19 -i.5 -c5 PING 192.168.0.19 (192.168.0.19) 56(84) bytes of data. 64 bytes from 192.168.0.19: icmp_req=1 ttl=64 time=16.0 ms 64 bytes from 192.168.0.19: icmp_req=2 ttl=64 time=15.4 ms 64 bytes from 192.168.0.19: icmp_req=3 ttl=64 time=14.4 ms 64 bytes from 192.168.0.19: icmp_req=4 ttl=64 time=13.4 ms 64 bytes from 192.168.0.19: icmp_req=5 ttl=64 time=12.4 ms === 192.168.0.19 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 12.451/14.369/16.057/1.316 ms Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> CC: Ranjani Vaidyanathan <ra5478@freescale.com> Signed-off-by: Shawn Guo <shawn.guo@linaro.org> ,but I am wondering if we should also do: --- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi +++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi @@ -66,6 +66,7 @@ pinctrl-0 = <&pinctrl_enet>; phy-mode = "rgmii"; interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>, + <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; status = "okay"; }; @@ -226,7 +227,7 @@ MX6QDL_PAD_RGMII_RD2__RGMII_RD2 0x1b0b0 MX6QDL_PAD_RGMII_RD3__RGMII_RD3 0x1b0b0 MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL 0x1b0b0 - MX6QDL_PAD_GPIO_6__ENET_IRQ 0x000b1 + MX6QDL_PAD_GPIO_6__ENET_IRQ 0x400000b1 Since the Workaround for erratum ERR006687 states that the SION bit needs to be used: "All of the interrupts can be selected by MUX and output to pad GPIO6. If GPIO6 is selected to output ENET interrupts and GPIO6 SION is set, the resulting GPIO interrupt will wake the system from Wait mode." ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-07 14:38 ` Fabio Estevam @ 2014-08-08 1:30 ` Troy Kisky 0 siblings, 0 replies; 44+ messages in thread From: Troy Kisky @ 2014-08-08 1:30 UTC (permalink / raw) To: linux-arm-kernel On 8/7/2014 7:38 AM, Fabio Estevam wrote: > On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam <festevam@gmail.com> wrote: > > ,but I am wondering if we should also do: > > --- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi > +++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi > @@ -66,6 +66,7 @@ > pinctrl-0 = <&pinctrl_enet>; > phy-mode = "rgmii"; > interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>, > + <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, > <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; > status = "okay"; > }; > @@ -226,7 +227,7 @@ > MX6QDL_PAD_RGMII_RD2__RGMII_RD2 0x1b0b0 > MX6QDL_PAD_RGMII_RD3__RGMII_RD3 0x1b0b0 > MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL 0x1b0b0 > - MX6QDL_PAD_GPIO_6__ENET_IRQ 0x000b1 > + MX6QDL_PAD_GPIO_6__ENET_IRQ > 0x400000b1 > > Since the Workaround for erratum ERR006687 states that the SION bit > needs to be used: > > "All of the interrupts can be selected by MUX and output to pad GPIO6. > If GPIO6 is selected to > output ENET interrupts and GPIO6 SION is set, the resulting GPIO > interrupt will wake the system > from Wait mode." > arch/arm/boot/dts/imx6q-pinfunc.h:#define MX6QDL_PAD_GPIO_6__ENET_IRQ 0x230 0x600 0x03c 0x11 0xff000609 So, the ion bit should already be set(0x11). But the other way works too. Troy ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-07 14:20 ` Fabio Estevam 2014-08-07 14:38 ` Fabio Estevam @ 2014-08-08 14:05 ` Fabio Estevam 1 sibling, 0 replies; 44+ messages in thread From: Fabio Estevam @ 2014-08-08 14:05 UTC (permalink / raw) To: linux-arm-kernel Mattis, On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam <festevam@gmail.com> wrote: > On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > >> Hmm, I'm slightly confused. On my iMX6Q, I have: >> >> 150: 581754 0 0 0 GIC 150 2188000.ethernet >> 151: 0 0 0 0 GIC 151 2188000.ethernet > > Same here on a mx6qsabresd. > >> In the DT file, we have: >> >> fec: ethernet at 02188000 { >> compatible = "fsl,imx6q-fec"; >> reg = <0x02188000 0x4000>; >> interrupts-extended = >> <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, >> <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; >> clocks = <&clks 117>, <&clks 117>, <&clks 190>; >> clock-names = "ipg", "ahb", "ptp"; >> status = "disabled"; >> }; >> >> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. >> Yet you seem to have nothing registered against GIC 150, instead having >> an interrupt against GPIO 6. >> >> This seems very odd, and as this is an on-SoC device, I don't see why >> you would want to bind the interrupts for the FEC device any differently >> to standard platforms. >> >> This could well be the cause of your stalls. >> >> What's GPIO 6 used for on your board? > > On a imx6q sabreauto I also get: > > 151: 0 0 0 0 GIC 151 2188000.ethernet > 166: 4577 0 0 0 gpio-mxc 6 2188000.ethernet Could you remove 'interrupts-extended' from the FEC node and also MX6QDL_PAD_GPIO_6__ENET_IRQ from the pinctrl node and test again? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-07 12:12 ` Russell King - ARM Linux 2014-08-07 14:20 ` Fabio Estevam @ 2014-08-08 18:09 ` Russell King - ARM Linux 2014-08-11 13:32 ` Mattis Lorentzon 1 sibling, 1 reply; 44+ messages in thread From: Russell King - ARM Linux @ 2014-08-08 18:09 UTC (permalink / raw) To: linux-arm-kernel On Thu, Aug 07, 2014 at 01:12:48PM +0100, Russell King - ARM Linux wrote: > On Thu, Aug 07, 2014 at 11:11:06AM +0000, Mattis Lorentzon wrote: > > Russell, > > > > > Can you ascertain whether these stalls are a result of some failure of the > > > receive side or the transmit side - you should be able to tell that if you watch > > > the packet counts via ifconfig on the stalled card. Also, it would be useful to > > > know whether the FEC interrupt was firing. > > > > grep eth /proc/interrupts > > 151: 0 0 0 0 GIC 151 2188000.ethernet > > 166: 1205661 0 0 0 gpio-mxc 6 2188000.ethernet > > > > The interrupt counter 166 increases regularly during the stalls. > > Ifconfig indicates that the RX and TX counters do not increase. > > Hmm, I'm slightly confused. On my iMX6Q, I have: > > 150: 581754 0 0 0 GIC 150 2188000.ethernet > 151: 0 0 0 0 GIC 151 2188000.ethernet > > In the DT file, we have: > > fec: ethernet at 02188000 { > compatible = "fsl,imx6q-fec"; > reg = <0x02188000 0x4000>; > interrupts-extended = > <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, > <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; > clocks = <&clks 117>, <&clks 117>, <&clks 190>; > clock-names = "ipg", "ahb", "ptp"; > status = "disabled"; > }; > > which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. > Yet you seem to have nothing registered against GIC 150, instead having > an interrupt against GPIO 6. > > This seems very odd, and as this is an on-SoC device, I don't see why > you would want to bind the interrupts for the FEC device any differently > to standard platforms. > > This could well be the cause of your stalls. > > What's GPIO 6 used for on your board? We have a second report of instability with the FEC today, and the problem board (wanboard) is also using GPIO1 6 for the ethernet IRQ. We have confirmation from the reporter that reverting the change (thus making the FEC use the standard interrupt) fixes their problem. Therefore, it seems that the workaround for ERR006687 is itself buggy. I'd be interested to hear whether removing the interrupts-extended = ... property from your board's DT file, thereby causing you to revert back to the default I list above, also fixes the instability you are seeing. Thanks. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-08 18:09 ` Russell King - ARM Linux @ 2014-08-11 13:32 ` Mattis Lorentzon 2014-08-11 17:41 ` Fabio Estevam 0 siblings, 1 reply; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-11 13:32 UTC (permalink / raw) To: linux-arm-kernel Russell and Fabio, > I'd be interested to hear whether removing the > > interrupts-extended = ... > > property from your board's DT file, thereby causing you to revert back to the > default I list above, also fixes the instability you are seeing. We have tried to remove the board specific interrupts-extended field and the MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve the stalls. Our interrupts look like this now: 150: 15519 0 0 0 GIC 150 2188000.ethernet 151: 0 0 0 0 GIC 151 2188000.ethernet Our device tree might still be slightly incorrect. We have noticed that our RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are a bit surprised that this works at all). We are not quite sure how to configure this properly. Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-11 13:32 ` Mattis Lorentzon @ 2014-08-11 17:41 ` Fabio Estevam 2014-08-13 13:39 ` Mattis Lorentzon 2014-08-14 14:43 ` Mattis Lorentzon 0 siblings, 2 replies; 44+ messages in thread From: Fabio Estevam @ 2014-08-11 17:41 UTC (permalink / raw) To: linux-arm-kernel On Mon, Aug 11, 2014 at 10:32 AM, Mattis Lorentzon <Mattis.Lorentzon@autoliv.com> wrote: > Russell and Fabio, > >> I'd be interested to hear whether removing the >> >> interrupts-extended = ... >> >> property from your board's DT file, thereby causing you to revert back to the >> default I list above, also fixes the instability you are seeing. > > We have tried to remove the board specific interrupts-extended field and the > MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve > the stalls. Our interrupts look like this now: > > 150: 15519 0 0 0 GIC 150 2188000.ethernet > 151: 0 0 0 0 GIC 151 2188000.ethernet > > Our device tree might still be slightly incorrect. We have noticed that our > RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are > a bit surprised that this works at all). We are not quite sure how to configure > this properly. In order to try to narrow down whether this is a board issue, could you try to run the same kernel on a mx6q development board, such as mx6qsabresd, cubox-i, wandboard, etc? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-11 17:41 ` Fabio Estevam @ 2014-08-13 13:39 ` Mattis Lorentzon 2014-08-25 10:18 ` Russell King - ARM Linux 2014-08-14 14:43 ` Mattis Lorentzon 1 sibling, 1 reply; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-13 13:39 UTC (permalink / raw) To: linux-arm-kernel Fabio and Russell, > In order to try to narrow down whether this is a board issue, could you try to > run the same kernel on a mx6q development board, such as mx6qsabresd, > cubox-i, wandboard, etc? Indeed, we have a Sabrelite development board and have run the same kernel configuration (please find attached). Russells 30 FEC related patches are applied. We have also tried with and without the extended interrupts entry in the DT. All our tests seem to behave the same way on the Sabrelite as on our own board. A working theory is that the switch (3Com Switch 4400) triggers the degeneration of the network stack from which Linux does not seem to recover, even if we later bypass the switch and directly connect the board to the server machine. Since the problem is stochastic in nature we are not completely sure if we can trigger the problem without the switch. It's the switch that allows us to run many cards simultaneously and thus trigger the problem more easily. :-) What are your thoughts? Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: config.gz Type: application/x-gzip Size: 14775 bytes Desc: config.gz URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140813/f7f2884e/attachment.bin> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-13 13:39 ` Mattis Lorentzon @ 2014-08-25 10:18 ` Russell King - ARM Linux 2014-08-26 13:11 ` Iain Paton 0 siblings, 1 reply; 44+ messages in thread From: Russell King - ARM Linux @ 2014-08-25 10:18 UTC (permalink / raw) To: linux-arm-kernel On Wed, Aug 13, 2014 at 01:39:27PM +0000, Mattis Lorentzon wrote: > All our tests seem to behave the same way on the Sabrelite as on our own board. > A working theory is that the switch (3Com Switch 4400) triggers the degeneration > of the network stack from which Linux does not seem to recover, even if we later > bypass the switch and directly connect the board to the server machine. Please can you try something - what happens if you completely disable pause frame support (flow control) on all machines on the switch? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-25 10:18 ` Russell King - ARM Linux @ 2014-08-26 13:11 ` Iain Paton 0 siblings, 0 replies; 44+ messages in thread From: Iain Paton @ 2014-08-26 13:11 UTC (permalink / raw) To: linux-arm-kernel On 25/08/14 11:18, Russell King - ARM Linux wrote: > On Wed, Aug 13, 2014 at 01:39:27PM +0000, Mattis Lorentzon wrote: >> All our tests seem to behave the same way on the Sabrelite as on our own board. >> A working theory is that the switch (3Com Switch 4400) triggers the degeneration >> of the network stack from which Linux does not seem to recover, even if we later >> bypass the switch and directly connect the board to the server machine. > > Please can you try something - what happens if you completely disable > pause frame support (flow control) on all machines on the switch? Russell, while trying to duplicate this I have flow-control disabled on the switch which leads to it being auto-negotiated off on all devices. Do you think it could be worth turning it on and trying again? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-11 17:41 ` Fabio Estevam 2014-08-13 13:39 ` Mattis Lorentzon @ 2014-08-14 14:43 ` Mattis Lorentzon 2014-08-14 15:30 ` Fabio Estevam 2014-08-22 8:27 ` Russell King - ARM Linux 1 sibling, 2 replies; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-14 14:43 UTC (permalink / raw) To: linux-arm-kernel Fabio and Russell, > A working theory is that the switch (3Com Switch 4400) triggers the > degeneration of the network stack from which Linux does not seem to > recover, even if we later bypass the switch and directly connect the board to > the server machine. After a few more tests we have finally been able to trigger the exact same stalls on the Sabrelite board with a direct network connection (i.e. without the switch). Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-14 14:43 ` Mattis Lorentzon @ 2014-08-14 15:30 ` Fabio Estevam 2014-08-15 5:42 ` Mattis Lorentzon 2014-08-22 8:27 ` Russell King - ARM Linux 1 sibling, 1 reply; 44+ messages in thread From: Fabio Estevam @ 2014-08-14 15:30 UTC (permalink / raw) To: linux-arm-kernel On Thu, Aug 14, 2014 at 11:43 AM, Mattis Lorentzon <Mattis.Lorentzon@autoliv.com> wrote: > After a few more tests we have finally been able to trigger the exact same stalls > on the Sabrelite board with a direct network connection (i.e. without the switch). Do the stalls also happen on a pure 3.16 kernel? How can we reproduce the error? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-14 15:30 ` Fabio Estevam @ 2014-08-15 5:42 ` Mattis Lorentzon 2014-08-17 21:34 ` Iain Paton 0 siblings, 1 reply; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-15 5:42 UTC (permalink / raw) To: linux-arm-kernel Fabio, > Do the stalls also happen on a pure 3.16 kernel? Yes, we just tried this out overnight and we get the same stalls here. We have seen similar problems on a Zynq-based board. It might be worth noting that a common chip between all three boards is, for example, the KSZ9021RN, while the FEC driver, for example, only runs on the two iMX6-boards. > How can we reproduce the error? We mostly run SSH with benchmarks using NFS, it can probably be triggered by using only SSH with the following loop: # while : ; do ssh arm-card date; done Our (pure) 3.16 kernel uses the following config. http://lkml.iu.edu/hypermail/linux/kernel/1408.1/03045/config.gz (We have quite generously disabled a lot of sub-systems in our config.) Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-15 5:42 ` Mattis Lorentzon @ 2014-08-17 21:34 ` Iain Paton 2014-08-17 21:46 ` Fabio Estevam 0 siblings, 1 reply; 44+ messages in thread From: Iain Paton @ 2014-08-17 21:34 UTC (permalink / raw) To: linux-arm-kernel On 15/08/14 06:42, Mattis Lorentzon wrote: > We mostly run SSH with benchmarks using NFS, it can probably be > triggered by using only SSH with the following loop: > > # while : ; do ssh arm-card date; done Mattis, What sort of time does it take for you to see a problem? I've been running the above for nearly two days on 3.16.0 on a board with fec interrupts routed through gpio_6 and haven't seen a hint of a problem. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-17 21:34 ` Iain Paton @ 2014-08-17 21:46 ` Fabio Estevam 2014-08-19 6:03 ` Iain Paton 0 siblings, 1 reply; 44+ messages in thread From: Fabio Estevam @ 2014-08-17 21:46 UTC (permalink / raw) To: linux-arm-kernel Iain, On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote: > On 15/08/14 06:42, Mattis Lorentzon wrote: > >> We mostly run SSH with benchmarks using NFS, it can probably be >> triggered by using only SSH with the following loop: >> >> # while : ; do ssh arm-card date; done > > Mattis, > > What sort of time does it take for you to see a problem? > > I've been running the above for nearly two days on 3.16.0 on a board > with fec interrupts routed through gpio_6 and haven't seen a hint of > a problem. Thanks for testing. Which mx6 board have you used on this test? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-17 21:46 ` Fabio Estevam @ 2014-08-19 6:03 ` Iain Paton 2014-08-21 9:39 ` Iain Paton 0 siblings, 1 reply; 44+ messages in thread From: Iain Paton @ 2014-08-19 6:03 UTC (permalink / raw) To: linux-arm-kernel On 17/08/14 22:46, Fabio Estevam wrote: > Iain, > > On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote: >> On 15/08/14 06:42, Mattis Lorentzon wrote: >> >>> We mostly run SSH with benchmarks using NFS, it can probably be >>> triggered by using only SSH with the following loop: >>> >>> # while : ; do ssh arm-card date; done >> >> Mattis, >> >> What sort of time does it take for you to see a problem? >> >> I've been running the above for nearly two days on 3.16.0 on a board >> with fec interrupts routed through gpio_6 and haven't seen a hint of >> a problem. > > Thanks for testing. > > Which mx6 board have you used on this test? It's currently pointed at a RIoTboard (atheros phy) but I'm happy to try it against both a Sabre-Lite and a Wandboard B1, all running the same kernel binary, as well. I'm interested enough in why different people get different results with this that I'll put some time towards testing to try to help narrow down the cause. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-19 6:03 ` Iain Paton @ 2014-08-21 9:39 ` Iain Paton 2014-08-22 0:01 ` Fabio Estevam 2014-08-26 13:12 ` Iain Paton 0 siblings, 2 replies; 44+ messages in thread From: Iain Paton @ 2014-08-21 9:39 UTC (permalink / raw) To: linux-arm-kernel On 19/08/14 07:03, Iain Paton wrote: > On 17/08/14 22:46, Fabio Estevam wrote: >> Iain, >> >> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote: >>> On 15/08/14 06:42, Mattis Lorentzon wrote: >>> >>>> We mostly run SSH with benchmarks using NFS, it can probably be >>>> triggered by using only SSH with the following loop: >>>> >>>> # while : ; do ssh arm-card date; done >>> >>> Mattis, >>> >>> What sort of time does it take for you to see a problem? >>> >>> I've been running the above for nearly two days on 3.16.0 on a board >>> with fec interrupts routed through gpio_6 and haven't seen a hint of >>> a problem. >> >> Thanks for testing. >> >> Which mx6 board have you used on this test? > > It's currently pointed at a RIoTboard (atheros phy) but I'm happy to > try it against both a Sabre-Lite and a Wandboard B1, all running the > same kernel binary, as well. > > I'm interested enough in why different people get different results > with this that I'll put some time towards testing to try to help > narrow down the cause. > two and a half days of running this against both a sabre-lite and a wandboard quad B1 and I still have no reason to think there's any sort of a problem. Up to now, my testing has been done with my own config, I'll now repeat the whole thing using the config Mattis posted to see if I can reproduce it that way. Suggestions on a better / easier / quicker way to reproduce it are welcome. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-21 9:39 ` Iain Paton @ 2014-08-22 0:01 ` Fabio Estevam 2014-08-22 6:39 ` Mattis Lorentzon 2014-08-22 10:36 ` Iain Paton 2014-08-26 13:12 ` Iain Paton 1 sibling, 2 replies; 44+ messages in thread From: Fabio Estevam @ 2014-08-22 0:01 UTC (permalink / raw) To: linux-arm-kernel On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton <ipaton0@gmail.com> wrote: > two and a half days of running this against both a sabre-lite and a > wandboard quad B1 and I still have no reason to think there's any > sort of a problem. > > Up to now, my testing has been done with my own config, I'll now > repeat the whole thing using the config Mattis posted to see if > I can reproduce it that way. > > Suggestions on a better / easier / quicker way to reproduce it are > welcome. Thanks, Iain. Mattis, What is the silicon version of the mx6 in your sabrelite? What GCC version do you use? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-22 0:01 ` Fabio Estevam @ 2014-08-22 6:39 ` Mattis Lorentzon 2014-08-22 10:36 ` Iain Paton 1 sibling, 0 replies; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-22 6:39 UTC (permalink / raw) To: linux-arm-kernel Fabio, > What is the silicon version of the mx6 in your sabrelite? What GCC version do > you use? The silicon version is PCIMX6Q6AVT10AA and the GCC version we use is arm-none-eabi-gcc (Fedora 2013.11.24-2.fc19) 4.8.1. Iain, > Up to now, my testing has been done with my own config, I'll now > repeat the whole thing using the config Mattis posted to see if I can > reproduce it that way. Thanks for testing this. Could you also send me the config that you used for your Sabrelite? Do you know of any options that enable additional debug information about the network driver state (full buffers etc.)? Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-22 0:01 ` Fabio Estevam 2014-08-22 6:39 ` Mattis Lorentzon @ 2014-08-22 10:36 ` Iain Paton 2014-08-27 6:32 ` Mattis Lorentzon 1 sibling, 1 reply; 44+ messages in thread From: Iain Paton @ 2014-08-22 10:36 UTC (permalink / raw) To: linux-arm-kernel On 22/08/14 01:01, Fabio Estevam wrote: > On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton <ipaton0@gmail.com> wrote: > >> two and a half days of running this against both a sabre-lite and a >> wandboard quad B1 and I still have no reason to think there's any >> sort of a problem. >> >> Up to now, my testing has been done with my own config, I'll now >> repeat the whole thing using the config Mattis posted to see if >> I can reproduce it that way. >> >> Suggestions on a better / easier / quicker way to reproduce it are >> welcome. > > Thanks, Iain. > > Mattis, > > What is the silicon version of the mx6 in your sabrelite? What GCC > version do you use? > For reference, both my SL and WBQUAD report silicon rev 1.2 The RIoTboard uses a Solo and reports silicon rev 1.1 I'm using vanilla gcc 4.9.1 and compiling the kernel natively on a sabre-lite. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-22 10:36 ` Iain Paton @ 2014-08-27 6:32 ` Mattis Lorentzon 2014-08-27 10:43 ` Iain Paton 0 siblings, 1 reply; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-27 6:32 UTC (permalink / raw) To: linux-arm-kernel Hi Iain, Russell and Fabio, > The config is attached. Note that there's a lot of additional stuff enabled as > I'm aiming for a single general purpose kernel that covers i.MX6, AM3359, > Allwinner A10/A20 along with several versions of boards using those > particular SoCs. > > Same kernel binary on all the boards I've tried this on, only real differences > will be the devicetree and u-boot Amazingly we have been able to run a complete nightly test on eight i.MX6 boards without hickups using Iain's config! We had to modify it slightly to get it to boot, please find attached patch and Iain's patched config. On Russell's suggestion we also began to disable flow control on the machines. However it did not seem to make a difference because all our Zynq cards stalled during the same test run (using our own Zynq config). Iain's config seems promising and we will continue to run tests during the next couple of days. We will also try to adapt Iain's config to our Zynq board. Many thanks for all suggestions, patches and configs so far! Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: config.patch Type: application/octet-stream Size: 692 bytes Desc: config.patch URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140827/44ff0458/attachment-0001.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: config.gz Type: application/x-gzip Size: 23527 bytes Desc: config.gz URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140827/44ff0458/attachment-0001.bin> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-27 6:32 ` Mattis Lorentzon @ 2014-08-27 10:43 ` Iain Paton 2014-08-29 10:57 ` Mattis Lorentzon 0 siblings, 1 reply; 44+ messages in thread From: Iain Paton @ 2014-08-27 10:43 UTC (permalink / raw) To: linux-arm-kernel On 27/08/14 07:32, Mattis Lorentzon wrote: > Hi Iain, Russell and Fabio, > >> The config is attached. Note that there's a lot of additional stuff enabled as >> I'm aiming for a single general purpose kernel that covers i.MX6, AM3359, >> Allwinner A10/A20 along with several versions of boards using those >> particular SoCs. >> >> Same kernel binary on all the boards I've tried this on, only real differences >> will be the devicetree and u-boot > > Amazingly we have been able to run a complete nightly test on eight i.MX6 > boards without hickups using Iain's config! We had to modify it slightly to get > it to boot, please find attached patch and Iain's patched config. Interesting. We obviously have some differences in how we boot, my changes to your config to get it to boot basically amount to reverting the patch you attached and then enabling sata and mmc. So far I've been unable to get your config to fail. I'm attaching the patch showing what I changed in case it sheds any light on what's going on, although I don't see why any of the changes make any difference. My kernel command line is also fairly obvious with nothing I'd think is odd: console=ttymxc1,115200n8 root=/dev/sda1 ro rootfstype=ext2 rootwait video= ahci-imx.hotplug=1 It would be good to know what makes my config work for you, I don't think I've done anything special with it. Iain -------------- next part -------------- 3c3 < # Linux/arm 3.16.0 Kernel Configuration --- > # Linux/arm 3.16.0-rc2 Kernel Configuration 38c38 < CONFIG_KERNEL_GZIP=y --- > # CONFIG_KERNEL_GZIP is not set 41c41 < # CONFIG_KERNEL_LZO is not set --- > CONFIG_KERNEL_LZO=y 233,239c233 < CONFIG_PARTITION_ADVANCED=y < # CONFIG_ACORN_PARTITION is not set < # CONFIG_AIX_PARTITION is not set < # CONFIG_OSF_PARTITION is not set < # CONFIG_AMIGA_PARTITION is not set < # CONFIG_ATARI_PARTITION is not set < # CONFIG_MAC_PARTITION is not set --- > # CONFIG_PARTITION_ADVANCED is not set 241,249d234 < # CONFIG_BSD_DISKLABEL is not set < # CONFIG_MINIX_SUBPARTITION is not set < # CONFIG_SOLARIS_X86_PARTITION is not set < # CONFIG_UNIXWARE_DISKLABEL is not set < # CONFIG_LDM_PARTITION is not set < # CONFIG_SGI_PARTITION is not set < # CONFIG_ULTRIX_PARTITION is not set < # CONFIG_SUN_PARTITION is not set < # CONFIG_KARMA_PARTITION is not set 251,252d235 < # CONFIG_SYSV68_PARTITION is not set < # CONFIG_CMDLINE_PARTITION is not set 265,266d247 < CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y < CONFIG_RWSEM_SPIN_ON_OWNER=y 533,534c514,521 < # CONFIG_ARM_APPENDED_DTB is not set < CONFIG_CMDLINE="" --- > CONFIG_ARM_APPENDED_DTB=y > CONFIG_ARM_ATAG_DTB_COMPAT=y > CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_FROM_BOOTLOADER=y > # CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_EXTEND is not set > CONFIG_CMDLINE="___console=ttymxc0,115200 ___debug ___LOGLEVEL=8 ___initrd=0x11800040,12383491 ___dyndbg=\"file * +p\"" > CONFIG_CMDLINE_FROM_BOOTLOADER=y > # CONFIG_CMDLINE_EXTEND is not set > # CONFIG_CMDLINE_FORCE is not set 591c578 < # CONFIG_PM_TEST_SUSPEND is not set --- > CONFIG_PM_TEST_SUSPEND=y 919d905 < CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y 920a907 > CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y 1045,1059c1032 < CONFIG_ATA=y < # CONFIG_ATA_NONSTANDARD is not set < CONFIG_ATA_VERBOSE_ERROR=y < CONFIG_SATA_PMP=y < < # < # Controllers with non-SFF native interface < # < CONFIG_SATA_AHCI=y < CONFIG_SATA_AHCI_PLATFORM=y < CONFIG_AHCI_IMX=y < # CONFIG_SATA_INIC162X is not set < # CONFIG_SATA_ACARD_AHCI is not set < # CONFIG_SATA_SIL24 is not set < # CONFIG_ATA_SFF is not set --- > # CONFIG_ATA is not set 1786,1815c1759 < CONFIG_MMC=y < # CONFIG_MMC_DEBUG is not set < # CONFIG_MMC_CLKGATE is not set < < # < # MMC/SD/SDIO Card Drivers < # < CONFIG_MMC_BLOCK=y < CONFIG_MMC_BLOCK_MINORS=8 < CONFIG_MMC_BLOCK_BOUNCE=y < # CONFIG_SDIO_UART is not set < # CONFIG_MMC_TEST is not set < < # < # MMC/SD/SDIO Host Controller Drivers < # < CONFIG_MMC_SDHCI=y < CONFIG_MMC_SDHCI_IO_ACCESSORS=y < # CONFIG_MMC_SDHCI_PCI is not set < CONFIG_MMC_SDHCI_PLTFM=y < # CONFIG_MMC_SDHCI_OF_ARASAN is not set < CONFIG_MMC_SDHCI_ESDHC_IMX=y < # CONFIG_MMC_SDHCI_PXAV3 is not set < # CONFIG_MMC_SDHCI_PXAV2 is not set < # CONFIG_MMC_MXC is not set < # CONFIG_MMC_TIFM_SD is not set < # CONFIG_MMC_CB710 is not set < # CONFIG_MMC_VIA_SDMMC is not set < # CONFIG_MMC_DW is not set < # CONFIG_MMC_USDHI6ROL0 is not set --- > # CONFIG_MMC is not set 1968d1911 < # CONFIG_WIMAX_GDM72XX is not set ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-27 10:43 ` Iain Paton @ 2014-08-29 10:57 ` Mattis Lorentzon 2014-08-29 11:30 ` Fabio Estevam 2014-12-16 14:50 ` Mattis Lorentzon 0 siblings, 2 replies; 44+ messages in thread From: Mattis Lorentzon @ 2014-08-29 10:57 UTC (permalink / raw) To: linux-arm-kernel Iain, > Interesting. We obviously have some differences in how we boot, my > changes to your config to get it to boot basically amount to reverting the > patch you attached and then enabling sata and mmc. So far I've been unable > to get your config to fail. Our version of U-boot doesn't support specifying a device tree separate from the kernel, so we append it to the end of the kernel binary. We also enable automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are: console=ttymxc1,115200 ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on earlyprintk enable_wait_mode=off > It would be good to know what makes my config work for you, I don't think > I've done anything special with it. With a couple of modifications (attached) we have been able to get your config running on our Zynq boards as well, solving our ethernet issues. The serial port and ethernet are essentially the only things we use. No disks, no graphics, no USB, etc. which is why we tried to reduce the kernel configuration to a bare minimum. We have no idea which disabled and/or enabled options that are causing the stalls. Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: config.patch Type: application/octet-stream Size: 2670 bytes Desc: config.patch URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140829/479869c2/attachment.obj> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-29 10:57 ` Mattis Lorentzon @ 2014-08-29 11:30 ` Fabio Estevam 2014-12-16 14:50 ` Mattis Lorentzon 1 sibling, 0 replies; 44+ messages in thread From: Fabio Estevam @ 2014-08-29 11:30 UTC (permalink / raw) To: linux-arm-kernel Hi Mattis, On Fri, Aug 29, 2014 at 7:57 AM, Mattis Lorentzon <Mattis.Lorentzon@autoliv.com> wrote: > Iain, > >> Interesting. We obviously have some differences in how we boot, my >> changes to your config to get it to boot basically amount to reverting the >> patch you attached and then enabling sata and mmc. So far I've been unable >> to get your config to fail. > > Our version of U-boot doesn't support specifying a device tree separate from > the kernel, so we append it to the end of the kernel binary. We also enable > automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are: > console=ttymxc1,115200 > ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on > earlyprintk enable_wait_mode=off I suppose that this 'enable_wait_mode=off' is a left over from the time you used the FSL BSP. This is not needed in mainline. >> It would be good to know what makes my config work for you, I don't think >> I've done anything special with it. > > With a couple of modifications (attached) we have been able to get your > config running on our Zynq boards as well, solving our ethernet issues. > > The serial port and ethernet are essentially the only things we use. No disks, > no graphics, no USB, etc. which is why we tried to reduce the kernel > configuration to a bare minimum. We have no idea which disabled and/or > enabled options that are causing the stalls. It's good to hear you do not have the lockups anymore, but this is still a big mistery for us as we have not yet understood the root cause and what is the 'guilty' kernel config option that makes things FEC to work unreliably. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-29 10:57 ` Mattis Lorentzon 2014-08-29 11:30 ` Fabio Estevam @ 2014-12-16 14:50 ` Mattis Lorentzon 1 sibling, 0 replies; 44+ messages in thread From: Mattis Lorentzon @ 2014-12-16 14:50 UTC (permalink / raw) To: linux-arm-kernel Hi Russell, > Now because things have changed during the last merge window, I've got > an even bigger problem sorting through that patch set and getting it > back into a submittable state. I've just sent out v2 for it onto the > netdev at vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from > testers, or acks. I'd very much like to have some testing of it, so > if you want to try it out, I can provide you with a git URL, patches or a > combined patch. We have run v3.16 for about three months now, and many millions of ssh connections on eight separate systems, both without and with your network patches. Our conclusion is that the patches clearly reduce the number of network timeouts, and this is a great improvement. However, after a month or so of uptime, the number of timeouts began to increase again, forcing us to reboot the cards. Best regards, Mattis Lorentzon *************************************************************** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *************************************************************** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-21 9:39 ` Iain Paton 2014-08-22 0:01 ` Fabio Estevam @ 2014-08-26 13:12 ` Iain Paton 1 sibling, 0 replies; 44+ messages in thread From: Iain Paton @ 2014-08-26 13:12 UTC (permalink / raw) To: linux-arm-kernel On 21/08/14 10:39, Iain Paton wrote: > On 19/08/14 07:03, Iain Paton wrote: >> On 17/08/14 22:46, Fabio Estevam wrote: >>> Iain, >>> >>> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote: >>>> On 15/08/14 06:42, Mattis Lorentzon wrote: >>>> >>>>> We mostly run SSH with benchmarks using NFS, it can probably be >>>>> triggered by using only SSH with the following loop: >>>>> >>>>> # while : ; do ssh arm-card date; done >>>> >>>> Mattis, >>>> >>>> What sort of time does it take for you to see a problem? >>>> >>>> I've been running the above for nearly two days on 3.16.0 on a board >>>> with fec interrupts routed through gpio_6 and haven't seen a hint of >>>> a problem. >>> >>> Thanks for testing. >>> >>> Which mx6 board have you used on this test? >> >> It's currently pointed at a RIoTboard (atheros phy) but I'm happy to >> try it against both a Sabre-Lite and a Wandboard B1, all running the >> same kernel binary, as well. >> >> I'm interested enough in why different people get different results >> with this that I'll put some time towards testing to try to help >> narrow down the cause. >> > > two and a half days of running this against both a sabre-lite and a > wandboard quad B1 and I still have no reason to think there's any > sort of a problem. > > Up to now, my testing has been done with my own config, I'll now > repeat the whole thing using the config Mattis posted to see if > I can reproduce it that way. > > Suggestions on a better / easier / quicker way to reproduce it are > welcome. > So I wasn't able to use Mattis exact configuration as I couldn't get it to boot properly on anything. I made changes enough to enable mmc/sata and to disable the compiled in kernel command line and appended devicetree and initrd. Even then it still won't boot on my WBQUAD. It is running on Sabre-Lite and RIoTboard though, so useful enough to test against the SL in a similar manner to Mattis tests with SL. I've had the test running against both for approx one day and again no sign of any problems. I'm happy to leave this running, but at this stage I'm not expecting I'll see any problems even if I leave it running for a week. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Oops: 17 SMP ARM (v3.16-rc2) 2014-08-14 14:43 ` Mattis Lorentzon 2014-08-14 15:30 ` Fabio Estevam @ 2014-08-22 8:27 ` Russell King - ARM Linux 1 sibling, 0 replies; 44+ messages in thread From: Russell King - ARM Linux @ 2014-08-22 8:27 UTC (permalink / raw) To: linux-arm-kernel On Thu, Aug 14, 2014 at 02:43:56PM +0000, Mattis Lorentzon wrote: > Fabio and Russell, > > > A working theory is that the switch (3Com Switch 4400) triggers the > > degeneration of the network stack from which Linux does not seem to > > recover, even if we later bypass the switch and directly connect the board to > > the server machine. > > After a few more tests we have finally been able to trigger the exact > same stalls on the Sabrelite board with a direct network connection > (i.e. without the switch). That's a setup which I can't reproduce, as all my MX6 hardware runs root-NFS, so using a direct connection to a machine to test will result in the MX6 losing its root filesystem. That said, on SolidRun hardware, there is some investigation going on at the moment concerning poor UDP performance - this is an on-going problem that has been present for a long time. What we find is that TCP performance achieves around the 600mbps mark, but UDP performance can be extremely poor with high packet loss. Adding a udelay(210) into the fec_enet_rx() can perversely (on multi-core SoCs) increase UDP performance to around 500mbps at the expense of a reduction in TCP performance. This "solution" was tripped over while trying to debug this problem, and it was found that adding printk()s to the driver increased UDP performance - so subsituting udelay() for printk() was then tried. I tried to run perf on the kernel yesterday to find out what's going on, but for some reason, perf gave me impossible call traces, so I gave up with that idea. For example, perf told me that there was a high hit rate in memcpy() being called from net_rx_action(), but net_rx_action() doesn't call memcpy(), nor do any of the called functions as a tail-call. That said, I don't think perf could tell us what's going on - what we need is a trace of the CPU's execution while iperf is running, *without* affecting the CPU itself. This is something I can't do with the hardware I have. My suspicion (unproven) is that a batch of packets get processed in the softirq handler called during the FEC interrupt exit path. Then, because there's more work to be done, ksoftirqd is scheduled, but it takes time for ksoftirqd to start running - during which time we drop a lot of packets. ksoftirqd processes some packets, but then finds that it can't complete the NAPI "work budget", and so stops running, resulting in the packet processing being triggered by the next FEC interrupt, and the cycle repeats. TCP notices this, and adjusts its sending rate to match, whereas UDP just carries on regardless, resulting in lots of packets dropped each time we switch from the tail of hardirq processing to ksoftirqd. With the udelay() in place, processing takes enough time that it gets bounced onto ksoftirqd, where it stays. I'm adding this to this thread in case it has any bearing on the problem(s) you're seeing - yes, it seems like a different problem, but could it be related... -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2014-12-16 14:50 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <C4A61A5CA4AD2246942F4BA6F20ACB804356534D@ALVA-EXMB04.alv.autoliv.int> [not found] ` <20140626140115.GQ32514@n2100.arm.linux.org.uk> 2014-06-26 14:44 ` Oops: 17 SMP ARM (v3.16-rc2) Mattis Lorentzon 2014-06-26 15:14 ` Russell King - ARM Linux 2014-06-27 11:21 ` Russell King - ARM Linux 2014-06-27 16:16 ` Fredrik Noring 2014-06-27 16:31 ` Russell King - ARM Linux 2014-06-30 6:22 ` Fredrik Noring 2014-06-30 12:30 ` Fredrik Noring 2014-06-30 13:00 ` Nathan Lynch 2014-07-02 6:02 ` Fredrik Noring 2014-08-05 13:31 ` Mattis Lorentzon 2014-08-05 13:53 ` Fabio Estevam 2014-08-06 6:48 ` Mattis Lorentzon 2014-08-06 9:50 ` Russell King - ARM Linux 2014-08-06 11:10 ` Mattis Lorentzon 2014-08-06 12:55 ` Russell King - ARM Linux 2014-08-07 11:11 ` Mattis Lorentzon 2014-08-07 12:12 ` Russell King - ARM Linux 2014-08-07 14:20 ` Fabio Estevam 2014-08-07 14:38 ` Fabio Estevam 2014-08-08 1:30 ` Troy Kisky 2014-08-08 14:05 ` Fabio Estevam 2014-08-08 18:09 ` Russell King - ARM Linux 2014-08-11 13:32 ` Mattis Lorentzon 2014-08-11 17:41 ` Fabio Estevam 2014-08-13 13:39 ` Mattis Lorentzon 2014-08-25 10:18 ` Russell King - ARM Linux 2014-08-26 13:11 ` Iain Paton 2014-08-14 14:43 ` Mattis Lorentzon 2014-08-14 15:30 ` Fabio Estevam 2014-08-15 5:42 ` Mattis Lorentzon 2014-08-17 21:34 ` Iain Paton 2014-08-17 21:46 ` Fabio Estevam 2014-08-19 6:03 ` Iain Paton 2014-08-21 9:39 ` Iain Paton 2014-08-22 0:01 ` Fabio Estevam 2014-08-22 6:39 ` Mattis Lorentzon 2014-08-22 10:36 ` Iain Paton 2014-08-27 6:32 ` Mattis Lorentzon 2014-08-27 10:43 ` Iain Paton 2014-08-29 10:57 ` Mattis Lorentzon 2014-08-29 11:30 ` Fabio Estevam 2014-12-16 14:50 ` Mattis Lorentzon 2014-08-26 13:12 ` Iain Paton 2014-08-22 8:27 ` Russell King - ARM Linux
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).