Oops: 17 SMP ARM (v3.16-rc2)

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Oops: 17 SMP ARM (v3.16-rc2)
       [not found] ` <20140626140115.GQ32514@n2100.arm.linux.org.uk>
@ 2014-06-26 14:44   ` Mattis Lorentzon
  2014-06-26 15:14     ` Russell King - ARM Linux
  0 siblings, 1 reply; 44+ messages in thread
From: Mattis Lorentzon @ 2014-06-26 14:44 UTC (permalink / raw)
  To: linux-arm-kernel

Thank you for your reply,

> On Wed, Jun 25, 2014 at 01:55:05PM +0000, Mattis Lorentzon wrote:
> > I have a similar issue with v3.16-rc2 as previously reported by Waldemar
> Brodkorb for v3.15-rc4.
> > https://lkml.org/lkml/2014/5/9/330
> 
> This URL returns no useful information.  I find that lkml.org is broken more
> times than not in recent years.  Please use a different archive site when
> referring to posts, thanks.

http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html

> I have had two iMX6 platforms running root-NFS for about the last six to nine
> months with various workloads, and have never seen this oops.
> Unfortunately, the description above gives very little information for what
> the mechanism to trigger this bug may be.  For example, if I wanted to
> reproduce it, what would I need to do?

We have managed to trigger the Oops by just transferring a large file over nfs
cat /mnt/foo > /dev/null
where foo is a file that is approximately 2 GB. There may be some packet losses
on this network, perhaps this differs from your workload?

> > The error is sporadic and it seems to occur more frequently when using
> perf.
> 
> So it occurs when not using perf?

Yes, certainly, see above.

We have done some more investigations, please find it in this mail:

http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html

The Oops seems to have been introduced somewhere between v3.12 and v3.13:

- The Oops is reproducible within seconds when running Linux 3.16-rc2.
- We have observed the Oops on 8 different hardware units and two different chipsets (Freescale i.MX6 and Xilinx Zynq).
- The Oops has not been seen on Linux 3.12 so it appears to be good.
- The Oops has been seen on Linux 3.13, 3.14, 3.15, 3.16-rc2 so these appear to be bad.

Configs and a couple of Oops reports are attached to the linked mail.

Best regards,
Mattis Lorentzon
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-26 14:44   ` Oops: 17 SMP ARM (v3.16-rc2) Mattis Lorentzon
@ 2014-06-26 15:14     ` Russell King - ARM Linux
  2014-06-27 11:21       ` Russell King - ARM Linux
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-06-26 15:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 26, 2014 at 02:44:52PM +0000, Mattis Lorentzon wrote:
> Thank you for your reply,
> 
> > On Wed, Jun 25, 2014 at 01:55:05PM +0000, Mattis Lorentzon wrote:
> > > I have a similar issue with v3.16-rc2 as previously reported by Waldemar
> > Brodkorb for v3.15-rc4.
> > > https://lkml.org/lkml/2014/5/9/330
> >
> > This URL returns no useful information.  I find that lkml.org is broken more
> > times than not in recent years.  Please use a different archive site when
> > referring to posts, thanks.
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html

I remember that report, but it was never resolved as I think no one has
any ideas what is causing these, and no one has any idea where to start
looking.

> We have managed to trigger the Oops by just transferring a large file
> over nfs
> cat /mnt/foo > /dev/null
> where foo is a file that is approximately 2 GB. There may be some
> packet losses on this network, perhaps this differs from your workload?

That's a similar workload to the one which is mentioned in the previous
report.  I've just set a similar transfer going, but this will be a 16GB
file.

> We have done some more investigations, please find it in this mail:
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html

Yes, I saw that before I replied, and my reply was written with that
message in mind.  That's what prompted this paragraph in my previous
reply:

"Your other oops dumps also show various other functions apparantly
returning 0xffffffff.  I can't believe that there's more than one bug
doing this, so I doubt the problem is in these functions.  Something
else must be going on."

One of the problems is that there's soo much work going on with the
kernel by many different parties, pulling it in various directions,
that no one really has an overview of all the changes, and so no one
has much of a feel what could be the cause of weird bugs like this.

I don't know what to suggest - you could try using git bisect to see
if you can track it down to a particular commit, but it sounds like
that's going to be very time consuming.  You mentioned that 3.12
doesn't show the bug, but 3.13 does - so start off telling git bisect
that 3.12 is "good" and 3.13 is "bad".

Hopefully there won't be too many breakages during the 3.13 merge
window (between 3.12 and 3.13-rc1), but I don't have much faith in
that; people seem to have a habbit of holding back fixes until -rc1,
which makes _exactly_ this kind of bug much harder for people like
yourselves to track down - or maybe even impossible.

I'm afraid I can't offer very much help beyond this until either I can
produce it, or someone manages to identify a particular change which
caused this.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-26 15:14     ` Russell King - ARM Linux
@ 2014-06-27 11:21       ` Russell King - ARM Linux
  2014-06-27 16:16         ` Fredrik Noring
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-06-27 11:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote:
> On Thu, Jun 26, 2014 at 02:44:52PM +0000, Mattis Lorentzon wrote:
> > We have managed to trigger the Oops by just transferring a large file
> > over nfs
> > cat /mnt/foo > /dev/null
> > where foo is a file that is approximately 2 GB. There may be some
> > packet losses on this network, perhaps this differs from your workload?
> 
> That's a similar workload to the one which is mentioned in the previous
> report.  I've just set a similar transfer going, but this will be a 16GB
> file.

I've run this transfer several times, but so far I've unable to reproduce
the issue here.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-27 11:21       ` Russell King - ARM Linux
@ 2014-06-27 16:16         ` Fredrik Noring
  2014-06-27 16:31           ` Russell King - ARM Linux
  0 siblings, 1 reply; 44+ messages in thread
From: Fredrik Noring @ 2014-06-27 16:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russel,

> On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote:
> > That's a similar workload to the one which is mentioned in the
> > previous report.  I've just set a similar transfer going, but this
> > will be a 16GB file.
> 
> I've run this transfer several times, but so far I've unable to reproduce the
> issue here.

Many thanks for testing this. We attempted to bisect, but unfortunately the
result was not conclusive. One reason might be that the config had to be
updated during the process, and so we did not end up with the exact same
configuration (things like e.g. IMX_SDMA in DMA_ENGINE etc.). Some runs
deadlocked without any visible Oops or printout. Some versions did not have
an entirely working console configuration.

Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of
some interest?

(We also had memtester run for days on the i.MX6 hardware, without issues.)

All the best,
Fredrik

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc2 #19
Backtrace: 
[<80012390>] (dump_backtrace) from [<8001266c>] (show_stack+0x18/0x1c)
 r6:00000108 r5:00000000 r4:8064e29c r3:00000000
[<80012654>] (show_stack) from [<8049791c>] (dump_stack+0x8c/0x9c)
[<80497890>] (dump_stack) from [<80024f4c>] (warn_slowpath_common+0x74/0x90)
 r5:00000009 r4:80631d70
[<80024ed8>] (warn_slowpath_common) from [<80024fa0>] (warn_slowpath_fmt+0x38/0x40)
 r8:806320c0 r7:9d85a254 r6:9d879000 r5:9d85a000 r4:00000000
[<80024f6c>] (warn_slowpath_fmt) from [<803b8ff0>] (dev_watchdog+0x270/0x27c)
 r3:9d85a000 r2:805c4790
[<803b8d80>] (dev_watchdog) from [<8002f280>] (call_timer_fn+0x6c/0xe4)
 r10:80630008 r9:9d85a000 r8:803b8d80 r7:00000100 r6:80630000 r5:00000001
 r4:80631dd8
[<8002f214>] (call_timer_fn) from [<8002fec8>] (run_timer_softirq+0x1d4/0x254)
 r10:803b8d80 r9:806320c0 r8:9d85a000 r7:00000000 r6:80631e28 r5:80667040
 r4:9d85a284
[<8002fcf4>] (run_timer_softirq) from [<8002945c>] (__do_softirq+0x17c/0x30c)
 r10:00000001 r9:80632080 r8:40000001 r7:80630000 r6:00000100 r5:80632084
 r4:00000020
[<800292e0>] (__do_softirq) from [<80029920>] (irq_exit+0xd0/0x114)
 r10:80630000 r9:80665f19 r8:00000001 r7:f4000100 r6:00000000 r5:80630008
 r4:80630000
[<80029850>] (irq_exit) from [<8000f348>] (handle_IRQ+0x4c/0x98)
 r5:0000001d r4:8062ce44
[<8000f2fc>] (handle_IRQ) from [<80008614>] (gic_handle_irq+0x34/0x64)
 r6:80631f20 r5:80638a40 r4:f400010c r3:000000a0
[<800085e0>] (gic_handle_irq) from [<800131c4>] (__irq_svc+0x44/0x58)
Exception stack(0x80631f20 to 0x80631f68)
1f20: 00000001 00000001 00000000 8063b6f0 8063852c 806384d8 80665f19 804a0040
1f40: 00000001 80665f19 80630000 80631f74 00000000 80631f68 800614b8 8000f6a8
1f60: 200f0013 ffffffff
 r7:80631f54 r6:ffffffff r5:200f0013 r4:8000f6a8
[<8000f67c>] (arch_cpu_idle) from [<8005cbf8>] (cpu_startup_entry+0x10c/0x164)
[<8005caec>] (cpu_startup_entry) from [<80492b68>] (rest_init+0xc8/0xd8)
 r7:80625028 r3:00000000
[<80492aa0>] (rest_init) from [<805f6c5c>] (start_kernel+0x39c/0x3a8)
 r5:00000001 r4:806385d0
[<805f68c0>] (start_kernel) from [<10008074>] (0x10008074)
---[ end trace a7b7109ab2d04e11 ]---
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-27 16:16         ` Fredrik Noring
@ 2014-06-27 16:31           ` Russell King - ARM Linux
  2014-06-30  6:22             ` Fredrik Noring
                               ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-06-27 16:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Fredrik,

On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote:
> Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of
> some interest?

It's not that serious... I know that the FEC ethernet driver is
horrendously racy (I have had a patch set for about the last six months
which fixes some of its problems) but as I've had a lot of patches to
deal with, and it's been pushed to the back of the queue...

The races don't lead to data corruption though, merely timeouts and
some lost packets.

Now because things have changed during the last merge window, I've got
an even bigger problem sorting through that patch set and getting it
back into a submittable state.  I've just sent out v2 for it onto the
netdev at vger.kernel.org mailing list.

The initial version (marked RFC) attracted very little interest from
testers, or acks.  I'd very much like to have some testing of it, so
if you want to try it out, I can provide you with a git URL, patches
or a combined patch.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-27 16:31           ` Russell King - ARM Linux
@ 2014-06-30  6:22             ` Fredrik Noring
  2014-06-30 12:30             ` Fredrik Noring
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Fredrik Noring @ 2014-06-30  6:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

> -----Original Message-----
> It's not that serious... I know that the FEC ethernet driver is horrendously
> racy (I have had a patch set for about the last six months which fixes some of
> its problems) but as I've had a lot of patches to deal with, and it's been
> pushed to the back of the queue...
> 
> The races don't lead to data corruption though, merely timeouts and some
> lost packets.

The serial port (uart1) and Ethernet are essentially the only things we use.
No disks, no graphics, no USB, etc. If not the Ethernet driver, what else is
likely to crash NFS so badly?

Also, we are happy to change our config if that would simplify things:

http://lkml.iu.edu/hypermail/linux/kernel/1406.3/01488/config.gz

> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state.  I've just sent out v2 for it onto the
> netdev at vger.kernel.org mailing list.
> 
> The initial version (marked RFC) attracted very little interest from testers, or
> acks.  I'd very much like to have some testing of it, so if you want to try it
> out, I can provide you with a git URL, patches or a combined patch.

Sure! A combined gzip patch attachment is fine. Git over HTTP probably works
too.

All the best,
Fredrik

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-27 16:31           ` Russell King - ARM Linux
  2014-06-30  6:22             ` Fredrik Noring
@ 2014-06-30 12:30             ` Fredrik Noring
  2014-06-30 13:00               ` Nathan Lynch
  2014-07-02  6:02             ` Fredrik Noring
  2014-08-05 13:31             ` Mattis Lorentzon
  3 siblings, 1 reply; 44+ messages in thread
From: Fredrik Noring @ 2014-06-30 12:30 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a properly
working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to do a lot
better. No crashes so far with v3.16-rc2!

All the best,
Fredrik

> -----Original Message-----
> Hi Fredrik,
> 
> On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote:
> > Please find below a trace that appeared once with 3.16-rc2. Perhaps it
> > is of some interest?
> 
> It's not that serious... I know that the FEC ethernet driver is horrendously
> racy (I have had a patch set for about the last six months which fixes some of
> its problems) but as I've had a lot of patches to deal with, and it's been
> pushed to the back of the queue...
> 
> The races don't lead to data corruption though, merely timeouts and some
> lost packets.
> 
> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state.  I've just sent out v2 for it onto the
> netdev at vger.kernel.org mailing list.
> 
> The initial version (marked RFC) attracted very little interest from testers, or
> acks.  I'd very much like to have some testing of it, so if you want to try it
> out, I can provide you with a git URL, patches or a combined patch.
> 
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-30 12:30             ` Fredrik Noring
@ 2014-06-30 13:00               ` Nathan Lynch
  0 siblings, 0 replies; 44+ messages in thread
From: Nathan Lynch @ 2014-06-30 13:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/30/2014 07:30 AM, Fredrik Noring wrote:
>>
>> On Fri, Jun 27, 2014 at 04:16:57PM +0000, Fredrik Noring wrote:
>>> Please find below a trace that appeared once with 3.16-rc2. Perhaps it
>>> is of some interest?
>>
>> It's not that serious... I know that the FEC ethernet driver is horrendously
>> racy (I have had a patch set for about the last six months which fixes some of
>> its problems) but as I've had a lot of patches to deal with, and it's been
>> pushed to the back of the queue...
>>
>> The races don't lead to data corruption though, merely timeouts and some
>> lost packets.

> It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a
properly
> working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to
do a lot
> better. No crashes so far with v3.16-rc2!
>

Did you narrow it down to a particular GCC bug?  The symptoms you
reported remind me of:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854

Sadly, unpatched GCC 4.8.1 and 4.8.2 are unsuitable for building ARM
kernels.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-27 16:31           ` Russell King - ARM Linux
  2014-06-30  6:22             ` Fredrik Noring
  2014-06-30 12:30             ` Fredrik Noring
@ 2014-07-02  6:02             ` Fredrik Noring
  2014-08-05 13:31             ` Mattis Lorentzon
  3 siblings, 0 replies; 44+ messages in thread
From: Fredrik Noring @ 2014-07-02  6:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

> -----Original Message-----
> > The initial version (marked RFC) attracted very little interest from
> > testers, or acks.  I'd very much like to have some testing of it, so
> > if you want to try it out, I can provide you with a git URL, patches
> > or a combined patch.
> 
> Sure! A combined gzip patch attachment is fine. Git over HTTP probably
> works too.

We are still interested in trying out your patches to improve network
performance. We can do some testing this week and in August.

Best regards,
Fredrik

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-06-27 16:31           ` Russell King - ARM Linux
                               ` (2 preceding siblings ...)
  2014-07-02  6:02             ` Fredrik Noring
@ 2014-08-05 13:31             ` Mattis Lorentzon
  2014-08-05 13:53               ` Fabio Estevam
  2014-08-06  9:50               ` Russell King - ARM Linux
  3 siblings, 2 replies; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-05 13:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell!

> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state.  I've just sent out v2 for it onto the
> netdev at vger.kernel.org mailing list.
> 
> The initial version (marked RFC) attracted very little interest from testers, or
> acks.  I'd very much like to have some testing of it, so if you want to try it out,
> I can provide you with a git URL, patches or a combined patch.

We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are
currently running some stability tests.

During our first test round we triggered a timeout which caused the fec driver
to become unresponsive for several minutes. The attached backtrace was
shown when the hardware was rebooted.

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fec-transmit-queue-timed-out.txt
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140805/048707db/attachment-0001.txt>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-05 13:31             ` Mattis Lorentzon
@ 2014-08-05 13:53               ` Fabio Estevam
  2014-08-06  6:48                 ` Mattis Lorentzon
  2014-08-06  9:50               ` Russell King - ARM Linux
  1 sibling, 1 reply; 44+ messages in thread
From: Fabio Estevam @ 2014-08-05 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 5, 2014 at 10:31 AM, Mattis Lorentzon
<Mattis.Lorentzon@autoliv.com> wrote:

> We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are
> currently running some stability tests.
>
> During our first test round we triggered a timeout which caused the fec driver
> to become unresponsive for several minutes. The attached backtrace was
> shown when the hardware was rebooted.

Could this problem be the same one as reported at:
http://www.spinics.net/lists/arm-kernel/msg347914.html ?

Which Ethernet PHY do you use? Do you have pull-up in the MDIO line?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-05 13:53               ` Fabio Estevam
@ 2014-08-06  6:48                 ` Mattis Lorentzon
  0 siblings, 0 replies; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-06  6:48 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Fabio,

> Could this problem be the same one as reported at:
> http://www.spinics.net/lists/arm-kernel/msg347914.html ?

The problem you link to describes a permanent issue, our problem seems
to be sporadic as most of our tests work fine (at least for a while).

> Which Ethernet PHY do you use? Do you have pull-up in the MDIO line?

Our hardware has the KSZ9021RN PHY, so the MDIO line should be pull-up.

Do you know if there are debug options that could help us determine the
cause of the timeout?

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-05 13:31             ` Mattis Lorentzon
  2014-08-05 13:53               ` Fabio Estevam
@ 2014-08-06  9:50               ` Russell King - ARM Linux
  2014-08-06 11:10                 ` Mattis Lorentzon
  1 sibling, 1 reply; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-08-06  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 05, 2014 at 01:31:29PM +0000, Mattis Lorentzon wrote:
> We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are
> currently running some stability tests.
> 
> During our first test round we triggered a timeout which caused the fec driver
> to become unresponsive for several minutes. The attached backtrace was
> shown when the hardware was rebooted.

What is on the other end of the link?

> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c()
> NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
...
> fec 2188000.ethernet eth0: TX ring dump
> Nr     SC     addr       len  SKB
>   0    0x1c00 0x00000000   66   (null)
...
>  83    0x1c00 0x00000000   66   (null)
>  84  H 0x1c00 0x00000000   66   (null)
>  85    0x9c00 0x2e205000   66 9e384f00
>  86    0x1c00 0x2e204800   66 9e384d80
>  87    0x1c00 0x2e204000   66 9e384180
...
> 376    0x1c00 0x2e252800   66 81cf6180
> 377    0x1c00 0x2e253000   66 81cf6240
> 378 S  0x1c00 0x00000000   66   (null)

So, the software would insert the next packet into slot 378.  However,
the slots from 85 to 377 have not been reaped, despite those in 86 to
377 allegedly having been sent.  This is because the entry in slot 85
shows that it has yet to be sent.

I've no idea what causes this; it looks like there's something screwed
with the hardware which causes the transmitter to skip an entry in the
ring under certain circumstances.  As I've never been able to reproduce
it here, I've not been able to investigate it.

What I would like to do is to stamp each packet in some way with an
identifier marking its ring position, and then monitor the network to
find out whether the packet at slot 85 was actually transmitted - that's
made slightly harder because packets may be dropped at the receiver
when operating in promisc mode.  This would then allow us to work out
some likely causes.

Note that after the transmit watchdog, the interface should recover and
start operating normally again - and that should not take "several
minutes."

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-06  9:50               ` Russell King - ARM Linux
@ 2014-08-06 11:10                 ` Mattis Lorentzon
  2014-08-06 12:55                   ` Russell King - ARM Linux
  0 siblings, 1 reply; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-06 11:10 UTC (permalink / raw)
  To: linux-arm-kernel

Russell,

> What is on the other end of the link?

16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20
machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05).

There may be multiple problems. The backtrace has only been seen a few
times, on two different cards. Most of the time, the network for a random
card just stalls without any visible backtrace or error messages. The other
cards seem to be unaffected when this happens.

> What I would like to do is to stamp each packet in some way with an
> identifier marking its ring position, and then monitor the network to find out
> whether the packet at slot 85 was actually transmitted - that's made slightly
> harder because packets may be dropped at the receiver when operating in
> promisc mode.  This would then allow us to work out some likely causes.

We would be glad to run this test on our setup, do you have more detailed
information on how to set it up?

> Note that after the transmit watchdog, the interface should recover and start
> operating normally again - and that should not take "several minutes."

After a network stall, we usually have to powercycle the ARM hardware to
get it back to a usable state. These stalls last at least several minutes,
perhaps indefinitely. It does not seem to recover properly, and is no longer
reachable via the network.

Best regards,
Mattis Lorentzon
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-06 11:10                 ` Mattis Lorentzon
@ 2014-08-06 12:55                   ` Russell King - ARM Linux
  2014-08-07 11:11                     ` Mattis Lorentzon
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-08-06 12:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 06, 2014 at 11:10:06AM +0000, Mattis Lorentzon wrote:
> Russell,
> 
> > What is on the other end of the link?
> 
> 16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20
> machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05).
> 
> There may be multiple problems. The backtrace has only been seen a few
> times, on two different cards. Most of the time, the network for a random
> card just stalls without any visible backtrace or error messages. The other
> cards seem to be unaffected when this happens.

Can you ascertain whether these stalls are a result of some failure of the
receive side or the transmit side - you should be able to tell that if you
watch the packet counts via ifconfig on the stalled card.  Also, it would
be useful to know whether the FEC interrupt was firing.

I hope you have some kind of serial console on these cards?

> > What I would like to do is to stamp each packet in some way with an
> > identifier marking its ring position, and then monitor the network to find out
> > whether the packet at slot 85 was actually transmitted - that's made slightly
> > harder because packets may be dropped at the receiver when operating in
> > promisc mode.  This would then allow us to work out some likely causes.
> 
> We would be glad to run this test on our setup, do you have more detailed
> information on how to set it up?

One of the problems is to find some way to stamp each packet with a 10-bit
number without having any side effects.  I guess one possibility would be
to overwrite the source MAC address on transmit, which hopefully should not
cause any side effects.

> After a network stall, we usually have to powercycle the ARM hardware to
> get it back to a usable state. These stalls last at least several minutes,
> perhaps indefinitely. It does not seem to recover properly, and is no longer
> reachable via the network.

Hmm.  Okay, I think the first thing we need to do is to work out why
the silent stalls are happening.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-06 12:55                   ` Russell King - ARM Linux
@ 2014-08-07 11:11                     ` Mattis Lorentzon
  2014-08-07 12:12                       ` Russell King - ARM Linux
  0 siblings, 1 reply; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-07 11:11 UTC (permalink / raw)
  To: linux-arm-kernel

Russell,

> Can you ascertain whether these stalls are a result of some failure of the
> receive side or the transmit side - you should be able to tell that if you watch
> the packet counts via ifconfig on the stalled card.  Also, it would be useful to
> know whether the FEC interrupt was firing.

grep eth /proc/interrupts
151:          0          0          0          0       GIC 151  2188000.ethernet
166:    1205661          0          0          0  gpio-mxc   6  2188000.ethernet

The interrupt counter 166 increases regularly during the stalls.
Ifconfig indicates that the RX and TX  counters do not increase.

> I hope you have some kind of serial console on these cards?

Yes, indeed. Local stimuli seems to be able to unstall the network in a
somewhat random fashion. Running e.g. ifconfig or ping locally may
immediately or after up to about half a minute make the network responsive.
However, it usually degenerates again to a complete stall within seconds.
Without local stimuli the network does not appear to recover at all. The card
does not even respond to pings (again, most often without any apparent
error messages).

Running both of the following commands in parallel from the FC server seems
to trigger the problem within minutes (please note that the arm card stops
responding to both ping and ssh):

# while :; do ssh arm-card echo Ok; done
# ping arm-card

We have noticed the same problem on both the i.MX6 and the Zynq cards
(using KSZ9021 and Cadence GEM drivers). However, the number of
iterations required to trigger the problem vary. Sometimes it might stall after
less than 100, but in other cases the stalls begin after nearly 10000 iterations.
Once stalled (and unstalled after stimuli), the network on that particular card
degenerates a lot more often. Apart from the kernel, IP numbers and MAC
addresses, the software configurations are identical between the Zynq and
the i.MX6. Perhaps the fault is unrelated to the Freescale driver?

> Hmm.  Okay, I think the first thing we need to do is to work out why the
> silent stalls are happening.

Would you have any ideas on what to check next?

Best regards,
Mattis Lorentzon
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-07 11:11                     ` Mattis Lorentzon
@ 2014-08-07 12:12                       ` Russell King - ARM Linux
  2014-08-07 14:20                         ` Fabio Estevam
  2014-08-08 18:09                         ` Russell King - ARM Linux
  0 siblings, 2 replies; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-08-07 12:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 07, 2014 at 11:11:06AM +0000, Mattis Lorentzon wrote:
> Russell,
> 
> > Can you ascertain whether these stalls are a result of some failure of the
> > receive side or the transmit side - you should be able to tell that if you watch
> > the packet counts via ifconfig on the stalled card.  Also, it would be useful to
> > know whether the FEC interrupt was firing.
> 
> grep eth /proc/interrupts
> 151:          0          0          0          0       GIC 151  2188000.ethernet
> 166:    1205661          0          0          0  gpio-mxc   6  2188000.ethernet
> 
> The interrupt counter 166 increases regularly during the stalls.
> Ifconfig indicates that the RX and TX  counters do not increase.

Hmm, I'm slightly confused.  On my iMX6Q, I have:

150:     581754          0          0          0       GIC 150  2188000.ethernet
151:          0          0          0          0       GIC 151  2188000.ethernet

In the DT file, we have:

                        fec: ethernet at 02188000 {
                                compatible = "fsl,imx6q-fec";
                                reg = <0x02188000 0x4000>;
                                interrupts-extended =
                                        <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
                                        <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
                                clocks = <&clks 117>, <&clks 117>, <&clks 190>;
                                clock-names = "ipg", "ahb", "ptp";
                                status = "disabled";
                        };

which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
Yet you seem to have nothing registered against GIC 150, instead having
an interrupt against GPIO 6.

This seems very odd, and as this is an on-SoC device, I don't see why
you would want to bind the interrupts for the FEC device any differently
to standard platforms.

This could well be the cause of your stalls.

What's GPIO 6 used for on your board?

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-07 12:12                       ` Russell King - ARM Linux
@ 2014-08-07 14:20                         ` Fabio Estevam
  2014-08-07 14:38                           ` Fabio Estevam
  2014-08-08 14:05                           ` Fabio Estevam
  2014-08-08 18:09                         ` Russell King - ARM Linux
  1 sibling, 2 replies; 44+ messages in thread
From: Fabio Estevam @ 2014-08-07 14:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:

> Hmm, I'm slightly confused.  On my iMX6Q, I have:
>
> 150:     581754          0          0          0       GIC 150  2188000.ethernet
> 151:          0          0          0          0       GIC 151  2188000.ethernet

Same here on a mx6qsabresd.

> In the DT file, we have:
>
>                         fec: ethernet at 02188000 {
>                                 compatible = "fsl,imx6q-fec";
>                                 reg = <0x02188000 0x4000>;
>                                 interrupts-extended =
>                                         <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
>                                         <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
>                                 clocks = <&clks 117>, <&clks 117>, <&clks 190>;
>                                 clock-names = "ipg", "ahb", "ptp";
>                                 status = "disabled";
>                         };
>
> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
> Yet you seem to have nothing registered against GIC 150, instead having
> an interrupt against GPIO 6.
>
> This seems very odd, and as this is an on-SoC device, I don't see why
> you would want to bind the interrupts for the FEC device any differently
> to standard platforms.
>
> This could well be the cause of your stalls.
>
> What's GPIO 6 used for on your board?

On a imx6q sabreauto I also get:

151:          0          0          0          0       GIC 151  2188000.ethernet
166:       4577          0          0          0  gpio-mxc   6  2188000.ethernet

and the GPIO1_6 interrupt comes from this commit:

commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3
Author: Troy Kisky <troy.kisky@boundarydevices.com>
Date:   Fri Dec 20 11:47:12 2013 -0700

    ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt.

    This works around a hardware bug.

    Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
    Signed-off-by: Shawn Guo <shawn.guo@linaro.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-07 14:20                         ` Fabio Estevam
@ 2014-08-07 14:38                           ` Fabio Estevam
  2014-08-08  1:30                             ` Troy Kisky
  2014-08-08 14:05                           ` Fabio Estevam
  1 sibling, 1 reply; 44+ messages in thread
From: Fabio Estevam @ 2014-08-07 14:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam <festevam@gmail.com> wrote:

> On a imx6q sabreauto I also get:
>
> 151:          0          0          0          0       GIC 151  2188000.ethernet
> 166:       4577          0          0          0  gpio-mxc   6  2188000.ethernet
>
> and the GPIO1_6 interrupt comes from this commit:
>
> commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3
> Author: Troy Kisky <troy.kisky@boundarydevices.com>
> Date:   Fri Dec 20 11:47:12 2013 -0700
>
>     ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt.
>
>     This works around a hardware bug.
>
>     Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
>     Signed-off-by: Shawn Guo <shawn.guo@linaro.org>

Actually a more descriptive commit log can be found here:

commit 6261c4c8f13eb91f733e8ba6d67c409a2e841667
Author: Troy Kisky <troy.kisky@boundarydevices.com>
Date:   Fri Dec 20 11:47:11 2013 -0700

    ARM: dts: imx6qdl-sabrelite: use GPIO_6 for FEC interrupt.

    This works around a hardware bug.
    From "Chip Errata for the i.MX 6Dual/6Quad"

    ERR006687 ENET: Only the ENET wake-up interrupt request can wake the
    system from Wait mode.

    The ENET block generates many interrupts. Only one of these interrupt lines
    is connected to the General Power Controller (GPC) block, but a logical OR
    of all of the ENET interrupts is connected to the General
Interrupt Controller
    (GIC). When the system enters Wait mode, a normal RX Done or TX
Done does not
    wake up the system because the GPC cannot see this interrupt. This impacts
    performance of the ENET block because its interrupts are serviced only when
    the chip exits Wait mode due to an interrupt from some other wake-up source.

    Before this patch, ping times of a Sabre Lite board are quite
    random:
    ping 192.168.0.13 -i.5 -c5
    PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data.
    64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=15.7 ms
    64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=14.4 ms
    64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=13.4 ms
    64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=12.4 ms
    64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=11.4 ms

    === 192.168.0.13 ping statistics ===
    5 packets transmitted, 5 received, 0% packet loss, time 2004ms
    rtt min/avg/max/mdev = 11.431/13.501/15.746/1.508 ms
    ____________________________________________________
    After this patch:

    ping 192.168.0.13 -i.5 -c5
    PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data.
    64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=0.120 ms
    64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=0.175 ms
    64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=0.169 ms
    64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=0.168 ms
    64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=0.172 ms

    === 192.168.0.13 ping statistics ===
    5 packets transmitted, 5 received, 0% packet loss, time 1999ms
    rtt min/avg/max/mdev = 0.120/0.160/0.175/0.026 ms
    ____________________________________________________

    Also, apply same change to imx6qdl-nitrogen6x.

    This change may not be appropriate for all boards.
    Sabre Lite uses GPIO6 as a power down output for a ov5642
    camera. As this expansion board does not yet work with mainline,
    this is not yet a conflict. It would be nice to have an alternative
    fix for boards where this is a problem.

    For example Sabre SD uses GPIO6 for I2C3_SDA. It also
    has long ping times currently. But cannot use this fix
    without giving up a touchscreen.

    Its ping times are also random.

    ping 192.168.0.19 -i.5 -c5
    PING 192.168.0.19 (192.168.0.19) 56(84) bytes of data.
    64 bytes from 192.168.0.19: icmp_req=1 ttl=64 time=16.0 ms
    64 bytes from 192.168.0.19: icmp_req=2 ttl=64 time=15.4 ms
    64 bytes from 192.168.0.19: icmp_req=3 ttl=64 time=14.4 ms
    64 bytes from 192.168.0.19: icmp_req=4 ttl=64 time=13.4 ms
    64 bytes from 192.168.0.19: icmp_req=5 ttl=64 time=12.4 ms

    === 192.168.0.19 ping statistics ---
    5 packets transmitted, 5 received, 0% packet loss, time 2003ms
    rtt min/avg/max/mdev = 12.451/14.369/16.057/1.316 ms

    Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
    CC: Ranjani Vaidyanathan <ra5478@freescale.com>
    Signed-off-by: Shawn Guo <shawn.guo@linaro.org>

,but I am wondering if we should also do:

--- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
+++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
@@ -66,6 +66,7 @@
        pinctrl-0 = <&pinctrl_enet>;
        phy-mode = "rgmii";
        interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>,
+                             <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
                              <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
        status = "okay";
 };
@@ -226,7 +227,7 @@
                                MX6QDL_PAD_RGMII_RD2__RGMII_RD2         0x1b0b0
                                MX6QDL_PAD_RGMII_RD3__RGMII_RD3         0x1b0b0
                                MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL   0x1b0b0
-                               MX6QDL_PAD_GPIO_6__ENET_IRQ             0x000b1
+                               MX6QDL_PAD_GPIO_6__ENET_IRQ
 0x400000b1

Since the Workaround for erratum ERR006687 states that the SION bit
needs to be used:

"All of the interrupts can be selected by MUX and output to pad GPIO6.
If GPIO6 is selected to
output ENET interrupts and GPIO6 SION is set, the resulting GPIO
interrupt will wake the system
from Wait mode."

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-07 14:38                           ` Fabio Estevam
@ 2014-08-08  1:30                             ` Troy Kisky
  0 siblings, 0 replies; 44+ messages in thread
From: Troy Kisky @ 2014-08-08  1:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 8/7/2014 7:38 AM, Fabio Estevam wrote:
> On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam <festevam@gmail.com> wrote:
> 
> ,but I am wondering if we should also do:
> 
> --- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
> +++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
> @@ -66,6 +66,7 @@
>         pinctrl-0 = <&pinctrl_enet>;
>         phy-mode = "rgmii";
>         interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>,
> +                             <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
>                               <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
>         status = "okay";
>  };
> @@ -226,7 +227,7 @@
>                                 MX6QDL_PAD_RGMII_RD2__RGMII_RD2         0x1b0b0
>                                 MX6QDL_PAD_RGMII_RD3__RGMII_RD3         0x1b0b0
>                                 MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL   0x1b0b0
> -                               MX6QDL_PAD_GPIO_6__ENET_IRQ             0x000b1
> +                               MX6QDL_PAD_GPIO_6__ENET_IRQ
>  0x400000b1
> 
> Since the Workaround for erratum ERR006687 states that the SION bit
> needs to be used:
> 
> "All of the interrupts can be selected by MUX and output to pad GPIO6.
> If GPIO6 is selected to
> output ENET interrupts and GPIO6 SION is set, the resulting GPIO
> interrupt will wake the system
> from Wait mode."
> 
arch/arm/boot/dts/imx6q-pinfunc.h:#define MX6QDL_PAD_GPIO_6__ENET_IRQ               0x230 0x600
0x03c 0x11 0xff000609

So, the ion bit should already be set(0x11). But the other way works too.


Troy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-07 14:20                         ` Fabio Estevam
  2014-08-07 14:38                           ` Fabio Estevam
@ 2014-08-08 14:05                           ` Fabio Estevam
  1 sibling, 0 replies; 44+ messages in thread
From: Fabio Estevam @ 2014-08-08 14:05 UTC (permalink / raw)
  To: linux-arm-kernel

Mattis,

On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam <festevam@gmail.com> wrote:
> On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
>
>> Hmm, I'm slightly confused.  On my iMX6Q, I have:
>>
>> 150:     581754          0          0          0       GIC 150  2188000.ethernet
>> 151:          0          0          0          0       GIC 151  2188000.ethernet
>
> Same here on a mx6qsabresd.
>
>> In the DT file, we have:
>>
>>                         fec: ethernet at 02188000 {
>>                                 compatible = "fsl,imx6q-fec";
>>                                 reg = <0x02188000 0x4000>;
>>                                 interrupts-extended =
>>                                         <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
>>                                         <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
>>                                 clocks = <&clks 117>, <&clks 117>, <&clks 190>;
>>                                 clock-names = "ipg", "ahb", "ptp";
>>                                 status = "disabled";
>>                         };
>>
>> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
>> Yet you seem to have nothing registered against GIC 150, instead having
>> an interrupt against GPIO 6.
>>
>> This seems very odd, and as this is an on-SoC device, I don't see why
>> you would want to bind the interrupts for the FEC device any differently
>> to standard platforms.
>>
>> This could well be the cause of your stalls.
>>
>> What's GPIO 6 used for on your board?
>
> On a imx6q sabreauto I also get:
>
> 151:          0          0          0          0       GIC 151  2188000.ethernet
> 166:       4577          0          0          0  gpio-mxc   6  2188000.ethernet

Could you remove 'interrupts-extended'  from the FEC node and also
MX6QDL_PAD_GPIO_6__ENET_IRQ from the pinctrl node and test again?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-07 12:12                       ` Russell King - ARM Linux
  2014-08-07 14:20                         ` Fabio Estevam
@ 2014-08-08 18:09                         ` Russell King - ARM Linux
  2014-08-11 13:32                           ` Mattis Lorentzon
  1 sibling, 1 reply; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-08-08 18:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 07, 2014 at 01:12:48PM +0100, Russell King - ARM Linux wrote:
> On Thu, Aug 07, 2014 at 11:11:06AM +0000, Mattis Lorentzon wrote:
> > Russell,
> > 
> > > Can you ascertain whether these stalls are a result of some failure of the
> > > receive side or the transmit side - you should be able to tell that if you watch
> > > the packet counts via ifconfig on the stalled card.  Also, it would be useful to
> > > know whether the FEC interrupt was firing.
> > 
> > grep eth /proc/interrupts
> > 151:          0          0          0          0       GIC 151  2188000.ethernet
> > 166:    1205661          0          0          0  gpio-mxc   6  2188000.ethernet
> > 
> > The interrupt counter 166 increases regularly during the stalls.
> > Ifconfig indicates that the RX and TX  counters do not increase.
> 
> Hmm, I'm slightly confused.  On my iMX6Q, I have:
> 
> 150:     581754          0          0          0       GIC 150  2188000.ethernet
> 151:          0          0          0          0       GIC 151  2188000.ethernet
> 
> In the DT file, we have:
> 
>                         fec: ethernet at 02188000 {
>                                 compatible = "fsl,imx6q-fec";
>                                 reg = <0x02188000 0x4000>;
>                                 interrupts-extended =
>                                         <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
>                                         <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
>                                 clocks = <&clks 117>, <&clks 117>, <&clks 190>;
>                                 clock-names = "ipg", "ahb", "ptp";
>                                 status = "disabled";
>                         };
> 
> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
> Yet you seem to have nothing registered against GIC 150, instead having
> an interrupt against GPIO 6.
> 
> This seems very odd, and as this is an on-SoC device, I don't see why
> you would want to bind the interrupts for the FEC device any differently
> to standard platforms.
> 
> This could well be the cause of your stalls.
> 
> What's GPIO 6 used for on your board?

We have a second report of instability with the FEC today, and the
problem board (wanboard) is also using GPIO1 6 for the ethernet IRQ.
We have confirmation from the reporter that reverting the change
(thus making the FEC use the standard interrupt) fixes their problem.

Therefore, it seems that the workaround for ERR006687 is itself buggy.

I'd be interested to hear whether removing the 

	interrupts-extended = ...

property from your board's DT file, thereby causing you to revert back
to the default I list above, also fixes the instability you are seeing.

Thanks.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-08 18:09                         ` Russell King - ARM Linux
@ 2014-08-11 13:32                           ` Mattis Lorentzon
  2014-08-11 17:41                             ` Fabio Estevam
  0 siblings, 1 reply; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-11 13:32 UTC (permalink / raw)
  To: linux-arm-kernel

Russell and Fabio,

> I'd be interested to hear whether removing the
> 
> 	interrupts-extended = ...
> 
> property from your board's DT file, thereby causing you to revert back to the
> default I list above, also fixes the instability you are seeing.

We have tried to remove the board specific interrupts-extended field and the
MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve
the stalls. Our interrupts look like this now:

150:      15519          0          0          0       GIC 150  2188000.ethernet
151:          0          0          0          0       GIC 151  2188000.ethernet

Our device tree might still be slightly incorrect. We have noticed that our
RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are
a bit surprised that this works at all). We are not quite sure how to configure
this properly.

Best regards,
Mattis Lorentzon
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-11 13:32                           ` Mattis Lorentzon
@ 2014-08-11 17:41                             ` Fabio Estevam
  2014-08-13 13:39                               ` Mattis Lorentzon
  2014-08-14 14:43                               ` Mattis Lorentzon
  0 siblings, 2 replies; 44+ messages in thread
From: Fabio Estevam @ 2014-08-11 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 11, 2014 at 10:32 AM, Mattis Lorentzon
<Mattis.Lorentzon@autoliv.com> wrote:
> Russell and Fabio,
>
>> I'd be interested to hear whether removing the
>>
>>       interrupts-extended = ...
>>
>> property from your board's DT file, thereby causing you to revert back to the
>> default I list above, also fixes the instability you are seeing.
>
> We have tried to remove the board specific interrupts-extended field and the
> MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve
> the stalls. Our interrupts look like this now:
>
> 150:      15519          0          0          0       GIC 150  2188000.ethernet
> 151:          0          0          0          0       GIC 151  2188000.ethernet
>
> Our device tree might still be slightly incorrect. We have noticed that our
> RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are
> a bit surprised that this works at all). We are not quite sure how to configure
> this properly.

In order to try to narrow down whether this is a board issue, could
you try to run the same kernel on a mx6q development board, such as
mx6qsabresd, cubox-i, wandboard, etc?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-11 17:41                             ` Fabio Estevam
@ 2014-08-13 13:39                               ` Mattis Lorentzon
  2014-08-25 10:18                                 ` Russell King - ARM Linux
  2014-08-14 14:43                               ` Mattis Lorentzon
  1 sibling, 1 reply; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-13 13:39 UTC (permalink / raw)
  To: linux-arm-kernel

Fabio and Russell,

> In order to try to narrow down whether this is a board issue, could you try to
> run the same kernel on a mx6q development board, such as mx6qsabresd,
> cubox-i, wandboard, etc?

Indeed, we have a Sabrelite development board and have run the same kernel
configuration (please find attached). Russells 30 FEC related patches are applied.
We have also tried with and without the extended interrupts entry in the DT.

All our tests seem to behave the same way on the Sabrelite as on our own board.
A working theory is that the switch (3Com Switch 4400) triggers the degeneration
of the network stack from which Linux does not seem to recover, even if we later
bypass the switch and directly connect the board to the server machine.

Since the problem is stochastic in nature we are not completely sure if we can
trigger the problem without the switch. It's the switch that allows us to run many
cards simultaneously and thus trigger the problem more easily. :-)

What are your thoughts?

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.gz
Type: application/x-gzip
Size: 14775 bytes
Desc: config.gz
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140813/f7f2884e/attachment.bin>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-13 13:39                               ` Mattis Lorentzon
@ 2014-08-25 10:18                                 ` Russell King - ARM Linux
  2014-08-26 13:11                                   ` Iain Paton
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-08-25 10:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 13, 2014 at 01:39:27PM +0000, Mattis Lorentzon wrote:
> All our tests seem to behave the same way on the Sabrelite as on our own board.
> A working theory is that the switch (3Com Switch 4400) triggers the degeneration
> of the network stack from which Linux does not seem to recover, even if we later
> bypass the switch and directly connect the board to the server machine.

Please can you try something - what happens if you completely disable
pause frame support (flow control) on all machines on the switch?

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-25 10:18                                 ` Russell King - ARM Linux
@ 2014-08-26 13:11                                   ` Iain Paton
  0 siblings, 0 replies; 44+ messages in thread
From: Iain Paton @ 2014-08-26 13:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 25/08/14 11:18, Russell King - ARM Linux wrote:
> On Wed, Aug 13, 2014 at 01:39:27PM +0000, Mattis Lorentzon wrote:
>> All our tests seem to behave the same way on the Sabrelite as on our own board.
>> A working theory is that the switch (3Com Switch 4400) triggers the degeneration
>> of the network stack from which Linux does not seem to recover, even if we later
>> bypass the switch and directly connect the board to the server machine.
> 
> Please can you try something - what happens if you completely disable
> pause frame support (flow control) on all machines on the switch?

Russell, while trying to duplicate this I have flow-control disabled 
on the switch which leads to it being auto-negotiated off on all devices.
Do you think it could be worth turning it on and trying again?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-11 17:41                             ` Fabio Estevam
  2014-08-13 13:39                               ` Mattis Lorentzon
@ 2014-08-14 14:43                               ` Mattis Lorentzon
  2014-08-14 15:30                                 ` Fabio Estevam
  2014-08-22  8:27                                 ` Russell King - ARM Linux
  1 sibling, 2 replies; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-14 14:43 UTC (permalink / raw)
  To: linux-arm-kernel

Fabio and Russell,

> A working theory is that the switch (3Com Switch 4400) triggers the
> degeneration of the network stack from which Linux does not seem to
> recover, even if we later bypass the switch and directly connect the board to
> the server machine.

After a few more tests we have finally been able to trigger the exact same stalls
on the Sabrelite board with a direct network connection (i.e. without the switch).

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-14 14:43                               ` Mattis Lorentzon
@ 2014-08-14 15:30                                 ` Fabio Estevam
  2014-08-15  5:42                                   ` Mattis Lorentzon
  2014-08-22  8:27                                 ` Russell King - ARM Linux
  1 sibling, 1 reply; 44+ messages in thread
From: Fabio Estevam @ 2014-08-14 15:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 14, 2014 at 11:43 AM, Mattis Lorentzon
<Mattis.Lorentzon@autoliv.com> wrote:

> After a few more tests we have finally been able to trigger the exact same stalls
> on the Sabrelite board with a direct network connection (i.e. without the switch).

Do the stalls also happen on a pure 3.16 kernel?

How can we reproduce the error?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-14 15:30                                 ` Fabio Estevam
@ 2014-08-15  5:42                                   ` Mattis Lorentzon
  2014-08-17 21:34                                     ` Iain Paton
  0 siblings, 1 reply; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-15  5:42 UTC (permalink / raw)
  To: linux-arm-kernel

Fabio,

> Do the stalls also happen on a pure 3.16 kernel?

Yes, we just tried this out overnight and we get the same stalls here.
We have seen similar problems on a Zynq-based board. It might be
worth noting that a common chip between all three boards is, for
example, the KSZ9021RN, while the FEC driver, for example, only
runs on the two iMX6-boards.

> How can we reproduce the error?

We mostly run SSH with benchmarks using NFS, it can probably be
triggered by using only SSH with the following loop:

# while : ; do ssh arm-card date; done

Our (pure) 3.16 kernel uses the following config.
http://lkml.iu.edu/hypermail/linux/kernel/1408.1/03045/config.gz

(We have quite generously disabled a lot of sub-systems in our config.)

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-15  5:42                                   ` Mattis Lorentzon
@ 2014-08-17 21:34                                     ` Iain Paton
  2014-08-17 21:46                                       ` Fabio Estevam
  0 siblings, 1 reply; 44+ messages in thread
From: Iain Paton @ 2014-08-17 21:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 15/08/14 06:42, Mattis Lorentzon wrote:

> We mostly run SSH with benchmarks using NFS, it can probably be
> triggered by using only SSH with the following loop:
> 
> # while : ; do ssh arm-card date; done

Mattis,

What sort of time does it take for you to see a problem?

I've been running the above for nearly two days on 3.16.0 on a board 
with fec interrupts routed through gpio_6 and haven't seen a hint of 
a problem.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-17 21:34                                     ` Iain Paton
@ 2014-08-17 21:46                                       ` Fabio Estevam
  2014-08-19  6:03                                         ` Iain Paton
  0 siblings, 1 reply; 44+ messages in thread
From: Fabio Estevam @ 2014-08-17 21:46 UTC (permalink / raw)
  To: linux-arm-kernel

Iain,

On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote:
> On 15/08/14 06:42, Mattis Lorentzon wrote:
>
>> We mostly run SSH with benchmarks using NFS, it can probably be
>> triggered by using only SSH with the following loop:
>>
>> # while : ; do ssh arm-card date; done
>
> Mattis,
>
> What sort of time does it take for you to see a problem?
>
> I've been running the above for nearly two days on 3.16.0 on a board
> with fec interrupts routed through gpio_6 and haven't seen a hint of
> a problem.

Thanks for testing.

Which mx6 board have you used on this test?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-17 21:46                                       ` Fabio Estevam
@ 2014-08-19  6:03                                         ` Iain Paton
  2014-08-21  9:39                                           ` Iain Paton
  0 siblings, 1 reply; 44+ messages in thread
From: Iain Paton @ 2014-08-19  6:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 17/08/14 22:46, Fabio Estevam wrote:
> Iain,
> 
> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote:
>> On 15/08/14 06:42, Mattis Lorentzon wrote:
>>
>>> We mostly run SSH with benchmarks using NFS, it can probably be
>>> triggered by using only SSH with the following loop:
>>>
>>> # while : ; do ssh arm-card date; done
>>
>> Mattis,
>>
>> What sort of time does it take for you to see a problem?
>>
>> I've been running the above for nearly two days on 3.16.0 on a board
>> with fec interrupts routed through gpio_6 and haven't seen a hint of
>> a problem.
> 
> Thanks for testing.
> 
> Which mx6 board have you used on this test?

It's currently pointed at a RIoTboard (atheros phy) but I'm happy to 
try it against both a Sabre-Lite and a Wandboard B1, all running the 
same kernel binary, as well. 

I'm interested enough in why different people get different results 
with this that I'll put some time towards testing to try to help 
narrow down the cause.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-19  6:03                                         ` Iain Paton
@ 2014-08-21  9:39                                           ` Iain Paton
  2014-08-22  0:01                                             ` Fabio Estevam
  2014-08-26 13:12                                             ` Iain Paton
  0 siblings, 2 replies; 44+ messages in thread
From: Iain Paton @ 2014-08-21  9:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 19/08/14 07:03, Iain Paton wrote:
> On 17/08/14 22:46, Fabio Estevam wrote:
>> Iain,
>>
>> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote:
>>> On 15/08/14 06:42, Mattis Lorentzon wrote:
>>>
>>>> We mostly run SSH with benchmarks using NFS, it can probably be
>>>> triggered by using only SSH with the following loop:
>>>>
>>>> # while : ; do ssh arm-card date; done
>>>
>>> Mattis,
>>>
>>> What sort of time does it take for you to see a problem?
>>>
>>> I've been running the above for nearly two days on 3.16.0 on a board
>>> with fec interrupts routed through gpio_6 and haven't seen a hint of
>>> a problem.
>>
>> Thanks for testing.
>>
>> Which mx6 board have you used on this test?
> 
> It's currently pointed at a RIoTboard (atheros phy) but I'm happy to 
> try it against both a Sabre-Lite and a Wandboard B1, all running the 
> same kernel binary, as well. 
> 
> I'm interested enough in why different people get different results 
> with this that I'll put some time towards testing to try to help 
> narrow down the cause.
> 

two and a half days of running this against both a sabre-lite and a 
wandboard quad B1 and I still have no reason to think there's any 
sort of a problem.

Up to now, my testing has been done with my own config, I'll now
repeat the whole thing using the config Mattis posted to see if 
I can reproduce it that way.

Suggestions on a better / easier / quicker way to reproduce it are 
welcome.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-21  9:39                                           ` Iain Paton
@ 2014-08-22  0:01                                             ` Fabio Estevam
  2014-08-22  6:39                                               ` Mattis Lorentzon
  2014-08-22 10:36                                               ` Iain Paton
  2014-08-26 13:12                                             ` Iain Paton
  1 sibling, 2 replies; 44+ messages in thread
From: Fabio Estevam @ 2014-08-22  0:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton <ipaton0@gmail.com> wrote:

> two and a half days of running this against both a sabre-lite and a
> wandboard quad B1 and I still have no reason to think there's any
> sort of a problem.
>
> Up to now, my testing has been done with my own config, I'll now
> repeat the whole thing using the config Mattis posted to see if
> I can reproduce it that way.
>
> Suggestions on a better / easier / quicker way to reproduce it are
> welcome.

Thanks, Iain.

Mattis,

What is the silicon version of the mx6 in your sabrelite? What GCC
version do you use?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-22  0:01                                             ` Fabio Estevam
@ 2014-08-22  6:39                                               ` Mattis Lorentzon
  2014-08-22 10:36                                               ` Iain Paton
  1 sibling, 0 replies; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-22  6:39 UTC (permalink / raw)
  To: linux-arm-kernel

Fabio,

> What is the silicon version of the mx6 in your sabrelite? What GCC version do
> you use?

The silicon version is PCIMX6Q6AVT10AA and the GCC version we use is
arm-none-eabi-gcc (Fedora 2013.11.24-2.fc19) 4.8.1.

Iain,

> Up to now, my testing has been done with my own config, I'll now
> repeat the whole thing using the config Mattis posted to see if I can
> reproduce it that way.

Thanks for testing this. Could you also send me the config that you used for
your Sabrelite?

Do you know of any options that enable additional debug information about
the network driver state (full buffers etc.)?

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-22  0:01                                             ` Fabio Estevam
  2014-08-22  6:39                                               ` Mattis Lorentzon
@ 2014-08-22 10:36                                               ` Iain Paton
  2014-08-27  6:32                                                 ` Mattis Lorentzon
  1 sibling, 1 reply; 44+ messages in thread
From: Iain Paton @ 2014-08-22 10:36 UTC (permalink / raw)
  To: linux-arm-kernel

On 22/08/14 01:01, Fabio Estevam wrote:
> On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton <ipaton0@gmail.com> wrote:
> 
>> two and a half days of running this against both a sabre-lite and a
>> wandboard quad B1 and I still have no reason to think there's any
>> sort of a problem.
>>
>> Up to now, my testing has been done with my own config, I'll now
>> repeat the whole thing using the config Mattis posted to see if
>> I can reproduce it that way.
>>
>> Suggestions on a better / easier / quicker way to reproduce it are
>> welcome.
> 
> Thanks, Iain.
> 
> Mattis,
> 
> What is the silicon version of the mx6 in your sabrelite? What GCC
> version do you use?
> 

For reference, both my SL and WBQUAD report silicon rev 1.2
The RIoTboard uses a Solo and reports silicon rev 1.1

I'm using vanilla gcc 4.9.1 and compiling the kernel natively on a 
sabre-lite.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-22 10:36                                               ` Iain Paton
@ 2014-08-27  6:32                                                 ` Mattis Lorentzon
  2014-08-27 10:43                                                   ` Iain Paton
  0 siblings, 1 reply; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-27  6:32 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Iain, Russell and Fabio,

> The config is attached. Note that there's a lot of additional stuff enabled as
> I'm aiming for a single general purpose kernel that covers i.MX6, AM3359,
> Allwinner A10/A20 along with several versions of boards using those
> particular SoCs.
> 
> Same kernel binary on all the boards I've tried this on, only real differences
> will be the devicetree and u-boot

Amazingly we have been able to run a complete nightly test on eight i.MX6
boards without hickups using Iain's config! We had to modify it slightly to get
it to boot, please find attached patch and Iain's patched config.

On Russell's suggestion we also began to disable flow control on the machines.
However it did not seem to make a difference because all our Zynq cards
stalled during the same test run (using our own Zynq config).

Iain's config seems promising and we will continue to run tests during the
next couple of days. We will also try to adapt Iain's config to our Zynq board.

Many thanks for all suggestions, patches and configs so far!

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.patch
Type: application/octet-stream
Size: 692 bytes
Desc: config.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140827/44ff0458/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.gz
Type: application/x-gzip
Size: 23527 bytes
Desc: config.gz
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140827/44ff0458/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-27  6:32                                                 ` Mattis Lorentzon
@ 2014-08-27 10:43                                                   ` Iain Paton
  2014-08-29 10:57                                                     ` Mattis Lorentzon
  0 siblings, 1 reply; 44+ messages in thread
From: Iain Paton @ 2014-08-27 10:43 UTC (permalink / raw)
  To: linux-arm-kernel

On 27/08/14 07:32, Mattis Lorentzon wrote:
> Hi Iain, Russell and Fabio,
> 
>> The config is attached. Note that there's a lot of additional stuff enabled as
>> I'm aiming for a single general purpose kernel that covers i.MX6, AM3359,
>> Allwinner A10/A20 along with several versions of boards using those
>> particular SoCs.
>>
>> Same kernel binary on all the boards I've tried this on, only real differences
>> will be the devicetree and u-boot
> 
> Amazingly we have been able to run a complete nightly test on eight i.MX6
> boards without hickups using Iain's config! We had to modify it slightly to get
> it to boot, please find attached patch and Iain's patched config.

Interesting. We obviously have some differences in how we boot, my changes to 
your config to get it to boot basically amount to reverting the patch you attached 
and then enabling sata and mmc. So far I've been unable to get your config to fail.

I'm attaching the patch showing what I changed in case it sheds any light on 
what's going on, although I don't see why any of the changes make any difference.

My kernel command line is also fairly obvious with nothing I'd think is odd:
console=ttymxc1,115200n8 root=/dev/sda1 ro rootfstype=ext2 rootwait video= ahci-imx.hotplug=1

It would be good to know what makes my config work for you, I don't think I've 
done anything special with it.

Iain

-------------- next part --------------
3c3
< # Linux/arm 3.16.0 Kernel Configuration
---
> # Linux/arm 3.16.0-rc2 Kernel Configuration
38c38
< CONFIG_KERNEL_GZIP=y
---
> # CONFIG_KERNEL_GZIP is not set
41c41
< # CONFIG_KERNEL_LZO is not set
---
> CONFIG_KERNEL_LZO=y
233,239c233
< CONFIG_PARTITION_ADVANCED=y
< # CONFIG_ACORN_PARTITION is not set
< # CONFIG_AIX_PARTITION is not set
< # CONFIG_OSF_PARTITION is not set
< # CONFIG_AMIGA_PARTITION is not set
< # CONFIG_ATARI_PARTITION is not set
< # CONFIG_MAC_PARTITION is not set
---
> # CONFIG_PARTITION_ADVANCED is not set
241,249d234
< # CONFIG_BSD_DISKLABEL is not set
< # CONFIG_MINIX_SUBPARTITION is not set
< # CONFIG_SOLARIS_X86_PARTITION is not set
< # CONFIG_UNIXWARE_DISKLABEL is not set
< # CONFIG_LDM_PARTITION is not set
< # CONFIG_SGI_PARTITION is not set
< # CONFIG_ULTRIX_PARTITION is not set
< # CONFIG_SUN_PARTITION is not set
< # CONFIG_KARMA_PARTITION is not set
251,252d235
< # CONFIG_SYSV68_PARTITION is not set
< # CONFIG_CMDLINE_PARTITION is not set
265,266d247
< CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
< CONFIG_RWSEM_SPIN_ON_OWNER=y
533,534c514,521
< # CONFIG_ARM_APPENDED_DTB is not set
< CONFIG_CMDLINE=""
---
> CONFIG_ARM_APPENDED_DTB=y
> CONFIG_ARM_ATAG_DTB_COMPAT=y
> CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_FROM_BOOTLOADER=y
> # CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_EXTEND is not set
> CONFIG_CMDLINE="___console=ttymxc0,115200 ___debug ___LOGLEVEL=8 ___initrd=0x11800040,12383491 ___dyndbg=\"file * +p\""
> CONFIG_CMDLINE_FROM_BOOTLOADER=y
> # CONFIG_CMDLINE_EXTEND is not set
> # CONFIG_CMDLINE_FORCE is not set
591c578
< # CONFIG_PM_TEST_SUSPEND is not set
---
> CONFIG_PM_TEST_SUSPEND=y
919d905
< CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
920a907
> CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
1045,1059c1032
< CONFIG_ATA=y
< # CONFIG_ATA_NONSTANDARD is not set
< CONFIG_ATA_VERBOSE_ERROR=y
< CONFIG_SATA_PMP=y
< 
< #
< # Controllers with non-SFF native interface
< #
< CONFIG_SATA_AHCI=y
< CONFIG_SATA_AHCI_PLATFORM=y
< CONFIG_AHCI_IMX=y
< # CONFIG_SATA_INIC162X is not set
< # CONFIG_SATA_ACARD_AHCI is not set
< # CONFIG_SATA_SIL24 is not set
< # CONFIG_ATA_SFF is not set
---
> # CONFIG_ATA is not set
1786,1815c1759
< CONFIG_MMC=y
< # CONFIG_MMC_DEBUG is not set
< # CONFIG_MMC_CLKGATE is not set
< 
< #
< # MMC/SD/SDIO Card Drivers
< #
< CONFIG_MMC_BLOCK=y
< CONFIG_MMC_BLOCK_MINORS=8
< CONFIG_MMC_BLOCK_BOUNCE=y
< # CONFIG_SDIO_UART is not set
< # CONFIG_MMC_TEST is not set
< 
< #
< # MMC/SD/SDIO Host Controller Drivers
< #
< CONFIG_MMC_SDHCI=y
< CONFIG_MMC_SDHCI_IO_ACCESSORS=y
< # CONFIG_MMC_SDHCI_PCI is not set
< CONFIG_MMC_SDHCI_PLTFM=y
< # CONFIG_MMC_SDHCI_OF_ARASAN is not set
< CONFIG_MMC_SDHCI_ESDHC_IMX=y
< # CONFIG_MMC_SDHCI_PXAV3 is not set
< # CONFIG_MMC_SDHCI_PXAV2 is not set
< # CONFIG_MMC_MXC is not set
< # CONFIG_MMC_TIFM_SD is not set
< # CONFIG_MMC_CB710 is not set
< # CONFIG_MMC_VIA_SDMMC is not set
< # CONFIG_MMC_DW is not set
< # CONFIG_MMC_USDHI6ROL0 is not set
---
> # CONFIG_MMC is not set
1968d1911
< # CONFIG_WIMAX_GDM72XX is not set

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-27 10:43                                                   ` Iain Paton
@ 2014-08-29 10:57                                                     ` Mattis Lorentzon
  2014-08-29 11:30                                                       ` Fabio Estevam
  2014-12-16 14:50                                                       ` Mattis Lorentzon
  0 siblings, 2 replies; 44+ messages in thread
From: Mattis Lorentzon @ 2014-08-29 10:57 UTC (permalink / raw)
  To: linux-arm-kernel

Iain,

> Interesting. We obviously have some differences in how we boot, my
> changes to your config to get it to boot basically amount to reverting the
> patch you attached and then enabling sata and mmc. So far I've been unable
> to get your config to fail.

Our version of U-boot doesn't support specifying a device tree separate from
the kernel, so we append it to the end of the kernel binary. We also enable
automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are:
console=ttymxc1,115200
ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on
earlyprintk enable_wait_mode=off

> It would be good to know what makes my config work for you, I don't think
> I've done anything special with it.

With a couple of modifications (attached) we have been able to get your
config running on our Zynq boards as well, solving our ethernet issues.

The serial port and ethernet are essentially the only things we use. No disks,
no graphics, no USB, etc. which is why we tried to reduce the kernel
configuration to a bare minimum. We have no idea which disabled and/or
enabled options that are causing the stalls.

Best regards,
Mattis Lorentzon

***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.patch
Type: application/octet-stream
Size: 2670 bytes
Desc: config.patch
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140829/479869c2/attachment.obj>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-29 10:57                                                     ` Mattis Lorentzon
@ 2014-08-29 11:30                                                       ` Fabio Estevam
  2014-12-16 14:50                                                       ` Mattis Lorentzon
  1 sibling, 0 replies; 44+ messages in thread
From: Fabio Estevam @ 2014-08-29 11:30 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Mattis,

On Fri, Aug 29, 2014 at 7:57 AM, Mattis Lorentzon
<Mattis.Lorentzon@autoliv.com> wrote:
> Iain,
>
>> Interesting. We obviously have some differences in how we boot, my
>> changes to your config to get it to boot basically amount to reverting the
>> patch you attached and then enabling sata and mmc. So far I've been unable
>> to get your config to fail.
>
> Our version of U-boot doesn't support specifying a device tree separate from
> the kernel, so we append it to the end of the kernel binary. We also enable
> automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are:
> console=ttymxc1,115200
> ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on
> earlyprintk enable_wait_mode=off

I suppose that this 'enable_wait_mode=off' is a left over from the
time you used the FSL BSP.

This is not needed in mainline.

>> It would be good to know what makes my config work for you, I don't think
>> I've done anything special with it.
>
> With a couple of modifications (attached) we have been able to get your
> config running on our Zynq boards as well, solving our ethernet issues.
>
> The serial port and ethernet are essentially the only things we use. No disks,
> no graphics, no USB, etc. which is why we tried to reduce the kernel
> configuration to a bare minimum. We have no idea which disabled and/or
> enabled options that are causing the stalls.

It's good to hear you do not have the lockups anymore, but this is
still a big mistery for us as we have not yet understood the root
cause and what is the 'guilty' kernel config option that makes things
FEC to work unreliably.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-29 10:57                                                     ` Mattis Lorentzon
  2014-08-29 11:30                                                       ` Fabio Estevam
@ 2014-12-16 14:50                                                       ` Mattis Lorentzon
  1 sibling, 0 replies; 44+ messages in thread
From: Mattis Lorentzon @ 2014-12-16 14:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

> Now because things have changed during the last merge window, I've got
> an even bigger problem sorting through that patch set and getting it
> back into a submittable state.  I've just sent out v2 for it onto the
> netdev at vger.kernel.org mailing list.
>
> The initial version (marked RFC) attracted very little interest from
> testers, or acks.  I'd very much like to have some testing of it, so
> if you want to try it out, I can provide you with a git URL, patches or a
> combined patch.

We have run v3.16 for about three months now, and many millions of ssh
connections on eight separate systems, both without and with your network
patches. Our conclusion is that the patches clearly reduce the number of
network timeouts, and this is a great improvement. However, after a month
or so of uptime, the number of timeouts began to increase again, forcing us
to reboot the cards.

Best regards,
Mattis Lorentzon
***************************************************************
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***************************************************************

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-21  9:39                                           ` Iain Paton
  2014-08-22  0:01                                             ` Fabio Estevam
@ 2014-08-26 13:12                                             ` Iain Paton
  1 sibling, 0 replies; 44+ messages in thread
From: Iain Paton @ 2014-08-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel

On 21/08/14 10:39, Iain Paton wrote:
> On 19/08/14 07:03, Iain Paton wrote:
>> On 17/08/14 22:46, Fabio Estevam wrote:
>>> Iain,
>>>
>>> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton <ipaton0@gmail.com> wrote:
>>>> On 15/08/14 06:42, Mattis Lorentzon wrote:
>>>>
>>>>> We mostly run SSH with benchmarks using NFS, it can probably be
>>>>> triggered by using only SSH with the following loop:
>>>>>
>>>>> # while : ; do ssh arm-card date; done
>>>>
>>>> Mattis,
>>>>
>>>> What sort of time does it take for you to see a problem?
>>>>
>>>> I've been running the above for nearly two days on 3.16.0 on a board
>>>> with fec interrupts routed through gpio_6 and haven't seen a hint of
>>>> a problem.
>>>
>>> Thanks for testing.
>>>
>>> Which mx6 board have you used on this test?
>>
>> It's currently pointed at a RIoTboard (atheros phy) but I'm happy to 
>> try it against both a Sabre-Lite and a Wandboard B1, all running the 
>> same kernel binary, as well. 
>>
>> I'm interested enough in why different people get different results 
>> with this that I'll put some time towards testing to try to help 
>> narrow down the cause.
>>
> 
> two and a half days of running this against both a sabre-lite and a 
> wandboard quad B1 and I still have no reason to think there's any 
> sort of a problem.
> 
> Up to now, my testing has been done with my own config, I'll now
> repeat the whole thing using the config Mattis posted to see if 
> I can reproduce it that way.
> 
> Suggestions on a better / easier / quicker way to reproduce it are 
> welcome.
> 

So I wasn't able to use Mattis exact configuration as I couldn't 
get it to boot properly on anything. 

I made changes enough to enable mmc/sata and to disable the 
compiled in kernel command line and appended devicetree and initrd.
Even then it still won't boot on my WBQUAD. 
It is running on Sabre-Lite and RIoTboard though, so useful enough 
to test against the SL in a similar manner to Mattis tests with SL.

I've had the test running against both for approx one day and again 
no sign of any problems. I'm happy to leave this running, but at 
this stage I'm not expecting I'll see any problems even if I leave 
it running for a week.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Oops: 17 SMP ARM (v3.16-rc2)
  2014-08-14 14:43                               ` Mattis Lorentzon
  2014-08-14 15:30                                 ` Fabio Estevam
@ 2014-08-22  8:27                                 ` Russell King - ARM Linux
  1 sibling, 0 replies; 44+ messages in thread
From: Russell King - ARM Linux @ 2014-08-22  8:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 14, 2014 at 02:43:56PM +0000, Mattis Lorentzon wrote:
> Fabio and Russell,
> 
> > A working theory is that the switch (3Com Switch 4400) triggers the
> > degeneration of the network stack from which Linux does not seem to
> > recover, even if we later bypass the switch and directly connect the board to
> > the server machine.
> 
> After a few more tests we have finally been able to trigger the exact
> same stalls on the Sabrelite board with a direct network connection
> (i.e. without the switch).

That's a setup which I can't reproduce, as all my MX6 hardware runs
root-NFS, so using a direct connection to a machine to test will
result in the MX6 losing its root filesystem.

That said, on SolidRun hardware, there is some investigation going on
at the moment concerning poor UDP performance - this is an on-going
problem that has been present for a long time.

What we find is that TCP performance achieves around the 600mbps mark,
but UDP performance can be extremely poor with high packet loss.
Adding a udelay(210) into the fec_enet_rx() can perversely (on
multi-core SoCs) increase UDP performance to around 500mbps at the
expense of a reduction in TCP performance.

This "solution" was tripped over while trying to debug this problem,
and it was found that adding printk()s to the driver increased UDP
performance - so subsituting udelay() for printk() was then tried.

I tried to run perf on the kernel yesterday to find out what's going
on, but for some reason, perf gave me impossible call traces, so I
gave up with that idea.  For example, perf told me that there was a
high hit rate in memcpy() being called from net_rx_action(), but
net_rx_action() doesn't call memcpy(), nor do any of the called
functions as a tail-call.

That said, I don't think perf could tell us what's going on - what
we need is a trace of the CPU's execution while iperf is running,
*without* affecting the CPU itself.  This is something I can't do
with the hardware I have.

My suspicion (unproven) is that a batch of packets get processed in
the softirq handler called during the FEC interrupt exit path.  Then,
because there's more work to be done, ksoftirqd is scheduled, but it
takes time for ksoftirqd to start running - during which time we drop
a lot of packets.  ksoftirqd processes some packets, but then finds
that it can't complete the NAPI "work budget", and so stops running,
resulting in the packet processing being triggered by the next FEC
interrupt, and the cycle repeats.

TCP notices this, and adjusts its sending rate to match, whereas UDP
just carries on regardless, resulting in lots of packets dropped each
time we switch from the tail of hardirq processing to ksoftirqd.

With the udelay() in place, processing takes enough time that it gets
bounced onto ksoftirqd, where it stays.

I'm adding this to this thread in case it has any bearing on the
problem(s) you're seeing - yes, it seems like a different problem, but
could it be related...

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2014-12-16 14:50 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <C4A61A5CA4AD2246942F4BA6F20ACB804356534D@ALVA-EXMB04.alv.autoliv.int>
     [not found] ` <20140626140115.GQ32514@n2100.arm.linux.org.uk>
2014-06-26 14:44   ` Oops: 17 SMP ARM (v3.16-rc2) Mattis Lorentzon
2014-06-26 15:14     ` Russell King - ARM Linux
2014-06-27 11:21       ` Russell King - ARM Linux
2014-06-27 16:16         ` Fredrik Noring
2014-06-27 16:31           ` Russell King - ARM Linux
2014-06-30  6:22             ` Fredrik Noring
2014-06-30 12:30             ` Fredrik Noring
2014-06-30 13:00               ` Nathan Lynch
2014-07-02  6:02             ` Fredrik Noring
2014-08-05 13:31             ` Mattis Lorentzon
2014-08-05 13:53               ` Fabio Estevam
2014-08-06  6:48                 ` Mattis Lorentzon
2014-08-06  9:50               ` Russell King - ARM Linux
2014-08-06 11:10                 ` Mattis Lorentzon
2014-08-06 12:55                   ` Russell King - ARM Linux
2014-08-07 11:11                     ` Mattis Lorentzon
2014-08-07 12:12                       ` Russell King - ARM Linux
2014-08-07 14:20                         ` Fabio Estevam
2014-08-07 14:38                           ` Fabio Estevam
2014-08-08  1:30                             ` Troy Kisky
2014-08-08 14:05                           ` Fabio Estevam
2014-08-08 18:09                         ` Russell King - ARM Linux
2014-08-11 13:32                           ` Mattis Lorentzon
2014-08-11 17:41                             ` Fabio Estevam
2014-08-13 13:39                               ` Mattis Lorentzon
2014-08-25 10:18                                 ` Russell King - ARM Linux
2014-08-26 13:11                                   ` Iain Paton
2014-08-14 14:43                               ` Mattis Lorentzon
2014-08-14 15:30                                 ` Fabio Estevam
2014-08-15  5:42                                   ` Mattis Lorentzon
2014-08-17 21:34                                     ` Iain Paton
2014-08-17 21:46                                       ` Fabio Estevam
2014-08-19  6:03                                         ` Iain Paton
2014-08-21  9:39                                           ` Iain Paton
2014-08-22  0:01                                             ` Fabio Estevam
2014-08-22  6:39                                               ` Mattis Lorentzon
2014-08-22 10:36                                               ` Iain Paton
2014-08-27  6:32                                                 ` Mattis Lorentzon
2014-08-27 10:43                                                   ` Iain Paton
2014-08-29 10:57                                                     ` Mattis Lorentzon
2014-08-29 11:30                                                       ` Fabio Estevam
2014-12-16 14:50                                                       ` Mattis Lorentzon
2014-08-26 13:12                                             ` Iain Paton
2014-08-22  8:27                                 ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).