All of lore.kernel.org
 help / color / mirror / Atom feed
* General firmware stability issue.
@ 2014-06-19 18:58 Ben Greear
  2014-06-23  6:49 ` Michal Kazior
  0 siblings, 1 reply; 5+ messages in thread
From: Ben Greear @ 2014-06-19 18:58 UTC (permalink / raw)
  To: ath10k

When using our firmware and kernel mods, we often see our AP system
crash the firmware after several days of various testing.

Often after this, it takes a full reboot to bring the system back.

For those with ability to debug firmware source,
at least some of the time, it is a heap list corruption/assert
that crashes us, but I have not nailed down exactly where/why yet.

Based on some email I received, I believe this problem may
happen on standard firmware as well.

I am curious to know if anyone else sees this type of problem,
and with what regularity.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: General firmware stability issue.
  2014-06-19 18:58 General firmware stability issue Ben Greear
@ 2014-06-23  6:49 ` Michal Kazior
  2014-06-23 16:05   ` Ben Greear
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Kazior @ 2014-06-23  6:49 UTC (permalink / raw)
  To: Ben Greear; +Cc: ath10k

On 19 June 2014 20:58, Ben Greear <greearb@candelatech.com> wrote:
> When using our firmware and kernel mods, we often see our AP system
> crash the firmware after several days of various testing.
>
> Often after this, it takes a full reboot to bring the system back.

Can you elaborate on this? Why does it need a full reboot?


> For those with ability to debug firmware source,
> at least some of the time, it is a heap list corruption/assert
> that crashes us, but I have not nailed down exactly where/why yet.

Some of the time.. but what happens other time? Any crash dump?


> Based on some email I received, I believe this problem may
> happen on standard firmware as well.
>
> I am curious to know if anyone else sees this type of problem,
> and with what regularity.

I'm aware of one problem with beaconing now. Since there's no "beacon
tx completed" indication ath10k is forced to blindly unmap/free beacon
sk_buff when next swba event is handled. In some rare cases when
target wmi pipes get stuck/lag it's possible to get an IOMMU fault
(provided your platform supports it and it's enabled) that crashes the
target so badly it's impossible to even use the CE diag window to read
out the crash dump. Warm reset is ineffective after that and only cold
reset is able to bring it up again (but also hangs the host sometimes
due to hw bug).


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: General firmware stability issue.
  2014-06-23  6:49 ` Michal Kazior
@ 2014-06-23 16:05   ` Ben Greear
  2014-06-23 20:48     ` Ben Greear
  0 siblings, 1 reply; 5+ messages in thread
From: Ben Greear @ 2014-06-23 16:05 UTC (permalink / raw)
  To: Michal Kazior; +Cc: ath10k



On 06/22/2014 11:49 PM, Michal Kazior wrote:
> On 19 June 2014 20:58, Ben Greear <greearb@candelatech.com> wrote:
>> When using our firmware and kernel mods, we often see our AP system
>> crash the firmware after several days of various testing.
>>
>> Often after this, it takes a full reboot to bring the system back.
>
> Can you elaborate on this? Why does it need a full reboot?

I'll send kernel messages next time it happens, but basically it just
fails cold restart over and over again.

>
>> For those with ability to debug firmware source,
>> at least some of the time, it is a heap list corruption/assert
>> that crashes us, but I have not nailed down exactly where/why yet.
>
> Some of the time.. but what happens other time? Any crash dump?

Some times I get crashes where the firmware says it cannot even read
the crash dump registers.  Usually this is after an initial dump
(say, heap crash), and shortly after, the cold restart failure problem
happens.

>> Based on some email I received, I believe this problem may
>> happen on standard firmware as well.
>>
>> I am curious to know if anyone else sees this type of problem,
>> and with what regularity.
>
> I'm aware of one problem with beaconing now. Since there's no "beacon
> tx completed" indication ath10k is forced to blindly unmap/free beacon
> sk_buff when next swba event is handled. In some rare cases when
> target wmi pipes get stuck/lag it's possible to get an IOMMU fault
> (provided your platform supports it and it's enabled) that crashes the
> target so badly it's impossible to even use the CE diag window to read
> out the crash dump. Warm reset is ineffective after that and only cold
> reset is able to bring it up again (but also hangs the host sometimes
> due to hw bug).

That is very interesting.  It sounds like that could be the problem
I hit.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: General firmware stability issue.
  2014-06-23 16:05   ` Ben Greear
@ 2014-06-23 20:48     ` Ben Greear
  2014-06-24  5:32       ` Michal Kazior
  0 siblings, 1 reply; 5+ messages in thread
From: Ben Greear @ 2014-06-23 20:48 UTC (permalink / raw)
  To: Michal Kazior; +Cc: ath10k

On 06/23/2014 09:05 AM, Ben Greear wrote:
> 
> 
> On 06/22/2014 11:49 PM, Michal Kazior wrote:
>> On 19 June 2014 20:58, Ben Greear <greearb@candelatech.com> wrote:
>>> When using our firmware and kernel mods, we often see our AP system
>>> crash the firmware after several days of various testing.
>>>
>>> Often after this, it takes a full reboot to bring the system back.
>>
>> Can you elaborate on this? Why does it need a full reboot?
> 
> I'll send kernel messages next time it happens, but basically it just
> fails cold restart over and over again.

Here's logs from a station system that had a problem of this nature.  Since
it should not be doing any beaconing, I guess the root cause of at least this
particular problem is different.  This is with our firmware and hacked ath10k
driver, so of course it is possible it is not an upstream problem.

Kernel is 3.14, with most of the ath10k patches from 3.15 backported to it,
plus additional patches.

http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=summary

Jun 22 10:00:00 localhost kernel: ath10k: Creating vdev id: 0  map: 68719476735
Jun 22 10:00:00 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): sta1: link is not ready
Jun 22 10:00:00 localhost kernel: ath10k: Creating vdev id: 1  map: 68719476734
Jun 22 10:00:00 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): sta2: link is not ready
Jun 22 10:00:01 localhost kernel: ath10k: stop, state OFF
Jun 22 10:00:02 localhost kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jun 22 10:00:02 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
Jun 22 10:00:04 localhost kernel: ath10k: Target ready! transmit resources: 3 size:1792
Jun 22 10:00:04 localhost kernel: ath10k: wmi event firmware message 'P 73 V 36 T 411'
Jun 22 10:00:04 localhost kernel: ath10k: wmi event firmware message 'msdu-desc: 808  sw-crypt: 1'
Jun 22 10:00:04 localhost kernel: ath10k: wmi event firmware message 'alloc rem: 4332 iram: 57220'
Jun 22 10:00:04 localhost kernel: ath10k: start, state going from OFF to ON
Jun 22 10:00:04 localhost kernel: ath10k: Creating vdev id: 0  map: 68719476735
Jun 22 10:00:04 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): sta1: link is not ready
Jun 22 10:00:05 localhost kernel: ath10k: stop, state OFF
Jun 22 10:00:08 localhost kernel: ath10k: failed to receive initialized event from target: 00000000
Jun 22 10:00:08 localhost kernel: ath10k: failed to wait for target to init: -110
Jun 22 10:00:08 localhost kernel: ath10k: failed to power up target using warm reset: -110
Jun 22 10:00:08 localhost kernel: ath10k: trying cold reset
Jun 22 10:00:08 localhost kernel: ath10k: target took longer 5000 us to wake up (awake count 1)
Jun 22 10:00:11 localhost kernel: ath10k: failed to receive initialized event from target: ffffffff
Jun 22 10:00:11 localhost kernel: ath10k: failed to wait for target to init: -110
Jun 22 10:00:11 localhost kernel: ath10k: device crashed - no diagnostics available
Jun 22 10:00:11 localhost kernel: ath10k: target took longer 5000 us to wake up (awake count 2)
Jun 22 10:00:11 localhost kernel: ath10k: failed to wake up target: -110
Jun 22 10:00:11 localhost kernel: ath10k: failed to power up target using cold reset too (-110)
Jun 22 10:00:11 localhost kernel: ath10k: Could not init hif: -110 (state OFF)
Jun 22 10:00:11 localhost kernel: ath10k: target took longer 5000 us to wake up (awake count 2)
Jun 22 10:00:11 localhost kernel: ath10k: failed to wake up target: -110
Jun 22 10:00:11 localhost kernel: ath10k: failed to reset target: -110
Jun 22 10:00:11 localhost kernel: ath10k: failed to power up target using warm reset: -110
.....



Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: General firmware stability issue.
  2014-06-23 20:48     ` Ben Greear
@ 2014-06-24  5:32       ` Michal Kazior
  0 siblings, 0 replies; 5+ messages in thread
From: Michal Kazior @ 2014-06-24  5:32 UTC (permalink / raw)
  To: Ben Greear; +Cc: ath10k

On 23 June 2014 22:48, Ben Greear <greearb@candelatech.com> wrote:
> On 06/23/2014 09:05 AM, Ben Greear wrote:
>>
>>
>> On 06/22/2014 11:49 PM, Michal Kazior wrote:
>>> On 19 June 2014 20:58, Ben Greear <greearb@candelatech.com> wrote:
>>>> When using our firmware and kernel mods, we often see our AP system
>>>> crash the firmware after several days of various testing.
>>>>
>>>> Often after this, it takes a full reboot to bring the system back.
>>>
>>> Can you elaborate on this? Why does it need a full reboot?
>>
>> I'll send kernel messages next time it happens, but basically it just
>> fails cold restart over and over again.
>
> Here's logs from a station system that had a problem of this nature.  Since
> it should not be doing any beaconing, I guess the root cause of at least this
> particular problem is different.  This is with our firmware and hacked ath10k
> driver, so of course it is possible it is not an upstream problem.

The cause may be different but the mechanism might be the same, i.e.
at one point target accesses an invalid memory address on host and
controller goes nopenopenope.


> Kernel is 3.14, with most of the ath10k patches from 3.15 backported to it,
> plus additional patches.
>
> http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=summary
>
> Jun 22 10:00:00 localhost kernel: ath10k: Creating vdev id: 0  map: 68719476735
> Jun 22 10:00:00 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): sta1: link is not ready
> Jun 22 10:00:00 localhost kernel: ath10k: Creating vdev id: 1  map: 68719476734
> Jun 22 10:00:00 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): sta2: link is not ready
> Jun 22 10:00:01 localhost kernel: ath10k: stop, state OFF
> Jun 22 10:00:02 localhost kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jun 22 10:00:02 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
> Jun 22 10:00:04 localhost kernel: ath10k: Target ready! transmit resources: 3 size:1792
> Jun 22 10:00:04 localhost kernel: ath10k: wmi event firmware message 'P 73 V 36 T 411'
> Jun 22 10:00:04 localhost kernel: ath10k: wmi event firmware message 'msdu-desc: 808  sw-crypt: 1'
> Jun 22 10:00:04 localhost kernel: ath10k: wmi event firmware message 'alloc rem: 4332 iram: 57220'
> Jun 22 10:00:04 localhost kernel: ath10k: start, state going from OFF to ON
> Jun 22 10:00:04 localhost kernel: ath10k: Creating vdev id: 0  map: 68719476735
> Jun 22 10:00:04 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): sta1: link is not ready
> Jun 22 10:00:05 localhost kernel: ath10k: stop, state OFF
> Jun 22 10:00:08 localhost kernel: ath10k: failed to receive initialized event from target: 00000000
> Jun 22 10:00:08 localhost kernel: ath10k: failed to wait for target to init: -110
> Jun 22 10:00:08 localhost kernel: ath10k: failed to power up target using warm reset: -110
> Jun 22 10:00:08 localhost kernel: ath10k: trying cold reset

You might want to try out my warm reset patch from Kalle's tree to
reduce usage of cold reset.


> Jun 22 10:00:08 localhost kernel: ath10k: target took longer 5000 us to wake up (awake count 1)
> Jun 22 10:00:11 localhost kernel: ath10k: failed to receive initialized event from target: ffffffff

0xffffffff from on ioread32()? It looks as if the device was
disconnected from the bus.

Perhaps your controller is more resilient to the hw cold reset bug and
you just end up with a device that looks as if disconnected, e.g. my
T430 hangs but AP135 just complains with a "data bus error" (both when
cold reset fails). Both cases need a reboot to make stuff work again.


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-06-24  5:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-19 18:58 General firmware stability issue Ben Greear
2014-06-23  6:49 ` Michal Kazior
2014-06-23 16:05   ` Ben Greear
2014-06-23 20:48     ` Ben Greear
2014-06-24  5:32       ` Michal Kazior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.