* Firmware crash when sending large numbers of forwarded packets
@ 2014-03-07 10:38 Avery Pennarun
2014-03-07 11:15 ` Michal Kazior
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Avery Pennarun @ 2014-03-07 10:38 UTC (permalink / raw)
To: ath10k
Hi,
I'm having a problem where if I transmit too fast out the ath10k
interface in AP mode, I get a near-immediate firmware crash.
Notes:
- I can generate this by running 'iperf -s' on a wifi station and
'iperf -c' on a separate machine on the wired LAN connected to the
ath10k AP. (This transmits data in the downstream direction, ie. to
the wifi interface)
- If I use 'iperf -u' (UDP) it tries to restart the firmware, but the
transmit queue stays dead. If I use iperf in the default TCP mode, it
restarts the firmware and generally recovers okay.
- If I run 'iperf -c' directly on the AP device instead of a separate
machine on the LAN, the crash never occurs.
- If I swap the direction of transmit (to the AP instead of from the
AP) the crash never occurs.
- CPU usage on the AP is always less than 1 CPU core (ARM CPU), so
it's probably not falling behind on processing.
- Using 40 or 80 MHz channel width for the test; also happens with 20
MHz but less frequently.
- Problem mostly doesn't occur until I exceed about 160 Mbit/sec.
Versions:
- kernel is based on current kvalo/for-linville branch (should I try
something else?) but seems to be the same in linux-next-20140114 so I
don't think this behaviour has changed lately.
- firmware version 10.1.467.2-1, but also tested with 10.1.467.1-1
with no difference.
I assume other people are not experiencing this or they would have
mentioned it by now. What can I do to help debug this?
Logs:
[ 360.717699] ath10k: firmware crashed!
[ 360.721434] ath10k: hardware name qca988x hw2.0 version 0x4100016c
[ 360.727669] ath10k: firmware version: 10.1.467.2-1
[ 360.733695] ath10k: target register Dump Location: 0x0040AC14
[ 360.740655] ath10k: target Register Dump
[ 360.744652] ath10k: [00]: 0x4100016C 0x000015B3 0x009AA69E 0x00955B31
[ 360.751163] ath10k: [04]: 0x009AA69E 0x00060530 0x00000002 0x00000000
[ 360.757636] ath10k: [08]: 0x004139B8 0x00955A00 0x0040EE54 0x00000000
[ 360.764185] ath10k: [12]: 0x00000009 0x00000000 0x0095808C 0x009580A2
[ 360.770663] ath10k: [16]: 0x00958080 0x0094085D 0x00000000 0x00000000
[ 360.777174] ath10k: [20]: 0x409AA69E 0x0040AD24 0x00000001 0x0000013A
[ 360.783672] ath10k: [24]: 0x809A8892 0x0040AD84 0x0040E9F8 0xC09AA69E
[ 360.790145] ath10k: [28]: 0x809A8920 0x0040ADD4 0x0040AE44 0x00400000
[ 360.796640] ath10k: [32]: 0x809A8138 0x0040AE04 0x00000001 0x0040AE44
[ 360.803133] ath10k: [36]: 0x809A7885 0x0040AE24 0x00411124 0x00411148
[ 360.809603] ath10k: [40]: 0x809B3AAC 0x0040AE44 0x00000001 0x00000000
[ 360.816099] ath10k: [44]: 0x809B39B8 0x0040AEA4 0x0041DAAC 0x00411784
[ 360.822587] ath10k: [48]: 0x80942EB3 0x0040AEC4 0x0041DAAC 0x00000001
[ 360.829057] ath10k: [52]: 0x80940F18 0x0040AF14 0x00000011 0x00403AD4
[ 360.835550] ath10k: [56]: 0x80940EEA 0x0040AF44 0x00400000 0x00000000
[ 361.840870] ath10k: suspend timed out - target pause event never came
[ 362.119075] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffd8c000 busy
[ 362.127193] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffd8d000 busy
[ 362.135290] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffd8e000 busy
[ 362.143372] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffd8f000 busy
[ 362.151447] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffda9000 busy
[ 362.159483] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdaa000 busy
[ 362.167543] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdab000 busy
[ 362.175602] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdac000 busy
[ 362.183668] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdad000 busy
[ 362.191725] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdae000 busy
[ 362.199773] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdaf000 busy
[ 362.207834] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdb9000 busy
[ 362.215896] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdba000 busy
[ 362.223959] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
pool, ffdbb000 busy
[ 362.490921] phy1: Hardware restart was requested
[ 363.804817] ath10k: device successfully recovered
Thanks,
Avery
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-07 10:38 Firmware crash when sending large numbers of forwarded packets Avery Pennarun
@ 2014-03-07 11:15 ` Michal Kazior
2014-03-07 15:53 ` Ben Greear
2014-03-08 8:03 ` Kalle Valo
2 siblings, 0 replies; 15+ messages in thread
From: Michal Kazior @ 2014-03-07 11:15 UTC (permalink / raw)
To: Avery Pennarun; +Cc: ath10k
On 7 March 2014 11:38, Avery Pennarun <apenwarr@gmail.com> wrote:
> Hi,
>
> I'm having a problem where if I transmit too fast out the ath10k
> interface in AP mode, I get a near-immediate firmware crash.
>
> Notes:
[...]
> - If I swap the direction of transmit (to the AP instead of from the
> AP) the crash never occurs.
What is the STA endpoint? Is it an ath10k too? Does it run on similar
ARM board too?
> I assume other people are not experiencing this or they would have
> mentioned it by now. What can I do to help debug this?
Can you try reproducing this with traces? Considering this might get
too big to attach to an email I think it's okay to exclude
ath10k:ath10k_log_dbg_dump from traces.
[...]
> [ 362.119075] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffd8c000 busy
> [ 362.127193] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffd8d000 busy
> [ 362.135290] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffd8e000 busy
> [ 362.143372] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffd8f000 busy
> [ 362.151447] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffda9000 busy
> [ 362.159483] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdaa000 busy
> [ 362.167543] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdab000 busy
> [ 362.175602] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdac000 busy
> [ 362.183668] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdad000 busy
> [ 362.191725] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdae000 busy
> [ 362.199773] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdaf000 busy
> [ 362.207834] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdb9000 busy
> [ 362.215896] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdba000 busy
> [ 362.223959] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
> pool, ffdbb000 busy
These dma pool warnings are very suspicious. I'm guessing tx wasn't
really stopped and perhaps it leaked due to missing locking/checks.
Michał
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-07 10:38 Firmware crash when sending large numbers of forwarded packets Avery Pennarun
2014-03-07 11:15 ` Michal Kazior
@ 2014-03-07 15:53 ` Ben Greear
2014-03-08 0:54 ` Avery Pennarun
2014-03-08 8:03 ` Kalle Valo
2 siblings, 1 reply; 15+ messages in thread
From: Ben Greear @ 2014-03-07 15:53 UTC (permalink / raw)
To: Avery Pennarun, ath10k
On 03/07/2014 02:38 AM, Avery Pennarun wrote:
> Hi,
>
> I'm having a problem where if I transmit too fast out the ath10k
> interface in AP mode, I get a near-immediate firmware crash.
>
> Notes:
> - I can generate this by running 'iperf -s' on a wifi station and
> 'iperf -c' on a separate machine on the wired LAN connected to the
> ath10k AP. (This transmits data in the downstream direction, ie. to
> the wifi interface)
> - If I use 'iperf -u' (UDP) it tries to restart the firmware, but the
> transmit queue stays dead. If I use iperf in the default TCP mode, it
> restarts the firmware and generally recovers okay.
> - If I run 'iperf -c' directly on the AP device instead of a separate
> machine on the LAN, the crash never occurs.
That is interesting, sending from LAN will often burst higher
and cause more packet loss (and perhaps higher periodic packet
loads) but sending locally will typically allow the local stack to back
off more gracefully.
> - If I swap the direction of transmit (to the AP instead of from the
> AP) the crash never occurs.
> - CPU usage on the AP is always less than 1 CPU core (ARM CPU), so
> it's probably not falling behind on processing.
> - Using 40 or 80 MHz channel width for the test; also happens with 20
> MHz but less frequently.
> - Problem mostly doesn't occur until I exceed about 160 Mbit/sec.
>
> Versions:
> - kernel is based on current kvalo/for-linville branch (should I try
> something else?) but seems to be the same in linux-next-20140114 so I
> don't think this behaviour has changed lately.
> - firmware version 10.1.467.2-1, but also tested with 10.1.467.1-1
> with no difference.
>
> I assume other people are not experiencing this or they would have
> mentioned it by now. What can I do to help debug this?
>
> Logs:
>
> [ 360.717699] ath10k: firmware crashed!
> [ 360.721434] ath10k: hardware name qca988x hw2.0 version 0x4100016c
> [ 360.727669] ath10k: firmware version: 10.1.467.2-1
> [ 360.733695] ath10k: target register Dump Location: 0x0040AC14
> [ 360.740655] ath10k: target Register Dump
> [ 360.744652] ath10k: [00]: 0x4100016C 0x000015B3 0x009AA69E 0x00955B31
This appears to be an assert in firmware, not just random crash.
Someone with .467 firmware source and a bit of skill with the tool-chain
should be able to figure out where it is asserting.
I do not know anyone with both of those things, however :P
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-07 15:53 ` Ben Greear
@ 2014-03-08 0:54 ` Avery Pennarun
2014-03-08 1:08 ` Ben Greear
2014-03-10 7:47 ` Michal Kazior
0 siblings, 2 replies; 15+ messages in thread
From: Avery Pennarun @ 2014-03-08 0:54 UTC (permalink / raw)
To: Ben Greear; +Cc: ath10k
On Fri, Mar 7, 2014 at 10:53 AM, Ben Greear <greearb@candelatech.com> wrote:
> That is interesting, sending from LAN will often burst higher
> and cause more packet loss (and perhaps higher periodic packet
> loads) but sending locally will typically allow the local stack to back
> off more gracefully.
That theory sounds right. Do you think there's a way to simulate the
burstiness while avoiding the extra machine? That would make testing
easier. Some way to write a lot of packets to a socket and then
release it all at once? I'm guessing UDP would work better for this
than TCP.
>> [ 360.717699] ath10k: firmware crashed!
>> [ 360.721434] ath10k: hardware name qca988x hw2.0 version 0x4100016c
>> [ 360.727669] ath10k: firmware version: 10.1.467.2-1
>> [ 360.733695] ath10k: target register Dump Location: 0x0040AC14
>> [ 360.740655] ath10k: target Register Dump
>> [ 360.744652] ath10k: [00]: 0x4100016C 0x000015B3 0x009AA69E 0x00955B31
>
> This appears to be an assert in firmware, not just random crash.
>
> Someone with .467 firmware source and a bit of skill with the tool-chain
> should be able to figure out where it is asserting.
>
> I do not know anyone with both of those things, however :P
Maybe I should try harder to get access to the firmware source :)
Michal wrote:
> Avery wrote:
>> - If I swap the direction of transmit (to the AP instead of from the
>> AP) the crash never occurs.
>
> What is the STA endpoint? Is it an ath10k too? Does it run on similar
> ARM board too?
It's a Macbook in this case. Other people here have simulated it with
other kinds of endpoints, including a Veriwave test device.
>> I assume other people are not experiencing this or they would have
>> mentioned it by now. What can I do to help debug this?
>
> Can you try reproducing this with traces? Considering this might get
> too big to attach to an email I think it's okay to exclude
> ath10k:ath10k_log_dbg_dump from traces.
Can you give me a clue how to get started with traces? What do I
enable, and what parts do you want to see?
>> [ 362.119075] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
>> pool, ffd8c000 busy
>> [ 362.127193] ath10k_pci 0000:00:00.0: dma_pool_destroy ath10k htt tx
>> pool, ffd8d000 busy
>
> These dma pool warnings are very suspicious. I'm guessing tx wasn't
> really stopped and perhaps it leaked due to missing locking/checks.
Agreed. But I figured that since this part only happens after the
firmware has already crashed, maybe it's of secondary importance.
Have fun,
Avery
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-08 0:54 ` Avery Pennarun
@ 2014-03-08 1:08 ` Ben Greear
2014-03-08 1:21 ` Avery Pennarun
2014-03-10 7:47 ` Michal Kazior
1 sibling, 1 reply; 15+ messages in thread
From: Ben Greear @ 2014-03-08 1:08 UTC (permalink / raw)
To: Avery Pennarun; +Cc: ath10k
On 03/07/2014 04:54 PM, Avery Pennarun wrote:
> On Fri, Mar 7, 2014 at 10:53 AM, Ben Greear <greearb@candelatech.com> wrote:
>> That is interesting, sending from LAN will often burst higher
>> and cause more packet loss (and perhaps higher periodic packet
>> loads) but sending locally will typically allow the local stack to back
>> off more gracefully.
>
> That theory sounds right. Do you think there's a way to simulate the
> burstiness while avoiding the extra machine? That would make testing
> easier. Some way to write a lot of packets to a socket and then
> release it all at once? I'm guessing UDP would work better for this
> than TCP.
UDP has some local backpressure as well, but maybe it would be more
likely to hit it than TCP I guess.
It could also be something else related to forwarding that is not just
rate specific (maybe frames come in with LRO and something special
about transmitting those, or something of that nature?).
Are you using bridging or routing on your AP? Bridging might
be fundamentally different than routing as far as this bug
is concerned. Sending from local AP would likely emulate
routing type behaviour.
>> This appears to be an assert in firmware, not just random crash.
>>
>> Someone with .467 firmware source and a bit of skill with the tool-chain
>> should be able to figure out where it is asserting.
>>
>> I do not know anyone with both of those things, however :P
>
> Maybe I should try harder to get access to the firmware source :)
Good luck..if you do get it and the toolchain, I can point
you towards how to decode (ie, get stack trace) that crash, or at least I can send
QCA email they could forward you (to satisfy any NDA isues).
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-08 1:08 ` Ben Greear
@ 2014-03-08 1:21 ` Avery Pennarun
0 siblings, 0 replies; 15+ messages in thread
From: Avery Pennarun @ 2014-03-08 1:21 UTC (permalink / raw)
To: Ben Greear; +Cc: ath10k
On Fri, Mar 7, 2014 at 7:08 PM, Ben Greear <greearb@candelatech.com> wrote:
> On 03/07/2014 04:54 PM, Avery Pennarun wrote:
>> On Fri, Mar 7, 2014 at 10:53 AM, Ben Greear <greearb@candelatech.com> wrote:
>>> That is interesting, sending from LAN will often burst higher
>>> and cause more packet loss (and perhaps higher periodic packet
>>> loads) but sending locally will typically allow the local stack to back
>>> off more gracefully.
>>
>> That theory sounds right. Do you think there's a way to simulate the
>> burstiness while avoiding the extra machine? That would make testing
>> easier. Some way to write a lot of packets to a socket and then
>> release it all at once? I'm guessing UDP would work better for this
>> than TCP.
>
> UDP has some local backpressure as well, but maybe it would be more
> likely to hit it than TCP I guess.
>
> It could also be something else related to forwarding that is not just
> rate specific (maybe frames come in with LRO and something special
> about transmitting those, or something of that nature?).
Yes, that's possible...
> Are you using bridging or routing on your AP? Bridging might
> be fundamentally different than routing as far as this bug
> is concerned. Sending from local AP would likely emulate
> routing type behaviour.
The wlan1 interface is in a bridge, but the packets were coming from a
wired ethernet interface (WAN port) that is not in the bridge. I'll
try without bridging and see if that affects anything. Thanks for the
suggestion.
>>> This appears to be an assert in firmware, not just random crash.
>>>
>>> Someone with .467 firmware source and a bit of skill with the tool-chain
>>> should be able to figure out where it is asserting.
>>>
>>> I do not know anyone with both of those things, however :P
>>
>> Maybe I should try harder to get access to the firmware source :)
>
> Good luck..if you do get it and the toolchain, I can point
> you towards how to decode (ie, get stack trace) that crash, or at least I can send
> QCA email they could forward you (to satisfy any NDA isues).
Thanks. I'll contact them again and see what happens.
Meanwhile, any suggestions anyone can provide on "what to trace" (and
how to trace) right now given that I don't have firmware access, would
be helpful.
Have fun,
Avery
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-07 10:38 Firmware crash when sending large numbers of forwarded packets Avery Pennarun
2014-03-07 11:15 ` Michal Kazior
2014-03-07 15:53 ` Ben Greear
@ 2014-03-08 8:03 ` Kalle Valo
2014-03-08 8:20 ` Avery Pennarun
2 siblings, 1 reply; 15+ messages in thread
From: Kalle Valo @ 2014-03-08 8:03 UTC (permalink / raw)
To: Avery Pennarun; +Cc: ath10k
Hi Avery,
Avery Pennarun <apenwarr@gmail.com> writes:
> I'm having a problem where if I transmit too fast out the ath10k
> interface in AP mode, I get a near-immediate firmware crash.
[...]
> Versions:
> - kernel is based on current kvalo/for-linville branch (should I try
> something else?) but seems to be the same in linux-next-20140114 so I
> don't think this behaviour has changed lately.
I do not recommend using for-linville branch for anything. As the name
implies, it's only for John Linville to pull ath10k and ath6kl changes
to his tree.
What I recommend is to use the master branch of my ath.git tree. That's
fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
ath6kl patches I have (ie. merge of wireless-testing and my ath-next
branch).
But if you prefer to have clean history (which wireless-testing doesn't
have), you can also use ath-next branch directly. That contains the
patches which I'm planning to send Linville next.
I wrote some documentation about branches here:
http://wireless.kernel.org/en/users/Drivers/ath10k/sources#Git_branches
But in a nutshell, use the master branch if you can. That way you are
best aligned with the ath10k developers and also spot any regressions
early on.
> - firmware version 10.1.467.2-1, but also tested with 10.1.467.1-1
> with no difference.
>
> I assume other people are not experiencing this or they would have
> mentioned it by now. What can I do to help debug this?
We have reported the issue to the firmware team and got some feedback
already. Hopefully we know more early next week.
--
Kalle Valo
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-08 8:03 ` Kalle Valo
@ 2014-03-08 8:20 ` Avery Pennarun
2014-03-08 16:57 ` Kalle Valo
2014-03-10 10:10 ` Michal Kazior
0 siblings, 2 replies; 15+ messages in thread
From: Avery Pennarun @ 2014-03-08 8:20 UTC (permalink / raw)
To: Kalle Valo; +Cc: ath10k
On Sat, Mar 8, 2014 at 2:03 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
> Avery Pennarun <apenwarr@gmail.com> writes:
>> I'm having a problem where if I transmit too fast out the ath10k
>> interface in AP mode, I get a near-immediate firmware crash.
>
> [...]
>
>> Versions:
>> - kernel is based on current kvalo/for-linville branch (should I try
>> something else?) but seems to be the same in linux-next-20140114 so I
>> don't think this behaviour has changed lately.
>
> I do not recommend using for-linville branch for anything. As the name
> implies, it's only for John Linville to pull ath10k and ath6kl changes
> to his tree.
>
> What I recommend is to use the master branch of my ath.git tree. That's
> fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
> ath6kl patches I have (ie. merge of wireless-testing and my ath-next
> branch).
Ok, thanks. We're using a fairly old kernel on our device right now
(3.2.26) so we're using the ath10k driver from linux-backports. This
means it's a little tricky to pick an arbitrary version if it has
diverged to far from linux/master or linux-next. I did try a few
different versions though and they did the same thing.
>> - firmware version 10.1.467.2-1, but also tested with 10.1.467.1-1
>> with no difference.
>>
>> I assume other people are not experiencing this or they would have
>> mentioned it by now. What can I do to help debug this?
>
> We have reported the issue to the firmware team and got some feedback
> already. Hopefully we know more early next week.
Thanks!
Another update. On a whim, based on the earlier mention that problems
might be related to extra burstiness of forwarding vs. local traffic
generation, I decided to add a udelay() before transmitting each
packet. I started with udelay(1000) and the problem went away
(although of course performance was terrible). I slowly reduced the
delay until I reached ndelay(1), and the problem stayed gone. So I
tried a mb() instead:
diff --git a/drivers/net/wireless/ath/ath10k/ce.c
b/drivers/net/wireless/ath/ath10k/ce.c
index a79499c..a808d82 100644
--- a/drivers/net/wireless/ath/ath10k/ce.c
+++ b/drivers/net/wireless/ath/ath10k/ce.c
@@ -291,6 +291,7 @@ int ath10k_ce_send_nolock(struct ath10k_ce_pipe *ce_state,
if (ret)
return ret;
+ mb();
if (unlikely(CE_RING_DELTA(nentries_mask,
write_index, sw_index - 1) <= 0)) {
ret = -ENOSR;
--
1.9.0.279.gdc9e3eb
Somehow this eliminates my firmware crashes. It's extremely reliable;
add this line and my crashes go away. Remove this line and my UDP
iperf can crash the firmware in a couple of seconds.
For this particular test I was using a backports built from linux
v3.11.8 merged with your ath10k-stable-3.11-8 tag.
Any idea why this would make any difference?
Thanks,
Avery
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-08 8:20 ` Avery Pennarun
@ 2014-03-08 16:57 ` Kalle Valo
2014-03-10 19:26 ` Avery Pennarun
2014-03-10 10:10 ` Michal Kazior
1 sibling, 1 reply; 15+ messages in thread
From: Kalle Valo @ 2014-03-08 16:57 UTC (permalink / raw)
To: Avery Pennarun; +Cc: ath10k
Avery Pennarun <apenwarr@gmail.com> writes:
> On Sat, Mar 8, 2014 at 2:03 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>
>> What I recommend is to use the master branch of my ath.git tree. That's
>> fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
>> ath6kl patches I have (ie. merge of wireless-testing and my ath-next
>> branch).
>
> Ok, thanks. We're using a fairly old kernel on our device right now
> (3.2.26) so we're using the ath10k driver from linux-backports. This
> means it's a little tricky to pick an arbitrary version if it has
> diverged to far from linux/master or linux-next.
Yeah, you are not the only one using ath10k with linux-backports.
Ideally we should have our own "backports-ath10k" which contains latest
and greatest ath10k from my master tree. This would be really handy for
debugging problems and verifying bug fixes, but I don't see any way to
find time for that right now :/
> I did try a few different versions though and they did the same thing.
Yeah, I wasn't expecting master branch to contain any fixes to your
issues. This was more like a generic comment.
--
Kalle Valo
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-08 0:54 ` Avery Pennarun
2014-03-08 1:08 ` Ben Greear
@ 2014-03-10 7:47 ` Michal Kazior
1 sibling, 0 replies; 15+ messages in thread
From: Michal Kazior @ 2014-03-10 7:47 UTC (permalink / raw)
To: Avery Pennarun; +Cc: Ben Greear, ath10k
On 8 March 2014 01:54, Avery Pennarun <apenwarr@gmail.com> wrote:
[...]
> Michal wrote:
>> Can you try reproducing this with traces? Considering this might get
>> too big to attach to an email I think it's okay to exclude
>> ath10k:ath10k_log_dbg_dump from traces.
>
> Can you give me a clue how to get started with traces? What do I
> enable, and what parts do you want to see?
CONFIG_ATH10K_TRACING=y
trace-cmd record -e ath10k:ath10k_log_err -e ath10k:ath10k_log_warn -e
ath10k:ath10k_log_info -e ath10k:ath10k_log_dbg -e
ath10k:ath10k_wmi_cmd -e ath10k:ath10k_wmi_event -e
ath10k:ath10k_htt_stats -e ath10k:ath10k_wmi_dbglog
(this excludes the ath10k:ath10k_log_dbg_dump)
If it's too big to attach/paste the trace.dat file then a reasonable
output of backlog from `trace-cmd report` before FW crashes might
suffice. Considering your other mail this may prevent the bug from
appearing though.
Michał
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-08 8:20 ` Avery Pennarun
2014-03-08 16:57 ` Kalle Valo
@ 2014-03-10 10:10 ` Michal Kazior
2014-03-10 10:16 ` Michal Kazior
1 sibling, 1 reply; 15+ messages in thread
From: Michal Kazior @ 2014-03-10 10:10 UTC (permalink / raw)
To: Avery Pennarun; +Cc: Kalle Valo, ath10k
On 8 March 2014 09:20, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Sat, Mar 8, 2014 at 2:03 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>> Avery Pennarun <apenwarr@gmail.com> writes:
>>> I'm having a problem where if I transmit too fast out the ath10k
>>> interface in AP mode, I get a near-immediate firmware crash.
>>
>> [...]
>>
>>> Versions:
>>> - kernel is based on current kvalo/for-linville branch (should I try
>>> something else?) but seems to be the same in linux-next-20140114 so I
>>> don't think this behaviour has changed lately.
>>
>> I do not recommend using for-linville branch for anything. As the name
>> implies, it's only for John Linville to pull ath10k and ath6kl changes
>> to his tree.
>>
>> What I recommend is to use the master branch of my ath.git tree. That's
>> fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
>> ath6kl patches I have (ie. merge of wireless-testing and my ath-next
>> branch).
>
> Ok, thanks. We're using a fairly old kernel on our device right now
> (3.2.26) so we're using the ath10k driver from linux-backports. This
> means it's a little tricky to pick an arbitrary version if it has
> diverged to far from linux/master or linux-next. I did try a few
> different versions though and they did the same thing.
>
>>> - firmware version 10.1.467.2-1, but also tested with 10.1.467.1-1
>>> with no difference.
>>>
>>> I assume other people are not experiencing this or they would have
>>> mentioned it by now. What can I do to help debug this?
>>
>> We have reported the issue to the firmware team and got some feedback
>> already. Hopefully we know more early next week.
>
> Thanks!
>
> Another update. On a whim, based on the earlier mention that problems
> might be related to extra burstiness of forwarding vs. local traffic
> generation, I decided to add a udelay() before transmitting each
> packet. I started with udelay(1000) and the problem went away
> (although of course performance was terrible). I slowly reduced the
> delay until I reached ndelay(1), and the problem stayed gone. So I
> tried a mb() instead:
>
> diff --git a/drivers/net/wireless/ath/ath10k/ce.c
> b/drivers/net/wireless/ath/ath10k/ce.c
> index a79499c..a808d82 100644
> --- a/drivers/net/wireless/ath/ath10k/ce.c
> +++ b/drivers/net/wireless/ath/ath10k/ce.c
> @@ -291,6 +291,7 @@ int ath10k_ce_send_nolock(struct ath10k_ce_pipe *ce_state,
> if (ret)
> return ret;
>
> + mb();
> if (unlikely(CE_RING_DELTA(nentries_mask,
> write_index, sw_index - 1) <= 0)) {
> ret = -ENOSR;
> --
> 1.9.0.279.gdc9e3eb
>
>
> Somehow this eliminates my firmware crashes. It's extremely reliable;
> add this line and my crashes go away. Remove this line and my UDP
> iperf can crash the firmware in a couple of seconds.
>
> For this particular test I was using a backports built from linux
> v3.11.8 merged with your ath10k-stable-3.11-8 tag.
>
> Any idea why this would make any difference?
The FW dump is supposedly related to it seeing a duplicate msdu_id tx request.
ath10k fills in a tx descriptor. The descriptor contains an id which
is used for completion handling (FW signals which id completed).
ath10k uses a spinlock protected bitmap to manage this metadata.
Descriptors are alloced via dma pool (consistent dma memory).
It is highly unlikely for ath10k to pick duplicate msdu_id in the
first place - you'd have to assume spinlock fail which would suggest
your system would be pretty fun. This leaves either low level chunk
submission is at play or DMA goes crazy.
The descriptor is transfered in two chunks over CE ring. The
ce_send_nolock is used to submit each separately (via pci_tx_sg). The
first contains msdu id, the other one is the msdu partial as frame
prefetch for FW classification engine. Once the second chunk is
submitted CE ringbuffer index is written to iomap.
If I assume this is DMA coherency issue, then msdu_id the device sees
is the old one (that has been overwritten but hasn't been flushed from
CPU caches yet). Then this is a platform bug, not ath10k one.
If I assume this is chunk submission ordering issue (CE ring item is
updated _after_ ring index in iomap is updated) then the device uses
an old tx descriptor pointer and an old (or re-used and currently used
msdu_id -- remember all descriptors come from dma pool which I assume
re-uses memory chunks). Then this is ath10k bug.
The latter is a little more plausible because mb() fixes. udelay()
might implicitly do the same thing.
Michał
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-10 10:10 ` Michal Kazior
@ 2014-03-10 10:16 ` Michal Kazior
2014-05-12 1:43 ` Avery Pennarun
0 siblings, 1 reply; 15+ messages in thread
From: Michal Kazior @ 2014-03-10 10:16 UTC (permalink / raw)
To: Avery Pennarun; +Cc: Kalle Valo, ath10k
On 10 March 2014 11:10, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 8 March 2014 09:20, Avery Pennarun <apenwarr@gmail.com> wrote:
>> On Sat, Mar 8, 2014 at 2:03 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>>> Avery Pennarun <apenwarr@gmail.com> writes:
>>>> I'm having a problem where if I transmit too fast out the ath10k
>>>> interface in AP mode, I get a near-immediate firmware crash.
>>>
>>> [...]
>>>
>>>> Versions:
>>>> - kernel is based on current kvalo/for-linville branch (should I try
>>>> something else?) but seems to be the same in linux-next-20140114 so I
>>>> don't think this behaviour has changed lately.
>>>
>>> I do not recommend using for-linville branch for anything. As the name
>>> implies, it's only for John Linville to pull ath10k and ath6kl changes
>>> to his tree.
>>>
>>> What I recommend is to use the master branch of my ath.git tree. That's
>>> fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
>>> ath6kl patches I have (ie. merge of wireless-testing and my ath-next
>>> branch).
>>
>> Ok, thanks. We're using a fairly old kernel on our device right now
>> (3.2.26) so we're using the ath10k driver from linux-backports. This
>> means it's a little tricky to pick an arbitrary version if it has
>> diverged to far from linux/master or linux-next. I did try a few
>> different versions though and they did the same thing.
>>
>>>> - firmware version 10.1.467.2-1, but also tested with 10.1.467.1-1
>>>> with no difference.
>>>>
>>>> I assume other people are not experiencing this or they would have
>>>> mentioned it by now. What can I do to help debug this?
>>>
>>> We have reported the issue to the firmware team and got some feedback
>>> already. Hopefully we know more early next week.
>>
>> Thanks!
>>
>> Another update. On a whim, based on the earlier mention that problems
>> might be related to extra burstiness of forwarding vs. local traffic
>> generation, I decided to add a udelay() before transmitting each
>> packet. I started with udelay(1000) and the problem went away
>> (although of course performance was terrible). I slowly reduced the
>> delay until I reached ndelay(1), and the problem stayed gone. So I
>> tried a mb() instead:
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/ce.c
>> b/drivers/net/wireless/ath/ath10k/ce.c
>> index a79499c..a808d82 100644
>> --- a/drivers/net/wireless/ath/ath10k/ce.c
>> +++ b/drivers/net/wireless/ath/ath10k/ce.c
>> @@ -291,6 +291,7 @@ int ath10k_ce_send_nolock(struct ath10k_ce_pipe *ce_state,
>> if (ret)
>> return ret;
>>
>> + mb();
>> if (unlikely(CE_RING_DELTA(nentries_mask,
>> write_index, sw_index - 1) <= 0)) {
>> ret = -ENOSR;
>> --
>> 1.9.0.279.gdc9e3eb
>>
>>
>> Somehow this eliminates my firmware crashes. It's extremely reliable;
>> add this line and my crashes go away. Remove this line and my UDP
>> iperf can crash the firmware in a couple of seconds.
>>
>> For this particular test I was using a backports built from linux
>> v3.11.8 merged with your ath10k-stable-3.11-8 tag.
>>
>> Any idea why this would make any difference?
>
> The FW dump is supposedly related to it seeing a duplicate msdu_id tx request.
>
> ath10k fills in a tx descriptor. The descriptor contains an id which
> is used for completion handling (FW signals which id completed).
> ath10k uses a spinlock protected bitmap to manage this metadata.
> Descriptors are alloced via dma pool (consistent dma memory).
>
> It is highly unlikely for ath10k to pick duplicate msdu_id in the
> first place - you'd have to assume spinlock fail which would suggest
> your system would be pretty fun. This leaves either low level chunk
> submission is at play or DMA goes crazy.
>
> The descriptor is transfered in two chunks over CE ring. The
> ce_send_nolock is used to submit each separately (via pci_tx_sg). The
> first contains msdu id, the other one is the msdu partial as frame
> prefetch for FW classification engine. Once the second chunk is
> submitted CE ringbuffer index is written to iomap.
>
> If I assume this is DMA coherency issue, then msdu_id the device sees
> is the old one (that has been overwritten but hasn't been flushed from
> CPU caches yet). Then this is a platform bug, not ath10k one.
>
> If I assume this is chunk submission ordering issue (CE ring item is
> updated _after_ ring index in iomap is updated) then the device uses
> an old tx descriptor pointer and an old (or re-used and currently used
> msdu_id -- remember all descriptors come from dma pool which I assume
> re-uses memory chunks). Then this is ath10k bug.
>
> The latter is a little more plausible because mb() fixes. udelay()
> might implicitly do the same thing.
Now that I think you can probably try placing the mb() after `*desc =
*desc`, or better, right before ath10k_ce_src_ring_write_index_set()
(inside the conditional). This is will make sure all prior CE items
are ready and set before CE ring index is updated.
Michał
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-08 16:57 ` Kalle Valo
@ 2014-03-10 19:26 ` Avery Pennarun
2014-03-11 7:31 ` Kalle Valo
0 siblings, 1 reply; 15+ messages in thread
From: Avery Pennarun @ 2014-03-10 19:26 UTC (permalink / raw)
To: Kalle Valo; +Cc: ath10k
On Sat, Mar 8, 2014 at 10:57 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
> Avery Pennarun <apenwarr@gmail.com> writes:
>> On Sat, Mar 8, 2014 at 2:03 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>>> What I recommend is to use the master branch of my ath.git tree. That's
>>> fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
>>> ath6kl patches I have (ie. merge of wireless-testing and my ath-next
>>> branch).
>>
>> Ok, thanks. We're using a fairly old kernel on our device right now
>> (3.2.26) so we're using the ath10k driver from linux-backports. This
>> means it's a little tricky to pick an arbitrary version if it has
>> diverged too far from linux/master or linux-next.
>
> Yeah, you are not the only one using ath10k with linux-backports.
> Ideally we should have our own "backports-ath10k" which contains latest
> and greatest ath10k from my master tree. This would be really handy for
> debugging problems and verifying bug fixes, but I don't see any way to
> find time for that right now :/
For what it's worth, linux-backports seems to get updated to match
linux-next fairly frequently. They have one for next-20140305 for
example. It's pretty easy for me to test against one of those. It
seems to be not so important to rebase on top of linux-backports as to
rebase on top of (close to) same version linux-backports is using as
input.
When I looked at the 20140305 release though, it appears that the
ath10k development has gotten quite far ahead of linux-next and some
of the patches not yet in linux-next (eg. reset fixes) are pretty
important.
Have fun,
Avery
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-10 19:26 ` Avery Pennarun
@ 2014-03-11 7:31 ` Kalle Valo
0 siblings, 0 replies; 15+ messages in thread
From: Kalle Valo @ 2014-03-11 7:31 UTC (permalink / raw)
To: Avery Pennarun; +Cc: ath10k
Avery Pennarun <apenwarr@gmail.com> writes:
> On Sat, Mar 8, 2014 at 10:57 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>> Avery Pennarun <apenwarr@gmail.com> writes:
>>> On Sat, Mar 8, 2014 at 2:03 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>>>> What I recommend is to use the master branch of my ath.git tree. That's
>>>> fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
>>>> ath6kl patches I have (ie. merge of wireless-testing and my ath-next
>>>> branch).
>>>
>>> Ok, thanks. We're using a fairly old kernel on our device right now
>>> (3.2.26) so we're using the ath10k driver from linux-backports. This
>>> means it's a little tricky to pick an arbitrary version if it has
>>> diverged too far from linux/master or linux-next.
>>
>> Yeah, you are not the only one using ath10k with linux-backports.
>> Ideally we should have our own "backports-ath10k" which contains latest
>> and greatest ath10k from my master tree. This would be really handy for
>> debugging problems and verifying bug fixes, but I don't see any way to
>> find time for that right now :/
>
> For what it's worth, linux-backports seems to get updated to match
> linux-next fairly frequently. They have one for next-20140305 for
> example. It's pretty easy for me to test against one of those. It
> seems to be not so important to rebase on top of linux-backports as to
> rebase on top of (close to) same version linux-backports is using as
> input.
>
> When I looked at the 20140305 release though, it appears that the
> ath10k development has gotten quite far ahead of linux-next and some
> of the patches not yet in linux-next (eg. reset fixes) are pretty
> important.
I haven't looked very closely, but I suspect it will take several weeks
for patches to flow from my ath-next branch to backports. For important
fixes, like the cold reset workarounds, it's too long.
--
Kalle Valo
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Firmware crash when sending large numbers of forwarded packets
2014-03-10 10:16 ` Michal Kazior
@ 2014-05-12 1:43 ` Avery Pennarun
0 siblings, 0 replies; 15+ messages in thread
From: Avery Pennarun @ 2014-05-12 1:43 UTC (permalink / raw)
To: Michal Kazior; +Cc: Kalle Valo, ath10k
On Mon, Mar 10, 2014 at 6:16 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 10 March 2014 11:10, Michal Kazior <michal.kazior@tieto.com> wrote:
>> It is highly unlikely for ath10k to pick duplicate msdu_id in the
>> first place - you'd have to assume spinlock fail which would suggest
>> your system would be pretty fun. This leaves either low level chunk
>> submission is at play or DMA goes crazy.
>>
>> The descriptor is transfered in two chunks over CE ring. The
>> ce_send_nolock is used to submit each separately (via pci_tx_sg). The
>> first contains msdu id, the other one is the msdu partial as frame
>> prefetch for FW classification engine. Once the second chunk is
>> submitted CE ringbuffer index is written to iomap.
>>
>> If I assume this is DMA coherency issue, then msdu_id the device sees
>> is the old one (that has been overwritten but hasn't been flushed from
>> CPU caches yet). Then this is a platform bug, not ath10k one.
>>
>> If I assume this is chunk submission ordering issue (CE ring item is
>> updated _after_ ring index in iomap is updated) then the device uses
>> an old tx descriptor pointer and an old (or re-used and currently used
>> msdu_id -- remember all descriptors come from dma pool which I assume
>> re-uses memory chunks). Then this is ath10k bug.
>>
>> The latter is a little more plausible because mb() fixes. udelay()
>> might implicitly do the same thing.
>
> Now that I think you can probably try placing the mb() after `*desc =
> *desc`, or better, right before ath10k_ce_src_ring_write_index_set()
> (inside the conditional). This is will make sure all prior CE items
> are ready and set before CE ring index is updated.
Sorry to take a long time to get back on this thread, but I finally
have some better test results.
First, our stability has definitely been improved a lot since my
original mb() patch.
Secondly, we still had occasional firmware crashes under certain load
types. Thanks to Ben Greear and some helpful people at QCA, we
diagnosed this as the same problem as the original mb() was trying to
fix. After moving the mb() as Michal suggested, this class of
firmware crash (which was much more rare after my original patch)
seems to be gone too. So the new location of the mb() seems to indeed
be better.
We still have rare firmware crashes for I think a different reason(s),
but I have/will ask about those in other threads.
Thanks for the help!
Avery
_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-05-12 1:44 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-07 10:38 Firmware crash when sending large numbers of forwarded packets Avery Pennarun
2014-03-07 11:15 ` Michal Kazior
2014-03-07 15:53 ` Ben Greear
2014-03-08 0:54 ` Avery Pennarun
2014-03-08 1:08 ` Ben Greear
2014-03-08 1:21 ` Avery Pennarun
2014-03-10 7:47 ` Michal Kazior
2014-03-08 8:03 ` Kalle Valo
2014-03-08 8:20 ` Avery Pennarun
2014-03-08 16:57 ` Kalle Valo
2014-03-10 19:26 ` Avery Pennarun
2014-03-11 7:31 ` Kalle Valo
2014-03-10 10:10 ` Michal Kazior
2014-03-10 10:16 ` Michal Kazior
2014-05-12 1:43 ` Avery Pennarun
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.