* Anyone seeing tx-credits 'hang'? @ 2015-01-08 21:24 Ben Greear 2015-01-09 10:34 ` Michal Kazior 0 siblings, 1 reply; 15+ messages in thread From: Ben Greear @ 2015-01-08 21:24 UTC (permalink / raw) To: ath10k I am still working on tracking down tx-credits hang, where it appears to the driver that firmware does not return tx credits, and the driver then gets lots of -11 errors from htc/wmi and will not recover (well, once it recovered after hanging for about 45 minutes, for reasons that are totally beyond me. I do not normally wait so long). I am using a hacked ath10k driver and CT firmware, but I am suspicious that the problem is not unique to me, though I probably hit the problem much more often due to the types of stress tests I am running. I have implemented a keep-alive between my driver and CT firmware, and firmware will assert if it does not get a message within about 10 seconds. This is a wmi-message, so if we hang due to credits, the firmware will assert and dump a nice crash log (and host can recover). One crash I looked at closely appears to show the firmware thinking it has returned all credits, but driver never received them. What is more, it seems that the driver thought it sent one additional wmi command that the firmware did not receive in the wmi message handling code. I am curious if anyone else has seen these problems (even if very rarely), and if anyone has done any additional debugging on what might be the issue. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-08 21:24 Anyone seeing tx-credits 'hang'? Ben Greear @ 2015-01-09 10:34 ` Michal Kazior 2015-01-09 16:55 ` Ben Greear 0 siblings, 1 reply; 15+ messages in thread From: Michal Kazior @ 2015-01-09 10:34 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On 8 January 2015 at 22:24, Ben Greear <greearb@candelatech.com> wrote: > I am still working on tracking down tx-credits hang, where it appears > to the driver that firmware does not return tx credits, and the driver > then gets lots of -11 errors from htc/wmi and will not recover (well, > once it recovered after hanging for about 45 minutes, for reasons that are totally > beyond me. I do not normally wait so long). > > I am using a hacked ath10k driver and CT firmware, but I am suspicious that the problem > is not unique to me, though I probably hit the problem much more often > due to the types of stress tests I am running. I don't recall seeing it recently. > I have implemented a keep-alive between my driver and CT firmware, > and firmware will assert if it does not get a message within > about 10 seconds. This is a wmi-message, so if we hang due to credits, > the firmware will assert and dump a nice crash log (and host can recover). FYI the default time mgmt tx can be stuck is 10 seconds (vide the tx-credit starvation issue due to hostapd's inactivity measures). > One crash I looked at closely appears to show the firmware thinking it > has returned all credits, but driver never received them. What is more, > it seems that the driver thought it sent one additional wmi command > that the firmware did not receive in the wmi message handling code. Hmm.. A couple of ideas: a) lost interrupt b) silently dropped event buffer (in fw, e.g. due to unforseen lack of resources) c) memory barrier / ordering issue (delivered/submitted buffer was a mess - I don't know if you're checking the buffer in/out count or analyzed all the way down to copy engine) You could try adding a few extra mb() (e.g. before copy engine ring indexes are updated) for (c), at least in ath10k. You could try changing _service_any() to ignore copy engine summary mask and iterate i=0..CE_COUNT-1 and try polling htc-wmi rx pipe (or just simply all of them :P) with ath10k_hif_send_complete_check(). Michal _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-09 10:34 ` Michal Kazior @ 2015-01-09 16:55 ` Ben Greear 2015-01-12 8:06 ` Michal Kazior 0 siblings, 1 reply; 15+ messages in thread From: Ben Greear @ 2015-01-09 16:55 UTC (permalink / raw) To: Michal Kazior; +Cc: ath10k On 01/09/2015 02:34 AM, Michal Kazior wrote: > On 8 January 2015 at 22:24, Ben Greear <greearb@candelatech.com> wrote: >> I am still working on tracking down tx-credits hang, where it appears >> to the driver that firmware does not return tx credits, and the driver >> then gets lots of -11 errors from htc/wmi and will not recover (well, >> once it recovered after hanging for about 45 minutes, for reasons that are totally >> beyond me. I do not normally wait so long). >> >> I am using a hacked ath10k driver and CT firmware, but I am suspicious that the problem >> is not unique to me, though I probably hit the problem much more often >> due to the types of stress tests I am running. > > I don't recall seeing it recently. > > >> I have implemented a keep-alive between my driver and CT firmware, >> and firmware will assert if it does not get a message within >> about 10 seconds. This is a wmi-message, so if we hang due to credits, >> the firmware will assert and dump a nice crash log (and host can recover). > > FYI the default time mgmt tx can be stuck is 10 seconds (vide the > tx-credit starvation issue due to hostapd's inactivity measures). One thing I noticed yesterday is that when the driver tries to put a vdev down, the firmware will try to flush, and will delay vdev-down event until fw is flushed. I changed CT firmware to automatically flush in this case, but perhaps the driver should explicitly ask firmware to flush the vdev before putting it down? Once the driver gets out of sync due to timeouts, the firmware is likely to assert soon after if wmi hang doesn't happen because firmware will think vdev is up when it is not, or vice versa. Also, I notice a pattern in the failure case. The sequence is almost always something like this: [lots of vdev up/down, re-associate, etc] vdev down (this would have timed out if I didn't put in the flush) * vdev down is usually last wmi cmd firmware receives. driver tries to delete peer, that times out (firmware wmi layer never saw the command) firmware reports one or two more messages to driver, and if it manages to report a dbglog, that shows a tx-timeout message usually within a second of the vdev down. This happens whether or not I flush the vdev bringing it down. At this point, one more request from driver may be sent, after that, it is credit starvation. Firmware continues to run (timers fire, etc). I think that firmware is also waiting on a completion event from the CE layer...I plan to dig into that more today. >> One crash I looked at closely appears to show the firmware thinking it >> has returned all credits, but driver never received them. What is more, >> it seems that the driver thought it sent one additional wmi command >> that the firmware did not receive in the wmi message handling code. > > Hmm.. A couple of ideas: > a) lost interrupt > b) silently dropped event buffer (in fw, e.g. due to unforseen lack > of resources) > c) memory barrier / ordering issue (delivered/submitted buffer was a > mess - I don't know if you're checking the buffer in/out count or > analyzed all the way down to copy engine) > > You could try adding a few extra mb() (e.g. before copy engine ring > indexes are updated) for (c), at least in ath10k. > > You could try changing _service_any() to ignore copy engine summary > mask and iterate i=0..CE_COUNT-1 and try polling htc-wmi rx pipe (or > just simply all of them :P) with ath10k_hif_send_complete_check(). Yes, I suspect CE transport issue...I have not dug into that code yet, but I will do so today. Thanks, Ben > > > Michal > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-09 16:55 ` Ben Greear @ 2015-01-12 8:06 ` Michal Kazior 2015-01-12 16:51 ` Ben Greear 2015-01-13 19:07 ` Ben Greear 0 siblings, 2 replies; 15+ messages in thread From: Michal Kazior @ 2015-01-12 8:06 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On 9 January 2015 at 17:55, Ben Greear <greearb@candelatech.com> wrote: [...] > One thing I noticed yesterday is that when the driver tries to put a > vdev down, the firmware will try to flush, and will delay vdev-down > event until fw is flushed. I changed CT firmware to automatically > flush in this case, but perhaps the driver should explicitly ask > firmware to flush the vdev before putting it down? I recall the discussion we once had. I do plan on doing a patch for that, eventually. > Once the driver gets out of sync due to timeouts, the firmware > is likely to assert soon after if wmi hang doesn't happen because > firmware will think vdev is up when it is not, or vice versa. > > Also, I notice a pattern in the failure case. > > The sequence is almost always something like this: > > [lots of vdev up/down, re-associate, etc] > > vdev down (this would have timed out if I didn't put in the flush) > * vdev down is usually last wmi cmd firmware receives. > driver tries to delete peer, that times out (firmware wmi layer never > saw the command) So there's a chance htc layer actually did get the buffer but for some reason it decided it isn't a wmi buffer. One reason could be the buffer contained garbage (e.g. due to missing barrier on host so firmware could read some data from an old physical address that was stored in ce descriptor item). > firmware reports one or two more messages to driver, and if it manages to report > a dbglog, that shows a tx-timeout message usually within a second of > the vdev down. This happens whether or not I flush the vdev bringing it > down. > > At this point, one more request from driver may be sent, after that, > it is credit starvation. Firmware continues to run (timers fire, etc). > > I think that firmware is also waiting on a completion event from the > CE layer...I plan to dig into that more today. Hm.. This reminds me of issues hw1.0 had. I'd check if one of the workarounds ath10k had changes anything (see ath10k_ce_src_ring_write_index_set in ce.c in 5e3dd157ce). Michał _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-12 8:06 ` Michal Kazior @ 2015-01-12 16:51 ` Ben Greear 2015-01-13 19:07 ` Ben Greear 1 sibling, 0 replies; 15+ messages in thread From: Ben Greear @ 2015-01-12 16:51 UTC (permalink / raw) To: Michal Kazior; +Cc: ath10k On 01/12/2015 12:06 AM, Michal Kazior wrote: > On 9 January 2015 at 17:55, Ben Greear <greearb@candelatech.com> wrote: > [...] >> One thing I noticed yesterday is that when the driver tries to put a >> vdev down, the firmware will try to flush, and will delay vdev-down >> event until fw is flushed. I changed CT firmware to automatically >> flush in this case, but perhaps the driver should explicitly ask >> firmware to flush the vdev before putting it down? > > I recall the discussion we once had. I do plan on doing a patch for > that, eventually. I this case, I am thinking to just flush a particular vdev instead of the entire set of vdevs. I don't think flushing is root cause of my problems anyway, as I still see the issue after making my CT firmware flush. I think upstream firmware might require one message per tid per peer, so might be an issue to generate that many wmi commands anyway...not sure. >> Once the driver gets out of sync due to timeouts, the firmware >> is likely to assert soon after if wmi hang doesn't happen because >> firmware will think vdev is up when it is not, or vice versa. >> >> Also, I notice a pattern in the failure case. >> >> The sequence is almost always something like this: >> >> [lots of vdev up/down, re-associate, etc] >> >> vdev down (this would have timed out if I didn't put in the flush) >> * vdev down is usually last wmi cmd firmware receives. >> driver tries to delete peer, that times out (firmware wmi layer never >> saw the command) > > So there's a chance htc layer actually did get the buffer but for some > reason it decided it isn't a wmi buffer. One reason could be the > buffer contained garbage (e.g. due to missing barrier on host so > firmware could read some data from an old physical address that was > stored in ce descriptor item). > > >> firmware reports one or two more messages to driver, and if it manages to report >> a dbglog, that shows a tx-timeout message usually within a second of >> the vdev down. This happens whether or not I flush the vdev bringing it >> down. >> >> At this point, one more request from driver may be sent, after that, >> it is credit starvation. Firmware continues to run (timers fire, etc). >> >> I think that firmware is also waiting on a completion event from the >> CE layer...I plan to dig into that more today. > > Hm.. This reminds me of issues hw1.0 had. I'd check if one of the > workarounds ath10k had changes anything (see > ath10k_ce_src_ring_write_index_set in ce.c in 5e3dd157ce). Thanks, I'll go take a look at this today. Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-12 8:06 ` Michal Kazior 2015-01-12 16:51 ` Ben Greear @ 2015-01-13 19:07 ` Ben Greear 2015-01-14 9:45 ` Michal Kazior 1 sibling, 1 reply; 15+ messages in thread From: Ben Greear @ 2015-01-13 19:07 UTC (permalink / raw) To: Michal Kazior; +Cc: ath10k On 01/12/2015 12:06 AM, Michal Kazior wrote: > On 9 January 2015 at 17:55, Ben Greear <greearb@candelatech.com> wrote: > [...] >> One thing I noticed yesterday is that when the driver tries to put a >> vdev down, the firmware will try to flush, and will delay vdev-down >> event until fw is flushed. I changed CT firmware to automatically >> flush in this case, but perhaps the driver should explicitly ask >> firmware to flush the vdev before putting it down? > > I recall the discussion we once had. I do plan on doing a patch for > that, eventually. > > >> Once the driver gets out of sync due to timeouts, the firmware >> is likely to assert soon after if wmi hang doesn't happen because >> firmware will think vdev is up when it is not, or vice versa. >> >> Also, I notice a pattern in the failure case. >> >> The sequence is almost always something like this: >> >> [lots of vdev up/down, re-associate, etc] >> >> vdev down (this would have timed out if I didn't put in the flush) >> * vdev down is usually last wmi cmd firmware receives. >> driver tries to delete peer, that times out (firmware wmi layer never >> saw the command) > > So there's a chance htc layer actually did get the buffer but for some > reason it decided it isn't a wmi buffer. One reason could be the > buffer contained garbage (e.g. due to missing barrier on host so > firmware could read some data from an old physical address that was > stored in ce descriptor item). I managed to get some better debug out of the firmware. I am having a hell of a time figuring out how the code flows through all of the callbacks (in both firmware and driver), but it appears this is what happened: (I have instrumented transfer-id in both firmware and driver) firmware sent wmi message with transfer-id of 72. kernel received this transfer-id firmware's last send-callback transfer ID is 71. So, it seems that either ath10k did not do the transfer-complete logic, did it incorrectly, or the firmware did not notice it was done. I cannot find where the transfer complete code that should be updating firmware is at. If you know, can you point me to it? Thanks, Ben > > >> firmware reports one or two more messages to driver, and if it manages to report >> a dbglog, that shows a tx-timeout message usually within a second of >> the vdev down. This happens whether or not I flush the vdev bringing it >> down. >> >> At this point, one more request from driver may be sent, after that, >> it is credit starvation. Firmware continues to run (timers fire, etc). >> >> I think that firmware is also waiting on a completion event from the >> CE layer...I plan to dig into that more today. > > Hm.. This reminds me of issues hw1.0 had. I'd check if one of the > workarounds ath10k had changes anything (see > ath10k_ce_src_ring_write_index_set in ce.c in 5e3dd157ce). > > > Michał > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-13 19:07 ` Ben Greear @ 2015-01-14 9:45 ` Michal Kazior 2015-01-14 17:57 ` Ben Greear 0 siblings, 1 reply; 15+ messages in thread From: Michal Kazior @ 2015-01-14 9:45 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On 13 January 2015 at 20:07, Ben Greear <greearb@candelatech.com> wrote: [...] > > I managed to get some better debug out of the firmware. > > I am having a hell of a time figuring out how the code flows through all > of the callbacks (in both firmware and driver), but it appears this is what happened: > > (I have instrumented transfer-id in both firmware and driver) > > firmware sent wmi message with transfer-id of 72. > kernel received this transfer-id > firmware's last send-callback transfer ID is 71. > > So, it seems that either ath10k did not do the transfer-complete logic, > did it incorrectly, or the firmware did not notice it was done. > > I cannot find where the transfer complete code that should be updating > firmware is at. If you know, can you point me to it? I think the send-callback should be called when CE is simply done doing it's stuff. There's no need for the other side to ack anything explicitly (it just needs to have a free buffer on it's side so CE can copy it over). Or maybe it is the HOST_IS_COPY_COMPLETE_MASK? Not really sure. Michał _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-14 9:45 ` Michal Kazior @ 2015-01-14 17:57 ` Ben Greear [not found] ` <54B6D67C.4090006@qca.qualcomm.com> 2015-01-15 7:48 ` Michal Kazior 0 siblings, 2 replies; 15+ messages in thread From: Ben Greear @ 2015-01-14 17:57 UTC (permalink / raw) To: Michal Kazior; +Cc: ath10k On 01/14/2015 01:45 AM, Michal Kazior wrote: > On 13 January 2015 at 20:07, Ben Greear <greearb@candelatech.com> wrote: > [...] >> >> I managed to get some better debug out of the firmware. >> >> I am having a hell of a time figuring out how the code flows through all >> of the callbacks (in both firmware and driver), but it appears this is what happened: >> >> (I have instrumented transfer-id in both firmware and driver) >> >> firmware sent wmi message with transfer-id of 72. >> kernel received this transfer-id >> firmware's last send-callback transfer ID is 71. >> >> So, it seems that either ath10k did not do the transfer-complete logic, >> did it incorrectly, or the firmware did not notice it was done. >> >> I cannot find where the transfer complete code that should be updating >> firmware is at. If you know, can you point me to it? > > I think the send-callback should be called when CE is simply done > doing it's stuff. There's no need for the other side to ack anything > explicitly (it just needs to have a free buffer on it's side so CE can > copy it over). > > Or maybe it is the HOST_IS_COPY_COMPLETE_MASK? Not really sure. I am now guessing that some magic IRQ happens when ath10k_ce_src_ring_write_index_set() is called. I may have narrowed down the problem a bit further now. I printed out the ring indexes in firmware and driver when lockup occured. The target -> host ring ids match fine, but I notice that it appears the firmware has pending entries in it's host -> target wmi ring that it has not consumed. Maybe it missed an irq or has some related race. I'm going to try forcing a poll of the host -> target wmi queue in the firmware when it detects no wmi keep-alive messages and see if that kicks things back into action, and maybe see if I can find any reason for it to not properly handle the ring in the first place. If this works, perhaps there is a way to kick the ring from the driver side...maybe send a wmi command (ignoring quota) that has no affect, or something like that? Thanks, Ben > > > Michał > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <54B6D67C.4090006@qca.qualcomm.com>]
[parent not found: <54B6DE13.1080609@candelatech.com>]
* Re: Anyone seeing tx-credits 'hang'? [not found] ` <54B6DE13.1080609@candelatech.com> @ 2015-01-15 1:54 ` Peter Oh 0 siblings, 0 replies; 15+ messages in thread From: Peter Oh @ 2015-01-15 1:54 UTC (permalink / raw) To: Ben Greear, Michal Kazior; +Cc: ath10k On 01/14/2015 01:22 PM, Ben Greear wrote: > On 01/14/2015 12:50 PM, Peter Oh wrote: >> On 01/14/2015 09:57 AM, Ben Greear wrote: >>> On 01/14/2015 01:45 AM, Michal Kazior wrote: >>>> On 13 January 2015 at 20:07, Ben Greear <greearb@candelatech.com> wrote: >>>> [...] >>>>> I managed to get some better debug out of the firmware. >>>>> >>>>> I am having a hell of a time figuring out how the code flows through all >>>>> of the callbacks (in both firmware and driver), but it appears this is what happened: >>>>> >>>>> (I have instrumented transfer-id in both firmware and driver) >>>>> >>>>> firmware sent wmi message with transfer-id of 72. >>>>> kernel received this transfer-id >>>>> firmware's last send-callback transfer ID is 71. >>>>> >>>>> So, it seems that either ath10k did not do the transfer-complete logic, >>>>> did it incorrectly, or the firmware did not notice it was done. >>>>> >>>>> I cannot find where the transfer complete code that should be updating >>>>> firmware is at. If you know, can you point me to it? >>>> I think the send-callback should be called when CE is simply done >>>> doing it's stuff. There's no need for the other side to ack anything >>>> explicitly (it just needs to have a free buffer on it's side so CE can >>>> copy it over). >>>> >>>> Or maybe it is the HOST_IS_COPY_COMPLETE_MASK? Not really sure. >>> I am now guessing that some magic IRQ happens when ath10k_ce_src_ring_write_index_set() >>> is called. >> You may already notice it, but to clarify the magic IRQ is DMA interrupts. Copy Engine is almost the same as DMA engine with channels which triggers an >> interrupt automatically when a DMA transfer is completed. we have registers to enable it, HOST_IE (offset 0x2c) and TARGET_IE(offset 0x24). >> ath10k_ce_src_ring_write_index_set (SRC_RING_WR_IND register, offset 0x3c) triggers fetching data automatically using DMA by ASIC design. > Yes, that makes sense, and I appreciate the extra details. > >>> I may have narrowed down the problem a bit further now. >>> >>> I printed out the ring indexes in firmware and driver when lockup >>> occured. The target -> host ring ids match fine, but I notice that >>> it appears the firmware has pending entries in it's host -> target wmi >>> ring that it has not consumed. >>> >>> Maybe it missed an irq or has some related race. >> Since the IRQ is a DMA interrupt triggered by ASIC, all the amount of data size must be transferred to trigger the interrupt. If IRQ does not happen even after >> all the data transferred, then we may call it an ASIC bug otherwise it could be software issues. The corresponding status register is TARGET_IS (offset 0x28) >> and HOST_IS (offset 0x30), but I'm not sure which registers represent the number of bytes has been transferred. If we have this type of register, it will be >> easy to determine if DMA is done. > I found some things that look risky in the firmware CE code, but my attempts at > fixing them made no improvement, so I am not sure I found any real problems in > this area yet. I'll be happy to send you the firmware patches for my debugging > efforts and such if you are interested. sure. I'd like to run your changes, but I cannot guarantee how much efforts by when I give work on. > > As for when bytes are fully read, see this firmware method: > > CE_completed_recv_next > > At this point, I am trying to make a work-around that will force a re-read of the ring > buffer (basically, fake an interrupt). > > > Back to the original attempt at debugging this...the problem was quite easy to reproduce > before I started adding debugging to the firmware..and the debugging I have added is quite light on > run-time behaviour, so I suspect some sort of race either in software or hardware. > > Hard to pin it down though. > > Out of curiosity, are you aware of anyone hitting this type of problem with upstream > firmware? sorry, but I don't see people address this issue. > Thanks, > Ben > > Regards, Peter _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-14 17:57 ` Ben Greear [not found] ` <54B6D67C.4090006@qca.qualcomm.com> @ 2015-01-15 7:48 ` Michal Kazior 2015-01-15 17:17 ` Ben Greear 2015-01-20 4:34 ` Ben Greear 1 sibling, 2 replies; 15+ messages in thread From: Michal Kazior @ 2015-01-15 7:48 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On 14 January 2015 at 18:57, Ben Greear <greearb@candelatech.com> wrote: > On 01/14/2015 01:45 AM, Michal Kazior wrote: >> On 13 January 2015 at 20:07, Ben Greear <greearb@candelatech.com> wrote: >> [...] >>> >>> I managed to get some better debug out of the firmware. >>> >>> I am having a hell of a time figuring out how the code flows through all >>> of the callbacks (in both firmware and driver), but it appears this is what happened: >>> >>> (I have instrumented transfer-id in both firmware and driver) >>> >>> firmware sent wmi message with transfer-id of 72. >>> kernel received this transfer-id >>> firmware's last send-callback transfer ID is 71. >>> >>> So, it seems that either ath10k did not do the transfer-complete logic, >>> did it incorrectly, or the firmware did not notice it was done. >>> >>> I cannot find where the transfer complete code that should be updating >>> firmware is at. If you know, can you point me to it? >> >> I think the send-callback should be called when CE is simply done >> doing it's stuff. There's no need for the other side to ack anything >> explicitly (it just needs to have a free buffer on it's side so CE can >> copy it over). >> >> Or maybe it is the HOST_IS_COPY_COMPLETE_MASK? Not really sure. > > I am now guessing that some magic IRQ happens when ath10k_ce_src_ring_write_index_set() > is called. Correct. CE should generate an interrupt (provided it's not masked in CE registers) on the other end when ring index is bumped. > I may have narrowed down the problem a bit further now. > > I printed out the ring indexes in firmware and driver when lockup > occured. The target -> host ring ids match fine, but I notice that > it appears the firmware has pending entries in it's host -> target wmi > ring that it has not consumed. > > Maybe it missed an irq or has some related race. Hmm.. The host can tell the target it wants tx credit update in the htc host->target buffer. Upstream ath10k does this only when spending last tx credit. Your observation would explain why firmware doesn't send tx credit update to the host - it didn't get to see the need-credit-update. Does your tree modify behaviour of when is set ATH10K_HTC_FLAG_NEED_CREDIT_UPDATE in ath10k? > I'm going to try forcing a poll of the host -> target wmi queue in the > firmware when it detects no wmi keep-alive messages and see if that kicks > things back into action, and maybe see if I can find any reason for it > to not properly handle the ring in the first place. Did you try the old workaround ath10k had for hw1.0? > If this works, perhaps there is a way to kick the ring from the driver > side...maybe send a wmi command (ignoring quota) that has no affect, > or something like that? I think the wmi-echo could be suited for this. It probably doesn't use any extra resources so overcommiting tx-credit to send it should be safe. Michał _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-15 7:48 ` Michal Kazior @ 2015-01-15 17:17 ` Ben Greear 2015-01-20 4:34 ` Ben Greear 1 sibling, 0 replies; 15+ messages in thread From: Ben Greear @ 2015-01-15 17:17 UTC (permalink / raw) To: Michal Kazior; +Cc: ath10k On 01/14/2015 11:48 PM, Michal Kazior wrote: > On 14 January 2015 at 18:57, Ben Greear <greearb@candelatech.com> wrote: >> On 01/14/2015 01:45 AM, Michal Kazior wrote: >>> On 13 January 2015 at 20:07, Ben Greear <greearb@candelatech.com> wrote: >>> [...] >>>> >>>> I managed to get some better debug out of the firmware. >>>> >>>> I am having a hell of a time figuring out how the code flows through all >>>> of the callbacks (in both firmware and driver), but it appears this is what happened: >>>> >>>> (I have instrumented transfer-id in both firmware and driver) >>>> >>>> firmware sent wmi message with transfer-id of 72. >>>> kernel received this transfer-id >>>> firmware's last send-callback transfer ID is 71. >>>> >>>> So, it seems that either ath10k did not do the transfer-complete logic, >>>> did it incorrectly, or the firmware did not notice it was done. >>>> >>>> I cannot find where the transfer complete code that should be updating >>>> firmware is at. If you know, can you point me to it? >>> >>> I think the send-callback should be called when CE is simply done >>> doing it's stuff. There's no need for the other side to ack anything >>> explicitly (it just needs to have a free buffer on it's side so CE can >>> copy it over). >>> >>> Or maybe it is the HOST_IS_COPY_COMPLETE_MASK? Not really sure. >> >> I am now guessing that some magic IRQ happens when ath10k_ce_src_ring_write_index_set() >> is called. > > Correct. CE should generate an interrupt (provided it's not masked in > CE registers) on the other end when ring index is bumped. > > >> I may have narrowed down the problem a bit further now. >> >> I printed out the ring indexes in firmware and driver when lockup >> occured. The target -> host ring ids match fine, but I notice that >> it appears the firmware has pending entries in it's host -> target wmi >> ring that it has not consumed. >> >> Maybe it missed an irq or has some related race. > > Hmm.. The host can tell the target it wants tx credit update in the > htc host->target buffer. Upstream ath10k does this only when spending > last tx credit. Your observation would explain why firmware doesn't > send tx credit update to the host - it didn't get to see the > need-credit-update. Does your tree modify behaviour of when is set > ATH10K_HTC_FLAG_NEED_CREDIT_UPDATE in ath10k? I am running a patch you posted a long time ago that enables credit-req on every frame. Firmware thinks it has given all credits back to the host. The problem seems to be that the firmware just did not receive the last two requests from the host because it failed to properly read it's wmi ring buffer. My attempt to force a read keep crashing...I'll be back at debugging that now. >> I'm going to try forcing a poll of the host -> target wmi queue in the >> firmware when it detects no wmi keep-alive messages and see if that kicks >> things back into action, and maybe see if I can find any reason for it >> to not properly handle the ring in the first place. > > Did you try the old workaround ath10k had for hw1.0? No, what I found looked quite horrible and complicated, and I did not take time to try to fully understand what it was doing. >> If this works, perhaps there is a way to kick the ring from the driver >> side...maybe send a wmi command (ignoring quota) that has no affect, >> or something like that? > > I think the wmi-echo could be suited for this. It probably doesn't use > any extra resources so overcommiting tx-credit to send it should be > safe. Worth a try, but since the driver ends up mostly dead-locked in this case, it may be hard to properly trigger the dummy write when needed. For now, I'm going to focus on having firmware keep-alive timer logic force the re-read of the ring buffer. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-15 7:48 ` Michal Kazior 2015-01-15 17:17 ` Ben Greear @ 2015-01-20 4:34 ` Ben Greear 2015-01-21 7:22 ` Michal Kazior 1 sibling, 1 reply; 15+ messages in thread From: Ben Greear @ 2015-01-20 4:34 UTC (permalink / raw) To: Michal Kazior; +Cc: ath10k Ok, so I think I've mostly got this figured out...at least enough to work around the problem. It seems that the firmware and/or NIC hardware stops doing CE interrupts for the WMI rings (at least). If I force a poll of the rings, then packets are found and may be processed. In one case I looked at closely, it seems IRQs went away for around 30 seconds, and then for no obvious reason IRQs for the rings started being delivered and processed again. ~20 WMI messages were processed due to polling CE rings in this interval. The combination of WMI keep-alive messages sent from host, and timer to check for timeouts (and do CE polling at higher intervals when timeout is detected) appears to be enough. I also check for the IRQ working again and stop the polling at that time. I plan to clean the firmware changes up and commit them to my own repo...but it will require host changes to enable the keep-alive to fully work around this problem. Probably none of this will make it upstream.... Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-20 4:34 ` Ben Greear @ 2015-01-21 7:22 ` Michal Kazior 2015-01-21 15:42 ` Ben Greear 0 siblings, 1 reply; 15+ messages in thread From: Michal Kazior @ 2015-01-21 7:22 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On 20 January 2015 at 05:34, Ben Greear <greearb@candelatech.com> wrote: > Ok, so I think I've mostly got this figured out...at least enough to > work around the problem. > > It seems that the firmware and/or NIC hardware stops doing CE interrupts > for the WMI rings (at least). If I force a poll of > the rings, then packets are found and may be processed. So you just keep calling ath10k_hif_send_complete_check() (or ath10k_ce_per_engine_service) for polling, right? > In one case I looked at closely, it seems IRQs went away for around 30 > seconds, > and then for no obvious reason IRQs for the rings started being delivered > and > processed again. ~20 WMI messages were processed due to polling CE rings in > this > interval. Out of curiosity - what irq mode are you using? Shared or MSI? Or did you try both? > The combination of WMI keep-alive messages sent from host, and > timer to check for timeouts (and do CE polling at higher intervals > when timeout is detected) appears to be enough. I also check > for the IRQ working again and stop the polling at that time. > > I plan to clean the firmware changes up and commit them to my > own repo...but it will require host changes to enable the keep-alive > to fully work around this problem. Probably none of this will make > it upstream.... We could add a watchdog to WMI which uses the `echo` command and look at echo events and tx credit completion (WMI is notified about that). In case neither comes in in a timely fashion (lets say 1s which is less than WMI command timeout of 3s) we start polling until things settle down. This should work with standard firmware, no? Michał _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-21 7:22 ` Michal Kazior @ 2015-01-21 15:42 ` Ben Greear 2015-01-22 6:11 ` Michal Kazior 0 siblings, 1 reply; 15+ messages in thread From: Ben Greear @ 2015-01-21 15:42 UTC (permalink / raw) To: Michal Kazior; +Cc: ath10k On 01/20/2015 11:22 PM, Michal Kazior wrote: > On 20 January 2015 at 05:34, Ben Greear <greearb@candelatech.com> wrote: >> Ok, so I think I've mostly got this figured out...at least enough to >> work around the problem. >> >> It seems that the firmware and/or NIC hardware stops doing CE interrupts >> for the WMI rings (at least). If I force a poll of >> the rings, then packets are found and may be processed. > > So you just keep calling ath10k_hif_send_complete_check() (or > ath10k_ce_per_engine_service) for polling, right? The polling is in firmware...but it is calling the firmware variants of these. I did actually add polling in the host as well, but that did not fix the problem. I will back that out and make sure the problem remains fixed with just the firmware changes and host keep-alive messages to enable the firmware changes. >> In one case I looked at closely, it seems IRQs went away for around 30 >> seconds, >> and then for no obvious reason IRQs for the rings started being delivered >> and >> processed again. ~20 WMI messages were processed due to polling CE rings in >> this >> interval. > > Out of curiosity - what irq mode are you using? Shared or MSI? Or did > you try both? Probably MSI, but I don't actually know. Is there an easy way to tell? >> The combination of WMI keep-alive messages sent from host, and >> timer to check for timeouts (and do CE polling at higher intervals >> when timeout is detected) appears to be enough. I also check >> for the IRQ working again and stop the polling at that time. >> >> I plan to clean the firmware changes up and commit them to my >> own repo...but it will require host changes to enable the keep-alive >> to fully work around this problem. Probably none of this will make >> it upstream.... > > We could add a watchdog to WMI which uses the `echo` command and look > at echo events and tx credit completion (WMI is notified about that). > In case neither comes in in a timely fashion (lets say 1s which is > less than WMI command timeout of 3s) we start polling until things > settle down. This should work with standard firmware, no? Since it is firmware that has to do the CE polling, then I don't see any way to resolve this w/out hacking firmware..and you need a new message to send to firmware from host that firmware can be sure is periodic to use as it's WMI keep-alive timer. That is why I made a new message type for this (otherwise, cannot really be backwards compat with old kernels that do not send regular keep-alives, but *may* send any other valid message type for whatever reason whenever they want.) Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Anyone seeing tx-credits 'hang'? 2015-01-21 15:42 ` Ben Greear @ 2015-01-22 6:11 ` Michal Kazior 0 siblings, 0 replies; 15+ messages in thread From: Michal Kazior @ 2015-01-22 6:11 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On 21 January 2015 at 16:42, Ben Greear <greearb@candelatech.com> wrote: > On 01/20/2015 11:22 PM, Michal Kazior wrote: >> On 20 January 2015 at 05:34, Ben Greear <greearb@candelatech.com> wrote: [...] >> Out of curiosity - what irq mode are you using? Shared or MSI? Or did >> you try both? > > > Probably MSI, but I don't actually know. Is there an easy way to tell? When ath10k loads it prints in the kernel log either: ath10k_pci 0000:00:00.0: pci irq legacy interrupts 0 irq_mode 0 reset_mode 0 (shared) or ath10k_pci 0000:00:05.0: pci irq msi interrupts 1 irq_mode 0 reset_mode 0 (MSI) You can force shared interrupts if you load ath10k_pci with irq_mode=1. >>> The combination of WMI keep-alive messages sent from host, and >>> timer to check for timeouts (and do CE polling at higher intervals >>> when timeout is detected) appears to be enough. I also check >>> for the IRQ working again and stop the polling at that time. >>> >>> I plan to clean the firmware changes up and commit them to my >>> own repo...but it will require host changes to enable the keep-alive >>> to fully work around this problem. Probably none of this will make >>> it upstream.... >> >> >> We could add a watchdog to WMI which uses the `echo` command and look >> at echo events and tx credit completion (WMI is notified about that). >> In case neither comes in in a timely fashion (lets say 1s which is >> less than WMI command timeout of 3s) we start polling until things >> settle down. This should work with standard firmware, no? > > > Since it is firmware that has to do the CE polling, then I don't see any > way to resolve this w/out hacking firmware..and you need a new message to > send to firmware from host that firmware can be sure is periodic to use > as it's WMI keep-alive timer. That is why I made a new message type > for this (otherwise, cannot really be backwards compat with old kernels that > do not send regular keep-alives, but *may* send any other valid message type > for > whatever reason whenever they want.) Oh. I totally misunderstood you before. Thanks for claryfing. Michał _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-01-22 6:12 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-08 21:24 Anyone seeing tx-credits 'hang'? Ben Greear
2015-01-09 10:34 ` Michal Kazior
2015-01-09 16:55 ` Ben Greear
2015-01-12 8:06 ` Michal Kazior
2015-01-12 16:51 ` Ben Greear
2015-01-13 19:07 ` Ben Greear
2015-01-14 9:45 ` Michal Kazior
2015-01-14 17:57 ` Ben Greear
[not found] ` <54B6D67C.4090006@qca.qualcomm.com>
[not found] ` <54B6DE13.1080609@candelatech.com>
2015-01-15 1:54 ` Peter Oh
2015-01-15 7:48 ` Michal Kazior
2015-01-15 17:17 ` Ben Greear
2015-01-20 4:34 ` Ben Greear
2015-01-21 7:22 ` Michal Kazior
2015-01-21 15:42 ` Ben Greear
2015-01-22 6:11 ` Michal Kazior
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.