* Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
@ 2016-06-02 13:52 Valo, Kalle
2016-06-02 14:24 ` Valo, Kalle
0 siblings, 1 reply; 14+ messages in thread
From: Valo, Kalle @ 2016-06-02 13:52 UTC (permalink / raw)
To: ath10k@lists.infradead.org, Manoharan, Rajkumar
Cc: linux-wireless@vger.kernel.org
Hi,
there's a regression in ath10k:
https://bugzilla.kernel.org/show_bug.cgi?id=119151
Reporter bisected it to this:
5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
Date: Tue Mar 22 17:22:19 2016 +0530
ath10k: combine txrx and replenish task
Since tx completion and rx indication processing are moved out
of txrx tasklet and rx ring lock contention also removed from txrx
for rx_ind messages, it would be efficient to combine both replenish
and txrx tasks. Refill threshold is adjusted for both AP135 and AP148
(low and high end systems). With this adjustment in AP135, TCP DL is
improved from 603 Mbps to 620 Mbps and UDP DL is improved from 758 Mbps
to 803 Mbps. Also no watchdog are observed on UDP BiDi.
Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
--
Kalle Valo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 13:52 Bug 119151 - [regression] ath10k no longer authenitcates and freezes system Valo, Kalle
@ 2016-06-02 14:24 ` Valo, Kalle
2016-06-02 15:21 ` Ben Greear
0 siblings, 1 reply; 14+ messages in thread
From: Valo, Kalle @ 2016-06-02 14:24 UTC (permalink / raw)
To: ath10k@lists.infradead.org
Cc: Manoharan, Rajkumar, linux-wireless@vger.kernel.org,
mike@fireburn.co.uk
Kalle Valo <kvalo@qca.qualcomm.com> writes:
> there's a regression in ath10k:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=119151
>
> Reporter bisected it to this:
>
> 5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
> commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
> Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
> Date: Tue Mar 22 17:22:19 2016 +0530
>
> ath10k: combine txrx and replenish task
>
> Since tx completion and rx indication processing are moved out
> of txrx tasklet and rx ring lock contention also removed from txrx
> for rx_ind messages, it would be efficient to combine both replenish
> and txrx tasks. Refill threshold is adjusted for both AP135 and AP148
> (low and high end systems). With this adjustment in AP135, TCP DL is
> improved from 603 Mbps to 620 Mbps and UDP DL is improved from 758 Mbps
> to 803 Mbps. Also no watchdog are observed on UDP BiDi.
>
> Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
> Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Adding Mike, the bug reporter.
--
Kalle Valo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 14:24 ` Valo, Kalle
@ 2016-06-02 15:21 ` Ben Greear
2016-06-02 15:26 ` Valo, Kalle
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Ben Greear @ 2016-06-02 15:21 UTC (permalink / raw)
To: Valo, Kalle, ath10k@lists.infradead.org
Cc: Manoharan, Rajkumar, linux-wireless@vger.kernel.org,
mike@fireburn.co.uk
On 06/02/2016 07:24 AM, Valo, Kalle wrote:
> Kalle Valo <kvalo@qca.qualcomm.com> writes:
>
>> there's a regression in ath10k:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=119151
>>
>> Reporter bisected it to this:
>>
>> 5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
>> commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
>> Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>> Date: Tue Mar 22 17:22:19 2016 +0530
>>
>> ath10k: combine txrx and replenish task
>>
>> Since tx completion and rx indication processing are moved out
>> of txrx tasklet and rx ring lock contention also removed from txrx
>> for rx_ind messages, it would be efficient to combine both replenish
>> and txrx tasks. Refill threshold is adjusted for both AP135 and AP148
>> (low and high end systems). With this adjustment in AP135, TCP DL is
>> improved from 603 Mbps to 620 Mbps and UDP DL is improved from 758 Mbps
>> to 803 Mbps. Also no watchdog are observed on UDP BiDi.
>>
>> Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>> Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
>
> Adding Mike, the bug reporter.
I found a lot of problems with this code as well, and the 5 patches
starting from the URL below fixed the issues for me.
They are stuck as 'NA' in patchwork, but I don't know why.
http://lists.infradead.org/pipermail/ath10k/2016-April/007218.html
You probably need this patch as well, or ath10k will crash when you
enable the debug-mask:
http://permalink.gmane.org/gmane.linux.kernel.wireless.general/151890
It is also 'NA' in patchwork.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 15:21 ` Ben Greear
@ 2016-06-02 15:26 ` Valo, Kalle
2016-06-02 15:32 ` Ben Greear
2016-06-02 15:34 ` Mohammed Shafi Shajakhan
2016-06-02 17:03 ` Manoharan, Rajkumar
2 siblings, 1 reply; 14+ messages in thread
From: Valo, Kalle @ 2016-06-02 15:26 UTC (permalink / raw)
To: Ben Greear
Cc: ath10k@lists.infradead.org, Manoharan, Rajkumar,
linux-wireless@vger.kernel.org, mike@fireburn.co.uk
Ben Greear <greearb@candelatech.com> writes:
> I found a lot of problems with this code as well, and the 5 patches
> starting from the URL below fixed the issues for me.
>
> They are stuck as 'NA' in patchwork, but I don't know why.
>
> http://lists.infradead.org/pipermail/ath10k/2016-April/007218.html
ath10k has a separate patchwork instance, did you look at the correct
one? I have quite a lot of patches from you in deferred state because of
the patch bomb, but I'm hoping to go through them soon.
https://patchwork.kernel.org/project/ath10k/list/?state=10&delegate=25621&order=date
--
Kalle Valo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 15:26 ` Valo, Kalle
@ 2016-06-02 15:32 ` Ben Greear
2016-06-03 15:52 ` Valo, Kalle
0 siblings, 1 reply; 14+ messages in thread
From: Ben Greear @ 2016-06-02 15:32 UTC (permalink / raw)
To: Valo, Kalle
Cc: ath10k@lists.infradead.org, Manoharan, Rajkumar,
linux-wireless@vger.kernel.org, mike@fireburn.co.uk
On 06/02/2016 08:26 AM, Valo, Kalle wrote:
> Ben Greear <greearb@candelatech.com> writes:
>
>> I found a lot of problems with this code as well, and the 5 patches
>> starting from the URL below fixed the issues for me.
>>
>> They are stuck as 'NA' in patchwork, but I don't know why.
>>
>> http://lists.infradead.org/pipermail/ath10k/2016-April/007218.html
>
> ath10k has a separate patchwork instance, did you look at the correct
> one? I have quite a lot of patches from you in deferred state because of
> the patch bomb, but I'm hoping to go through them soon.
>
> https://patchwork.kernel.org/project/ath10k/list/?state=10&delegate=25621&order=date
Ok, they are deferred then.
The series of 5 is likely quite useful and fixes some nasty bugs,
and the first patch of the big bomb is also a trivial crash fix
for a regression you added (as best as I can tell).
The rest of the patch bomb is less critical, but making some progress on
that would make me feel good about working on ath10k patches again.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 15:21 ` Ben Greear
2016-06-02 15:26 ` Valo, Kalle
@ 2016-06-02 15:34 ` Mohammed Shafi Shajakhan
2016-06-02 17:03 ` Manoharan, Rajkumar
2 siblings, 0 replies; 14+ messages in thread
From: Mohammed Shafi Shajakhan @ 2016-06-02 15:34 UTC (permalink / raw)
To: Ben Greear
Cc: Valo, Kalle, ath10k@lists.infradead.org, mike@fireburn.co.uk,
linux-wireless@vger.kernel.org, Manoharan, Rajkumar
On Thu, Jun 02, 2016 at 08:21:41AM -0700, Ben Greear wrote:
> On 06/02/2016 07:24 AM, Valo, Kalle wrote:
> >Kalle Valo <kvalo@qca.qualcomm.com> writes:
> >
> >>there's a regression in ath10k:
> >>
> >>https://bugzilla.kernel.org/show_bug.cgi?id=119151
> >>
> >>Reporter bisected it to this:
> >>
> >>5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
> >>commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
> >>Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
> >>Date: Tue Mar 22 17:22:19 2016 +0530
> >>
> >>ath10k: combine txrx and replenish task
> >>
> >>Since tx completion and rx indication processing are moved out
> >>of txrx tasklet and rx ring lock contention also removed from txrx
> >>for rx_ind messages, it would be efficient to combine both replenish
> >>and txrx tasks. Refill threshold is adjusted for both AP135 and AP148
> >>(low and high end systems). With this adjustment in AP135, TCP DL is
> >>improved from 603 Mbps to 620 Mbps and UDP DL is improved from 758 Mbps
> >>to 803 Mbps. Also no watchdog are observed on UDP BiDi.
> >>
> >>Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
> >>Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
> >
> >Adding Mike, the bug reporter.
>
>
> I found a lot of problems with this code as well, and the 5 patches
> starting from the URL below fixed the issues for me.
>
> They are stuck as 'NA' in patchwork, but I don't know why.
>
> http://lists.infradead.org/pipermail/ath10k/2016-April/007218.html
>
> You probably need this patch as well, or ath10k will crash when you
> enable the debug-mask:
>
> http://permalink.gmane.org/gmane.linux.kernel.wireless.general/151890
>
> It is also 'NA' in patchwork.
[shafi] i think this is already in the pending branch
https://patchwork.kernel.org/patch/9073471/
>
> Thanks,
> Ben
>
> --
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc http://www.candelatech.com
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 15:21 ` Ben Greear
2016-06-02 15:26 ` Valo, Kalle
2016-06-02 15:34 ` Mohammed Shafi Shajakhan
@ 2016-06-02 17:03 ` Manoharan, Rajkumar
2016-06-02 17:23 ` Ben Greear
2 siblings, 1 reply; 14+ messages in thread
From: Manoharan, Rajkumar @ 2016-06-02 17:03 UTC (permalink / raw)
To: Ben Greear, Valo, Kalle, ath10k@lists.infradead.org,
Rajkumar Manoharan
Cc: linux-wireless@vger.kernel.org, mike@fireburn.co.uk
On Thursday, June 2, 2016 8:51 PM, Ben Greear <greearb@candelatech.com> wrote:
> On 06/02/2016 07:24 AM, Valo, Kalle wrote:
>> Kalle Valo <kvalo@qca.qualcomm.com> writes:
>>
>>> there's a regression in ath10k:
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=119151
>>>
>>> Reporter bisected it to this:
>>>
>>> 5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
>>> commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
>>> Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>>> Date: Tue Mar 22 17:22:19 2016 +0530
>>>
>>> ath10k: combine txrx and replenish task
>>>
>>> Since tx completion and rx indication processing are moved out
>>> of txrx tasklet and rx ring lock contention also removed from txrx
>>> for rx_ind messages, it would be efficient to combine both replenish
>>> and txrx tasks. Refill threshold is adjusted for both AP135 and AP148
>>> (low and high end systems). With this adjustment in AP135, TCP DL is
>>> improved from 603 Mbps to 620 Mbps and UDP DL is improved from 758 Mbps
>>> to 803 Mbps. Also no watchdog are observed on UDP BiDi.
>>>
>>> Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>>> Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
>>
>> Adding Mike, the bug reporter.
>
Mike,
Sorry for the regression. Since the patch combines both txrx and replenish tasklet,
it is validated in low end embedded devices like AP135 (single core 720 MHz MIPS processor).
It seems yours is octa core processor. So CPU is not bottleneck here. Need your help to fix this issue asap.
Can you please try reducing rx refill threshold as below.
diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
index 2aa407160859..d35d3d48ae6c 100644
--- a/drivers/net/wireless/ath/ath10k/htt.h
+++ b/drivers/net/wireless/ath/ath10k/htt.h
@@ -1734,7 +1734,7 @@ struct htt_rx_desc {
/* Refill a bunch of RX buffers for each refill round so that FW/HW can handle
* aggregated traffic more nicely. */
-#define ATH10K_HTT_MAX_NUM_REFILL 100
+#define ATH10K_HTT_MAX_NUM_REFILL 16
>From your log attachment from bug report, I found few timed out messages.
May 30 21:09:26 axion kernel: wlan0: deauthenticating from a0:63:91:a7:3c:9f by local choice (Reason: 3=DEAUTH_LEAVING)
May 30 21:09:32 axion kernel: ath10k_pci 0000:3c:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
May 30 21:09:35 axion kernel: ath10k_pci 0000:3c:00.0: failed to delete peer a0:63:91:a7:3c:9f for vdev 0: -110
May 30 21:09:35 axion kernel: ath10k_pci 0000:3c:00.0: found sta peer a0:63:91:a7:3c:9f entry on vdev 0 after it was supposed
Try disabling pci power save for qca6174 as below.
diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 852f2c18cd11..5e3ba37a8c6a 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -2979,7 +2979,7 @@ static int ath10k_pci_probe(struct pci_dev *pdev,
case QCA6164_2_1_DEVICE_ID:
case QCA6174_2_1_DEVICE_ID:
hw_rev = ATH10K_HW_QCA6174;
- pci_ps = true;
+ pci_ps = false;
pci_soft_reset = ath10k_pci_warm_reset;
pci_hard_reset = ath10k_pci_qca6174_chip_reset;
>
> I found a lot of problems with this code as well, and the 5 patches
> starting from the URL below fixed the issues for me.
>
Ben,
Can you please explain the sort of issues you have observed with this change?
-Rajkumar
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 17:03 ` Manoharan, Rajkumar
@ 2016-06-02 17:23 ` Ben Greear
2016-06-02 17:41 ` Rajkumar Manoharan
[not found] ` <CAHbf0-GT0y1pEs-ToxbPAf+aRo7TNAyV_Emies_rjL27R1fk2A@mail.gmail.com>
0 siblings, 2 replies; 14+ messages in thread
From: Ben Greear @ 2016-06-02 17:23 UTC (permalink / raw)
To: Manoharan, Rajkumar, Valo, Kalle, ath10k@lists.infradead.org,
Rajkumar Manoharan
Cc: linux-wireless@vger.kernel.org, mike@fireburn.co.uk
On 06/02/2016 10:03 AM, Manoharan, Rajkumar wrote:
> On Thursday, June 2, 2016 8:51 PM, Ben Greear <greearb@candelatech.com> wrote:
>> On 06/02/2016 07:24 AM, Valo, Kalle wrote:
>>> Kalle Valo <kvalo@qca.qualcomm.com> writes:
>>>
>>>> there's a regression in ath10k:
>>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=119151
>>>>
>>>> Reporter bisected it to this:
>>>>
>>>> 5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
>>>> commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
>>>> Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>>>> Date: Tue Mar 22 17:22:19 2016 +0530
>>>>
>>>> ath10k: combine txrx and replenish task
>>>>
>>>> Since tx completion and rx indication processing are moved out
>>>> of txrx tasklet and rx ring lock contention also removed from txrx
>>>> for rx_ind messages, it would be efficient to combine both replenish
>>>> and txrx tasks. Refill threshold is adjusted for both AP135 and AP148
>>>> (low and high end systems). With this adjustment in AP135, TCP DL is
>>>> improved from 603 Mbps to 620 Mbps and UDP DL is improved from 758 Mbps
>>>> to 803 Mbps. Also no watchdog are observed on UDP BiDi.
>>>>
>>>> Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>>>> Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
>>>
>>> Adding Mike, the bug reporter.
>>
> Mike,
>
> Sorry for the regression. Since the patch combines both txrx and replenish tasklet,
> it is validated in low end embedded devices like AP135 (single core 720 MHz MIPS processor).
>
> It seems yours is octa core processor. So CPU is not bottleneck here. Need your help to fix this issue asap.
> Can you please try reducing rx refill threshold as below.
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> index 2aa407160859..d35d3d48ae6c 100644
> --- a/drivers/net/wireless/ath/ath10k/htt.h
> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> @@ -1734,7 +1734,7 @@ struct htt_rx_desc {
>
> /* Refill a bunch of RX buffers for each refill round so that FW/HW can handle
> * aggregated traffic more nicely. */
> -#define ATH10K_HTT_MAX_NUM_REFILL 100
> +#define ATH10K_HTT_MAX_NUM_REFILL 16
>
> From your log attachment from bug report, I found few timed out messages.
> May 30 21:09:26 axion kernel: wlan0: deauthenticating from a0:63:91:a7:3c:9f by local choice (Reason: 3=DEAUTH_LEAVING)
> May 30 21:09:32 axion kernel: ath10k_pci 0000:3c:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
> May 30 21:09:35 axion kernel: ath10k_pci 0000:3c:00.0: failed to delete peer a0:63:91:a7:3c:9f for vdev 0: -110
> May 30 21:09:35 axion kernel: ath10k_pci 0000:3c:00.0: found sta peer a0:63:91:a7:3c:9f entry on vdev 0 after it was supposed
>
> Try disabling pci power save for qca6174 as below.
>
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
> index 852f2c18cd11..5e3ba37a8c6a 100644
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -2979,7 +2979,7 @@ static int ath10k_pci_probe(struct pci_dev *pdev,
> case QCA6164_2_1_DEVICE_ID:
> case QCA6174_2_1_DEVICE_ID:
> hw_rev = ATH10K_HW_QCA6174;
> - pci_ps = true;
> + pci_ps = false;
> pci_soft_reset = ath10k_pci_warm_reset;
> pci_hard_reset = ath10k_pci_qca6174_chip_reset;
>
>>
>> I found a lot of problems with this code as well, and the 5 patches
>> starting from the URL below fixed the issues for me.
>>
> Ben,
>
> Can you please explain the sort of issues you have observed with this change?
I imported a bunch of upstream patches at once, so not sure exactly what commit
caused it. And, this was about 2 months ago... Upon review, I'm not sure I even have
the patch this particular bug was bisected to, so maybe that is some other issue.
But, the problems I saw were deadlocks and memory corruption. A lot of it was
because I was debugging new firmware at the time and so peer creation was failing
sometimes, and things like that. The error handling in ath10k for this was
faulty and racy and such. We have not seen any performance regressions,
but we mostly run on very powerful CPUs.
Please take a look at those 5 patches. A good review would be much appreciated,
and by reading them you will better be able to see the problems I was hitting
and trying to fix.
In case you want to look at the full context of those patches, you can find
them here (around 24 patches down from the top...)
http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=summary
For now, I am sticking with 4.4 + what I pulled in, but will rebase against upstream someday
soon-ish and then we can start testing it all over again :)
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 17:23 ` Ben Greear
@ 2016-06-02 17:41 ` Rajkumar Manoharan
2016-06-02 18:02 ` Ben Greear
[not found] ` <CAHbf0-GT0y1pEs-ToxbPAf+aRo7TNAyV_Emies_rjL27R1fk2A@mail.gmail.com>
1 sibling, 1 reply; 14+ messages in thread
From: Rajkumar Manoharan @ 2016-06-02 17:41 UTC (permalink / raw)
To: Ben Greear; +Cc: Manoharan, Rajkumar, Valo, Kalle, ath10k, linux-wireless, mike
On 2016-06-02 22:53, Ben Greear wrote:
> On 06/02/2016 10:03 AM, Manoharan, Rajkumar wrote:
>> On Thursday, June 2, 2016 8:51 PM, Ben Greear
>> <greearb@candelatech.com> wrote:
>>> On 06/02/2016 07:24 AM, Valo, Kalle wrote:
>>>> Kalle Valo <kvalo@qca.qualcomm.com> writes:
>>>>
>>>>> there's a regression in ath10k:
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=119151
>>>>>
>>>>> Reporter bisected it to this:
>>>>>
>>>>> 5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
>>>>> commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
>>>>> Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>>>>> Date: Tue Mar 22 17:22:19 2016 +0530
>>>>>
>>>>> ath10k: combine txrx and replenish task
>>>>>
[...]
>>> I found a lot of problems with this code as well, and the 5 patches
>>> starting from the URL below fixed the issues for me.
>>>
>> Ben,
>>
>> Can you please explain the sort of issues you have observed with this
>> change?
>
> I imported a bunch of upstream patches at once, so not sure exactly
> what commit
> caused it. And, this was about 2 months ago... Upon review, I'm not
> sure I even have
> the patch this particular bug was bisected to, so maybe that is some
> other issue.
>
Please keep track of buggy commit and report them asap.
> But, the problems I saw were deadlocks and memory corruption. A lot of
> it was
> because I was debugging new firmware at the time and so peer creation
> was failing
> sometimes, and things like that. The error handling in ath10k for this
> was
> faulty and racy and such. We have not seen any performance
> regressions,
> but we mostly run on very powerful CPUs.
>
> Please take a look at those 5 patches. A good review would be much
> appreciated,
> and by reading them you will better be able to see the problems I was
> hitting
> and trying to fix.
>
Below two patches are critical and I already shared my feedback.
https://patchwork.kernel.org/patch/8727841/
https://patchwork.kernel.org/patch/9073471/
Others are LGTM.
> In case you want to look at the full context of those patches, you can
> find
> them here (around 24 patches down from the top...)
>
Quite a big list :)
> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=summary
>
> For now, I am sticking with 4.4 + what I pulled in, but will rebase
> against upstream someday
> soon-ish and then we can start testing it all over again :)
>
Will go through the list. Better to post them to public if not.
-Rajkumar
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 17:41 ` Rajkumar Manoharan
@ 2016-06-02 18:02 ` Ben Greear
0 siblings, 0 replies; 14+ messages in thread
From: Ben Greear @ 2016-06-02 18:02 UTC (permalink / raw)
To: Rajkumar Manoharan
Cc: Manoharan, Rajkumar, Valo, Kalle, ath10k, linux-wireless, mike
On 06/02/2016 10:41 AM, Rajkumar Manoharan wrote:
> On 2016-06-02 22:53, Ben Greear wrote:
>> On 06/02/2016 10:03 AM, Manoharan, Rajkumar wrote:
>>> On Thursday, June 2, 2016 8:51 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>> On 06/02/2016 07:24 AM, Valo, Kalle wrote:
>>>>> Kalle Valo <kvalo@qca.qualcomm.com> writes:
>>>>>
>>>>>> there's a regression in ath10k:
>>>>>>
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=119151
>>>>>>
>>>>>> Reporter bisected it to this:
>>>>>>
>>>>>> 5c86d97bcc1d42ce7f75685a61be4dad34ee8183 is the first bad commit
>>>>>> commit 5c86d97bcc1d42ce7f75685a61be4dad34ee8183
>>>>>> Author: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
>>>>>> Date: Tue Mar 22 17:22:19 2016 +0530
>>>>>>
>>>>>> ath10k: combine txrx and replenish task
>>>>>>
> [...]
>
>>>> I found a lot of problems with this code as well, and the 5 patches
>>>> starting from the URL below fixed the issues for me.
>>>>
>>> Ben,
>>>
>>> Can you please explain the sort of issues you have observed with this change?
>>
>> I imported a bunch of upstream patches at once, so not sure exactly what commit
>> caused it. And, this was about 2 months ago... Upon review, I'm not
>> sure I even have
>> the patch this particular bug was bisected to, so maybe that is some
>> other issue.
>>
> Please keep track of buggy commit and report them asap.
I posted to the list at the time. When I was debugging this, there
were so many conflicting issues that it was hard to find a single
regression point.
>> But, the problems I saw were deadlocks and memory corruption. A lot of it was
>> because I was debugging new firmware at the time and so peer creation
>> was failing
>> sometimes, and things like that. The error handling in ath10k for this was
>> faulty and racy and such. We have not seen any performance regressions,
>> but we mostly run on very powerful CPUs.
>>
>> Please take a look at those 5 patches. A good review would be much appreciated,
>> and by reading them you will better be able to see the problems I was hitting
>> and trying to fix.
>>
> Below two patches are critical and I already shared my feedback.
>
> https://patchwork.kernel.org/patch/8727841/
> https://patchwork.kernel.org/patch/9073471/
>
> Others are LGTM.
Not sure what LGTM means.
This one fixes memory corruption:
http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=blobdiff;f=drivers/net/wireless/ath/ath10k/htt_tx.c;h=58e88d392fb56a65304db17d11a9eaf0b0397dc7;hp=07b960e9704f509b3dddf1e45730e76a4c39e51e;hb=fddb6661a0f5772853fbb9feb7232f325d5f74c5;hpb=ed1757f8345064181664e4a62e2b917e694a665e
This one fixes use-after-free memory bugs:
http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=blobdiff;f=drivers/net/wireless/ath/ath10k/mac.c;h=5e5cc9c6c1d82524b9b77a7c6d2c1341c5268732;hp=8783119b9ba84e0ddb292d521e6513bf7d68a40b;hb=5ae13cea64004afc673ecc22cd70ac51179168c6;hpb=fddb6661a0f5772853fbb9feb7232f325d5f74c5
As does this one:
http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=blobdiff;f=drivers/net/wireless/ath/ath10k/mac.c;h=020dd25752224d9786da37a6dfd10a69e646b138;hp=5e5cc9c6c1d82524b9b77a7c6d2c1341c5268732;hb=c4b9566416a5e7b8d4c446d1bad34aabcbeff9f5;hpb=9bd9c11c1a2e61261c268ac2b6d791d4f6b6fe26
>
>> In case you want to look at the full context of those patches, you can find
>> them here (around 24 patches down from the top...)
>>
> Quite a big list :)
>
>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=summary
>>
>> For now, I am sticking with 4.4 + what I pulled in, but will rebase
>> against upstream someday
>> soon-ish and then we can start testing it all over again :)
>>
> Will go through the list. Better to post them to public if not.
Many of these patches are related to features only in my firmware. The ~20
patch patch-bomb was a start at adding some of the hopefully less controversial
support. If I can ever get that upstream, then I will pick off another
set of patches and try to get them ready for upstream.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-02 15:32 ` Ben Greear
@ 2016-06-03 15:52 ` Valo, Kalle
2016-06-03 16:12 ` Ben Greear
0 siblings, 1 reply; 14+ messages in thread
From: Valo, Kalle @ 2016-06-03 15:52 UTC (permalink / raw)
To: Ben Greear
Cc: ath10k@lists.infradead.org, Manoharan, Rajkumar,
linux-wireless@vger.kernel.org, mike@fireburn.co.uk
Ben Greear <greearb@candelatech.com> writes:
> On 06/02/2016 08:26 AM, Valo, Kalle wrote:
>> Ben Greear <greearb@candelatech.com> writes:
>>
>>> I found a lot of problems with this code as well, and the 5 patches
>>> starting from the URL below fixed the issues for me.
>>>
>>> They are stuck as 'NA' in patchwork, but I don't know why.
>>>
>>> http://lists.infradead.org/pipermail/ath10k/2016-April/007218.html
>>
>> ath10k has a separate patchwork instance, did you look at the correct
>> one? I have quite a lot of patches from you in deferred state because of
>> the patch bomb, but I'm hoping to go through them soon.
>>
>> https://patchwork.kernel.org/project/ath10k/list/?state=10&delegate=25621&order=date
>
> Ok, they are deferred then.
>
> The series of 5 is likely quite useful and fixes some nasty bugs,
> and the first patch of the big bomb is also a trivial crash fix
> for a regression you added (as best as I can tell).
>
> The rest of the patch bomb is less critical, but making some progress on
> that would make me feel good about working on ath10k patches again.
If I get a big patchset like 25 patches it immediately goes to the
bottom of the queue. Organising them a bit better takes like 15 minutes
of your time and makes it a lot easier to review. For example, you could
have split the patches into three sets: important bug fixes, firmware
debugging and the rest. That helps everyone and saves time.
--
Kalle Valo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-03 15:52 ` Valo, Kalle
@ 2016-06-03 16:12 ` Ben Greear
0 siblings, 0 replies; 14+ messages in thread
From: Ben Greear @ 2016-06-03 16:12 UTC (permalink / raw)
To: Valo, Kalle
Cc: ath10k@lists.infradead.org, Manoharan, Rajkumar,
linux-wireless@vger.kernel.org, mike@fireburn.co.uk
On 06/03/2016 08:52 AM, Valo, Kalle wrote:
> Ben Greear <greearb@candelatech.com> writes:
>
>> On 06/02/2016 08:26 AM, Valo, Kalle wrote:
>>> Ben Greear <greearb@candelatech.com> writes:
>>>
>>>> I found a lot of problems with this code as well, and the 5 patches
>>>> starting from the URL below fixed the issues for me.
>>>>
>>>> They are stuck as 'NA' in patchwork, but I don't know why.
>>>>
>>>> http://lists.infradead.org/pipermail/ath10k/2016-April/007218.html
>>>
>>> ath10k has a separate patchwork instance, did you look at the correct
>>> one? I have quite a lot of patches from you in deferred state because of
>>> the patch bomb, but I'm hoping to go through them soon.
>>>
>>> https://patchwork.kernel.org/project/ath10k/list/?state=10&delegate=25621&order=date
>>
>> Ok, they are deferred then.
>>
>> The series of 5 is likely quite useful and fixes some nasty bugs,
>> and the first patch of the big bomb is also a trivial crash fix
>> for a regression you added (as best as I can tell).
>>
>> The rest of the patch bomb is less critical, but making some progress on
>> that would make me feel good about working on ath10k patches again.
>
> If I get a big patchset like 25 patches it immediately goes to the
> bottom of the queue. Organising them a bit better takes like 15 minutes
> of your time and makes it a lot easier to review. For example, you could
> have split the patches into three sets: important bug fixes, firmware
> debugging and the rest. That helps everyone and saves time.
The first 5 were posted a month earlier than the 25 patchset, and are
bug fixes. Whatever reason you ignored them, it wasn't because
there were 25 patches from me on the list at the time.
The second big series has the first patch as bug-fix, and clearly noted
in the 0000 description. Grab it, and save the rest for later.
I'll be happy to re-work the big patch-set, but there is a lot of churn in
ath10k, and waiting months before applying patches means they rot and makes
more work for everyone. Let's get these 6 bug-fixes in, and then I'll rebase,
test, and post a smaller patch-set for consideration.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
[not found] ` <CAHbf0-GT0y1pEs-ToxbPAf+aRo7TNAyV_Emies_rjL27R1fk2A@mail.gmail.com>
@ 2016-06-08 15:52 ` Rajkumar Manoharan
2016-06-08 17:41 ` Mike Lothian
0 siblings, 1 reply; 14+ messages in thread
From: Rajkumar Manoharan @ 2016-06-08 15:52 UTC (permalink / raw)
To: Mike Lothian
Cc: Ben Greear, Manoharan, Rajkumar, Valo, Kalle, ath10k,
linux-wireless
On 2016-06-02 23:03, Mike Lothian wrote:
> I've just tried those two changes, the machine now locks up before X
> has even started
>
Mike,
Sorry for the delay. Found root cause for dead lock. Can you please give
a try with below change?
diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c
b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 3b35c7ab5680..80e645302b54 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1905,7 +1905,6 @@ static void ath10k_htt_rx_in_ord_ind(struct ath10k
*ar, struct sk_buff *skb)
return;
}
}
- ath10k_htt_rx_msdu_buff_replenish(htt);
}
-Rajkumar
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Bug 119151 - [regression] ath10k no longer authenitcates and freezes system
2016-06-08 15:52 ` Rajkumar Manoharan
@ 2016-06-08 17:41 ` Mike Lothian
0 siblings, 0 replies; 14+ messages in thread
From: Mike Lothian @ 2016-06-08 17:41 UTC (permalink / raw)
To: Rajkumar Manoharan
Cc: Ben Greear, Manoharan, Rajkumar, Valo, Kalle, ath10k,
linux-wireless
Hi
Yes that fixes things locally
Thanks
Mike
On 8 June 2016 at 16:52, Rajkumar Manoharan <rmanohar@codeaurora.org> wrote:
> On 2016-06-02 23:03, Mike Lothian wrote:
>>
>> I've just tried those two changes, the machine now locks up before X
>> has even started
>>
> Mike,
>
> Sorry for the delay. Found root cause for dead lock. Can you please give a
> try with below change?
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c
> b/drivers/net/wireless/ath/ath10k/htt_rx.c
> index 3b35c7ab5680..80e645302b54 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> @@ -1905,7 +1905,6 @@ static void ath10k_htt_rx_in_ord_ind(struct ath10k
> *ar, struct sk_buff *skb)
> return;
> }
> }
> - ath10k_htt_rx_msdu_buff_replenish(htt);
> }
>
>
> -Rajkumar
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-06-08 17:41 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-02 13:52 Bug 119151 - [regression] ath10k no longer authenitcates and freezes system Valo, Kalle
2016-06-02 14:24 ` Valo, Kalle
2016-06-02 15:21 ` Ben Greear
2016-06-02 15:26 ` Valo, Kalle
2016-06-02 15:32 ` Ben Greear
2016-06-03 15:52 ` Valo, Kalle
2016-06-03 16:12 ` Ben Greear
2016-06-02 15:34 ` Mohammed Shafi Shajakhan
2016-06-02 17:03 ` Manoharan, Rajkumar
2016-06-02 17:23 ` Ben Greear
2016-06-02 17:41 ` Rajkumar Manoharan
2016-06-02 18:02 ` Ben Greear
[not found] ` <CAHbf0-GT0y1pEs-ToxbPAf+aRo7TNAyV_Emies_rjL27R1fk2A@mail.gmail.com>
2016-06-08 15:52 ` Rajkumar Manoharan
2016-06-08 17:41 ` Mike Lothian
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).