linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] ath10k: failed to flush transmit queue
@ 2024-07-12  2:23 Cedric Veilleux
  2024-07-12  8:08 ` Felix Fietkau
  0 siblings, 1 reply; 4+ messages in thread
From: Cedric Veilleux @ 2024-07-12  2:23 UTC (permalink / raw)
  To: linux-wireless

AP mode.
Both 2.4 and 5ghz channels.

Using WLE600VX (QCA986x/988x), we are seeing the following errors in
kernel logs:

[12978.022077] ath10k_pci 0000:04:00.0: failed to flush transmit queue
(skip 0 ar-state 1): 0
[13343.069189] ath10k_pci 0000:04:00.0: failed to flush transmit queue
(skip 0 ar-state 1): 0

They are somewhat random but frequent. Can happen once a day or many
times per hour.

They are associated with 3-4 seconds of radio silence. Full packet
loss. Then everything resumes normally, STA are still associated and
traffic resumes.

I have tested with major kernel versions:

6.1.97: stable (tested for many days on 10+ access points)
6.2.16: stable (tested for few hours single machine)
6.3.13: stable (tested for few hours single machine)

6.4.16: unstable  (we have errors within an hour)
6.5.13: unstable  (we have errors within an hour)
6.6.39: unstable  (we have errors within an hour)
6.7.12: unstable  (we have errors within an hour)
6.8.10: unstable  (we have errors within an hour)
6.9.7: unstable  (we have errors within an hour)

From these tests I believe something changed in 6.4 series causing
instabilities and the dreaded "failed to flush transmit queue" error.

This is a custom linux distribution. Only change is the kernel. All
other packages are same versions. Everything rebuilt from source using
bitbake/yocto. Same linux-firmware files.


module initialization output logs:

[    9.335682] ath10k_pci 0000:04:00.0: pci irq msi oper_irq_mode 2
irq_mode 0 reset_mode 0
[    9.543221] ath10k_pci 0000:04:00.0: qca988x hw2.0 target
0x4100016c chip_id 0x043222ff sub 0000:0000
[    9.543270] ath10k_pci 0000:04:00.0: kconfig debug 1 debugfs 0
tracing 0 dfs 1 testmode 0
[    9.544296] ath10k_pci 0000:04:00.0: firmware ver 10.2.4-1.0-00047
api 5 features no-p2p,raw-mode,mfp,allows-mesh-bcast crc32 35bd9258
[    9.603583] ath10k_pci 0000:04:00.0: board_file api 1 bmi_id N/A
crc32 bebc7c08
[   10.985663] ath10k_pci 0000:04:00.0: htt-ver 2.1 wmi-op 5 htt-op 2
cal otp max-sta 128 raw 0 hwcrypto 1

This is followed by hostapd starting and the "failed to flush transmit
queue" errors within an hour.


If there is a way to further debug and collect information please let me know.


Regards,
Cedric

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] ath10k: failed to flush transmit queue
  2024-07-12  2:23 [REGRESSION] ath10k: failed to flush transmit queue Cedric Veilleux
@ 2024-07-12  8:08 ` Felix Fietkau
  2024-07-31 18:13   ` Kalle Valo
  0 siblings, 1 reply; 4+ messages in thread
From: Felix Fietkau @ 2024-07-12  8:08 UTC (permalink / raw)
  To: Cedric Veilleux, linux-wireless

On 12.07.24 04:23, Cedric Veilleux wrote:
> AP mode.
> Both 2.4 and 5ghz channels.
> 
> Using WLE600VX (QCA986x/988x), we are seeing the following errors in
> kernel logs:
> 
> [12978.022077] ath10k_pci 0000:04:00.0: failed to flush transmit queue
> (skip 0 ar-state 1): 0
> [13343.069189] ath10k_pci 0000:04:00.0: failed to flush transmit queue
> (skip 0 ar-state 1): 0
> 
> They are somewhat random but frequent. Can happen once a day or many
> times per hour.
> 
> They are associated with 3-4 seconds of radio silence. Full packet
> loss. Then everything resumes normally, STA are still associated and
> traffic resumes.
> 
> I have tested with major kernel versions:
> 
> 6.1.97: stable (tested for many days on 10+ access points)
> 6.2.16: stable (tested for few hours single machine)
> 6.3.13: stable (tested for few hours single machine)
> 
> 6.4.16: unstable  (we have errors within an hour)
> 6.5.13: unstable  (we have errors within an hour)
> 6.6.39: unstable  (we have errors within an hour)
> 6.7.12: unstable  (we have errors within an hour)
> 6.8.10: unstable  (we have errors within an hour)
> 6.9.7: unstable  (we have errors within an hour)
> 
>  From these tests I believe something changed in 6.4 series causing
> instabilities and the dreaded "failed to flush transmit queue" error.
> 
> This is a custom linux distribution. Only change is the kernel. All
> other packages are same versions. Everything rebuilt from source using
> bitbake/yocto. Same linux-firmware files.

I'm pretty sure it's caused by this commit:

commit 0b75a1b1e42e07ae84e3a11d2368b418546e2bec
Author: Johannes Berg <johannes.berg@intel.com>
Date:   Fri Mar 31 16:59:16 2023 +0200

     wifi: mac80211: flush queues on STA removal

I guess somebody needs to look into making the queue flush on ath10k 
more reliable (or even better, implement a more lightweight .flush_sta op).

I don't have time to do the work myself, but hopefully this information 
could help somebody else take care of it.

- Felix

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] ath10k: failed to flush transmit queue
  2024-07-12  8:08 ` Felix Fietkau
@ 2024-07-31 18:13   ` Kalle Valo
  2025-02-20 13:55     ` James Prestwood
  0 siblings, 1 reply; 4+ messages in thread
From: Kalle Valo @ 2024-07-31 18:13 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: Cedric Veilleux, linux-wireless, ath10k

Felix Fietkau <nbd@nbd.name> writes:

> On 12.07.24 04:23, Cedric Veilleux wrote:
>
>> AP mode.
>> Both 2.4 and 5ghz channels.
>> Using WLE600VX (QCA986x/988x), we are seeing the following errors in
>> kernel logs:
>> [12978.022077] ath10k_pci 0000:04:00.0: failed to flush transmit
>> queue
>> (skip 0 ar-state 1): 0
>> [13343.069189] ath10k_pci 0000:04:00.0: failed to flush transmit queue
>> (skip 0 ar-state 1): 0
>> They are somewhat random but frequent. Can happen once a day or many
>> times per hour.
>> They are associated with 3-4 seconds of radio silence. Full packet
>> loss. Then everything resumes normally, STA are still associated and
>> traffic resumes.
>> I have tested with major kernel versions:
>> 6.1.97: stable (tested for many days on 10+ access points)
>> 6.2.16: stable (tested for few hours single machine)
>> 6.3.13: stable (tested for few hours single machine)
>> 6.4.16: unstable  (we have errors within an hour)
>> 6.5.13: unstable  (we have errors within an hour)
>> 6.6.39: unstable  (we have errors within an hour)
>> 6.7.12: unstable  (we have errors within an hour)
>> 6.8.10: unstable  (we have errors within an hour)
>> 6.9.7: unstable  (we have errors within an hour)
>>  From these tests I believe something changed in 6.4 series causing
>> instabilities and the dreaded "failed to flush transmit queue" error.
>> This is a custom linux distribution. Only change is the kernel. All
>> other packages are same versions. Everything rebuilt from source using
>> bitbake/yocto. Same linux-firmware files.
>
> I'm pretty sure it's caused by this commit:
>
> commit 0b75a1b1e42e07ae84e3a11d2368b418546e2bec
> Author: Johannes Berg <johannes.berg@intel.com>
> Date:   Fri Mar 31 16:59:16 2023 +0200
>
>     wifi: mac80211: flush queues on STA removal
>
> I guess somebody needs to look into making the queue flush on ath10k
> more reliable (or even better, implement a more lightweight .flush_sta
> op).
>
> I don't have time to do the work myself, but hopefully this
> information could help somebody else take care of it.

Adding ath10k list so that everyone see this.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] ath10k: failed to flush transmit queue
  2024-07-31 18:13   ` Kalle Valo
@ 2025-02-20 13:55     ` James Prestwood
  0 siblings, 0 replies; 4+ messages in thread
From: James Prestwood @ 2025-02-20 13:55 UTC (permalink / raw)
  To: Kalle Valo, Felix Fietkau; +Cc: Cedric Veilleux, linux-wireless, ath10k

Hi All,

On 7/31/24 11:13 AM, Kalle Valo wrote:
> Felix Fietkau <nbd@nbd.name> writes:
>
>> On 12.07.24 04:23, Cedric Veilleux wrote:
>>
>>> AP mode.
>>> Both 2.4 and 5ghz channels.
>>> Using WLE600VX (QCA986x/988x), we are seeing the following errors in
>>> kernel logs:
>>> [12978.022077] ath10k_pci 0000:04:00.0: failed to flush transmit
>>> queue
>>> (skip 0 ar-state 1): 0
>>> [13343.069189] ath10k_pci 0000:04:00.0: failed to flush transmit queue
>>> (skip 0 ar-state 1): 0
>>> They are somewhat random but frequent. Can happen once a day or many
>>> times per hour.
>>> They are associated with 3-4 seconds of radio silence. Full packet
>>> loss. Then everything resumes normally, STA are still associated and
>>> traffic resumes.
>>> I have tested with major kernel versions:
>>> 6.1.97: stable (tested for many days on 10+ access points)
>>> 6.2.16: stable (tested for few hours single machine)
>>> 6.3.13: stable (tested for few hours single machine)
>>> 6.4.16: unstable  (we have errors within an hour)
>>> 6.5.13: unstable  (we have errors within an hour)
>>> 6.6.39: unstable  (we have errors within an hour)
>>> 6.7.12: unstable  (we have errors within an hour)
>>> 6.8.10: unstable  (we have errors within an hour)
>>> 6.9.7: unstable  (we have errors within an hour)
>>>   From these tests I believe something changed in 6.4 series causing
>>> instabilities and the dreaded "failed to flush transmit queue" error.
>>> This is a custom linux distribution. Only change is the kernel. All
>>> other packages are same versions. Everything rebuilt from source using
>>> bitbake/yocto. Same linux-firmware files.
>> I'm pretty sure it's caused by this commit:
>>
>> commit 0b75a1b1e42e07ae84e3a11d2368b418546e2bec
>> Author: Johannes Berg <johannes.berg@intel.com>
>> Date:   Fri Mar 31 16:59:16 2023 +0200
>>
>>      wifi: mac80211: flush queues on STA removal
>>
>> I guess somebody needs to look into making the queue flush on ath10k
>> more reliable (or even better, implement a more lightweight .flush_sta
>> op).
>>
>> I don't have time to do the work myself, but hopefully this
>> information could help somebody else take care of it.
> Adding ath10k list so that everyone see this.

I want to revive this thread and provide some additional data. This is 
not just something that happens in AP mode, or specifically with the 
hardware mentioned. After upgrading from 6.2 to 6.8 we started seeing 
this on client devices running the QCA6174 hw 3.2 firmware ver 
WLAN.RM.4.4.1-00288- api 6. We see it during disconnects which isn't as 
big of a deal, the more concerning time is during roams which makes 
roams go from less than 200ms to over 5 seconds.

Based on this report I have tried using Remi's set of patches [1] which 
implement flush_sta(), but we end up with the same ~5 second hang, just 
in ath10k_flush_sta() instead of ath10k_flush(). I'm unsure if this is a 
firmware problem, or some race within the driver itself. In the past I 
have reduced timeouts [2] to work around these type of things but its 
really just a band-aid.

I would agree that this was "introduced" by Johannes' commit above, but 
the original commit does make sense... This is just an ath10k problem 
with flushing the queue's.

At this point I'm really left with two options:

  - Revert Johannes commit to flush the queues, thereby reducing 
security, OR

  - Reduce the timeout from 5 seconds to something more manageable, like 
1 second (hopefully someone more in the know can comment here).

Has anyone else looked at this regression? Maybe has some workaround 
other than my options above?

Thanks,

James

[1] 
https://lore.kernel.org/linux-wireless/17d26d6a3e80ff03939ee7935fdc07f979b61a4f.1732293922.git.repk@triplefau.lt/

[2] 
https://lore.kernel.org/linux-wireless/20240814164507.996303-2-prestwoj@gmail.com/


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-02-20 13:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-12  2:23 [REGRESSION] ath10k: failed to flush transmit queue Cedric Veilleux
2024-07-12  8:08 ` Felix Fietkau
2024-07-31 18:13   ` Kalle Valo
2025-02-20 13:55     ` James Prestwood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).