All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Prestwood <prestwoj@gmail.com>
To: Martin Petzold <martin.petzold@tavla.de>
Cc: iwd@lists.linux.dev, Arend Van Spriel <arend.vanspriel@broadcom.com>
Subject: Re: Connection loss (IWD HEAD with latest OWE / BSS selection patches) - brcmfmac driver
Date: Tue, 12 Nov 2024 04:13:23 -0800	[thread overview]
Message-ID: <a2ef307c-e0a9-438f-9bc4-1076b7bc5c1c@gmail.com> (raw)
In-Reply-To: <7fa8c348-9834-4a63-a700-eb3117c4ae89@tavla.de>

Hi Martin,

On 11/12/24 1:15 AM, Martin Petzold wrote:
> Dear James,
>
> Am 05.11.24 um 16:16 schrieb Martin Petzold:
>> Dear James, dear Arend,
>>
>> Am 05.11.24 um 14:14 schrieb James Prestwood:
>>> Hi Martin,
>>>
>>> On 11/4/24 3:20 PM, James Prestwood wrote:
>>>> Hi Martin,
>>>>
>>>> On 11/4/24 2:42 PM, Martin Petzold wrote:
>>>>> Dear James,
>>>>>
>>>>> Am 04.11.24 um 13:36 schrieb James Prestwood:
>>>>>>
>>>>>> On 11/3/24 3:13 PM, Martin Petzold wrote:
>>>>>>> Dear James,
>>>>>>>
>>>>>>> Am 25.10.24 um 17:17 schrieb James Prestwood:
>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I open a new thread for this one: During the last weeks 
>>>>>>>>>>>>>>>> I have seen connection losses for 30+ minutes, 
>>>>>>>>>>>>>>>> sometimes even hours or just now even forever (IWD HEAD 
>>>>>>>>>>>>>>>> with v2 OWE / BSS selection patches). Driver is 
>>>>>>>>>>>>>>>> brcmfmac (NXP 6.1.36 kernel) and chip is BCM4339 (Laird 
>>>>>>>>>>>>>>>> LWB5).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It happens in a) single router environment (WPA2-PSK; 
>>>>>>>>>>>>>>>> Touchstone TG3442DE), and b) router + repeater 
>>>>>>>>>>>>>>>> environment (WPA2 CCMP; Fritz!Box + Fritz!Repeater), 
>>>>>>>>>>>>>>>> and maybe also in the WPA3 OWE Transition network 
>>>>>>>>>>>>>>>> (yesterday lost a connection again).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I lost now again 2 of 10 devices in the WPA3 OWE network 
>>>>>>>>>>>>>>> (with roaming). However, now they don't disappear all 
>>>>>>>>>>>>>>> after a shorter while. It seems to be later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also lost one device in a Router+Repeater WPA2 (CCMP) 
>>>>>>>>>>>>>>> network. It is confirmed here on router side, that the 
>>>>>>>>>>>>>>> device is disconnected. Since more than a day.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We can't do anything without logs. If you suspect its the 
>>>>>>>>>>>>>> blacklist you can lower the blacklist time down in 
>>>>>>>>>>>>>> main.conf:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [
>>>>>>>
>>>>>>> I am still losing devices. Sometimes they come back again, but 
>>>>>>> mostly do not re-connect. I have observed the following:
>>>>>>>
>>>>>>> - Connection exists for several hours until about one day, or 
>>>>>>> two. Then gone for several hours or mostly forever.
>>>>>>> - For FritzBox+FritzRepeater I have seen the connection coming 
>>>>>>> back after like a day (here connection loss was also confirmed 
>>>>>>> on router side!)
>>>>>>> - For the Aruba enterprise environment the connection never came 
>>>>>>> back (until now no AP logs - waiting for an answer)
>>>>>>> - After reboot the connection comes back
>>>>>>> - It occurs only in an environment with multiple APs with same 
>>>>>>> SSID (i.e. roaming environment), however my single AP 
>>>>>>> environments have all strong signal
>>>>>>> - Some devices with identical configuration in this environment 
>>>>>>> DO NOT get lost, those seem to have quite strong signal (maybe 
>>>>>>> they don't roam)
>>>>>>> - Other devices in the same environment work without any 
>>>>>>> problems (Intel+NetworkManager) and the APs are Aruba enterprise 
>>>>>>> grade
>>>>>>> - I see almost the same in the Aruba enterprise environment, but 
>>>>>>> ALSO in a FritzBox + FritzRepeater environment
>>>>>>> - We had a bug in our web socket connection, causing to many IWD 
>>>>>>> requests. However, this was fixed. And why are all the other 
>>>>>>> devices okay? Maybe co-incidence with roaming and anything 
>>>>>>> related to dropping and re-connecting web socket connection.
>>>>>>>
>>>>>>> Please find attached my currently available debug logs (they are 
>>>>>>> a few days old, but I am quite sure this is the connection loss 
>>>>>>> situation). These logs are from the FritzBox+FritzRepeater 
>>>>>>> environment. There are no brcmfmac messages (but also no special 
>>>>>>> debug level configured here)!
>>>>>>>
>>>>>>> I have now also disabled WiFi power saving and will deploy to 
>>>>>>> the environment...hoping the best.
>>>>>>>
>>>>>>> Maybe you could check the logs and have an idea?
>>>>>>
>>>>>> Looks like the same thing as the last logs you sent. IWD tries to 
>>>>>> connect (sends CMD_CONNECT to the kernel) but gets no associated 
>>>>>> CMD_CONNECT event after that which causes IWD to wait 
>>>>>> indefinitely for that event. This, again, appears like a driver 
>>>>>> problem because its expected that the kernel tells userspace the 
>>>>>> result of the CMD_CONNECT request.
>>>>>>
>>>>>> Only similarity I can see between the two sets of logs is there 
>>>>>> is a failed connection just prior to the hang. IWD then attempts 
>>>>>> to connect again but the 4-way handshake is never started and 
>>>>>> this results in a failure with status 16 (group key handshake 
>>>>>> timeout). In your latest set of logs IWD actually again tries to 
>>>>>> connect to a different BSS and gets status 16 before trying yet 
>>>>>> again and hanging.
>>>>>>
>>>>>> This actually seems similar to an issue I encountered with ath10k 
>>>>>> where the network interface would time out being brought up. 
>>>>>> Retrying would succeed but the driver would be in a similar state 
>>>>>> where IWD could authenticate/associate but no data frames (i.e. 
>>>>>> 4-way handshake) would be passed to userspace. Only solution 
>>>>>> (until upstream fixed the bug) was to unload/reload the driver 
>>>>>> when we detected this condition.
>>>>>>
>>>>>> If you are able to physically attach to a device currently in 
>>>>>> this state you may be able to get more info. For example if IWD 
>>>>>> is stuck like this try disconnecting/reconnecting with iwctl or 
>>>>>> restarting IWD to see what happens. If you end up in the same 
>>>>>> state right away I'm 99.9% sure the driver is the entire reason 
>>>>>> your running into this.
>>>>>
>>>>> Are you sure? Maybe you could double-check?
>>>> I'm sure.
>>>>>
>>>>> Because my SOM vendor (Variscite) selling a few hundred thousand 
>>>>> of these do not report any issue with this kernel, firmware, and 
>>>>> NetworkManager (wpa_supplicant)...
>>>>
>>>> Because wpa_supplicant sets internal timers for these commands in 
>>>> case the driver is broken. I would expect you would see this exact 
>>>> behavior with wpa_supplicant, it would just disconnect/reconnect 
>>>> after 5 seconds of no response from the kernel. And something like 
>>>> this either a) goes entirely unnoticed and/or b) works well enough 
>>>> for a hardware vendor to ship it to customers and not care.
>>>>
>>>> This is the commit adding these timers to wpa_supplicant:
>>>>
>>>> commit e29853bbff1eef781099a9108e3b51f26b477ac3
>>>> Author: Ben Greear <greearb@candelatech.com>
>>>> Date:   Thu Feb 24 16:59:46 2011 +0200
>>>>
>>>>     SME: Add timers for authentication and asscoiation
>>>>
>>>>     mac80211 authentication or association operation may get stuck 
>>>> for some
>>>>     reasons, so wpa_supplicant better use an internal timer to 
>>>> recover from
>>>>     this.
>>>>
>>>>     Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>
>>>> I wish it surprised me that 13 years later this behavior still 
>>>> happens... We don't like adding special driver workarounds like 
>>>> this in IWD because a) it becomes difficult to maintain and b) it 
>>>> just hides the root cause and nobody ever fixes it. But my opinions 
>>>> aside, for a driver like brcmfmac which is very mainstream, I guess 
>>>> we have no choice but to adapt IWD to work around it like 
>>>> wpa_supplicant does.
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>
>>> I've sent a patch to the list which sets a timer within IWD in case 
>>> the connect event never arrives. Note that I cannot test this beyond 
>>> manually commenting out code to "trick" IWD into thinking this 
>>> happens. Applying that patch to the brcmfmac client your using is 
>>> going to be the true test.
>>
>> I may be able to test it, however only together with wifi power save 
>> disabled (I also prepared NetworkManager branch, because the customer 
>> will kill me otherwise).
>>
>> @Arent: Maybe you could check all this, as it seems to be related to 
>> some brcmfmac state. Just now I cannot provide logs, and your debug 
>> level for brcmfmac produces a lot lot of data, which I somehow also 
>> need to handle (limited space).
>
> With this patch and power save disabled, we have a better connection, 
> but still carrier losses several times a day (sometimes minutes, 
> sometimes hours). We think this is still due to driver / daemon. We 
> have checked AP logs and the infrastructure is Aruba. We will try to 
> get some more debug logs from that environment.
>
> My feeling is somehow, there could maybe still be a corner-case 
> related to BSS selection in WPA3 OWE. Or it is about failed roamings 
> of the driver.
>
> We will now a) upgrade to 6.6x kernel (driver) and b) add another wifi 
> chip (NXP IW611) to our board and test that one too.
>
> @James: Do you know if NXP IW611 works well with IWD?
Sorry, I have no experience with that chipset.
>
> Thanks,
>
> Martin
>
>

  reply	other threads:[~2024-11-12 12:13 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25 10:12 Connection loss (IWD HEAD with latest OWE / BSS selection patches) - brcmfmac driver Martin Petzold
2024-10-25 11:10 ` Martin Petzold
2024-10-25 11:48   ` James Prestwood
2024-10-25 12:01     ` Martin Petzold
2024-10-25 12:28     ` Martin Petzold
2024-10-25 12:33       ` Martin Petzold
2024-10-25 12:39       ` Martin Petzold
2024-10-25 12:48       ` Martin Petzold
2024-10-25 12:54       ` James Prestwood
2024-10-25 13:05         ` Martin Petzold
2024-10-25 13:17           ` James Prestwood
2024-10-25 13:11         ` Martin Petzold
2024-10-25 13:18           ` James Prestwood
2024-10-25 15:03             ` Martin Petzold
2024-10-25 15:17               ` James Prestwood
2024-10-25 22:22                 ` Martin Petzold
2024-10-26 10:01                   ` Martin Petzold
2024-10-26  8:26                 ` Arend Van Spriel
2024-11-03 23:13                 ` Martin Petzold
2024-11-04  0:43                   ` Martin Petzold
2024-11-04 12:36                   ` James Prestwood
2024-11-04 22:42                     ` Martin Petzold
2024-11-04 23:20                       ` James Prestwood
2024-11-05  8:03                         ` Martin Petzold
2024-11-05 13:14                         ` James Prestwood
2024-11-05 15:16                           ` Martin Petzold
2024-11-12  9:15                             ` Martin Petzold
2024-11-12 12:13                               ` James Prestwood [this message]
2024-11-07 13:09                         ` Martin Petzold
2024-11-06 20:32                     ` Martin Petzold
2024-11-06 21:35                       ` James Prestwood
2024-10-25 15:17         ` Martin Petzold
2024-10-26  9:07           ` Arend Van Spriel
2024-10-26 10:08             ` Martin Petzold

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a2ef307c-e0a9-438f-9bc4-1076b7bc5c1c@gmail.com \
    --to=prestwoj@gmail.com \
    --cc=arend.vanspriel@broadcom.com \
    --cc=iwd@lists.linux.dev \
    --cc=martin.petzold@tavla.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.