From: James Prestwood <prestwoj@gmail.com>
To: Martin Petzold <martin.petzold@tavla.de>
Cc: iwd@lists.linux.dev
Subject: Re: Connection loss (IWD HEAD with latest OWE / BSS selection patches) - brcmfmac driver
Date: Fri, 25 Oct 2024 06:18:49 -0700 [thread overview]
Message-ID: <e19e7527-9cc1-4c84-95ac-dcb39f836d58@gmail.com> (raw)
In-Reply-To: <57826e6a-8466-4e45-8906-6cb15968bcc8@tavla.de>
On 10/25/24 6:11 AM, Martin Petzold wrote:
> Hi James,
>
> Am 25.10.24 um 14:54 schrieb James Prestwood:
>> Hi Martin,
>>
>> On 10/25/24 5:28 AM, Martin Petzold wrote:
>>> Hi James,
>>>
>>> Am 25.10.24 um 13:48 schrieb James Prestwood:
>>>> Hi Martin,
>>>>
>>>> On 10/25/24 4:10 AM, Martin Petzold wrote:
>>>>> Hi all,
>>>>>
>>>>> Am 25.10.24 um 12:12 schrieb Martin Petzold:
>>>>>> Hi all,
>>>>>>
>>>>>> I open a new thread for this one: During the last weeks I have
>>>>>> seen connection losses for 30+ minutes, sometimes even hours or
>>>>>> just now even forever (IWD HEAD with v2 OWE / BSS selection
>>>>>> patches). Driver is brcmfmac (NXP 6.1.36 kernel) and chip is
>>>>>> BCM4339 (Laird LWB5).
>>>>>>
>>>>>> It happens in a) single router environment (WPA2-PSK; Touchstone
>>>>>> TG3442DE), and b) router + repeater environment (WPA2 CCMP;
>>>>>> Fritz!Box + Fritz!Repeater), and maybe also in the WPA3 OWE
>>>>>> Transition network (yesterday lost a connection again).
>>>>>
>>>>> I lost now again 2 of 10 devices in the WPA3 OWE network (with
>>>>> roaming). However, now they don't disappear all after a shorter
>>>>> while. It seems to be later.
>>>>>
>>>>> I also lost one device in a Router+Repeater WPA2 (CCMP) network.
>>>>> It is confirmed here on router side, that the device is
>>>>> disconnected. Since more than a day.
>>>>
>>>> We can't do anything without logs. If you suspect its the blacklist
>>>> you can lower the blacklist time down in main.conf:
>>>>
>>>> [Blacklist]
>>>> MaximumTimeout=5
>>>>
>>>> But logs would be best, we would be able to see if that is whats
>>>> happening or if its something else.
>>>>
>>>>
>>> I can also now see this on one of our local devices (which was
>>> connected to Ethernet and WiFi).
>>>
>>> Please find attached the log. The logs start at late 23rd with the
>>> image upgrade.
>>>
>>> I can see on our server side, that devices continuously connect and
>>> disconnect again (missed heartbeat = hard loss of websocket
>>> connection). I see some type of connection loop on client side too
>>> ("org.eclipse.jetty.websocket.api.UpgradeException: 0 null" caused
>>> by "java.util.concurrent.TimeoutException: DNS timeout 15000 ms").
>>> Maybe it alternates between Ethernet and WiFi. You will understand
>>> better. Or the network is re-configured all the time.
>>>
>>> Here is my local status:
>>>
>>> tavla@tavla:~$ iwctl station wlan0 show
>>> Station: wlan0
>>> --------------------------------------------------------------------------------
>>>
>>> Settable Property Value
>>> --------------------------------------------------------------------------------
>>>
>>> Scanning no
>>> State connecting
>>> Connected network XYZ
>>> No IP addresses Is DHCP client configured?
>>>
>>> tavla@tavla:~$ networkctl status
>>> ● State: routable
>>> Online state: unknown
>>> Address: 192.168.178.178 on eth0
>>> 2a0a:a549:da80:0:fadc:7aff:fe67:2e4 on eth0
>>> fe80::fadc:7aff:fe67:2e4 on eth0
>>> fe80::c2ee:40ff:fe8a:dd62 on wlan0
>>> Gateway: 192.168.178.1 on eth0
>>> fe80::3a10:d5ff:fe37:2c79 on eth0
>>> DNS: 192.168.178.1
>>> NTP: 192.168.178.1
>>> IDX LINK TYPE OPERATIONAL SETUP
>>> 1 lo loopback carrier unmanaged
>>> 2 eth0 ether routable configured
>>> 3 wlan0 wlan no-carrier unmanaged
>>>
>>> 3 links listed.
>>
>> Yep, IWD is ultimately hung up waiting for a connect event from the
>> kernel/driver. A few things I noticed prior to that:
>>
>> 1. Not really critical but you have an agent connecting and
>> disconnecting multiple times a second. Are you polling info with
>> iwctl or something?
>>
>> 2. I see this really odd local disconnect with reason code 0. This is
>> likely something from brcmfmac because I doubt the wireless subsystem
>> would use a "success" reason code in a local deauth:
>>
>> Okt 24 06:05:00 tavla iwd[384]: Received Deauthentication event,
>> reason: 0, from_ap: false
>>
>> 3. You had another disconnect from the AP, but IWD reconnected:
>>
>> Okt 24 06:06:12 tavla iwd[384]: Received Deauthentication event,
>> reason: 6, from_ap: true
>>
>> 4. You then got a disconnect from the AP yet again, reason code 3:
>>
>> Okt 25 06:13:21 tavla iwd[384]: Received Deauthentication event,
>> reason: 3, from_ap: true
>>
>> # You got disconnected by the AP here
>
> I stopped my application (all processes gone!). And then tried a
> manual re-connect via iwctl:
>
> -----
>
> tavla@tavla:~$ iwctl station wlan0 connect "XYZ"
>
> Okt 25 15:07:06 tavla iwd[384]: src/agent.c:agent_register() agent
> register called
> Okt 25 15:07:06 tavla iwd[384]: src/agent.c:agent_register() agent
> :1.13733 path /agent/204414
> Okt 25 15:07:06 tavla iwd[384]: src/network.c:network_connect()
> Okt 25 15:07:06 tavla iwd[384]: src/agent.c:agent_disconnect() agent
> :1.13733 disconnected
> Okt 25 15:07:06 tavla iwd[384]: src/agent.c:agent_free() agent free
> 0xaaaae76e8b30
>
> tavla@tavla:~$ iwctl station wlan0 show
> Station: wlan0
> --------------------------------------------------------------------------------
>
> Settable Property Value
> --------------------------------------------------------------------------------
>
> Scanning no
> State connecting
> Connected network XYZ
> No IP addresses Is DHCP client configured?
>
> tavla@tavla:~$ iwctl station wlan0 disconnect
> Operation failed
>
> Okt 25 15:08:56 tavla iwd[384]: src/agent.c:agent_register() agent
> register called
> Okt 25 15:08:56 tavla iwd[384]: src/agent.c:agent_register() agent
> :1.13740 path /agent/204425
> Okt 25 15:08:56 tavla iwd[384]: src/station.c:station_dbus_disconnect()
> Okt 25 15:08:56 tavla iwd[384]:
> src/station.c:station_reset_connection_state() 3
> Okt 25 15:08:56 tavla iwd[384]:
> src/station.c:station_roam_state_clear() 3
> Okt 25 15:08:56 tavla iwd[384]: event: state, old: connecting (auto),
> new: disconnecting
> Okt 25 15:08:56 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_done()
> Work item 19 done
> Okt 25 15:08:56 tavla iwd[384]: src/station.c:station_connect_cb() 3,
> result: 5
> Okt 25 15:08:56 tavla iwd[384]: src/station.c:station_disconnect_cb()
> 3, success: 0
> Okt 25 15:08:56 tavla iwd[384]: event: state, old: disconnecting, new:
> disconnected
> Okt 25 15:08:56 tavla iwd[384]: src/agent.c:agent_disconnect() agent
> :1.13740 disconnected
> Okt 25 15:08:56 tavla iwd[384]: src/agent.c:agent_free() agent free
> 0xaaaae76eaa50
Yes, the driver seems completely stuck. I would recommend
unloading/reloading it with modprobe.
>
> -----
>
>>
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_disconnect_event() 3
>> Okt 25 06:13:21 tavla iwd[384]: event: disconnect-info, reason: 3
>> Okt 25 06:13:21 tavla iwd[384]: src/station.c:station_disassociated() 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_reset_connection_state() 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_roam_state_clear() 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/netconfig.c:netconfig_event_handler() l_netconfig event 2
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/netconfig-commit.c:netconfig_commit_print_addrs() removing
>> address: 192.168.178.113
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/resolve.c:resolve_systemd_revert() ifindex: 3
>> Okt 25 06:13:21 tavla iwd[384]: event: state, old: connected, new:
>> disconnected
>> Okt 25 06:13:21 tavla iwd[384]: event: state, old: disconnected, new:
>> autoconnect_quick
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_insert()
>> Inserting work item 17
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_next()
>> Starting work item 17
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_reg_notify()
>> Notification of command Reg Change(36)
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_link_notify()
>> event 16 on ifindex 3
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_notify() Scan
>> notification Trigger Scan(33)
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_link_notify()
>> event 16 on ifindex 3
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_request_triggered()
>> Passive scan triggered for wdev 1
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_quick_scan_triggered() Quick scan triggered for
>> wlan0
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_notify() Scan
>> notification New Scan Results(34)
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_link_notify()
>> event 16 on ifindex 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '38:10:d5:37:2c:7b' with SSID: XYZ, freq: 2462, rank: 442, strength:
>> -3300, data_rate: 54.0, load: 21/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '38:10:d5:7a:45:ca' with SSID: FRITZ!Box 7490, freq: 5580, rank: 16,
>> strength: -8700, data_rate: 2.0, load: 12/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '84:47:65:1a:c8:14' with SSID: WLAN-T4MQUD, freq: 5260, rank: 442,
>> strength: -6100, data_rate: 54.0, load: 46/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '84:47:65:1a:c8:11' with SSID: WLAN-T4MQUD, freq: 2437, rank: 294,
>> strength: -6700, data_rate: 36.0, load: 63/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_autoconnect_start()
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_autoconnect_next() autoconnect: Trying SSID: XYZ
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_autoconnect_next() autoconnect:
>> '38:10:d5:37:2c:7b' freq: 2462, rank: 442, strength: -3300
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_cqm_rssi_update()
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_insert()
>> Inserting work item 18
>> Okt 25 06:13:21 tavla iwd[384]: event: connect-info, ssid: XYZ, bss:
>> 38:10:d5:37:2c:7b, signal: -33, load: 21/255
>> Okt 25 06:13:21 tavla iwd[384]: event: state, old: autoconnect_quick,
>> new: connecting (auto)
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_cancel() Trying to
>> cancel scan id 17 for wdev 1
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_done()
>> Work item 17 done
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_next()
>> Starting work item 18
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_mlme_notify()
>> MLME notification Connect(46)
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_connect_event()
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_connect_event()
>> aborting and ignore_connect_event not set, proceed
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_connect_event()
>> expect_connect_failure not set, proceed
>> Okt 25 06:13:22 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_done()
>> Work item 18 done
>> Okt 25 06:13:22 tavla iwd[384]: src/station.c:station_connect_cb() 3,
>> result: 2
>> Okt 25 06:13:22 tavla iwd[384]: event: connect-failed, status: 16
>>
>> # AP rejects your next attempt to connect with status 16 (group
>> handshake timeout). Looks like the 4-way handshake never even started
>> so this is kinda odd.
>>
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_cqm_rssi_update()
>> Okt 25 06:13:22 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_insert()
>> Inserting work item 19
>> Okt 25 06:13:22 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_next()
>> Starting work item 19
>> Okt 25 06:13:22 tavla iwd[384]: event: connect-info, ssid: XYZ, bss:
>> 38:10:d5:37:2c:7b, signal: -33, load: 21/255
>> Okt 25 06:13:22 tavla iwd[384]: src/station.c:station_try_next_bss()
>> Attempting to connect to next BSS 38:10:d5:37:2c:7b
>>
>> # Then IWD tries to connect again and we get no connect even whatsoever.
>>
>> I think brcmfmac got very confused and stuck. There are many cases in
>> IWD where we trust the kernel (and associated driver) will send the
>> events we expect. And in this case that did not happen and IWD waits
>> indefinitely.
>>
>> Unless I'm missing something this seems like a bug in the driver.
>> When IWD issues a CMD_CONNECT call the driver needs to tell us if it
>> succeeded or failed. Not replying at all is going to cause IWD to
>> hang like this.
>>
>> Thanks,
>>
>> James
>>
>>>
>>>>>
>>>>>>
>>>>>> We have 100% reproducable builds, test environments (testing
>>>>>> sites), and web socket-based telemetry. Thus my feeling is, there
>>>>>> is still something wrong with the BSS selection (maybe
>>>>>> blacklisting)! Of course this can also be something about the
>>>>>> driver (brcmfmac) or firmware (roaming?). Or can this be about
>>>>>> IPv6 disabled (my configuration below)?
>>>>>>
>>>>>> This is really really serious, I (still) lose connections to
>>>>>> devices!
>>>>>>
>>>>>> I will investigate further and provide logs later today, if
>>>>>> possible.
>>>>>>
>>>>>> SETUP:
>>>>>>
>>>>>> SoM: Variscite DART-IMX8M-PLUS
>>>>>> SoC: NXP i.MX8M Plus
>>>>>> Wi-Fi Chip: Laird Sterling LWB5 (Infineon CYW43353)
>>>>>> Antenna: Laird FlexPIFA
>>>>>> RootFS: Debian bookworm
>>>>>> Kernel: NXP Kernel 6.1.36 (with Variscite patches)
>>>>>> Driver: brcmfmac (from Kernel, only minor patches - device tree)
>>>>>> Firmware: Laird 12.29.0.22 (loaded via scripts)
>>>>>> Network Daemon: systemd-networkd 252.30
>>>>>> Wi-Fi Daemon: IWD 2.22+ (c459dc75c080dcbff6f9b9389344e11328dec411
>>>>>> + OWE and BSS selection patches) + ELL 0.69
>>>>>> (7bed251f8020ef896043f008276f41cd13fdd43f)
>>>>>>
>>>>>> KERNEL:
>>>>>>
>>>>>> REPOSITORY=https://github.com/tavla/linux-imx.git (fork from
>>>>>> https://github.com/varigit/linux-imx)
>>>>>> BRANCH=lf-6.1.y_var03
>>>>>> COMMIT=a4b3cc1dbbb396284e501b938aa79d2dfd4f1a98
>>>>>>
>>>>>> FIRMWARE:
>>>>>>
>>>>>> summit-lwb5-etsi-firmware-12.29.0.22.tar.bz2
>>>>>>
>>>>>> From:
>>>>>> https://github.com/LairdCP/Sterling-LWB-and-LWB5-Release-Packages/releases/
>>>>>>
>>>>>> IWD
>>>>>>
>>>>>> -----
>>>>>>
>>>>>> [General]
>>>>>> EnableNetworkConfiguration=true
>>>>>>
>>>>>> [Network]
>>>>>> NameResolvingService=systemd
>>>>>> EnableIPv6=false
>>>>>>
>>>>>> [DriverQuirks]
>>>>>> OweDisable=brcmfmac
>>>>>>
>>>>>> -----
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
next prev parent reply other threads:[~2024-10-25 13:18 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 10:12 Connection loss (IWD HEAD with latest OWE / BSS selection patches) - brcmfmac driver Martin Petzold
2024-10-25 11:10 ` Martin Petzold
2024-10-25 11:48 ` James Prestwood
2024-10-25 12:01 ` Martin Petzold
2024-10-25 12:28 ` Martin Petzold
2024-10-25 12:33 ` Martin Petzold
2024-10-25 12:39 ` Martin Petzold
2024-10-25 12:48 ` Martin Petzold
2024-10-25 12:54 ` James Prestwood
2024-10-25 13:05 ` Martin Petzold
2024-10-25 13:17 ` James Prestwood
2024-10-25 13:11 ` Martin Petzold
2024-10-25 13:18 ` James Prestwood [this message]
2024-10-25 15:03 ` Martin Petzold
2024-10-25 15:17 ` James Prestwood
2024-10-25 22:22 ` Martin Petzold
2024-10-26 10:01 ` Martin Petzold
2024-10-26 8:26 ` Arend Van Spriel
2024-11-03 23:13 ` Martin Petzold
2024-11-04 0:43 ` Martin Petzold
2024-11-04 12:36 ` James Prestwood
2024-11-04 22:42 ` Martin Petzold
2024-11-04 23:20 ` James Prestwood
2024-11-05 8:03 ` Martin Petzold
2024-11-05 13:14 ` James Prestwood
2024-11-05 15:16 ` Martin Petzold
2024-11-12 9:15 ` Martin Petzold
2024-11-12 12:13 ` James Prestwood
2024-11-07 13:09 ` Martin Petzold
2024-11-06 20:32 ` Martin Petzold
2024-11-06 21:35 ` James Prestwood
2024-10-25 15:17 ` Martin Petzold
2024-10-26 9:07 ` Arend Van Spriel
2024-10-26 10:08 ` Martin Petzold
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e19e7527-9cc1-4c84-95ac-dcb39f836d58@gmail.com \
--to=prestwoj@gmail.com \
--cc=iwd@lists.linux.dev \
--cc=martin.petzold@tavla.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox