From: James Prestwood <prestwoj@gmail.com>
To: Martin Petzold <martin.petzold@tavla.de>
Cc: iwd@lists.linux.dev
Subject: Re: Connection loss (IWD HEAD with latest OWE / BSS selection patches) - brcmfmac driver
Date: Fri, 25 Oct 2024 06:17:34 -0700 [thread overview]
Message-ID: <9092227f-23d5-4efb-b103-93ad11e8a71d@gmail.com> (raw)
In-Reply-To: <228c149f-24f7-4d46-bf0e-d2dcafeb326a@tavla.de>
On 10/25/24 6:05 AM, Martin Petzold wrote:
> Hi James,
>
> Am 25.10.24 um 14:54 schrieb James Prestwood:
>> Hi Martin,
>>
>> On 10/25/24 5:28 AM, Martin Petzold wrote:
>>> Hi James,
>>>
>>> Am 25.10.24 um 13:48 schrieb James Prestwood:
>>>> Hi Martin,
>>>>
>>>> On 10/25/24 4:10 AM, Martin Petzold wrote:
>>>>> Hi all,
>>>>>
>>>>> Am 25.10.24 um 12:12 schrieb Martin Petzold:
>>>>>> Hi all,
>>>>>>
>>>>>> I open a new thread for this one: During the last weeks I have
>>>>>> seen connection losses for 30+ minutes, sometimes even hours or
>>>>>> just now even forever (IWD HEAD with v2 OWE / BSS selection
>>>>>> patches). Driver is brcmfmac (NXP 6.1.36 kernel) and chip is
>>>>>> BCM4339 (Laird LWB5).
>>>>>>
>>>>>> It happens in a) single router environment (WPA2-PSK; Touchstone
>>>>>> TG3442DE), and b) router + repeater environment (WPA2 CCMP;
>>>>>> Fritz!Box + Fritz!Repeater), and maybe also in the WPA3 OWE
>>>>>> Transition network (yesterday lost a connection again).
>>>>>
>>>>> I lost now again 2 of 10 devices in the WPA3 OWE network (with
>>>>> roaming). However, now they don't disappear all after a shorter
>>>>> while. It seems to be later.
>>>>>
>>>>> I also lost one device in a Router+Repeater WPA2 (CCMP) network.
>>>>> It is confirmed here on router side, that the device is
>>>>> disconnected. Since more than a day.
>>>>
>>>> We can't do anything without logs. If you suspect its the blacklist
>>>> you can lower the blacklist time down in main.conf:
>>>>
>>>> [Blacklist]
>>>> MaximumTimeout=5
>>>>
>>>> But logs would be best, we would be able to see if that is whats
>>>> happening or if its something else.
>>>>
>>>>
>>> I can also now see this on one of our local devices (which was
>>> connected to Ethernet and WiFi).
>>>
>>> Please find attached the log. The logs start at late 23rd with the
>>> image upgrade.
>>>
>>> I can see on our server side, that devices continuously connect and
>>> disconnect again (missed heartbeat = hard loss of websocket
>>> connection). I see some type of connection loop on client side too
>>> ("org.eclipse.jetty.websocket.api.UpgradeException: 0 null" caused
>>> by "java.util.concurrent.TimeoutException: DNS timeout 15000 ms").
>>> Maybe it alternates between Ethernet and WiFi. You will understand
>>> better. Or the network is re-configured all the time.
>>>
>>> Here is my local status:
>>>
>>> tavla@tavla:~$ iwctl station wlan0 show
>>> Station: wlan0
>>> --------------------------------------------------------------------------------
>>>
>>> Settable Property Value
>>> --------------------------------------------------------------------------------
>>>
>>> Scanning no
>>> State connecting
>>> Connected network XYZ
>>> No IP addresses Is DHCP client configured?
>>>
>>> tavla@tavla:~$ networkctl status
>>> ● State: routable
>>> Online state: unknown
>>> Address: 192.168.178.178 on eth0
>>> 2a0a:a549:da80:0:fadc:7aff:fe67:2e4 on eth0
>>> fe80::fadc:7aff:fe67:2e4 on eth0
>>> fe80::c2ee:40ff:fe8a:dd62 on wlan0
>>> Gateway: 192.168.178.1 on eth0
>>> fe80::3a10:d5ff:fe37:2c79 on eth0
>>> DNS: 192.168.178.1
>>> NTP: 192.168.178.1
>>> IDX LINK TYPE OPERATIONAL SETUP
>>> 1 lo loopback carrier unmanaged
>>> 2 eth0 ether routable configured
>>> 3 wlan0 wlan no-carrier unmanaged
>>>
>>> 3 links listed.
>>
>> Yep, IWD is ultimately hung up waiting for a connect event from the
>> kernel/driver. A few things I noticed prior to that:
>
> Why should there be a connect event from the kernel/driver? Why don't
> we get any related logs from brcm?
IWD requests to connect and the expectation is for the kernel to return
us an event with the result. Failed or successful. This is part of the
kernel API. Why we don't see brcmfmac logs I have no idea.
>
>
>>
>> 1. Not really critical but you have an agent connecting and
>> disconnecting multiple times a second. Are you polling info with
>> iwctl or something?
>>
>> 2. I see this really odd local disconnect with reason code 0. This is
>> likely something from brcmfmac because I doubt the wireless subsystem
>> would use a "success" reason code in a local deauth:
>>
>> Okt 24 06:05:00 tavla iwd[384]: Received Deauthentication event,
>> reason: 0, from_ap: false
>>
>> 3. You had another disconnect from the AP, but IWD reconnected:
>>
>> Okt 24 06:06:12 tavla iwd[384]: Received Deauthentication event,
>> reason: 6, from_ap: true
>>
>> 4. You then got a disconnect from the AP yet again, reason code 3:
>>
>> Okt 25 06:13:21 tavla iwd[384]: Received Deauthentication event,
>> reason: 3, from_ap: true
>>
>> # You got disconnected by the AP here
>
> Okay, thanks. I don't do any polling of IWD or other network daemons.
> However, I do have a mechanism for web socket re-connects in case of
> disconnect.
>
> What I also see: networkctl shows some IPv6 for wlan0. However, I have
> IPv6 disabled in the iwd.conf. Could this be a problem?
Disabling IPv6 in IWD just means it won't do DHCPv6 and other IPv6
related operations. IPv6 is still "enabled" from a system point of view.
Apart from addressing the shortcomings in the driver, one suggestion I
have (I also work with headless devices that cannot go offline) is to
implement a watchdog. I did this about a year and a half ago originally
as a quick/temporary fix for a nasty IWD bug causing a hang (since
fixed, long ago) but kept it around just in case. We also found a bug in
ath10k which required unloading/reloading the driver to recover from.
Its unfortunate it has to come to this, but I don't know what else to
say. We can't implement special cases in IWD for every buggy driver and
at some point you just have to either scrap the driver your using, fix
it, or work around the issues it has.
Thanks,
James
> I sent additional log output for networkd and resolved in an earlier
> message.
>
>
>>
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_disconnect_event() 3
>> Okt 25 06:13:21 tavla iwd[384]: event: disconnect-info, reason: 3
>> Okt 25 06:13:21 tavla iwd[384]: src/station.c:station_disassociated() 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_reset_connection_state() 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_roam_state_clear() 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/netconfig.c:netconfig_event_handler() l_netconfig event 2
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/netconfig-commit.c:netconfig_commit_print_addrs() removing
>> address: 192.168.178.113
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/resolve.c:resolve_systemd_revert() ifindex: 3
>> Okt 25 06:13:21 tavla iwd[384]: event: state, old: connected, new:
>> disconnected
>> Okt 25 06:13:21 tavla iwd[384]: event: state, old: disconnected, new:
>> autoconnect_quick
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_insert()
>> Inserting work item 17
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_next()
>> Starting work item 17
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_reg_notify()
>> Notification of command Reg Change(36)
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_link_notify()
>> event 16 on ifindex 3
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_notify() Scan
>> notification Trigger Scan(33)
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_link_notify()
>> event 16 on ifindex 3
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_request_triggered()
>> Passive scan triggered for wdev 1
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_quick_scan_triggered() Quick scan triggered for
>> wlan0
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_notify() Scan
>> notification New Scan Results(34)
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_link_notify()
>> event 16 on ifindex 3
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '38:10:d5:37:2c:7b' with SSID: XYZ, freq: 2462, rank: 442, strength:
>> -3300, data_rate: 54.0, load: 21/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '38:10:d5:7a:45:ca' with SSID: FRITZ!Box 7490, freq: 5580, rank: 16,
>> strength: -8700, data_rate: 2.0, load: 12/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '84:47:65:1a:c8:14' with SSID: WLAN-T4MQUD, freq: 5260, rank: 442,
>> strength: -6100, data_rate: 54.0, load: 46/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_print_scan_bss() Processing BSS
>> '84:47:65:1a:c8:11' with SSID: WLAN-T4MQUD, freq: 2437, rank: 294,
>> strength: -6700, data_rate: 36.0, load: 63/255
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_autoconnect_start()
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_autoconnect_next() autoconnect: Trying SSID: XYZ
>> Okt 25 06:13:21 tavla iwd[384]:
>> src/station.c:station_autoconnect_next() autoconnect:
>> '38:10:d5:37:2c:7b' freq: 2462, rank: 442, strength: -3300
>> Okt 25 06:13:21 tavla iwd[384]: src/netdev.c:netdev_cqm_rssi_update()
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_insert()
>> Inserting work item 18
>> Okt 25 06:13:21 tavla iwd[384]: event: connect-info, ssid: XYZ, bss:
>> 38:10:d5:37:2c:7b, signal: -33, load: 21/255
>> Okt 25 06:13:21 tavla iwd[384]: event: state, old: autoconnect_quick,
>> new: connecting (auto)
>> Okt 25 06:13:21 tavla iwd[384]: src/scan.c:scan_cancel() Trying to
>> cancel scan id 17 for wdev 1
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_done()
>> Work item 17 done
>> Okt 25 06:13:21 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_next()
>> Starting work item 18
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_mlme_notify()
>> MLME notification Connect(46)
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_connect_event()
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_connect_event()
>> aborting and ignore_connect_event not set, proceed
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_connect_event()
>> expect_connect_failure not set, proceed
>> Okt 25 06:13:22 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_done()
>> Work item 18 done
>> Okt 25 06:13:22 tavla iwd[384]: src/station.c:station_connect_cb() 3,
>> result: 2
>> Okt 25 06:13:22 tavla iwd[384]: event: connect-failed, status: 16
>>
>> # AP rejects your next attempt to connect with status 16 (group
>> handshake timeout). Looks like the 4-way handshake never even started
>> so this is kinda odd.
>>
>> Okt 25 06:13:22 tavla iwd[384]: src/netdev.c:netdev_cqm_rssi_update()
>> Okt 25 06:13:22 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_insert()
>> Inserting work item 19
>> Okt 25 06:13:22 tavla iwd[384]: src/wiphy.c:wiphy_radio_work_next()
>> Starting work item 19
>> Okt 25 06:13:22 tavla iwd[384]: event: connect-info, ssid: XYZ, bss:
>> 38:10:d5:37:2c:7b, signal: -33, load: 21/255
>> Okt 25 06:13:22 tavla iwd[384]: src/station.c:station_try_next_bss()
>> Attempting to connect to next BSS 38:10:d5:37:2c:7b
>>
>> # Then IWD tries to connect again and we get no connect even whatsoever.
>>
>> I think brcmfmac got very confused and stuck. There are many cases in
>> IWD where we trust the kernel (and associated driver) will send the
>> events we expect. And in this case that did not happen and IWD waits
>> indefinitely.
>>
>> Unless I'm missing something this seems like a bug in the driver.
>> When IWD issues a CMD_CONNECT call the driver needs to tell us if it
>> succeeded or failed. Not replying at all is going to cause IWD to
>> hang like this.
>>
>> Thanks,
>>
>> James
>>
>>>
>>>>>
>>>>>>
>>>>>> We have 100% reproducable builds, test environments (testing
>>>>>> sites), and web socket-based telemetry. Thus my feeling is, there
>>>>>> is still something wrong with the BSS selection (maybe
>>>>>> blacklisting)! Of course this can also be something about the
>>>>>> driver (brcmfmac) or firmware (roaming?). Or can this be about
>>>>>> IPv6 disabled (my configuration below)?
>>>>>>
>>>>>> This is really really serious, I (still) lose connections to
>>>>>> devices!
>>>>>>
>>>>>> I will investigate further and provide logs later today, if
>>>>>> possible.
>>>>>>
>>>>>> SETUP:
>>>>>>
>>>>>> SoM: Variscite DART-IMX8M-PLUS
>>>>>> SoC: NXP i.MX8M Plus
>>>>>> Wi-Fi Chip: Laird Sterling LWB5 (Infineon CYW43353)
>>>>>> Antenna: Laird FlexPIFA
>>>>>> RootFS: Debian bookworm
>>>>>> Kernel: NXP Kernel 6.1.36 (with Variscite patches)
>>>>>> Driver: brcmfmac (from Kernel, only minor patches - device tree)
>>>>>> Firmware: Laird 12.29.0.22 (loaded via scripts)
>>>>>> Network Daemon: systemd-networkd 252.30
>>>>>> Wi-Fi Daemon: IWD 2.22+ (c459dc75c080dcbff6f9b9389344e11328dec411
>>>>>> + OWE and BSS selection patches) + ELL 0.69
>>>>>> (7bed251f8020ef896043f008276f41cd13fdd43f)
>>>>>>
>>>>>> KERNEL:
>>>>>>
>>>>>> REPOSITORY=https://github.com/tavla/linux-imx.git (fork from
>>>>>> https://github.com/varigit/linux-imx)
>>>>>> BRANCH=lf-6.1.y_var03
>>>>>> COMMIT=a4b3cc1dbbb396284e501b938aa79d2dfd4f1a98
>>>>>>
>>>>>> FIRMWARE:
>>>>>>
>>>>>> summit-lwb5-etsi-firmware-12.29.0.22.tar.bz2
>>>>>>
>>>>>> From:
>>>>>> https://github.com/LairdCP/Sterling-LWB-and-LWB5-Release-Packages/releases/
>>>>>>
>>>>>> IWD
>>>>>>
>>>>>> -----
>>>>>>
>>>>>> [General]
>>>>>> EnableNetworkConfiguration=true
>>>>>>
>>>>>> [Network]
>>>>>> NameResolvingService=systemd
>>>>>> EnableIPv6=false
>>>>>>
>>>>>> [DriverQuirks]
>>>>>> OweDisable=brcmfmac
>>>>>>
>>>>>> -----
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
next prev parent reply other threads:[~2024-10-25 13:17 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 10:12 Connection loss (IWD HEAD with latest OWE / BSS selection patches) - brcmfmac driver Martin Petzold
2024-10-25 11:10 ` Martin Petzold
2024-10-25 11:48 ` James Prestwood
2024-10-25 12:01 ` Martin Petzold
2024-10-25 12:28 ` Martin Petzold
2024-10-25 12:33 ` Martin Petzold
2024-10-25 12:39 ` Martin Petzold
2024-10-25 12:48 ` Martin Petzold
2024-10-25 12:54 ` James Prestwood
2024-10-25 13:05 ` Martin Petzold
2024-10-25 13:17 ` James Prestwood [this message]
2024-10-25 13:11 ` Martin Petzold
2024-10-25 13:18 ` James Prestwood
2024-10-25 15:03 ` Martin Petzold
2024-10-25 15:17 ` James Prestwood
2024-10-25 22:22 ` Martin Petzold
2024-10-26 10:01 ` Martin Petzold
2024-10-26 8:26 ` Arend Van Spriel
2024-11-03 23:13 ` Martin Petzold
2024-11-04 0:43 ` Martin Petzold
2024-11-04 12:36 ` James Prestwood
2024-11-04 22:42 ` Martin Petzold
2024-11-04 23:20 ` James Prestwood
2024-11-05 8:03 ` Martin Petzold
2024-11-05 13:14 ` James Prestwood
2024-11-05 15:16 ` Martin Petzold
2024-11-12 9:15 ` Martin Petzold
2024-11-12 12:13 ` James Prestwood
2024-11-07 13:09 ` Martin Petzold
2024-11-06 20:32 ` Martin Petzold
2024-11-06 21:35 ` James Prestwood
2024-10-25 15:17 ` Martin Petzold
2024-10-26 9:07 ` Arend Van Spriel
2024-10-26 10:08 ` Martin Petzold
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9092227f-23d5-4efb-b103-93ad11e8a71d@gmail.com \
--to=prestwoj@gmail.com \
--cc=iwd@lists.linux.dev \
--cc=martin.petzold@tavla.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox