* [PATCH 0/4] Packet/beacon loss roaming improvements
@ 2023-10-30 13:48 James Prestwood
2023-10-30 13:48 ` [PATCH 1/4] station: rename ap_directed_roam to force_roam James Prestwood
` (4 more replies)
0 siblings, 5 replies; 19+ messages in thread
From: James Prestwood @ 2023-10-30 13:48 UTC (permalink / raw)
To: iwd; +Cc: James Prestwood
The rate limiting patch shouldn't be too controversial, but the
forced roam on beacon loss may be so I wanted to provde some
background. This behavior was only tested on ath10k, so I understand
if we don't want to impose the same behavior for all drivers.
We were seeing beacon loss events not resulting in an immediate
disconnnect (as I have always expected), still eventually but after
plenty of time to roam. I initially added handling for
beacon loss identical to packet loss (try and find a better BSS) but
noticed that if IWD did not find a better candidate it resulted in a
disconnect 100% of the time. I watched a client for a full day and
whenever beacon loss events arrived they were followed by a
disconnect within ~5-6 seconds if IWD did not roam. This led me to
believe that at least on ath10k a beacon loss is still very much a
sign that a disconnect is going to come, we just have a bit more time
than other drivers. This was the motivation behind re-using/re-naming
the 'ap_directed_roam' flag in order to force IWD off the BSS.
Again, this is just one driver. Maybe other drivers can
handle/recover from beacon loss. If we instead want to keep the
behavior the same as packet loss I'm ok with that (I can keep the
patch locally), or put the forced roam behavior behind a user
option e.g. [Roam].ForceRoamOnBeaconLoss
James Prestwood (4):
station: rename ap_directed_roam to force_roam
station: start roam on beacon loss event
netdev: handle/send beacon loss event
station: rate limit packet loss roam scans
src/netdev.c | 6 ++++-
src/netdev.h | 1 +
src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++--------
3 files changed, 58 insertions(+), 10 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/4] station: rename ap_directed_roam to force_roam
2023-10-30 13:48 [PATCH 0/4] Packet/beacon loss roaming improvements James Prestwood
@ 2023-10-30 13:48 ` James Prestwood
2023-10-30 13:48 ` [PATCH 2/4] station: start roam on beacon loss event James Prestwood
` (3 subsequent siblings)
4 siblings, 0 replies; 19+ messages in thread
From: James Prestwood @ 2023-10-30 13:48 UTC (permalink / raw)
To: iwd; +Cc: James Prestwood
This will be used for other situations where a roam is forced rather
than based on ranking.
---
src/station.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/src/station.c b/src/station.c
index 1a886b19..a2358d95 100644
--- a/src/station.c
+++ b/src/station.c
@@ -126,7 +126,7 @@ struct station {
bool preparing_roam : 1;
bool roam_scan_full : 1;
bool signal_low : 1;
- bool ap_directed_roaming : 1;
+ bool force_roam : 1;
bool scanning : 1;
bool autoconnect : 1;
bool autoconnect_can_start : 1;
@@ -2114,7 +2114,7 @@ static void station_roam_retry(struct station *station)
*/
station->preparing_roam = false;
station->roam_scan_full = false;
- station->ap_directed_roaming = false;
+ station->force_roam = false;
if (station->signal_low)
station_roam_timeout_rearm(station, roam_retry_interval);
@@ -2145,7 +2145,7 @@ static void station_roam_failed(struct station *station)
* We were told by the AP to roam, but failed. Try ourselves or
* wait for the AP to tell us to roam again
*/
- if (station->ap_directed_roaming)
+ if (station->force_roam)
goto delayed_retry;
/*
@@ -2425,7 +2425,7 @@ static bool station_try_next_transition(struct station *station,
util_address_to_string(bss->addr));
/* Reset AP roam flag, at this point the roaming behaves the same */
- station->ap_directed_roaming = false;
+ station->force_roam = false;
/* Can we use Fast Transition? */
if (station_can_fast_transition(hs, bss) && !no_ft)
@@ -2583,7 +2583,7 @@ static bool station_roam_scan_notify(int err, struct l_queue *bss_list,
* to occur.
*/
bss = l_queue_find(bss_list, bss_match_bssid, current_bss->addr);
- if (bss && !station->ap_directed_roaming) {
+ if (bss && !station->force_roam) {
cur_bss_rank = bss->rank;
if (hs->mde && bss->mde_present && l_get_le16(bss->mde) == mdid)
@@ -2961,18 +2961,18 @@ static void station_ap_directed_roam(struct station *station,
MAC_STR(hdr->address_3));
/*
- * The ap_directed_roaming flag forces IWD to roam if there are any
+ * The force_roam flag forces IWD to roam if there are any
* candidates, even if they are worse than the current BSS. This isn't
* always a good idea since we may be associated to the best BSS. Where
* this does matter is if the AP indicates its going down or will be
* disassociating us. If either of these bits are set, set the
- * ap_directed_roaming flag. Otherwise still try roaming but don't
+ * force_roam flag. Otherwise still try roaming but don't
* treat it any different than a normal roam.
*/
if (req_mode & (WNM_REQUEST_MODE_DISASSOCIATION_IMMINENT |
WNM_REQUEST_MODE_TERMINATION_IMMINENT |
WNM_REQUEST_MODE_ESS_DISASSOCIATION_IMMINENT))
- station->ap_directed_roaming = true;
+ station->force_roam = true;
if (req_mode & WNM_REQUEST_MODE_TERMINATION_IMMINENT) {
if (pos + 12 > body_len)
--
2.25.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/4] station: start roam on beacon loss event
2023-10-30 13:48 [PATCH 0/4] Packet/beacon loss roaming improvements James Prestwood
2023-10-30 13:48 ` [PATCH 1/4] station: rename ap_directed_roam to force_roam James Prestwood
@ 2023-10-30 13:48 ` James Prestwood
2023-10-30 13:48 ` [PATCH 3/4] netdev: handle/send " James Prestwood
` (2 subsequent siblings)
4 siblings, 0 replies; 19+ messages in thread
From: James Prestwood @ 2023-10-30 13:48 UTC (permalink / raw)
To: iwd; +Cc: James Prestwood
In past testing the beacon loss event seemed to always be followed
immediately by a disconnect. It was determined at the time that there
wasn't much point in even attempting a roam because a disconnect
would happen before the roam could finish. This was a bit short
sighted because its hardware dependent when the beacon loss event
is sent.
Recently it was seen that some hardware (ath10k in this case)
actually gives some time before the kernel finally disconnects.
In this example a beacon loss event came and IWD remained connected
for 4 seconds (until a packet loss which issued a roam scan).
11:30:25 iwd[1407046]: src/netdev.c:netdev_cqm_event() Beacon lost event
11:30:29 iwd[1407046]: src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
11:30:29 iwd[1407046]: src/station.c:station_packets_lost() Packets lost event: 10
11:30:29 iwd[1407046]: src/station.c:station_roam_scan() ifindex: 6
11:30:29 iwd[1407046]: src/wiphy.c:wiphy_radio_work_insert() Inserting work item 144
11:30:29 iwd[1407046]: src/wiphy.c:wiphy_radio_work_next() Starting work item 144
11:30:29 iwd[1407046]: src/station.c:station_start_roam() Using cached neighbor report for roam
11:30:29 iwd[1407046]: src/scan.c:scan_notify() Scan notification Trigger Scan(33)
11:30:29 iwd[1407046]: src/scan.c:scan_request_triggered() Active scan triggered for wdev 4
11:30:29 iwd[1407046]: src/netdev.c:netdev_link_notify() event 16 on ifindex 6
11:30:29 iwd[1407046]: src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
11:30:29 iwd[1407046]: src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
11:30:29 iwd[1407046]: src/netdev.c:netdev_deauthenticate_event()
11:30:29 iwd[1407046]: src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
11:30:29 iwd[1407046]: src/netdev.c:netdev_disconnect_event()
11:30:29 iwd[1407046]: Received Deauthentication event, reason: 4, from_ap: false
It makes sense to also handle beacon loss events and attempt a
roam, potentially saving the connection. One difference here is
the beacon loss event will force a roam (similar to ap directed
roaming when the imminent bit is set).
---
src/netdev.h | 1 +
src/station.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/src/netdev.h b/src/netdev.h
index 73d38c32..f27130f1 100644
--- a/src/netdev.h
+++ b/src/netdev.h
@@ -51,6 +51,7 @@ enum netdev_event {
NETDEV_EVENT_RSSI_LEVEL_NOTIFY,
NETDEV_EVENT_PACKET_LOSS_NOTIFY,
NETDEV_EVENT_FT_ROAMED,
+ NETDEV_EVENT_BEACON_LOSS_NOTIFY,
};
enum netdev_watch_event {
diff --git a/src/station.c b/src/station.c
index a2358d95..8c87b3e0 100644
--- a/src/station.c
+++ b/src/station.c
@@ -3347,6 +3347,20 @@ static void station_packets_lost(struct station *station, uint32_t num_pkts)
station_start_roam(station);
}
+static void station_beacon_lost(struct station *station)
+{
+ l_debug("Beacon lost event");
+
+ if (station_cannot_roam(station))
+ return;
+
+ station->force_roam = true;
+
+ station_debug_event(station, "beacon-loss-roam");
+
+ station_start_roam(station);
+}
+
static void station_netdev_event(struct netdev *netdev, enum netdev_event event,
void *event_data, void *user_data)
{
@@ -3391,6 +3405,9 @@ static void station_netdev_event(struct netdev *netdev, enum netdev_event event,
station_roamed(station);
break;
+ case NETDEV_EVENT_BEACON_LOSS_NOTIFY:
+ station_beacon_lost(station);
+ break;
}
}
--
2.25.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/4] netdev: handle/send beacon loss event
2023-10-30 13:48 [PATCH 0/4] Packet/beacon loss roaming improvements James Prestwood
2023-10-30 13:48 ` [PATCH 1/4] station: rename ap_directed_roam to force_roam James Prestwood
2023-10-30 13:48 ` [PATCH 2/4] station: start roam on beacon loss event James Prestwood
@ 2023-10-30 13:48 ` James Prestwood
2023-10-30 13:48 ` [PATCH 4/4] station: rate limit packet loss roam scans James Prestwood
2023-10-30 15:00 ` [PATCH 0/4] Packet/beacon loss roaming improvements Denis Kenzior
4 siblings, 0 replies; 19+ messages in thread
From: James Prestwood @ 2023-10-30 13:48 UTC (permalink / raw)
To: iwd; +Cc: James Prestwood
---
src/netdev.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/netdev.c b/src/netdev.c
index 56c6ebd2..86712658 100644
--- a/src/netdev.c
+++ b/src/netdev.c
@@ -1054,6 +1054,7 @@ static void netdev_cqm_event(struct l_genl_msg *msg, struct netdev *netdev)
uint32_t *rssi_event = NULL;
int32_t *rssi_val = NULL;
uint32_t *pkt_event = NULL;
+ bool beacon_loss = false;
if (!l_genl_attr_init(&attr, msg))
return;
@@ -1081,7 +1082,7 @@ static void netdev_cqm_event(struct l_genl_msg *msg, struct netdev *netdev)
break;
case NL80211_ATTR_CQM_BEACON_LOSS_EVENT:
- l_debug("Beacon lost event");
+ beacon_loss = true;
break;
case NL80211_ATTR_CQM_RSSI_LEVEL:
@@ -1111,6 +1112,9 @@ static void netdev_cqm_event(struct l_genl_msg *msg, struct netdev *netdev)
} else if (pkt_event && netdev->event_filter)
netdev->event_filter(netdev, NETDEV_EVENT_PACKET_LOSS_NOTIFY,
pkt_event, netdev->user_data);
+ else if (beacon_loss && netdev->event_filter)
+ netdev->event_filter(netdev, NETDEV_EVENT_BEACON_LOSS_NOTIFY,
+ NULL, netdev->user_data);
}
static void netdev_rekey_offload_event(struct l_genl_msg *msg,
--
2.25.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 4/4] station: rate limit packet loss roam scans
2023-10-30 13:48 [PATCH 0/4] Packet/beacon loss roaming improvements James Prestwood
` (2 preceding siblings ...)
2023-10-30 13:48 ` [PATCH 3/4] netdev: handle/send " James Prestwood
@ 2023-10-30 13:48 ` James Prestwood
2023-10-30 14:48 ` Denis Kenzior
2023-10-30 15:00 ` [PATCH 0/4] Packet/beacon loss roaming improvements Denis Kenzior
4 siblings, 1 reply; 19+ messages in thread
From: James Prestwood @ 2023-10-30 13:48 UTC (permalink / raw)
To: iwd; +Cc: James Prestwood
The packet loss handler puts a higher priority on roaming compared
to the low signal roam path. This is generally beneficial since this
event usually indicates some problem with the BSS and generally is
an indicator that a disconnect will follow sometime soon.
But by immediately issuing a scan we run the risk of causing many
successive scans if more packet loss events arrive following
the roam scans (and if no candidates are found). Logs provided
further.
To help with this handle the first event with priority and
immediately issue a roam scan. If another event comes in within a
certain timeframe (2 seconds) don't immediately scan, but instead
rearm the roam timer instead of issuing a scan. This also handles
the case of a low signal roam scan followed by a packet loss
event. Delaying the roam will at least provide some time for packets
to get out in between roam scans.
Logs were snipped to be less verbose, but this cycled happened
5 times prior. In total 7 scans were issued in 5 seconds which may
very well have been the reason for the local disconnect:
Oct 27 16:23:46 src/station.c:station_roam_failed() 9
Oct 27 16:23:46 src/wiphy.c:wiphy_radio_work_done() Work item 29 done
Oct 27 16:23:47 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
Oct 27 16:23:47 src/station.c:station_packets_lost() Packets lost event: 10
Oct 27 16:23:47 src/station.c:station_roam_scan() ifindex: 9
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 30
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_next() Starting work item 30
Oct 27 16:23:47 src/station.c:station_start_roam() Using cached neighbor report for roam
Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
Oct 27 16:23:47 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification New Scan Results(34)
Oct 27 16:23:47 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
... scan results ...
Oct 27 16:23:47 src/station.c:station_roam_failed() 9
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_done() Work item 30 done
Oct 27 16:23:47 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
Oct 27 16:23:47 src/station.c:station_packets_lost() Packets lost event: 10
Oct 27 16:23:47 src/station.c:station_roam_scan() ifindex: 9
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 31
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_next() Starting work item 31
Oct 27 16:23:47 src/station.c:station_start_roam() Using cached neighbor report for roam
Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
Oct 27 16:23:47 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
Oct 27 16:23:48 src/scan.c:scan_notify() Scan notification New Scan Results(34)
Oct 27 16:23:48 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
... scan results ...
Oct 27 16:23:48 src/station.c:station_roam_failed() 9
Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_done() Work item 31 done
Oct 27 16:23:48 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
Oct 27 16:23:48 src/station.c:station_packets_lost() Packets lost event: 10
Oct 27 16:23:48 src/station.c:station_roam_scan() ifindex: 9
Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 32
Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_next() Starting work item 32
Oct 27 16:23:48 src/station.c:station_start_roam() Using cached neighbor report for roam
Oct 27 16:23:48 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
Oct 27 16:23:48 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
Oct 27 16:23:49 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
Oct 27 16:23:49 src/netdev.c:netdev_deauthenticate_event()
Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
Oct 27 16:23:49 src/netdev.c:netdev_disconnect_event()
Oct 27 16:23:49 Received Deauthentication event, reason: 4, from_ap: false
---
src/station.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/src/station.c b/src/station.c
index 8c87b3e0..bab690b9 100644
--- a/src/station.c
+++ b/src/station.c
@@ -123,6 +123,8 @@ struct station {
struct wiphy_radio_work_item ft_work;
+ uint64_t last_roam_scan;
+
bool preparing_roam : 1;
bool roam_scan_full : 1;
bool signal_low : 1;
@@ -1708,6 +1710,7 @@ static void station_roam_state_clear(struct station *station)
station->signal_low = false;
station->roam_min_time.tv_sec = 0;
station->netconfig_after_roam = false;
+ station->last_roam_scan = 0;
if (station->roam_scan_id)
scan_cancel(netdev_get_wdev_id(station->netdev),
@@ -2711,6 +2714,8 @@ static int station_roam_scan(struct station *station,
if (L_WARN_ON(scan_freq_set_isempty(params.freqs)))
return -ENOTSUP;
+ station->last_roam_scan = l_time_now();
+
station->roam_scan_id =
scan_active_full(netdev_get_wdev_id(station->netdev), ¶ms,
station_roam_scan_triggered,
@@ -3330,10 +3335,13 @@ static void station_disconnect_event(struct station *station, void *event_data)
l_warn("Unexpected disconnect event");
}
-#define STATION_PKT_LOSS_THRESHOLD 10
+#define STATION_PKT_LOSS_THRESHOLD 10
+#define LOSS_ROAM_RATE_LIMIT 2
static void station_packets_lost(struct station *station, uint32_t num_pkts)
{
+ uint64_t elapsed;
+
l_debug("Packets lost event: %u", num_pkts);
if (num_pkts < STATION_PKT_LOSS_THRESHOLD)
@@ -3344,6 +3352,24 @@ static void station_packets_lost(struct station *station, uint32_t num_pkts)
station_debug_event(station, "packet-loss-roam");
+ elapsed = l_time_diff(station->last_roam_scan, l_time_now());
+
+ /*
+ * If we just issued a roam scan, delay the roam to avoid constant
+ * scanning.
+ */
+ if (LOSS_ROAM_RATE_LIMIT > l_time_to_secs(elapsed)) {
+ l_debug("Too many roam attempts in %u second timeframe, "
+ "delaying roam", LOSS_ROAM_RATE_LIMIT);
+
+ if (station->roam_trigger_timeout)
+ return;
+
+ station_roam_timeout_rearm(station, LOSS_ROAM_RATE_LIMIT);
+
+ return;
+ }
+
station_start_roam(station);
}
--
2.25.1
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 4/4] station: rate limit packet loss roam scans
2023-10-30 13:48 ` [PATCH 4/4] station: rate limit packet loss roam scans James Prestwood
@ 2023-10-30 14:48 ` Denis Kenzior
0 siblings, 0 replies; 19+ messages in thread
From: Denis Kenzior @ 2023-10-30 14:48 UTC (permalink / raw)
To: James Prestwood, iwd
Hi James,
On 10/30/23 08:48, James Prestwood wrote:
> The packet loss handler puts a higher priority on roaming compared
> to the low signal roam path. This is generally beneficial since this
> event usually indicates some problem with the BSS and generally is
> an indicator that a disconnect will follow sometime soon.
>
> But by immediately issuing a scan we run the risk of causing many
> successive scans if more packet loss events arrive following
> the roam scans (and if no candidates are found). Logs provided
> further.
>
> To help with this handle the first event with priority and
> immediately issue a roam scan. If another event comes in within a
> certain timeframe (2 seconds) don't immediately scan, but instead
> rearm the roam timer instead of issuing a scan. This also handles
> the case of a low signal roam scan followed by a packet loss
> event. Delaying the roam will at least provide some time for packets
> to get out in between roam scans.
>
> Logs were snipped to be less verbose, but this cycled happened
> 5 times prior. In total 7 scans were issued in 5 seconds which may
> very well have been the reason for the local disconnect:
>
> Oct 27 16:23:46 src/station.c:station_roam_failed() 9
> Oct 27 16:23:46 src/wiphy.c:wiphy_radio_work_done() Work item 29 done
> Oct 27 16:23:47 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
> Oct 27 16:23:47 src/station.c:station_packets_lost() Packets lost event: 10
> Oct 27 16:23:47 src/station.c:station_roam_scan() ifindex: 9
> Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 30
> Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_next() Starting work item 30
> Oct 27 16:23:47 src/station.c:station_start_roam() Using cached neighbor report for roam
> Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
> Oct 27 16:23:47 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
> Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification New Scan Results(34)
> Oct 27 16:23:47 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
> ... scan results ...
> Oct 27 16:23:47 src/station.c:station_roam_failed() 9
> Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_done() Work item 30 done
> Oct 27 16:23:47 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
> Oct 27 16:23:47 src/station.c:station_packets_lost() Packets lost event: 10
> Oct 27 16:23:47 src/station.c:station_roam_scan() ifindex: 9
> Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 31
> Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_next() Starting work item 31
> Oct 27 16:23:47 src/station.c:station_start_roam() Using cached neighbor report for roam
> Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
> Oct 27 16:23:47 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
> Oct 27 16:23:48 src/scan.c:scan_notify() Scan notification New Scan Results(34)
> Oct 27 16:23:48 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
> ... scan results ...
> Oct 27 16:23:48 src/station.c:station_roam_failed() 9
> Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_done() Work item 31 done
> Oct 27 16:23:48 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
> Oct 27 16:23:48 src/station.c:station_packets_lost() Packets lost event: 10
> Oct 27 16:23:48 src/station.c:station_roam_scan() ifindex: 9
> Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 32
> Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_next() Starting work item 32
> Oct 27 16:23:48 src/station.c:station_start_roam() Using cached neighbor report for roam
> Oct 27 16:23:48 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
> Oct 27 16:23:48 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
> Oct 27 16:23:49 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
> Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
> Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
> Oct 27 16:23:49 src/netdev.c:netdev_deauthenticate_event()
> Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
> Oct 27 16:23:49 src/netdev.c:netdev_disconnect_event()
> Oct 27 16:23:49 Received Deauthentication event, reason: 4, from_ap: false
> ---
> src/station.c | 28 +++++++++++++++++++++++++++-
> 1 file changed, 27 insertions(+), 1 deletion(-)
>
Applied, thanks.
Regards,
-Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-10-30 13:48 [PATCH 0/4] Packet/beacon loss roaming improvements James Prestwood
` (3 preceding siblings ...)
2023-10-30 13:48 ` [PATCH 4/4] station: rate limit packet loss roam scans James Prestwood
@ 2023-10-30 15:00 ` Denis Kenzior
2023-10-30 15:37 ` James Prestwood
4 siblings, 1 reply; 19+ messages in thread
From: Denis Kenzior @ 2023-10-30 15:00 UTC (permalink / raw)
To: James Prestwood, iwd
Hi James,
>
> We were seeing beacon loss events not resulting in an immediate
> disconnnect (as I have always expected), still eventually but after
If I recall correctly, Lost Beacon is sent after several beacons in succession
were lost. You are right that this could just be bad luck and doesn't actually
mean that no packets are getting through. However, in practice mac80211 almost
always disconnected us soon after. Didn't we test this pretty thoroughly?
My memory is fuzzy here, but I seem to recall that power save has an effect on
how lost beacon events are treated by mac80211. Maybe your recent power save
patches had an effect?
> plenty of time to roam. I initially added handling for
> beacon loss identical to packet loss (try and find a better BSS) but
> noticed that if IWD did not find a better candidate it resulted in a
> disconnect 100% of the time. I watched a client for a full day and
> whenever beacon loss events arrived they were followed by a
> disconnect within ~5-6 seconds if IWD did not roam. This led me to
> believe that at least on ath10k a beacon loss is still very much a
> sign that a disconnect is going to come, we just have a bit more time
> than other drivers. This was the motivation behind re-using/re-naming
> the 'ap_directed_roam' flag in order to force IWD off the BSS.
>
ath10k is still a mac80211 driver, no? Given that we did test Lost Beacon event
behavior before, I would like some more data points before being convinced it is
a driver behavior change.
> Again, this is just one driver. Maybe other drivers can
> handle/recover from beacon loss. If we instead want to keep the
> behavior the same as packet loss I'm ok with that (I can keep the
> patch locally), or put the forced roam behavior behind a user
> option e.g. [Roam].ForceRoamOnBeaconLoss
If this is a driver behavior quirk, then this belongs in src/wiphy.c
driver_infos table somehow. I'd really rather not add a bazillion config
options that address the bug-of-the-day.
>
> James Prestwood (4):
> station: rename ap_directed_roam to force_roam
> station: start roam on beacon loss event
> netdev: handle/send beacon loss event
> station: rate limit packet loss roam scans
>
> src/netdev.c | 6 ++++-
> src/netdev.h | 1 +
> src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++--------
> 3 files changed, 58 insertions(+), 10 deletions(-)
>
Regards,
-Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-10-30 15:00 ` [PATCH 0/4] Packet/beacon loss roaming improvements Denis Kenzior
@ 2023-10-30 15:37 ` James Prestwood
2023-10-30 17:05 ` Denis Kenzior
0 siblings, 1 reply; 19+ messages in thread
From: James Prestwood @ 2023-10-30 15:37 UTC (permalink / raw)
To: Denis Kenzior, iwd
Hi Denis,
On 10/30/23 8:00 AM, Denis Kenzior wrote:
> Hi James,
>
>>
>> We were seeing beacon loss events not resulting in an immediate
>> disconnnect (as I have always expected), still eventually but after
>
> If I recall correctly, Lost Beacon is sent after several beacons in
> succession were lost. You are right that this could just be bad luck
> and doesn't actually mean that no packets are getting through. However,
> in practice mac80211 almost always disconnected us soon after. Didn't
> we test this pretty thoroughly?
Yes, it appears mac80211 by default waits for 7 missed beacons before
sending the event. It then probes the AP (either nullfunc or probe
request) so apparently the connection could be recovered if the AP
responded. Unfortunately we don't get any notification in userspace if
the AP responded or not...
I can't remember what hardware was tested. But there really wasn't a
consistent way to test this. The testing involved me disabling roaming
and walking away from the AP until I got disconnected. Sometimes this
was due to beacon loss, sometimes the AP disconnected explicitly. But
what I do remember is when beacon loss occurred, a local disconnect
followed near immediately. This is why (I think) we thought there was no
reason to handle this event.
>
> My memory is fuzzy here, but I seem to recall that power save has an
> effect on how lost beacon events are treated by mac80211. Maybe your
> recent power save patches had an effect?
From what I can tell in mac80211 power save doesn't change handling.
Its the driver that tells mac80211 of the beacon loss but maybe the
driver (or firmware) could handle it differently depending on power save.
When I was watching this device power save was disabled.
>
>> plenty of time to roam. I initially added handling for
>> beacon loss identical to packet loss (try and find a better BSS) but
>> noticed that if IWD did not find a better candidate it resulted in a
>> disconnect 100% of the time. I watched a client for a full day and
>> whenever beacon loss events arrived they were followed by a
>> disconnect within ~5-6 seconds if IWD did not roam. This led me to
>> believe that at least on ath10k a beacon loss is still very much a
>> sign that a disconnect is going to come, we just have a bit more time
>> than other drivers. This was the motivation behind re-using/re-naming
>> the 'ap_directed_roam' flag in order to force IWD off the BSS.
>>
>
> ath10k is still a mac80211 driver, no? Given that we did test Lost
> Beacon event behavior before, I would like some more data points before
> being convinced it is a driver behavior change.
>
>> Again, this is just one driver. Maybe other drivers can
>> handle/recover from beacon loss. If we instead want to keep the
>> behavior the same as packet loss I'm ok with that (I can keep the
>> patch locally), or put the forced roam behavior behind a user
>> option e.g. [Roam].ForceRoamOnBeaconLoss
>
> If this is a driver behavior quirk, then this belongs in src/wiphy.c
> driver_infos table somehow. I'd really rather not add a bazillion
> config options that address the bug-of-the-day.
Yeah, adding a driver specific quirk doesn't seem like the right route.
I think for now there is no harm in trying to roam on beacon loss,
basically the same handling as packet loss. If a disconnect comes
immediately the scan would be canceled. Otherwise maybe we get lucky and
be able to roam.
Since our specific hardware/use case seems to benefit from the forced
roam I can keep that change out of tree (just set ap_directed_roaming),
at least until more testing can be done, or if others report similar
behavior.
Thanks,
James
>
>>
>> James Prestwood (4):
>> station: rename ap_directed_roam to force_roam
>> station: start roam on beacon loss event
>> netdev: handle/send beacon loss event
>> station: rate limit packet loss roam scans
>>
>> src/netdev.c | 6 ++++-
>> src/netdev.h | 1 +
>> src/station.c | 61 +++++++++++++++++++++++++++++++++++++++++++--------
>> 3 files changed, 58 insertions(+), 10 deletions(-)
>>
>
> Regards,
> -Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-10-30 15:37 ` James Prestwood
@ 2023-10-30 17:05 ` Denis Kenzior
2023-10-30 17:37 ` James Prestwood
0 siblings, 1 reply; 19+ messages in thread
From: Denis Kenzior @ 2023-10-30 17:05 UTC (permalink / raw)
To: James Prestwood, iwd
Hi James,
On 10/30/23 10:37, James Prestwood wrote:
> Hi Denis,
>
> On 10/30/23 8:00 AM, Denis Kenzior wrote:
>> Hi James,
>>
>>>
>>> We were seeing beacon loss events not resulting in an immediate
>>> disconnnect (as I have always expected), still eventually but after
>>
>> If I recall correctly, Lost Beacon is sent after several beacons in succession
>> were lost. You are right that this could just be bad luck and doesn't
>> actually mean that no packets are getting through. However, in practice
>> mac80211 almost always disconnected us soon after. Didn't we test this pretty
>> thoroughly?
>
> Yes, it appears mac80211 by default waits for 7 missed beacons before sending
> the event. It then probes the AP (either nullfunc or probe request) so
> apparently the connection could be recovered if the AP responded. Unfortunately
> we don't get any notification in userspace if the AP responded or not...
So this magic here?
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215
>
> I can't remember what hardware was tested. But there really wasn't a consistent
> way to test this. The testing involved me disabling roaming and walking away
> from the AP until I got disconnected. Sometimes this was due to beacon loss,
> sometimes the AP disconnected explicitly. But what I do remember is when beacon
> loss occurred, a local disconnect followed near immediately. This is why (I
> think) we thought there was no reason to handle this event.
So what does your ath10k driver/hw do? Does it send nullfuncs or probe requests?
>
>>
>> My memory is fuzzy here, but I seem to recall that power save has an effect on
>> how lost beacon events are treated by mac80211. Maybe your recent power save
>> patches had an effect?
>
> From what I can tell in mac80211 power save doesn't change handling. Its the
> driver that tells mac80211 of the beacon loss but maybe the driver (or firmware)
> could handle it differently depending on power save.
>
> When I was watching this device power save was disabled.
Okay, fair enough.
>> If this is a driver behavior quirk, then this belongs in src/wiphy.c
>> driver_infos table somehow. I'd really rather not add a bazillion config
>> options that address the bug-of-the-day.
>
> Yeah, adding a driver specific quirk doesn't seem like the right route.
>
> I think for now there is no harm in trying to roam on beacon loss, basically the
> same handling as packet loss. If a disconnect comes immediately the scan would
> be canceled. Otherwise maybe we get lucky and be able to roam.
So the problem is, we had the _exact_ same behavior you're proposing here. We
took it out. See commit:
836beb1276d1 ("station/wsc: remove beacon loss handling")
So when we do that, alarm bells start going off. Why did we get rid of it if it
was useful?
7 consecutive lost beacons is actually a lot. That's ~700ms with no connection
with default settings. And you can maintain the connection after that for
another 5-6? Something smells fishy.
If the kernel has a hard limit after which it expects the connection to be
disconnected, we can start a timer for 2-4x that limit? Looks like kernel uses
probe_wait_ms parameter for this with a default of 500ms. Is your setup using
the default values for beacon_loss_count and probe_wait_ms?
Regards,
-Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-10-30 17:05 ` Denis Kenzior
@ 2023-10-30 17:37 ` James Prestwood
2023-11-01 12:07 ` James Prestwood
0 siblings, 1 reply; 19+ messages in thread
From: James Prestwood @ 2023-10-30 17:37 UTC (permalink / raw)
To: Denis Kenzior, iwd
Hi Denis,
On 10/30/23 10:05 AM, Denis Kenzior wrote:
> Hi James,
>
> On 10/30/23 10:37, James Prestwood wrote:
>> Hi Denis,
>>
>> On 10/30/23 8:00 AM, Denis Kenzior wrote:
>>> Hi James,
>>>
>>>>
>>>> We were seeing beacon loss events not resulting in an immediate
>>>> disconnnect (as I have always expected), still eventually but after
>>>
>>> If I recall correctly, Lost Beacon is sent after several beacons in
>>> succession were lost. You are right that this could just be bad luck
>>> and doesn't actually mean that no packets are getting through.
>>> However, in practice mac80211 almost always disconnected us soon
>>> after. Didn't we test this pretty thoroughly?
>>
>> Yes, it appears mac80211 by default waits for 7 missed beacons before
>> sending the event. It then probes the AP (either nullfunc or probe
>> request) so apparently the connection could be recovered if the AP
>> responded. Unfortunately we don't get any notification in userspace if
>> the AP responded or not...
>
> So this magic here?
> https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215
Yep
>
>>
>> I can't remember what hardware was tested. But there really wasn't a
>> consistent way to test this. The testing involved me disabling roaming
>> and walking away from the AP until I got disconnected. Sometimes this
>> was due to beacon loss, sometimes the AP disconnected explicitly. But
>> what I do remember is when beacon loss occurred, a local disconnect
>> followed near immediately. This is why (I think) we thought there was
>> no reason to handle this event.
>
> So what does your ath10k driver/hw do? Does it send nullfuncs or probe
> requests?
Based on the driver it sets REPORTS_TX_ACK_STATUS, so nullfunc. Which
would add an extra second before disconnect by default (2 nullfuncs
500ms apart).
>
>>
>>>
>>> My memory is fuzzy here, but I seem to recall that power save has an
>>> effect on how lost beacon events are treated by mac80211. Maybe your
>>> recent power save patches had an effect?
>>
>> From what I can tell in mac80211 power save doesn't change handling.
>> Its the driver that tells mac80211 of the beacon loss but maybe the
>> driver (or firmware) could handle it differently depending on power save.
>>
>> When I was watching this device power save was disabled.
>
> Okay, fair enough.
>
>>> If this is a driver behavior quirk, then this belongs in src/wiphy.c
>>> driver_infos table somehow. I'd really rather not add a bazillion
>>> config options that address the bug-of-the-day.
>>
>> Yeah, adding a driver specific quirk doesn't seem like the right route.
>>
>> I think for now there is no harm in trying to roam on beacon loss,
>> basically the same handling as packet loss. If a disconnect comes
>> immediately the scan would be canceled. Otherwise maybe we get lucky
>> and be able to roam.
>
> So the problem is, we had the _exact_ same behavior you're proposing
> here. We took it out. See commit:
> 836beb1276d1 ("station/wsc: remove beacon loss handling")
>
> So when we do that, alarm bells start going off. Why did we get rid of
> it if it was useful?
Huh, apparently it was handled previously. I really have no idea,
apparently I thought something changed in the kernel. Maybe based on
testing only certain hardware.
Looking back to kernel 5.3 (since 5.4+ was mentioned in that commit) I
really don't see much difference either. This device was on 5.11, and
again, not much different than current upstream in that regard.
>
> 7 consecutive lost beacons is actually a lot. That's ~700ms with no
> connection with default settings. And you can maintain the connection
> after that for another 5-6? Something smells fishy.
According to the logs, yes. Looking back to the logs in the commit it
was actually 4 seconds.
Still, I'm not sure how it could get that far without a second lost
beacon event (e.g. miss beacons, but get a nullfunc reply, miss 7 more,
nullfunc reply etc.). With a single event the longest it could get drug
out would be 1.7 seconds (7 beacons + 2 nullfuncs) before either
disconnecting or recovering but missing more beacons and generating an
event, unless there is some other reason=4 failure path I'm missing.
> If the kernel has a hard limit after which it expects the connection to
> be disconnected, we can start a timer for 2-4x that limit? Looks like
> kernel uses probe_wait_ms parameter for this with a default of 500ms.
> Is your setup using the default values for beacon_loss_count and
> probe_wait_ms?
Yep, everything is default as far as nl80211/driver options.
>
> Regards,
> -Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-10-30 17:37 ` James Prestwood
@ 2023-11-01 12:07 ` James Prestwood
2023-11-02 1:39 ` Denis Kenzior
0 siblings, 1 reply; 19+ messages in thread
From: James Prestwood @ 2023-11-01 12:07 UTC (permalink / raw)
To: Denis Kenzior, iwd
Hi Denis,
On 10/30/23 10:37 AM, James Prestwood wrote:
> Hi Denis,
>
> On 10/30/23 10:05 AM, Denis Kenzior wrote:
>> Hi James,
>>
>> On 10/30/23 10:37, James Prestwood wrote:
>>> Hi Denis,
>>>
>>> On 10/30/23 8:00 AM, Denis Kenzior wrote:
>>>> Hi James,
>>>>
>>>>>
>>>>> We were seeing beacon loss events not resulting in an immediate
>>>>> disconnnect (as I have always expected), still eventually but after
>>>>
>>>> If I recall correctly, Lost Beacon is sent after several beacons in
>>>> succession were lost. You are right that this could just be bad
>>>> luck and doesn't actually mean that no packets are getting through.
>>>> However, in practice mac80211 almost always disconnected us soon
>>>> after. Didn't we test this pretty thoroughly?
>>>
>>> Yes, it appears mac80211 by default waits for 7 missed beacons before
>>> sending the event. It then probes the AP (either nullfunc or probe
>>> request) so apparently the connection could be recovered if the AP
>>> responded. Unfortunately we don't get any notification in userspace
>>> if the AP responded or not...
>>
>> So this magic here?
>> https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/tree/net/mac80211/mlme.c#n3215
>
> Yep
>
>>
>>>
>>> I can't remember what hardware was tested. But there really wasn't a
>>> consistent way to test this. The testing involved me disabling
>>> roaming and walking away from the AP until I got disconnected.
>>> Sometimes this was due to beacon loss, sometimes the AP disconnected
>>> explicitly. But what I do remember is when beacon loss occurred, a
>>> local disconnect followed near immediately. This is why (I think) we
>>> thought there was no reason to handle this event.
>>
>> So what does your ath10k driver/hw do? Does it send nullfuncs or
>> probe requests?
>
> Based on the driver it sets REPORTS_TX_ACK_STATUS, so nullfunc. Which
> would add an extra second before disconnect by default (2 nullfuncs
> 500ms apart).
>
>>
>>>
>>>>
>>>> My memory is fuzzy here, but I seem to recall that power save has an
>>>> effect on how lost beacon events are treated by mac80211. Maybe
>>>> your recent power save patches had an effect?
>>>
>>> From what I can tell in mac80211 power save doesn't change handling.
>>> Its the driver that tells mac80211 of the beacon loss but maybe the
>>> driver (or firmware) could handle it differently depending on power
>>> save.
>>>
>>> When I was watching this device power save was disabled.
>>
>> Okay, fair enough.
>>
>>>> If this is a driver behavior quirk, then this belongs in src/wiphy.c
>>>> driver_infos table somehow. I'd really rather not add a bazillion
>>>> config options that address the bug-of-the-day.
>>>
>>> Yeah, adding a driver specific quirk doesn't seem like the right route.
>>>
>>> I think for now there is no harm in trying to roam on beacon loss,
>>> basically the same handling as packet loss. If a disconnect comes
>>> immediately the scan would be canceled. Otherwise maybe we get lucky
>>> and be able to roam.
>>
>> So the problem is, we had the _exact_ same behavior you're proposing
>> here. We took it out. See commit:
>> 836beb1276d1 ("station/wsc: remove beacon loss handling")
>>
>> So when we do that, alarm bells start going off. Why did we get rid
>> of it if it was useful?
>
> Huh, apparently it was handled previously. I really have no idea,
> apparently I thought something changed in the kernel. Maybe based on
> testing only certain hardware.
>
> Looking back to kernel 5.3 (since 5.4+ was mentioned in that commit) I
> really don't see much difference either. This device was on 5.11, and
> again, not much different than current upstream in that regard.
>
>>
>> 7 consecutive lost beacons is actually a lot. That's ~700ms with no
>> connection with default settings. And you can maintain the connection
>> after that for another 5-6? Something smells fishy.
>
> According to the logs, yes. Looking back to the logs in the commit it
> was actually 4 seconds.
>
> Still, I'm not sure how it could get that far without a second lost
> beacon event (e.g. miss beacons, but get a nullfunc reply, miss 7 more,
> nullfunc reply etc.). With a single event the longest it could get drug
> out would be 1.7 seconds (7 beacons + 2 nullfuncs) before either
> disconnecting or recovering but missing more beacons and generating an
> event, unless there is some other reason=4 failure path I'm missing.
>
>
>> If the kernel has a hard limit after which it expects the connection
>> to be disconnected, we can start a timer for 2-4x that limit? Looks
>> like kernel uses probe_wait_ms parameter for this with a default of
>> 500ms. Is your setup using the default values for beacon_loss_count
>> and probe_wait_ms?
>
> Yep, everything is default as far as nl80211/driver options.
I found an old thread I asked on linux-wireless back when I removed the
beacon loss handling [1]. Johannes explained that behavior did
apparently change in order to support power save (your memory was correct).
I identified the behavior change with hwsim, and apparently took his
word when he said:
"I'm pretty sure real hardware will behave just like hwsim here, albeit
perhaps a bit slower"
And I never added "proper" testing of beacon loss, i.e. block several
beacons as opposed to just tearing down the AP. And this appears closer
to the behavior I'm seeing in reality (AP not going down, just dropping
beacons).
So this is my bad, I didn't take into account the situation of beacons
being dropped but then being picked up again, or the nullfuncs/probes
coming back successfully.
[1]
https://lore.kernel.org/linux-wireless/ada14dfad76b93d654606c3b397de059d968096b.camel@gmail.com/
Thanks,
James
>
>>
>> Regards,
>> -Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-01 12:07 ` James Prestwood
@ 2023-11-02 1:39 ` Denis Kenzior
2023-11-02 11:58 ` James Prestwood
0 siblings, 1 reply; 19+ messages in thread
From: Denis Kenzior @ 2023-11-02 1:39 UTC (permalink / raw)
To: James Prestwood, iwd
Hi James,
>>
>>> If the kernel has a hard limit after which it expects the connection to be
>>> disconnected, we can start a timer for 2-4x that limit? Looks like kernel
>>> uses probe_wait_ms parameter for this with a default of 500ms. Is your setup
>>> using the default values for beacon_loss_count and probe_wait_ms?
>>
>> Yep, everything is default as far as nl80211/driver options.
>
> I found an old thread I asked on linux-wireless back when I removed the beacon
> loss handling [1]. Johannes explained that behavior did apparently change in
> order to support power save (your memory was correct).
>
Hah, funny.
> I identified the behavior change with hwsim, and apparently took his word when
> he said:
>
> "I'm pretty sure real hardware will behave just like hwsim here, albeit perhaps
> a bit slower"
>
> And I never added "proper" testing of beacon loss, i.e. block several beacons as
> opposed to just tearing down the AP. And this appears closer to the behavior I'm
> seeing in reality (AP not going down, just dropping beacons).
Right.
>
> So this is my bad, I didn't take into account the situation of beacons being
> dropped but then being picked up again, or the nullfuncs/probes coming back
> successfully.
So the question is still how we handle this properly. When the kernel sends the
nullfunc / probe request, are we going to be interfering with that logic by
initiating a scan right away? A scan would force the device to go offchannel,
preventing it from receiving the AP's response. Given that nullfunc should be
extremely fast, maybe we should delay any roam action we take by a second or
two? If the AP is truly lost, then we'll get disconnected soon after the
LOST_BEACON event. If it isn't, then an extra second/two is not going to make
much difference since the scan should be limited and quite fast most of the time.
Regards,
-Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-02 1:39 ` Denis Kenzior
@ 2023-11-02 11:58 ` James Prestwood
2023-11-02 14:10 ` Denis Kenzior
0 siblings, 1 reply; 19+ messages in thread
From: James Prestwood @ 2023-11-02 11:58 UTC (permalink / raw)
To: Denis Kenzior, iwd
Hi Denis,
On 11/1/23 6:39 PM, Denis Kenzior wrote:
> Hi James,
>
>>>
>>>> If the kernel has a hard limit after which it expects the connection
>>>> to be disconnected, we can start a timer for 2-4x that limit? Looks
>>>> like kernel uses probe_wait_ms parameter for this with a default of
>>>> 500ms. Is your setup using the default values for beacon_loss_count
>>>> and probe_wait_ms?
>>>
>>> Yep, everything is default as far as nl80211/driver options.
>>
>> I found an old thread I asked on linux-wireless back when I removed
>> the beacon loss handling [1]. Johannes explained that behavior did
>> apparently change in order to support power save (your memory was
>> correct).
>>
>
> Hah, funny.
>
>> I identified the behavior change with hwsim, and apparently took his
>> word when he said:
>>
>> "I'm pretty sure real hardware will behave just like hwsim here,
>> albeit perhaps a bit slower"
>>
>> And I never added "proper" testing of beacon loss, i.e. block several
>> beacons as opposed to just tearing down the AP. And this appears
>> closer to the behavior I'm seeing in reality (AP not going down, just
>> dropping beacons).
>
> Right.
>
>>
>> So this is my bad, I didn't take into account the situation of beacons
>> being dropped but then being picked up again, or the nullfuncs/probes
>> coming back successfully.
>
> So the question is still how we handle this properly. When the kernel
> sends the nullfunc / probe request, are we going to be interfering with
> that logic by initiating a scan right away? A scan would force the
> device to go offchannel, preventing it from receiving the AP's
> response. Given that nullfunc should be extremely fast, maybe we should
> delay any roam action we take by a second or two? If the AP is truly
> lost, then we'll get disconnected soon after the LOST_BEACON event. If
> it isn't, then an extra second/two is not going to make much difference
> since the scan should be limited and quite fast most of the time.
Yes scanning likely kills all chances of recovery. And this may very
well be why I was seeing a 100% rate of disconnections. Granted, our
devices are always moving so its very likely that once beacon/packet
loss starts its only going to get worse, but that's not the case for
everyone. Scanning just makes things worse.
I'm fine adding similar handling that I added for packet loss, except
always delay rather than only on additional events. But I would like to
explore other options in the future.
I'm not sure how, but being able to detect if the AP responded to
nullfunc/probes prior to the kernel blowing away the connection would be
great. (like send our own nullfunc frames or something, not really sure...)
Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick
scan and ~1-2 seconds for DHCP. (I need to look at why the quick scan is
taking that long, that seems like something isn't right). So if its at
all possible to roam that is best, obviously.
Thanks,
James
>
> Regards,
> -Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-02 11:58 ` James Prestwood
@ 2023-11-02 14:10 ` Denis Kenzior
2023-11-02 14:33 ` James Prestwood
0 siblings, 1 reply; 19+ messages in thread
From: Denis Kenzior @ 2023-11-02 14:10 UTC (permalink / raw)
To: James Prestwood, iwd
Hi James,
>
> I'm fine adding similar handling that I added for packet loss, except always
> delay rather than only on additional events. But I would like to explore other
> options in the future.
I guess the question is, does adding LOST BEACON handling actually help or
you're speculating that it does? I don't mind if we add this back in with a
delay, but I'm worried it doesn't actually do anything. 7 beacons lost in a row
is likely not recoverable territory.
I'm actually surprised the driver doesn't give you any other indications prior
to the lost beacon event. I would have expected RSSI or packet loss to manifest
itself prior?
>
> I'm not sure how, but being able to detect if the AP responded to
> nullfunc/probes prior to the kernel blowing away the connection would be great.
> (like send our own nullfunc frames or something, not really sure...)
Yes, the current implementation of this event in the kernel is pretty useless.
What we really need is an additional threshold that generates an event out to
userspace _before_ the kernel starts taking potentially irreversible actions.
Something like a pre-beacon-loss event that gets generated when 2-3 beacons are
lost in a row.
>
> Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick scan and
> ~1-2 seconds for DHCP. (I need to look at why the quick scan is taking that
That's awful. How many frequencies are you generating? Even a full scan should
be ~1-2 seconds at most. And 1-2 seconds for DHCP also seems fishy. I've
tested our implementation to be sub-300ms or so.
> long, that seems like something isn't right). So if its at all possible to roam
> that is best, obviously.
>
Yep.
Regards,
-Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-02 14:10 ` Denis Kenzior
@ 2023-11-02 14:33 ` James Prestwood
2023-11-02 15:17 ` Denis Kenzior
0 siblings, 1 reply; 19+ messages in thread
From: James Prestwood @ 2023-11-02 14:33 UTC (permalink / raw)
To: Denis Kenzior, iwd
Hi Denis,
On 11/2/23 7:10 AM, Denis Kenzior wrote:
> Hi James,
>
>>
>> I'm fine adding similar handling that I added for packet loss, except
>> always delay rather than only on additional events. But I would like
>> to explore other options in the future.
>
> I guess the question is, does adding LOST BEACON handling actually help
> or you're speculating that it does? I don't mind if we add this back in
> with a delay, but I'm worried it doesn't actually do anything. 7
> beacons lost in a row is likely not recoverable territory.
In the behavior I witnessed which was actually quite reproducible at the
time, and yes roaming on beacon loss definitely helped. It avoided an
inevitable disconnect.
>
> I'm actually surprised the driver doesn't give you any other indications
> prior to the lost beacon event. I would have expected RSSI or packet
> loss to manifest itself prior?
There were packet loss events in some cases, and I'd say 75% of the time
IWD did end up roaming anyways (with a patch that roamed on beacon loss,
identical to packet loss). It was only the cases when we got a beacon
loss and IWD did not roam that the disconnect was always inevitable.
This was the motivation to force the roam for beacon loss.
RSSI wasn't always bad, but it was a very high load environment so I
assume that wasn't helping.
>
>>
>> I'm not sure how, but being able to detect if the AP responded to
>> nullfunc/probes prior to the kernel blowing away the connection would
>> be great. (like send our own nullfunc frames or something, not really
>> sure...)
>
> Yes, the current implementation of this event in the kernel is pretty
> useless. What we really need is an additional threshold that generates
> an event out to userspace _before_ the kernel starts taking potentially
> irreversible actions. Something like a pre-beacon-loss event that gets
> generated when 2-3 beacons are lost in a row.
Best would be an event like "Disconnection Imminent" (after the 2
nullfuncs failed) so userspace could start a roam. Like you mentioned
even if we knew a few beacons were lost, scanning is probably going to
ruin any chance of recovery.
On the bright side a disconnect isn't that terrible since IWD is super
fast at reconnecting, I was just looking for any optimization I could to
maintain the connection.
>
>>
>> Its taking IWD about 4-5 seconds to reconnect, 3 seconds are the quick
>> scan and ~1-2 seconds for DHCP. (I need to look at why the quick scan
>> is taking that
>
> That's awful. How many frequencies are you generating? Even a full
> scan should be ~1-2 seconds at most. And 1-2 seconds for DHCP also
> seems fishy. I've tested our implementation to be sub-300ms or so.
In theory only 5:
known_freq_set = known_networks_get_recent_frequencies(5);
But I'm also seeing a random New Scan Results event prior to the quick
scan. Anyways, its something I gotta look into.
>
>> long, that seems like something isn't right). So if its at all
>> possible to roam that is best, obviously.
>>
>
> Yep.
>
> Regards,
> -Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-02 14:33 ` James Prestwood
@ 2023-11-02 15:17 ` Denis Kenzior
2023-11-02 15:41 ` James Prestwood
0 siblings, 1 reply; 19+ messages in thread
From: Denis Kenzior @ 2023-11-02 15:17 UTC (permalink / raw)
To: James Prestwood, iwd
Hi James,
>
> In the behavior I witnessed which was actually quite reproducible at the time,
> and yes roaming on beacon loss definitely helped. It avoided an inevitable
> disconnect.
>
Okay, then lets put this in with a 1-2 delay to let the kernel disconnect if it
is going to.
>>
>>>
>>> I'm not sure how, but being able to detect if the AP responded to
>>> nullfunc/probes prior to the kernel blowing away the connection would be
>>> great. (like send our own nullfunc frames or something, not really sure...)
>>
>> Yes, the current implementation of this event in the kernel is pretty useless.
>> What we really need is an additional threshold that generates an event out to
>> userspace _before_ the kernel starts taking potentially irreversible actions.
>> Something like a pre-beacon-loss event that gets generated when 2-3 beacons
>> are lost in a row.
>
> Best would be an event like "Disconnection Imminent" (after the 2 nullfuncs
It is too late by that point.
> failed) so userspace could start a roam. Like you mentioned even if we knew a
> few beacons were lost, scanning is probably going to ruin any chance of recovery.
<snip>
>
> In theory only 5:
> known_freq_set = known_networks_get_recent_frequencies(5);
>
No, that picks the top 5 most recently used networks. The known frequencies for
each of the top 5 are added to the quick scan. This works okay most of the
time, but we do not rotate out known frequencies, so if you have a network with
lots of APs spread out over the spectrum, it might end up scanning quite a bit.
Regards,
-Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-02 15:17 ` Denis Kenzior
@ 2023-11-02 15:41 ` James Prestwood
2023-11-02 16:10 ` Denis Kenzior
0 siblings, 1 reply; 19+ messages in thread
From: James Prestwood @ 2023-11-02 15:41 UTC (permalink / raw)
To: Denis Kenzior, iwd
Hi Denis,
>>
>> In theory only 5:
>> known_freq_set = known_networks_get_recent_frequencies(5);
>>
>
> No, that picks the top 5 most recently used networks. The known
> frequencies for each of the top 5 are added to the quick scan. This
> works okay most of the time, but we do not rotate out known frequencies,
> so if you have a network with lots of APs spread out over the spectrum,
> it might end up scanning quite a bit.
Ah, that is exactly the problem then.
I wonder if we should be limiting both known networks and frequencies in
those networks. We do push new frequencies in order that they are added
so it may be nice to just pick the first 5 or so in the queue. Then we'd
be capping 25 frequencies at most instead of the whole spectrum in the
worst case.
>
> Regards,
> -Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-02 15:41 ` James Prestwood
@ 2023-11-02 16:10 ` Denis Kenzior
2023-11-02 16:13 ` James Prestwood
0 siblings, 1 reply; 19+ messages in thread
From: Denis Kenzior @ 2023-11-02 16:10 UTC (permalink / raw)
To: James Prestwood, iwd
Hi James,
>
> I wonder if we should be limiting both known networks and frequencies in those
> networks. We do push new frequencies in order that they are added so it may be
> nice to just pick the first 5 or so in the queue. Then we'd be capping 25
> frequencies at most instead of the whole spectrum in the worst case.
We should, but we don't track them in LRU order at the moment.
Regards,
-Denis
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] Packet/beacon loss roaming improvements
2023-11-02 16:10 ` Denis Kenzior
@ 2023-11-02 16:13 ` James Prestwood
0 siblings, 0 replies; 19+ messages in thread
From: James Prestwood @ 2023-11-02 16:13 UTC (permalink / raw)
To: Denis Kenzior, iwd
Hi Denis,
On 11/2/23 9:10 AM, Denis Kenzior wrote:
> Hi James,
>
>>
>> I wonder if we should be limiting both known networks and frequencies
>> in those networks. We do push new frequencies in order that they are
>> added so it may be nice to just pick the first 5 or so in the queue.
>> Then we'd be capping 25 frequencies at most instead of the whole
>> spectrum in the worst case.
>
> We should, but we don't track them in LRU order at the moment.
Yeah I see that now, we just add them as we see BSS's, not when we
connect. I'll put it in the backlog and see if I can come up with something.
>
> Regards,
> -Denis
>
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2023-11-02 16:13 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-30 13:48 [PATCH 0/4] Packet/beacon loss roaming improvements James Prestwood
2023-10-30 13:48 ` [PATCH 1/4] station: rename ap_directed_roam to force_roam James Prestwood
2023-10-30 13:48 ` [PATCH 2/4] station: start roam on beacon loss event James Prestwood
2023-10-30 13:48 ` [PATCH 3/4] netdev: handle/send " James Prestwood
2023-10-30 13:48 ` [PATCH 4/4] station: rate limit packet loss roam scans James Prestwood
2023-10-30 14:48 ` Denis Kenzior
2023-10-30 15:00 ` [PATCH 0/4] Packet/beacon loss roaming improvements Denis Kenzior
2023-10-30 15:37 ` James Prestwood
2023-10-30 17:05 ` Denis Kenzior
2023-10-30 17:37 ` James Prestwood
2023-11-01 12:07 ` James Prestwood
2023-11-02 1:39 ` Denis Kenzior
2023-11-02 11:58 ` James Prestwood
2023-11-02 14:10 ` Denis Kenzior
2023-11-02 14:33 ` James Prestwood
2023-11-02 15:17 ` Denis Kenzior
2023-11-02 15:41 ` James Prestwood
2023-11-02 16:10 ` Denis Kenzior
2023-11-02 16:13 ` James Prestwood
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox