* ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins
@ 2015-11-04 5:20 Avery Pennarun
0 siblings, 0 replies; 7+ messages in thread
From: Avery Pennarun @ 2015-11-04 5:20 UTC (permalink / raw)
To: linux-wireless, ath9k-devel; +Cc: Tim Shepard
[fixed ath9k list address. sorry for the spam]
Hi all,
I have a pretty weird problem I've been chasing for a few weeks and
have narrowed it down, but not quite solved it. It may be caused by
bugs in aggregation-related code.
Steps:
- Set up an ath9k-based Linux AP on an ARM processor (currently using
this version of backports, though I've tried older and newer versions
with no change: "backported from Linux (next-20150525-0-gc201847)
using backports backports-20150525-0-g49969bd")
- Join my iPhone 4S (running iOS 7.1.2) to the network
- Use it for a while
- Eventually it will stay connected, but Internet access doesn't work
- Wireless packet captures show that packets are received *from* the
iPhone, and ACKs are returned for those packets from the ath9k, and
those packets are correctly forwarded to the AP's br0 interface. But
outgoing packets show up on br0 and wlan0 with tcpdump, but never make
it onto the air.
- Putting the iPhone 4S into airplane mode and then letting it
reconnecting will fix it for a few more seconds/minutes before it
stops again.
More details:
- It only seems to happen to my iPhone 4S client (never seen it with a
different client).
- It only seems to happen with my ath9k AP.
- It only seems to happen on my home network (another instance of the
same AP hardware on another network doesn't trigger the problem).
- It only seems to happen when no other 802.11n-capable devices are
connected to the same AP.
- The moment I join an 802.11n-capable device to the AP, traffic
instantly unblocks (see packet capture below).
- Joining an 802.11g-only device (no aggregation) does *not* unblock traffic.
- Disabling encryption and turning wmm_enable on and off have no effect.
- Disabling 802.11n support on the AP (so that everyone has to use
802.11g) makes the problem go away.
- 'ip -s link show dev wlan0' shows tx packet counters continuing to
increase during the outage, even though packets aren't flowing.
- I applied a patch from Tim Shepard to track the most recent tx
attempt, acked tx, and rx packet times inside mac80211. According to
this data, mac80211 thinks rx happened at most a couple of seconds ago
(as expected). The most recent tx was acked, but it was back around
the time the outage started. Note that this disagrees with 'ip -s
link' and tcpdump, which think they transmitted much more recently
than that. (The patch is here:
https://gfiber-review.googlesource.com/#/c/1250/ )
I captured a pcap of a new 802.11n-capable device joining the network
and unblocking the transmit. The action starts around frame 325:
http://apenwarr.ca/tmp/iPod4-fixing-iPhone4-trimmed.pcap.gz
In this pcap, the main players are:
ath9k AP: 88:dc:96:08:60:50
iPhone 4S with the problem: e4:25:e7:73:e6:31
New client fixing the problem (iPod 4): 18:e7:f4:7e:c1:42
Observations from the pcap:
- Upstream packets (iPhone->ath9k) are received and acked (see eg. frame 154)
- Beacons from the ath9k show an empty TIM bitmap until the iPod
joins, then it's nonempty and things unblock.
Does anyone have any thoughts about what to look for here?
Have fun,
Avery
^ permalink raw reply [flat|nested] 7+ messages in thread
* ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins
@ 2015-11-04 5:03 Avery Pennarun
2016-02-16 21:28 ` Avery Pennarun
0 siblings, 1 reply; 7+ messages in thread
From: Avery Pennarun @ 2015-11-04 5:03 UTC (permalink / raw)
To: linux-wireless, ath9k-devel; +Cc: Tim Shepard
Hi all,
I have a pretty weird problem I've been chasing for a few weeks and
have narrowed it down, but not quite solved it. It may be caused by
bugs in aggregation-related code.
Steps:
- Set up an ath9k-based Linux AP on an ARM processor (currently using
this version of backports, though I've tried older and newer versions
with no change: "backported from Linux (next-20150525-0-gc201847)
using backports backports-20150525-0-g49969bd")
- Join my iPhone 4S (running iOS 7.1.2) to the network
- Use it for a while
- Eventually it will stay connected, but Internet access doesn't work
- Wireless packet captures show that packets are received *from* the
iPhone, and ACKs are returned for those packets from the ath9k, and
those packets are correctly forwarded to the AP's br0 interface. But
outgoing packets show up on br0 and wlan0 with tcpdump, but never make
it onto the air.
- Putting the iPhone 4S into airplane mode and then letting it
reconnecting will fix it for a few more seconds/minutes before it
stops again.
More details:
- It only seems to happen to my iPhone 4S client (never seen it with a
different client).
- It only seems to happen with my ath9k AP.
- It only seems to happen on my home network (another instance of the
same AP hardware on another network doesn't trigger the problem).
- It only seems to happen when no other 802.11n-capable devices are
connected to the same AP.
- The moment I join an 802.11n-capable device to the AP, traffic
instantly unblocks (see packet capture below).
- Joining an 802.11g-only device (no aggregation) does *not* unblock traffic.
- Disabling encryption and turning wmm_enable on and off have no effect.
- Disabling 802.11n support on the AP (so that everyone has to use
802.11g) makes the problem go away.
- 'ip -s link show dev wlan0' shows tx packet counters continuing to
increase during the outage, even though packets aren't flowing.
- I applied a patch from Tim Shepard to track the most recent tx
attempt, acked tx, and rx packet times inside mac80211. According to
this data, mac80211 thinks rx happened at most a couple of seconds ago
(as expected). The most recent tx was acked, but it was back around
the time the outage started. Note that this disagrees with 'ip -s
link' and tcpdump, which think they transmitted much more recently
than that. (The patch is here:
https://gfiber-review.googlesource.com/#/c/1250/ )
I captured a pcap of a new 802.11n-capable device joining the network
and unblocking the transmit. The action starts around frame 325:
http://apenwarr.ca/tmp/iPod4-fixing-iPhone4-trimmed.pcap.gz
In this pcap, the main players are:
ath9k AP: 88:dc:96:08:60:50
iPhone 4S with the problem: e4:25:e7:73:e6:31
New client fixing the problem (iPod 4): 18:e7:f4:7e:c1:42
Observations from the pcap:
- Upstream packets (iPhone->ath9k) are received and acked (see eg. frame 154)
- Beacons from the ath9k show an empty TIM bitmap until the iPod
joins, then it's nonempty and things unblock.
Does anyone have any thoughts about what to look for here?
Have fun,
Avery
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins
2015-11-04 5:03 Avery Pennarun
@ 2016-02-16 21:28 ` Avery Pennarun
2016-02-16 22:05 ` Johannes Berg
0 siblings, 1 reply; 7+ messages in thread
From: Avery Pennarun @ 2016-02-16 21:28 UTC (permalink / raw)
To: linux-wireless, ath9k-devel, johannes, nbd; +Cc: Avery Pennarun
Okay, I've made much more progress on this old thread. I haven't actually
fixed the bug, which I suspect is a race condition only on multicore
machines, but I at least have better reproduction steps and a workaround.
The bug seems to trigger when three things happen at once:
1) Background interference causes retries
2) AP wants to send data to the STA, which has been idle for a while
3) We want to negotiate a new BA session from AP to STA.
Sometimes, the background interference will cause the time between ADDBA
Request (from AP) and ADDBA Response (from STA) to be longer than usual. In
my tests, it's usually <1ms, but in high-interference situations I've seen
it be >3ms. Sometimes, when the delay is longer, I see the symptom that the
agg_status file for the station in question starts showing TID#0's "pending"
column increasing slowly, until it eventually reaches 64. A wifi capture on
a separate sniffer indicates that no data is being transmitted to that
station, although traffic to other stations (and broadcast/multicast)
continues unabated. I guess this means the device's queues are themselves
not stopped, but the station's per-TID aggregation queue is stuck.
Twiddling the agg_status of a different queue (in this case TID#1) unblocks
TID#0:
echo "tx start 1" >/sys/kernel/debug/ieee80211/phy0/.../agg_status
So does having another aggregation-capable device join the network. Having
an 802.11g-only device join the network does *not* unblock the queue.
However, trying to stop TID#0 doesn't help (and it also doesn't successfully
stop the aggregation):
echo "tx stop 0" >/sys/kernel/debug/ieee80211/phy0/.../agg_status
The following patch makes the problem easier to reproduce by letting you
turn the aggregation timeout way down. For myself, I used a
default_agg_timeout of 500ms and just pinged repeatedly once per second from
the AP to STA. This causes the aggregation sessions to be repeatedly
brought up and torn down, which triggers the problem for me within a few
minutes (when run on a channel with fairly high noise).
Changing default_agg_timeout to zero (as it is on most non-ath9k drivers)
makes the problem pretty much go away. However, I think it's because I'm
just dodging the code path that triggers a race condition.
Notes:
- I'm using exactly the same ath9k driver (currently 20150525, but we've
tried newer ones with no difference) on two totally different platforms: a
dual-core mindspeed c2k host CPU (ARMv7) with separate ath9k, and a
single-core QCA9531 (MIPS) with on-chip ath9k.
- I've been unable to trigger the problem on the QCA9531, but I have on
MIPS.
The aggregation code is... a little hairy. Does anyone have any guesses
where I might look for the race condition? Or better still, a patch I can
try?
Avery Pennarun (1):
mac80211: add a debugfs var for the default aggregation timeout.
net/mac80211/debugfs_netdev.c | 4 ++++
net/mac80211/rc80211_minstrel_ht.c | 4 +++-
2 files changed, 7 insertions(+), 1 deletion(-)
--
2.7.0.rc3.207.g0ac5344
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins
2016-02-16 21:28 ` Avery Pennarun
@ 2016-02-16 22:05 ` Johannes Berg
2016-02-17 4:32 ` Avery Pennarun
0 siblings, 1 reply; 7+ messages in thread
From: Johannes Berg @ 2016-02-16 22:05 UTC (permalink / raw)
To: Avery Pennarun, linux-wireless, ath9k-devel, nbd
On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
>
> Changing default_agg_timeout to zero (as it is on most non-ath9k
> drivers) makes the problem pretty much go away. However, I think
> it's because I'm just dodging the code path that triggers a race
> condition.
That does seem likely. Perhaps you could reproduce it while running
mac80211 tracing? There should be a fair amount of information about
aggregation and queue stops in there, though as you note queue stops
aren't really happening, only aggregation related things. Perhaps the
tracepoints for that aren't quite sufficient.
> Notes:
>
> - I'm using exactly the same ath9k driver (currently 20150525, but
> we've tried newer ones with no difference) on two totally different
> platforms: a dual-core mindspeed c2k host CPU (ARMv7) with separate
> ath9k, and a single-core QCA9531 (MIPS) with on-chip ath9k.
>
> - I've been unable to trigger the problem on the QCA9531, but I have
> on MIPS.
That's ... not what I would have expected, especially since the MIPS is
single core. That makes the races stranger than expected.
> The aggregation code is... a little hairy. Does anyone have any
> guesses where I might look for the race condition? Or better still,
> a patch I can try?
I'm not aware of any race conditions in the code right now :)
johannes
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins
2016-02-16 22:05 ` Johannes Berg
@ 2016-02-17 4:32 ` Avery Pennarun
2016-02-17 6:23 ` Krishna Chaitanya
0 siblings, 1 reply; 7+ messages in thread
From: Avery Pennarun @ 2016-02-17 4:32 UTC (permalink / raw)
To: Johannes Berg; +Cc: linux-wireless, ath9k-devel, Felix Fietkau
On Tue, Feb 16, 2016 at 5:05 PM, Johannes Berg
<johannes@sipsolutions.net> wrote:
> On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
>> Changing default_agg_timeout to zero (as it is on most non-ath9k
>> drivers) makes the problem pretty much go away. However, I think
>> it's because I'm just dodging the code path that triggers a race
>> condition.
>
> That does seem likely. Perhaps you could reproduce it while running
> mac80211 tracing? There should be a fair amount of information about
> aggregation and queue stops in there, though as you note queue stops
> aren't really happening, only aggregation related things. Perhaps the
> tracepoints for that aren't quite sufficient.
So far that hasn't seemed to help, although maybe you can read traces
better than I can. The big problem is that the actual queue doesn't
seem to have stopped; it might be an ath9k bug.
>> Notes:
>>
>> - I'm using exactly the same ath9k driver (currently 20150525, but
>> we've tried newer ones with no difference) on two totally different
>> platforms: a dual-core mindspeed c2k host CPU (ARMv7) with separate
>> ath9k, and a single-core QCA9531 (MIPS) with on-chip ath9k.
>>
>> - I've been unable to trigger the problem on the QCA9531, but I have
>> on MIPS.
>
> That's ... not what I would have expected, especially since the MIPS is
> single core. That makes the races stranger than expected.
Oops, typo. The QCA9531 *is* MIPS. The one where it triggers is the
dual-core ARM.
>> The aggregation code is... a little hairy. Does anyone have any
>> guesses where I might look for the race condition? Or better still,
>> a patch I can try?
>
> I'm not aware of any race conditions in the code right now :)
Aw. That would have made it a lot easier!
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins
2016-02-17 4:32 ` Avery Pennarun
@ 2016-02-17 6:23 ` Krishna Chaitanya
2016-02-17 7:05 ` Avery Pennarun
0 siblings, 1 reply; 7+ messages in thread
From: Krishna Chaitanya @ 2016-02-17 6:23 UTC (permalink / raw)
To: Avery Pennarun; +Cc: Johannes Berg, linux-wireless, ath9k-devel, Felix Fietkau
On Wed, Feb 17, 2016 at 10:02 AM, Avery Pennarun <apenwarr@gmail.com> wrote:
>
> On Tue, Feb 16, 2016 at 5:05 PM, Johannes Berg
> <johannes@sipsolutions.net> wrote:
> > On Tue, 2016-02-16 at 16:28 -0500, Avery Pennarun wrote:
> >> Changing default_agg_timeout to zero (as it is on most non-ath9k
> >> drivers) makes the problem pretty much go away. However, I think
> >> it's because I'm just dodging the code path that triggers a race
> >> condition.
> >
> > That does seem likely. Perhaps you could reproduce it while running
> > mac80211 tracing? There should be a fair amount of information about
> > aggregation and queue stops in there, though as you note queue stops
> > aren't really happening, only aggregation related things. Perhaps the
> > tracepoints for that aren't quite sufficient.
>
> So far that hasn't seemed to help, although maybe you can read traces
> better than I can. The big problem is that the actual queue doesn't
> seem to have stopped; it might be an ath9k bug.
>
> >> Notes:
> >>
> >> - I'm using exactly the same ath9k driver (currently 20150525, but
> >> we've tried newer ones with no difference) on two totally different
> >> platforms: a dual-core mindspeed c2k host CPU (ARMv7) with separate
> >> ath9k, and a single-core QCA9531 (MIPS) with on-chip ath9k.
> >>
> >> - I've been unable to trigger the problem on the QCA9531, but I have
> >> on MIPS.
> >
> > That's ... not what I would have expected, especially since the MIPS is
> > single core. That makes the races stranger than expected.
>
> Oops, typo. The QCA9531 *is* MIPS. The one where it triggers is the
> dual-core ARM.
>
> >> The aggregation code is... a little hairy. Does anyone have any
> >> guesses where I might look for the race condition? Or better still,
> >> a patch I can try?
> >
> > I'm not aware of any race conditions in the code right now :)
>
> Aw. That would have made it a lot easier!
>From a quick glance of symptoms, i think the below patch is worth a
try, even though
i don't see you are doing any background scans for which this applies.
https://patchwork.kernel.org/patch/8015321/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins
2016-02-17 6:23 ` Krishna Chaitanya
@ 2016-02-17 7:05 ` Avery Pennarun
0 siblings, 0 replies; 7+ messages in thread
From: Avery Pennarun @ 2016-02-17 7:05 UTC (permalink / raw)
To: Krishna Chaitanya
Cc: Johannes Berg, linux-wireless, ath9k-devel, Felix Fietkau
On Wed, Feb 17, 2016 at 1:23 AM, Krishna Chaitanya
<chaitanya.mgit@gmail.com> wrote:
> From a quick glance of symptoms, i think the below patch is worth a
> try, even though
> i don't see you are doing any background scans for which this applies.
>
> https://patchwork.kernel.org/patch/8015321/
Thanks, Krishna. We are in fact doing background scans occasionally,
however, none was in progress around the time of the glitch, and the
problem was still reproducible with background scans disabled. We
also aren't combining AP and STA on the same radio (in this particular
use case).
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-02-17 7:05 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-04 5:20 ath9k(?): AP stops sending traffic to iPhone 4S until another 802.11n-capable STA joins Avery Pennarun
-- strict thread matches above, loose matches on Subject: below --
2015-11-04 5:03 Avery Pennarun
2016-02-16 21:28 ` Avery Pennarun
2016-02-16 22:05 ` Johannes Berg
2016-02-17 4:32 ` Avery Pennarun
2016-02-17 6:23 ` Krishna Chaitanya
2016-02-17 7:05 ` Avery Pennarun
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).