From: Oleksandr Natalenko <oleksandr@natalenko.name>
To: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org, linux-wireless@vger.kernel.org,
Jakub Kicinski <kuba@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Johannes Berg <johannes@sipsolutions.net>,
Felix Fietkau <nbd@nbd.name>,
Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>,
Ryder Lee <ryder.lee@mediatek.com>,
Kalle Valo <kvalo@codeaurora.org>
Subject: Re: Divide error in minstrel_ht_get_tp_avg()
Date: Sat, 21 Aug 2021 13:14:34 +0200 [thread overview]
Message-ID: <2137329.lhgpPQ2jGW@natalenko.name> (raw)
In-Reply-To: <20210607145223.tlxo5ge42mef44m5@spock.localdomain>
Hello.
On pondělí 7. června 2021 16:52:30 CEST Oleksandr Natalenko wrote:
> On Sat, May 29, 2021 at 06:57:29PM +0200, Oleksandr Natalenko wrote:
> > The following woe happened on my home router running as a Wi-Fi AP with
> > MT7612:
> >
> > ```
> > [16592.157962] divide error: 0000 [#1] PREEMPT SMP PTI
> > [16592.163169] CPU: 2 PID: 683 Comm: mt76-tx phy0 Tainted: G C
> > 5.12.0-pf4 #1 [16592.171745] Hardware name: To Be Filled By O.E.M. To
> > Be Filled By O.E.M./J3710-ITX, BIOS P1.50 04/16/2018 [16592.182155] RIP:
> > 0010:minstrel_ht_get_tp_avg+0xb1/0x100 [mac80211] [16592.188795] Code: 04
> > 00 00 00 7f 1c 31 c9 81 fa f1 49 02 00 0f 9c c1 8d 0c cd 08 00 00 00 eb
> > 08 8b 47 30 b9 01 00 00 00 69 c0 e8 03 00 00 31 d2 <f7> f1 ba 66 0e 00 00
> > 39 d6 0f 4f f2 49 63 d1 48 8d 0c 52 48 8d 0c [16592.208796] RSP:
> > 0018:ffffb6a601293be8 EFLAGS: 00010246
> > [16592.214292] RAX: 000000000001a5e0 RBX: ffff9c658ee95170 RCX:
> > 0000000000000000 [16592.222054] RDX: 0000000000000000 RSI:
> > 0000000000000585 RDI: ffff9c658ee94000 [16592.229620] RBP:
> > 0000000000000006 R08: 0000000000000006 R09: 0000000000000012
> > [16592.237254] R10: 0000000000000007 R11: 0000000000000000 R12:
> > ffff9c658ee94000 [16592.244828] R13: 0000000000000012 R14:
> > ffff9c658ee9534c R15: 0000000000000585 [16592.252635] FS:
> > 0000000000000000(0000) GS:ffff9c66f8500000(0000) knlGS:0000000000000000
> > [16592.261086] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [16592.267326] CR2: 0000555d81abe7b8 CR3: 0000000116c10000 CR4:
> > 00000000001006e0 [16592.274904] Call Trace:
> > [16592.277549] minstrel_ht_update_stats+0x1fe/0x1320 [mac80211]
> > [16592.283730] minstrel_ht_tx_status+0x67f/0x710 [mac80211]
> > [16592.289442] rate_control_tx_status+0x6e/0xb0 [mac80211]
> > [16592.295267] ieee80211_tx_status_ext+0x22e/0xb00 [mac80211]
> > [16592.311290] ieee80211_tx_status+0x7d/0xa0 [mac80211]
> > [16592.316714] mt76_tx_status_unlock+0x83/0xa0 [mt76]
> > [16592.321999] mt76x02_send_tx_status+0x1b7/0x400 [mt76x02_lib]
> > [16592.334516] mt76x02_tx_worker+0x8f/0xd0 [mt76x02_lib]
> > [16592.340091] __mt76_worker_fn+0x78/0xb0 [mt76]
> > [16592.345067] kthread+0x183/0x1b0
> > [16592.353378] ret_from_fork+0x22/0x30
> > ```
> >
> > `faddr2line` says it is this:
> >
> > ```
> >
> > 430 int
> > 431 minstrel_ht_get_tp_avg(struct minstrel_ht_sta *mi, int group, int
> > rate,
> > 432 int prob_avg)
> > 433 {
> >
> > …
> >
> > 435 unsigned int ampdu_len = 1;
> >
> > …
> >
> > 441 if (minstrel_ht_is_legacy_group(group))
> >
> > …
> >
> > 443 else
> > 444 ampdu_len = minstrel_ht_avg_ampdu_len(mi);
> >
> > …
> >
> > 446 nsecs = 1000 * overhead / ampdu_len;
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > ```
> >
> > So, it seems `minstrel_ht_avg_ampdu_len()` can return 0, which is not
> > really legitimate.
> >
> > Looking at `minstrel_ht_avg_ampdu_len()`, I see the following:
> >
> > ```
> > 16:#define MINSTREL_SCALE 12
> > …
> > 18:#define MINSTREL_TRUNC(val) ((val) >> MINSTREL_SCALE)
> > ```
> >
> > ```
> >
> > 401 static unsigned int
> > 402 minstrel_ht_avg_ampdu_len(struct minstrel_ht_sta *mi)
> > 403 {
> >
> > …
> >
> > 406 if (mi->avg_ampdu_len)
> > 407 return MINSTREL_TRUNC(mi->avg_ampdu_len);
> >
> > ```
> >
> > So, likely, `mi->avg_ampdu_len` is non-zero, but it's too small, hence
> > right bitshift makes it zero.
> >
> > Should there be some protection against such a situation?
> >
> > Felix, could you maybe check this? Looks like you were the last one to
> > overhaul that part of code.
>
> Quick hack from me:
>
> commit a23d3f77658de968c5bb852ba478402c93958f7f
> Author: Oleksandr Natalenko <oleksandr@natalenko.name>
> Date: Mon Jun 7 16:45:45 2021 +0200
>
> mac80211: minstrel_ht: force ampdu_len to be > 0
>
> This is a hack.
>
> Work around the following crash:
>
> ```
> divide error: 0000 [#1] PREEMPT SMP PTI
> CPU: 2 PID: 683 Comm: mt76-tx phy0 Tainted: G C
> 5.12.0-pf4 #1 Hardware name: To Be Filled By O.E.M. To Be Filled By
> O.E.M./J3710-ITX, BIOS P1.50 04/16/2018 RIP:
> 0010:minstrel_ht_get_tp_avg+0xb1/0x100 [mac80211]
> …
> Call Trace:
> minstrel_ht_update_stats+0x1fe/0x1320 [mac80211]
> minstrel_ht_tx_status+0x67f/0x710 [mac80211]
> rate_control_tx_status+0x6e/0xb0 [mac80211]
> ieee80211_tx_status_ext+0x22e/0xb00 [mac80211]
> ieee80211_tx_status+0x7d/0xa0 [mac80211]
> mt76_tx_status_unlock+0x83/0xa0 [mt76]
> mt76x02_send_tx_status+0x1b7/0x400 [mt76x02_lib]
> mt76x02_tx_worker+0x8f/0xd0 [mt76x02_lib]
> __mt76_worker_fn+0x78/0xb0 [mt76]
> kthread+0x183/0x1b0
> ret_from_fork+0x22/0x30
> ```
>
> Link:
> https://lore.kernel.org/lkml/20210529165728.bskaozwtmwxnvucx@spock.localdom
> ain/ Signed-off-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> ---
> diff --git a/net/mac80211/rc80211_minstrel_ht.c
> b/net/mac80211/rc80211_minstrel_ht.c index ecad9b10984f..6ad188c4101e
> 100644
> --- a/net/mac80211/rc80211_minstrel_ht.c
> +++ b/net/mac80211/rc80211_minstrel_ht.c
> @@ -440,8 +440,13 @@ minstrel_ht_get_tp_avg(struct minstrel_ht_sta *mi, int
> group, int rate,
>
> if (minstrel_ht_is_legacy_group(group))
> overhead = mi->overhead_legacy;
> - else
> + else {
> ampdu_len = minstrel_ht_avg_ampdu_len(mi);
> + if (unlikely(!ampdu_len)) {
> + pr_err_once("minstrel_ht_get_tp_avg: ampdu_len == 0!");
> + ampdu_len = 1;
> + }
> + }
>
> nsecs = 1000 * overhead / ampdu_len;
> nsecs += minstrel_mcs_groups[group].duration[rate] <<
I've also found out that this happens exactly at midnight, IOW, at 00:00:00.
Not every midnight, though.
Does it have something to do with timekeeping? This is strange, I wouldn't
expect kernel to act like that. Probably, some client sends malformed frame?
How to find out?
Thanks.
P.S. The hack works, BTW, the system survives, and AP runs as if nothing
happened.
--
Oleksandr Natalenko (post-factum)
next prev parent reply other threads:[~2021-08-21 11:14 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-29 16:57 Divide error in minstrel_ht_get_tp_avg() Oleksandr Natalenko
2021-06-07 14:52 ` Oleksandr Natalenko
2021-08-21 11:14 ` Oleksandr Natalenko [this message]
2021-08-21 12:19 ` Willy Tarreau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2137329.lhgpPQ2jGW@natalenko.name \
--to=oleksandr@natalenko.name \
--cc=davem@davemloft.net \
--cc=johannes@sipsolutions.net \
--cc=kuba@kernel.org \
--cc=kvalo@codeaurora.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-wireless@vger.kernel.org \
--cc=lorenzo.bianconi83@gmail.com \
--cc=nbd@nbd.name \
--cc=netdev@vger.kernel.org \
--cc=ryder.lee@mediatek.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).