From: Nils Radtke <lkml@Think-Future.de>
To: reinette chatre <reinette.chatre@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org
Subject: Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
Date: Mon, 31 May 2010 22:12:25 +0200 [thread overview]
Message-ID: <20100531201225.GA6327@localhost> (raw)
In-Reply-To: <1274380408.2091.9272.camel@rchatre-DESK>
[-- Attachment #1: Type: text/plain, Size: 7370 bytes --]
Hi Reinette,
First off:
Linux mypole 2.6.33.4 #23 SMP PREEMPT Sat May 15 20:27:33 CEST 2010 i686 GNU/Linux
On Thu 2010-05-20 @ 11-33-28AM -0700, reinette chatre wrote:
# On Thu, 2010-05-20 at 05:15 -0700, Nils Radtke wrote:
# > #
# > # To address (1), could you please run with attached debug patch and also
# > # enable rate scaling debugging. That will be "modprobe iwlagn
# > # debug=0x143fff).
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c: In function ‘rs_collect_tx_data’:
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: ‘priv’ undeclared (first use in this function)
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: (Each undeclared identifier is reported only once
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: for each function it appears in.)
# >
# > This happens when compiling w/ the patch applied cleanly against .33.3
# > I'll try to fix it, then conduct the field test.
#
# Sorry ... and thanks.
#
# > For the latter, do
# > you need the same kind of log as for the previous one?
# The goal of this patch is to find the reason behind the error
# "expected_tpt should have been calculated by now". From what I
# understand you only encountered that in one of your tests, not all. Any
# test you can run to reproduce that error will be welcome.
Yep. This expected_tpt stuff happens IRC (see mails for certainty, though)
exclusively on site B.
Ok, so that's the goal. What could I do to advance us an additional step
at a time avoiding pushing hundreds of kilobytes of logs uplink?
To reproduce it, I have to be on site B and just start surfing. It's a matter
of _short_ time until the driver hits the wall.
I'm suspecting still getting those expected_tpt thingies (see below). Though today
it seems I've been lucky, it (maybe) only happened once (or never today).
# Thinking about your question more ... I believe your previous debug logs
# were created with debug flag 0x43fff. For this iteration, please use
# debug flag 0x143fff.
If you insist on a test using .33.3, I will do so but that will have to wait.
Meanwhile I used this patch for .34 to fix the build err from your dbg patch:
drivers/net/wireless/iwlwifi/iwl-agn-rs.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
Index: linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
===================================================================
--- linux.orig/drivers/net/wireless/iwlwifi/iwl-agn-rs.c 2010-05-31 13:21:48.000000000 +0200
+++ linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c 2010-05-31 13:25:52.000000000 +0200
@@ -365,7 +365,8 @@
* packets.
*/
static int rs_collect_tx_data(struct iwl_scale_tbl_info *tbl,
- int scale_index, int attempts, int successes)
+ int scale_index, int attempts, int successes,
+ struct iwl_priv *priv)
{
struct iwl_rate_scale_data *window = NULL;
static const u64 mask = (((u64)1) << (IWL_RATE_MAX_WINDOW - 1));
@@ -868,7 +869,7 @@
&rs_index);
rs_collect_tx_data(curr_tbl, rs_index,
info->status.ampdu_ack_len,
- info->status.ampdu_ack_map);
+ info->status.ampdu_ack_map, priv);
/* Update success/fail counts if not searching for new mode */
if (lq_sta->stay_in_tbl) {
@@ -902,7 +903,7 @@
else
continue;
rs_collect_tx_data(tmp_tbl, rs_index, 1,
- i < retries ? 0 : legacy_success);
+ i < retries ? 0 : legacy_success, priv);
}
/* Update success/fail counts if not searching for new mode */
Test conducted using debug flag 0x143fff.
Test on Site B resulted immediately in a hard crash upon resume, that is I got to X,
activated wireless, echoed 0x143fff to the sysfile and that was it.
But I didn't count on that one so I had no console to watch for it. So right now
I got no clue what caused the crash, there's nothing in the logs, of course..
Next thing was to redo test on site B, but this time I switched to console beforehand.
Sure enough, this time nothing happened..
Log appended as bz2.
Hopefully lines like the following don't indicate essential info getting dropped:
May 31 17:23:17 localhost kernel: [91800.091565] net_ratelimit: 70 callbacks suppressed
This line indicates the first timestamp _after_ the crash:
May 31 17:35:19 localhost kernel: [ 69.488456]
The crash happened after site A and on site B. Just arrived, opened lid and *crash*.
I noticed in iwl-agn-rs.c:2080:
BUG_ON(window->average_tpt != ((window->success_ratio *
tbl->expected_tpt[index] + 64) / 128));
Could that be again the point that hit me today when the machine crashed once?
Would you mind changing this into a milder WARN? That way I wouldn't hit the wall
that hard. And I would notice it anyway while skimming the logs as we still are on the
hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
update..
Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)
The dbg log contains all types of the errs happening here:
- "deauthenticated" msgs w/ "reason 2"
- "request scan called when driver not ready"
- iwl_tx_agg_start
- of course the "expected_tpt should have been[..]" don't show up anymore, the source
has no more WARN regarding this..
And the rest of the 9Mb dbg log..
Could you tell me a bit about your idea of how to track those down? Maybe we can
speed things a little up. The logging and testing stuff takes a lot of time and
most of it I have no clue why that might help or what the goal is..
Are you online in some #chan?
# > # Regarding (2): This is a common issue in busy environments where AP
# > # decides to deathenticate station after it does not receive an ack for
# > # data sent after a few retries. Was this test done in busy environment?
# > Both. This happens in busy environment as well as in an idle one. Can't tell
# > yet whether there're more of those msgs in busy env. I start to feel against
# > Cisco APs..
#
# I don't know ... perhaps these APs have been set up to be strict wrt
# delays.
Sure. May well be.. For sure I'm no fan of this config/policy..
But wait. I noticed no such delays or disassociations using another notebook.
Well, seems I got to investigate in this, too.
# > # Regarding (3): Seems like driver is getting a request to scan after a
# > # request to remove interface. I am still inquiring about this.
# > Probably due to me switching of via RF_KILLSWITCH. But anyway I assume this
# > msg should not happen..
#
# Absolutely. What are the exact steps you run when you encounter this
# issue?
Nothing particular. I.e. after the tests conducted toggle the hw kill switch. Then
a script gets called via acpi callback that in turn 1) kills the respective dhclient3, 2)
terminates wpa_supplicant and 3) removes the wifi modules (for power saving).
Cheers,
Nils
[-- Attachment #2: 2010-05-31_iwlwifi_dbg_filter.bz2 --]
[-- Type: application/octet-stream, Size: 365121 bytes --]
next prev parent reply other threads:[~2010-05-31 20:12 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-03 19:17 kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock NilsRadtkelkml
2010-05-03 19:22 ` John W. Linville
2010-05-06 9:14 ` Christian Borntraeger
2010-05-06 16:28 ` reinette chatre
2010-05-11 15:50 ` Christian Borntraeger
2010-05-11 17:21 ` reinette chatre
2010-05-12 15:18 ` Christian Borntraeger
2010-05-10 18:36 ` Nils Radtke
2010-05-10 23:32 ` reinette chatre
2010-05-12 14:39 ` Nils Radtke
2010-05-12 23:14 ` reinette chatre
2010-05-13 10:34 ` Nils Radtke
2010-05-13 11:32 ` Nils Radtke
2010-05-13 16:31 ` reinette chatre
2010-05-14 17:45 ` Nils Radtke
2010-05-13 15:05 ` Nils Radtke
2010-05-17 23:19 ` reinette chatre
2010-05-20 12:15 ` Nils Radtke
2010-05-20 18:33 ` reinette chatre
2010-05-31 20:12 ` Nils Radtke [this message]
2010-06-02 17:51 ` reinette chatre
2010-06-04 16:57 ` Nils Radtke
2010-06-08 17:46 ` reinette chatre
2010-06-10 14:22 ` Nils Radtke
2010-06-10 16:19 ` reinette chatre
2010-05-20 12:31 ` Nils Radtke
2010-05-20 18:26 ` reinette chatre
2010-05-20 22:30 ` David Miller
-- strict thread matches above, loose matches on Subject: below --
2010-05-11 9:41 Nils Radtke
2010-04-29 18:26 kernel BUG in iwl-agn-rs.c:2076,was: iwlagn + some accesspoint = hardlock lkml
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100531201225.GA6327@localhost \
--to=lkml@think-future.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-wireless@vger.kernel.org \
--cc=reinette.chatre@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.