Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock

linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
@ 2010-05-11  9:41 Nils Radtke
  0 siblings, 0 replies; 16+ messages in thread
From: Nils Radtke @ 2010-05-11  9:41 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linville, linux-kernel, linux-wireless

  Hi,

Thanks a lot for the driver not hanging w/ bug_on() any more. At least the machine 
keeps working and when on battery no repeated reboots are required any more. That alone
already means a lot.

# On Mon, 2010-05-10 at 11:36 -0700, Nils Radtke wrote:
# >   Today weather was fine again, finally. So testing with .33.3 w/ the patch applied: 
# > 
# >   http://marc.info/?l=linux-wireless&m=127290931304496&w=2
# > 
# > The kernel kernel .32 was still running before it crashed immediately on wireless activation.
# > The crash log showed again at least two messages, the last was as already described in my first
# > message, bug from 2010-04-30: I think even the 0x2030 was the same:
# > 
# > EIP rs_tx_status +x8f/x2030 
# 
# You report an issue on 2.6.32 ...
Yes. These errors happened to be the same regardless of .32 or .33

# > W/ .33.3 and the above patch applied:
# 
# ... but then test the patch with 2.6.33.
# 
# Which kernel are you focused on?
Sorry, no intention to confuse or show erratic behaviour.. :)

It's just that the errors occur on both of them. Then I accidently booted the old one again (now
removed from the system), but again, the error showed up on .32, .33{1,2,3} . But you always had
had an indication which kernel it happened on.

OTH, it's basically the same, the identical error persists, so I can't seem the difference here. 
Except for a scientific approach one shouldn't do that, ACK. But, hey, I'd like to use the machine in
the meantime and happened to update the kernel source. 

# > Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux
# > 
# > May 10 19:14:11 [   80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:17 [  626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:30 [  638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:32 [  641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:54 [  663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:58 [  666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:24:02 [  671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# Can you see any impact on your connection speed that can be connected to
# these messages?
I'm glad you're asking. Yes, indeed, speed it exceptionally low to what might be achievable. Around 30k/s
average, burst with maybe 200k/s, instead of 700k/s.

# > Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)
# > 
# > May 10 19:24:16 [  685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > May 10 19:24:22 [  691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > May 10 19:28:02 [  911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > 
# > The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
# > dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
# > got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
# > with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now" 
# > problem).
# It could be that some of the queues get stuck. Can you try with the
# patches in
# http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c113 ? They
# are based on 2.6.33.
Good, no wait, bad, now running on .34-rc7. *sigh

I'll apply the patches to .33. .34-rc7 hadn't brought the desired success w/ the olicard100 usb-umts-stick.

Update: noticed you mean 2.6.33 not .33.x ;) On .33.3 it doesn't apply cleanly for a couple of files..
Any objections if I apply it to .33.3 anyway? (Fixing the rej of course..)

Interestingly enough, quilt import 0001*patch imports, quilt push patches but it applies the patch w/o
rej. patch -p1 0001*patch does recognize the patch already applied and rejects..

All patches applied successfully, trying again these days.

Thanks for your comments.

Will keep you informed.

Nils

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <1274380408.2091.9272.camel@rchatre-DESK>]

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
       [not found] <1274380408.2091.9272.camel@rchatre-DESK>
@ 2010-05-31 20:12 ` Nils Radtke
  2010-06-02 17:51   ` reinette chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Nils Radtke @ 2010-05-31 20:12 UTC (permalink / raw)
  To: reinette chatre; +Cc: linux-kernel, linux-wireless

[-- Attachment #1: Type: text/plain, Size: 7370 bytes --]

  Hi Reinette,

First off: 
Linux mypole 2.6.33.4 #23 SMP PREEMPT Sat May 15 20:27:33 CEST 2010 i686 GNU/Linux

On Thu 2010-05-20 @ 11-33-28AM -0700, reinette chatre wrote: 
# On Thu, 2010-05-20 at 05:15 -0700, Nils Radtke wrote:
# > # 
# > # To address (1), could you please run with attached debug patch and also
# > # enable rate scaling debugging. That will be "modprobe iwlagn
# > # debug=0x143fff).
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c: In function ‘rs_collect_tx_data’:
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: ‘priv’ undeclared (first use in this function)
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: (Each undeclared identifier is reported only once
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: for each function it appears in.)
# > 
# > This happens when compiling w/ the patch applied cleanly against .33.3
# > I'll try to fix it, then conduct the field test.
# 
# Sorry ... and thanks.
# 
# >  For the latter, do 
# > you need the same kind of log as for the previous one? 
# The goal of this patch is to find the reason behind the error
# "expected_tpt should have been calculated by now". From what I
# understand you only encountered that in one of your tests, not all. Any
# test you can run to reproduce that error will be welcome. 
Yep. This expected_tpt stuff happens IRC (see mails for certainty, though) 
exclusively on site B.

Ok, so that's the goal. What could I do to advance us an additional step
at a time avoiding pushing hundreds of kilobytes of logs uplink?

To reproduce it, I have to be on site B and just start surfing. It's a matter
of _short_ time until the driver hits the wall.

I'm suspecting still getting those expected_tpt thingies (see below). Though today
it seems I've been lucky, it (maybe) only happened once (or never today).

# Thinking about your question more ... I believe your previous debug logs
# were created with debug flag 0x43fff. For this iteration, please use
# debug flag 0x143fff.

If you insist on a test using .33.3, I will do so but that will have to wait.

Meanwhile I used this patch for .34 to fix the build err from your dbg patch:

 drivers/net/wireless/iwlwifi/iwl-agn-rs.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
===================================================================
--- linux.orig/drivers/net/wireless/iwlwifi/iwl-agn-rs.c        2010-05-31 13:21:48.000000000 +0200
+++ linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c     2010-05-31 13:25:52.000000000 +0200
@@ -365,7 +365,8 @@
  * packets.
  */
 static int rs_collect_tx_data(struct iwl_scale_tbl_info *tbl,
-                             int scale_index, int attempts, int successes)
+                             int scale_index, int attempts, int successes,
+                             struct iwl_priv *priv)
 {
        struct iwl_rate_scale_data *window = NULL;
        static const u64 mask = (((u64)1) << (IWL_RATE_MAX_WINDOW - 1));
@@ -868,7 +869,7 @@
                                &rs_index);
                rs_collect_tx_data(curr_tbl, rs_index,
                                   info->status.ampdu_ack_len,
-                                  info->status.ampdu_ack_map);
+                                  info->status.ampdu_ack_map, priv);

                /* Update success/fail counts if not searching for new mode */
                if (lq_sta->stay_in_tbl) {
@@ -902,7 +903,7 @@
                        else
                                continue;
                        rs_collect_tx_data(tmp_tbl, rs_index, 1,
-                                          i < retries ? 0 : legacy_success);
+                                          i < retries ? 0 : legacy_success, priv);
                }

                /* Update success/fail counts if not searching for new mode */

Test conducted using debug flag 0x143fff.

Test on Site B resulted immediately in a hard crash upon resume, that is I got to X,
activated wireless, echoed 0x143fff to the sysfile and that was it.
But I didn't count on that one so I had no console to watch for it. So right now 
I got no clue what caused the crash, there's nothing in the logs, of course..

Next thing was to redo test on site B, but this time I switched to console beforehand.
Sure enough, this time nothing happened..

Log appended as bz2.

Hopefully lines like the following don't indicate essential info getting dropped:
May 31 17:23:17 localhost kernel: [91800.091565] net_ratelimit: 70 callbacks suppressed

This line indicates the first timestamp _after_ the crash:
May 31 17:35:19 localhost kernel: [   69.488456]

The crash happened after site A and on site B. Just arrived, opened lid and *crash*.

I noticed in iwl-agn-rs.c:2080:
  BUG_ON(window->average_tpt != ((window->success_ratio *
        tbl->expected_tpt[index] + 64) / 128));
Could that be again the point that hit me today when the machine crashed once?
Would you mind changing this into a milder WARN? That way I wouldn't hit the wall 
that hard. And I would notice it anyway while skimming the logs as we still are on the
hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
update..

Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)

The dbg log contains all types of the errs happening here:

  - "deauthenticated" msgs w/ "reason 2"

  - "request scan called when driver not ready"

  - iwl_tx_agg_start

  - of course the "expected_tpt should have been[..]" don't show up anymore, the source
    has no more WARN regarding this..

And the rest of the 9Mb dbg log..

Could you tell me a bit about your idea of how to track those down? Maybe we can
speed things a little up. The logging and testing stuff takes a lot of time and 
most of it I have no clue why that might help or what the goal is..
Are you online in some #chan? 

# > # Regarding (2): This is a common issue in busy environments where AP
# > # decides to deathenticate station after it does not receive an ack for
# > # data sent after a few retries. Was this test done in busy environment?
# > Both. This happens in busy environment as well as in an idle one. Can't tell
# > yet whether there're more of those msgs in busy env. I start to feel against 
# > Cisco APs..
# 
# I don't know ... perhaps these APs have been set up to be strict wrt
# delays.
Sure. May well be.. For sure I'm no fan of this config/policy.. 
But wait. I noticed no such delays or disassociations using another notebook.
Well, seems I got to investigate in this, too.

# > # Regarding (3): Seems like driver is getting a request to scan after a
# > # request to remove interface. I am still inquiring about this.
# > Probably due to me switching of via RF_KILLSWITCH. But anyway I assume this
# > msg should not happen..
# 
# Absolutely. What are the exact steps you run when you encounter this
# issue?
Nothing particular. I.e. after the tests conducted toggle the hw kill switch. Then
a script gets called via acpi callback that in turn 1) kills the respective dhclient3, 2)
terminates wpa_supplicant and 3) removes the wifi modules (for power saving).

        Cheers,

                    Nils

[-- Attachment #2: 2010-05-31_iwlwifi_dbg_filter.bz2 --]
[-- Type: application/octet-stream, Size: 365121 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-31 20:12 ` Nils Radtke
@ 2010-06-02 17:51   ` reinette chatre
  2010-06-04 16:57     ` Nils Radtke
  0 siblings, 1 reply; 16+ messages in thread
From: reinette chatre @ 2010-06-02 17:51 UTC (permalink / raw)
  To: Nils Radtke; +Cc: linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

On Mon, 2010-05-31 at 13:12 -0700, Nils Radtke wrote:

> This line indicates the first timestamp _after_ the crash:
> May 31 17:35:19 localhost kernel: [   69.488456]
> 
> The crash happened after site A and on site B. Just arrived, opened lid and *crash*.
> 
> I noticed in iwl-agn-rs.c:2080:
>   BUG_ON(window->average_tpt != ((window->success_ratio *
>         tbl->expected_tpt[index] + 64) / 128));
> Could that be again the point that hit me today when the machine crashed once?
> Would you mind changing this into a milder WARN? That way I wouldn't hit the wall 
> that hard. And I would notice it anyway while skimming the logs as we still are on the
> hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
> update..
> 
> Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)

Seems like you performed the testing without the patch that we used to
address the hang issue from the beginning of this thread. Please see
http://marc.info/?l=linux-wireless&m=127290931304496&w=2 - that thread
also explains why the patch is not in 2.6.34.

I think it is time to move this discussion to a bug report so that it
can be tracked better. Please open a new bug at
http://bugzilla.intellinuxwireless.org/

Reinette




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-02 17:51   ` reinette chatre
@ 2010-06-04 16:57     ` Nils Radtke
  2010-06-08 17:46       ` reinette chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Nils Radtke @ 2010-06-04 16:57 UTC (permalink / raw)
  To: reinette chatre; +Cc: linville, linux-kernel, linux-wireless

  Hi Reinette,

  BTW, this:
  Jun  3 12:05:43 localhost kernel: [174170.391756] iwlagn 0000:03:00.0: 
    TX Power requested while scanning!
  happened even w/o toggling radio switch, so this seems not uniquely
  related to toggling the radio switch.

On mer 2010-06-02 @ 10-51-25 -0700, reinette chatre wrote: 
# On Mon, 2010-05-31 at 13:12 -0700, Nils Radtke wrote:
# 
# > This line indicates the first timestamp _after_ the crash:
# > May 31 17:35:19 localhost kernel: [   69.488456]
# > 
# > The crash happened after site A and on site B. Just arrived, opened lid and *crash*.
# > 
# > I noticed in iwl-agn-rs.c:2080:
# >   BUG_ON(window->average_tpt != ((window->success_ratio *
# >         tbl->expected_tpt[index] + 64) / 128));
# > Could that be again the point that hit me today when the machine crashed once?
# > Would you mind changing this into a milder WARN? That way I wouldn't hit the wall 
# > that hard. And I would notice it anyway while skimming the logs as we still are on the
# > hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
# > update..
# > 
# > Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)
# 
# Seems like you performed the testing without the patch that we used to
# address the hang issue from the beginning of this thread. Please see
Indeed, that's what it feels like. It is just so annoying, that one..
You can't work w/ the kernel drivers. That's a shame.
BTW, iff the patch for the BUG_ON is in kernel src since 2.6.28, that might
explain a lot of crashes before where I haven't never been able to track it down.
Even more, those days I hadn't a chance to do more on this. Unlike now.

# http://marc.info/?l=linux-wireless&m=127290931304496&w=2 - that thread
# also explains why the patch is not in 2.6.34.
It should definitely and absolutely be merged (change the BUG_ON into WARNING).
Even if, like hypothesized, the bug is hidden elsewhere, a BUG_ON doesn't get
me far, it's killing every chance to advance to a solution. How am I supposed
to investigate w/ the kernel crashing? BTW, I don't like working w/ a Linux
kernel that kills my work regularly, I think that's understandable. If I needed
a break from work, I'd set an alarm.

I've seen a bugreport on this issue on the redhat bts referencing my word about
this BUG_ON only getting hit w/ cisco APs. There's a wide range of AP manufacturers 
out there in the city. But only cisco APs are crashing this driver. Admittedly, only 
on one single location, but anyway it's a cisco. Always the same MAC, unless they 
use to reassign MAC addresses, though.. 

I think it's a tough one, if an AP is able to crash the driver. 

I haven't yet received a comment of yours regarding my many other questions in
my previous message. I am willing to help investigate more, assist in other ways 
than testing only (always only doing testing isn't a way to keep up fun..)

# I think it is time to move this discussion to a bug report so that it
# can be tracked better. Please open a new bug at
# http://bugzilla.intellinuxwireless.org/
As you wish. It's probably a good idea. But I still miss the registration mail
from bz, did register yesterday.

So, please see to it, that the patch rendering the BUG_ON into a
WARNING finds it's way back in.

    Thank you very much,

            Nils Radtke

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-04 16:57     ` Nils Radtke
@ 2010-06-08 17:46       ` reinette chatre
  2010-06-10 14:22         ` Nils Radtke
  0 siblings, 1 reply; 16+ messages in thread
From: reinette chatre @ 2010-06-08 17:46 UTC (permalink / raw)
  To: Nils Radtke
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

On Fri, 2010-06-04 at 09:57 -0700, Nils Radtke wrote:
> I haven't yet received a comment of yours regarding my many other questions in
> my previous message. I am willing to help investigate more, assist in other ways 
> than testing only (always only doing testing isn't a way to keep up fun..)

Your messages contain references to many issues and it is becoming
increasingly hard to keep track of them all in a single email thread.
Since the system crash is clearly the big issue I would like to focus on
that and get that resolved. This is why I proposed that you create bug
reports to help track your various issues better.

Reinette



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-08 17:46       ` reinette chatre
@ 2010-06-10 14:22         ` Nils Radtke
  2010-06-10 16:19           ` reinette chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Nils Radtke @ 2010-06-10 14:22 UTC (permalink / raw)
  To: reinette chatre
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

  Hi Reinette,

  Thanks for your message.

  Yes, you're right about the multiple bugs one thread thing.

  Just today I got registered w/ the wireless ml because the
  system just did not send me a registration message. 

  For the bug reports to be created it will take me some time.
  I'll firstly report the main issue, the 2 other ones afterwards.
  Would it be ok cross referencing i.e. to the log and such 
  between the reports?

  Should I paste all the mail messages in separate report messages 
  (belonging to one bug report, of course) or should I paste some
  links to the thread?

  Cheers,

        Nils

@John: Yes, you're right but the 2.6.33.4 tree which for me still
  has the bug_on in.

On Tue 2010-06-08 @ 10-46-29AM -0700, reinette chatre wrote: 
# On Fri, 2010-06-04 at 09:57 -0700, Nils Radtke wrote:
# > I haven't yet received a comment of yours regarding my many other questions in
# > my previous message. I am willing to help investigate more, assist in other ways 
# > than testing only (always only doing testing isn't a way to keep up fun..)
# 
# Your messages contain references to many issues and it is becoming
# increasingly hard to keep track of them all in a single email thread.
# Since the system crash is clearly the big issue I would like to focus on
# that and get that resolved. This is why I proposed that you create bug
# reports to help track your various issues better.
# 
# Reinette
# 
# 

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-10 14:22         ` Nils Radtke
@ 2010-06-10 16:19           ` reinette chatre
  0 siblings, 0 replies; 16+ messages in thread
From: reinette chatre @ 2010-06-10 16:19 UTC (permalink / raw)
  To: Nils Radtke
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

On Thu, 2010-06-10 at 07:22 -0700, Nils Radtke wrote:
>   For the bug reports to be created it will take me some time.
>   I'll firstly report the main issue, the 2 other ones afterwards.

Sounds great. Thanks

>   Would it be ok cross referencing i.e. to the log and such 
>   between the reports?
>   Should I paste all the mail messages in separate report messages 
>   (belonging to one bug report, of course) or should I paste some
>   links to the thread?

I find it most convenient if all information related to the bug is
contained in the bug report. Links can be used.

Reinette





^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <1273768269.2295.1144.camel@rchatre-DESK>]

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
       [not found] <1273768269.2295.1144.camel@rchatre-DESK>
@ 2010-05-14 17:45 ` Nils Radtke
  0 siblings, 0 replies; 16+ messages in thread
From: Nils Radtke @ 2010-05-14 17:45 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linux-kernel, linux-wireless

  Hi Reinette,

  Might be of interest:

[63099.789939] eth1: associated
[63166.919257] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63180.322024] Hangcheck: hangcheck value past margin!
[63190.664526] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63193.255873] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63194.941768] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63195.099286] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63196.524065] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63197.417740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63199.767526] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0
[63205.689184] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0
[63210.821316] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0
[63228.178530] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0

Happened on site B, with high throughput (280-340k/s). So, it's happening w/ both, fast
and slow conn speed.

Yes, I noticed and had a look into the compat-wireless scripts but preferred to do 
it manually. Thank you for your explanation.

Cheers,

                Nils


^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20100503191756.GA3479@localhost>]

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
       [not found] <20100503191756.GA3479@localhost>
@ 2010-05-03 19:22 ` John W. Linville
  2010-05-06  9:14   ` Christian Borntraeger
  2010-05-10 18:36   ` Nils Radtke
  0 siblings, 2 replies; 16+ messages in thread
From: John W. Linville @ 2010-05-03 19:22 UTC (permalink / raw)
  To: NilsRadtkelkml; +Cc: linux-kernel, reinette.chatre, linux-wireless

On Mon, May 03, 2010 at 09:17:56PM +0200, NilsRadtkelkml@Think-Future.de wrote:

> Strangely, as stated in the previous message, this bug does only happen in conjunction at a specific 
> geographical position, so far it only happened there that is. How does this correlate with the designated
> code at line 2076 in iwl-agn-rs.c?:
> 
>   /* Sanity-check TPT calculations */
>   BUG_ON(window->average_tpt != ((window->success_ratio *
>       tbl->expected_tpt[index] + 64) / 128));

Interestingly enough, we have been discussing this line of code today.  Could you try the patch here?

	http://marc.info/?l=linux-wireless&m=127290931304496&w=2

Does it address the issue you are experiencing?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-03 19:22 ` John W. Linville
@ 2010-05-06  9:14   ` Christian Borntraeger
  2010-05-06 16:28     ` reinette chatre
  2010-05-10 18:36   ` Nils Radtke
  1 sibling, 1 reply; 16+ messages in thread
From: Christian Borntraeger @ 2010-05-06  9:14 UTC (permalink / raw)
  To: John W. Linville
  Cc: NilsRadtkelkml, linux-kernel, reinette.chatre, linux-wireless

Am Montag 03 Mai 2010 21:22:19 schrieb John W. Linville:
> >   /* Sanity-check TPT calculations */
> >   BUG_ON(window->average_tpt != ((window->success_ratio *
> >       tbl->expected_tpt[index] + 64) / 128));
> 
> Interestingly enough, we have been discussing this line of code today.  Could you try the patch here?
> 
> 	http://marc.info/?l=linux-wireless&m=127290931304496&w=2

I also see a hard lockup some time after connection to my companies
wireless network. My private network does not seem to trigger that bug.
Unfortunately the kernel is not able to switch back graphics, so I 
cannot tell if I see the same BUG - even if the problem description is 
the same.
For reference, the patch above  does not help on my T61p.

It started soon after 2.6.34-rc4. Before and with rc4 I had to
apply this http://patchwork.ozlabs.org/patch/49850/mbox/ patch to avoid
the other crash. With only this patch on top of rc4 everything seemed to
work fine, so the lockup seems to be triggered by one of the other patches.
Sometimes it takes some minutes to crash, which makes it hard to bisect
the problem.

Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-06  9:14   ` Christian Borntraeger
@ 2010-05-06 16:28     ` reinette chatre
  2010-05-11 15:50       ` Christian Borntraeger
  0 siblings, 1 reply; 16+ messages in thread
From: reinette chatre @ 2010-05-06 16:28 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

Hi Christian,

On Thu, 2010-05-06 at 02:14 -0700, Christian Borntraeger wrote:
> Am Montag 03 Mai 2010 21:22:19 schrieb John W. Linville:
> > >   /* Sanity-check TPT calculations */
> > >   BUG_ON(window->average_tpt != ((window->success_ratio *
> > >       tbl->expected_tpt[index] + 64) / 128));
> > 
> > Interestingly enough, we have been discussing this line of code today.  Could you try the patch here?
> > 
> > 	http://marc.info/?l=linux-wireless&m=127290931304496&w=2
> 
> I also see a hard lockup some time after connection to my companies
> wireless network. My private network does not seem to trigger that bug.
> Unfortunately the kernel is not able to switch back graphics, so I 
> cannot tell if I see the same BUG - even if the problem description is 
> the same.

It will be hard to debug this without some logs. Can you perhaps run
with netconsole for a while? Is it possible to trigger this when not in
X to be able to get some information about where issue is?

> For reference, the patch above  does not help on my T61p.
> 
> It started soon after 2.6.34-rc4. Before and with rc4 I had to
> apply this http://patchwork.ozlabs.org/patch/49850/mbox/ patch to avoid
> the other crash. With only this patch on top of rc4 everything seemed to
> work fine, so the lockup seems to be triggered by one of the other patches.
> Sometimes it takes some minutes to crash, which makes it hard to bisect
> the problem.

Below seven iwlwifi patches were added after rc4. If you are unable to
bisect ... perhaps you can run a while by reverting more and more from
this list?

f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings

Reinette



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-06 16:28     ` reinette chatre
@ 2010-05-11 15:50       ` Christian Borntraeger
  2010-05-11 17:21         ` reinette chatre
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Borntraeger @ 2010-05-11 15:50 UTC (permalink / raw)
  To: reinette chatre
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

Am Donnerstag 06 Mai 2010 18:28:48 schrieb reinette chatre:
> Below seven iwlwifi patches were added after rc4. If you are unable to
> bisect ... perhaps you can run a while by reverting more and more from
> this list?
> 
> f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
> 88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
> 8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
> ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
> de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
> 04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
> dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings

Just to give you some feedback. Sometimes it takes a while until it crashes.
My hand made bisect is currently at two remaining patches:

de0f60ea94e132c858caa64a44b2012e1e8580b0
8b9fce77737ae9983f61ec56cd53f52fb738b2c7

reverting both solves my hard lockup. I will try to isolate the "bad" patch
but this takes some more days since I wont be in the "hazardous environment"
this week.

Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-11 15:50       ` Christian Borntraeger
@ 2010-05-11 17:21         ` reinette chatre
  2010-05-12 15:18           ` Christian Borntraeger
  0 siblings, 1 reply; 16+ messages in thread
From: reinette chatre @ 2010-05-11 17:21 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

On Tue, 2010-05-11 at 08:50 -0700, Christian Borntraeger wrote:
> Am Donnerstag 06 Mai 2010 18:28:48 schrieb reinette chatre:
> > Below seven iwlwifi patches were added after rc4. If you are unable to
> > bisect ... perhaps you can run a while by reverting more and more from
> > this list?
> > 
> > f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
> > 88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
> > 8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
> > ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
> > de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
> > 04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
> > dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings
> 
> Just to give you some feedback. Sometimes it takes a while until it crashes.
> My hand made bisect is currently at two remaining patches:
> 
> de0f60ea94e132c858caa64a44b2012e1e8580b0
> 8b9fce77737ae9983f61ec56cd53f52fb738b2c7
> 
> reverting both solves my hard lockup. I will try to isolate the "bad" patch
> but this takes some more days since I wont be in the "hazardous environment"
> this week.

Thank you for digging into this. It will be very helpful it you can get
us a trace of the crash - any chance that netconsole may work?

Reinette



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-11 17:21         ` reinette chatre
@ 2010-05-12 15:18           ` Christian Borntraeger
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Borntraeger @ 2010-05-12 15:18 UTC (permalink / raw)
  To: reinette chatre
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

Am Dienstag 11 Mai 2010 19:21:33 schrieb reinette chatre:
> > de0f60ea94e132c858caa64a44b2012e1e8580b0
> > 8b9fce77737ae9983f61ec56cd53f52fb738b2c7
> > 
> > reverting both solves my hard lockup. I will try to isolate the "bad" patch
> > but this takes some more days since I wont be in the "hazardous environment"
> > this week.

Drat! Now I got a lockup with these two patches reverted. Dont know if that  was
the same bug.

> Thank you for digging into this. It will be very helpful it you can get
> us a trace of the crash - any chance that netconsole may work?

No success so far.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-03 19:22 ` John W. Linville
  2010-05-06  9:14   ` Christian Borntraeger
@ 2010-05-10 18:36   ` Nils Radtke
  2010-05-10 23:32     ` reinette chatre
  1 sibling, 1 reply; 16+ messages in thread
From: Nils Radtke @ 2010-05-10 18:36 UTC (permalink / raw)
  To: linville; +Cc: linux-kernel, reinette.chatre, linux-wireless

  Hi John,

  Today weather was fine again, finally. So testing with .33.3 w/ the patch applied: 

  http://marc.info/?l=linux-wireless&m=127290931304496&w=2

The kernel kernel .32 was still running before it crashed immediately on wireless activation.
The crash log showed again at least two messages, the last was as already described in my first
message, bug from 2010-04-30: I think even the 0x2030 was the same:

EIP rs_tx_status +x8f/x2030 

W/ .33.3 and the above patch applied:

Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux

May 10 19:14:11 [   80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:17 [  626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:30 [  638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:32 [  641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:54 [  663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:58 [  666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:24:02 [  671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)

May 10 19:24:16 [  685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:24:22 [  691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:28:02 [  911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0

The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now" 
problem).

Another type occurred: (probably .33.2 or something)

May  3 17:58:11 [ 3946.608743] iwlagn 0000:03:00.0: request scan called when driver not ready.
May  3 17:58:12 [ 3948.082684] iwlagn 0000:03:00.0: TX Power requested while scanning!
May  3 18:01:00 [ 4115.282852] iwlagn 0000:03:00.0: RF_KILL bit toggled to disable radio.

Is this "TX Power requested while scanning" because of RF_KILL set to off?

  Thank you,

  Nils

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-10 18:36   ` Nils Radtke
@ 2010-05-10 23:32     ` reinette chatre
  0 siblings, 0 replies; 16+ messages in thread
From: reinette chatre @ 2010-05-10 23:32 UTC (permalink / raw)
  To: Nils Radtke
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

On Mon, 2010-05-10 at 11:36 -0700, Nils Radtke wrote:
>   Today weather was fine again, finally. So testing with .33.3 w/ the patch applied: 
> 
>   http://marc.info/?l=linux-wireless&m=127290931304496&w=2
> 
> The kernel kernel .32 was still running before it crashed immediately on wireless activation.
> The crash log showed again at least two messages, the last was as already described in my first
> message, bug from 2010-04-30: I think even the 0x2030 was the same:
> 
> EIP rs_tx_status +x8f/x2030 

You report an issue on 2.6.32 ...

> 
> W/ .33.3 and the above patch applied:

... but then test the patch with 2.6.33.

Which kernel are you focused on?


> Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux
> 
> May 10 19:14:11 [   80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:17 [  626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:30 [  638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:32 [  641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:54 [  663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:58 [  666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:24:02 [  671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

Can you see any impact on your connection speed that can be connected to
these messages?

> Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)
> 
> May 10 19:24:16 [  685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:24:22 [  691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:28:02 [  911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> 
> The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
> dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
> got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
> with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now" 
> problem).

It could be that some of the queues get stuck. Can you try with the
patches in
http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c113 ? They
are based on 2.6.33.

Reinette




^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-06-10 16:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-11  9:41 kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock Nils Radtke
     [not found] <1274380408.2091.9272.camel@rchatre-DESK>
2010-05-31 20:12 ` Nils Radtke
2010-06-02 17:51   ` reinette chatre
2010-06-04 16:57     ` Nils Radtke
2010-06-08 17:46       ` reinette chatre
2010-06-10 14:22         ` Nils Radtke
2010-06-10 16:19           ` reinette chatre
     [not found] <1273768269.2295.1144.camel@rchatre-DESK>
2010-05-14 17:45 ` Nils Radtke
     [not found] <20100503191756.GA3479@localhost>
2010-05-03 19:22 ` John W. Linville
2010-05-06  9:14   ` Christian Borntraeger
2010-05-06 16:28     ` reinette chatre
2010-05-11 15:50       ` Christian Borntraeger
2010-05-11 17:21         ` reinette chatre
2010-05-12 15:18           ` Christian Borntraeger
2010-05-10 18:36   ` Nils Radtke
2010-05-10 23:32     ` reinette chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).