RE: kernel BUG in iwl-agn-rs.c:2076,was: iwlagn + some accesspoint = hardlock

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: kernel BUG in iwl-agn-rs.c:2076,was: iwlagn + some accesspoint = hardlock
@ 2010-04-29 18:26 lkml
  0 siblings, 0 replies; 30+ messages in thread
From: lkml @ 2010-04-29 18:26 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1986 bytes --]

  Hi,

  In addition to the previous message it seems that:

  - RFKILL switch, switching radio on, (obviously) is a precondition. But then, it only 
    happens shortly after switching radio on,
  - it happens more frequently and faster after switching radio on when kde + firefox + xy is running

Please find attached an image of the bug. Finally I could catch a glimpse of what happens, I longued
for this quite some time.

Strangely, as stated in the previous message, this bug does only happen in conjunction at a specific 
geographical position, so far it only happened there that is. How does this correlate with the designated
code at line 2076 in iwl-agn-rs.c?:

  /* Sanity-check TPT calculations */
  BUG_ON(window->average_tpt != ((window->success_ratio *
      tbl->expected_tpt[index] + 64) / 128));


Furthermore, it seems like there was already an encounter with this bug before. Previously I had no means 
of grabbing more information off it, though, so I wrote down some notes:

2010-03-20: iwl-agn-rs.c line 2076: bug
The file where I kept my note for this bug contains: 

- crashed during wireless AP scan using wpa_supplicant (after RFKILL switching to on..)
- this happend right after a fresh reboot (after crashing for the same bug)

- afterwards, auth.log contained this (bad, bad where's my data gone?):
Mar 20 18:24:01 localhost CRON[16684]: pam_unix(cron:session): session closed for user root
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@i686; it; rv:1.9.2.1) Gecko/20091211 Firefox/3.5.6 (like Firefox/3.5.11; Debian-3.5.11-1)"

Noticed the line in the bug from 2010-03-20? Yep, the same as in the one of today.

Happy to serve with more info if required (and available..).

  Regards,



[-- Attachment #2: 2010-04-29_kernel-oops_iwlagn_2076_cut4www.png --]
[-- Type: image/png, Size: 83116 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
@ 2010-05-03 19:17 NilsRadtkelkml
  2010-05-03 19:22 ` John W. Linville
  0 siblings, 1 reply; 30+ messages in thread
From: NilsRadtkelkml @ 2010-05-03 19:17 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4610 bytes --]

  Hello,

  Some of the previous messages about this BUG seem lost, so here again in a single message:

When connecting to the local university's wireless APs, rather regularly the kernel crashes.

So, connecting to one of those the system hardlocks. No panic LED blinking, no mouse cursor moving, 
no magic sysreq key working. Poweroff is the only way to revive the machine. Until the upcoming
lock-up.

What might serve as a hint: using knemo and kwifimanager (with wpa_supplicant in the bg) it feels like it is 
freezing more frequently. Not running kwifimanager seems to reduce the frequency w/o however eliminating it
completely. Sometimes iceweasel has been touched and then the crash occurred. 
Maybe some TX (or subsequent RX) provokes the crash? Some idea that also came up was that accessing iwlagn 
drivers isn't clean when it's happening from more that one place at the same time 
(->wpa_supplicant + kwifimanager)? [edit: this isn't probably it, or not solely]

This behaviour is currently only known to happen in this exact place/geolocation. So one assumption is that it
is somehow correlated to a particular type of access point, firmware respectively. Nonetheless, even
if there's some fw sending kill packets, the kernel should not crash, obviously.

The access points in range are of type: (obviously, fw versions are unknown..)

  - one 0:1d:8b PIRELLI BROADBAND SOLUTIONS 
  - couple of those 0:40:96 Cisco Systems, Inc.
  - one 0:1a:70 Cisco-Linksys

This notebook has no serial port, so we haven't been able to get a glimpse on what is happening at that 
precise moment. xconsole or dmesg haven't been helpful as the system just freezes and that's it.

Other circumstances leading to a crash (reproducibility):

  - RFKILL switch, switching radio on, (obviously) is a precondition. But then, it mainly 
    happens shortly after switching radio on,
  - it happens more frequently and faster after switching radio on when kde + firefox + xy is running

Please find attached an image of the bug. Finally I could catch a glimpse of what happens, I longued
for this quite some time.

Strangely, as stated in the previous message, this bug does only happen in conjunction at a specific 
geographical position, so far it only happened there that is. How does this correlate with the designated
code at line 2076 in iwl-agn-rs.c?:

  /* Sanity-check TPT calculations */
  BUG_ON(window->average_tpt != ((window->success_ratio *
      tbl->expected_tpt[index] + 64) / 128));

Furthermore, it seems like there was already an encounter with this bug before. Previously I had no means 
of grabbing more information off it, though, so I wrote down some notes:

2010-03-20: iwl-agn-rs.c line 2076: bug
The file where I kept my note for this bug contains: 

  - crashed during wireless AP scan using wpa_supplicant (after RFKILL switching to on..)
  - this happend right after a fresh reboot (after crashing for the same bug)

  - afterwards, auth.log contained this (bad, bad where's my data gone?):
Mar 20 18:24:01 localhost CRON[16684]: pam_unix(cron:session): session closed for user root
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@i686; it; rv:1.9.2.1) Gecko/20091211 Firefox/3.5.6 (like Firefox/3.5.11; Debian-3.5.11-1)"

Noticed the line in the bug from 2010-03-20? Yep, the same as in the one of today.

2010-04-30: One crash, two oopses: Msg restored mainly from remainders left in some brain's corner, w/ 
heavy massaging applied, mainly Code has been written down properly due to outside constrains..
Unfortunately, the proper ordering of lines got lost, so, yes, this is a poor report..
Order is as it's believed it was shown on-screen.

Number 1: nearly lost as subsequent oops made it scroll off-screen, only bottom half visible:

in CPU idle +x3f/x90

EIP rs_tx_status +x8f/x2030    SS: ESP 0668:f749fc4c

Code: 7f 22 69 c7 80 01 00 00 8b 8c 24 fc 01 00 00 8b 84 01 a8 00 00 00 6b 04 b0 64 39 44 24 1c 0f 8e 
      d2 fe ff ff 31 db e9 cb fe ff ff <0f> 0b fe 8b 54 24 4c 8b 8c 24 fc 01 00 00 8b 82 04 00 00 00

in iwl_rx_queue restock xbd/x130   iwlcore

panic -> oops begin 

kernel panic not syncing fatal exec in int

pid 0 comm: swapper tainted GD .33.2

Comments, ideas, fixes welcome... this BUG is highly annoying..

Happy to serve with more info if required (and available..).

  Regards,

      Nils

[-- Attachment #2: 2010-04-29_kernel-oops_iwlagn_2076_cut4www.png --]
[-- Type: image/png, Size: 83116 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-03 19:17 kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock NilsRadtkelkml
@ 2010-05-03 19:22 ` John W. Linville
  2010-05-06  9:14   ` Christian Borntraeger
  2010-05-10 18:36   ` Nils Radtke
  0 siblings, 2 replies; 30+ messages in thread
From: John W. Linville @ 2010-05-03 19:22 UTC (permalink / raw)
  To: NilsRadtkelkml; +Cc: linux-kernel, reinette.chatre, linux-wireless

On Mon, May 03, 2010 at 09:17:56PM +0200, NilsRadtkelkml@Think-Future.de wrote:

> Strangely, as stated in the previous message, this bug does only happen in conjunction at a specific 
> geographical position, so far it only happened there that is. How does this correlate with the designated
> code at line 2076 in iwl-agn-rs.c?:
> 
>   /* Sanity-check TPT calculations */
>   BUG_ON(window->average_tpt != ((window->success_ratio *
>       tbl->expected_tpt[index] + 64) / 128));

Interestingly enough, we have been discussing this line of code today.  Could you try the patch here?

	http://marc.info/?l=linux-wireless&m=127290931304496&w=2

Does it address the issue you are experiencing?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-03 19:22 ` John W. Linville
@ 2010-05-06  9:14   ` Christian Borntraeger
  2010-05-06 16:28     ` reinette chatre
  2010-05-10 18:36   ` Nils Radtke
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Borntraeger @ 2010-05-06  9:14 UTC (permalink / raw)
  To: John W. Linville
  Cc: NilsRadtkelkml, linux-kernel, reinette.chatre, linux-wireless

Am Montag 03 Mai 2010 21:22:19 schrieb John W. Linville:
> >   /* Sanity-check TPT calculations */
> >   BUG_ON(window->average_tpt != ((window->success_ratio *
> >       tbl->expected_tpt[index] + 64) / 128));
> 
> Interestingly enough, we have been discussing this line of code today.  Could you try the patch here?
> 
> 	http://marc.info/?l=linux-wireless&m=127290931304496&w=2

I also see a hard lockup some time after connection to my companies
wireless network. My private network does not seem to trigger that bug.
Unfortunately the kernel is not able to switch back graphics, so I 
cannot tell if I see the same BUG - even if the problem description is 
the same.
For reference, the patch above  does not help on my T61p.

It started soon after 2.6.34-rc4. Before and with rc4 I had to
apply this http://patchwork.ozlabs.org/patch/49850/mbox/ patch to avoid
the other crash. With only this patch on top of rc4 everything seemed to
work fine, so the lockup seems to be triggered by one of the other patches.
Sometimes it takes some minutes to crash, which makes it hard to bisect
the problem.

Christian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-06  9:14   ` Christian Borntraeger
@ 2010-05-06 16:28     ` reinette chatre
  2010-05-11 15:50       ` Christian Borntraeger
  0 siblings, 1 reply; 30+ messages in thread
From: reinette chatre @ 2010-05-06 16:28 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

Hi Christian,

On Thu, 2010-05-06 at 02:14 -0700, Christian Borntraeger wrote:
> Am Montag 03 Mai 2010 21:22:19 schrieb John W. Linville:
> > >   /* Sanity-check TPT calculations */
> > >   BUG_ON(window->average_tpt != ((window->success_ratio *
> > >       tbl->expected_tpt[index] + 64) / 128));
> > 
> > Interestingly enough, we have been discussing this line of code today.  Could you try the patch here?
> > 
> > 	http://marc.info/?l=linux-wireless&m=127290931304496&w=2
> 
> I also see a hard lockup some time after connection to my companies
> wireless network. My private network does not seem to trigger that bug.
> Unfortunately the kernel is not able to switch back graphics, so I 
> cannot tell if I see the same BUG - even if the problem description is 
> the same.

It will be hard to debug this without some logs. Can you perhaps run
with netconsole for a while? Is it possible to trigger this when not in
X to be able to get some information about where issue is?

> For reference, the patch above  does not help on my T61p.
> 
> It started soon after 2.6.34-rc4. Before and with rc4 I had to
> apply this http://patchwork.ozlabs.org/patch/49850/mbox/ patch to avoid
> the other crash. With only this patch on top of rc4 everything seemed to
> work fine, so the lockup seems to be triggered by one of the other patches.
> Sometimes it takes some minutes to crash, which makes it hard to bisect
> the problem.

Below seven iwlwifi patches were added after rc4. If you are unable to
bisect ... perhaps you can run a while by reverting more and more from
this list?

f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings

Reinette



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-06 16:28     ` reinette chatre
@ 2010-05-11 15:50       ` Christian Borntraeger
  2010-05-11 17:21         ` reinette chatre
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Borntraeger @ 2010-05-11 15:50 UTC (permalink / raw)
  To: reinette chatre
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

Am Donnerstag 06 Mai 2010 18:28:48 schrieb reinette chatre:
> Below seven iwlwifi patches were added after rc4. If you are unable to
> bisect ... perhaps you can run a while by reverting more and more from
> this list?
> 
> f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
> 88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
> 8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
> ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
> de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
> 04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
> dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings

Just to give you some feedback. Sometimes it takes a while until it crashes.
My hand made bisect is currently at two remaining patches:

de0f60ea94e132c858caa64a44b2012e1e8580b0
8b9fce77737ae9983f61ec56cd53f52fb738b2c7

reverting both solves my hard lockup. I will try to isolate the "bad" patch
but this takes some more days since I wont be in the "hazardous environment"
this week.

Christian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-11 15:50       ` Christian Borntraeger
@ 2010-05-11 17:21         ` reinette chatre
  2010-05-12 15:18           ` Christian Borntraeger
  0 siblings, 1 reply; 30+ messages in thread
From: reinette chatre @ 2010-05-11 17:21 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

On Tue, 2010-05-11 at 08:50 -0700, Christian Borntraeger wrote:
> Am Donnerstag 06 Mai 2010 18:28:48 schrieb reinette chatre:
> > Below seven iwlwifi patches were added after rc4. If you are unable to
> > bisect ... perhaps you can run a while by reverting more and more from
> > this list?
> > 
> > f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
> > 88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
> > 8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
> > ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
> > de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
> > 04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
> > dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings
> 
> Just to give you some feedback. Sometimes it takes a while until it crashes.
> My hand made bisect is currently at two remaining patches:
> 
> de0f60ea94e132c858caa64a44b2012e1e8580b0
> 8b9fce77737ae9983f61ec56cd53f52fb738b2c7
> 
> reverting both solves my hard lockup. I will try to isolate the "bad" patch
> but this takes some more days since I wont be in the "hazardous environment"
> this week.

Thank you for digging into this. It will be very helpful it you can get
us a trace of the crash - any chance that netconsole may work?

Reinette



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-11 17:21         ` reinette chatre
@ 2010-05-12 15:18           ` Christian Borntraeger
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Borntraeger @ 2010-05-12 15:18 UTC (permalink / raw)
  To: reinette chatre
  Cc: John W. Linville, NilsRadtkelkml@think-future.de,
	linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

Am Dienstag 11 Mai 2010 19:21:33 schrieb reinette chatre:
> > de0f60ea94e132c858caa64a44b2012e1e8580b0
> > 8b9fce77737ae9983f61ec56cd53f52fb738b2c7
> > 
> > reverting both solves my hard lockup. I will try to isolate the "bad" patch
> > but this takes some more days since I wont be in the "hazardous environment"
> > this week.

Drat! Now I got a lockup with these two patches reverted. Dont know if that  was
the same bug.

> Thank you for digging into this. It will be very helpful it you can get
> us a trace of the crash - any chance that netconsole may work?

No success so far.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-03 19:22 ` John W. Linville
  2010-05-06  9:14   ` Christian Borntraeger
@ 2010-05-10 18:36   ` Nils Radtke
  2010-05-10 23:32     ` reinette chatre
  1 sibling, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-05-10 18:36 UTC (permalink / raw)
  To: linville; +Cc: linux-kernel, reinette.chatre, linux-wireless

  Hi John,

  Today weather was fine again, finally. So testing with .33.3 w/ the patch applied: 

  http://marc.info/?l=linux-wireless&m=127290931304496&w=2

The kernel kernel .32 was still running before it crashed immediately on wireless activation.
The crash log showed again at least two messages, the last was as already described in my first
message, bug from 2010-04-30: I think even the 0x2030 was the same:

EIP rs_tx_status +x8f/x2030 

W/ .33.3 and the above patch applied:

Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux

May 10 19:14:11 [   80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:17 [  626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:30 [  638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:32 [  641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:54 [  663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:23:58 [  666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
May 10 19:24:02 [  671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)

May 10 19:24:16 [  685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:24:22 [  691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:28:02 [  911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0

The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now" 
problem).

Another type occurred: (probably .33.2 or something)

May  3 17:58:11 [ 3946.608743] iwlagn 0000:03:00.0: request scan called when driver not ready.
May  3 17:58:12 [ 3948.082684] iwlagn 0000:03:00.0: TX Power requested while scanning!
May  3 18:01:00 [ 4115.282852] iwlagn 0000:03:00.0: RF_KILL bit toggled to disable radio.

Is this "TX Power requested while scanning" because of RF_KILL set to off?

  Thank you,

  Nils

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-10 18:36   ` Nils Radtke
@ 2010-05-10 23:32     ` reinette chatre
  2010-05-12 14:39       ` Nils Radtke
  0 siblings, 1 reply; 30+ messages in thread
From: reinette chatre @ 2010-05-10 23:32 UTC (permalink / raw)
  To: Nils Radtke
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

On Mon, 2010-05-10 at 11:36 -0700, Nils Radtke wrote:
>   Today weather was fine again, finally. So testing with .33.3 w/ the patch applied: 
> 
>   http://marc.info/?l=linux-wireless&m=127290931304496&w=2
> 
> The kernel kernel .32 was still running before it crashed immediately on wireless activation.
> The crash log showed again at least two messages, the last was as already described in my first
> message, bug from 2010-04-30: I think even the 0x2030 was the same:
> 
> EIP rs_tx_status +x8f/x2030 

You report an issue on 2.6.32 ...

> 
> W/ .33.3 and the above patch applied:

... but then test the patch with 2.6.33.

Which kernel are you focused on?


> Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux
> 
> May 10 19:14:11 [   80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:17 [  626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:30 [  638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:32 [  641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:54 [  663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:23:58 [  666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
> May 10 19:24:02 [  671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

Can you see any impact on your connection speed that can be connected to
these messages?

> Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)
> 
> May 10 19:24:16 [  685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:24:22 [  691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:28:02 [  911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
> 
> The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
> dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
> got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
> with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now" 
> problem).

It could be that some of the queues get stuck. Can you try with the
patches in
http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c113 ? They
are based on 2.6.33.

Reinette




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-10 23:32     ` reinette chatre
@ 2010-05-12 14:39       ` Nils Radtke
  2010-05-12 23:14         ` reinette chatre
  0 siblings, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-05-12 14:39 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linux-kernel

  Hi,

Problem regarding expected tpt calculation: (at site A)
  Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
  and iwl recovery patches applied.

[28766.975878] iwlagn 0000:03:00.0: loaded firmware version 228.61.2.24
[28770.306101] eth1: direct probe to AP 00:1a:70:aa:bb:cc (try 1)
[28770.506282] eth1: direct probe to AP 00:1a:70:aa:bb:cc (try 2)
[28770.512298] eth1: direct probe responded
[28770.512308] eth1: authenticate with AP 00:1a:70:aa:bb:cc (try 1)
[28770.712518] eth1: authenticate with AP 00:1a:70:aa:bb:cc (try 2)
[28770.714378] eth1: authenticated
[28770.714419] eth1: associate with AP 00:1a:70:aa:bb:cc (try 1)
[28770.914282] eth1: associate with AP 00:1a:70:aa:bb:cc (try 2)
[28770.917618] eth1: RX AssocResp from 00:1a:70:aa:bb:cc (capab=0x401 status=0 aid=2)
[28770.917626] eth1: associated
[28777.113828] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[28777.991207] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[28778.621611] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[28788.508867] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

Messages of the type of the last line all over the logs, coming in triples, couples or alone..

Then, there are again these msgs: (site A again)

[29001.505362] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:bb:cc tid = 0
[29039.037933] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:bb:cc tid = 0

and so on. Associated w/ poor throughput.

  - - -

Problem regarding the deauths/reconnects (sometimes combined w/ bad throughput):

  Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
  and iwl recovery patches applied.

RFKILL switch on, wpa_cli stat showed auth.

First auth: conn failed during https login

[28018.602587] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28018.900400] iwlagn 0000:03:00.0: TX Power requested while scanning!
[28018.900411] iwlagn 0000:03:00.0: Error sending TX power (-11)
[28018.921465] iwlagn 0000:03:00.0: request scan called when driver not ready.

started knoppix download w/ 20k/s avg, got 88kb when:
[28225.669850] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)

next try: no deauth but throughput was: --.-K/s
on abortion of wget and retry: deauth:

[28376.132933] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28376.253981] eth1: direct probe to AP 00:23:eb:aa:bb:cc (try 1)
[28376.768958] eth1: direct probe to AP 00:23:eb:aa:bb:cc (try 1)
[28376.771879] eth1: direct probe responded
[28376.771887] eth1: authenticate with AP 00:23:eb:aa:bb:cc (try 1)
[28376.971307] eth1: authenticate with AP 00:23:eb:aa:bb:cc (try 2)
[28376.974411] eth1: authenticated
[28376.974455] eth1: associate with AP 00:23:eb:aa:bb:cc (try 1)
[28377.174290] eth1: associate with AP 00:23:eb:aa:bb:cc (try 2)
[28377.180867] eth1: RX AssocResp from 00:23:eb:aa:bb:cc (capab=0x421 status=0 aid=11)
[28377.180875] eth1: associated

Then again, doing nothing:
[28406.222404] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28409.836210] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 1)
[28410.036330] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 2)
[28410.039365] eth1: direct probe responded
[28410.039377] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 1)
[28410.239290] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 2)
[28410.245375] eth1: authenticated
[28410.245419] eth1: associate with AP 00:40:96:aa:bb:cc (try 1)
[28410.445346] eth1: associate with AP 00:40:96:aa:bb:cc (try 2)
[28410.448273] eth1: RX AssocResp from 00:40:96:aa:bb:cc (capab=0x421 status=0 aid=17)
[28410.448282] eth1: associated

and so on:
[28476.948459] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28477.059629] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 1)
[28477.590389] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 1)
[28477.607096] eth1: direct probe responded
[28477.607106] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 1)
[28477.807275] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 2)
[28477.812340] eth1: authenticated
[28477.812382] eth1: associate with AP 00:40:96:aa:bb:cc (try 1)
[28478.012397] eth1: associate with AP 00:40:96:aa:bb:cc (try 2)
[28478.014914] eth1: RX AssocResp from 00:40:96:aa:bb:cc (capab=0x421 status=0 aid=19)
[28478.014922] eth1: associated

So, this is a real blocker. 

BTW, what is so particular about cisco APs? Again, this is a problem only w/ cisco (different site B, 
though):
0:40:96 Cisco Systems, Inc.

  Thanks.

  Cheers,

            Nils


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-12 14:39       ` Nils Radtke
@ 2010-05-12 23:14         ` reinette chatre
  2010-05-13 10:34           ` Nils Radtke
                             ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: reinette chatre @ 2010-05-12 23:14 UTC (permalink / raw)
  To: Nils Radtke; +Cc: linux-kernel@vger.kernel.org

Hi Nils,

On Wed, 2010-05-12 at 07:39 -0700, Nils Radtke wrote:
> Problem regarding the deauths/reconnects (sometimes combined w/ bad throughput):
> 
>   Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
>   and iwl recovery patches applied.
> 

Any chance you can try these same tests with latest code
(compat-wireless may be easiest) to help us determine if this is still
an issue?

> RFKILL switch on, wpa_cli stat showed auth.
> 
> First auth: conn failed during https login
> 
> [28018.602587] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
> [28018.900400] iwlagn 0000:03:00.0: TX Power requested while scanning!
> [28018.900411] iwlagn 0000:03:00.0: Error sending TX power (-11)
> [28018.921465] iwlagn 0000:03:00.0: request scan called when driver not ready.

We could maybe at least figure out what is going on here if the driver
is loaded with some debugging. Please ensure driver is compiled with
debugging (CONFIG_IWLWIFI_DEBUG) and load the module with "modprobe
iwlagn debug=0x43fff". Could you please reproduce under this
environment?

Reinette




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-12 23:14         ` reinette chatre
@ 2010-05-13 10:34           ` Nils Radtke
  2010-05-13 11:32           ` Nils Radtke
  2010-05-13 15:05           ` Nils Radtke
  2 siblings, 0 replies; 30+ messages in thread
From: Nils Radtke @ 2010-05-13 10:34 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linux-kernel

  Hi Reinette,

On Wed, 2010-05-12 at 07:39 -0700, Nils Radtke wrote:
> > Problem regarding the deauths/reconnects (sometimes combined w/ bad throughput):
> > 
> >   Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
> >   and iwl recovery patches applied.
> > 
> 
> Any chance you can try these same tests with latest code
> (compat-wireless may be easiest) to help us determine if this is still
> an issue?
I'll try those. Have to recompile the kernel though as I used to have config_80211 in instead of M.

> > RFKILL switch on, wpa_cli stat showed auth.
> > 
> > First auth: conn failed during https login
> > 
> > [28018.602587] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
> > [28018.900400] iwlagn 0000:03:00.0: TX Power requested while scanning!
> > [28018.900411] iwlagn 0000:03:00.0: Error sending TX power (-11)
> > [28018.921465] iwlagn 0000:03:00.0: request scan called when driver not ready.
> 
> We could maybe at least figure out what is going on here if the driver
> is loaded with some debugging. Please ensure driver is compiled with
> debugging (CONFIG_IWLWIFI_DEBUG) and load the module with "modprobe
> iwlagn debug=0x43fff". Could you please reproduce under this
> environment?
Ok.

Today it's been a thunderstorm here but right now the sun is shining. Guess, I'll make it for a tour through 
the city.. :)

  Cheers,

        Nils


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-12 23:14         ` reinette chatre
  2010-05-13 10:34           ` Nils Radtke
@ 2010-05-13 11:32           ` Nils Radtke
  2010-05-13 16:31             ` reinette chatre
  2010-05-13 15:05           ` Nils Radtke
  2 siblings, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-05-13 11:32 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linux-kernel

  Hi,

  short feedback:

  - recompiled kernel and config80211 as M
  - debug symbols included. Yet to try.
  - reboot
    Linux mypole 2.6.33.3 #22 SMP PREEMPT Thu May 13 12:52:22 CEST 2010 i686 GNU/Linux
  - renamed in-tree iwlagn and iwlcore, ran depmod -a, confirmed by looking into modules.dep
  - compat wireless compiled ok, modprobe fails w/:

[   23.665814] iwlagn: Intel(R) Wireless WiFi Link AGN driver for Linux, in-tree:
[   23.665817] iwlagn: Copyright(c) 2003-2010 Intel Corporation
[   23.665869] BUG: unable to handle kernel NULL pointer dereference at (null)
[   23.665934] IP: [<f8931b40>] iwl_alloc_all+0x20/0x50 [iwlcore]
[   23.665980] *pdpt = 0000000036448001 *pde = 0000000000000000 
[   23.666019] Oops: 0002 [#1] PREEMPT SMP 
[   23.666019] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/type
[   23.666019] Modules linked in: iwlagn(+) iwlcore mac80211 uhci_hcd e1000e cfg80211
[   23.666019] 
[   23.666019] Pid: 356, comm: modprobe Not tainted 2.6.33.3 #22 7666B4G/7666B4G
[   23.666019] EIP: 0060:[<f8931b40>] EFLAGS: 00010282 CPU: 0
[   23.666019] EIP is at iwl_alloc_all+0x20/0x50 [iwlcore]
[   23.666019] EAX: 00000000 EBX: f8998f20 ECX: f61d0280 EDX: f88e1870
[   23.666019] ESI: f61d0280 EDI: f8998f20 EBP: 00000000 ESP: f64ede38
[   23.666019]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[   23.666019] Process modprobe (pid: 356, ti=f64ec000 task=f64b2030 task.ti=f64ec000)
[   23.666019] Stack:
[   23.666019]  c114241c f64ede58 f752bc00 f89987c0 f897eed5 00000000 a1ffde9c f65affe0
[   23.666019] <0> 000033bf f606dbbc f7535b0c c16f06bd c16f03fc c134d180 f752bc00 f89987c0
[   23.666019] <0> f897ee90 00000000 c134d02b c134d851 f89987c0 f752bc00 f8997e9c f752bc5c
[   23.666019] Call Trace:
[   23.666019]  [<c114241c>] ? sysfs_add_one+0x1c/0x140
[   23.666019]  [<f897eed5>] ? iwl_pci_probe+0x45/0x9c0 [iwlagn]
[   23.666019]  [<c16f06bd>] ? _raw_spin_lock+0xd/0x30
[   23.666019]  [<c16f03fc>] ? _raw_spin_unlock+0xc/0x20
[   23.666019]  [<c134d180>] ? pci_match_device+0xa0/0xc0
[   23.666019]  [<f897ee90>] ? iwl_pci_probe+0x0/0x9c0 [iwlagn]
[   23.666019]  [<c134d02b>] ? local_pci_probe+0xb/0x10
[   23.666019]  [<c134d851>] ? pci_device_probe+0x61/0x80
[   23.666019]  [<c1415dbb>] ? driver_probe_device+0x7b/0x170
[   23.666019]  [<c134d180>] ? pci_match_device+0xa0/0xc0
[   23.666019]  [<c16f0689>] ? _raw_spin_lock_irqsave+0x19/0x40
[   23.671155]  [<c1415f29>] ? __driver_attach+0x79/0x80
[   23.671155]  [<c1415eb0>] ? __driver_attach+0x0/0x80
[   23.671155]  [<c1415622>] ? bus_for_each_dev+0x52/0x80
[   23.671155]  [<c1415c56>] ? driver_attach+0x16/0x20
[   23.671155]  [<c1415eb0>] ? __driver_attach+0x0/0x80
[   23.671155]  [<c1414f1c>] ? bus_add_driver+0xcc/0x2b0
[   23.671155]  [<c134d790>] ? pci_device_remove+0x0/0x40
[   23.671155]  [<c14161c3>] ? driver_register+0x63/0x120
[   23.671155]  [<f899e000>] ? iwl_init+0x0/0x6a [iwlagn]
[   23.671155]  [<f899e000>] ? iwl_init+0x0/0x6a [iwlagn]
[   23.671155]  [<c134da7d>] ? __pci_register_driver+0x3d/0xb0
[   23.671155]  [<f899e04d>] ? iwl_init+0x4d/0x6a [iwlagn]
[   23.671155]  [<c100112f>] ? do_one_initcall+0x2f/0x190
[   23.671155]  [<c107ad54>] ? sys_init_module+0xb4/0x220
[   23.671155]  [<c1002c8c>] ? sysenter_do_call+0x12/0x22
[   23.671155] Code: b6 00 00 00 00 8d bf 00 00 00 00 83 ec 10 89 5c 24 08 89 c3 b8 24 63 00 00 89 74 24 0c e8 59 fa fa ff 85 c0 89 c6 74 13 8b 40 30 <89> 30 89 f0 8b 5c 24 08 8b 74 24 0c 83 c4 10 c3 8b 03 c7 04 24 
[   23.671155] EIP: [<f8931b40>] iwl_alloc_all+0x20/0x50 [iwlcore] SS:ESP 0068:f64ede38
[   23.671155] CR2: 0000000000000000
[   23.693150] ---[ end trace 2f287167f09b7fb3 ]---


So, suspending test w/ compat-wireless drivers, doing w/ those in-tree w/ dbg symbols included and activated.

  Cheers,

              Nils


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-13 11:32           ` Nils Radtke
@ 2010-05-13 16:31             ` reinette chatre
  2010-05-14 17:45               ` Nils Radtke
  0 siblings, 1 reply; 30+ messages in thread
From: reinette chatre @ 2010-05-13 16:31 UTC (permalink / raw)
  To: Nils Radtke; +Cc: linux-kernel@vger.kernel.org

On Thu, 2010-05-13 at 04:32 -0700, Nils Radtke wrote:
>   short feedback:
> 
>   - recompiled kernel and config80211 as M
>   - debug symbols included. Yet to try.
>   - reboot
>     Linux mypole 2.6.33.3 #22 SMP PREEMPT Thu May 13 12:52:22 CEST 2010 i686 GNU/Linux
>   - renamed in-tree iwlagn and iwlcore, ran depmod -a, confirmed by looking into modules.dep

Note that compat-wireless installs its modules in updates/ and thus does
not overwrite your default ones. After default modules are unloaded the
ones in updates/ should be picked up first. No renaming or overwriting
of system's default modules should be done.

Following the instructions that come with compat-wireless should be all
you need to do.

Reinette



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-13 16:31             ` reinette chatre
@ 2010-05-14 17:45               ` Nils Radtke
  0 siblings, 0 replies; 30+ messages in thread
From: Nils Radtke @ 2010-05-14 17:45 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linux-kernel, linux-wireless

  Hi Reinette,

  Might be of interest:

[63099.789939] eth1: associated
[63166.919257] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63180.322024] Hangcheck: hangcheck value past margin!
[63190.664526] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63193.255873] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63194.941768] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63195.099286] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63196.524065] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63197.417740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[63199.767526] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0
[63205.689184] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0
[63210.821316] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0
[63228.178530] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:aa:aa tid = 0

Happened on site B, with high throughput (280-340k/s). So, it's happening w/ both, fast
and slow conn speed.

Yes, I noticed and had a look into the compat-wireless scripts but preferred to do 
it manually. Thank you for your explanation.

Cheers,

                Nils


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-12 23:14         ` reinette chatre
  2010-05-13 10:34           ` Nils Radtke
  2010-05-13 11:32           ` Nils Radtke
@ 2010-05-13 15:05           ` Nils Radtke
  2010-05-17 23:19             ` reinette chatre
  2 siblings, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-05-13 15:05 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 967 bytes --]

  Hi Reinette,

Attached the output from the debug-enabled in-tree iwlwifi driver connects.

The order of trial is: site A first, then site B. The sites have thus been visited 
in the same order as last time and the commands executed have been equally the
same. The failures evenly.. All the log entries that caught my eye last time
can be found in the attached one also. I'm sending it bz2 compressed. My uplink
is a poor man's volume limited UMTS one.

The timestamp that is the last entry for site A log is: 4174.659838
Everthing after that timestamp is log for site B, 10 minutes by bike from site A.

We've been lucky, I could just finish site B survey as heaven opened it's doors such
wide that the streets flooded within seconds. Before that mosquitos took the chance 
and bit me a dozen of times (*scratch*). I hope it was worth it and there's something 
to be found in the logs. ;)

If there's anything else, let me know.

  Cheers,

                  Nils

[-- Attachment #2: 2010-05-13_iwlwifi_dbg_filter.bz2 --]
[-- Type: application/octet-stream, Size: 99247 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-13 15:05           ` Nils Radtke
@ 2010-05-17 23:19             ` reinette chatre
  2010-05-20 12:15               ` Nils Radtke
  2010-05-20 12:31               ` Nils Radtke
  0 siblings, 2 replies; 30+ messages in thread
From: reinette chatre @ 2010-05-17 23:19 UTC (permalink / raw)
  To: Nils Radtke; +Cc: linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1510 bytes --]

Hi Nils,

On Thu, 2010-05-13 at 08:05 -0700, Nils Radtke wrote:
> Attached the output from the debug-enabled in-tree iwlwifi driver connects.

To keep track, this is 2.6.33.3.

> 
> The order of trial is: site A first, then site B. The sites have thus been visited 
> in the same order as last time and the commands executed have been equally the
> same. The failures evenly.. All the log entries that caught my eye last time
> can be found in the attached one also. I'm sending it bz2 compressed. My uplink
> is a poor man's volume limited UMTS one.

Thank you very much.

So from what I can tell (to summarize your previous emails) there are
three issues:

1) Error messages like: 
iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

2) Frequent deaths with code like:
eth1: deauthenticated from 00:40:96:aa:aa:aa (Reason: 2)

3) Error as follows:
[ 4148.141064] iwlagn 0000:03:00.0: TX Power requested while scanning!
[ 4148.141070] iwlagn 0000:03:00.0: Error sending TX power (-11)


To address (1), could you please run with attached debug patch and also
enable rate scaling debugging. That will be "modprobe iwlagn
debug=0x143fff).

Regarding (2): This is a common issue in busy environments where AP
decides to deathenticate station after it does not receive an ack for
data sent after a few retries. Was this test done in busy environment?

Regarding (3): Seems like driver is getting a request to scan after a
request to remove interface. I am still inquiring about this.

Reinette


[-- Attachment #2: debug.patch --]
[-- Type: text/x-patch, Size: 1672 bytes --]

diff --git a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
index b93e491..60e5323 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
@@ -361,6 +361,8 @@ static int rs_collect_tx_data(struct iwl_rate_scale_data *windows,
 	 * subtract "1" from the success counter (this is the main reason
 	 * we keep these bitmaps!).
 	 */
+	IWL_DEBUG_RATE(priv, "attempts = %d, window->counter = %d\n",
+			attempts, window->counter);
 	while (attempts > 0) {
 		if (window->counter >= IWL_RATE_MAX_WINDOW) {
 
@@ -389,6 +391,7 @@ static int rs_collect_tx_data(struct iwl_rate_scale_data *windows,
 		attempts--;
 	}
 
+	IWL_DEBUG_RATE(priv, "window counter = %d \n", window->counter);
 	/* Calculate current success ratio, avoid divide-by-0! */
 	if (window->counter > 0)
 		window->success_ratio = 128 * (100 * window->success_counter)
@@ -849,6 +852,7 @@ static void rs_tx_status(void *priv_r, struct ieee80211_supported_band *sband,
 	 * first index into rate scale table.
 	 */
 	if (info->flags & IEEE80211_TX_STAT_AMPDU) {
+		IWL_DEBUG_RATE(priv, "Updating aggregation history\n");
 		tx_rate = le32_to_cpu(table->rs_table[0].rate_n_flags);
 		rs_get_tbl_info_from_mcs(tx_rate, priv->band, &tbl_type,
 				&rs_index);
@@ -868,6 +872,7 @@ static void rs_tx_status(void *priv_r, struct ieee80211_supported_band *sband,
 	 * For legacy, update frame history with for each Tx retry.
 	 */
 		retries = info->status.rates[0].count - 1;
+		IWL_DEBUG_RATE(priv, "Updating legacy history. retries = %d \n", retries);
 		/* HW doesn't send more than 15 retries */
 		retries = min(retries, 15);
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-17 23:19             ` reinette chatre
@ 2010-05-20 12:15               ` Nils Radtke
  2010-05-20 18:33                 ` reinette chatre
  2010-05-20 12:31               ` Nils Radtke
  1 sibling, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-05-20 12:15 UTC (permalink / raw)
  To: reinette chatre; +Cc: linux-kernel@vger.kernel.org

    Hi,

# > Attached the output from the debug-enabled in-tree iwlwifi driver connects.
# To keep track, this is 2.6.33.3.

# So from what I can tell (to summarize your previous emails) there are
# three issues:
# 1) Error messages like: 
# iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# 
# 2) Frequent deaths with code like:
# eth1: deauthenticated from 00:40:96:aa:aa:aa (Reason: 2)
Appropriate misspelling, I assume, no pun intended? ;)

# 
# 3) Error as follows:
# [ 4148.141064] iwlagn 0000:03:00.0: TX Power requested while scanning!
# [ 4148.141070] iwlagn 0000:03:00.0: Error sending TX power (-11)

This is correct.

# 
# To address (1), could you please run with attached debug patch and also
# enable rate scaling debugging. That will be "modprobe iwlagn
# debug=0x143fff).
drivers/net/wireless/iwlwifi/iwl-agn-rs.c: In function ‘rs_collect_tx_data’:
drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: ‘priv’ undeclared (first use in this function)
drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: (Each undeclared identifier is reported only once
drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: for each function it appears in.)

This happens when compiling w/ the patch applied cleanly against .33.3
I'll try to fix it, then conduct the field test. For the latter, do 
you need the same kind of log as for the previous one? 

# Regarding (2): This is a common issue in busy environments where AP
# decides to deathenticate station after it does not receive an ack for
# data sent after a few retries. Was this test done in busy environment?
Both. This happens in busy environment as well as in an idle one. Can't tell
yet whether there're more of those msgs in busy env. I start to feel against 
Cisco APs..

# Regarding (3): Seems like driver is getting a request to scan after a
# request to remove interface. I am still inquiring about this.
Probably due to me switching of via RF_KILLSWITCH. But anyway I assume this
msg should not happen..

Thanks, Reginette.

      Cheers,

              Nils

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-20 12:15               ` Nils Radtke
@ 2010-05-20 18:33                 ` reinette chatre
  2010-05-31 20:12                   ` Nils Radtke
  0 siblings, 1 reply; 30+ messages in thread
From: reinette chatre @ 2010-05-20 18:33 UTC (permalink / raw)
  To: Nils Radtke; +Cc: linux-kernel@vger.kernel.org

On Thu, 2010-05-20 at 05:15 -0700, Nils Radtke wrote:
> # 
> # To address (1), could you please run with attached debug patch and also
> # enable rate scaling debugging. That will be "modprobe iwlagn
> # debug=0x143fff).
> drivers/net/wireless/iwlwifi/iwl-agn-rs.c: In function ‘rs_collect_tx_data’:
> drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: ‘priv’ undeclared (first use in this function)
> drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: (Each undeclared identifier is reported only once
> drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: for each function it appears in.)
> 
> This happens when compiling w/ the patch applied cleanly against .33.3
> I'll try to fix it, then conduct the field test.

Sorry ... and thanks.

>  For the latter, do 
> you need the same kind of log as for the previous one? 

The goal of this patch is to find the reason behind the error
"expected_tpt should have been calculated by now". From what I
understand you only encountered that in one of your tests, not all. Any
test you can run to reproduce that error will be welcome. 

Thinking about your question more ... I believe your previous debug logs
were created with debug flag 0x43fff. For this iteration, please use
debug flag 0x143fff.

>  
> # Regarding (2): This is a common issue in busy environments where AP
> # decides to deathenticate station after it does not receive an ack for
> # data sent after a few retries. Was this test done in busy environment?
> Both. This happens in busy environment as well as in an idle one. Can't tell
> yet whether there're more of those msgs in busy env. I start to feel against 
> Cisco APs..

I don't know ... perhaps these APs have been set up to be strict wrt
delays.

 
> # Regarding (3): Seems like driver is getting a request to scan after a
> # request to remove interface. I am still inquiring about this.
> Probably due to me switching of via RF_KILLSWITCH. But anyway I assume this
> msg should not happen..

Absolutely. What are the exact steps you run when you encounter this
issue?

Reinette




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-20 18:33                 ` reinette chatre
@ 2010-05-31 20:12                   ` Nils Radtke
  2010-06-02 17:51                     ` reinette chatre
  0 siblings, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-05-31 20:12 UTC (permalink / raw)
  To: reinette chatre; +Cc: linux-kernel, linux-wireless

[-- Attachment #1: Type: text/plain, Size: 7370 bytes --]

  Hi Reinette,

First off: 
Linux mypole 2.6.33.4 #23 SMP PREEMPT Sat May 15 20:27:33 CEST 2010 i686 GNU/Linux

On Thu 2010-05-20 @ 11-33-28AM -0700, reinette chatre wrote: 
# On Thu, 2010-05-20 at 05:15 -0700, Nils Radtke wrote:
# > # 
# > # To address (1), could you please run with attached debug patch and also
# > # enable rate scaling debugging. That will be "modprobe iwlagn
# > # debug=0x143fff).
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c: In function ‘rs_collect_tx_data’:
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: ‘priv’ undeclared (first use in this function)
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: (Each undeclared identifier is reported only once
# > drivers/net/wireless/iwlwifi/iwl-agn-rs.c:364: error: for each function it appears in.)
# > 
# > This happens when compiling w/ the patch applied cleanly against .33.3
# > I'll try to fix it, then conduct the field test.
# 
# Sorry ... and thanks.
# 
# >  For the latter, do 
# > you need the same kind of log as for the previous one? 
# The goal of this patch is to find the reason behind the error
# "expected_tpt should have been calculated by now". From what I
# understand you only encountered that in one of your tests, not all. Any
# test you can run to reproduce that error will be welcome. 
Yep. This expected_tpt stuff happens IRC (see mails for certainty, though) 
exclusively on site B.

Ok, so that's the goal. What could I do to advance us an additional step
at a time avoiding pushing hundreds of kilobytes of logs uplink?

To reproduce it, I have to be on site B and just start surfing. It's a matter
of _short_ time until the driver hits the wall.

I'm suspecting still getting those expected_tpt thingies (see below). Though today
it seems I've been lucky, it (maybe) only happened once (or never today).

# Thinking about your question more ... I believe your previous debug logs
# were created with debug flag 0x43fff. For this iteration, please use
# debug flag 0x143fff.

If you insist on a test using .33.3, I will do so but that will have to wait.

Meanwhile I used this patch for .34 to fix the build err from your dbg patch:

 drivers/net/wireless/iwlwifi/iwl-agn-rs.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
===================================================================
--- linux.orig/drivers/net/wireless/iwlwifi/iwl-agn-rs.c        2010-05-31 13:21:48.000000000 +0200
+++ linux/drivers/net/wireless/iwlwifi/iwl-agn-rs.c     2010-05-31 13:25:52.000000000 +0200
@@ -365,7 +365,8 @@
  * packets.
  */
 static int rs_collect_tx_data(struct iwl_scale_tbl_info *tbl,
-                             int scale_index, int attempts, int successes)
+                             int scale_index, int attempts, int successes,
+                             struct iwl_priv *priv)
 {
        struct iwl_rate_scale_data *window = NULL;
        static const u64 mask = (((u64)1) << (IWL_RATE_MAX_WINDOW - 1));
@@ -868,7 +869,7 @@
                                &rs_index);
                rs_collect_tx_data(curr_tbl, rs_index,
                                   info->status.ampdu_ack_len,
-                                  info->status.ampdu_ack_map);
+                                  info->status.ampdu_ack_map, priv);

                /* Update success/fail counts if not searching for new mode */
                if (lq_sta->stay_in_tbl) {
@@ -902,7 +903,7 @@
                        else
                                continue;
                        rs_collect_tx_data(tmp_tbl, rs_index, 1,
-                                          i < retries ? 0 : legacy_success);
+                                          i < retries ? 0 : legacy_success, priv);
                }

                /* Update success/fail counts if not searching for new mode */

Test conducted using debug flag 0x143fff.

Test on Site B resulted immediately in a hard crash upon resume, that is I got to X,
activated wireless, echoed 0x143fff to the sysfile and that was it.
But I didn't count on that one so I had no console to watch for it. So right now 
I got no clue what caused the crash, there's nothing in the logs, of course..

Next thing was to redo test on site B, but this time I switched to console beforehand.
Sure enough, this time nothing happened..

Log appended as bz2.

Hopefully lines like the following don't indicate essential info getting dropped:
May 31 17:23:17 localhost kernel: [91800.091565] net_ratelimit: 70 callbacks suppressed

This line indicates the first timestamp _after_ the crash:
May 31 17:35:19 localhost kernel: [   69.488456]

The crash happened after site A and on site B. Just arrived, opened lid and *crash*.

I noticed in iwl-agn-rs.c:2080:
  BUG_ON(window->average_tpt != ((window->success_ratio *
        tbl->expected_tpt[index] + 64) / 128));
Could that be again the point that hit me today when the machine crashed once?
Would you mind changing this into a milder WARN? That way I wouldn't hit the wall 
that hard. And I would notice it anyway while skimming the logs as we still are on the
hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
update..

Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)

The dbg log contains all types of the errs happening here:

  - "deauthenticated" msgs w/ "reason 2"

  - "request scan called when driver not ready"

  - iwl_tx_agg_start

  - of course the "expected_tpt should have been[..]" don't show up anymore, the source
    has no more WARN regarding this..

And the rest of the 9Mb dbg log..

Could you tell me a bit about your idea of how to track those down? Maybe we can
speed things a little up. The logging and testing stuff takes a lot of time and 
most of it I have no clue why that might help or what the goal is..
Are you online in some #chan? 

# > # Regarding (2): This is a common issue in busy environments where AP
# > # decides to deathenticate station after it does not receive an ack for
# > # data sent after a few retries. Was this test done in busy environment?
# > Both. This happens in busy environment as well as in an idle one. Can't tell
# > yet whether there're more of those msgs in busy env. I start to feel against 
# > Cisco APs..
# 
# I don't know ... perhaps these APs have been set up to be strict wrt
# delays.
Sure. May well be.. For sure I'm no fan of this config/policy.. 
But wait. I noticed no such delays or disassociations using another notebook.
Well, seems I got to investigate in this, too.

# > # Regarding (3): Seems like driver is getting a request to scan after a
# > # request to remove interface. I am still inquiring about this.
# > Probably due to me switching of via RF_KILLSWITCH. But anyway I assume this
# > msg should not happen..
# 
# Absolutely. What are the exact steps you run when you encounter this
# issue?
Nothing particular. I.e. after the tests conducted toggle the hw kill switch. Then
a script gets called via acpi callback that in turn 1) kills the respective dhclient3, 2)
terminates wpa_supplicant and 3) removes the wifi modules (for power saving).

        Cheers,

                    Nils

[-- Attachment #2: 2010-05-31_iwlwifi_dbg_filter.bz2 --]
[-- Type: application/octet-stream, Size: 365121 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-31 20:12                   ` Nils Radtke
@ 2010-06-02 17:51                     ` reinette chatre
  2010-06-04 16:57                       ` Nils Radtke
  0 siblings, 1 reply; 30+ messages in thread
From: reinette chatre @ 2010-06-02 17:51 UTC (permalink / raw)
  To: Nils Radtke; +Cc: linux-kernel@vger.kernel.org, linux-wireless@vger.kernel.org

On Mon, 2010-05-31 at 13:12 -0700, Nils Radtke wrote:

> This line indicates the first timestamp _after_ the crash:
> May 31 17:35:19 localhost kernel: [   69.488456]
> 
> The crash happened after site A and on site B. Just arrived, opened lid and *crash*.
> 
> I noticed in iwl-agn-rs.c:2080:
>   BUG_ON(window->average_tpt != ((window->success_ratio *
>         tbl->expected_tpt[index] + 64) / 128));
> Could that be again the point that hit me today when the machine crashed once?
> Would you mind changing this into a milder WARN? That way I wouldn't hit the wall 
> that hard. And I would notice it anyway while skimming the logs as we still are on the
> hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
> update..
> 
> Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)

Seems like you performed the testing without the patch that we used to
address the hang issue from the beginning of this thread. Please see
http://marc.info/?l=linux-wireless&m=127290931304496&w=2 - that thread
also explains why the patch is not in 2.6.34.

I think it is time to move this discussion to a bug report so that it
can be tracked better. Please open a new bug at
http://bugzilla.intellinuxwireless.org/

Reinette




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-02 17:51                     ` reinette chatre
@ 2010-06-04 16:57                       ` Nils Radtke
  2010-06-08 17:46                         ` reinette chatre
  0 siblings, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-06-04 16:57 UTC (permalink / raw)
  To: reinette chatre; +Cc: linville, linux-kernel, linux-wireless

  Hi Reinette,

  BTW, this:
  Jun  3 12:05:43 localhost kernel: [174170.391756] iwlagn 0000:03:00.0: 
    TX Power requested while scanning!
  happened even w/o toggling radio switch, so this seems not uniquely
  related to toggling the radio switch.

On mer 2010-06-02 @ 10-51-25 -0700, reinette chatre wrote: 
# On Mon, 2010-05-31 at 13:12 -0700, Nils Radtke wrote:
# 
# > This line indicates the first timestamp _after_ the crash:
# > May 31 17:35:19 localhost kernel: [   69.488456]
# > 
# > The crash happened after site A and on site B. Just arrived, opened lid and *crash*.
# > 
# > I noticed in iwl-agn-rs.c:2080:
# >   BUG_ON(window->average_tpt != ((window->success_ratio *
# >         tbl->expected_tpt[index] + 64) / 128));
# > Could that be again the point that hit me today when the machine crashed once?
# > Would you mind changing this into a milder WARN? That way I wouldn't hit the wall 
# > that hard. And I would notice it anyway while skimming the logs as we still are on the
# > hunt. It's more maintainable if it's a WARN in the src instead of me patching it w/ any
# > update..
# > 
# > Wasn't this BUG_ON a WARNING in .33.3? (didn't check..)
# 
# Seems like you performed the testing without the patch that we used to
# address the hang issue from the beginning of this thread. Please see
Indeed, that's what it feels like. It is just so annoying, that one..
You can't work w/ the kernel drivers. That's a shame.
BTW, iff the patch for the BUG_ON is in kernel src since 2.6.28, that might
explain a lot of crashes before where I haven't never been able to track it down.
Even more, those days I hadn't a chance to do more on this. Unlike now.

# http://marc.info/?l=linux-wireless&m=127290931304496&w=2 - that thread
# also explains why the patch is not in 2.6.34.
It should definitely and absolutely be merged (change the BUG_ON into WARNING).
Even if, like hypothesized, the bug is hidden elsewhere, a BUG_ON doesn't get
me far, it's killing every chance to advance to a solution. How am I supposed
to investigate w/ the kernel crashing? BTW, I don't like working w/ a Linux
kernel that kills my work regularly, I think that's understandable. If I needed
a break from work, I'd set an alarm.

I've seen a bugreport on this issue on the redhat bts referencing my word about
this BUG_ON only getting hit w/ cisco APs. There's a wide range of AP manufacturers 
out there in the city. But only cisco APs are crashing this driver. Admittedly, only 
on one single location, but anyway it's a cisco. Always the same MAC, unless they 
use to reassign MAC addresses, though.. 

I think it's a tough one, if an AP is able to crash the driver. 

I haven't yet received a comment of yours regarding my many other questions in
my previous message. I am willing to help investigate more, assist in other ways 
than testing only (always only doing testing isn't a way to keep up fun..)

# I think it is time to move this discussion to a bug report so that it
# can be tracked better. Please open a new bug at
# http://bugzilla.intellinuxwireless.org/
As you wish. It's probably a good idea. But I still miss the registration mail
from bz, did register yesterday.

So, please see to it, that the patch rendering the BUG_ON into a
WARNING finds it's way back in.

    Thank you very much,

            Nils Radtke

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-04 16:57                       ` Nils Radtke
@ 2010-06-08 17:46                         ` reinette chatre
  2010-06-10 14:22                           ` Nils Radtke
  0 siblings, 1 reply; 30+ messages in thread
From: reinette chatre @ 2010-06-08 17:46 UTC (permalink / raw)
  To: Nils Radtke
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

On Fri, 2010-06-04 at 09:57 -0700, Nils Radtke wrote:
> I haven't yet received a comment of yours regarding my many other questions in
> my previous message. I am willing to help investigate more, assist in other ways 
> than testing only (always only doing testing isn't a way to keep up fun..)

Your messages contain references to many issues and it is becoming
increasingly hard to keep track of them all in a single email thread.
Since the system crash is clearly the big issue I would like to focus on
that and get that resolved. This is why I proposed that you create bug
reports to help track your various issues better.

Reinette



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-08 17:46                         ` reinette chatre
@ 2010-06-10 14:22                           ` Nils Radtke
  2010-06-10 16:19                             ` reinette chatre
  0 siblings, 1 reply; 30+ messages in thread
From: Nils Radtke @ 2010-06-10 14:22 UTC (permalink / raw)
  To: reinette chatre
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

  Hi Reinette,

  Thanks for your message.

  Yes, you're right about the multiple bugs one thread thing.

  Just today I got registered w/ the wireless ml because the
  system just did not send me a registration message. 

  For the bug reports to be created it will take me some time.
  I'll firstly report the main issue, the 2 other ones afterwards.
  Would it be ok cross referencing i.e. to the log and such 
  between the reports?

  Should I paste all the mail messages in separate report messages 
  (belonging to one bug report, of course) or should I paste some
  links to the thread?

  Cheers,

        Nils

@John: Yes, you're right but the 2.6.33.4 tree which for me still
  has the bug_on in.

On Tue 2010-06-08 @ 10-46-29AM -0700, reinette chatre wrote: 
# On Fri, 2010-06-04 at 09:57 -0700, Nils Radtke wrote:
# > I haven't yet received a comment of yours regarding my many other questions in
# > my previous message. I am willing to help investigate more, assist in other ways 
# > than testing only (always only doing testing isn't a way to keep up fun..)
# 
# Your messages contain references to many issues and it is becoming
# increasingly hard to keep track of them all in a single email thread.
# Since the system crash is clearly the big issue I would like to focus on
# that and get that resolved. This is why I proposed that you create bug
# reports to help track your various issues better.
# 
# Reinette
# 
# 

-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-06-10 14:22                           ` Nils Radtke
@ 2010-06-10 16:19                             ` reinette chatre
  0 siblings, 0 replies; 30+ messages in thread
From: reinette chatre @ 2010-06-10 16:19 UTC (permalink / raw)
  To: Nils Radtke
  Cc: linville@tuxdriver.com, linux-kernel@vger.kernel.org,
	linux-wireless@vger.kernel.org

On Thu, 2010-06-10 at 07:22 -0700, Nils Radtke wrote:
>   For the bug reports to be created it will take me some time.
>   I'll firstly report the main issue, the 2 other ones afterwards.

Sounds great. Thanks

>   Would it be ok cross referencing i.e. to the log and such 
>   between the reports?
>   Should I paste all the mail messages in separate report messages 
>   (belonging to one bug report, of course) or should I paste some
>   links to the thread?

I find it most convenient if all information related to the bug is
contained in the bug report. Links can be used.

Reinette





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-17 23:19             ` reinette chatre
  2010-05-20 12:15               ` Nils Radtke
@ 2010-05-20 12:31               ` Nils Radtke
  2010-05-20 18:26                 ` reinette chatre
  2010-05-20 22:30                 ` David Miller
  1 sibling, 2 replies; 30+ messages in thread
From: Nils Radtke @ 2010-05-20 12:31 UTC (permalink / raw)
  To: reinette chatre; +Cc: linux-kernel@vger.kernel.org


  Hi Reinette,

  Did the fix like the patch below, applied on top of yours. Builds well, 
  not tested yet.

  Do you ACK this patch?

  Thanks,

          Nils


# 1) Error messages like: 
# iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# To address (1), could you please run with attached debug patch and also
# enable rate scaling debugging. That will be "modprobe iwlagn
# debug=0x143fff).

# diff --git a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
# index b93e491..60e5323 100644
# --- a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
# +++ b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
# @@ -361,6 +361,8 @@ static int rs_collect_tx_data(struct iwl_rate_scale_data *windows,
#  	 * subtract "1" from the success counter (this is the main reason
#  	 * we keep these bitmaps!).
#  	 */
# +	IWL_DEBUG_RATE(priv, "attempts = %d, window->counter = %d\n",
# +			attempts, window->counter);
#  	while (attempts > 0) {
#  		if (window->counter >= IWL_RATE_MAX_WINDOW) {
#  
# @@ -389,6 +391,7 @@ static int rs_collect_tx_data(struct iwl_rate_scale_data *windows,
#  		attempts--;
#  	}
#  
# +	IWL_DEBUG_RATE(priv, "window counter = %d \n", window->counter);
#  	/* Calculate current success ratio, avoid divide-by-0! */
#  	if (window->counter > 0)
#  		window->success_ratio = 128 * (100 * window->success_counter)
# @@ -849,6 +852,7 @@ static void rs_tx_status(void *priv_r, struct ieee80211_supported_band *sband,
#  	 * first index into rate scale table.
#  	 */
#  	if (info->flags & IEEE80211_TX_STAT_AMPDU) {
# +		IWL_DEBUG_RATE(priv, "Updating aggregation history\n");
#  		tx_rate = le32_to_cpu(table->rs_table[0].rate_n_flags);
#  		rs_get_tbl_info_from_mcs(tx_rate, priv->band, &tbl_type,
#  				&rs_index);
# @@ -868,6 +872,7 @@ static void rs_tx_status(void *priv_r, struct ieee80211_supported_band *sband,
#  	 * For legacy, update frame history with for each Tx retry.
#  	 */
#  		retries = info->status.rates[0].count - 1;
# +		IWL_DEBUG_RATE(priv, "Updating legacy history. retries = %d \n", retries);
#  		/* HW doesn't send more than 15 retries */
#  		retries = min(retries, 15);
#  

---

 drivers/net/wireless/iwlwifi/iwl-agn-rs.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6.33.3_clean/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
===================================================================
--- linux-2.6.33.3_clean.orig/drivers/net/wireless/iwlwifi/iwl-agn-rs.c	2010-05-20 14:25:42.000000000 +0200
+++ linux-2.6.33.3_clean/drivers/net/wireless/iwlwifi/iwl-agn-rs.c	2010-05-20 14:26:50.000000000 +0200
@@ -341,7 +341,7 @@
  */
 static int rs_collect_tx_data(struct iwl_rate_scale_data *windows,
 			      int scale_index, s32 tpt, int attempts,
-			      int successes)
+			      int successes, struct iwl_priv *priv)
 {
 	struct iwl_rate_scale_data *window = NULL;
 	static const u64 mask = (((u64)1) << (IWL_RATE_MAX_WINDOW - 1));
@@ -859,7 +859,7 @@
 		tpt = get_expected_tpt(curr_tbl, rs_index);
 		rs_collect_tx_data(window, rs_index, tpt,
 				   info->status.ampdu_ack_len,
-				   info->status.ampdu_ack_map);
+				   info->status.ampdu_ack_map, priv);
 
 		/* Update success/fail counts if not searching for new mode */
 		if (lq_sta->stay_in_tbl) {
@@ -897,10 +897,10 @@
 			/* Constants mean 1 transmission, 0 successes */
 			if (i < retries)
 				rs_collect_tx_data(window, rs_index, tpt, 1,
-						0);
+						0, priv);
 			else
 				rs_collect_tx_data(window, rs_index, tpt, 1,
-						legacy_success);
+						legacy_success, priv);
 		}
 
 		/* Update success/fail counts if not searching for new mode */

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-20 12:31               ` Nils Radtke
@ 2010-05-20 18:26                 ` reinette chatre
  2010-05-20 22:30                 ` David Miller
  1 sibling, 0 replies; 30+ messages in thread
From: reinette chatre @ 2010-05-20 18:26 UTC (permalink / raw)
  To: Nils Radtke; +Cc: linux-kernel@vger.kernel.org

On Thu, 2010-05-20 at 05:31 -0700, Nils Radtke wrote:
> Hi Reinette,
> 
>   Did the fix like the patch below, applied on top of yours. Builds well, 
>   not tested yet.
> 
>   Do you ACK this patch?

I do. I'm very sorry about the problem with the one I provided.

Reinette



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
  2010-05-20 12:31               ` Nils Radtke
  2010-05-20 18:26                 ` reinette chatre
@ 2010-05-20 22:30                 ` David Miller
  1 sibling, 0 replies; 30+ messages in thread
From: David Miller @ 2010-05-20 22:30 UTC (permalink / raw)
  To: lkml; +Cc: reinette.chatre, linux-kernel


Wireless bug reports and patches should be posted with
linux-wireless@vger.kernel.org on the CC: list.

Otherwise the wireless developers will not see this discussion.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock
@ 2010-05-11  9:41 Nils Radtke
  0 siblings, 0 replies; 30+ messages in thread
From: Nils Radtke @ 2010-05-11  9:41 UTC (permalink / raw)
  To: reinette.chatre; +Cc: linville, linux-kernel, linux-wireless

  Hi,

Thanks a lot for the driver not hanging w/ bug_on() any more. At least the machine 
keeps working and when on battery no repeated reboots are required any more. That alone
already means a lot.

# On Mon, 2010-05-10 at 11:36 -0700, Nils Radtke wrote:
# >   Today weather was fine again, finally. So testing with .33.3 w/ the patch applied: 
# > 
# >   http://marc.info/?l=linux-wireless&m=127290931304496&w=2
# > 
# > The kernel kernel .32 was still running before it crashed immediately on wireless activation.
# > The crash log showed again at least two messages, the last was as already described in my first
# > message, bug from 2010-04-30: I think even the 0x2030 was the same:
# > 
# > EIP rs_tx_status +x8f/x2030 
# 
# You report an issue on 2.6.32 ...
Yes. These errors happened to be the same regardless of .32 or .33

# > W/ .33.3 and the above patch applied:
# 
# ... but then test the patch with 2.6.33.
# 
# Which kernel are you focused on?
Sorry, no intention to confuse or show erratic behaviour.. :)

It's just that the errors occur on both of them. Then I accidently booted the old one again (now
removed from the system), but again, the error showed up on .32, .33{1,2,3} . But you always had
had an indication which kernel it happened on.

OTH, it's basically the same, the identical error persists, so I can't seem the difference here. 
Except for a scientific approach one shouldn't do that, ACK. But, hey, I'd like to use the machine in
the meantime and happened to update the kernel source. 

# > Linux mypole 2.6.33.3 #18 SMP PREEMPT Thu May 6 21:51:37 CEST 2010 i686 GNU/Linux
# > 
# > May 10 19:14:11 [   80.586637] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:17 [  626.476078] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:30 [  638.913740] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:32 [  641.232425] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:54 [  663.392697] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:23:58 [  666.980247] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# > May 10 19:24:02 [  671.121826] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
# Can you see any impact on your connection speed that can be connected to
# these messages?
I'm glad you're asking. Yes, indeed, speed it exceptionally low to what might be achievable. Around 30k/s
average, burst with maybe 200k/s, instead of 700k/s.

# > Additionally these were logged, could you tell why they're there and what to do? (also .33.3 w/ patch)
# > 
# > May 10 19:24:16 [  685.079617] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > May 10 19:24:22 [  691.026737] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > May 10 19:28:02 [  911.406162] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > May 10 19:35:38 [ 1367.251240] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:12:23:25 tid = 0
# > 
# > The above "iwl_tx_agg_start" lines happen when connecting - again to a Cisco AP - and the connection gets
# > dropped the exact moment when a download is started. It even often drops when dhcp is still negotiating, has
# > got it's IP but the nego isn't finished yet. Conn drops, same procedure again and again. This happens only
# > with this Cisco AP (which is BTW another one from the "expected_tpt should have been calculated by now" 
# > problem).
# It could be that some of the queues get stuck. Can you try with the
# patches in
# http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2037#c113 ? They
# are based on 2.6.33.
Good, no wait, bad, now running on .34-rc7. *sigh

I'll apply the patches to .33. .34-rc7 hadn't brought the desired success w/ the olicard100 usb-umts-stick.

Update: noticed you mean 2.6.33 not .33.x ;) On .33.3 it doesn't apply cleanly for a couple of files..
Any objections if I apply it to .33.3 anyway? (Fixing the rej of course..)

Interestingly enough, quilt import 0001*patch imports, quilt push patches but it applies the patch w/o
rej. patch -p1 0001*patch does recognize the patch already applied and rejects..

All patches applied successfully, trying again these days.

Thanks for your comments.

Will keep you informed.

Nils

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2010-06-10 16:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-29 18:26 kernel BUG in iwl-agn-rs.c:2076,was: iwlagn + some accesspoint = hardlock lkml
  -- strict thread matches above, loose matches on Subject: below --
2010-05-03 19:17 kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock NilsRadtkelkml
2010-05-03 19:22 ` John W. Linville
2010-05-06  9:14   ` Christian Borntraeger
2010-05-06 16:28     ` reinette chatre
2010-05-11 15:50       ` Christian Borntraeger
2010-05-11 17:21         ` reinette chatre
2010-05-12 15:18           ` Christian Borntraeger
2010-05-10 18:36   ` Nils Radtke
2010-05-10 23:32     ` reinette chatre
2010-05-12 14:39       ` Nils Radtke
2010-05-12 23:14         ` reinette chatre
2010-05-13 10:34           ` Nils Radtke
2010-05-13 11:32           ` Nils Radtke
2010-05-13 16:31             ` reinette chatre
2010-05-14 17:45               ` Nils Radtke
2010-05-13 15:05           ` Nils Radtke
2010-05-17 23:19             ` reinette chatre
2010-05-20 12:15               ` Nils Radtke
2010-05-20 18:33                 ` reinette chatre
2010-05-31 20:12                   ` Nils Radtke
2010-06-02 17:51                     ` reinette chatre
2010-06-04 16:57                       ` Nils Radtke
2010-06-08 17:46                         ` reinette chatre
2010-06-10 14:22                           ` Nils Radtke
2010-06-10 16:19                             ` reinette chatre
2010-05-20 12:31               ` Nils Radtke
2010-05-20 18:26                 ` reinette chatre
2010-05-20 22:30                 ` David Miller
2010-05-11  9:41 Nils Radtke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).