Can someone please try...

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Can someone please try...
@ 2007-01-16 17:06 Michael Buesch
  2007-01-16 18:29 ` Pavel Roskin
  2007-01-16 19:00 ` Andreas Schwab
  0 siblings, 2 replies; 19+ messages in thread
From: Michael Buesch @ 2007-01-16 17:06 UTC (permalink / raw)
  To: bcm43xx-dev; +Cc: netdev

...the bcm43xx driver in my tree with a 4318 chip?
The code there works excellent with my 4306 now, but I can't
get it to work with my 4318. It's strange, it doesn't seem
to work at all. I don't seem to be able to TX and RX any packet.
Not sure why.

To get it, please try to avoid cloning the whole tree
from my repository to avoid unnecessary bandwidth wasting.
If you have a linville-wireless-dev tree, you can do the
following:

cd wireless-dev
git branch mb
git checkout mb
git pull http://bu3sch.de/git/wireless-dev.git master

I think this should also work if you have a linus-2.6 tree
checked out somewhere.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 17:06 Can someone please try Michael Buesch
@ 2007-01-16 18:29 ` Pavel Roskin
  2007-01-16 19:23   ` Michael Buesch
  2007-01-16 19:00 ` Andreas Schwab
  1 sibling, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-16 18:29 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

On Tue, 2007-01-16 at 18:06 +0100, Michael Buesch wrote:
> ...the bcm43xx driver in my tree with a 4318 chip?

Things are progressing for me a bit because I observed an association to
an AP with no security.  I still had to use wpa_supplicant.

Unfortunately, there is a bigger issue with the new code.  When I
interrupt wpa_supplicant, the kernel reports several oopses and then
panics, so I have to reboot.  I had to use serial console just to
capture the messages.

I assume the first message is most relevant.  Here it is:

kernel BUG at /home/proski/src/linux-2.6/mm/slab.c:597!
invalid opcode: 0000 [1]
CPU 0
Modules linked in: bcm43xx_d80211 ssb
Pid: 2984, comm: wpa_supplicant Not tainted 2.6.20-rc3 #2
RIP: 0010:[<ffffffff8020aa5a>]  [<ffffffff8020aa5a>] kfree+0x5c/0x97
RSP: 0018:ffff81000727fd08  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff81001e53a3c0 RCX: 0000000000000001
RDX: ffff810001689c40 RSI: 000000000727c010 RDI: ffff81001de38000
RBP: ffff81001de38000 R08: ffffffff8052c2e0 R09: ffff81001eac80c0
R10: ffff8100066153c0 R11: ffff8100066157c0 R12: 0000000000000286
R13: ffff810006dfb988 R14: ffff81001e23c000 R15: 0000000000000000
FS:  00002b75242c6cd0(0000) GS:ffffffff8056c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003a8ab12e60 CR3: 000000000727a000 CR4: 00000000000006e0
Process wpa_supplicant (pid: 2984, threadinfo ffff81000727e000, task ffff810011e3d0c0)
Stack:  ffff81001e53a3c0 0000000000000013 ffff81001dddc000 ffffffff802237be
 ffff81001eac80c0 ffffffff8801b2f6 ffffffff8056c980 ffffffff8028cb21
 ffff81000707c7b8 ffff81001e8921c0 ffff81001e892000 ffffffff8801b63a
Call Trace:
 [<ffffffff802237be>] kfree_skbmem+0x9/0x73
 [<ffffffff8801b2f6>] :bcm43xx_d80211:bcm43xx_destroy_dmaring+0x1d1/0x205
 [<ffffffff8028cb21>] free_irq+0xd8/0x120
 [<ffffffff8801b63a>] :bcm43xx_d80211:bcm43xx_dma_free+0x89/0xad
 [<ffffffff88008c7e>] :bcm43xx_d80211:bcm43xx_wireless_core_exit+0x29/0x76
 [<ffffffff88008dcc>] :bcm43xx_d80211:bcm43xx_remove_interface+0x101/0x135
 [<ffffffff804422d3>] ieee80211_stop+0xdd/0xf7
 [<ffffffff80407cac>] dev_close+0x52/0x71
 [<ffffffff8040750f>] dev_change_flags+0x5a/0x119
 [<ffffffff8042e57d>] devinet_ioctl+0x235/0x59b
 [<ffffffff804004a6>] sock_ioctl+0x1c8/0x1e5
 [<ffffffff80238f2a>] do_ioctl+0x1b/0x50
 [<ffffffff8022a82a>] vfs_ioctl+0x215/0x227
 [<ffffffff80242166>] sys_ioctl+0x3c/0x5c
 [<ffffffff80250ede>] system_call+0x7e/0x83


Code: 0f 0b eb fe 48 8b 7a 28 48 8b 1f 8b 13 3b 53 04 73 0c 89 d0
RIP  [<ffffffff8020aa5a>] kfree+0x5c/0x97
 RSP <ffff81000727fd08>

That's still the same Dell Latitude D520 with Core 2 Duo and Fedora Core
6, internal PCIe card 14e4:4312.  I'm using your current tree ending
with "bcm43xx-d80211: Various cleanups all over the code"

SMP is disabled this time, just to make things simpler.

-- 
Regards,
Pavel Roskin



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 17:06 Can someone please try Michael Buesch
  2007-01-16 18:29 ` Pavel Roskin
@ 2007-01-16 19:00 ` Andreas Schwab
  2007-01-16 19:24   ` Michael Buesch
  1 sibling, 1 reply; 19+ messages in thread
From: Andreas Schwab @ 2007-01-16 19:00 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

Michael Buesch <mb@bu3sch.de> writes:

> ...the bcm43xx driver in my tree with a 4318 chip?
> The code there works excellent with my 4306 now, but I can't
> get it to work with my 4318.

Doesn't work for me either.  I cannot get it to associate to the AP.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 18:29 ` Pavel Roskin
@ 2007-01-16 19:23   ` Michael Buesch
  2007-01-16 21:50     ` Pavel Roskin
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Buesch @ 2007-01-16 19:23 UTC (permalink / raw)
  To: Pavel Roskin; +Cc: bcm43xx-dev, netdev

On Tuesday 16 January 2007 19:29, Pavel Roskin wrote:
> On Tue, 2007-01-16 at 18:06 +0100, Michael Buesch wrote:
> > ...the bcm43xx driver in my tree with a 4318 chip?
> 
> Things are progressing for me a bit because I observed an association to
> an AP with no security.  I still had to use wpa_supplicant.
> 
> Unfortunately, there is a bigger issue with the new code.  When I
> interrupt wpa_supplicant, the kernel reports several oopses and then
> panics, so I have to reboot.  I had to use serial console just to
> capture the messages.
> 
> I assume the first message is most relevant.  Here it is:

A patch for that is already upstream.
It's surprising that it doesn't happen for me, though.
Neiter on PPC, nor on i386.

Patch was
[PATCH] bcm43xx-d80211: Fix DMA TX skb doublefree

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 19:00 ` Andreas Schwab
@ 2007-01-16 19:24   ` Michael Buesch
  0 siblings, 0 replies; 19+ messages in thread
From: Michael Buesch @ 2007-01-16 19:24 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: bcm43xx-dev, netdev

On Tuesday 16 January 2007 20:00, Andreas Schwab wrote:
> Michael Buesch <mb@bu3sch.de> writes:
> 
> > ...the bcm43xx driver in my tree with a 4318 chip?
> > The code there works excellent with my 4306 now, but I can't
> > get it to work with my 4318.
> 
> Doesn't work for me either.  I cannot get it to associate to the AP.

Ok, let's see.
I found a few other bugs. But I can't make any promises when
I'll find all of them. ;)

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 19:23   ` Michael Buesch
@ 2007-01-16 21:50     ` Pavel Roskin
  2007-01-16 22:07       ` Michael Buesch
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-16 21:50 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

On Tue, 2007-01-16 at 20:23 +0100, Michael Buesch wrote:

> A patch for that is already upstream.

I don't see it.  It's not in your tree yet.

> It's surprising that it doesn't happen for me, though.
> Neiter on PPC, nor on i386.

It did happen for me on i386, as well as on x86_64.  The dump was for
x86_64, as evidenced by the register size.  Maybe you have less
debugging options enabled?

> Patch was
> [PATCH] bcm43xx-d80211: Fix DMA TX skb doublefree

Even with this hint, I cannot spot the bug immediately, so it would be
great if you sync the public repository soon.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 21:50     ` Pavel Roskin
@ 2007-01-16 22:07       ` Michael Buesch
  2007-01-16 23:51         ` Pavel Roskin
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Buesch @ 2007-01-16 22:07 UTC (permalink / raw)
  To: Pavel Roskin
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w

On Tuesday 16 January 2007 22:50, Pavel Roskin wrote:
> On Tue, 2007-01-16 at 20:23 +0100, Michael Buesch wrote:
> 
> > A patch for that is already upstream.
> 
> I don't see it.  It's not in your tree yet.

It is on its way upstream to linville.

> > It's surprising that it doesn't happen for me, though.
> > Neiter on PPC, nor on i386.
> 
> It did happen for me on i386, as well as on x86_64.  The dump was for
> x86_64, as evidenced by the register size.  Maybe you have less
> debugging options enabled?

All.

> > Patch was
> > [PATCH] bcm43xx-d80211: Fix DMA TX skb doublefree
> 
> Even with this hint, I cannot spot the bug immediately, so it would be
> great if you sync the public repository soon.

Linville has to put the patch into his tree first, so I can pull it.
You can find the patch easily by searching bcm43xx-dev or netdev
archives.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 22:07       ` Michael Buesch
@ 2007-01-16 23:51         ` Pavel Roskin
  2007-01-17  9:52           ` Michael Buesch
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-16 23:51 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

On Tue, 2007-01-16 at 23:07 +0100, Michael Buesch wrote:
> On Tuesday 16 January 2007 22:50, Pavel Roskin wrote:
> > On Tue, 2007-01-16 at 20:23 +0100, Michael Buesch wrote:
> > 
> > > A patch for that is already upstream.
> > 
> > I don't see it.  It's not in your tree yet.
> 
> It is on its way upstream to linville.

Well, it's pretty cruel to ask others to test code with known fatal
bugs, IMHO.  Even it git were extremely poor at handling a patch applied
in two branches.  In fact, git is not so bad at all at handling such
situations.

> > > It's surprising that it doesn't happen for me, though.
> > > Neiter on PPC, nor on i386.
> > 
> > It did happen for me on i386, as well as on x86_64.  The dump was for
> > x86_64, as evidenced by the register size.  Maybe you have less
> > debugging options enabled?
> 
> All.

That's commendable.  I tried the 32-bit kernel without SMP and with
almost all debugging.  One thing I noticed is that scanning ignores the
pure 802.11b AP running HostAP that I was going to use for testing.
Other APs are detected.  The association didn't work, probably for that
reason.

Scanning may trigger many assertion failures:

bcm43xx_d80211: ASSERTION FAILED ((lna & ~0x7) == 0)
at: /home/proski/src/linux-2.6/drivers/net/
wireless/d80211/bcm43xx/bcm43xx_lo.c:235:lo_measure_feedthrough()

Finally, interrupting wpa_supplicant hits another bug:

BUG: unable to handle kernel paging request at virtual address c3e2cbf8
 printing eip:
e03835e1
*pde = 0000f067
*pte = 03e2c000
Oops: 0002 [#1]
DEBUG_PAGEALLOC
Modules linked in: bcm43xx_d80211 ssb
CPU:    0
EIP:    0060:[<e03835e1>]    Not tainted VLI
EFLAGS: 00210282   (2.6.20-rc3 #3)
EIP is at bcm43xx_wireless_core_init+0x5a/0x98e [bcm43xx_d80211]
eax: 00000000   ebx: c3dab740   ecx: 000000e1   edx: c3493808
esi: c34937f8   edi: c3e2cbf8   ebp: c3e60e38   esp: c3e60db8
ds: 007b   es: 007b   ss: 0068
Process wpa_supplicant (pid: 2942, ti=c3e60000 task=c3f92590 task.ti=c3e60000)
Stack: c0339db6 c0339db6 00000000 c3f92590 c0339db6 00200246 c3e60df0 c3f30000 
       c3dab740 c0339de4 c3493808 c3e60e0c c3e60e0c 00200246 c3e60e2c c0339dc0 
       00000000 00000002 c0339de4 c3e60e50 c3f92590 22222222 22222222 22222222 
Call Trace:
 [<c010335d>] show_trace_log_lvl+0x1a/0x2f
 [<c010340d>] show_stack_log_lvl+0x9b/0xa3
 [<c01035a6>] show_registers+0x191/0x267
 [<c010378f>] die+0x113/0x212
 [<c011010a>] do_page_fault+0x43a/0x50c
 [<c033b47c>] error_code+0x74/0x7c
 [<e03850bc>] bcm43xx_add_interface+0x4f/0xb7 [bcm43xx_d80211]
 [<c032022f>] ieee80211_open+0x19d/0x27e
 [<c02dbb77>] dev_open+0x2d/0x64
 [<c02da71f>] dev_change_flags+0x51/0xf1
 [<c030b67a>] devinet_ioctl+0x235/0x53a
 [<c030bc38>] inet_ioctl+0x73/0x91
 [<c02d1db8>] sock_ioctl+0x1ac/0x1c9
 [<c015dd64>] do_ioctl+0x1c/0x51
 [<c015df94>] vfs_ioctl+0x1fb/0x212
 [<c015dfdc>] sys_ioctl+0x31/0x49
 [<c0102cba>] sysenter_past_esp+0x5f/0x99
 =======================
Code: 00 80 66 0d ef 8d be 9c 01 00 00 f3 ab 8b 7a 5c 80 62 49 c5 c7 42 4c ff ff ff ff 85 ff c7 
42 50 00 00 00 00 74 13 b9 e1 00 00 00 <f3> ab 8b 42 5c 66 c7 80 76 03 00 00 ff ff 8b 4d a8 89 f
0 c7 41 
EIP: [<e03835e1>] bcm43xx_wireless_core_init+0x5a/0x98e [bcm43xx_d80211] SS:ESP 0068:c3e60db8


Then I used MadWifi on the AP side, and "iwpriv scan" picked it.
Moreover, wpa_supplicant reported connection!  I interrupted
wpa_supplicant and started it again, and then the kernel oopsed again.
Strangely, the driver is not even mentioned in the backtrace.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004
 printing eip:
c02d8863
*pde = 00000000
Oops: 0002 [#1]
DEBUG_PAGEALLOC
Modules linked in: bcm43xx_d80211 ssb
CPU:    0
EIP:    0060:[<c02d8863>]    Not tainted VLI
EFLAGS: 00210246   (2.6.20-rc3 #3)
EIP is at datagram_poll+0xba/0xc5
eax: 00000000   ebx: cc252bf8   ecx: 00000049   edx: 00000000
esi: 00000002   edi: 00000004   ebp: cb940b70   esp: cb940b68
ds: 007b   es: 007b   ss: 0068
Process wpa_supplicant (pid: 4344, ti=cb940000 task=c2be0590 task.ti=cb940000)
Stack: c0353220 c9fedf2c cb940b7c c02d1643 00000000 cb940e30 c015ebae c033b3bd
       cb940e54 cb940e50 cb940f9c cb940f50 cb940be0 00000000 00000000 cb940e5c
       cb940e60 cb940e64 cb940e50 cb940e54 cb940e58 00000070 00000000 00000000
Call Trace:
 [<c010335d>] show_trace_log_lvl+0x1a/0x2f
 [<c010340d>] show_stack_log_lvl+0x9b/0xa3
 [<c01035a6>] show_registers+0x191/0x267
 [<c010378f>] die+0x113/0x212
 [<c011010a>] do_page_fault+0x43a/0x50c
 [<c033b47c>] error_code+0x74/0x7c
 [<c02d1643>] sock_poll+0x12/0x15
 [<c015ebae>] do_select+0x2b4/0x4cc
 [<c015f076>] core_sys_select+0x2b0/0x2d5
 [<c015f631>] sys_select+0x99/0x170
 [<c0102cba>] sysenter_past_esp+0x5f/0x99
 =======================
Code: ca 3c 02 74 2b 8b 83 7c 01 00 00 ba 02 00 00 00 89 d6 99 f7 fe 39 83 cc 00 00 00 7d 08 81
c9 04 03 00 00 eb 0b 8b 83 44 02 00 00 <0f> ba 68 04 00 5b 89 c8 5e 5d c3 55 89 e5 57 56 89 c6 5
3 83 ec
EIP: [<c02d8863>] datagram_poll+0xba/0xc5 SS:ESP 0068:cb940b68

-- 
Regards,
Pavel Roskin



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-16 23:51         ` Pavel Roskin
@ 2007-01-17  9:52           ` Michael Buesch
  2007-01-18  9:41             ` Pavel Roskin
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Buesch @ 2007-01-17  9:52 UTC (permalink / raw)
  To: Pavel Roskin
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w

On Wednesday 17 January 2007 00:51, Pavel Roskin wrote:
> On Tue, 2007-01-16 at 23:07 +0100, Michael Buesch wrote:
> > On Tuesday 16 January 2007 22:50, Pavel Roskin wrote:
> > > On Tue, 2007-01-16 at 20:23 +0100, Michael Buesch wrote:
> > > 
> > > > A patch for that is already upstream.
> > > 
> > > I don't see it.  It's not in your tree yet.
> > 
> > It is on its way upstream to linville.
> 
> Well, it's pretty cruel to ask others to test code with known fatal
> bugs, IMHO.

I forgot that the bug was there, because it doesn't trigger on my
machines. I already explained that...

> Even it git were extremely poor at handling a patch applied 
> in two branches.  In fact, git is not so bad at all at handling such
> situations.

I have to wait until linville pulls it. Fullstop.

> > > > It's surprising that it doesn't happen for me, though.
> > > > Neiter on PPC, nor on i386.
> > > 
> > > It did happen for me on i386, as well as on x86_64.  The dump was for
> > > x86_64, as evidenced by the register size.  Maybe you have less
> > > debugging options enabled?
> > 
> > All.
> 
> That's commendable.  I tried the 32-bit kernel without SMP and with
> almost all debugging.  One thing I noticed is that scanning ignores the
> pure 802.11b AP running HostAP that I was going to use for testing.
> Other APs are detected.  The association didn't work, probably for that
> reason.

Probably some d80211 bug. Dunno.

> Scanning may trigger many assertion failures:
> 
> bcm43xx_d80211: ASSERTION FAILED ((lna & ~0x7) == 0)
> at: /home/proski/src/linux-2.6/drivers/net/
> wireless/d80211/bcm43xx/bcm43xx_lo.c:235:lo_measure_feedthrough()

It's not triggered by scanning, it's known and it's nonfatal.

> Finally, interrupting wpa_supplicant hits another bug:
> 
> BUG: unable to handle kernel paging request at virtual address c3e2cbf8
>  printing eip:
> e03835e1
> *pde = 0000f067
> *pte = 03e2c000
> Oops: 0002 [#1]
> DEBUG_PAGEALLOC
> Modules linked in: bcm43xx_d80211 ssb
> CPU:    0
> EIP:    0060:[<e03835e1>]    Not tainted VLI
> EFLAGS: 00210282   (2.6.20-rc3 #3)
> EIP is at bcm43xx_wireless_core_init+0x5a/0x98e [bcm43xx_d80211]
> eax: 00000000   ebx: c3dab740   ecx: 000000e1   edx: c3493808
> esi: c34937f8   edi: c3e2cbf8   ebp: c3e60e38   esp: c3e60db8
> ds: 007b   es: 007b   ss: 0068
> Process wpa_supplicant (pid: 2942, ti=c3e60000 task=c3f92590 task.ti=c3e60000)
> Stack: c0339db6 c0339db6 00000000 c3f92590 c0339db6 00200246 c3e60df0 c3f30000 
>        c3dab740 c0339de4 c3493808 c3e60e0c c3e60e0c 00200246 c3e60e2c c0339dc0 
>        00000000 00000002 c0339de4 c3e60e50 c3f92590 22222222 22222222 22222222 
> Call Trace:
>  [<c010335d>] show_trace_log_lvl+0x1a/0x2f
>  [<c010340d>] show_stack_log_lvl+0x9b/0xa3
>  [<c01035a6>] show_registers+0x191/0x267
>  [<c010378f>] die+0x113/0x212
>  [<c011010a>] do_page_fault+0x43a/0x50c
>  [<c033b47c>] error_code+0x74/0x7c
>  [<e03850bc>] bcm43xx_add_interface+0x4f/0xb7 [bcm43xx_d80211]
>  [<c032022f>] ieee80211_open+0x19d/0x27e
>  [<c02dbb77>] dev_open+0x2d/0x64
>  [<c02da71f>] dev_change_flags+0x51/0xf1
>  [<c030b67a>] devinet_ioctl+0x235/0x53a
>  [<c030bc38>] inet_ioctl+0x73/0x91
>  [<c02d1db8>] sock_ioctl+0x1ac/0x1c9
>  [<c015dd64>] do_ioctl+0x1c/0x51
>  [<c015df94>] vfs_ioctl+0x1fb/0x212
>  [<c015dfdc>] sys_ioctl+0x31/0x49
>  [<c0102cba>] sysenter_past_esp+0x5f/0x99
>  =======================
> Code: 00 80 66 0d ef 8d be 9c 01 00 00 f3 ab 8b 7a 5c 80 62 49 c5 c7 42 4c ff ff ff ff 85 ff c7 
> 42 50 00 00 00 00 74 13 b9 e1 00 00 00 <f3> ab 8b 42 5c 66 c7 80 76 03 00 00 ff ff 8b 4d a8 89 f
> 0 c7 41 
> EIP: [<e03835e1>] bcm43xx_wireless_core_init+0x5a/0x98e [bcm43xx_d80211] SS:ESP 0068:c3e60db8

Doesn't happen for me. I have no idea what's happening.
Care to debug it?
But it's weird that _killing_ the supplicant calls add_interface.
I'd expect it to call remove_interface.

> Then I used MadWifi on the AP side, and "iwpriv scan" picked it.
> Moreover, wpa_supplicant reported connection!  I interrupted
> wpa_supplicant and started it again, and then the kernel oopsed again.
> Strangely, the driver is not even mentioned in the backtrace.
> 
> BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004
>  printing eip:
> c02d8863
> *pde = 00000000
> Oops: 0002 [#1]
> DEBUG_PAGEALLOC
> Modules linked in: bcm43xx_d80211 ssb
> CPU:    0
> EIP:    0060:[<c02d8863>]    Not tainted VLI
> EFLAGS: 00210246   (2.6.20-rc3 #3)
> EIP is at datagram_poll+0xba/0xc5
> eax: 00000000   ebx: cc252bf8   ecx: 00000049   edx: 00000000
> esi: 00000002   edi: 00000004   ebp: cb940b70   esp: cb940b68
> ds: 007b   es: 007b   ss: 0068
> Process wpa_supplicant (pid: 4344, ti=cb940000 task=c2be0590 task.ti=cb940000)
> Stack: c0353220 c9fedf2c cb940b7c c02d1643 00000000 cb940e30 c015ebae c033b3bd
>        cb940e54 cb940e50 cb940f9c cb940f50 cb940be0 00000000 00000000 cb940e5c
>        cb940e60 cb940e64 cb940e50 cb940e54 cb940e58 00000070 00000000 00000000
> Call Trace:
>  [<c010335d>] show_trace_log_lvl+0x1a/0x2f
>  [<c010340d>] show_stack_log_lvl+0x9b/0xa3
>  [<c01035a6>] show_registers+0x191/0x267
>  [<c010378f>] die+0x113/0x212
>  [<c011010a>] do_page_fault+0x43a/0x50c
>  [<c033b47c>] error_code+0x74/0x7c
>  [<c02d1643>] sock_poll+0x12/0x15
>  [<c015ebae>] do_select+0x2b4/0x4cc
>  [<c015f076>] core_sys_select+0x2b0/0x2d5
>  [<c015f631>] sys_select+0x99/0x170
>  [<c0102cba>] sysenter_past_esp+0x5f/0x99
>  =======================
> Code: ca 3c 02 74 2b 8b 83 7c 01 00 00 ba 02 00 00 00 89 d6 99 f7 fe 39 83 cc 00 00 00 7d 08 81
> c9 04 03 00 00 eb 0b 8b 83 44 02 00 00 <0f> ba 68 04 00 5b 89 c8 5e 5d c3 55 89 e5 57 56 89 c6 5
> 3 83 ec
> EIP: [<c02d8863>] datagram_poll+0xba/0xc5 SS:ESP 0068:cb940b68

I have absolutely no idea. Did not happen a single time for me.
In fact. It's all pretty stable on my machines.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-17  9:52           ` Michael Buesch
@ 2007-01-18  9:41             ` Pavel Roskin
  2007-01-19  7:54               ` Pavel Roskin
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-18  9:41 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

On Wed, 2007-01-17 at 10:52 +0100, Michael Buesch wrote:

> Doesn't happen for me. I have no idea what's happening.
> Care to debug it?
> But it's weird that _killing_ the supplicant calls add_interface.
> I'd expect it to call remove_interface.

I'm sorry, I was actually running wpa_supplicant again at the time of
the crash.

What I have now is very different behavior in two configurations on the
same machine.

The i386 kernel without SMP with most debug enabled and serial console.
wpa_supplicant times out.  If I restart is, the kernel oopses, every
time in a different place.

The x86_64 kernel with SMP and with very few debug options.
wpa_supplicant connects.  Killing and restarting wpa_supplicant doesn't
cause any problems.  In fact, wpa_supplicant reconnects quickly.  I can
even ping the station from the AP, but the packet loss is horrible.  It
appears that most loss is on the receiving side.

I'll try to debug the problem when I have time.  At least I'll try to
find out if it's specific to the architecture or to another kernel
option.

Anyway, it's exciting that I could send first packets today!

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-18  9:41             ` Pavel Roskin
@ 2007-01-19  7:54               ` Pavel Roskin
  2007-01-22 20:06                 ` Michael Buesch
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-19  7:54 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

Hello, Michael!

I did more testing, and the results are following.  It looks like the
oopses and panics on i386 were triggered by 4k stacks.  x86_64 doesn't
have this option.

Now that I enabled other debug options on both platforms. but not 4k
stacks, I'm seeing exactly the same problem on each platform.  When run
initially, wpa_supplicant connects with no problems (except very poor
reception of the data packets, but it's another story).  If interrupted
and restarted, wpa_supplicant reconnects, but I'm getting messages like
this (i386):

Slab corruption: start=cfdaece0, len=1024
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<c02d70c2>](skb_release_data+0x7b/0x7f)
000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Prev obj: start=cfdae8d4, len=1024
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<c026ea5a>](device_create+0x2c/0x98)
000: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: ad 4e ad de ff ff ff ff ff ff ff ff 10 3a 6d c0
Next obj: start=cfdaf0ec, len=1024
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<c0165730>](expand_files+0x95/0x2c2)
000: 78 55 39 c7 78 55 39 c7 78 55 39 c7 88 da 52 df
010: d8 18 3b c7 00 00 00 00 00 00 00 00 00 00 00 00

and this (x86_64):

Slab corruption: start=ffff81000ec8a198, len=1024
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<ffffffff8042e916>](skb_release_data+0x94/0x99)
000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Next obj: start=ffff81000ec8a5b0, len=1024
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [<ffffffff803be6e9>](device_create+0x5f/0x110)
000: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

I can restart wpa_supplicant again, and it would show similar messages.
The first "Last user" is inevitably skb_release_data.

I have no idea how to deal with it.  I think I need a stack trace at the
time when skb_release_data is called.

This is a stack trace at the time when slab corruption is detected.
It's actually incorrect closer to the top, perhaps from gcc
optimizations for static functions.

Slab corruption: start=ffff8100066f81d8, len=1024

Call Trace:
 [<ffffffff80218636>] vsnprintf+0x338/0x5a8
 [<ffffffff8020713d>] check_poison_obj+0x69/0x1ae
 [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326
 [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326


 [<ffffffff8020c09a>] cache_alloc_debugcheck_after+0x32/0x1a2
 [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326
 [<ffffffff802aaae2>] kmem_cache_zalloc+0xaf/0xd8
 [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326
 [<ffffffff880111ea>] :bcm43xx_d80211:bcm43xx_phy_init_tssi2dbm_table
+0xf0/0x2ca
 [<ffffffff803c432a>] request_firmware+0xe/0x10
 [<ffffffff88007d75>] :bcm43xx_d80211:bcm43xx_chip_init+0x96/0xaba
 [<ffffffff8020a03d>] kmem_cache_alloc+0xaf/0xbe
 [<ffffffff88009c97>] :bcm43xx_d80211:bcm43xx_wireless_core_init
+0x4de/0xa3d
 [<ffffffff8800b4e8>] :bcm43xx_d80211:bcm43xx_add_interface+0x64/0xde
 [<ffffffff8046eaa0>] ieee80211_open+0x1c7/0x2cc
 [<ffffffff804330da>] dev_open+0x36/0x76
 [<ffffffff8043185b>] dev_change_flags+0x5d/0x122
 [<ffffffff8045a1a3>] devinet_ioctl+0x259/0x5e8
 [<ffffffff8045a7f2>] inet_ioctl+0x71/0x8f
 [<ffffffff8042a395>] sock_ioctl+0x1db/0x1fd
 [<ffffffff8023bfa7>] do_ioctl+0x1b/0x50
 [<ffffffff8022c9b2>] vfs_ioctl+0x22a/0x23c
 [<ffffffff80289975>] trace_hardirqs_on+0x124/0x14e
 [<ffffffff802459a2>] sys_ioctl+0x42/0x65
 [<ffffffff8025531e>] system_call+0x7e/0x83

Anyway, I could narrow down this message to the first kzalloc() call in
fw_register_device(), file drivers/base/firmware_class.c.  This only
seems to confirm my suspicion that the actual corruption happened before
this point.  We are just hitting it when trying to allocate more memory.

Help with debugging this problem will be appreciated.  I've never hunted
down such problems, especially in kernel space.

-- 
Regards,
Pavel Roskin



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-19  7:54               ` Pavel Roskin
@ 2007-01-22 20:06                 ` Michael Buesch
  2007-01-22 20:44                   ` Pavel Roskin
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Buesch @ 2007-01-22 20:06 UTC (permalink / raw)
  To: Pavel Roskin; +Cc: bcm43xx-dev, netdev

On Friday 19 January 2007 08:54, Pavel Roskin wrote:
> Hello, Michael!
> 
> I did more testing, and the results are following.  It looks like the
> oopses and panics on i386 were triggered by 4k stacks.  x86_64 doesn't
> have this option.
> 
> Now that I enabled other debug options on both platforms. but not 4k
> stacks, I'm seeing exactly the same problem on each platform.  When run
> initially, wpa_supplicant connects with no problems (except very poor
> reception of the data packets, but it's another story).  If interrupted
> and restarted, wpa_supplicant reconnects, but I'm getting messages like
> this (i386):

That's a very interresting discover.
Partly, because I don't see this on my i386 machine. ;)

It's obviously some stack/memory corruption. But I'm not
sure if this is a stackoverflow. I'd rather say no, it isn't.

Could probably be triggered by something like kfree()ing
a dangling pointer or something...

> Slab corruption: start=cfdaece0, len=1024
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<c02d70c2>](skb_release_data+0x7b/0x7f)
> 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Prev obj: start=cfdae8d4, len=1024
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c026ea5a>](device_create+0x2c/0x98)
> 000: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 010: ad 4e ad de ff ff ff ff ff ff ff ff 10 3a 6d c0
> Next obj: start=cfdaf0ec, len=1024
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c0165730>](expand_files+0x95/0x2c2)
> 000: 78 55 39 c7 78 55 39 c7 78 55 39 c7 88 da 52 df
> 010: d8 18 3b c7 00 00 00 00 00 00 00 00 00 00 00 00
> 
> and this (x86_64):
> 
> Slab corruption: start=ffff81000ec8a198, len=1024
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff8042e916>](skb_release_data+0x94/0x99)
> 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Next obj: start=ffff81000ec8a5b0, len=1024
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<ffffffff803be6e9>](device_create+0x5f/0x110)
> 000: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> I can restart wpa_supplicant again, and it would show similar messages.
> The first "Last user" is inevitably skb_release_data.
> 
> I have no idea how to deal with it.  I think I need a stack trace at the
> time when skb_release_data is called.
> 
> This is a stack trace at the time when slab corruption is detected.
> It's actually incorrect closer to the top, perhaps from gcc
> optimizations for static functions.
> 
> Slab corruption: start=ffff8100066f81d8, len=1024
> 
> Call Trace:
>  [<ffffffff80218636>] vsnprintf+0x338/0x5a8
>  [<ffffffff8020713d>] check_poison_obj+0x69/0x1ae
>  [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326
>  [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326
> 
> 
>  [<ffffffff8020c09a>] cache_alloc_debugcheck_after+0x32/0x1a2
>  [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326
>  [<ffffffff802aaae2>] kmem_cache_zalloc+0xaf/0xd8
>  [<ffffffff803c3ff2>] _request_firmware+0x8f/0x326
>  [<ffffffff880111ea>] :bcm43xx_d80211:bcm43xx_phy_init_tssi2dbm_table
> +0xf0/0x2ca
>  [<ffffffff803c432a>] request_firmware+0xe/0x10
>  [<ffffffff88007d75>] :bcm43xx_d80211:bcm43xx_chip_init+0x96/0xaba
>  [<ffffffff8020a03d>] kmem_cache_alloc+0xaf/0xbe
>  [<ffffffff88009c97>] :bcm43xx_d80211:bcm43xx_wireless_core_init
> +0x4de/0xa3d
>  [<ffffffff8800b4e8>] :bcm43xx_d80211:bcm43xx_add_interface+0x64/0xde
>  [<ffffffff8046eaa0>] ieee80211_open+0x1c7/0x2cc
>  [<ffffffff804330da>] dev_open+0x36/0x76
>  [<ffffffff8043185b>] dev_change_flags+0x5d/0x122
>  [<ffffffff8045a1a3>] devinet_ioctl+0x259/0x5e8
>  [<ffffffff8045a7f2>] inet_ioctl+0x71/0x8f
>  [<ffffffff8042a395>] sock_ioctl+0x1db/0x1fd
>  [<ffffffff8023bfa7>] do_ioctl+0x1b/0x50
>  [<ffffffff8022c9b2>] vfs_ioctl+0x22a/0x23c
>  [<ffffffff80289975>] trace_hardirqs_on+0x124/0x14e
>  [<ffffffff802459a2>] sys_ioctl+0x42/0x65
>  [<ffffffff8025531e>] system_call+0x7e/0x83
> 
> Anyway, I could narrow down this message to the first kzalloc() call in
> fw_register_device(), file drivers/base/firmware_class.c.  This only
> seems to confirm my suspicion that the actual corruption happened before
> this point.  We are just hitting it when trying to allocate more memory.
> 
> Help with debugging this problem will be appreciated.  I've never hunted
> down such problems, especially in kernel space.
> 

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-22 20:06                 ` Michael Buesch
@ 2007-01-22 20:44                   ` Pavel Roskin
  2007-01-22 21:00                     ` Michael Buesch
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-22 20:44 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

Hello, Michael!

On Mon, 2007-01-22 at 21:06 +0100, Michael Buesch wrote: 
> It's obviously some stack/memory corruption. But I'm not
> sure if this is a stackoverflow. I'd rather say no, it isn't.
> 
> Could probably be triggered by something like kfree()ing
> a dangling pointer or something...

Yes.  That's what my patch was for ("Fix major memory corruption bug").
It was pretty hard to catch because I would find the consequences rather
than to the offending code.  I got lucky after I enabled some weird
options, such as 64Gb support and highmem debugging.  Whether it played
any role or not, the oops finally happened where the driver tried to
erase memory pointed to by the stale phy->lo_control pointer.

Now the situation is following.

No more random crashes.  There is still a crash if I rmmod the driver
while wlan0 is up, but it's a separate issue, and it's easy to avoid
(unlike the interface going down).  I hope to look at it soon.

The driver connects to a 802.11b Linksys router just fine.  I can send
and receive data.  The driver is fully functional.  128-bit WEP is
supported.

There are periodic bursts of assertion failures.  Looking at the driver,
I see three places where lna a.k.a. phy->lo_gain[0] is assigned the
value of 32 (written as 0x20 in one place).  It's not surprising that it
exceeds 7 in lo_measure_feedthrough().

I think the assert() should be replaced with a FIXME, which would not
annoy end users so much.  And while at that, it would be great to
replace phy->lo_gain with four fields with descriptive names.
phy->lo_gain is never used as an array.  Alternatively, you could make
it a structure within bcm43xx_phy.

The problems with a MadWifi based AP turn out to be related to 802.11g.
If the AP is configured for 802.11b only, everything is working.  If
802.11g is enabled, strange things are happening.  Judging by what's on
the air, it looks like the driver loses the data frames is receives.
wpa_supplicant connects instantly, but ARP and ping packets from AP to
STA are lost.  The frames are even acknowledged, but not seen on the
station side.  It takes from one to ten minutes util ping suddenly
starts working.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-22 20:44                   ` Pavel Roskin
@ 2007-01-22 21:00                     ` Michael Buesch
  2007-01-22 22:04                       ` Larry Finger
  2007-01-23  6:14                       ` Pavel Roskin
  0 siblings, 2 replies; 19+ messages in thread
From: Michael Buesch @ 2007-01-22 21:00 UTC (permalink / raw)
  To: Pavel Roskin; +Cc: bcm43xx-dev, netdev

On Monday 22 January 2007 21:44, Pavel Roskin wrote:
> Hello, Michael!
> 
> On Mon, 2007-01-22 at 21:06 +0100, Michael Buesch wrote: 
> > It's obviously some stack/memory corruption. But I'm not
> > sure if this is a stackoverflow. I'd rather say no, it isn't.
> > 
> > Could probably be triggered by something like kfree()ing
> > a dangling pointer or something...
> 
> Yes.  That's what my patch was for ("Fix major memory corruption bug").
> It was pretty hard to catch because I would find the consequences rather
> than to the offending code.  I got lucky after I enabled some weird
> options, such as 64Gb support and highmem debugging.  Whether it played
> any role or not, the oops finally happened where the driver tried to
> erase memory pointed to by the stale phy->lo_control pointer.
> 
> Now the situation is following.
> 
> No more random crashes.  There is still a crash if I rmmod the driver
> while wlan0 is up, but it's a separate issue, and it's easy to avoid
> (unlike the interface going down).  I hope to look at it soon.

Did you apply that d80211 rmmod crash fix that Michael Wu posted
recently. I bet it will fix your issue.

> The driver connects to a 802.11b Linksys router just fine.  I can send
> and receive data.  The driver is fully functional.  128-bit WEP is
> supported.

Nice.

> There are periodic bursts of assertion failures.  Looking at the driver,
> I see three places where lna a.k.a. phy->lo_gain[0] is assigned the
> value of 32 (written as 0x20 in one place).  It's not surprising that it
> exceeds 7 in lo_measure_feedthrough().

I know about these and I am going to fix that, soon.
Ignore it for the time being, please.

> I think the assert() should be replaced with a FIXME, which would not
> annoy end users so much.

Well, no. It's kind of: Michael, go ahead and fix that crap!
So I'd like to keep it to get me to fix it. :D

> And while at that, it would be great to 
> replace phy->lo_gain with four fields with descriptive names.
> phy->lo_gain is never used as an array.  Alternatively, you could make
> it a structure within bcm43xx_phy.

Yeah, one step after the other. ;)
We didn't know the meanings of the values until recently. Of course
I am going to rename them.

> The problems with a MadWifi based AP turn out to be related to 802.11g.
> If the AP is configured for 802.11b only, everything is working.  If
> 802.11g is enabled, strange things are happening.  Judging by what's on
> the air, it looks like the driver loses the data frames is receives.
> wpa_supplicant connects instantly, but ARP and ping packets from AP to
> STA are lost.  The frames are even acknowledged, but not seen on the
> station side.  It takes from one to ten minutes util ping suddenly
> starts working.

Hm, is this 4318? It is known to loose lots of packets.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-22 21:00                     ` Michael Buesch
@ 2007-01-22 22:04                       ` Larry Finger
  2007-01-23  6:14                       ` Pavel Roskin
  1 sibling, 0 replies; 19+ messages in thread
From: Larry Finger @ 2007-01-22 22:04 UTC (permalink / raw)
  To: Michael Buesch; +Cc: Pavel Roskin, netdev, bcm43xx-dev

Michael Buesch wrote:
> On Monday 22 January 2007 21:44, Pavel Roskin wrote:
>> The problems with a MadWifi based AP turn out to be related to 802.11g.
>> If the AP is configured for 802.11b only, everything is working.  If
>> 802.11g is enabled, strange things are happening.  Judging by what's on
>> the air, it looks like the driver loses the data frames is receives.
>> wpa_supplicant connects instantly, but ARP and ping packets from AP to
>> STA are lost.  The frames are even acknowledged, but not seen on the
>> station side.  It takes from one to ten minutes util ping suddenly
>> starts working.
> 
> Hm, is this 4318? It is known to loose lots of packets.

On my 4311 with softmac, the throughput is increased by a factor of 6 by reducing the rate from the
default 11M to 1M. Obviously the success rate is greatly improved. Perhaps the same effect will
happen for 4318's. Does the d80211 version let you change the rate?

Larry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-22 21:00                     ` Michael Buesch
  2007-01-22 22:04                       ` Larry Finger
@ 2007-01-23  6:14                       ` Pavel Roskin
  2007-01-23  9:21                         ` Michael Buesch
  1 sibling, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-23  6:14 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev

On Mon, 2007-01-22 at 22:00 +0100, Michael Buesch wrote: 
> > No more random crashes.  There is still a crash if I rmmod the driver
> > while wlan0 is up, but it's a separate issue, and it's easy to avoid
> > (unlike the interface going down).  I hope to look at it soon.
> 
> Did you apply that d80211 rmmod crash fix that Michael Wu posted
> recently. I bet it will fix your issue.

I have tried the patch, and it doesn't fix the problem.  It's a separate
problem.  It happens when bcm43xx_interrupt_handler() is called on a
device that has already been removed.  It looks like
bcm43xx_wireless_core_stop() should be called from
bcm43xx_one_core_detach().

Unfortunately, I cannot come to a satisfactory solution yet.  If I call
bcm43xx_wireless_core_stop() with the mutex held, the driver won't
unload if the interface is down.  If I don't hold the mutex, it would
happen when the interface is up.

By the way, I think it's a bad idea to unlock any mutexes or other locks
set outside the function.  The caller assumes that the lock is held
until it (the caller) unlocks it.  Unlocking locks from other functions
breaks this convention. 

> > I think the assert() should be replaced with a FIXME, which would not
> > annoy end users so much.
> 
> Well, no. It's kind of: Michael, go ahead and fix that crap!
> So I'd like to keep it to get me to fix it. :D

I, for one, prefer to keep my to-do items in my to-do list, but I don't
want to distract you with petty arguments from fixing the real problem.

> > And while at that, it would be great to 
> > replace phy->lo_gain with four fields with descriptive names.
> > phy->lo_gain is never used as an array.  Alternatively, you could make
> > it a structure within bcm43xx_phy.
> 
> Yeah, one step after the other. ;)
> We didn't know the meanings of the values until recently. Of course
> I am going to rename them.

Great!

> > The problems with a MadWifi based AP turn out to be related to 802.11g.
> > If the AP is configured for 802.11b only, everything is working.  If
> > 802.11g is enabled, strange things are happening.  Judging by what's on
> > the air, it looks like the driver loses the data frames is receives.
> > wpa_supplicant connects instantly, but ARP and ping packets from AP to
> > STA are lost.  The frames are even acknowledged, but not seen on the
> > station side.  It takes from one to ten minutes util ping suddenly
> > starts working.
> 
> Hm, is this 4318? It is known to loose lots of packets.

No, it's 4312.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-23  6:14                       ` Pavel Roskin
@ 2007-01-23  9:21                         ` Michael Buesch
       [not found]                           ` <200701231021.34995.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Buesch @ 2007-01-23  9:21 UTC (permalink / raw)
  To: Pavel Roskin; +Cc: bcm43xx-dev, netdev

On Tuesday 23 January 2007 07:14, Pavel Roskin wrote:
> On Mon, 2007-01-22 at 22:00 +0100, Michael Buesch wrote: 
> > > No more random crashes.  There is still a crash if I rmmod the driver
> > > while wlan0 is up, but it's a separate issue, and it's easy to avoid
> > > (unlike the interface going down).  I hope to look at it soon.
> > 
> > Did you apply that d80211 rmmod crash fix that Michael Wu posted
> > recently. I bet it will fix your issue.
> 
> I have tried the patch, and it doesn't fix the problem.  It's a separate
> problem.  It happens when bcm43xx_interrupt_handler() is called on a
> device that has already been removed.

That shouldn't happen and doesn't for me.

> It looks like 
> bcm43xx_wireless_core_stop() should be called from
> bcm43xx_one_core_detach().

No, well... . remove_interface should have been called by the stack, no?

> Unfortunately, I cannot come to a satisfactory solution yet.  If I call
> bcm43xx_wireless_core_stop() with the mutex held, the driver won't
> unload if the interface is down.  If I don't hold the mutex, it would
> happen when the interface is up.
> 
> By the way, I think it's a bad idea to unlock any mutexes or other locks
> set outside the function.  The caller assumes that the lock is held
> until it (the caller) unlocks it.  Unlocking locks from other functions
> breaks this convention. 

It would result in a deadlock, if we don't unlock it there. That's
perfectly fine.

> > > I think the assert() should be replaced with a FIXME, which would not
> > > annoy end users so much.
> > 
> > Well, no. It's kind of: Michael, go ahead and fix that crap!
> > So I'd like to keep it to get me to fix it. :D
> 
> I, for one, prefer to keep my to-do items in my to-do list, but I don't
> want to distract you with petty arguments from fixing the real problem.

Well, assert() statements are there to find bugs. And if there is a bug,
they trigger. That's pretty much the semantics of an assert() statement.
I'm not sure why you want to hide a bug.

Either way, in this case it seems like the code is right
and just the assert() mask is wrong. But that's only this way by luck.
Could easily have been the other way around. ;)
Specs were slightly wrong at this point.

But as I said, I will commit a fix today.

> > Hm, is this 4318? It is known to loose lots of packets.
> 
> No, it's 4312.

That has got the same problems.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
       [not found]                           ` <200701231021.34995.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
@ 2007-01-24  5:43                             ` Pavel Roskin
  2007-01-24  8:43                               ` Michael Buesch
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Roskin @ 2007-01-24  5:43 UTC (permalink / raw)
  To: Michael Buesch
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w

On Tue, 2007-01-23 at 10:21 +0100, Michael Buesch wrote:
> On Tuesday 23 January 2007 07:14, Pavel Roskin wrote:
> > I have tried the patch, and it doesn't fix the problem.  It's a separate
> > problem.  It happens when bcm43xx_interrupt_handler() is called on a
> > device that has already been removed.
> 
> That shouldn't happen and doesn't for me.
> 
> > It looks like 
> > bcm43xx_wireless_core_stop() should be called from
> > bcm43xx_one_core_detach().
> 
> No, well... . remove_interface should have been called by the stack, no?

It is not.  It's called if I bring the interface down with ifconfig.  If
I remove live interface with "rmmod bcm43xx_d80211",
bcm43xx_one_core_detach() is called first, followed by kernel panic in 
bcm43xx_interrupt_handler().

And that's what I see in the code.  Module removal calls bcm43xx_exit().
It unregisters the ssb driver first.  The ssb layer calls
bcm43xx_remove(), which calls bcm43xx_one_core_detach() before doing
anything with the wireless stack or with interrupts.

I tried to put bcm43xx_one_core_detach() to the end of bcm43xx_remove(),
but the result was the same.  Still, I think the solution lies in that
direction.  We should stop the hardware before dismantling any data
structures.

-- 
Regards,
Pavel Roskin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can someone please try...
  2007-01-24  5:43                             ` Pavel Roskin
@ 2007-01-24  8:43                               ` Michael Buesch
  0 siblings, 0 replies; 19+ messages in thread
From: Michael Buesch @ 2007-01-24  8:43 UTC (permalink / raw)
  To: Pavel Roskin; +Cc: bcm43xx-dev, netdev

On Wednesday 24 January 2007 06:43, Pavel Roskin wrote:
> On Tue, 2007-01-23 at 10:21 +0100, Michael Buesch wrote:
> > On Tuesday 23 January 2007 07:14, Pavel Roskin wrote:
> > > I have tried the patch, and it doesn't fix the problem.  It's a separate
> > > problem.  It happens when bcm43xx_interrupt_handler() is called on a
> > > device that has already been removed.
> > 
> > That shouldn't happen and doesn't for me.
> > 
> > > It looks like 
> > > bcm43xx_wireless_core_stop() should be called from
> > > bcm43xx_one_core_detach().
> > 
> > No, well... . remove_interface should have been called by the stack, no?
> 
> It is not.  It's called if I bring the interface down with ifconfig.  If
> I remove live interface with "rmmod bcm43xx_d80211",
> bcm43xx_one_core_detach() is called first, followed by kernel panic in 
> bcm43xx_interrupt_handler().
> 
> And that's what I see in the code.  Module removal calls bcm43xx_exit().
> It unregisters the ssb driver first.  The ssb layer calls
> bcm43xx_remove(), which calls bcm43xx_one_core_detach() before doing
> anything with the wireless stack or with interrupts.
> 
> I tried to put bcm43xx_one_core_detach() to the end of bcm43xx_remove(),
> but the result was the same.  Still, I think the solution lies in that
> direction.  We should stop the hardware before dismantling any data
> structures.

Ok, I see. I will try to debug this.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-01-24  8:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-16 17:06 Can someone please try Michael Buesch
2007-01-16 18:29 ` Pavel Roskin
2007-01-16 19:23   ` Michael Buesch
2007-01-16 21:50     ` Pavel Roskin
2007-01-16 22:07       ` Michael Buesch
2007-01-16 23:51         ` Pavel Roskin
2007-01-17  9:52           ` Michael Buesch
2007-01-18  9:41             ` Pavel Roskin
2007-01-19  7:54               ` Pavel Roskin
2007-01-22 20:06                 ` Michael Buesch
2007-01-22 20:44                   ` Pavel Roskin
2007-01-22 21:00                     ` Michael Buesch
2007-01-22 22:04                       ` Larry Finger
2007-01-23  6:14                       ` Pavel Roskin
2007-01-23  9:21                         ` Michael Buesch
     [not found]                           ` <200701231021.34995.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
2007-01-24  5:43                             ` Pavel Roskin
2007-01-24  8:43                               ` Michael Buesch
2007-01-16 19:00 ` Andreas Schwab
2007-01-16 19:24   ` Michael Buesch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).