Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
@ 2005-09-30  3:36 Hendrik Visage
  2005-09-30  4:16 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Hendrik Visage @ 2005-09-30  3:36 UTC (permalink / raw)
  To: linux-net, linux-kernel

Hi there,

 Traced a panicing kernel to what appears the starfire changes for
2.6.13 up to 2.6.14_rc2

During a relative heavy NFS read (client a 32bit 2.6.13.1 P2-350) with
rsync (ripped CD archive) I get kernel panics (Aieee interupt handler
lost or something... okay also need
a way to capture those errors as it's a hard panic and needs a reset button :()

I've isolated the problem going from 2.6.12.5/2.6.12-gentoo-r10 (both
working) to
2.6.13/2.6.13-gentoo/2.6.14_rc2 while the NFS is served through the
Adaptec/starfire,
and further more the onboard forceth(nvidia) is serving the data
without hassles (at least
on 2.6.14_rc2)

Using gcc 3.4.4

--
Hendrik Visage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30  3:36 Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server Hendrik Visage
@ 2005-09-30  4:16 ` Andrew Morton
  2005-09-30  8:14   ` Hendrik Visage
  2005-09-30 16:01   ` Hendrik Visage
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2005-09-30  4:16 UTC (permalink / raw)
  To: Hendrik Visage; +Cc: linux-net, linux-kernel, Ion Badulescu

Hendrik Visage <hvjunk@gmail.com> wrote:
>
>   Traced a panicing kernel to what appears the starfire changes for
>  2.6.13 up to 2.6.14_rc2
> 
>  During a relative heavy NFS read (client a 32bit 2.6.13.1 P2-350) with
>  rsync (ripped CD archive) I get kernel panics (Aieee interupt handler
>  lost or something... okay also need
>  a way to capture those errors as it's a hard panic and needs a reset button :()

A serial console is useful.  Often people will take a digital photo of the
screen, which works OK.  But we do need that info somehow, please.

>  I've isolated the problem going from 2.6.12.5/2.6.12-gentoo-r10 (both
>  working) to
>  2.6.13/2.6.13-gentoo/2.6.14_rc2 while the NFS is served through the
>  Adaptec/starfire,
>  and further more the onboard forceth(nvidia) is serving the data
>  without hassles (at least
>  on 2.6.14_rc2)

The starfire changes in 2.6.12->2.6.13 look fairly innocuous.  Need that
trace, please.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30  4:16 ` Andrew Morton
@ 2005-09-30  8:14   ` Hendrik Visage
  2005-09-30 16:46     ` Ion Badulescu
  2005-09-30 16:01   ` Hendrik Visage
  1 sibling, 1 reply; 10+ messages in thread
From: Hendrik Visage @ 2005-09-30  8:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-net, linux-kernel, Ion Badulescu

On 9/30/05, Andrew Morton <akpm@osdl.org> wrote:
> A serial console is useful.  Often people will take a digital photo of the
> screen, which works OK.  But we do need that info somehow, please.

busy getting that (and/or lkcd|kdb) setup..

> The starfire changes in 2.6.12->2.6.13 look fairly innocuous.  Need that
> trace, please.

Will do, but check perhaps for some 64bit uncleanes in the scatter gather stuff
that got enabled in 2.6.13 because of the GPL'd Adaptec firmware, as I
recalled some skb related stuff.

--
Hendrik Visage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30  8:14   ` Hendrik Visage
@ 2005-09-30 16:46     ` Ion Badulescu
  0 siblings, 0 replies; 10+ messages in thread
From: Ion Badulescu @ 2005-09-30 16:46 UTC (permalink / raw)
  To: Hendrik Visage; +Cc: Andrew Morton, linux-net, linux-kernel

Hi Henrik,

On Fri, 30 Sep 2005, Hendrik Visage wrote:

> Will do, but check perhaps for some 64bit uncleanes in the scatter gather stuff
> that got enabled in 2.6.13 because of the GPL'd Adaptec firmware, as I
> recalled some skb related stuff.

There is an easy way to disable the firmware and pretty much all the 
changes that went into 2.6.13: load the starfire with enable_hw_cksum=0. 
If you can easily reproduce this problem, try doing the above and see if 
you can still hit it. Maybe it's a newly introduced problem in the upper 
layer's SG--your other network driver simply isn't using SG so it's 
not affected.

It's very suspicious that the bug would be in skb_checksum_help(), since 
the starfire driver doesn't do anything with the skb before handing it 
over to skb_checksum_help(). It would mean that the upper layer handed an 
invalid skb to the driver, or that we have some random memory corruption 
somewhere.

Thanks,
Ion

-- 
   It is better to keep your mouth shut and be thought a fool,
             than to open it and remove all doubt.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30  4:16 ` Andrew Morton
  2005-09-30  8:14   ` Hendrik Visage
@ 2005-09-30 16:01   ` Hendrik Visage
  2005-09-30 17:40     ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: Hendrik Visage @ 2005-09-30 16:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-net, linux-kernel, Ion Badulescu

[-- Attachment #1: Type: text/plain, Size: 281 bytes --]

On 9/30/05, Andrew Morton <akpm@osdl.org> wrote:

> The starfire changes in 2.6.12->2.6.13 look fairly innocuous.  Need that
> trace, please.

See attached :)

Will do a check without PREEMPT as I've noticed that to be the first
line of "problem" :(

--
Hendrik Visage

[-- Attachment #2: crash2.minicom --]
[-- Type: application/octet-stream, Size: 4192 bytes --]

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at net/core/dev.c:1099
invalid operand: 0000 [1] PREEMPT 
CPU 0 
Modules linked in: nvidia nfsd exportfs lockd sunrpc rfcomm l2cap hci_usb bluetooth starfire mii snd_ac97_bus soundcore snd_page_alloc forcedeth i2c_nforce2 dm_mirror dm_mod sbp2 ohci1394 ieee1394 ohci_hcd uhci_hcd usb_storage usbhid ehci_hcd usbcore
Pid: 11252, comm: nfsd Tainted: P      2.6.14-rc2 #3
RIP: 0010:[<ffffffff802cc7ed>] <ffffffff802cc7ed>{skb_checksum_help+157}
RSP: 0000:ffff81003a0bd998  EFLAGS: 00010246
RAX: ffff81003ff01624 RBX: ffff81003ca7f180 RCX: 00000000b7e42194
RDX: 00000000b7e42194 RSI: ffff81003ff01624 RDI: ffff81003b026080
RBP: ffff81003a0bd9b8 R08: 0000000000000000 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff81003ca7f180 R15: ffff81003d462218
FS:  00002aaaaade6ae0(0000) GS:ffffffff804fe800(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaaac2000 CR3: 000000003d5a2000 CR4: 00000000000006e0
Process nfsd (pid: 11252, threadinfo ffff81003a0bc000, task ffff81003e0ed0c0)
Stack: ffffffff804cd720 ffff81003d462000 ffff81003d4623e0 ffff81003ca7f180 
       ffff81003a0bda08 ffffffff88104944 ffff81003d462218 000000013a2a8600 
       ffff81003d462000 ffff81003d462000 
Call Trace:<ffffffff88104944>{:starfire:start_tx+164} <ffffffff802db0fc>{qdisc_restart+268}
       <ffffffff802ccad0>{dev_queue_xmit+288} <ffffffff802d29b0>{neigh_resolve_output+672}
       <ffffffff802ebb27>{ip_finish_output+455} <ffffffff802ec5ff>{ip_fragment+863}
       <ffffffff802eb960>{ip_finish_output+0} <ffffffff802eca6c>{ip_output+108}
       <ffffffff8035a708>{_spin_unlock_bh+24} <ffffffff802ee1e7>{ip_push_pending_frames+919}
       <ffffffff80307d7e>{udp_push_pending_frames+574} <ffffffff80308658>{udp_sendpage+280}
       <ffffffff8031001f>{inet_sendpage+111} <ffffffff881411ea>{:sunrpc:svc_sendto+554}
       <ffffffff8818b8f9>{:nfsd:encode_post_op_attr+553} <ffffffff88141893>{:sunrpc:svc_udp_sendto+35}
       <ffffffff88142327>{:sunrpc:svc_send+247} <ffffffff88140854>{:sunrpc:svc_process+1108}
       <ffffffff8817e43e>{:nfsd:nfsd+462} <ffffffff8012e529>{schedule_tail+73}
       <ffffffff8010f61e>{child_rip+8} <ffffffff8817e270>{:nfsd:nfsd+0}
       <ffffffff8010f616>{child_rip+0} 

Code: 0f 0b 68 23 d9 39 80 c2 4b 04 8b 93 8c 00 00 00 8d 42 02 44 
RIP <ffffffff802cc7ed>{skb_checksum_help+157} RSP <ffff81003a0bd998>
 <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():0

Call Trace:<ffffffff8012db7f>{__might_sleep+191} <ffffffff80133c3c>{profile_task_exit+44}
       <ffffffff801350f5>{do_exit+37} <ffffffff8035a5c3>{_spin_unlock_irqrestore+19}
       <ffffffff8035a5cd>{_spin_unlock_irqrestore+29} <ffffffff80110294>{die+84}
       <ffffffff8035ac1e>{do_trap+334} <ffffffff8011058c>{do_invalid_op+172}
       <ffffffff802cc7ed>{skb_checksum_help+157} <ffffffff8010f469>{error_exit+0}
       <ffffffff802cc7ed>{skb_checksum_help+157} <ffffffff802cc7d5>{skb_checksum_help+133}
       <ffffffff88104944>{:starfire:start_tx+164} <ffffffff802db0fc>{qdisc_restart+268}
       <ffffffff802ccad0>{dev_queue_xmit+288} <ffffffff802d29b0>{neigh_resolve_output+672}
       <ffffffff802ebb27>{ip_finish_output+455} <ffffffff802ec5ff>{ip_fragment+863}
       <ffffffff802eb960>{ip_finish_output+0} <ffffffff802eca6c>{ip_output+108}
       <ffffffff8035a708>{_spin_unlock_bh+24} <ffffffff802ee1e7>{ip_push_pending_frames+919}
       <ffffffff80307d7e>{udp_push_pending_frames+574} <ffffffff80308658>{udp_sendpage+280}
       <ffffffff8031001f>{inet_sendpage+111} <ffffffff881411ea>{:sunrpc:svc_sendto+554}
       <ffffffff8818b8f9>{:nfsd:encode_post_op_attr+553} <ffffffff88141893>{:sunrpc:svc_udp_sendto+35}
       <ffffffff88142327>{:sunrpc:svc_send+247} <ffffffff88140854>{:sunrpc:svc_process+1108}
       <ffffffff8817e43e>{:nfsd:nfsd+462} <ffffffff8012e529>{schedule_tail+73}
       <ffffffff8010f61e>{child_rip+8} <ffffffff8817e270>{:nfsd:nfsd+0}
       <ffffffff8010f616>{child_rip+0} 
Kernel panic - not syncing: Aiee, killing interrupt handler!
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30 16:01   ` Hendrik Visage
@ 2005-09-30 17:40     ` Andrew Morton
  2005-09-30 20:10       ` Hendrik Visage
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2005-09-30 17:40 UTC (permalink / raw)
  To: Hendrik Visage; +Cc: linux-net, linux-kernel, ionut, Jeff Garzik

Hendrik Visage <hvjunk@gmail.com> wrote:
>
> On 9/30/05, Andrew Morton <akpm@osdl.org> wrote:
> 
> > The starfire changes in 2.6.12->2.6.13 look fairly innocuous.  Need that
> > trace, please.
> 
> See attached :)
> 

It helps, thanks.


> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at net/core/dev.c:1099
> invalid operand: 0000 [1] PREEMPT 
> CPU 0 
> Modules linked in: nvidia nfsd exportfs lockd sunrpc rfcomm l2cap hci_usb bluetooth starfire mii snd_ac97_bus soundcore snd_page_alloc forcedeth i2c_nforce2 dm_mirror dm_mod sbp2 ohci1394 ieee1394 ohci_hcd uhci_hcd usb_storage usbhid ehci_hcd usbcore
> Pid: 11252, comm: nfsd Tainted: P      2.6.14-rc2 #3
> RIP: 0010:[<ffffffff802cc7ed>] <ffffffff802cc7ed>{skb_checksum_help+157}
> RSP: 0000:ffff81003a0bd998  EFLAGS: 00010246
> RAX: ffff81003ff01624 RBX: ffff81003ca7f180 RCX: 00000000b7e42194
> RDX: 00000000b7e42194 RSI: ffff81003ff01624 RDI: ffff81003b026080
> RBP: ffff81003a0bd9b8 R08: 0000000000000000 R09: 0000000000000004
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff81003ca7f180 R15: ffff81003d462218
> FS:  00002aaaaade6ae0(0000) GS:ffffffff804fe800(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaaaac2000 CR3: 000000003d5a2000 CR4: 00000000000006e0
> Process nfsd (pid: 11252, threadinfo ffff81003a0bc000, task ffff81003e0ed0c0)
> Stack: ffffffff804cd720 ffff81003d462000 ffff81003d4623e0 ffff81003ca7f180 
>        ffff81003a0bda08 ffffffff88104944 ffff81003d462218 000000013a2a8600 
>        ffff81003d462000 ffff81003d462000 
> Call Trace:<ffffffff88104944>{:starfire:start_tx+164} <ffffffff802db0fc>{qdisc_restart+268}
>        <ffffffff802ccad0>{dev_queue_xmit+288} <ffffffff802d29b0>{neigh_resolve_output+672}
>        <ffffffff802ebb27>{ip_finish_output+455} <ffffffff802ec5ff>{ip_fragment+863}
>        <ffffffff802eb960>{ip_finish_output+0} <ffffffff802eca6c>{ip_output+108}


yep, there's something wrong with the skb which starfire fed into
skb_checksum_help().

	offset = skb->tail - skb->h.raw;
	if (offset <= 0)
		BUG();

And that's a post-2.6.12 driver change.  You can probably work around
it by deleting the #define ZEROCOPY line.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30 17:40     ` Andrew Morton
@ 2005-09-30 20:10       ` Hendrik Visage
  2005-09-30 20:55         ` Ion Badulescu
  2005-09-30 22:39         ` Herbert Xu
  0 siblings, 2 replies; 10+ messages in thread
From: Hendrik Visage @ 2005-09-30 20:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-net, linux-kernel, ionut, Jeff Garzik

[-- Attachment #1: Type: text/plain, Size: 727 bytes --]

On 9/30/05, Andrew Morton <akpm@osdl.org> wrote:
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at net/core/dev.c:1099
> > invalid operand: 0000 [1] PREEMPT
>
> yep, there's something wrong with the skb which starfire fed into
> skb_checksum_help().
>
<snip>
>
> And that's a post-2.6.12 driver change.  You can probably work around
> it by deleting the #define ZEROCOPY line.

:)
Anycase, here is a non-PREEMPT traceback. What makes this one
interesting, is that
in the preempt case, I had to push the NFS output to get the panic, but the
non-preempt case attached, sorta just happened, ie. when the clients
just checked on the server's status :(


--
Hendrik Visage

[-- Attachment #2: non-prempt --]
[-- Type: application/octet-stream, Size: 4110 bytes --]

Kernel BUG at net/core/dev.c:1099
invalid operand: 0000 [1] 
CPU 0 
Modules linked in: nfs nfsd exportfs lockd sunrpc rfcomm l2cap hci_usb bluetooth starfire mii snd_ac97_bus soundcore snd_page_alloc forcedeth i2c_nforce2 dm_mirror dm_mod sbp2 ohci1394 ieee1394 ohci_hcd uhci_hcd usb_storage usbhid ehci_hcd usbcore
Pid: 11169, comm: nfsd Not tainted 2.6.14-rc2 #4
RIP: 0010:[<ffffffff802c803d>] <ffffffff802c803d>{skb_checksum_help+157}
RSP: 0018:ffff81003d3bda08  EFLAGS: 00010246
RAX: ffff81003ac51c24 RBX: ffff81003ac4cd80 RCX: 000000005459cd0b
RDX: 000000005459cd0b RSI: ffff81003ac51c24 RDI: ffff81003d272080
RBP: ffff81003d3bda28 R08: 0000000000000000 R09: 0000000000000006
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff81003ac4cd80 R15: ffff81003a31c218
FS:  00002aaaaade6ae0(0000) GS:ffffffff804f7800(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaabc1190 CR3: 000000003b22b000 CR4: 00000000000006e0
Process nfsd (pid: 11169, threadinfo ffff81003d3bc000, task ffff81003f7c5100)
Stack: ffff81003d3bda48 ffff81003a31c000 ffff81003a31c3e0 ffff81003ac4cd80 
       ffff81003d3bda78 ffffffff88104944 ffff8100b00000d0 0000000100000000 
       ffff81003a31c000 ffff81003a31c000 
Call Trace:<ffffffff88104944>{:starfire:start_tx+164} <ffffffff802d6583>{qdisc_restart+243}
       <ffffffff802c8325>{dev_queue_xmit+293} <ffffffff802e64c7>{ip_finish_output+455}
       <ffffffff802e6f9f>{ip_fragment+863} <ffffffff802e6300>{ip_finish_output+0}
       <ffffffff802e740c>{ip_output+108} <ffffffff8035404e>{_spin_unlock_bh+14}
       <ffffffff802e8b87>{ip_push_pending_frames+919} <ffffffff803024de>{udp_push_pending_frames+574}
       <ffffffff80302db8>{udp_sendpage+280} <ffffffff8030a39f>{inet_sendpage+111}
       <ffffffff881411ca>{:sunrpc:svc_sendto+554} <ffffffff8818b879>{:nfsd:encode_post_op_attr+553}
       <ffffffff88141873>{:sunrpc:svc_udp_sendto+35} <ffffffff88142307>{:sunrpc:svc_send+247}
       <ffffffff88140834>{:sunrpc:svc_process+1108} <ffffffff8817e3c0>{:nfsd:nfsd+448}
       <ffffffff8012dfa9>{schedule_tail+73} <ffffffff8010f50e>{child_rip+8}
       <ffffffff8817e200>{:nfsd:nfsd+0} <ffffffff8010f506>{child_rip+0}
       

Code: 0f 0b 68 0b 6f 39 80 c2 4b 04 8b 93 8c 00 00 00 8d 42 02 44 
RIP <ffffffff802c803d>{skb_checksum_help+157} RSP <ffff81003d3bda08>
 <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():0

Call Trace:<ffffffff8012d6af>{__might_sleep+191} <ffffffff801333dc>{profile_task_exit+44}
       <ffffffff80134895>{do_exit+37} <ffffffff80353ff3>{_spin_unlock_irqrestore+19}
       <ffffffff80110184>{die+84} <ffffffff8035431e>{do_trap+334}
       <ffffffff8011047c>{do_invalid_op+172} <ffffffff802c803d>{skb_checksum_help+157}
       <ffffffff802c27e5>{__alloc_skb+133} <ffffffff802c06dd>{sock_alloc_send_skb+109}
       <ffffffff802e179d>{__ip_route_output_key+1517} <ffffffff8010f359>{error_exit+0}
       <ffffffff802c803d>{skb_checksum_help+157} <ffffffff802c8025>{skb_checksum_help+133}
       <ffffffff88104944>{:starfire:start_tx+164} <ffffffff802d6583>{qdisc_restart+243}
       <ffffffff802c8325>{dev_queue_xmit+293} <ffffffff802e64c7>{ip_finish_output+455}
       <ffffffff802e6f9f>{ip_fragment+863} <ffffffff802e6300>{ip_finish_output+0}
       <ffffffff802e740c>{ip_output+108} <ffffffff8035404e>{_spin_unlock_bh+14}
       <ffffffff802e8b87>{ip_push_pending_frames+919} <ffffffff803024de>{udp_push_pending_frames+574}
       <ffffffff80302db8>{udp_sendpage+280} <ffffffff8030a39f>{inet_sendpage+111}
       <ffffffff881411ca>{:sunrpc:svc_sendto+554} <ffffffff8818b879>{:nfsd:encode_post_op_attr+553}
       <ffffffff88141873>{:sunrpc:svc_udp_sendto+35} <ffffffff88142307>{:sunrpc:svc_send+247}
       <ffffffff88140834>{:sunrpc:svc_process+1108} <ffffffff8817e3c0>{:nfsd:nfsd+448}
       <ffffffff8012dfa9>{schedule_tail+73} <ffffffff8010f50e>{child_rip+8}
       <ffffffff8817e200>{:nfsd:nfsd+0} <ffffffff8010f506>{child_rip+0}
       
Kernel panic - not syncing: Aiee, killing interrupt handler!
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30 20:10       ` Hendrik Visage
@ 2005-09-30 20:55         ` Ion Badulescu
  2005-09-30 22:39         ` Herbert Xu
  1 sibling, 0 replies; 10+ messages in thread
From: Ion Badulescu @ 2005-09-30 20:55 UTC (permalink / raw)
  To: Hendrik Visage; +Cc: Andrew Morton, linux-net, linux-kernel, Jeff Garzik

On Fri, 30 Sep 2005, Hendrik Visage wrote:

> Anycase, here is a non-PREEMPT traceback.

Same trace, pretty much like I expected. Still, starfire must be getting 
a bad skb from the upper layers, because it gets passed __unmodified__ to 
skb_checksum_help().

Either that, or skb_checksum_help() itself got broken at some point, at 
least on 64-bit platforms.

I'll try to reproduce it over the weekend (assumming I can get an x86_64 
box set up, with a starfire inside) and see where the problem is.

> What makes this one interesting, is that in the preempt case, I had to 
> push the NFS output to get the panic, but the non-preempt case attached, 
> sorta just happened, ie. when the clients just checked on the server's 
> status :(

I'm actually surprised you got your panic from nfsd. skb_checksum_help() 
is called only when one of the fragments has length == 1, so the easiest 
way to hit it is to slowly type something into a telnet session.

Thanks,
Ion

-- 
   It is better to keep your mouth shut and be thought a fool,
             than to open it and remove all doubt.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30 20:10       ` Hendrik Visage
  2005-09-30 20:55         ` Ion Badulescu
@ 2005-09-30 22:39         ` Herbert Xu
  2005-10-01 19:21           ` Hendrik Visage
  1 sibling, 1 reply; 10+ messages in thread
From: Herbert Xu @ 2005-09-30 22:39 UTC (permalink / raw)
  To: Hendrik Visage
  Cc: Andrew Morton, linux-net, linux-kernel, ionut, Jeff Garzik,
	netdev

[-- Attachment #1: Type: text/plain, Size: 843 bytes --]

On Fri, Sep 30, 2005 at 08:10:59PM +0000, Hendrik Visage wrote:
>
> Anycase, here is a non-PREEMPT traceback. What makes this one
> interesting, is that
> in the preempt case, I had to push the NFS output to get the panic, but the
> non-preempt case attached, sorta just happened, ie. when the clients
> just checked on the server's status :(

You must never call skb_checksum_help unless the packet is meant to
be checksummed by the hardware.  So starfire is the guilty party here.

This patch makes it do the check and also check for errors from
skb_checksum_help.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: p --]
[-- Type: text/plain, Size: 654 bytes --]

diff --git a/drivers/net/starfire.c b/drivers/net/starfire.c
--- a/drivers/net/starfire.c
+++ b/drivers/net/starfire.c
@@ -1333,7 +1333,7 @@ static int start_tx(struct sk_buff *skb,
 	}
 
 #if defined(ZEROCOPY) && defined(HAS_BROKEN_FIRMWARE)
-	{
+	if (skb->ip_summed == CHECKSUM_HW) {
 		int has_bad_length = 0;
 
 		if (skb_first_frag_len(skb) == 1)
@@ -1346,8 +1346,10 @@ static int start_tx(struct sk_buff *skb,
 				}
 		}
 
-		if (has_bad_length)
-			skb_checksum_help(skb, 0);
+		if (has_bad_length && unlikely(skb_checksum_help(skb, 0))) {
+			dev_kfree_skb(skb);
+			return NETDEV_TX_OK;
+		}
 	}
 #endif /* ZEROCOPY && HAS_BROKEN_FIRMWARE */
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server
  2005-09-30 22:39         ` Herbert Xu
@ 2005-10-01 19:21           ` Hendrik Visage
  0 siblings, 0 replies; 10+ messages in thread
From: Hendrik Visage @ 2005-10-01 19:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Andrew Morton, linux-net, linux-kernel, ionut, Jeff Garzik,
	netdev

On 10/1/05, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> You must never call skb_checksum_help unless the packet is meant to
> be checksummed by the hardware.  So starfire is the guilty party here.
>
> This patch makes it do the check and also check for errors from
> skb_checksum_help.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanx Herbert,
 at least on 2.6.14_rc2 the patch appears to work for my stress test :)

--
Hendrik Visage

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-10-01 19:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-30  3:36 Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server Hendrik Visage
2005-09-30  4:16 ` Andrew Morton
2005-09-30  8:14   ` Hendrik Visage
2005-09-30 16:46     ` Ion Badulescu
2005-09-30 16:01   ` Hendrik Visage
2005-09-30 17:40     ` Andrew Morton
2005-09-30 20:10       ` Hendrik Visage
2005-09-30 20:55         ` Ion Badulescu
2005-09-30 22:39         ` Herbert Xu
2005-10-01 19:21           ` Hendrik Visage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox