netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.2 crash after network link failure
@ 2004-02-09 13:01 Petr Vandrovec
  2004-02-09 22:28 ` David S. Miller
  0 siblings, 1 reply; 4+ messages in thread
From: Petr Vandrovec @ 2004-02-09 13:01 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Hi,
  on Saturday our network switch lost power due to scheduled
Power outage, and unfortunately one of systems connected to the switch
(mine :-( ) did not survive it. When I came to the system's console today, 
I found (written by hand, made a bit shorter...) on screen:

e100: eth0 NIC Link is Up 100Mbps Full Duplex
kernel/sched.c:1802: spin_is_locked on uninitialized spinlock d6e74e18
Unable to handle NULL kernel pointer dereference at virtual address 00000000
...
EIP: C0120AF1  <__wake_up_common + 0x17/0x57>
...
EAX: D6E74E30   EBX: D6E74E18    ECX: 00000001    EDX: 00000000
ESI: 00000001   EDI: 00000001    EBP: C9F3DEDC    ESP: C9F3DEC0
...
Call Trace:
   __wake_up + 0x7B/0x139
   sock_def_write_space + 0x8C/0x94
   sock_wfree + 0x4B/0x4D
   __kfree_skb + 0x5C/0xD9
   net_tx_action + 0x3E/0x210
   do_softirq + 0x92/0x94
   do_IRQ + 0x207/0x360
   common_interrupt + 0x18/0x20


It looks to me like that we've got skb on completion_queue which was connected
to a bit unhappy socket - one which had sk->sk_sleep uninitialized. Only problem 
is that only af_unix sets skb->destructor to sock_wfree, so I somehow miss how 
this could be triggered by e100 link change.

Kernel is approx 2.6.2-bk2, UP, APIC, IOAPIC, no-preempt, all DEBUG options except
CONFIG_FRAME_POINTER enabled.

System is 1.6GHz P4/512MB RAM.
						Best regards,
							Petr Vandrovec
							vandrove@vc.cvut.cz

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.6.2 crash after network link failure
  2004-02-09 13:01 2.6.2 crash after network link failure Petr Vandrovec
@ 2004-02-09 22:28 ` David S. Miller
  0 siblings, 0 replies; 4+ messages in thread
From: David S. Miller @ 2004-02-09 22:28 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: netdev, linux-kernel, scott.feldman

On Mon, 9 Feb 2004 14:01:34 +0100
Petr Vandrovec <vandrove@vc.cvut.cz> wrote:

> It looks to me like that we've got skb on completion_queue which was connected
> to a bit unhappy socket - one which had sk->sk_sleep uninitialized. Only problem 
> is that only af_unix sets skb->destructor to sock_wfree, so I somehow miss how 
> this could be triggered by e100 link change.

It is not only af_unix, any time we invoke skb_set_owner_w() we get sock_wfree()
as the destructor, furthermore sock_def_write_space is the default such handler
given to all sockets unless they override that.

Maybe e100 is mangling it's TX queue or in fact freeing things twice.

I think what might be happening is that somehow the TX queue is corrupted if
e100_config() runs (due to link UP state change) while there are active normal
SKB packets on the TX queue.  Or perhaps some TX queue handling locking issue.

Scott, any ideas?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: 2.6.2 crash after network link failure
@ 2004-02-09 23:37 Feldman, Scott
  2004-02-10 14:11 ` Petr Vandrovec
  0 siblings, 1 reply; 4+ messages in thread
From: Feldman, Scott @ 2004-02-09 23:37 UTC (permalink / raw)
  To: David S. Miller, Petr Vandrovec; +Cc: netdev, linux-kernel

> I think what might be happening is that somehow the TX queue 
> is corrupted if
> e100_config() runs (due to link UP state change) while there 
> are active normal SKB packets on the TX queue.  Or perhaps 
> some TX queue handling locking issue.
> 
> Scott, any ideas?

e100 hardware will continue to process the hardware's Tx queue even
after link is lost, and then cleanup (return skbs) on interrupt.  I
would expect e100 to be holding no Tx skbs when link returned.

Petr, -mm kernel has an updated (and much simpler) e100 driver.  Is this
something you can try?  The switch failure can be simulated by manually
plugging the cable in/out.

-scott

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.6.2 crash after network link failure
  2004-02-09 23:37 Feldman, Scott
@ 2004-02-10 14:11 ` Petr Vandrovec
  0 siblings, 0 replies; 4+ messages in thread
From: Petr Vandrovec @ 2004-02-10 14:11 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: David S. Miller, netdev, linux-kernel

On Mon, Feb 09, 2004 at 03:37:48PM -0800, Feldman, Scott wrote:
> > I think what might be happening is that somehow the TX queue 
> > is corrupted if
> > e100_config() runs (due to link UP state change) while there 
> > are active normal SKB packets on the TX queue.  Or perhaps 
> > some TX queue handling locking issue.
> > 
> > Scott, any ideas?
> 
> e100 hardware will continue to process the hardware's Tx queue even
> after link is lost, and then cleanup (return skbs) on interrupt.  I
> would expect e100 to be holding no Tx skbs when link returned.
> 
> Petr, -mm kernel has an updated (and much simpler) e100 driver.  Is this
> something you can try?  The switch failure can be simulated by manually
> plugging the cable in/out.

Unfortunately it does not seem easily triggerable. I spent about one
hour plugging/unplugging cable while transmitting UDP packets as fast
as possible, sometime interleaved with 'mii-tool -r', and it refused
to crash...
						Petr Vandrovec

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-02-10 14:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-09 13:01 2.6.2 crash after network link failure Petr Vandrovec
2004-02-09 22:28 ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2004-02-09 23:37 Feldman, Scott
2004-02-10 14:11 ` Petr Vandrovec

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).