* 2.6.2 crash after network link failure
@ 2004-02-09 13:01 Petr Vandrovec
2004-02-09 22:28 ` David S. Miller
0 siblings, 1 reply; 4+ messages in thread
From: Petr Vandrovec @ 2004-02-09 13:01 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel
Hi,
on Saturday our network switch lost power due to scheduled
Power outage, and unfortunately one of systems connected to the switch
(mine :-( ) did not survive it. When I came to the system's console today,
I found (written by hand, made a bit shorter...) on screen:
e100: eth0 NIC Link is Up 100Mbps Full Duplex
kernel/sched.c:1802: spin_is_locked on uninitialized spinlock d6e74e18
Unable to handle NULL kernel pointer dereference at virtual address 00000000
...
EIP: C0120AF1 <__wake_up_common + 0x17/0x57>
...
EAX: D6E74E30 EBX: D6E74E18 ECX: 00000001 EDX: 00000000
ESI: 00000001 EDI: 00000001 EBP: C9F3DEDC ESP: C9F3DEC0
...
Call Trace:
__wake_up + 0x7B/0x139
sock_def_write_space + 0x8C/0x94
sock_wfree + 0x4B/0x4D
__kfree_skb + 0x5C/0xD9
net_tx_action + 0x3E/0x210
do_softirq + 0x92/0x94
do_IRQ + 0x207/0x360
common_interrupt + 0x18/0x20
It looks to me like that we've got skb on completion_queue which was connected
to a bit unhappy socket - one which had sk->sk_sleep uninitialized. Only problem
is that only af_unix sets skb->destructor to sock_wfree, so I somehow miss how
this could be triggered by e100 link change.
Kernel is approx 2.6.2-bk2, UP, APIC, IOAPIC, no-preempt, all DEBUG options except
CONFIG_FRAME_POINTER enabled.
System is 1.6GHz P4/512MB RAM.
Best regards,
Petr Vandrovec
vandrove@vc.cvut.cz
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.6.2 crash after network link failure
2004-02-09 13:01 2.6.2 crash after network link failure Petr Vandrovec
@ 2004-02-09 22:28 ` David S. Miller
0 siblings, 0 replies; 4+ messages in thread
From: David S. Miller @ 2004-02-09 22:28 UTC (permalink / raw)
To: Petr Vandrovec; +Cc: netdev, linux-kernel, scott.feldman
On Mon, 9 Feb 2004 14:01:34 +0100
Petr Vandrovec <vandrove@vc.cvut.cz> wrote:
> It looks to me like that we've got skb on completion_queue which was connected
> to a bit unhappy socket - one which had sk->sk_sleep uninitialized. Only problem
> is that only af_unix sets skb->destructor to sock_wfree, so I somehow miss how
> this could be triggered by e100 link change.
It is not only af_unix, any time we invoke skb_set_owner_w() we get sock_wfree()
as the destructor, furthermore sock_def_write_space is the default such handler
given to all sockets unless they override that.
Maybe e100 is mangling it's TX queue or in fact freeing things twice.
I think what might be happening is that somehow the TX queue is corrupted if
e100_config() runs (due to link UP state change) while there are active normal
SKB packets on the TX queue. Or perhaps some TX queue handling locking issue.
Scott, any ideas?
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: 2.6.2 crash after network link failure
@ 2004-02-09 23:37 Feldman, Scott
2004-02-10 14:11 ` Petr Vandrovec
0 siblings, 1 reply; 4+ messages in thread
From: Feldman, Scott @ 2004-02-09 23:37 UTC (permalink / raw)
To: David S. Miller, Petr Vandrovec; +Cc: netdev, linux-kernel
> I think what might be happening is that somehow the TX queue
> is corrupted if
> e100_config() runs (due to link UP state change) while there
> are active normal SKB packets on the TX queue. Or perhaps
> some TX queue handling locking issue.
>
> Scott, any ideas?
e100 hardware will continue to process the hardware's Tx queue even
after link is lost, and then cleanup (return skbs) on interrupt. I
would expect e100 to be holding no Tx skbs when link returned.
Petr, -mm kernel has an updated (and much simpler) e100 driver. Is this
something you can try? The switch failure can be simulated by manually
plugging the cable in/out.
-scott
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: 2.6.2 crash after network link failure
2004-02-09 23:37 Feldman, Scott
@ 2004-02-10 14:11 ` Petr Vandrovec
0 siblings, 0 replies; 4+ messages in thread
From: Petr Vandrovec @ 2004-02-10 14:11 UTC (permalink / raw)
To: Feldman, Scott; +Cc: David S. Miller, netdev, linux-kernel
On Mon, Feb 09, 2004 at 03:37:48PM -0800, Feldman, Scott wrote:
> > I think what might be happening is that somehow the TX queue
> > is corrupted if
> > e100_config() runs (due to link UP state change) while there
> > are active normal SKB packets on the TX queue. Or perhaps
> > some TX queue handling locking issue.
> >
> > Scott, any ideas?
>
> e100 hardware will continue to process the hardware's Tx queue even
> after link is lost, and then cleanup (return skbs) on interrupt. I
> would expect e100 to be holding no Tx skbs when link returned.
>
> Petr, -mm kernel has an updated (and much simpler) e100 driver. Is this
> something you can try? The switch failure can be simulated by manually
> plugging the cable in/out.
Unfortunately it does not seem easily triggerable. I spent about one
hour plugging/unplugging cable while transmitting UDP packets as fast
as possible, sometime interleaved with 'mii-tool -r', and it refused
to crash...
Petr Vandrovec
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-02-10 14:11 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-09 13:01 2.6.2 crash after network link failure Petr Vandrovec
2004-02-09 22:28 ` David S. Miller
-- strict thread matches above, loose matches on Subject: below --
2004-02-09 23:37 Feldman, Scott
2004-02-10 14:11 ` Petr Vandrovec
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).