Re: Lockup with 2.6.9-ac15 related to netconsole

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Lockup with 2.6.9-ac15 related to netconsole
       [not found]       ` <20041217233524.GA11202@electric-eye.fr.zoreil.com>
@ 2004-12-20  9:42         ` Mark Broadbent
  2004-12-20 21:14           ` Matt Mackall
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Broadbent @ 2004-12-20  9:42 UTC (permalink / raw)
  To: romieu; +Cc: mpm, linux-kernel, netdev


Francois Romieu said:
> Matt Mackall <mpm@selenic.com> :
> [...]
>> Please try the attached untested, uncompiled patch to add polling to
>> r8169:
> [...]
>> @@ -1839,6 +1842,15 @@
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_NET_POLL_CONTROLLER
>> +static void rtl8169_netpoll(struct net_device *dev)
>> +{
>> +	disable_irq(dev->irq);
>> +	rtl8169_interrupt(dev->irq, netdev, NULL);
>                                    ^^^^^^ -> should be "dev"
>
> The r8169 driver in -mm offers netpoll. A patch which syncs the r8169
> driver from 2.6.10-rc3 with current -mm is available at:
> http://www.fr.zoreil.com/people/francois/misc/20041218-2.6.10-rc3-r8169.c-test.patch>
> Please report success/failure. Cc: netdev@oss.sgi.com is welcome.

Exactly the same happens, I still get a 'NMI Watchdog detected LOCKUP'
with the r8169 device using the above patch on top of 2.6.10-rc3-bk10.
Thanks
Mark

-- 
Mark Broadbent <markb@wetlettuce.com>
Web: http://www.wetlettuce.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-20  9:42         ` Lockup with 2.6.9-ac15 related to netconsole Mark Broadbent
@ 2004-12-20 21:14           ` Matt Mackall
  2004-12-21  0:22             ` Francois Romieu
  0 siblings, 1 reply; 20+ messages in thread
From: Matt Mackall @ 2004-12-20 21:14 UTC (permalink / raw)
  To: Mark Broadbent; +Cc: romieu, linux-kernel, netdev

On Mon, Dec 20, 2004 at 09:42:08AM -0000, Mark Broadbent wrote:
> 
> Exactly the same happens, I still get a 'NMI Watchdog detected LOCKUP'
> with the r8169 device using the above patch on top of 2.6.10-rc3-bk10.

Ok, that suggests a problem localized to netpoll itself. Do you have
spinlock debugging turned on by any chance? 

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-20 21:14           ` Matt Mackall
@ 2004-12-21  0:22             ` Francois Romieu
  2004-12-21  0:55               ` Matt Mackall
  0 siblings, 1 reply; 20+ messages in thread
From: Francois Romieu @ 2004-12-21  0:22 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Mark Broadbent, linux-kernel, netdev

Matt Mackall <mpm@selenic.com> :
> On Mon, Dec 20, 2004 at 09:42:08AM -0000, Mark Broadbent wrote:
> > 
> > Exactly the same happens, I still get a 'NMI Watchdog detected LOCKUP'
> > with the r8169 device using the above patch on top of 2.6.10-rc3-bk10.
> 
> Ok, that suggests a problem localized to netpoll itself. Do you have
> spinlock debugging turned on by any chance? 

Any chance of:
1 dev_queue_xmit
2 dev->xmit_lock taken
3 interruption
4 printk
5 netconsole write
6 dev->xmit_lock again
7 lockup

?

This is probably the silly question of the day.

--
Ueimor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21  0:22             ` Francois Romieu
@ 2004-12-21  0:55               ` Matt Mackall
  2004-12-21 10:23                 ` Mark Broadbent
  0 siblings, 1 reply; 20+ messages in thread
From: Matt Mackall @ 2004-12-21  0:55 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Mark Broadbent, linux-kernel, netdev

On Tue, Dec 21, 2004 at 01:22:18AM +0100, Francois Romieu wrote:
> Matt Mackall <mpm@selenic.com> :
> > On Mon, Dec 20, 2004 at 09:42:08AM -0000, Mark Broadbent wrote:
> > > 
> > > Exactly the same happens, I still get a 'NMI Watchdog detected LOCKUP'
> > > with the r8169 device using the above patch on top of 2.6.10-rc3-bk10.
> > 
> > Ok, that suggests a problem localized to netpoll itself. Do you have
> > spinlock debugging turned on by any chance? 
> 
> Any chance of:
> 1 dev_queue_xmit
> 2 dev->xmit_lock taken
> 3 interruption
> 4 printk
> 5 netconsole write
> 6 dev->xmit_lock again
> 7 lockup
> 
> ?
> 
> This is probably the silly question of the day.

Maybe, but the answer isn't obvious to me at the moment as I haven't
been thinking about such stuff enough lately. Silly response of the
day:

Mark, can you try this (again completely untested, but at least
compiles) patch? I'm afraid I don't have a proper test rig to
reproduce this at the moment. This will attempt to grab the lock, and
if it fails, will check for recursion. Then it will try to print a
message on the local console, temporarily disabling netconsole to
allow the printk to get through..

Index: l/net/core/netpoll.c
===================================================================
--- l.orig/net/core/netpoll.c	2004-11-04 10:53:23.388610000 -0800
+++ l/net/core/netpoll.c	2004-12-20 16:45:40.212709000 -0800
@@ -31,6 +31,8 @@
 #define MAX_SKBS 32
 #define MAX_UDP_CHUNK 1460
 
+static int netpoll_kill;
+
 static spinlock_t skb_list_lock = SPIN_LOCK_UNLOCKED;
 static int nr_skbs;
 static struct sk_buff *skbs;
@@ -183,13 +185,24 @@
 	int status;
 
 repeat:
-	if(!np || !np->dev || !netif_running(np->dev)) {
+	if(!np || !np->dev || !netif_running(np->dev) || netpoll_kill) {
 		__kfree_skb(skb);
 		return;
 	}
 
-	spin_lock(&np->dev->xmit_lock);
-	np->dev->xmit_lock_owner = smp_processor_id();
+	if(spin_trylock(&np->dev->xmit_lock))
+		np->dev->xmit_lock_owner = smp_processor_id();
+	else {
+		if(np->dev->xmit_lock_owner == smp_processor_id()) {
+			netpoll_kill = 1;
+			__kfree_skb(skb);
+			printk("Tried to recursively get dev->xmit_lock");
+			netpoll_kill = 0;
+			return;
+		}
+		spin_lock(&np->dev->xmit_lock);
+
+	}
 
 	/*
 	 * network drivers do not expect to be called if the queue is


-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21  0:55               ` Matt Mackall
@ 2004-12-21 10:23                 ` Mark Broadbent
  2004-12-21 12:37                   ` Francois Romieu
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Broadbent @ 2004-12-21 10:23 UTC (permalink / raw)
  To: mpm; +Cc: romieu, linux-kernel, netdev


Matt Mackall said:
> On Tue, Dec 21, 2004 at 01:22:18AM +0100, Francois Romieu wrote:
>> Matt Mackall <mpm@selenic.com> :
>> > On Mon, Dec 20, 2004 at 09:42:08AM -0000, Mark Broadbent wrote:
>> > >
>> > > Exactly the same happens, I still get a 'NMI Watchdog detected
>> > > LOCKUP' with the r8169 device using the above patch on top of
>> > > 2.6.10-rc3-bk10.
>> >
>> > Ok, that suggests a problem localized to netpoll itself. Do you have
>> > spinlock debugging turned on by any chance?
>>
>> Any chance of:
>> 1 dev_queue_xmit
>> 2 dev->xmit_lock taken
>> 3 interruption
>> 4 printk
>> 5 netconsole write
>> 6 dev->xmit_lock again
>> 7 lockup
>>
>> ?
>>
>> This is probably the silly question of the day.
>
> Maybe, but the answer isn't obvious to me at the moment as I haven't
> been thinking about such stuff enough lately. Silly response of the
> day:
>
> Mark, can you try this (again completely untested, but at least
> compiles) patch? I'm afraid I don't have a proper test rig to
> reproduce this at the moment. This will attempt to grab the lock, and
> if it fails, will check for recursion. Then it will try to print a
> message on the local console, temporarily disabling netconsole to
> allow the printk to get through..

OK, patch applied and spinlock debugging enabled.  Testing with eth1
(r1869) doesn'tyield any additional messages, just the standard 'NMI Watchdog detected
lockup'.
Thanks
Mark

-- 
Mark Broadbent <markb@wetlettuce.com>
Web: http://www.wetlettuce.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21 10:23                 ` Mark Broadbent
@ 2004-12-21 12:37                   ` Francois Romieu
  2004-12-21 13:29                     ` Mark Broadbent
  0 siblings, 1 reply; 20+ messages in thread
From: Francois Romieu @ 2004-12-21 12:37 UTC (permalink / raw)
  To: Mark Broadbent; +Cc: mpm, romieu, linux-kernel, netdev

Mark Broadbent <markb@wetlettuce.com> :
[...]
> OK, patch applied and spinlock debugging enabled.  Testing with eth1
> (r1869) doesn'tyield any additional messages, just the standard
> 'NMI Watchdog detected lockup'.

Does the modified version below trigger _exactly_ the same hang ?

--- net/core/netpoll.c	2004-12-21 13:09:51.000000000 +0100
+++ net/core/netpoll.c	2004-12-21 13:27:01.000000000 +0100
@@ -31,6 +31,8 @@
 #define MAX_SKBS 32
 #define MAX_UDP_CHUNK 1460
 
+static int netpoll_kill;
+
 static spinlock_t skb_list_lock = SPIN_LOCK_UNLOCKED;
 static int nr_skbs;
 static struct sk_buff *skbs;
@@ -184,11 +186,21 @@ void netpoll_send_skb(struct netpoll *np
 
 repeat:
 	if(!np || !np->dev || !netif_running(np->dev)) {
+too_bad:
 		__kfree_skb(skb);
 		return;
 	}
 
-	spin_lock(&np->dev->xmit_lock);
+	if (!spin_trylock(&np->dev->xmit_lock)) {
+		netpoll_kill = 1;
+		goto too_bad;
+	}
+
+	if (netpoll_kill) {
+		if (net_ratelimit())
+			printk(KERN_ERR "netconsole raced");
+		netpoll_kill = 0;
+	}
 	np->dev->xmit_lock_owner = smp_processor_id();
 
 	/*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21 12:37                   ` Francois Romieu
@ 2004-12-21 13:29                     ` Mark Broadbent
  2004-12-21 20:48                       ` Francois Romieu
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Broadbent @ 2004-12-21 13:29 UTC (permalink / raw)
  To: romieu; +Cc: mpm, linux-kernel, netdev


Francois Romieu said:
> Mark Broadbent <markb@wetlettuce.com> :
> [...]
>> OK, patch applied and spinlock debugging enabled.  Testing with eth1
>> (r1869) doesn'tyield any additional messages, just the standard
>> 'NMI Watchdog detected lockup'.
>
> Does the modified version below trigger _exactly_ the same hang ?

Using the patch supplied I get no hang, just the message 'netconsole
raced' output to the console and the packet capture proceeds as normal.
Thanks
Mark

-- 
Mark Broadbent <markb@wetlettuce.com>
Web: http://www.wetlettuce.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21 13:29                     ` Mark Broadbent
@ 2004-12-21 20:48                       ` Francois Romieu
  2004-12-21 21:27                         ` Matt Mackall
  0 siblings, 1 reply; 20+ messages in thread
From: Francois Romieu @ 2004-12-21 20:48 UTC (permalink / raw)
  To: Mark Broadbent; +Cc: mpm, linux-kernel, netdev

Mark Broadbent <markb@wetlettuce.com> :
[...]
> Using the patch supplied I get no hang, just the message 'netconsole
> raced' output to the console and the packet capture proceeds as normal.
> Thanks

The patch is more a bandaid for debugging than a real fix. The netconsole
will drop some messages until its locking is fixed

If you can issue one more test, I'd like to know if some messages appear
on the VGA console around the time at which tcpdump is started (the test
assumes that netconsole is not used/insmoded at all). Please check that
the console log_level is set high enough as it will be really disappointing
if nothing appears :o)

--
Ueimor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21 20:48                       ` Francois Romieu
@ 2004-12-21 21:27                         ` Matt Mackall
  2004-12-21 22:58                           ` Francois Romieu
  0 siblings, 1 reply; 20+ messages in thread
From: Matt Mackall @ 2004-12-21 21:27 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Mark Broadbent, linux-kernel, netdev

On Tue, Dec 21, 2004 at 09:48:53PM +0100, Francois Romieu wrote:
> Mark Broadbent <markb@wetlettuce.com> :
> [...]
> > Using the patch supplied I get no hang, just the message 'netconsole
> > raced' output to the console and the packet capture proceeds as normal.
> > Thanks
> 
> The patch is more a bandaid for debugging than a real fix. The netconsole
> will drop some messages until its locking is fixed

Unfortunately there's no good way to fix its locking in this
circumstance (or the harder case of driver-private locks). I think
I'll just have to come up with some scheme for queueing messages that
arrive when the queue lock is held.

> If you can issue one more test, I'd like to know if some messages appear
> on the VGA console around the time at which tcpdump is started (the test
> assumes that netconsole is not used/insmoded at all). Please check that
> the console log_level is set high enough as it will be really disappointing
> if nothing appears :o)

I think it's the promiscuous mode message itself that's the problem
but I've not had time to reproduce it.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21 21:27                         ` Matt Mackall
@ 2004-12-21 22:58                           ` Francois Romieu
  2004-12-22  9:34                             ` Patrick McHardy
  0 siblings, 1 reply; 20+ messages in thread
From: Francois Romieu @ 2004-12-21 22:58 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Mark Broadbent, linux-kernel, netdev

Matt Mackall <mpm@selenic.com> :
[...]
> I think it's the promiscuous mode message itself that's the problem

Yes. dev_mc_add takes dev->xmit_lock and the game is over.

Marc, if the patch below happens to work, it should not drop messages
like the previous one (it is an ugly short-term suggestion).

--- net/core/netpoll.c	2004-12-21 13:09:51.000000000 +0100
+++ net/core/netpoll.c	2004-12-21 23:35:25.000000000 +0100
@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <asm/unaligned.h>
+#include <net/pkt_sched.h>
 
 /*
  * We maintain a small pool of fully-sized skbs, to make sure the
@@ -184,11 +187,19 @@ void netpoll_send_skb(struct netpoll *np
 
 repeat:
 	if(!np || !np->dev || !netif_running(np->dev)) {
 		__kfree_skb(skb);
 		return;
 	}
 
-	spin_lock(&np->dev->xmit_lock);
+	while (!spin_trylock(&np->dev->xmit_lock)) {
+		if (np->dev->xmit_lock_owner == smp_processor_id()) {
+			struct Qdisc *q = dev->qdisc;
+
+			q->ops->enqueue(skb, q);
+			return;
+		}
+	}
+
 	np->dev->xmit_lock_owner = smp_processor_id();
 
 	/*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-21 22:58                           ` Francois Romieu
@ 2004-12-22  9:34                             ` Patrick McHardy
  2004-12-22 10:54                               ` Patrick McHardy
  0 siblings, 1 reply; 20+ messages in thread
From: Patrick McHardy @ 2004-12-22  9:34 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Matt Mackall, Mark Broadbent, linux-kernel, netdev

Francois Romieu wrote:
> Marc, if the patch below happens to work, it should not drop messages
> like the previous one (it is an ugly short-term suggestion).
> 

> -	spin_lock(&np->dev->xmit_lock);
> +	while (!spin_trylock(&np->dev->xmit_lock)) {
> +		if (np->dev->xmit_lock_owner == smp_processor_id()) {
> +			struct Qdisc *q = dev->qdisc;
> +
> +			q->ops->enqueue(skb, q);

Shouldn't this be requeue ?

Regards
Patrick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-22  9:34                             ` Patrick McHardy
@ 2004-12-22 10:54                               ` Patrick McHardy
  2004-12-22 12:39                                 ` Francois Romieu
  2004-12-22 14:37                                 ` Mark Broadbent
  0 siblings, 2 replies; 20+ messages in thread
From: Patrick McHardy @ 2004-12-22 10:54 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Matt Mackall, Mark Broadbent, linux-kernel, netdev

Patrick McHardy wrote:
> Francois Romieu wrote:
> 
>> Marc, if the patch below happens to work, it should not drop messages
>> like the previous one (it is an ugly short-term suggestion).
>>
> 
>> -    spin_lock(&np->dev->xmit_lock);
>> +    while (!spin_trylock(&np->dev->xmit_lock)) {
>> +        if (np->dev->xmit_lock_owner == smp_processor_id()) {
>> +            struct Qdisc *q = dev->qdisc;
>> +
>> +            q->ops->enqueue(skb, q);
> 
> 
> Shouldn't this be requeue ?

Since the code doesn't dequeue itself, enqueue seems fine to keep
at least the queued messages ordered. But you need to grab
dev->queue_lock, otherwise you risk corrupting qdisc internal data.
You should probably also deal with the noqueue-qdisc, which doesn't
have an enqueue function. So it should look something like this:

while (!spin_trylock(&np->dev->xmit_lock)) {
	if (np->dev->xmit_lock_owner == smp_processor_id()) {
		struct Qdisc *q;

		rcu_read_lock();
		q = rcu_dereference(dev->qdisc);
		if (q->enqueue) {
			spin_lock(&dev->queue_lock);
			q->ops->enqueue(skb, q);
			spin_unlock(&dev->queue_lock);
			netif_schedule(np->dev);
		} else
			kfree_skb(skb);
		rcu_read_unlock();
		return;
	}
}

Regards
Patrick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-22 10:54                               ` Patrick McHardy
@ 2004-12-22 12:39                                 ` Francois Romieu
  2004-12-22 13:33                                   ` jamal
  2004-12-22 14:57                                   ` Patrick McHardy
  2004-12-22 14:37                                 ` Mark Broadbent
  1 sibling, 2 replies; 20+ messages in thread
From: Francois Romieu @ 2004-12-22 12:39 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Matt Mackall, Mark Broadbent, linux-kernel, netdev

Patrick McHardy <kaber@trash.net> :
[...]
> at least the queued messages ordered. But you need to grab
> dev->queue_lock, otherwise you risk corrupting qdisc internal data.
> You should probably also deal with the noqueue-qdisc, which doesn't
> have an enqueue function. So it should look something like this:

If I am not mistaken, a failure on spin_trylock + the test on
xmit_lock_owner imply that it is safe to directly handle the
queue. It means that qdisc_run() has been interrupted on the
current cpu and the other paths seem fine as well. Counter-example
is welcome (no joke).

Of course the patch is completely ugly and violates any layering
principle one could think of. It was not submitted for inclusion :o)

> while (!spin_trylock(&np->dev->xmit_lock)) {
> 	if (np->dev->xmit_lock_owner == smp_processor_id()) {
> 		struct Qdisc *q;
> 
> 		rcu_read_lock();
> 		q = rcu_dereference(dev->qdisc);
> 		if (q->enqueue) {
> 			spin_lock(&dev->queue_lock);

I'd expect it to deadlock if dev_queue_xmit -> qdisc_run is interrupted
on the current cpu and a printk is issued as dev->queue_lock will have
been taken elsewhere.

--
Ueimor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-22 12:39                                 ` Francois Romieu
@ 2004-12-22 13:33                                   ` jamal
  2004-12-22 14:57                                   ` Patrick McHardy
  1 sibling, 0 replies; 20+ messages in thread
From: jamal @ 2004-12-22 13:33 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Patrick McHardy, Matt Mackall, Mark Broadbent, linux-kernel,
	netdev

On Wed, 2004-12-22 at 07:39, Francois Romieu wrote:

> If I am not mistaken, a failure on spin_trylock + the test on
> xmit_lock_owner imply that it is safe to directly handle the
> queue. It means that qdisc_run() has been interrupted on the
> current cpu and the other paths seem fine as well. Counter-example
> is welcome (no joke).

Think more than 2 processors ;-> 

cheers,
jamal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-22 10:54                               ` Patrick McHardy
  2004-12-22 12:39                                 ` Francois Romieu
@ 2004-12-22 14:37                                 ` Mark Broadbent
  1 sibling, 0 replies; 20+ messages in thread
From: Mark Broadbent @ 2004-12-22 14:37 UTC (permalink / raw)
  To: kaber; +Cc: romieu, mpm, linux-kernel, netdev


Patrick McHardy said:
> Patrick McHardy wrote:
>> Francois Romieu wrote:
>>
>>> Marc, if the patch below happens to work, it should not drop messages
>>> like the previous one (it is an ugly short-term suggestion).
>>>
>>
>>> -    spin_lock(&np->dev->xmit_lock);
>>> +    while (!spin_trylock(&np->dev->xmit_lock)) {
>>> +        if (np->dev->xmit_lock_owner == smp_processor_id()) {
>>> +            struct Qdisc *q = dev->qdisc;
>>> +
>>> +            q->ops->enqueue(skb, q);
>>
>>
>> Shouldn't this be requeue ?
>
> Since the code doesn't dequeue itself, enqueue seems fine to keep
> at least the queued messages ordered. But you need to grab
> dev->queue_lock, otherwise you risk corrupting qdisc internal data. You
> should probably also deal with the noqueue-qdisc, which doesn't have an
> enqueue function. So it should look something like this:
>
> while (!spin_trylock(&np->dev->xmit_lock)) {
> 	if (np->dev->xmit_lock_owner == smp_processor_id()) {
> 		struct Qdisc *q;
>
> 		rcu_read_lock();
> 		q = rcu_dereference(dev->qdisc);
> 		if (q->enqueue) {
> 			spin_lock(&dev->queue_lock);
> 			q->ops->enqueue(skb, q);
> 			spin_unlock(&dev->queue_lock);
> 			netif_schedule(np->dev);
> 		} else
> 			kfree_skb(skb);
> 		rcu_read_unlock();
> 		return;
> 	}
> }

I've tried both patches (modified slightly to get them to compile) but
they both produce hard NMI detected lockups (as before).
Thanks
Mark

Patches after modification to allow compilation:

Francois' patch (against 2.6.10-rc3-bk10):

diff -X dontdiff -urN linux-2.6.9-rc3-bk10.orig/net/core/netpoll.c
linux-2.6.9-rc3-bk10/net/core/netpoll.c--- linux-2.6.9-rc3-bk10.orig/net/core/netpoll.c	2004-12-22
12:09:32.000000000 +0000+++ linux-2.6.9-rc3-bk10/net/core/netpoll.c	2004-12-22 14:13:54.000000000
+0000@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <asm/unaligned.h>
+#include <net/pkt_sched.h>

 /*
  * We maintain a small pool of fully-sized skbs, to make sure the
@@ -188,7 +189,15 @@
 		return;
 	}

-	spin_lock(&np->dev->xmit_lock);
+	while (!spin_trylock(&np->dev->xmit_lock)) {
+		if (np->dev->xmit_lock_owner == smp_processor_id()) {
+			struct Qdisc *q = np->dev->qdisc;
+
+			q->ops->enqueue(skb, q);
+			return;
+		}
+	}
+
 	np->dev->xmit_lock_owner = smp_processor_id();

 	/*

Patrick's patch (against 2.6.10-rc3-bk10):

diff -X dontdiff -urN linux-2.6.9-rc3-bk10.orig/net/core/netpoll.c
linux-2.6.9-rc3-bk10/net/core/netpoll.c--- linux-2.6.9-rc3-bk10.orig/net/core/netpoll.c	2004-12-22
12:09:32.000000000 +0000+++ linux-2.6.9-rc3-bk10/net/core/netpoll.c	2004-12-22 11:08:06.000000000
+0000@@ -22,6 +22,7 @@
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <asm/unaligned.h>
+#include <net/pkt_sched.h>

 /*
  * We maintain a small pool of fully-sized skbs, to make sure the
@@ -188,7 +189,24 @@
 		return;
 	}

-	spin_lock(&np->dev->xmit_lock);
+	while (!spin_trylock(&np->dev->xmit_lock)) {
+		if (np->dev->xmit_lock_owner == smp_processor_id()) {
+			struct Qdisc *q;
+
+			rcu_read_lock();
+			q = rcu_dereference(np->dev->qdisc);
+			if (q->enqueue) {
+				spin_lock(&np->dev->queue_lock);
+				q->ops->enqueue(skb, q);
+				spin_unlock(&np->dev->queue_lock);
+				netif_schedule(np->dev);
+			} else
+				__kfree_skb(skb);
+			rcu_read_unlock();
+			return;
+		}
+	}
+
 	np->dev->xmit_lock_owner = smp_processor_id();

 	/*


-- 
Mark Broadbent <markb@wetlettuce.com>
Web: http://www.wetlettuce.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-22 12:39                                 ` Francois Romieu
  2004-12-22 13:33                                   ` jamal
@ 2004-12-22 14:57                                   ` Patrick McHardy
  2004-12-22 17:18                                     ` Matt Mackall
  1 sibling, 1 reply; 20+ messages in thread
From: Patrick McHardy @ 2004-12-22 14:57 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Matt Mackall, Mark Broadbent, linux-kernel, netdev

Francois Romieu wrote:
> Patrick McHardy <kaber@trash.net> :
> [...]
> 
>>at least the queued messages ordered. But you need to grab
>>dev->queue_lock, otherwise you risk corrupting qdisc internal data.
>>You should probably also deal with the noqueue-qdisc, which doesn't
>>have an enqueue function. So it should look something like this:
> 
> 
> If I am not mistaken, a failure on spin_trylock + the test on
> xmit_lock_owner imply that it is safe to directly handle the
> queue. It means that qdisc_run() has been interrupted on the
> current cpu and the other paths seem fine as well. Counter-example
> is welcome (no joke).

enqueue is only protected by dev->queue_lock, and dev->queue_lock
is dropped as soon as dev->xmit_lock is grabbed, so any other CPU
might call enqueue at the same time.

Example:

CPU1					CPU2

dev_queue_xmit				dev_queue_xmit
  lock(dev->queue_lock)			 lock(dev->queue_lock)
q->enqueue
qdisc_run
qdisc_restart
  trylock(dev->xmit_lock), ok
  unlock(dev->queue_lock)
...
printk("something")
...
netpoll_send_skb
  trylock(dev->xmit_lock), fails
q->enqueue				q->enqueue

> Of course the patch is completely ugly and violates any layering
> principle one could think of. It was not submitted for inclusion :o)

Sure, but I think we should have a short-term workaround until
a better solution has been invented. Maybe dropping the packets
would be best for now, it only affects printks issued in paths
starting at qdisc_restart (-> hard_start_xmit -> ...). Queueing
the packets might also cause reordering since not all packets
are queued.

>>while (!spin_trylock(&np->dev->xmit_lock)) {
>>	if (np->dev->xmit_lock_owner == smp_processor_id()) {
>>		struct Qdisc *q;
>>
>>		rcu_read_lock();
>>		q = rcu_dereference(dev->qdisc);
>>		if (q->enqueue) {
>>			spin_lock(&dev->queue_lock);
> 
> 
> I'd expect it to deadlock if dev_queue_xmit -> qdisc_run is interrupted
> on the current cpu and a printk is issued as dev->queue_lock will have
> been taken elsewhere.

Hmm this is complicated, let me think some more about it.

Regards
Patrick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-22 14:57                                   ` Patrick McHardy
@ 2004-12-22 17:18                                     ` Matt Mackall
  2004-12-25 11:26                                       ` Wish you all a Merry Christmas Pranav
  2004-12-28 13:45                                       ` Lockup with 2.6.9-ac15 related to netconsole jamal
  0 siblings, 2 replies; 20+ messages in thread
From: Matt Mackall @ 2004-12-22 17:18 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Francois Romieu, Mark Broadbent, linux-kernel, netdev

On Wed, Dec 22, 2004 at 03:57:57PM +0100, Patrick McHardy wrote:
> >Of course the patch is completely ugly and violates any layering
> >principle one could think of. It was not submitted for inclusion :o)
> 
> Sure, but I think we should have a short-term workaround until
> a better solution has been invented. Maybe dropping the packets
> would be best for now, it only affects printks issued in paths
> starting at qdisc_restart (-> hard_start_xmit -> ...). Queueing
> the packets might also cause reordering since not all packets
> are queued.

When I mentioned queueing, I was thinking of a netpoll-private queue
that would be hooked to a softirq or some such so that it would be
pushed out as soon as possible. Dropping may be the better approach as
queueing throws away netpoll's immediacy and ordering properties. And
getting netpoll _more_ tangled in the net stack mechanics is
definitely a step in the wrong direction.

More generally, I'm tempted to add some warn_on style functionality so
that printks in such troublesome paths can be lifted out.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Wish you all a Merry Christmas
  2004-12-22 17:18                                     ` Matt Mackall
@ 2004-12-25 11:26                                       ` Pranav
  2004-12-25 11:30                                         ` Jan Engelhardt
  2004-12-28 13:45                                       ` Lockup with 2.6.9-ac15 related to netconsole jamal
  1 sibling, 1 reply; 20+ messages in thread
From: Pranav @ 2004-12-25 11:26 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Hi everyone,
Wishing you all A Prosperous Merry ChristMas.
Hope Coming years brings Peace,Happiness,blessings of CHRISTmas to you all
,your family and this World.

With Regards,
Pranav.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Wish you all a Merry Christmas
  2004-12-25 11:26                                       ` Wish you all a Merry Christmas Pranav
@ 2004-12-25 11:30                                         ` Jan Engelhardt
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Engelhardt @ 2004-12-25 11:30 UTC (permalink / raw)
  To: Pranav; +Cc: netdev, linux-kernel

>Wishing you all A Prosperous Merry ChristMas.
>Hope Coming years brings Peace,Happiness,blessings of CHRISTmas to you all
>,your family and this World.
>
>With Regards,
>Pranav.

I don't see how this is related to linux-kernel.




Jan Engelhardt
-- 
ENOSPC

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Lockup with 2.6.9-ac15 related to netconsole
  2004-12-22 17:18                                     ` Matt Mackall
  2004-12-25 11:26                                       ` Wish you all a Merry Christmas Pranav
@ 2004-12-28 13:45                                       ` jamal
  1 sibling, 0 replies; 20+ messages in thread
From: jamal @ 2004-12-28 13:45 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Patrick McHardy, Francois Romieu, Mark Broadbent, linux-kernel,
	netdev

On Wed, 2004-12-22 at 12:18, Matt Mackall wrote:
> On Wed, Dec 22, 2004 at 03:57:57PM +0100, Patrick McHardy wrote:
> > >Of course the patch is completely ugly and violates any layering
> > >principle one could think of. It was not submitted for inclusion :o)
> > 
> > Sure, but I think we should have a short-term workaround until
> > a better solution has been invented. Maybe dropping the packets
> > would be best for now, it only affects printks issued in paths
> > starting at qdisc_restart (-> hard_start_xmit -> ...). Queueing
> > the packets might also cause reordering since not all packets
> > are queued.
> 
> When I mentioned queueing, I was thinking of a netpoll-private queue
> that would be hooked to a softirq or some such so that it would be
> pushed out as soon as possible. Dropping may be the better approach 

I think so - just junk those packets. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2004-12-28 13:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <59719.192.102.214.6.1103214002.squirrel@webmail.wetlettuce.com>
     [not found] ` <20041216211024.GK2767@waste.org>
     [not found]   ` <34721.192.102.214.6.1103274614.squirrel@webmail.wetlettuce.com>
     [not found]     ` <20041217215752.GP2767@waste.org>
     [not found]       ` <20041217233524.GA11202@electric-eye.fr.zoreil.com>
2004-12-20  9:42         ` Lockup with 2.6.9-ac15 related to netconsole Mark Broadbent
2004-12-20 21:14           ` Matt Mackall
2004-12-21  0:22             ` Francois Romieu
2004-12-21  0:55               ` Matt Mackall
2004-12-21 10:23                 ` Mark Broadbent
2004-12-21 12:37                   ` Francois Romieu
2004-12-21 13:29                     ` Mark Broadbent
2004-12-21 20:48                       ` Francois Romieu
2004-12-21 21:27                         ` Matt Mackall
2004-12-21 22:58                           ` Francois Romieu
2004-12-22  9:34                             ` Patrick McHardy
2004-12-22 10:54                               ` Patrick McHardy
2004-12-22 12:39                                 ` Francois Romieu
2004-12-22 13:33                                   ` jamal
2004-12-22 14:57                                   ` Patrick McHardy
2004-12-22 17:18                                     ` Matt Mackall
2004-12-25 11:26                                       ` Wish you all a Merry Christmas Pranav
2004-12-25 11:30                                         ` Jan Engelhardt
2004-12-28 13:45                                       ` Lockup with 2.6.9-ac15 related to netconsole jamal
2004-12-22 14:37                                 ` Mark Broadbent

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).