Re: [PATCH] Prevent netpoll hanging when link is down

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] Prevent netpoll hanging when link is down
       [not found] <20041006232544.53615761@jack.colino.net>
@ 2004-10-06 21:43 ` Matt Mackall
  2004-10-07  5:53   ` Colin Leroy
  0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-06 21:43 UTC (permalink / raw)
  To: Colin Leroy; +Cc: Andrew Morton, netdev

On Wed, Oct 06, 2004 at 11:25:44PM +0200, Colin Leroy wrote:
> Hi,
> 
> this patch fixes a (quite big) problem with netpoll: when link is down,
> it hangs.
> 
> This patch fixes it. Tested with no carrier (no more hang) and with
> carrier (same behaviour as before).

[cc:ed to netdev]

Well this doesn't look unreasonable, but I haven't run into it with
the NICs I've tested. Nor have I seen this reported before. Which NICs
is this with?
 
> -	if(!np || !np->dev || !netif_running(np->dev)) {
> +	if(!np || !np->dev || !netif_running(np->dev) || 
> +	   !netif_carrier_ok(np->dev)) {

I wonder if netif_running is redundant if netif_carrier_ok.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-06 21:43 ` [PATCH] Prevent netpoll hanging when link is down Matt Mackall
@ 2004-10-07  5:53   ` Colin Leroy
  2004-10-07  6:49     ` David S. Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-07  5:53 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andrew Morton, netdev

On 06 Oct 2004 at 16h10, Matt Mackall wrote:

Hi, 

> On Wed, Oct 06, 2004 at 11:25:44PM +0200, Colin Leroy wrote:
> > Hi,
> > 
> > this patch fixes a (quite big) problem with netpoll: when link is
> > down, it hangs.
> > 
> > This patch fixes it. Tested with no carrier (no more hang) and with
> > carrier (same behaviour as before).
> 
> [cc:ed to netdev]
> 
> Well this doesn't look unreasonable, but I haven't run into it with
> the NICs I've tested. Nor have I seen this reported before. Which NICs
> is this with?

Sungem. I didn't find anything strange in sungem, but it may be...
  
> > -	if(!np || !np->dev || !netif_running(np->dev)) {
> > +	if(!np || !np->dev || !netif_running(np->dev) || 
> > +	   !netif_carrier_ok(np->dev)) {
> 
> I wonder if netif_running is redundant if netif_carrier_ok.
 
Probably... I wanted to do the less modifications possible.

-- 
Colin
  http://colino.net/323/ - Presenting the Mazda 323 Rouge

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07  5:53   ` Colin Leroy
@ 2004-10-07  6:49     ` David S. Miller
  2004-10-07  8:33       ` Colin Leroy
  2004-10-07 14:05       ` Colin Leroy
  0 siblings, 2 replies; 34+ messages in thread
From: David S. Miller @ 2004-10-07  6:49 UTC (permalink / raw)
  To: Colin Leroy; +Cc: mpm, akpm, netdev

On Thu, 7 Oct 2004 07:53:19 +0200
Colin Leroy <colin@colino.net> wrote:

> On 06 Oct 2004 at 16h10, Matt Mackall wrote:
> 
> > On Wed, Oct 06, 2004 at 11:25:44PM +0200, Colin Leroy wrote:
> > Well this doesn't look unreasonable, but I haven't run into it with
> > the NICs I've tested. Nor have I seen this reported before. Which NICs
> > is this with?
> 
> Sungem. I didn't find anything strange in sungem, but it may be...

I think this is very strange that only sungem behaves this
way.

I don't think netpoll is doing anything different than what
would happen, f.e., when bringing an interface up using dhcp.
That should cause the same kind of hang.

The only thing that should make the thing spin is if gp->hw_running
is zero.  This is set non-zero by gem_open() after resetting the
chip to bring it up.

If gem_open() fails, and on entry gp->hw_running was zero, the chip
will be powered back down and gp->hw_running set back to zero.
gem_suspend()/gem_resume() also modify the gp->hw_running state, as
appropriate.

I could see it that if gp->hw_running is non-zero, we could run into
troubles.  np->dev->poll_controller() will run, and it won't do anything
since the gem_interrupt() call is a nop when gp->hw_running is zero.
Then we blindly call ingo np->dev->poll()

Folks debugging this should verify that gp->hw_running is non-zero when
the problematic case runs.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07  6:49     ` David S. Miller
@ 2004-10-07  8:33       ` Colin Leroy
  2004-10-07  8:45         ` Colin Leroy
  2004-10-07 14:05       ` Colin Leroy
  1 sibling, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-07  8:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, akpm, netdev

On 06 Oct 2004 at 23h10, David S. Miller wrote:

Hi, 

Thanks for your help.

> I could see it that if gp->hw_running is non-zero, we could run into

You meant zero here ? (or didn't I understand something)

> troubles.  np->dev->poll_controller() will run, and it won't do
> anything since the gem_interrupt() call is a nop when gp->hw_running
> is zero. Then we blindly call ingo np->dev->poll()

Btw, shouldn't gem_poll() check for gp->hw_running, too?

> Folks debugging this should verify that gp->hw_running is non-zero
> when the problematic case runs.

I added a printk at the end of gem_open(), it's 1 even when there's no
link:

...
netconsole: device eth0 not up yet, forcing it
eth0: gp->hw_running after gem_open: 1
netconsole: timeout waiting for carrier
netconsole: network logging started
...

To be more precise about the netpoll-related hanging: it hangs
after a few messages. From what I remember (90% sure, i'm not in front
of the laptop right now so can't make it crash, it'll be harder to debug
after ^^), it hangs after printing hda init stuff, which is 32 lines
after "netconsole: network logging started" (MAX_SKBS == 32 in
netpoll.c)...

How stupid, I just added a printk at the beginning of netpoll_send_skb
and rebooted, and the laptop doesn't get back online. Bitten by
recursion once again... I should have thought harder first.

-- 
Colin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07  8:33       ` Colin Leroy
@ 2004-10-07  8:45         ` Colin Leroy
  0 siblings, 0 replies; 34+ messages in thread
From: Colin Leroy @ 2004-10-07  8:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, akpm, netdev

On 07 Oct 2004 at 10h10, Colin Leroy wrote:

Hi, 

> How stupid, I just added a printk at the beginning of netpoll_send_skb
> and rebooted, and the laptop doesn't get back online. Bitten by
> recursion once again... I should have thought harder first.

My girlfriend told me what was outputted by my infinite loop:

netpoll: netif_running 1, netif_carrier_ok 0, netif_queue_stopped 0.

Shouldn't sungem stop its queue when there's no carrier ?

(I really hope to be able to spare a few hours in front of the laptop to
sort that out)
-- 
Colin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07  6:49     ` David S. Miller
  2004-10-07  8:33       ` Colin Leroy
@ 2004-10-07 14:05       ` Colin Leroy
  2004-10-07 18:28         ` David S. Miller
  1 sibling, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-07 14:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, akpm, netdev

On 06 Oct 2004 at 23h10, David S. Miller wrote:

Hi again, 

> Folks debugging this should verify that gp->hw_running is non-zero
> when the problematic case runs.

I looked a bit more at the code and found out a possible problem.
However it doesn't fix the hang, so either it's not what I found, or
there's something else.

First, my newbie question: is it possible to deadlock a spinlock on a
Uniprocessor kernel ? For example, there's something I find suspect in
netpoll/sungem interaction:

it starts in netpoll_send_skb(), where xmit_lock is acquired; then,
gem_start_xmit() is called. In gem_start_xmit(), if the TX ring is full,
we return 1, but previously log the error via printk(). 
In this condition, isn't netpoll_send_skb() called again (via the whole
console stuff), where it retries to get the lock on xmit_lock? Could
that cause a deadlock?

Also, at the end of netpoll_send_skb() is a check for the return status
of hard_start_xmit. If I understand correctly, hard_start_xmit functions
return 0 for success, -1 for "busy please retry" and 1 for "hard error".
Shouldn't netpoll free the skb and forget about it in case status is 1
(instead of goto'ing repeat) ?

Based on this, I've a new patch that I'm attaching. As said it doesn't
fix the problem, but if my I understood is valid, it may still be
useful ? 

Any insight will be greatly appreciated :)

Patch following:
--- a/net/core/netpoll.c	2004-10-05 21:09:49.000000000 +0200
+++ b/net/core/netpoll.c	2004-10-07 14:11:00.000000000 +0200
@@ -188,7 +188,10 @@
 		return;
 	}

-	spin_lock(&np->dev->xmit_lock);
+	if (!spin_trylock(&np->dev->xmit_lock)) {
+		/* looks like np->dev->hard_start_xmit did a printk */
+		goto bail_free;
+	}
 	np->dev->xmit_lock_owner = smp_processor_id();

 	/*
@@ -204,13 +207,18 @@
 	}

 	status = np->dev->hard_start_xmit(skb, np->dev);
+
 	np->dev->xmit_lock_owner = -1;
 	spin_unlock(&np->dev->xmit_lock);

-	/* transmit busy */
-	if(status) {
+	if(status == -1) {
+		/* transmit busy */
 		netpoll_poll(np);
 		goto repeat;
+	} else if (status == 1) {
+		/* hard error, drop this skb */
+bail_free:
+		__kfree_skb(skb);
 	}
 }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 14:05       ` Colin Leroy
@ 2004-10-07 18:28         ` David S. Miller
  2004-10-07 18:41           ` Matt Mackall
                             ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: David S. Miller @ 2004-10-07 18:28 UTC (permalink / raw)
  To: Colin Leroy; +Cc: mpm, akpm, netdev

On Thu, 7 Oct 2004 16:05:32 +0200
Colin Leroy <colin@colino.net> wrote:

> First, my newbie question: is it possible to deadlock a spinlock on a
> Uniprocessor kernel ? For example, there's something I find suspect in
> netpoll/sungem interaction:
> 

Oh yes, it appears that netpoll doesn't support NETIF_F_LLTX locking,
crap :(

When a device has NETIF_F_LLTX set, it means that the driver's
dev->hard_start_xmit() routine is what takes the xmit_lock, not
the caller one level up.

Andi Kleen didn't fix up netpoll when he did his LLTX changes, oops.

So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
Basically:

1) If NETIF_F_LLTX is clear, same as before
2) If NETIF_F_LLTX is set:
	a) Do not take xmit_lock
	b) Check ->hard_start_xmit() return value,
	   if it is NETDEV_TX_LOCKED, then
	   spin_trylock(&dev->xmit_lock) failed
           in ->hard_start_xmit()

The best example is in net/sched/sch_generic.c:qdisc_restart()

		unsigned nolock = (dev->features & NETIF_F_LLTX);
		/*
		 * When the driver has LLTX set it does its own locking
		 * in start_xmit. No need to add additional overhead by
		 * locking again. These checks are worth it because
		 * even uncongested locks can be quite expensive.
		 * The driver can do trylock like here too, in case
		 * of lock congestion it should return -1 and the packet
		 * will be requeued.
		 */
		if (!nolock) {
			if (!spin_trylock(&dev->xmit_lock)) {
			collision:
				/* So, someone grabbed the driver. */
				
				/* It may be transient configuration error,
				   when hard_start_xmit() recurses. We detect
				   it by checking xmit owner and drop the
				   packet when deadloop is detected.
				*/
				if (dev->xmit_lock_owner == smp_processor_id()) {
					kfree_skb(skb);
					if (net_ratelimit())
						printk(KERN_DEBUG "Dead loop on netdevice %s, fix it urgently!\n", dev->name);
					return -1;
				}
				__get_cpu_var(netdev_rx_stat).cpu_collision++;
				goto requeue;
			}
			/* Remember that the driver is grabbed by us. */
			dev->xmit_lock_owner = smp_processor_id();
		}
		
		{
			/* And release queue */
			spin_unlock(&dev->queue_lock);

			if (!netif_queue_stopped(dev)) {
				int ret;
				if (netdev_nit)
					dev_queue_xmit_nit(skb, dev);

				ret = dev->hard_start_xmit(skb, dev);
				if (ret == NETDEV_TX_OK) { 
					if (!nolock) {
						dev->xmit_lock_owner = -1;
						spin_unlock(&dev->xmit_lock);
					}
					spin_lock(&dev->queue_lock);
					return -1;
				}
				if (ret == NETDEV_TX_LOCKED && nolock) {
					spin_lock(&dev->queue_lock);
					goto collision; 
				}
			}

			/* NETDEV_TX_BUSY - we need to requeue */
			/* Release the driver */
			if (!nolock) { 
				dev->xmit_lock_owner = -1;
				spin_unlock(&dev->xmit_lock);
			} 
			spin_lock(&dev->queue_lock);
			q = dev->qdisc;
		}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 18:28         ` David S. Miller
@ 2004-10-07 18:41           ` Matt Mackall
  2004-10-07 20:00             ` Colin Leroy
  2004-10-07 18:43           ` Andi Kleen
  2004-10-07 20:44           ` Colin Leroy
  2 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-07 18:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: Colin Leroy, akpm, netdev

On Thu, Oct 07, 2004 at 11:28:46AM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 16:05:32 +0200
> Colin Leroy <colin@colino.net> wrote:
> 
> > First, my newbie question: is it possible to deadlock a spinlock on a
> > Uniprocessor kernel ? For example, there's something I find suspect in
> > netpoll/sungem interaction:
> > 
> 
> Oh yes, it appears that netpoll doesn't support NETIF_F_LLTX locking,
> crap :(
> 
> When a device has NETIF_F_LLTX set, it means that the driver's
> dev->hard_start_xmit() routine is what takes the xmit_lock, not
> the caller one level up.
> 
> Andi Kleen didn't fix up netpoll when he did his LLTX changes, oops.
> 
> So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
> Basically:
> 
> 1) If NETIF_F_LLTX is clear, same as before
> 2) If NETIF_F_LLTX is set:
> 	a) Do not take xmit_lock
> 	b) Check ->hard_start_xmit() return value,
> 	   if it is NETDEV_TX_LOCKED, then
> 	   spin_trylock(&dev->xmit_lock) failed
>            in ->hard_start_xmit()

Colin, feeling adventurous enough to take a stab at this? It looks
pretty straightforward but I'm going to be even more useless than
usual for the next two weeks.

> 
> The best example is in net/sched/sch_generic.c:qdisc_restart()
> 
> 		unsigned nolock = (dev->features & NETIF_F_LLTX);
> 		/*
> 		 * When the driver has LLTX set it does its own locking
> 		 * in start_xmit. No need to add additional overhead by
> 		 * locking again. These checks are worth it because
> 		 * even uncongested locks can be quite expensive.
> 		 * The driver can do trylock like here too, in case
> 		 * of lock congestion it should return -1 and the packet
> 		 * will be requeued.
> 		 */
> 		if (!nolock) {
> 			if (!spin_trylock(&dev->xmit_lock)) {
> 			collision:
> 				/* So, someone grabbed the driver. */
> 				
> 				/* It may be transient configuration error,
> 				   when hard_start_xmit() recurses. We detect
> 				   it by checking xmit owner and drop the
> 				   packet when deadloop is detected.
> 				*/
> 				if (dev->xmit_lock_owner == smp_processor_id()) {
> 					kfree_skb(skb);
> 					if (net_ratelimit())
> 						printk(KERN_DEBUG "Dead loop on netdevice %s, fix it urgently!\n", dev->name);
> 					return -1;
> 				}
> 				__get_cpu_var(netdev_rx_stat).cpu_collision++;
> 				goto requeue;
> 			}
> 			/* Remember that the driver is grabbed by us. */
> 			dev->xmit_lock_owner = smp_processor_id();
> 		}
> 		
> 		{
> 			/* And release queue */
> 			spin_unlock(&dev->queue_lock);
> 
> 			if (!netif_queue_stopped(dev)) {
> 				int ret;
> 				if (netdev_nit)
> 					dev_queue_xmit_nit(skb, dev);
> 
> 				ret = dev->hard_start_xmit(skb, dev);
> 				if (ret == NETDEV_TX_OK) { 
> 					if (!nolock) {
> 						dev->xmit_lock_owner = -1;
> 						spin_unlock(&dev->xmit_lock);
> 					}
> 					spin_lock(&dev->queue_lock);
> 					return -1;
> 				}
> 				if (ret == NETDEV_TX_LOCKED && nolock) {
> 					spin_lock(&dev->queue_lock);
> 					goto collision; 
> 				}
> 			}
> 
> 			/* NETDEV_TX_BUSY - we need to requeue */
> 			/* Release the driver */
> 			if (!nolock) { 
> 				dev->xmit_lock_owner = -1;
> 				spin_unlock(&dev->xmit_lock);
> 			} 
> 			spin_lock(&dev->queue_lock);
> 			q = dev->qdisc;
> 		}

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 18:41           ` Matt Mackall
@ 2004-10-07 20:00             ` Colin Leroy
  0 siblings, 0 replies; 34+ messages in thread
From: Colin Leroy @ 2004-10-07 20:00 UTC (permalink / raw)
  To: Matt Mackall; +Cc: David S. Miller, akpm, netdev

On 07 Oct 2004 at 13h10, Matt Mackall wrote:

Hi, 

> Colin, feeling adventurous enough to take a stab at this? It looks
> pretty straightforward but I'm going to be even more useless than
> usual for the next two weeks.

I'll try :)

Thanks for all the input !

-- 
Colin
  Recursion: see Recursion

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 18:28         ` David S. Miller
  2004-10-07 18:41           ` Matt Mackall
@ 2004-10-07 18:43           ` Andi Kleen
  2004-10-07 20:44           ` Colin Leroy
  2 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2004-10-07 18:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: Colin Leroy, mpm, akpm, netdev

On Thu, Oct 07, 2004 at 11:28:46AM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 16:05:32 +0200
> Colin Leroy <colin@colino.net> wrote:
> 
> > First, my newbie question: is it possible to deadlock a spinlock on a
> > Uniprocessor kernel ? For example, there's something I find suspect in
> > netpoll/sungem interaction:
> > 
> 
> Oh yes, it appears that netpoll doesn't support NETIF_F_LLTX locking,
> crap :(
> 
> When a device has NETIF_F_LLTX set, it means that the driver's
> dev->hard_start_xmit() routine is what takes the xmit_lock, not
> the caller one level up.

It takes an own lock, not xmit_lock.

It's fine to ignore it completely.  In the worst case the poll
will not be retried, but netpoll has no way to do that anyways I think.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 18:28         ` David S. Miller
  2004-10-07 18:41           ` Matt Mackall
  2004-10-07 18:43           ` Andi Kleen
@ 2004-10-07 20:44           ` Colin Leroy
  2004-10-07 21:45             ` Andi Kleen
  2004-10-07 22:08             ` David S. Miller
  2 siblings, 2 replies; 34+ messages in thread
From: Colin Leroy @ 2004-10-07 20:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, akpm, netdev

On 07 Oct 2004 at 11h10, David S. Miller wrote:

Hi again, 

> So, netpoll needs to have the NETIF_F_LLTX stuff added to it.

This patch should do that. It works OK for me, but I'd like it checked
before sent upstream...

However, it doesn't fix the hang. it looks like this hang is really
coming from sungem.

--- a/net/core/netpoll.c	2004-10-05 21:09:49.000000000 +0200
+++ b/net/core/netpoll.c	2004-10-07 22:48:58.000000000 +0200
@@ -181,6 +181,7 @@
 void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
 {
 	int status;
+	unsigned must_lock;
 
 repeat:
 	if(!np || !np->dev || !netif_running(np->dev)) {
@@ -188,29 +189,51 @@
 		return;
 	}
 
-	spin_lock(&np->dev->xmit_lock);
-	np->dev->xmit_lock_owner = smp_processor_id();
+	must_lock = !(np->dev->features & NETIF_F_LLTX);
+	
+	if (must_lock) {
+		if (!spin_trylock(&np->dev->xmit_lock)) {
+			if (np->dev->xmit_lock_owner ==
smp_processor_id()) {
+				__kfree_skb(skb);
+				return;
+			}
 
+			goto repeat;
+		}
+		np->dev->xmit_lock_owner = smp_processor_id();
+	}
+	
 	/*
 	 * network drivers do not expect to be called if the queue is
 	 * stopped.
 	 */
 	if (netif_queue_stopped(np->dev)) {
 		np->dev->xmit_lock_owner = -1;
-		spin_unlock(&np->dev->xmit_lock);
+		
+		if (must_lock)
+			spin_unlock(&np->dev->xmit_lock);
 
 		netpoll_poll(np);
 		goto repeat;
 	}
 
 	status = np->dev->hard_start_xmit(skb, np->dev);
-	np->dev->xmit_lock_owner = -1;
-	spin_unlock(&np->dev->xmit_lock);
+
+	if (must_lock) {
+		np->dev->xmit_lock_owner = -1;
+		spin_unlock(&np->dev->xmit_lock);
+	}
 
 	/* transmit busy */
-	if(status) {
+	if (status == NETDEV_TX_LOCKED) {
 		netpoll_poll(np);
 		goto repeat;
+	} 
+	
+	/* hard error */
+	if (status == NETDEV_TX_BUSY) {
+		__kfree_skb(skb);
+		return;
 	}
 }
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 20:44           ` Colin Leroy
@ 2004-10-07 21:45             ` Andi Kleen
  2004-10-07 21:50               ` Matt Mackall
  2004-10-08  7:06               ` Colin Leroy
  2004-10-07 22:08             ` David S. Miller
  1 sibling, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2004-10-07 21:45 UTC (permalink / raw)
  To: Colin Leroy; +Cc: David S. Miller, mpm, akpm, netdev

On Thu, Oct 07, 2004 at 10:44:22PM +0200, Colin Leroy wrote:
> On 07 Oct 2004 at 11h10, David S. Miller wrote:
> 
> Hi again, 
> 
> > So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
> 
> This patch should do that. It works OK for me, but I'd like it checked
> before sent upstream...
> 
> However, it doesn't fix the hang. it looks like this hang is really
> coming from sungem.

IMHO it's not needed. Taking xmit_lock is harmless even when
the NETIF_F_LLTX flag is set. 

(or at least it was with my original patchkit. In theory it's 
possible someone changed their driver to take xmit_lock in hard_start_xmit,
but if they did that I would just consider it a driver bug) 

The only drawback is that there won't be a reply when the driver try
lock fails, but netpoll doesn't have a queue for that anyways. You could
probably poll then, but I'm not sure it's a good idea.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 21:45             ` Andi Kleen
@ 2004-10-07 21:50               ` Matt Mackall
  2004-10-07 22:07                 ` David S. Miller
  2004-10-08  7:06               ` Colin Leroy
  1 sibling, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-07 21:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Colin Leroy, David S. Miller, akpm, netdev

On Thu, Oct 07, 2004 at 11:45:05PM +0200, Andi Kleen wrote:
> On Thu, Oct 07, 2004 at 10:44:22PM +0200, Colin Leroy wrote:
> > On 07 Oct 2004 at 11h10, David S. Miller wrote:
> > 
> > Hi again, 
> > 
> > > So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
> > 
> > This patch should do that. It works OK for me, but I'd like it checked
> > before sent upstream...
> > 
> > However, it doesn't fix the hang. it looks like this hang is really
> > coming from sungem.
> 
> IMHO it's not needed. Taking xmit_lock is harmless even when
> the NETIF_F_LLTX flag is set. 
> 
> (or at least it was with my original patchkit. In theory it's 
> possible someone changed their driver to take xmit_lock in hard_start_xmit,
> but if they did that I would just consider it a driver bug) 

Ok, this part makes sense.

> The only drawback is that there won't be a reply when the driver try
> lock fails, but netpoll doesn't have a queue for that anyways. You could
> probably poll then, but I'm not sure it's a good idea.

But your meaning here is not entirely clear.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 21:50               ` Matt Mackall
@ 2004-10-07 22:07                 ` David S. Miller
  2004-10-07 23:43                   ` Matt Mackall
  0 siblings, 1 reply; 34+ messages in thread
From: David S. Miller @ 2004-10-07 22:07 UTC (permalink / raw)
  To: Matt Mackall; +Cc: ak, colin, akpm, netdev

On Thu, 7 Oct 2004 16:50:26 -0500
Matt Mackall <mpm@selenic.com> wrote:

> > The only drawback is that there won't be a reply when the driver try
> > lock fails, but netpoll doesn't have a queue for that anyways. You could
> > probably poll then, but I'm not sure it's a good idea.
> 
> But your meaning here is not entirely clear.

If another thread on another cpu is in the dev->hard_start_xmit() routine,
then it will have it's tx device lock held, and netpoll will simply get an
immediate return from ->hard_start_xmit() with error NETDEV_TX_LOCKED.

The packet will thus not be sent, and because netpoll does not have a
backlog queue for tx packets of any kind the packet lost forever.

NETDEV_TX_LOCKED is a transient condition.  It works for the rest of the
kernel because whoever holds the tx lock on the device, will recheck the
device packet transmit queue when it drops that lock and returns from
->hard_start_xmit().

Andi is merely noting how netpoll's design does not have such a model,
which is why the NETIF_F_LLTX semantics don't mesh very well.

It is unclear if it ise wise that netpoll_send_skb() currently spins
on ->hard_start_xmit() returning NETDEV_TX_LOCKED.  That could
result in some kind of deadlocks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 22:07                 ` David S. Miller
@ 2004-10-07 23:43                   ` Matt Mackall
  2004-10-07 23:50                     ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-07 23:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, colin, akpm, netdev

On Thu, Oct 07, 2004 at 03:07:56PM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 16:50:26 -0500
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > > The only drawback is that there won't be a reply when the driver try
> > > lock fails, but netpoll doesn't have a queue for that anyways. You could
> > > probably poll then, but I'm not sure it's a good idea.
> > 
> > But your meaning here is not entirely clear.
> 
> If another thread on another cpu is in the dev->hard_start_xmit() routine,
> then it will have it's tx device lock held, and netpoll will simply get an
> immediate return from ->hard_start_xmit() with error NETDEV_TX_LOCKED.
> 
> The packet will thus not be sent, and because netpoll does not have a
> backlog queue for tx packets of any kind the packet lost forever.
> 
> NETDEV_TX_LOCKED is a transient condition.  It works for the rest of the
> kernel because whoever holds the tx lock on the device, will recheck the
> device packet transmit queue when it drops that lock and returns from
> ->hard_start_xmit().
> 
> Andi is merely noting how netpoll's design does not have such a model,
> which is why the NETIF_F_LLTX semantics don't mesh very well.
> 
> It is unclear if it ise wise that netpoll_send_skb() currently spins
> on ->hard_start_xmit() returning NETDEV_TX_LOCKED.  That could
> result in some kind of deadlocks.

Deadlocks from recursion, presumably? We could probably throw in a max
retry count, as ugly as that is..

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 23:43                   ` Matt Mackall
@ 2004-10-07 23:50                     ` Andi Kleen
  2004-10-08  6:46                       ` Colin Leroy
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-07 23:50 UTC (permalink / raw)
  To: Matt Mackall; +Cc: David S. Miller, ak, colin, akpm, netdev

> Deadlocks from recursion, presumably? We could probably throw in a max
> retry count, as ugly as that is..

There should not be any recursion, no.

The problem is that the poll is effectively a spinlock. But when
another CPU takes an long interrupt while holding the lock it 
could take quite a long time to grab the lock.

For most network drivers this shouldn't occur though because
the net driver private lock is usually always taken with interrupts
off. 

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 23:50                     ` Andi Kleen
@ 2004-10-08  6:46                       ` Colin Leroy
  2004-10-08 21:53                         ` Matt Mackall
  0 siblings, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-08  6:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Matt Mackall, David S. Miller, ak, akpm, netdev

On 08 Oct 2004 at 01h10, Andi Kleen wrote:

Hi, 

> > Deadlocks from recursion, presumably? We could probably throw in a
> > max retry count, as ugly as that is..
> 
> There should not be any recursion, no.

If printk() is synchronous, there could be, if there's a printk() in the
codepath taken by dev->hard_start_xmit()... But I don't if it is...

> The problem is that the poll is effectively a spinlock. But when
> another CPU takes an long interrupt while holding the lock it 
> could take quite a long time to grab the lock.
> 
> For most network drivers this shouldn't occur though because
> the net driver private lock is usually always taken with interrupts
> off. 

Second newbie question: how are the interrupts disabled, is it via
local_irq_save()/local_irq_restore()? or is it something else ?

-- 
Colin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-08  6:46                       ` Colin Leroy
@ 2004-10-08 21:53                         ` Matt Mackall
  0 siblings, 0 replies; 34+ messages in thread
From: Matt Mackall @ 2004-10-08 21:53 UTC (permalink / raw)
  To: Colin Leroy; +Cc: Andi Kleen, David S. Miller, akpm, netdev

On Fri, Oct 08, 2004 at 08:46:40AM +0200, Colin Leroy wrote:
> On 08 Oct 2004 at 01h10, Andi Kleen wrote:
> 
> Hi, 
> 
> > > Deadlocks from recursion, presumably? We could probably throw in a
> > > max retry count, as ugly as that is..
> > 
> > There should not be any recursion, no.
> 
> If printk() is synchronous, there could be, if there's a printk() in the
> codepath taken by dev->hard_start_xmit()... But I don't if it is...

Have you looked at the code path that does carrier detect for possibly
recursing printks?

> > For most network drivers this shouldn't occur though because
> > the net driver private lock is usually always taken with interrupts
> > off. 
> 
> Second newbie question: how are the interrupts disabled, is it via
> local_irq_save()/local_irq_restore()? or is it something else ?

Often by virtue of being in an interrupt context already.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 21:45             ` Andi Kleen
  2004-10-07 21:50               ` Matt Mackall
@ 2004-10-08  7:06               ` Colin Leroy
  2004-10-08 22:00                 ` Matt Mackall
  1 sibling, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-08  7:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, mpm, akpm, netdev

On 07 Oct 2004 at 23h10, Andi Kleen wrote:

Hi, 

> > This patch should do that. It works OK for me, but I'd like it
> > checked before sent upstream...
> > 
> > However, it doesn't fix the hang. it looks like this hang is really
> > coming from sungem.
> 
> IMHO it's not needed. Taking xmit_lock is harmless even when
> the NETIF_F_LLTX flag is set. 

Should that be completely dropped, or is it still ok ? (I think
differenciating action based on hard_start_xmit status, that is, don't
goto repeat undefinitely when NETDEV_TX_BUSY, could be a good idea).
I mean, should I rework that patch, forget about it or leave it as-is?

Concerning the hang, I see that Andrew has put my first patch, the one
checking for netif_carrier_ok(), in his tree. Is it an OK solution from
your (net dev hackers) point of view?

Thanks,
-- 
Colin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-08  7:06               ` Colin Leroy
@ 2004-10-08 22:00                 ` Matt Mackall
  2004-10-08 22:18                   ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-08 22:00 UTC (permalink / raw)
  To: Colin Leroy; +Cc: Andi Kleen, David S. Miller, akpm, netdev

On Fri, Oct 08, 2004 at 09:06:10AM +0200, Colin Leroy wrote:
> On 07 Oct 2004 at 23h10, Andi Kleen wrote:
> 
> Hi, 
> 
> > > This patch should do that. It works OK for me, but I'd like it
> > > checked before sent upstream...
> > > 
> > > However, it doesn't fix the hang. it looks like this hang is really
> > > coming from sungem.
> > 
> > IMHO it's not needed. Taking xmit_lock is harmless even when
> > the NETIF_F_LLTX flag is set. 
> 
> Should that be completely dropped, or is it still ok ? (I think
> differenciating action based on hard_start_xmit status, that is, don't
> goto repeat undefinitely when NETDEV_TX_BUSY, could be a good idea).
> I mean, should I rework that patch, forget about it or leave it as-is?

Well the purpose of the LLTX flag is to reduce serializing on the
xmit_lock. If we take the lock anyway, that should be harmless as Andi
says. So I'm afraid it looks to be a performance fix at best (which is
a low priority here). Let's back burner it for now.
 
> Concerning the hang, I see that Andrew has put my first patch, the one
> checking for netif_carrier_ok(), in his tree. Is it an OK solution from
> your (net dev hackers) point of view?

It seems to be papering over a driver bug of some sort, which is not
the way we like to fix things.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-08 22:00                 ` Matt Mackall
@ 2004-10-08 22:18                   ` Andrew Morton
  2004-10-11  3:59                     ` David S. Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2004-10-08 22:18 UTC (permalink / raw)
  To: Matt Mackall; +Cc: colin, ak, davem, netdev

Matt Mackall <mpm@selenic.com> wrote:
>
> > Concerning the hang, I see that Andrew has put my first patch, the one
> > checking for netif_carrier_ok(), in his tree.

I dropped it again, waiting for the dust to settle on this one.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-08 22:18                   ` Andrew Morton
@ 2004-10-11  3:59                     ` David S. Miller
  2004-10-11 15:40                       ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: David S. Miller @ 2004-10-11  3:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mpm, colin, ak, netdev

Wait, I think I see the problem.

Sungem processes link status in it's ->poll() NAPI handler.
This occurs via calls to gem_pcs_interrupt(), for example.
Non-pcs sungem variants use a timer to poll link status.

When the link changes state, this link state processing
does printk()'s.

So perhaps that is why it deadlocks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11  3:59                     ` David S. Miller
@ 2004-10-11 15:40                       ` Andi Kleen
  2004-10-11 16:22                         ` Matt Mackall
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 15:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andrew Morton, mpm, colin, ak, netdev

On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
> 
> Wait, I think I see the problem.
> 
> Sungem processes link status in it's ->poll() NAPI handler.
> This occurs via calls to gem_pcs_interrupt(), for example.
> Non-pcs sungem variants use a timer to poll link status.
> 
> When the link changes state, this link state processing
> does printk()'s.
> 
> So perhaps that is why it deadlocks.

printk handles recursion with the down_trylock on console_sem.
So it shouldn't deadlock.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11 15:40                       ` Andi Kleen
@ 2004-10-11 16:22                         ` Matt Mackall
  2004-10-11 16:32                           ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-11 16:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, colin, netdev

On Mon, Oct 11, 2004 at 05:40:00PM +0200, Andi Kleen wrote:
> On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
> > 
> > Wait, I think I see the problem.
> > 
> > Sungem processes link status in it's ->poll() NAPI handler.
> > This occurs via calls to gem_pcs_interrupt(), for example.
> > Non-pcs sungem variants use a timer to poll link status.
> > 
> > When the link changes state, this link state processing
> > does printk()'s.
> > 
> > So perhaps that is why it deadlocks.
> 
> printk handles recursion with the down_trylock on console_sem.
> So it shouldn't deadlock.

If we're in the ->poll() handler for non-netpoll reasons, and the link
state changes, causing a printk, we'll potentially reenter ->poll() via
netconsole.

This also explains the original fix quite nicely. Colin, can you try
commenting out all the printks in gem_pcs_interrupt and if that works,
we'll start thinking about a proper fix.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11 16:22                         ` Matt Mackall
@ 2004-10-11 16:32                           ` Andi Kleen
  2004-10-11 16:36                             ` Matt Mackall
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 16:32 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, colin, netdev

On Mon, Oct 11, 2004 at 11:22:24AM -0500, Matt Mackall wrote:
> On Mon, Oct 11, 2004 at 05:40:00PM +0200, Andi Kleen wrote:
> > On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
> > > 
> > > Wait, I think I see the problem.
> > > 
> > > Sungem processes link status in it's ->poll() NAPI handler.
> > > This occurs via calls to gem_pcs_interrupt(), for example.
> > > Non-pcs sungem variants use a timer to poll link status.
> > > 
> > > When the link changes state, this link state processing
> > > does printk()'s.
> > > 
> > > So perhaps that is why it deadlocks.
> > 
> > printk handles recursion with the down_trylock on console_sem.
> > So it shouldn't deadlock.
> 
> If we're in the ->poll() handler for non-netpoll reasons, and the link
> state changes, causing a printk, we'll potentially reenter ->poll() via
> netconsole.

It won't because printk catches the case.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11 16:32                           ` Andi Kleen
@ 2004-10-11 16:36                             ` Matt Mackall
  2004-10-11 16:43                               ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-11 16:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, colin, netdev

On Mon, Oct 11, 2004 at 06:32:26PM +0200, Andi Kleen wrote:
> On Mon, Oct 11, 2004 at 11:22:24AM -0500, Matt Mackall wrote:
> > On Mon, Oct 11, 2004 at 05:40:00PM +0200, Andi Kleen wrote:
> > > On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
> > > > 
> > > > Wait, I think I see the problem.
> > > > 
> > > > Sungem processes link status in it's ->poll() NAPI handler.
> > > > This occurs via calls to gem_pcs_interrupt(), for example.
> > > > Non-pcs sungem variants use a timer to poll link status.
> > > > 
> > > > When the link changes state, this link state processing
> > > > does printk()'s.
> > > > 
> > > > So perhaps that is why it deadlocks.
> > > 
> > > printk handles recursion with the down_trylock on console_sem.
> > > So it shouldn't deadlock.
> > 
> > If we're in the ->poll() handler for non-netpoll reasons, and the link
> > state changes, causing a printk, we'll potentially reenter ->poll() via
> > netconsole.
> 
> It won't because printk catches the case.

It's not recursion on printk that's a problem, it's recursion on
->poll() and attempting to take whatever internal driver locks.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11 16:36                             ` Matt Mackall
@ 2004-10-11 16:43                               ` Andi Kleen
  2004-10-11 16:58                                 ` Matt Mackall
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 16:43 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, colin, netdev

> It's not recursion on printk that's a problem, it's recursion on
> ->poll() and attempting to take whatever internal driver locks.

There is no recursion on poll because printk will never call into
the low level console driver when the console sem is already taken.

-Andi

P.S.: I made the same mistake long ago, but akpm set me straight.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11 16:43                               ` Andi Kleen
@ 2004-10-11 16:58                                 ` Matt Mackall
  2004-10-11 17:41                                   ` Andi Kleen
  2004-10-11 20:45                                   ` Colin Leroy
  0 siblings, 2 replies; 34+ messages in thread
From: Matt Mackall @ 2004-10-11 16:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, colin, netdev

On Mon, Oct 11, 2004 at 06:43:15PM +0200, Andi Kleen wrote:
> > It's not recursion on printk that's a problem, it's recursion on
> > ->poll() and attempting to take whatever internal driver locks.
> 
> There is no recursion on poll because printk will never call into
> the low level console driver when the console sem is already taken.

Ergh, you deleted the context. Again, imagine we're originally in
->poll() for _a non-netpoll-related reason_. In other words, the
console sem is not taken, because we're just doing routine network
I/O. While in poll(), we take a private driver lock. Then for whatever
reason, we printk -> netconsole -> netpoll -> poll() again where we
attempt to retake the private driver lock.

> P.S.: I made the same mistake long ago, but akpm set me straight.

I'm pretty sure this case is different.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11 16:58                                 ` Matt Mackall
@ 2004-10-11 17:41                                   ` Andi Kleen
  2004-10-11 20:45                                   ` Colin Leroy
  1 sibling, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 17:41 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, colin, netdev

On Mon, Oct 11, 2004 at 11:58:52AM -0500, Matt Mackall wrote:
> On Mon, Oct 11, 2004 at 06:43:15PM +0200, Andi Kleen wrote:
> > > It's not recursion on printk that's a problem, it's recursion on
> > > ->poll() and attempting to take whatever internal driver locks.
> > 
> > There is no recursion on poll because printk will never call into
> > the low level console driver when the console sem is already taken.
> 
> Ergh, you deleted the context. Again, imagine we're originally in
> ->poll() for _a non-netpoll-related reason_. In other words, the
> console sem is not taken, because we're just doing routine network
> I/O. While in poll(), we take a private driver lock. Then for whatever
> reason, we printk -> netconsole -> netpoll -> poll() again where we
> attempt to retake the private driver lock.

You're right, in that case you could get a deadlock. 

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-11 16:58                                 ` Matt Mackall
  2004-10-11 17:41                                   ` Andi Kleen
@ 2004-10-11 20:45                                   ` Colin Leroy
       [not found]                                     ` <5cac192f0410181443303379e2@mail.gmail.com>
  1 sibling, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-11 20:45 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, netdev

On 11 Oct 2004 at 11h10, Matt Mackall wrote:

Hi, 

> Ergh, you deleted the context. Again, imagine we're originally in
> ->poll() for _a non-netpoll-related reason_. In other words, the
> console sem is not taken, because we're just doing routine network
> I/O. While in poll(), we take a private driver lock. Then for whatever
> reason, we printk -> netconsole -> netpoll -> poll() again where we
> attempt to retake the private driver lock.
> 
> > P.S.: I made the same mistake long ago, but akpm set me straight.
> 
> I'm pretty sure this case is different.

I tried replacing spin_lock by spin_trylock in poll() and a few other
functions (and returning accordingly), but it didn't help. I'm beginning
to think I won't be able to debug this one! ;)

-- 
Colin
  "When in doubt, do the right thing."

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <5cac192f0410181443303379e2@mail.gmail.com>]

[parent not found: <5cac192f041018145824acce5a@mail.gmail.com>]

[parent not found: <20041020161119.6e30efe5@pirandello>]

[parent not found: <5cac192f0410200848179ccc81@mail.gmail.com>]

* Re: [PATCH] Prevent netpoll hanging when link is down
       [not found]                                           ` <5cac192f0410200848179ccc81@mail.gmail.com>
@ 2004-10-21 16:36                                             ` Colin Leroy
  2004-10-24 15:22                                               ` Eric Lemoine
  0 siblings, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-21 16:36 UTC (permalink / raw)
  To: Eric Lemoine; +Cc: netdev

On 20 Oct 2004 at 17h10, Eric Lemoine wrote:

Hi Eric, 

> > > > > [deadlock with sungem & netconsole when no carrier]
> > > > Sorry I'm catching up on that problem... Do you have your
> > > > problem solved? If not, did you try removing all printk calls
> > > > from within gem_poll(), as suggested by Matt Mackall?
> > >
> Ok. Keep me posted. If removing the printk's indeed helps I'll cook up
> a proper fix.

Ok, it tried... It doesn't help. I tried the wild way too:

	#define printk(...) do {} while (0);

It didn't help either. I'll give a look to other net drivers, to see if
they do stuff differently, but I fear that I won't find it easily...

-- 
Colin
  "If you can't beat your computer at chess, try kickboxing."

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-21 16:36                                             ` Colin Leroy
@ 2004-10-24 15:22                                               ` Eric Lemoine
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Lemoine @ 2004-10-24 15:22 UTC (permalink / raw)
  To: Colin Leroy; +Cc: netdev

> > > > > > [deadlock with sungem & netconsole when no carrier]
> > > > > Sorry I'm catching up on that problem... Do you have your
> > > > > problem solved? If not, did you try removing all printk calls
> > > > > from within gem_poll(), as suggested by Matt Mackall?
> > > >
> > Ok. Keep me posted. If removing the printk's indeed helps I'll cook up
> > a proper fix.
> 
> Ok, it tried... It doesn't help. I tried the wild way too:
> 
>         #define printk(...) do {} while (0);
> 
> It didn't help either. I'll give a look to other net drivers, to see if
> they do stuff differently, but I fear that I won't find it easily...

Thanks for running this Colin. Unfortunately I can't figure out what's
going on at this point.

-- 
Eric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 20:44           ` Colin Leroy
  2004-10-07 21:45             ` Andi Kleen
@ 2004-10-07 22:08             ` David S. Miller
  2004-10-08  6:54               ` Colin Leroy
  1 sibling, 1 reply; 34+ messages in thread
From: David S. Miller @ 2004-10-07 22:08 UTC (permalink / raw)
  To: Colin Leroy; +Cc: mpm, akpm, netdev

On Thu, 7 Oct 2004 22:44:22 +0200
Colin Leroy <colin@colino.net> wrote:

> On 07 Oct 2004 at 11h10, David S. Miller wrote:
> 
> Hi again, 
> 
> > So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
> 
> This patch should do that. It works OK for me, but I'd like it checked
> before sent upstream...
> 
> However, it doesn't fix the hang. it looks like this hang is really
> coming from sungem.

Is it hanging inside of the ->hard_start_xmit() call or somewhere
else?  Do you have a way to determine this without adding printk()'s
and thus causing recursion as you mentioned earlier? :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Prevent netpoll hanging when link is down
  2004-10-07 22:08             ` David S. Miller
@ 2004-10-08  6:54               ` Colin Leroy
  0 siblings, 0 replies; 34+ messages in thread
From: Colin Leroy @ 2004-10-08  6:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, akpm, netdev

On 07 Oct 2004 at 15h10, David S. Miller wrote:

Hi, 

> > However, it doesn't fix the hang. it looks like this hang is really
> > coming from sungem.
> 
> Is it hanging inside of the ->hard_start_xmit() call 

I think so, but my way of discovering it may not be very good: I tested
by replacing 

status = np->dev->hard_start_xmit(...);

by 
status = NETDEV_TX_OK, then status = NETDEV_TX_BUSY, then status =
NETDEV_TX_LOCKED 

in netpoll.c (avoiding to call hard_start_xmit()), and it didn't hang. 

> or somewhere else?  Do you have a way to determine this without adding
> printk()'s and thus causing recursion as you mentioned earlier? :-)

Well, that's my big problem :-) I can't use the spinlock debugging
neither, because I'm on uniprocessor and on PPC.

I tried removing printk()s from gem_start_xmit() codepath, but it didn't
help either, so I don't think the lock comes from a printk()
recursion... 

(It's really hard to debug that kind of stuff! I'm learning quite a few
things :))

-- 
Colin

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2004-10-24 15:22 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20041006232544.53615761@jack.colino.net>
2004-10-06 21:43 ` [PATCH] Prevent netpoll hanging when link is down Matt Mackall
2004-10-07  5:53   ` Colin Leroy
2004-10-07  6:49     ` David S. Miller
2004-10-07  8:33       ` Colin Leroy
2004-10-07  8:45         ` Colin Leroy
2004-10-07 14:05       ` Colin Leroy
2004-10-07 18:28         ` David S. Miller
2004-10-07 18:41           ` Matt Mackall
2004-10-07 20:00             ` Colin Leroy
2004-10-07 18:43           ` Andi Kleen
2004-10-07 20:44           ` Colin Leroy
2004-10-07 21:45             ` Andi Kleen
2004-10-07 21:50               ` Matt Mackall
2004-10-07 22:07                 ` David S. Miller
2004-10-07 23:43                   ` Matt Mackall
2004-10-07 23:50                     ` Andi Kleen
2004-10-08  6:46                       ` Colin Leroy
2004-10-08 21:53                         ` Matt Mackall
2004-10-08  7:06               ` Colin Leroy
2004-10-08 22:00                 ` Matt Mackall
2004-10-08 22:18                   ` Andrew Morton
2004-10-11  3:59                     ` David S. Miller
2004-10-11 15:40                       ` Andi Kleen
2004-10-11 16:22                         ` Matt Mackall
2004-10-11 16:32                           ` Andi Kleen
2004-10-11 16:36                             ` Matt Mackall
2004-10-11 16:43                               ` Andi Kleen
2004-10-11 16:58                                 ` Matt Mackall
2004-10-11 17:41                                   ` Andi Kleen
2004-10-11 20:45                                   ` Colin Leroy
     [not found]                                     ` <5cac192f0410181443303379e2@mail.gmail.com>
     [not found]                                       ` <5cac192f041018145824acce5a@mail.gmail.com>
     [not found]                                         ` <20041020161119.6e30efe5@pirandello>
     [not found]                                           ` <5cac192f0410200848179ccc81@mail.gmail.com>
2004-10-21 16:36                                             ` Colin Leroy
2004-10-24 15:22                                               ` Eric Lemoine
2004-10-07 22:08             ` David S. Miller
2004-10-08  6:54               ` Colin Leroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).