lockups with netconsole on e1000 on media insertion

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* lockups with netconsole on e1000 on media insertion
@ 2005-08-05 11:04 John Bäckstrand
  0 siblings, 0 replies; 18+ messages in thread
From: John Bäckstrand @ 2005-08-05 11:04 UTC (permalink / raw)
  To: linux-kernel

I've been trying to hunt down a hard lockup issue with some hardware of 
mine, but I've possibly hit a kernel bug instead. When using netconsole 
on my e1000, if I unplug the cable and then re-plug it, the machine 
locks up hard. It manages to print the "link up" message on the screen, 
but nothing after that. Now, I wonder if this is supposed to be so? I 
tried this on 4 different configurations, 2.6.13-rc5 and 2.6.12 with and 
without "noapic acpi=off", same result on all of them. I've tried with 1 
and 3 other NICs in the machine at the same time.

It seems to be working fine on other NICs, such as rtl8139 and 3c59x. 
Any ideas on how to debug this further? (Btw, is there an easy way of 
"inserting" dmesg messages manually?)

---
John Bäckstrand

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
       [not found] <42F347D2.7000207@home.se.suse.lists.linux.kernel>
@ 2005-08-05 11:45 ` Andi Kleen
  2005-08-05 12:44   ` John Bäckstrand
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Andi Kleen @ 2005-08-05 11:45 UTC (permalink / raw)
  To: John Bäckstrand; +Cc: linux-kernel, netdev

John Bäckstrand <sandos@home.se> writes:

> I've been trying to hunt down a hard lockup issue with some hardware
> of mine, but I've possibly hit a kernel bug instead. When using
> netconsole on my e1000, if I unplug the cable and then re-plug it, the
> machine locks up hard. It manages to print the "link up" message on
> the screen, but nothing after that. Now, I wonder if this is supposed
> to be so? I tried this on 4 different configurations, 2.6.13-rc5 and
> 2.6.12 with and without "noapic acpi=off", same result on all of
> them. I've tried with 1 and 3 other NICs in the machine at the same
> time.

I ran into the same problem some time ago on e1000. The problem was
that if the link doesn't come up netconsole ends up waiting forever
for it.

The patch was for 2.6.12, did a quick untested port to 2.6.13rc5.

-Andi

Only try a limited number to send packets in netpoll

Avoids hangs on e1000 when link is not up.

Signed-off-by: Andi Kleen <ak@suse.de>

Index: linux/net/core/netpoll.c
===================================================================
--- linux.orig/net/core/netpoll.c
+++ linux/net/core/netpoll.c
@@ -247,9 +247,11 @@ static void netpoll_send_skb(struct netp
 {
 	int status;
 	struct netpoll_info *npinfo;
+	/* Only try 5 times in case the link is down etc. */
+	int try = 5;
 
 repeat:
-	if(!np || !np->dev || !netif_running(np->dev)) {
+	if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) {
 		__kfree_skb(skb);
 		return;
 	}
@@ -286,6 +288,9 @@ repeat:
 
 	/* transmit busy */
 	if(status) {
+		/* Don't count spinlock as try */
+		if (status == NETDEV_TX_LOCKED)
+			try++; 
 		netpoll_poll(np);
 		goto repeat;
 	}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 11:45 ` Andi Kleen
@ 2005-08-05 12:44   ` John Bäckstrand
  2005-08-05 13:49   ` Steven Rostedt
  2005-08-05 20:12   ` Matt Mackall
  2 siblings, 0 replies; 18+ messages in thread
From: John Bäckstrand @ 2005-08-05 12:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, netdev

Andi Kleen wrote:
> The patch was for 2.6.12, did a quick untested port to 2.6.13rc5.
> 
> -Andi
> 
> Only try a limited number to send packets in netpoll

Thanks, worked nicely!

---
John Bäckstrand

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 11:45 ` Andi Kleen
  2005-08-05 12:44   ` John Bäckstrand
@ 2005-08-05 13:49   ` Steven Rostedt
  2005-08-05 13:55     ` Andi Kleen
  2005-08-07 21:12     ` John Bäckstrand
  2005-08-05 20:12   ` Matt Mackall
  2 siblings, 2 replies; 18+ messages in thread
From: Steven Rostedt @ 2005-08-05 13:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, netdev, linux-kernel, John Bäckstrand

On Fri, 2005-08-05 at 13:45 +0200, Andi Kleen wrote:
> John Bäckstrand <sandos@home.se> writes:
> 
> > I've been trying to hunt down a hard lockup issue with some hardware
> > of mine, but I've possibly hit a kernel bug instead. When using
> > netconsole on my e1000, if I unplug the cable and then re-plug it, the
> > machine locks up hard. It manages to print the "link up" message on
> > the screen, but nothing after that. Now, I wonder if this is supposed
> > to be so? I tried this on 4 different configurations, 2.6.13-rc5 and
> > 2.6.12 with and without "noapic acpi=off", same result on all of
> > them. I've tried with 1 and 3 other NICs in the machine at the same
> > time.
> 
> I ran into the same problem some time ago on e1000. The problem was
> that if the link doesn't come up netconsole ends up waiting forever
> for it.
> 
> The patch was for 2.6.12, did a quick untested port to 2.6.13rc5.
> 
> -Andi
> 
> Only try a limited number to send packets in netpoll
> 
> Avoids hangs on e1000 when link is not up.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> 
> Index: linux/net/core/netpoll.c
> ===================================================================
> --- linux.orig/net/core/netpoll.c
> +++ linux/net/core/netpoll.c
> @@ -247,9 +247,11 @@ static void netpoll_send_skb(struct netp
>  {
>  	int status;
>  	struct netpoll_info *npinfo;
> +	/* Only try 5 times in case the link is down etc. */
> +	int try = 5;
>  
>  repeat:
> -	if(!np || !np->dev || !netif_running(np->dev)) {
> +	if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) {
>  		__kfree_skb(skb);
>  		return;
>  	}
> @@ -286,6 +288,9 @@ repeat:
>  
>  	/* transmit busy */
>  	if(status) {
> +		/* Don't count spinlock as try */
> +		if (status == NETDEV_TX_LOCKED)
> +			try++; 
>  		netpoll_poll(np);
>  		goto repeat;
>  	}
> -

This is fixing the symptom and is not the cure.  Unfortunately I don't
have a e1000 card so I can't try a fix. But I did have a e100 card that
would lock up the same way.  The problem was that netpoll_poll calls the
cards netpoll routine (in e1000_main.c e1000_netpoll).  In the e100
case, when the transmit buffer would fill up, the queue would go down.
But the netpoll routine in the e100 code never put it back up after it
was all transfered. So this would lock up the kernel when that happened.

I believe that the e1000 is suffering the same problem, but I can't fix
it since I don't have an e1000 to test, but what probably needs to be
done is to check to see if the transmit buffer can be cleaned and the
queue go back up.

e1000_netpoll calls e1000_intr which looks like this:

static irqreturn_t
e1000_intr(int irq, void *data, struct pt_regs *regs)
{
	struct net_device *netdev = data;
	struct e1000_adapter *adapter = netdev_priv(netdev);
	struct e1000_hw *hw = &adapter->hw;
	uint32_t icr = E1000_READ_REG(hw, ICR);
#ifndef CONFIG_E1000_NAPI
	unsigned int i;
#endif

	if(unlikely(!icr))
		return IRQ_NONE;  /* Not our interrupt */

^^^^^^^^
---- Here I'm wondering if the netpoll case this is returned?


	if(unlikely(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))) {
		hw->get_link_status = 1;
		mod_timer(&adapter->watchdog_timer, jiffies);
	}

#ifdef CONFIG_E1000_NAPI
	if(likely(netif_rx_schedule_prep(netdev))) {

		/* Disable interrupts and register for poll. The flush 
		  of the posted write is intentionally left out.
		*/

		atomic_inc(&adapter->irq_sem);
		E1000_WRITE_REG(hw, IMC, ~0);
		__netif_rx_schedule(netdev);
	}
#else
	/* Writing IMC and IMS is needed for 82547.
	   Due to Hub Link bus being occupied, an interrupt
	   de-assertion message is not able to be sent.
	   When an interrupt assertion message is generated later,
	   two messages are re-ordered and sent out.
	   That causes APIC to think 82547 is in de-assertion
	   state, while 82547 is in assertion state, resulting
	   in dead lock. Writing IMC forces 82547 into
	   de-assertion state.
	*/
	if(hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2){
		atomic_inc(&adapter->irq_sem);
		E1000_WRITE_REG(hw, IMC, ~0);
	}

	for(i = 0; i < E1000_MAX_INTR; i++)
		if(unlikely(!adapter->clean_rx(adapter) &
		   !e1000_clean_tx_irq(adapter)))
^^^^^
----  This should clean the transmit buffer, but it may not get here.

			break;

	if(hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2)
		e1000_irq_enable(adapter);
#endif

	return IRQ_HANDLED;
}



So maybe the patch should be something like:

--- linux-2.6.13-rc3/drivers/net/e1000/e1000_main.c.orig	2005-08-05 09:32:01.000000000 -0400
+++ linux-2.6.13-rc3/drivers/net/e1000/e1000_main.c	2005-08-05 09:33:56.000000000 -0400
@@ -3816,6 +3816,7 @@ e1000_netpoll(struct net_device *netdev)
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev, NULL);
+	e1000_clean_tx_irq(adapter);
 	enable_irq(adapter->pdev->irq);
 }
 #endif


I don't have the card, so I can't test it. But if this works (after
removing the previous patch) then this is the better solution.  If this
does work, then we should probably add the timeout in netpoll with a
warning that the netpoll of the driver is broken:

Here's a modified version of the other patch: So we know where the
problem is.

#### John, Delete this part if you apply the above. ####

--- linux-2.6.13-rc3/net/core/netpoll.c.orig	2005-08-05 09:37:00.000000000 -0400
+++ linux-2.6.13-rc3/net/core/netpoll.c	2005-08-05 09:44:19.000000000 -0400
@@ -247,9 +247,14 @@ static void netpoll_send_skb(struct netp
 {
 	int status;
 	struct netpoll_info *npinfo;
+	/* only try five times incase link is down */
+	int try=5;
 
 repeat:
-	if(!np || !np->dev || !netif_running(np->dev)) {
+	if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) {
+		if (!try)
+			printk(KERN_WARNING "net driver is stuck down, maybe a"
+					" problem with the driver's netpoll\n");
 		__kfree_skb(skb);
 		return;
 	}
@@ -286,6 +291,9 @@ repeat:
 
 	/* transmit busy */
 	if(status) {
+		/* Don't count spinlock as try */
+		if (status == NETDEV_TX_LOCKED)
+			try++;
 		netpoll_poll(np);
 		goto repeat;
 	}


-- Steve



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 13:49   ` Steven Rostedt
@ 2005-08-05 13:55     ` Andi Kleen
  2005-08-05 14:10       ` Steven Rostedt
  2005-08-07 21:12     ` John Bäckstrand
  1 sibling, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2005-08-05 13:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Ingo Molnar, netdev, linux-kernel, John B?ckstrand

> This is fixing the symptom and is not the cure.  Unfortunately I don't
> have a e1000 card so I can't try a fix. But I did have a e100 card that
> would lock up the same way.  The problem was that netpoll_poll calls the
> cards netpoll routine (in e1000_main.c e1000_netpoll).  In the e100
> case, when the transmit buffer would fill up, the queue would go down.
> But the netpoll routine in the e100 code never put it back up after it
> was all transfered. So this would lock up the kernel when that happened.

In my case the hang happened when no cable was connected.

There is no way to handle this in any other way. You eventually
have to bail out.

>  
>  repeat:
> -	if(!np || !np->dev || !netif_running(np->dev)) {
> +	if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) {
> +		if (!try)
> +			printk(KERN_WARNING "net driver is stuck down, maybe a"
> +					" problem with the driver's netpoll\n");

... and nobody will see that. It will not even trigger an output.

-Andi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 13:55     ` Andi Kleen
@ 2005-08-05 14:10       ` Steven Rostedt
  2005-08-05 14:14         ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2005-08-05 14:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, netdev, linux-kernel, John B?ckstrand

On Fri, 2005-08-05 at 15:55 +0200, Andi Kleen wrote:
> > This is fixing the symptom and is not the cure.  Unfortunately I don't
> > have a e1000 card so I can't try a fix. But I did have a e100 card that
> > would lock up the same way.  The problem was that netpoll_poll calls the
> > cards netpoll routine (in e1000_main.c e1000_netpoll).  In the e100
> > case, when the transmit buffer would fill up, the queue would go down.
> > But the netpoll routine in the e100 code never put it back up after it
> > was all transfered. So this would lock up the kernel when that happened.
> 
> In my case the hang happened when no cable was connected.

But should come back when the cable is reconnected. 

OK, I admit, it shouldn't hang in the first place.

> 
> There is no way to handle this in any other way. You eventually
> have to bail out.
> 
> >  
> >  repeat:
> > -	if(!np || !np->dev || !netif_running(np->dev)) {
> > +	if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) {
> > +		if (!try)
> > +			printk(KERN_WARNING "net driver is stuck down, maybe a"
> > +					" problem with the driver's netpoll\n");
> 
> ... and nobody will see that. It will not even trigger an output.

Since one would be using net console right? :-)   Oops! I forgot that.
Well it may make it to the logs, since this patch also bails out.
That's why I think your first patch with this warning as well as a fix
for the e1000 should be submitted.  Since the e1000 shouldn't lock up
netpoll just because the queue was put down.

Hmm, how bad is it to have a printk in a routine that is registered to
printk?   If this does print, a "static once" variable should be added
so that this is only printed once and not everytime it tries to print
this message.

-- Steve



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 14:10       ` Steven Rostedt
@ 2005-08-05 14:14         ` Andi Kleen
  2005-08-05 14:27           ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2005-08-05 14:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Ingo Molnar, netdev, linux-kernel, John B?ckstrand

On Fri, Aug 05, 2005 at 10:10:13AM -0400, Steven Rostedt wrote:
> On Fri, 2005-08-05 at 15:55 +0200, Andi Kleen wrote:
> > > This is fixing the symptom and is not the cure.  Unfortunately I don't
> > > have a e1000 card so I can't try a fix. But I did have a e100 card that
> > > would lock up the same way.  The problem was that netpoll_poll calls the
> > > cards netpoll routine (in e1000_main.c e1000_netpoll).  In the e100
> > > case, when the transmit buffer would fill up, the queue would go down.
> > > But the netpoll routine in the e100 code never put it back up after it
> > > was all transfered. So this would lock up the kernel when that happened.
> > 
> > In my case the hang happened when no cable was connected.
> 
> But should come back when the cable is reconnected. 

Which might be never. Not an option.

> Hmm, how bad is it to have a printk in a routine that is registered to
> printk?   If this does print, a "static once" variable should be added
> so that this is only printed once and not everytime it tries to print
> this message.

printk notices it is recursing and will not try to output it.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 14:14         ` Andi Kleen
@ 2005-08-05 14:27           ` Steven Rostedt
  2005-08-05 14:36             ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2005-08-05 14:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, netdev, linux-kernel, John B?ckstrand

On Fri, 2005-08-05 at 16:14 +0200, Andi Kleen wrote:
> On Fri, Aug 05, 2005 at 10:10:13AM -0400, Steven Rostedt wrote:
> > On Fri, 2005-08-05 at 15:55 +0200, Andi Kleen wrote:
> > > > This is fixing the symptom and is not the cure.  Unfortunately I don't
> > > > have a e1000 card so I can't try a fix. But I did have a e100 card that
> > > > would lock up the same way.  The problem was that netpoll_poll calls the
> > > > cards netpoll routine (in e1000_main.c e1000_netpoll).  In the e100
> > > > case, when the transmit buffer would fill up, the queue would go down.
> > > > But the netpoll routine in the e100 code never put it back up after it
> > > > was all transfered. So this would lock up the kernel when that happened.
> > > 
> > > In my case the hang happened when no cable was connected.
> > 
> > But should come back when the cable is reconnected. 
> 
> Which might be never. Not an option.

Hey! You removed my admission to this. Don't make me look stupid
here ;-)

> 
> > Hmm, how bad is it to have a printk in a routine that is registered to
> > printk?   If this does print, a "static once" variable should be added
> > so that this is only printed once and not everytime it tries to print
> > this message.
> 
> printk notices it is recursing and will not try to output it.

Darn it, since this should really be reported.  Yes, the core netpoll
should bail out, but it is also a problem with the driver and should be
fixed.

Come to think of it, I should have submitted a patch that did what you
did when I discovered the problem with the e100. But that network card
was slow and could easily lock up when doing a sysrq-t.  I wasn't
removing cables, so I just submitted the fix for the e100, not thinking
that the netpoll shouldn't lock up itself.

-- Steve



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 14:27           ` Steven Rostedt
@ 2005-08-05 14:36             ` David S. Miller
  2005-08-05 15:02               ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-08-05 14:36 UTC (permalink / raw)
  To: rostedt; +Cc: ak, mingo, netdev, linux-kernel, sandos

From: Steven Rostedt <rostedt@goodmis.org>
Date: Fri, 05 Aug 2005 10:27:06 -0400

> Darn it, since this should really be reported.  Yes, the core netpoll
> should bail out, but it is also a problem with the driver and should be
> fixed.

I don't get how you can even remotely claim this to
be a problem with the driver.

If there is no cable plugged in, the link never comes
up, and that is a completely normal thing.  The netpoll
code should simply not try forever to wait for the link
to go up.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 14:36             ` David S. Miller
@ 2005-08-05 15:02               ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2005-08-05 15:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, mingo, netdev, linux-kernel, sandos

On Fri, 2005-08-05 at 07:36 -0700, David S. Miller wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> Date: Fri, 05 Aug 2005 10:27:06 -0400
> 
> > Darn it, since this should really be reported.  Yes, the core netpoll
> > should bail out, but it is also a problem with the driver and should be
> > fixed.
> 
> I don't get how you can even remotely claim this to
> be a problem with the driver.
> 
> If there is no cable plugged in, the link never comes
> up, and that is a completely normal thing.  The netpoll
> code should simply not try forever to wait for the link
> to go up.

You're right with that case. The problem with the driver is that it
doesn't clean up the transmits if it just happened to overflow the
transmit buffer and shut down the queue.  The netpoll should at least
see that the queue can be brought up again.  That's what I have a
problem with.  

In other words, I see two bugs:

1. The bug with the netpoll.  It locks up if the driver's queue is down
and never comes up. Which is fixed with Andi's patch.

2.  The bug with the driver. Its netpoll doesn't detect that the queue
can come back up again.  With the timeout on netpoll this may no longer
be a bug, since it should clean itself up after netpoll times out and
turns interrupts back on.  But if a timeout is avoidable by netpoll
being a little smarter, then I believe that it should be fixed.

Now do you understand where I'm coming from?

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 11:45 ` Andi Kleen
  2005-08-05 12:44   ` John Bäckstrand
  2005-08-05 13:49   ` Steven Rostedt
@ 2005-08-05 20:12   ` Matt Mackall
  2005-08-05 21:56     ` Andi Kleen
  2 siblings, 1 reply; 18+ messages in thread
From: Matt Mackall @ 2005-08-05 20:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: John B?ckstrand, linux-kernel, netdev

On Fri, Aug 05, 2005 at 01:45:55PM +0200, Andi Kleen wrote:
> John B?ckstrand <sandos@home.se> writes:
> 
> > I've been trying to hunt down a hard lockup issue with some hardware
> > of mine, but I've possibly hit a kernel bug instead. When using
> > netconsole on my e1000, if I unplug the cable and then re-plug it, the
> > machine locks up hard. It manages to print the "link up" message on
> > the screen, but nothing after that. Now, I wonder if this is supposed
> > to be so? I tried this on 4 different configurations, 2.6.13-rc5 and
> > 2.6.12 with and without "noapic acpi=off", same result on all of
> > them. I've tried with 1 and 3 other NICs in the machine at the same
> > time.
> 
> I ran into the same problem some time ago on e1000. The problem was
> that if the link doesn't come up netconsole ends up waiting forever
> for it.

I still don't like this fix. Yes, you're right, it should eventually
give up. But here it gives up way too easily - 5 could easily
translate to 5 microseconds. This is analogous to giving up on serial
transmit if CTS is down for 5 loops.

I'd be much happier if there were some udelay or the like in here so
that we're not giving up on such a short timeframe.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 20:12   ` Matt Mackall
@ 2005-08-05 21:56     ` Andi Kleen
  2005-08-05 23:20       ` Matt Mackall
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2005-08-05 21:56 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, John B?ckstrand, linux-kernel, netdev

> I still don't like this fix. Yes, you're right, it should eventually
> give up. But here it gives up way too easily - 5 could easily
> translate to 5 microseconds. This is analogous to giving up on serial
> transmit if CTS is down for 5 loops.
> 
> I'd be much happier if there were some udelay or the like in here so
> that we're not giving up on such a short timeframe.

Problem is that it could translate to a long aggregate delay
e.g. when the kernel tries to dump the backlog after console_init.
That is why I made the delay so short.

Longer delay would be possible, but then it would need some logic
to detect down links and don't delay on them and then retry later etc. 
Would be all far more complicated.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 21:56     ` Andi Kleen
@ 2005-08-05 23:20       ` Matt Mackall
  2005-08-05 23:51         ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Matt Mackall @ 2005-08-05 23:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: John B?ckstrand, linux-kernel, netdev

On Fri, Aug 05, 2005 at 11:56:50PM +0200, Andi Kleen wrote:
> > I still don't like this fix. Yes, you're right, it should eventually
> > give up. But here it gives up way too easily - 5 could easily
> > translate to 5 microseconds. This is analogous to giving up on serial
> > transmit if CTS is down for 5 loops.
> > 
> > I'd be much happier if there were some udelay or the like in here so
> > that we're not giving up on such a short timeframe.
> 
> Problem is that it could translate to a long aggregate delay
> e.g. when the kernel tries to dump the backlog after console_init.
> That is why I made the delay so short.

But why are we in a hurry to dump the backlog on the floor? Why are we
worrying about the performance of netpoll without the cable plugged in
at all? We shouldn't be optimizing the data loss case.

My primary concern here is that the loop have a non-negligible extent
in time. 5 loops is effectively equal to none. I'd be very surprised
if it was even enough for deglitching.

With serial console, we do polled I/O that runs at the serial rate -
milliseconds per line of output.

> Longer delay would be possible, but then it would need some logic
> to detect down links and don't delay on them and then retry later etc. 
> Would be all far more complicated.

I think we could probably have subsequent failures be much shorter
without too much added complexity. But I'm not sure it matters.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 23:20       ` Matt Mackall
@ 2005-08-05 23:51         ` Andi Kleen
  2005-08-06  1:22           ` Matt Mackall
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2005-08-05 23:51 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, John B?ckstrand, linux-kernel, netdev

> But why are we in a hurry to dump the backlog on the floor? Why are we
> worrying about the performance of netpoll without the cable plugged in
> at all? We shouldn't be optimizing the data loss case.

Because a system shouldn't stall for minutes (or forever like right now) 
at boot just because the network cable isn't plugged in.

> 
> My primary concern here is that the loop have a non-negligible extent
> in time. 5 loops is effectively equal to none. I'd be very surprised
> if it was even enough for deglitching.

In the normal case the packets should just be send out.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 23:51         ` Andi Kleen
@ 2005-08-06  1:22           ` Matt Mackall
  2005-08-06  1:37             ` Daniel Phillips
  0 siblings, 1 reply; 18+ messages in thread
From: Matt Mackall @ 2005-08-06  1:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: John B?ckstrand, linux-kernel, netdev

On Sat, Aug 06, 2005 at 01:51:22AM +0200, Andi Kleen wrote:
> > But why are we in a hurry to dump the backlog on the floor? Why are we
> > worrying about the performance of netpoll without the cable plugged in
> > at all? We shouldn't be optimizing the data loss case.
> 
> Because a system shouldn't stall for minutes (or forever like right now) 
> at boot just because the network cable isn't plugged in.

Using netconsole without a network cable could well be classified as a
serious configuration error. NFS also is a bit sluggish without a
network cable.

I've already agreed that forever is a problem. Can we work towards
agreeing on a non-trivial timeout, please?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-06  1:22           ` Matt Mackall
@ 2005-08-06  1:37             ` Daniel Phillips
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Phillips @ 2005-08-06  1:37 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Andi Kleen, John B?ckstrand, linux-kernel, netdev

On Saturday 06 August 2005 11:22, Matt Mackall wrote:
> On Sat, Aug 06, 2005 at 01:51:22AM +0200, Andi Kleen wrote:
> > > But why are we in a hurry to dump the backlog on the floor? Why are we
> > > worrying about the performance of netpoll without the cable plugged in
> > > at all? We shouldn't be optimizing the data loss case.
> >
> > Because a system shouldn't stall for minutes (or forever like right now)
> > at boot just because the network cable isn't plugged in.
>
> Using netconsole without a network cable could well be classified as a
> serious configuration error.

But please don't.  An OS that slows to a crawl or crashes because a cable 
isn't plugged in an OS that deserves to be ridiculed.  Silly timeouts on boot 
are scary and a waste of user's time.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-05 13:49   ` Steven Rostedt
  2005-08-05 13:55     ` Andi Kleen
@ 2005-08-07 21:12     ` John Bäckstrand
  2005-08-08  2:29       ` Steven Rostedt
  1 sibling, 1 reply; 18+ messages in thread
From: John Bäckstrand @ 2005-08-07 21:12 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Andi Kleen, Ingo Molnar, netdev, linux-kernel

Steven Rostedt wrote:
> I don't have the card, so I can't test it. But if this works (after
> removing the previous patch) then this is the better solution. 

I can confirm that this alone does not work for the simple 
unplug/re-plug cycle I described, it still locks up hard. Tried this 
alone on -rc6.

---
John Bäckstrand

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: lockups with netconsole on e1000 on media insertion
  2005-08-07 21:12     ` John Bäckstrand
@ 2005-08-08  2:29       ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2005-08-08  2:29 UTC (permalink / raw)
  To: John Bäckstrand
  Cc: Matt Mackall, Andi Kleen, Ingo Molnar, netdev, linux-kernel

On Sun, 2005-08-07 at 23:12 +0200, John Bäckstrand wrote:
> Steven Rostedt wrote:
> > I don't have the card, so I can't test it. But if this works (after
> > removing the previous patch) then this is the better solution. 
> 
> I can confirm that this alone does not work for the simple 
> unplug/re-plug cycle I described, it still locks up hard. Tried this 
> alone on -rc6.

Darn it.  If I had a e1000 I could debug it. I have other methods of
logging than printks in all there varieties (see relayfs and friends).
I still believe that the e1000_netpoll is not turning on the queue for
some reason and the netpoll_send_skb is locking up because of that.
Especially since Andi's patch fixes the problem.

In e1000_clean_tx_irq, which I added to the e1000_netpoll call, has the
following lines:

        if(unlikely(cleaned && netif_queue_stopped(netdev) &&
                    netif_carrier_ok(netdev)))
                netif_wake_queue(netdev);

The netif_queue_stopped is true, since that causes the looping in
netpoll_send_pkt.  So either it didn't clean any buffers (cleaned is
false) or netif_carrier_ok is false.  I don't know what the e1000 does
when you pull the cable while it's transmitting, does it call the
e1000_down? If so it could cause the carrier_ok to fail.

Oh well, someone with a e1000 card will need to look into this. The
problem should be easily found.  Good luck.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-08-08  2:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-05 11:04 lockups with netconsole on e1000 on media insertion John Bäckstrand
     [not found] <42F347D2.7000207@home.se.suse.lists.linux.kernel>
2005-08-05 11:45 ` Andi Kleen
2005-08-05 12:44   ` John Bäckstrand
2005-08-05 13:49   ` Steven Rostedt
2005-08-05 13:55     ` Andi Kleen
2005-08-05 14:10       ` Steven Rostedt
2005-08-05 14:14         ` Andi Kleen
2005-08-05 14:27           ` Steven Rostedt
2005-08-05 14:36             ` David S. Miller
2005-08-05 15:02               ` Steven Rostedt
2005-08-07 21:12     ` John Bäckstrand
2005-08-08  2:29       ` Steven Rostedt
2005-08-05 20:12   ` Matt Mackall
2005-08-05 21:56     ` Andi Kleen
2005-08-05 23:20       ` Matt Mackall
2005-08-05 23:51         ` Andi Kleen
2005-08-06  1:22           ` Matt Mackall
2005-08-06  1:37             ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox