* lockups with netconsole on e1000 on media insertion @ 2005-08-05 11:04 John Bäckstrand 0 siblings, 0 replies; 18+ messages in thread From: John Bäckstrand @ 2005-08-05 11:04 UTC (permalink / raw) To: linux-kernel I've been trying to hunt down a hard lockup issue with some hardware of mine, but I've possibly hit a kernel bug instead. When using netconsole on my e1000, if I unplug the cable and then re-plug it, the machine locks up hard. It manages to print the "link up" message on the screen, but nothing after that. Now, I wonder if this is supposed to be so? I tried this on 4 different configurations, 2.6.13-rc5 and 2.6.12 with and without "noapic acpi=off", same result on all of them. I've tried with 1 and 3 other NICs in the machine at the same time. It seems to be working fine on other NICs, such as rtl8139 and 3c59x. Any ideas on how to debug this further? (Btw, is there an easy way of "inserting" dmesg messages manually?) --- John Bäckstrand ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <42F347D2.7000207@home.se.suse.lists.linux.kernel>]
* Re: lockups with netconsole on e1000 on media insertion [not found] <42F347D2.7000207@home.se.suse.lists.linux.kernel> @ 2005-08-05 11:45 ` Andi Kleen 2005-08-05 12:44 ` John Bäckstrand ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Andi Kleen @ 2005-08-05 11:45 UTC (permalink / raw) To: John Bäckstrand; +Cc: linux-kernel, netdev John Bäckstrand <sandos@home.se> writes: > I've been trying to hunt down a hard lockup issue with some hardware > of mine, but I've possibly hit a kernel bug instead. When using > netconsole on my e1000, if I unplug the cable and then re-plug it, the > machine locks up hard. It manages to print the "link up" message on > the screen, but nothing after that. Now, I wonder if this is supposed > to be so? I tried this on 4 different configurations, 2.6.13-rc5 and > 2.6.12 with and without "noapic acpi=off", same result on all of > them. I've tried with 1 and 3 other NICs in the machine at the same > time. I ran into the same problem some time ago on e1000. The problem was that if the link doesn't come up netconsole ends up waiting forever for it. The patch was for 2.6.12, did a quick untested port to 2.6.13rc5. -Andi Only try a limited number to send packets in netpoll Avoids hangs on e1000 when link is not up. Signed-off-by: Andi Kleen <ak@suse.de> Index: linux/net/core/netpoll.c =================================================================== --- linux.orig/net/core/netpoll.c +++ linux/net/core/netpoll.c @@ -247,9 +247,11 @@ static void netpoll_send_skb(struct netp { int status; struct netpoll_info *npinfo; + /* Only try 5 times in case the link is down etc. */ + int try = 5; repeat: - if(!np || !np->dev || !netif_running(np->dev)) { + if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) { __kfree_skb(skb); return; } @@ -286,6 +288,9 @@ repeat: /* transmit busy */ if(status) { + /* Don't count spinlock as try */ + if (status == NETDEV_TX_LOCKED) + try++; netpoll_poll(np); goto repeat; } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 11:45 ` Andi Kleen @ 2005-08-05 12:44 ` John Bäckstrand 2005-08-05 13:49 ` Steven Rostedt 2005-08-05 20:12 ` Matt Mackall 2 siblings, 0 replies; 18+ messages in thread From: John Bäckstrand @ 2005-08-05 12:44 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, netdev Andi Kleen wrote: > The patch was for 2.6.12, did a quick untested port to 2.6.13rc5. > > -Andi > > Only try a limited number to send packets in netpoll Thanks, worked nicely! --- John Bäckstrand ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 11:45 ` Andi Kleen 2005-08-05 12:44 ` John Bäckstrand @ 2005-08-05 13:49 ` Steven Rostedt 2005-08-05 13:55 ` Andi Kleen 2005-08-07 21:12 ` John Bäckstrand 2005-08-05 20:12 ` Matt Mackall 2 siblings, 2 replies; 18+ messages in thread From: Steven Rostedt @ 2005-08-05 13:49 UTC (permalink / raw) To: Andi Kleen; +Cc: Ingo Molnar, netdev, linux-kernel, John Bäckstrand On Fri, 2005-08-05 at 13:45 +0200, Andi Kleen wrote: > John Bäckstrand <sandos@home.se> writes: > > > I've been trying to hunt down a hard lockup issue with some hardware > > of mine, but I've possibly hit a kernel bug instead. When using > > netconsole on my e1000, if I unplug the cable and then re-plug it, the > > machine locks up hard. It manages to print the "link up" message on > > the screen, but nothing after that. Now, I wonder if this is supposed > > to be so? I tried this on 4 different configurations, 2.6.13-rc5 and > > 2.6.12 with and without "noapic acpi=off", same result on all of > > them. I've tried with 1 and 3 other NICs in the machine at the same > > time. > > I ran into the same problem some time ago on e1000. The problem was > that if the link doesn't come up netconsole ends up waiting forever > for it. > > The patch was for 2.6.12, did a quick untested port to 2.6.13rc5. > > -Andi > > Only try a limited number to send packets in netpoll > > Avoids hangs on e1000 when link is not up. > > Signed-off-by: Andi Kleen <ak@suse.de> > > Index: linux/net/core/netpoll.c > =================================================================== > --- linux.orig/net/core/netpoll.c > +++ linux/net/core/netpoll.c > @@ -247,9 +247,11 @@ static void netpoll_send_skb(struct netp > { > int status; > struct netpoll_info *npinfo; > + /* Only try 5 times in case the link is down etc. */ > + int try = 5; > > repeat: > - if(!np || !np->dev || !netif_running(np->dev)) { > + if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) { > __kfree_skb(skb); > return; > } > @@ -286,6 +288,9 @@ repeat: > > /* transmit busy */ > if(status) { > + /* Don't count spinlock as try */ > + if (status == NETDEV_TX_LOCKED) > + try++; > netpoll_poll(np); > goto repeat; > } > - This is fixing the symptom and is not the cure. Unfortunately I don't have a e1000 card so I can't try a fix. But I did have a e100 card that would lock up the same way. The problem was that netpoll_poll calls the cards netpoll routine (in e1000_main.c e1000_netpoll). In the e100 case, when the transmit buffer would fill up, the queue would go down. But the netpoll routine in the e100 code never put it back up after it was all transfered. So this would lock up the kernel when that happened. I believe that the e1000 is suffering the same problem, but I can't fix it since I don't have an e1000 to test, but what probably needs to be done is to check to see if the transmit buffer can be cleaned and the queue go back up. e1000_netpoll calls e1000_intr which looks like this: static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs) { struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = &adapter->hw; uint32_t icr = E1000_READ_REG(hw, ICR); #ifndef CONFIG_E1000_NAPI unsigned int i; #endif if(unlikely(!icr)) return IRQ_NONE; /* Not our interrupt */ ^^^^^^^^ ---- Here I'm wondering if the netpoll case this is returned? if(unlikely(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))) { hw->get_link_status = 1; mod_timer(&adapter->watchdog_timer, jiffies); } #ifdef CONFIG_E1000_NAPI if(likely(netif_rx_schedule_prep(netdev))) { /* Disable interrupts and register for poll. The flush of the posted write is intentionally left out. */ atomic_inc(&adapter->irq_sem); E1000_WRITE_REG(hw, IMC, ~0); __netif_rx_schedule(netdev); } #else /* Writing IMC and IMS is needed for 82547. Due to Hub Link bus being occupied, an interrupt de-assertion message is not able to be sent. When an interrupt assertion message is generated later, two messages are re-ordered and sent out. That causes APIC to think 82547 is in de-assertion state, while 82547 is in assertion state, resulting in dead lock. Writing IMC forces 82547 into de-assertion state. */ if(hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2){ atomic_inc(&adapter->irq_sem); E1000_WRITE_REG(hw, IMC, ~0); } for(i = 0; i < E1000_MAX_INTR; i++) if(unlikely(!adapter->clean_rx(adapter) & !e1000_clean_tx_irq(adapter))) ^^^^^ ---- This should clean the transmit buffer, but it may not get here. break; if(hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2) e1000_irq_enable(adapter); #endif return IRQ_HANDLED; } So maybe the patch should be something like: --- linux-2.6.13-rc3/drivers/net/e1000/e1000_main.c.orig 2005-08-05 09:32:01.000000000 -0400 +++ linux-2.6.13-rc3/drivers/net/e1000/e1000_main.c 2005-08-05 09:33:56.000000000 -0400 @@ -3816,6 +3816,7 @@ e1000_netpoll(struct net_device *netdev) struct e1000_adapter *adapter = netdev_priv(netdev); disable_irq(adapter->pdev->irq); e1000_intr(adapter->pdev->irq, netdev, NULL); + e1000_clean_tx_irq(adapter); enable_irq(adapter->pdev->irq); } #endif I don't have the card, so I can't test it. But if this works (after removing the previous patch) then this is the better solution. If this does work, then we should probably add the timeout in netpoll with a warning that the netpoll of the driver is broken: Here's a modified version of the other patch: So we know where the problem is. #### John, Delete this part if you apply the above. #### --- linux-2.6.13-rc3/net/core/netpoll.c.orig 2005-08-05 09:37:00.000000000 -0400 +++ linux-2.6.13-rc3/net/core/netpoll.c 2005-08-05 09:44:19.000000000 -0400 @@ -247,9 +247,14 @@ static void netpoll_send_skb(struct netp { int status; struct netpoll_info *npinfo; + /* only try five times incase link is down */ + int try=5; repeat: - if(!np || !np->dev || !netif_running(np->dev)) { + if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) { + if (!try) + printk(KERN_WARNING "net driver is stuck down, maybe a" + " problem with the driver's netpoll\n"); __kfree_skb(skb); return; } @@ -286,6 +291,9 @@ repeat: /* transmit busy */ if(status) { + /* Don't count spinlock as try */ + if (status == NETDEV_TX_LOCKED) + try++; netpoll_poll(np); goto repeat; } -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 13:49 ` Steven Rostedt @ 2005-08-05 13:55 ` Andi Kleen 2005-08-05 14:10 ` Steven Rostedt 2005-08-07 21:12 ` John Bäckstrand 1 sibling, 1 reply; 18+ messages in thread From: Andi Kleen @ 2005-08-05 13:55 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Ingo Molnar, netdev, linux-kernel, John B?ckstrand > This is fixing the symptom and is not the cure. Unfortunately I don't > have a e1000 card so I can't try a fix. But I did have a e100 card that > would lock up the same way. The problem was that netpoll_poll calls the > cards netpoll routine (in e1000_main.c e1000_netpoll). In the e100 > case, when the transmit buffer would fill up, the queue would go down. > But the netpoll routine in the e100 code never put it back up after it > was all transfered. So this would lock up the kernel when that happened. In my case the hang happened when no cable was connected. There is no way to handle this in any other way. You eventually have to bail out. > > repeat: > - if(!np || !np->dev || !netif_running(np->dev)) { > + if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) { > + if (!try) > + printk(KERN_WARNING "net driver is stuck down, maybe a" > + " problem with the driver's netpoll\n"); ... and nobody will see that. It will not even trigger an output. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 13:55 ` Andi Kleen @ 2005-08-05 14:10 ` Steven Rostedt 2005-08-05 14:14 ` Andi Kleen 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2005-08-05 14:10 UTC (permalink / raw) To: Andi Kleen; +Cc: Ingo Molnar, netdev, linux-kernel, John B?ckstrand On Fri, 2005-08-05 at 15:55 +0200, Andi Kleen wrote: > > This is fixing the symptom and is not the cure. Unfortunately I don't > > have a e1000 card so I can't try a fix. But I did have a e100 card that > > would lock up the same way. The problem was that netpoll_poll calls the > > cards netpoll routine (in e1000_main.c e1000_netpoll). In the e100 > > case, when the transmit buffer would fill up, the queue would go down. > > But the netpoll routine in the e100 code never put it back up after it > > was all transfered. So this would lock up the kernel when that happened. > > In my case the hang happened when no cable was connected. But should come back when the cable is reconnected. OK, I admit, it shouldn't hang in the first place. > > There is no way to handle this in any other way. You eventually > have to bail out. > > > > > repeat: > > - if(!np || !np->dev || !netif_running(np->dev)) { > > + if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) { > > + if (!try) > > + printk(KERN_WARNING "net driver is stuck down, maybe a" > > + " problem with the driver's netpoll\n"); > > ... and nobody will see that. It will not even trigger an output. Since one would be using net console right? :-) Oops! I forgot that. Well it may make it to the logs, since this patch also bails out. That's why I think your first patch with this warning as well as a fix for the e1000 should be submitted. Since the e1000 shouldn't lock up netpoll just because the queue was put down. Hmm, how bad is it to have a printk in a routine that is registered to printk? If this does print, a "static once" variable should be added so that this is only printed once and not everytime it tries to print this message. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 14:10 ` Steven Rostedt @ 2005-08-05 14:14 ` Andi Kleen 2005-08-05 14:27 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2005-08-05 14:14 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Ingo Molnar, netdev, linux-kernel, John B?ckstrand On Fri, Aug 05, 2005 at 10:10:13AM -0400, Steven Rostedt wrote: > On Fri, 2005-08-05 at 15:55 +0200, Andi Kleen wrote: > > > This is fixing the symptom and is not the cure. Unfortunately I don't > > > have a e1000 card so I can't try a fix. But I did have a e100 card that > > > would lock up the same way. The problem was that netpoll_poll calls the > > > cards netpoll routine (in e1000_main.c e1000_netpoll). In the e100 > > > case, when the transmit buffer would fill up, the queue would go down. > > > But the netpoll routine in the e100 code never put it back up after it > > > was all transfered. So this would lock up the kernel when that happened. > > > > In my case the hang happened when no cable was connected. > > But should come back when the cable is reconnected. Which might be never. Not an option. > Hmm, how bad is it to have a printk in a routine that is registered to > printk? If this does print, a "static once" variable should be added > so that this is only printed once and not everytime it tries to print > this message. printk notices it is recursing and will not try to output it. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 14:14 ` Andi Kleen @ 2005-08-05 14:27 ` Steven Rostedt 2005-08-05 14:36 ` David S. Miller 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2005-08-05 14:27 UTC (permalink / raw) To: Andi Kleen; +Cc: Ingo Molnar, netdev, linux-kernel, John B?ckstrand On Fri, 2005-08-05 at 16:14 +0200, Andi Kleen wrote: > On Fri, Aug 05, 2005 at 10:10:13AM -0400, Steven Rostedt wrote: > > On Fri, 2005-08-05 at 15:55 +0200, Andi Kleen wrote: > > > > This is fixing the symptom and is not the cure. Unfortunately I don't > > > > have a e1000 card so I can't try a fix. But I did have a e100 card that > > > > would lock up the same way. The problem was that netpoll_poll calls the > > > > cards netpoll routine (in e1000_main.c e1000_netpoll). In the e100 > > > > case, when the transmit buffer would fill up, the queue would go down. > > > > But the netpoll routine in the e100 code never put it back up after it > > > > was all transfered. So this would lock up the kernel when that happened. > > > > > > In my case the hang happened when no cable was connected. > > > > But should come back when the cable is reconnected. > > Which might be never. Not an option. Hey! You removed my admission to this. Don't make me look stupid here ;-) > > > Hmm, how bad is it to have a printk in a routine that is registered to > > printk? If this does print, a "static once" variable should be added > > so that this is only printed once and not everytime it tries to print > > this message. > > printk notices it is recursing and will not try to output it. Darn it, since this should really be reported. Yes, the core netpoll should bail out, but it is also a problem with the driver and should be fixed. Come to think of it, I should have submitted a patch that did what you did when I discovered the problem with the e100. But that network card was slow and could easily lock up when doing a sysrq-t. I wasn't removing cables, so I just submitted the fix for the e100, not thinking that the netpoll shouldn't lock up itself. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 14:27 ` Steven Rostedt @ 2005-08-05 14:36 ` David S. Miller 2005-08-05 15:02 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: David S. Miller @ 2005-08-05 14:36 UTC (permalink / raw) To: rostedt; +Cc: ak, mingo, netdev, linux-kernel, sandos From: Steven Rostedt <rostedt@goodmis.org> Date: Fri, 05 Aug 2005 10:27:06 -0400 > Darn it, since this should really be reported. Yes, the core netpoll > should bail out, but it is also a problem with the driver and should be > fixed. I don't get how you can even remotely claim this to be a problem with the driver. If there is no cable plugged in, the link never comes up, and that is a completely normal thing. The netpoll code should simply not try forever to wait for the link to go up. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 14:36 ` David S. Miller @ 2005-08-05 15:02 ` Steven Rostedt 0 siblings, 0 replies; 18+ messages in thread From: Steven Rostedt @ 2005-08-05 15:02 UTC (permalink / raw) To: David S. Miller; +Cc: ak, mingo, netdev, linux-kernel, sandos On Fri, 2005-08-05 at 07:36 -0700, David S. Miller wrote: > From: Steven Rostedt <rostedt@goodmis.org> > Date: Fri, 05 Aug 2005 10:27:06 -0400 > > > Darn it, since this should really be reported. Yes, the core netpoll > > should bail out, but it is also a problem with the driver and should be > > fixed. > > I don't get how you can even remotely claim this to > be a problem with the driver. > > If there is no cable plugged in, the link never comes > up, and that is a completely normal thing. The netpoll > code should simply not try forever to wait for the link > to go up. You're right with that case. The problem with the driver is that it doesn't clean up the transmits if it just happened to overflow the transmit buffer and shut down the queue. The netpoll should at least see that the queue can be brought up again. That's what I have a problem with. In other words, I see two bugs: 1. The bug with the netpoll. It locks up if the driver's queue is down and never comes up. Which is fixed with Andi's patch. 2. The bug with the driver. Its netpoll doesn't detect that the queue can come back up again. With the timeout on netpoll this may no longer be a bug, since it should clean itself up after netpoll times out and turns interrupts back on. But if a timeout is avoidable by netpoll being a little smarter, then I believe that it should be fixed. Now do you understand where I'm coming from? -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 13:49 ` Steven Rostedt 2005-08-05 13:55 ` Andi Kleen @ 2005-08-07 21:12 ` John Bäckstrand 2005-08-08 2:29 ` Steven Rostedt 1 sibling, 1 reply; 18+ messages in thread From: John Bäckstrand @ 2005-08-07 21:12 UTC (permalink / raw) To: Steven Rostedt; +Cc: Andi Kleen, Ingo Molnar, netdev, linux-kernel Steven Rostedt wrote: > I don't have the card, so I can't test it. But if this works (after > removing the previous patch) then this is the better solution. I can confirm that this alone does not work for the simple unplug/re-plug cycle I described, it still locks up hard. Tried this alone on -rc6. --- John Bäckstrand ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-07 21:12 ` John Bäckstrand @ 2005-08-08 2:29 ` Steven Rostedt 0 siblings, 0 replies; 18+ messages in thread From: Steven Rostedt @ 2005-08-08 2:29 UTC (permalink / raw) To: John Bäckstrand Cc: Matt Mackall, Andi Kleen, Ingo Molnar, netdev, linux-kernel On Sun, 2005-08-07 at 23:12 +0200, John Bäckstrand wrote: > Steven Rostedt wrote: > > I don't have the card, so I can't test it. But if this works (after > > removing the previous patch) then this is the better solution. > > I can confirm that this alone does not work for the simple > unplug/re-plug cycle I described, it still locks up hard. Tried this > alone on -rc6. Darn it. If I had a e1000 I could debug it. I have other methods of logging than printks in all there varieties (see relayfs and friends). I still believe that the e1000_netpoll is not turning on the queue for some reason and the netpoll_send_skb is locking up because of that. Especially since Andi's patch fixes the problem. In e1000_clean_tx_irq, which I added to the e1000_netpoll call, has the following lines: if(unlikely(cleaned && netif_queue_stopped(netdev) && netif_carrier_ok(netdev))) netif_wake_queue(netdev); The netif_queue_stopped is true, since that causes the looping in netpoll_send_pkt. So either it didn't clean any buffers (cleaned is false) or netif_carrier_ok is false. I don't know what the e1000 does when you pull the cable while it's transmitting, does it call the e1000_down? If so it could cause the carrier_ok to fail. Oh well, someone with a e1000 card will need to look into this. The problem should be easily found. Good luck. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 11:45 ` Andi Kleen 2005-08-05 12:44 ` John Bäckstrand 2005-08-05 13:49 ` Steven Rostedt @ 2005-08-05 20:12 ` Matt Mackall 2005-08-05 21:56 ` Andi Kleen 2 siblings, 1 reply; 18+ messages in thread From: Matt Mackall @ 2005-08-05 20:12 UTC (permalink / raw) To: Andi Kleen; +Cc: John B?ckstrand, linux-kernel, netdev On Fri, Aug 05, 2005 at 01:45:55PM +0200, Andi Kleen wrote: > John B?ckstrand <sandos@home.se> writes: > > > I've been trying to hunt down a hard lockup issue with some hardware > > of mine, but I've possibly hit a kernel bug instead. When using > > netconsole on my e1000, if I unplug the cable and then re-plug it, the > > machine locks up hard. It manages to print the "link up" message on > > the screen, but nothing after that. Now, I wonder if this is supposed > > to be so? I tried this on 4 different configurations, 2.6.13-rc5 and > > 2.6.12 with and without "noapic acpi=off", same result on all of > > them. I've tried with 1 and 3 other NICs in the machine at the same > > time. > > I ran into the same problem some time ago on e1000. The problem was > that if the link doesn't come up netconsole ends up waiting forever > for it. I still don't like this fix. Yes, you're right, it should eventually give up. But here it gives up way too easily - 5 could easily translate to 5 microseconds. This is analogous to giving up on serial transmit if CTS is down for 5 loops. I'd be much happier if there were some udelay or the like in here so that we're not giving up on such a short timeframe. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 20:12 ` Matt Mackall @ 2005-08-05 21:56 ` Andi Kleen 2005-08-05 23:20 ` Matt Mackall 0 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2005-08-05 21:56 UTC (permalink / raw) To: Matt Mackall; +Cc: Andi Kleen, John B?ckstrand, linux-kernel, netdev > I still don't like this fix. Yes, you're right, it should eventually > give up. But here it gives up way too easily - 5 could easily > translate to 5 microseconds. This is analogous to giving up on serial > transmit if CTS is down for 5 loops. > > I'd be much happier if there were some udelay or the like in here so > that we're not giving up on such a short timeframe. Problem is that it could translate to a long aggregate delay e.g. when the kernel tries to dump the backlog after console_init. That is why I made the delay so short. Longer delay would be possible, but then it would need some logic to detect down links and don't delay on them and then retry later etc. Would be all far more complicated. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 21:56 ` Andi Kleen @ 2005-08-05 23:20 ` Matt Mackall 2005-08-05 23:51 ` Andi Kleen 0 siblings, 1 reply; 18+ messages in thread From: Matt Mackall @ 2005-08-05 23:20 UTC (permalink / raw) To: Andi Kleen; +Cc: John B?ckstrand, linux-kernel, netdev On Fri, Aug 05, 2005 at 11:56:50PM +0200, Andi Kleen wrote: > > I still don't like this fix. Yes, you're right, it should eventually > > give up. But here it gives up way too easily - 5 could easily > > translate to 5 microseconds. This is analogous to giving up on serial > > transmit if CTS is down for 5 loops. > > > > I'd be much happier if there were some udelay or the like in here so > > that we're not giving up on such a short timeframe. > > Problem is that it could translate to a long aggregate delay > e.g. when the kernel tries to dump the backlog after console_init. > That is why I made the delay so short. But why are we in a hurry to dump the backlog on the floor? Why are we worrying about the performance of netpoll without the cable plugged in at all? We shouldn't be optimizing the data loss case. My primary concern here is that the loop have a non-negligible extent in time. 5 loops is effectively equal to none. I'd be very surprised if it was even enough for deglitching. With serial console, we do polled I/O that runs at the serial rate - milliseconds per line of output. > Longer delay would be possible, but then it would need some logic > to detect down links and don't delay on them and then retry later etc. > Would be all far more complicated. I think we could probably have subsequent failures be much shorter without too much added complexity. But I'm not sure it matters. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 23:20 ` Matt Mackall @ 2005-08-05 23:51 ` Andi Kleen 2005-08-06 1:22 ` Matt Mackall 0 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2005-08-05 23:51 UTC (permalink / raw) To: Matt Mackall; +Cc: Andi Kleen, John B?ckstrand, linux-kernel, netdev > But why are we in a hurry to dump the backlog on the floor? Why are we > worrying about the performance of netpoll without the cable plugged in > at all? We shouldn't be optimizing the data loss case. Because a system shouldn't stall for minutes (or forever like right now) at boot just because the network cable isn't plugged in. > > My primary concern here is that the loop have a non-negligible extent > in time. 5 loops is effectively equal to none. I'd be very surprised > if it was even enough for deglitching. In the normal case the packets should just be send out. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-05 23:51 ` Andi Kleen @ 2005-08-06 1:22 ` Matt Mackall 2005-08-06 1:37 ` Daniel Phillips 0 siblings, 1 reply; 18+ messages in thread From: Matt Mackall @ 2005-08-06 1:22 UTC (permalink / raw) To: Andi Kleen; +Cc: John B?ckstrand, linux-kernel, netdev On Sat, Aug 06, 2005 at 01:51:22AM +0200, Andi Kleen wrote: > > But why are we in a hurry to dump the backlog on the floor? Why are we > > worrying about the performance of netpoll without the cable plugged in > > at all? We shouldn't be optimizing the data loss case. > > Because a system shouldn't stall for minutes (or forever like right now) > at boot just because the network cable isn't plugged in. Using netconsole without a network cable could well be classified as a serious configuration error. NFS also is a bit sluggish without a network cable. I've already agreed that forever is a problem. Can we work towards agreeing on a non-trivial timeout, please? -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: lockups with netconsole on e1000 on media insertion 2005-08-06 1:22 ` Matt Mackall @ 2005-08-06 1:37 ` Daniel Phillips 0 siblings, 0 replies; 18+ messages in thread From: Daniel Phillips @ 2005-08-06 1:37 UTC (permalink / raw) To: Matt Mackall; +Cc: Andi Kleen, John B?ckstrand, linux-kernel, netdev On Saturday 06 August 2005 11:22, Matt Mackall wrote: > On Sat, Aug 06, 2005 at 01:51:22AM +0200, Andi Kleen wrote: > > > But why are we in a hurry to dump the backlog on the floor? Why are we > > > worrying about the performance of netpoll without the cable plugged in > > > at all? We shouldn't be optimizing the data loss case. > > > > Because a system shouldn't stall for minutes (or forever like right now) > > at boot just because the network cable isn't plugged in. > > Using netconsole without a network cable could well be classified as a > serious configuration error. But please don't. An OS that slows to a crawl or crashes because a cable isn't plugged in an OS that deserves to be ridiculed. Silly timeouts on boot are scary and a waste of user's time. Regards, Daniel ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-08-08 2:29 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-05 11:04 lockups with netconsole on e1000 on media insertion John Bäckstrand
[not found] <42F347D2.7000207@home.se.suse.lists.linux.kernel>
2005-08-05 11:45 ` Andi Kleen
2005-08-05 12:44 ` John Bäckstrand
2005-08-05 13:49 ` Steven Rostedt
2005-08-05 13:55 ` Andi Kleen
2005-08-05 14:10 ` Steven Rostedt
2005-08-05 14:14 ` Andi Kleen
2005-08-05 14:27 ` Steven Rostedt
2005-08-05 14:36 ` David S. Miller
2005-08-05 15:02 ` Steven Rostedt
2005-08-07 21:12 ` John Bäckstrand
2005-08-08 2:29 ` Steven Rostedt
2005-08-05 20:12 ` Matt Mackall
2005-08-05 21:56 ` Andi Kleen
2005-08-05 23:20 ` Matt Mackall
2005-08-05 23:51 ` Andi Kleen
2005-08-06 1:22 ` Matt Mackall
2005-08-06 1:37 ` Daniel Phillips
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox