* [patch] revert: [NET]: Fix races in net_rx_action vs netpoll
@ 2007-07-16 9:12 Ingo Molnar
2007-07-16 10:35 ` Olaf Kirch
` (2 more replies)
0 siblings, 3 replies; 67+ messages in thread
From: Ingo Molnar @ 2007-07-16 9:12 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, olaf.kirch, davem
current -git broke my main testbox. No TCP/IP networking to/from the box
and e1000 would time out in xmit:
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <95>
TDT <95>
next_to_use <95>
next_to_clean <ea>
buffer_info[next_to_clean]
time_stamp <ffff8f43>
next_to_watch <ea>
jiffies <ffffb5cc>
next_to_watch.status <1>
After a bisection session the bad commit turned out to be:
29578624e354f56143d92510fff33a8b2aaa2c03 is first bad commit
commit 29578624e354f56143d92510fff33a8b2aaa2c03
Author: Olaf Kirch <olaf.kirch@oracle.com>
Date: Wed Jul 11 19:32:02 2007 -0700
[NET]: Fix races in net_rx_action vs netpoll.
Keep netpoll/poll_napi from messing with the poll_list.
Only net_rx_action is allowed to manipulate the list.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
and indeed the testbox uses netconsole (it's a laptop so this is the
only viable remote debugging option). Applying the revert patch below
makes it work again.
Ingo
------------------>
Subject: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll
From: Ingo Molnar <mingo@elte.hu>
commit 29578624 causes netconsole failures:
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <95>
TDT <95>
next_to_use <95>
next_to_clean <ea>
buffer_info[next_to_clean]
time_stamp <ffff8f43>
next_to_watch <ea>
jiffies <ffffb5cc>
next_to_watch.status <1>
revert it for now, to make my testsystem work.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/netdevice.h | 10 ----------
net/core/netpoll.c | 8 --------
2 files changed, 18 deletions(-)
Index: linux/include/linux/netdevice.h
===================================================================
--- linux.orig/include/linux/netdevice.h
+++ linux/include/linux/netdevice.h
@@ -261,8 +261,6 @@ enum netdev_state_t
__LINK_STATE_LINKWATCH_PENDING,
__LINK_STATE_DORMANT,
__LINK_STATE_QDISC_RUNNING,
- /* Set by the netpoll NAPI code */
- __LINK_STATE_POLL_LIST_FROZEN,
};
@@ -1016,14 +1014,6 @@ static inline void netif_rx_complete(str
{
unsigned long flags;
-#ifdef CONFIG_NETPOLL
- /* Prevent race with netpoll - yes, this is a kludge.
- * But at least it doesn't penalize the non-netpoll
- * code path. */
- if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state))
- return;
-#endif
-
local_irq_save(flags);
__netif_rx_complete(dev);
local_irq_restore(flags);
Index: linux/net/core/netpoll.c
===================================================================
--- linux.orig/net/core/netpoll.c
+++ linux/net/core/netpoll.c
@@ -124,13 +124,6 @@ static void poll_napi(struct netpoll *np
if (test_bit(__LINK_STATE_RX_SCHED, &np->dev->state) &&
npinfo->poll_owner != smp_processor_id() &&
spin_trylock(&npinfo->poll_lock)) {
- /* When calling dev->poll from poll_napi, we may end up in
- * netif_rx_complete. However, only the CPU to which the
- * device was queued is allowed to remove it from poll_list.
- * Setting POLL_LIST_FROZEN tells netif_rx_complete
- * to leave the NAPI state alone.
- */
- set_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
npinfo->rx_flags |= NETPOLL_RX_DROP;
atomic_inc(&trapped);
@@ -138,7 +131,6 @@ static void poll_napi(struct netpoll *np
atomic_dec(&trapped);
npinfo->rx_flags &= ~NETPOLL_RX_DROP;
- clear_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
spin_unlock(&npinfo->poll_lock);
}
}
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 9:12 [patch] revert: [NET]: Fix races in net_rx_action vs netpoll Ingo Molnar @ 2007-07-16 10:35 ` Olaf Kirch 2007-07-16 11:26 ` David Miller 2007-07-17 5:46 ` Jarek Poplawski 2 siblings, 0 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-16 10:35 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, davem On Monday 16 July 2007 11:12, Ingo Molnar wrote: > After a bisection session the bad commit turned out to be: > > 29578624e354f56143d92510fff33a8b2aaa2c03 is first bad commit > commit 29578624e354f56143d92510fff33a8b2aaa2c03 > Author: Olaf Kirch <olaf.kirch@oracle.com> > Date: Wed Jul 11 19:32:02 2007 -0700 > > [NET]: Fix races in net_rx_action vs netpoll. > > Keep netpoll/poll_napi from messing with the poll_list. > Only net_rx_action is allowed to manipulate the list. > > Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> > Signed-off-by: David S. Miller <davem@davemloft.net> Apologies - I'll look into it. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 9:12 [patch] revert: [NET]: Fix races in net_rx_action vs netpoll Ingo Molnar 2007-07-16 10:35 ` Olaf Kirch @ 2007-07-16 11:26 ` David Miller 2007-07-16 12:18 ` Olaf Kirch ` (2 more replies) 2007-07-17 5:46 ` Jarek Poplawski 2 siblings, 3 replies; 67+ messages in thread From: David Miller @ 2007-07-16 11:26 UTC (permalink / raw) To: mingo; +Cc: torvalds, linux-kernel, olaf.kirch From: Ingo Molnar <mingo@elte.hu> Date: Mon, 16 Jul 2007 11:12:36 +0200 > Applying the revert patch below makes it work again. Well, let's figure out why before we revert because it is attempting to fix a legitimate bug. Olaf, any ideas? ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 11:26 ` David Miller @ 2007-07-16 12:18 ` Olaf Kirch 2007-07-16 13:29 ` Ingo Molnar 2007-07-16 21:09 ` Ingo Molnar 2007-07-16 21:40 ` Linus Torvalds 2 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-16 12:18 UTC (permalink / raw) To: David Miller; +Cc: mingo, torvalds, linux-kernel On Monday 16 July 2007 13:26, David Miller wrote: > Well, let's figure out why before we revert because it > is attempting to fix a legitimate bug. > > Olaf, any ideas? It seems as if the card is stuck in NAPI mode without being serviced by net_rx_action. Ingo, is this a UP or SMP machine? Are you still getting output from netconsole, or is the network down completely? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 12:18 ` Olaf Kirch @ 2007-07-16 13:29 ` Ingo Molnar 0 siblings, 0 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-16 13:29 UTC (permalink / raw) To: Olaf Kirch; +Cc: David Miller, torvalds, linux-kernel * Olaf Kirch <olaf.kirch@oracle.com> wrote: > On Monday 16 July 2007 13:26, David Miller wrote: > > Well, let's figure out why before we revert because it > > is attempting to fix a legitimate bug. > > > > Olaf, any ideas? > > It seems as if the card is stuck in NAPI mode without being serviced > by net_rx_action. > > Ingo, is this a UP or SMP machine? Are you still getting output from > netconsole, or is the network down completely? there is some netconsole output but that too gets stuck after the tx timeout message. The e1000 options are: CONFIG_E1000=y CONFIG_E1000_NAPI=y # CONFIG_E1000_DISABLE_PACKET_SPLIT is not set Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 11:26 ` David Miller 2007-07-16 12:18 ` Olaf Kirch @ 2007-07-16 21:09 ` Ingo Molnar 2007-07-16 22:06 ` David Miller 2007-07-16 21:40 ` Linus Torvalds 2 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-16 21:09 UTC (permalink / raw) To: David Miller; +Cc: torvalds, linux-kernel, olaf.kirch * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 16 Jul 2007 11:12:36 +0200 > > > Applying the revert patch below makes it work again. > > Well, let's figure out why before we revert because it is attempting > to fix a legitimate bug. yeah, no doubt about it, but this fix breaks my box _for sure_, while whatever problem the commit fixed, it was not something that ever triggered on my boxes =B-) so ... i can promise to test whatever new version of the patch Olaf sends me (the problem is easy to reproduce and easy to test, so i can check it all in a heartbeat), so to get things back on track, and to value the time i spent on bisecting this, could we please apply the revert upstream (temporarily), because Linus' -git tree as of today is still broken for me. (and likely broken for others) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 21:09 ` Ingo Molnar @ 2007-07-16 22:06 ` David Miller 0 siblings, 0 replies; 67+ messages in thread From: David Miller @ 2007-07-16 22:06 UTC (permalink / raw) To: mingo; +Cc: torvalds, linux-kernel, olaf.kirch From: Ingo Molnar <mingo@elte.hu> Date: Mon, 16 Jul 2007 23:09:24 +0200 > so ... i can promise to test whatever new version of the patch Olaf > sends me (the problem is easy to reproduce and easy to test, so i can > check it all in a heartbeat), so to get things back on track, and to > value the time i spent on bisecting this, could we please apply the > revert upstream (temporarily), because Linus' -git tree as of today is > still broken for me. (and likely broken for others) You can't revert patch from your tree locally in the mean time because??? Ingo, please be reasonable and a little patient. Things break all the time, and it's not like 2.6.23-final is going out tomorrow. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 11:26 ` David Miller 2007-07-16 12:18 ` Olaf Kirch 2007-07-16 21:09 ` Ingo Molnar @ 2007-07-16 21:40 ` Linus Torvalds 2007-07-16 21:51 ` Ingo Molnar ` (2 more replies) 2 siblings, 3 replies; 67+ messages in thread From: Linus Torvalds @ 2007-07-16 21:40 UTC (permalink / raw) To: David Miller; +Cc: mingo, linux-kernel, olaf.kirch On Mon, 16 Jul 2007, David Miller wrote: > > Well, let's figure out why before we revert because it > is attempting to fix a legitimate bug. I'm reverting it. I don't think there is any excuse for not reverting something that provably breaks somebody's machine. I don't want this to be on the regression list - if you figure out why it broke, you can always just resubmit a fixed patch. Keeping a known broken patch around just because you want to figure out why it's broken seems pointless. Just looking at the patch I see two options: - The change seems to always set the LIST_FROZEN bit when calling ->poll(), and at least on e1000, the NAPI poll() routine ends up doing that netif_rx_complete(), so we're *guaranteed* to always take the early exit and not do anything. Looks fundamentally broken to me, and not worth saving. But maybe I'm missing something (and it doesn't really seem to explain the write timeout per se). - The early return from netif_rx_complete() ends up meaning that an edge-triggered interrupt isn't handled properly, and will this never happen again, since it never goes away. With MSI, edge-triggered interrupts are making a comeback in a big way, and yeah, e1000 is one of the drivers that do MSI. Ingo might want to confirm whether it's actually enabled for him, and whether turning it off might hide the problem, but if that's it, then the whole patch is fundamentally broken, and not worth saving. but in either case (or, indeed, even if I didn't see any problem at all), I think reverting a patch that isn't needed is _always_ the right choice. If we don't know what caused a problem in the first place, or if the fix is known to be required for something else and reverting it would cause *another* regression, it would be another issue. But as it is, reverting it would seem to unquestionably get rid of a regression, and is thus a no-brainer. No? Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 21:40 ` Linus Torvalds @ 2007-07-16 21:51 ` Ingo Molnar 2007-07-16 22:09 ` David Miller 2007-07-16 22:08 ` David Miller 2007-07-17 8:16 ` Olaf Kirch 2 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-16 21:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Miller, linux-kernel, olaf.kirch * Linus Torvalds <torvalds@linux-foundation.org> wrote: > With MSI, edge-triggered interrupts are making a comeback in a big > way, and yeah, e1000 is one of the drivers that do MSI. Ingo might > want to confirm whether it's actually enabled for him, and whether > turning it off might hide the problem, but if that's it, then the > whole patch is fundamentally broken, and not worth saving. MSI was off for the test: # CONFIG_PCI_MSI is not set full config is at: http://redhat.com/~mingo/misc/config the hang-log is at: http://redhat.com/~mingo/misc/hang.log netconsole output went silent during the last tx-timeout message. (the above hang.log is from dmesg) > but in either case (or, indeed, even if I didn't see any problem at > all), I think reverting a patch that isn't needed is _always_ the > right choice. > > If we don't know what caused a problem in the first place, or if the > fix is known to be required for something else and reverting it would > cause *another* regression, it would be another issue. But as it is, > reverting it would seem to unquestionably get rid of a regression, and > is thus a no-brainer. > > No? i also offered to quickly try any test-version of the fixed patch, so there's a real and deterministic path towards fixing the patch. The regression is obvious and triggers all the time. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 21:51 ` Ingo Molnar @ 2007-07-16 22:09 ` David Miller 2007-07-16 22:37 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: David Miller @ 2007-07-16 22:09 UTC (permalink / raw) To: mingo; +Cc: torvalds, linux-kernel, olaf.kirch From: Ingo Molnar <mingo@elte.hu> Date: Mon, 16 Jul 2007 23:51:17 +0200 > i also offered to quickly try any test-version of the fixed patch, so > there's a real and deterministic path towards fixing the patch. The > regression is obvious and triggers all the time. For you. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 22:09 ` David Miller @ 2007-07-16 22:37 ` Ingo Molnar 2007-07-16 22:57 ` David Miller 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-16 22:37 UTC (permalink / raw) To: David Miller; +Cc: torvalds, linux-kernel, olaf.kirch * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 16 Jul 2007 23:51:17 +0200 > > > i also offered to quickly try any test-version of the fixed patch, so > > there's a real and deterministic path towards fixing the patch. The > > regression is obvious and triggers all the time. > > For you. I can certainly keep the revert around in my trees. (although it's a complication, i have to take care for it to never leak out into any external trees, etc. - but it's not a big issue) Fundamentally, i trust Olaf to fix this quickly, and i dont want to make a too big fuss about this, but in general it's always better to revert patches causing known regressions (unless the revert is hugely complex and other changes depend on it - but this isnt the case here). I can also run whatever test-patches of Olaf, that would instrument/dump whatever info is needed to fix this. So Olaf's debugging effort is not hindered in any way as far as i can see. I think if you leaned back and thought it through, and if you applied this scenario to a bad scheduler commit from me that broke your box, you'd readily agree with me =B-) (which scenario is purely hypothetical, my scheduler commits are all 100% perfect of course ;-) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 22:37 ` Ingo Molnar @ 2007-07-16 22:57 ` David Miller 2007-07-17 18:09 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: David Miller @ 2007-07-16 22:57 UTC (permalink / raw) To: mingo; +Cc: torvalds, linux-kernel, olaf.kirch From: Ingo Molnar <mingo@elte.hu> Date: Tue, 17 Jul 2007 00:37:18 +0200 > I think if you leaned back and thought it through, and if you applied > this scenario to a bad scheduler commit from me that broke your box, > you'd readily agree with me =B-) (which scenario is purely hypothetical, > my scheduler commits are all 100% perfect of course ;-) Actually I'd probably send you a patch for any bug I found that triggered on sparc64, since that's faster than asking you to fix a bug that you are unlikely to be able to trigger on your own systems. But that's just how I operate. Ask Thomas Gleixner. Every single hrtimers/nohz bug I've found on sparc64, I sent him either a full fix or a full analysis of the bug and a description of exactly what is going on and what needs to be changed so he can compose the bug fix patch with minimal effort. I don't ask people to revert, ever. It's never necessary when the parties involved are extremely competent and responsive. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 22:57 ` David Miller @ 2007-07-17 18:09 ` Ingo Molnar 0 siblings, 0 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-17 18:09 UTC (permalink / raw) To: David Miller; +Cc: torvalds, linux-kernel, olaf.kirch * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Tue, 17 Jul 2007 00:37:18 +0200 > > > I think if you leaned back and thought it through, and if you > > applied this scenario to a bad scheduler commit from me that broke > > your box, you'd readily agree with me =B-) (which scenario is purely > > hypothetical, my scheduler commits are all 100% perfect of course > > ;-) > > Actually I'd probably send you a patch for any bug I found that > triggered on sparc64, since that's faster than asking you to fix a bug > that you are unlikely to be able to trigger on your own systems. > > But that's just how I operate. > > Ask Thomas Gleixner. Every single hrtimers/nohz bug I've found on > sparc64, I sent him either a full fix or a full analysis of the bug > and a description of exactly what is going on and what needs to be > changed so he can compose the bug fix patch with minimal effort. yeah, i too usually try to fix bugs straight away - just check: git-log net drivers/net | grep "Author: Ingo Molnar" but in this current case i was freshly back from a few days off, had quite a backlog after a rather complex set of merges and i just didnt have the time. And i really applaud you for the clockevents/dynticks contributions (i loved your dynticks patches and the clockevents framework fixes that resulted out of it), but i really, really think this case is apples to oranges. I had some urgent hacking to do in some completely different area (not networking) and i ran across that networking regression and spent more than an hour on bisecting it. The bad commit looked simple so i thought the person who is doing it (Olaf) would immediately see the bug (i certainly didnt see it). It's also not that easy to debug that box, the only remote debugging interface to it is ... netconsole. If i were into networking changes right now i'd probably have tried to debug and fix it straight away. So instead i opted to bisect it, to give you the exact bad commit, give you full logs of the incident and an offer to run arbitrary patches immediately. There's also little loss from the revert AFAICS: the commit went into your tree on July 11th and went upstream July 12th and i found it on July 16th. So (unless the commit dates are deceiving me) it does not seem to be an over-tested patch with lots of QA value (yet), and it's not some critical commit that another 100 commits grew to depend on. It's nevertheless a fix we want to have, and i'm willing to test anything. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 21:40 ` Linus Torvalds 2007-07-16 21:51 ` Ingo Molnar @ 2007-07-16 22:08 ` David Miller 2007-07-16 22:29 ` Linus Torvalds 2007-07-17 7:37 ` Olaf Kirch 2007-07-17 8:16 ` Olaf Kirch 2 siblings, 2 replies; 67+ messages in thread From: David Miller @ 2007-07-16 22:08 UTC (permalink / raw) To: torvalds; +Cc: mingo, linux-kernel, olaf.kirch From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 16 Jul 2007 14:40:38 -0700 (PDT) > If we don't know what caused a problem in the first place, or if the fix > is known to be required for something else and reverting it would cause > *another* regression, it would be another issue. But as it is, reverting > it would seem to unquestionably get rid of a regression, and is thus a > no-brainer. > > No? Sure, but I thought it would be nice to give Olaf a day or two to figure out what's going on rather than have the knee-jerk reaction to just revert. Ingo is the only person hitting and reporting this and last time I checked he is competent enough to revert the thing locally in his own trees, right? :-) ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 22:08 ` David Miller @ 2007-07-16 22:29 ` Linus Torvalds 2007-07-16 22:52 ` David Miller 2007-07-16 23:17 ` Matt Mackall 2007-07-17 7:37 ` Olaf Kirch 1 sibling, 2 replies; 67+ messages in thread From: Linus Torvalds @ 2007-07-16 22:29 UTC (permalink / raw) To: David Miller; +Cc: mingo, linux-kernel, olaf.kirch On Mon, 16 Jul 2007, David Miller wrote: > > Ingo is the only person hitting and reporting this and last time I > checked he is competent enough to revert the thing locally in his own > trees, right? :-) Umm. And your suggestion is what? Wait until -rc1, when non-developers (the kinds of people who are supposed to *not* be competent enough) come along, and report the problem? That's explicitly what I do *not* want to happen. If we knew something was wrong before the -rc1 release, all the better: we can avoid havign that bug in -rc1, and the people who test it will tell us about the problems we did *not* know about. In contrast, if we leave a known bug in, that will just mean that other problems won't be found either, because people get hung up on the known bug. Regressions should be fixed early and often. Otherwise there's no point in making the code available early and often in the first place. A regression is basically MUCH WORSE than just about any possible bug that patch could have fixed, and I want people to really think that way. The rule for *everybody* should be: "Regressions get fixed immediately" with no ifs, buts or maybes. In fact, it would be even better if the kinds of patches that cause regressions wouldn't hit my tree at all, but once they have hit it, and once they've been identified, they get reverted. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 22:29 ` Linus Torvalds @ 2007-07-16 22:52 ` David Miller 2007-07-16 23:17 ` Matt Mackall 1 sibling, 0 replies; 67+ messages in thread From: David Miller @ 2007-07-16 22:52 UTC (permalink / raw) To: torvalds; +Cc: mingo, linux-kernel, olaf.kirch From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 16 Jul 2007 15:29:15 -0700 (PDT) > If we knew something was wrong before the -rc1 release, all the better: we > can avoid havign that bug in -rc1, and the people who test it will tell us > about the problems we did *not* know about. > > In contrast, if we leave a known bug in, that will just mean that other > problems won't be found either, because people get hung up on the known > bug. Fair enough. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 22:29 ` Linus Torvalds 2007-07-16 22:52 ` David Miller @ 2007-07-16 23:17 ` Matt Mackall 2007-07-16 23:34 ` Linus Torvalds 1 sibling, 1 reply; 67+ messages in thread From: Matt Mackall @ 2007-07-16 23:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Miller, mingo, linux-kernel, olaf.kirch On Mon, Jul 16, 2007 at 03:29:15PM -0700, Linus Torvalds wrote: > > > On Mon, 16 Jul 2007, David Miller wrote: > > > > Ingo is the only person hitting and reporting this and last time I > > checked he is competent enough to revert the thing locally in his own > > trees, right? :-) > > Umm. And your suggestion is what? Wait until -rc1, when non-developers > (the kinds of people who are supposed to *not* be competent enough) come > along, and report the problem? > > That's explicitly what I do *not* want to happen. > > If we knew something was wrong before the -rc1 release, all the better: we > can avoid havign that bug in -rc1, and the people who test it will tell us > about the problems we did *not* know about. Unfortunately the particular patch from Olaf is presumably covering up another bug that other people (including Olaf) had hit. So reverting it is going to introduce a different regression. I think this particular problem has _already_ bounced back and forth once. But I haven't been keeping count as most of the relevant patches weren't cc:ed to me. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 23:17 ` Matt Mackall @ 2007-07-16 23:34 ` Linus Torvalds 0 siblings, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2007-07-16 23:34 UTC (permalink / raw) To: Matt Mackall; +Cc: David Miller, mingo, linux-kernel, olaf.kirch On Mon, 16 Jul 2007, Matt Mackall wrote: > > Unfortunately the particular patch from Olaf is presumably covering up > another bug that other people (including Olaf) had hit. So reverting > it is going to introduce a different regression. It's not a regression, it's an old problem. And the rule is: machines that _used_ to work are more important. That rule got hewn into stone when the ACPI layer changes constantly kept breaking things that used to work, while "fixing" other things. So we don't fix bugs by introducing new problems. That way lies madness, and nobody ever knows if you actually make any real progress at all. Is it two steps forwards, one step back, or one step forward and two steps back? Different people will give different answers. That's why regressions are _so_ much more important than new bugfixes. Because it's much more important to make slow but _steady_ progress, and have people know things improve (or at least not "deprove"). We don't want any kind of "brownian motion development". Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 22:08 ` David Miller 2007-07-16 22:29 ` Linus Torvalds @ 2007-07-17 7:37 ` Olaf Kirch 1 sibling, 0 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-17 7:37 UTC (permalink / raw) To: David Miller; +Cc: torvalds, mingo, linux-kernel On Tuesday 17 July 2007 00:08, David Miller wrote: > Sure, but I thought it would be nice to give Olaf a day or two to > figure out what's going on rather than have the knee-jerk reaction to > just revert. Oh, reverting is fine with me. I'll just resubmit the patch. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 21:40 ` Linus Torvalds 2007-07-16 21:51 ` Ingo Molnar 2007-07-16 22:08 ` David Miller @ 2007-07-17 8:16 ` Olaf Kirch 2 siblings, 0 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-17 8:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Miller, mingo, linux-kernel On Monday 16 July 2007 23:40, Linus Torvalds wrote: > - The change seems to always set the LIST_FROZEN bit when calling > ->poll(), and at least on e1000, the NAPI poll() routine ends up doing > that netif_rx_complete(), so we're *guaranteed* to always take the > early exit and not do anything. That is only in the poll_napi case, where netpoll tries to push pending frames. The default caller for dev->poll() is net_rx_action. We only ever get into this particular code branch in poll_napi when the RX_SCHED bit is already set, ie someone put the device on the NAPI poll_list and raised the softirq. poll_napi is just doing manually what net_rx_action would do anyway once the softirq is serviced. > - The early return from netif_rx_complete() ends up meaning that an > edge-triggered interrupt isn't handled properly, and will this never > happen again, since it never goes away. Sorry, I may be sitting on my brain this morning, but I don't understand how skipping netif_rx_complete would affect ACKing of interrupts. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-16 9:12 [patch] revert: [NET]: Fix races in net_rx_action vs netpoll Ingo Molnar 2007-07-16 10:35 ` Olaf Kirch 2007-07-16 11:26 ` David Miller @ 2007-07-17 5:46 ` Jarek Poplawski 2007-07-17 6:14 ` Jarek Poplawski 2 siblings, 1 reply; 67+ messages in thread From: Jarek Poplawski @ 2007-07-17 5:46 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, olaf.kirch, davem On 16-07-2007 11:12, Ingo Molnar wrote: > current -git broke my main testbox. No TCP/IP networking to/from the box > and e1000 would time out in xmit: > > NETDEV WATCHDOG: eth0: transmit timed out > e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang ... Olaf, I think this error can trigger in this place: > static void net_rx_action(struct softirq_action *h) > { > struct softnet_data *queue = &__get_cpu_var(softnet_data); > unsigned long start_time = jiffies; > int budget = netdev_budget; > void *have; > > local_irq_disable(); > > while (!list_empty(&queue->poll_list)) { > struct net_device *dev; > > if (budget <= 0 || jiffies - start_time > 1) > goto softnet_break; > > local_irq_enable(); > > dev = list_entry(queue->poll_list.next, > struct net_device, poll_list); > have = netpoll_poll_lock(dev); > > if (dev->quota <= 0 || dev->poll(dev, &budget)) { If after poll_napi dev->quota <= 0 dev->poll is not run and __LINK_STATE_RX_SCHED bit (plus dev->poll_list) stays uncleared. Regards, Jarek P. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 5:46 ` Jarek Poplawski @ 2007-07-17 6:14 ` Jarek Poplawski 2007-07-17 7:55 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Jarek Poplawski @ 2007-07-17 6:14 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, olaf.kirch, davem On Tue, Jul 17, 2007 at 07:46:39AM +0200, Jarek Poplawski wrote: ... > > static void net_rx_action(struct softirq_action *h) > > { > > struct softnet_data *queue = &__get_cpu_var(softnet_data); > > unsigned long start_time = jiffies; > > int budget = netdev_budget; > > void *have; > > > > local_irq_disable(); > > > > while (!list_empty(&queue->poll_list)) { > > struct net_device *dev; > > > > if (budget <= 0 || jiffies - start_time > 1) > > goto softnet_break; > > > > local_irq_enable(); > > > > dev = list_entry(queue->poll_list.next, > > struct net_device, poll_list); > > have = netpoll_poll_lock(dev); > > > > if (dev->quota <= 0 || dev->poll(dev, &budget)) { > > If after poll_napi dev->quota <= 0 dev->poll is not run and > __LINK_STATE_RX_SCHED bit (plus dev->poll_list) stays uncleared. Or, more precisely dev->poll_list will be cleared just after this, and net_rx_action returns with __LINK_STATE_RX_SCHED bit set. Jarek P. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 6:14 ` Jarek Poplawski @ 2007-07-17 7:55 ` Olaf Kirch 2007-07-17 8:28 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-17 7:55 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Ingo Molnar, Linus Torvalds, linux-kernel, davem On Tuesday 17 July 2007 08:14, Jarek Poplawski wrote: > > If after poll_napi dev->quota <= 0 dev->poll is not run and > > __LINK_STATE_RX_SCHED bit (plus dev->poll_list) stays uncleared. > > Or, more precisely dev->poll_list will be cleared just after this, > and net_rx_action returns with __LINK_STATE_RX_SCHED bit set. I don't think so. dev will remain on poll_list. What I find more problematic about this portion of code though is that once a net_device is over quota, net_rx_action will loop for up to one jiffy, even if there's just this one device on the poll_list. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 7:55 ` Olaf Kirch @ 2007-07-17 8:28 ` Olaf Kirch 2007-07-17 8:57 ` Ingo Molnar 2007-07-17 9:12 ` Jarek Poplawski 0 siblings, 2 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-17 8:28 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Ingo Molnar, Linus Torvalds, linux-kernel, davem On Tuesday 17 July 2007 09:55, Olaf Kirch wrote: > What I find more problematic about this portion of code though > is that once a net_device is over quota, net_rx_action will > loop for up to one jiffy, even if there's just this one device on > the poll_list. Duh, wrong. For every loop, it'll add dev->weight to dev->quota, so it'll be fine very soon. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 8:28 ` Olaf Kirch @ 2007-07-17 8:57 ` Ingo Molnar 2007-07-17 9:29 ` Jarek Poplawski ` (2 more replies) 2007-07-17 9:12 ` Jarek Poplawski 1 sibling, 3 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-17 8:57 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem Olaf, i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes the problem go away. So it's somehow also related to jiffies. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 8:57 ` Ingo Molnar @ 2007-07-17 9:29 ` Jarek Poplawski 2007-07-17 14:07 ` Olaf Kirch 2007-07-17 17:49 ` Linus Torvalds 2 siblings, 0 replies; 67+ messages in thread From: Jarek Poplawski @ 2007-07-17 9:29 UTC (permalink / raw) To: Ingo Molnar; +Cc: Olaf Kirch, Linus Torvalds, linux-kernel, davem On Tue, Jul 17, 2007 at 10:57:48AM +0200, Ingo Molnar wrote: > > Olaf, > > i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes > the problem go away. So it's somehow also related to jiffies. IMHO it could be related with __LINK_STATE_RX_SCHED beeing set too long e.g. between two softirqs. So, it could be really a matter of timing out - not necessarily permanent error state. Jarek P. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 8:57 ` Ingo Molnar 2007-07-17 9:29 ` Jarek Poplawski @ 2007-07-17 14:07 ` Olaf Kirch 2007-07-17 16:57 ` Ingo Molnar 2007-07-17 17:49 ` Linus Torvalds 2 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-17 14:07 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Tuesday 17 July 2007 10:57, Ingo Molnar wrote: > i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes > the problem go away. So it's somehow also related to jiffies. There are several "Tx Hang detected" messages in the log, which looks a lot as if net_rx_action never runs, or at least never calls dev->poll on the e1000 nic. Can you check whether/how often it bails out of net_rx_action taking the softnet_break path? My suspicion right now is that dev->quota goes way negative when pushing out netconsole output. Normally, we bump the quota in __net_rx_schedule. But the whole point of the patch is that netpoll has no business removing the device from the poll_list, so it stays there, and we don't end up calling __net_rx_schedule as often as we would otherwise. Can you try what happens if you change netif_rx_complete to something like this: if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) { dev->quota = dev->weight; return; } This is just a hack to make sure that we don't go to insanely negative quotas while sending packets through netpoll. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 14:07 ` Olaf Kirch @ 2007-07-17 16:57 ` Ingo Molnar 2007-07-17 18:06 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-17 16:57 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > Can you try what happens if you change netif_rx_complete to something > like this: > > if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) { > dev->quota = dev->weight; > return; > } > > This is just a hack to make sure that we don't go to insanely negative > quotas while sending packets through netpoll. i've done the patch below, but it did not change the timeouts nor did it solve the 'no network' problem. netconsole output hung earlier as well. i can try other things too. (it would be best if you sent me test/debug-patches, instead of letting me hack in the changes - that's the surest way of ensuring that i test exactly what you intended.) Ingo -------------> Index: linux/include/linux/netdevice.h =================================================================== --- linux.orig/include/linux/netdevice.h +++ linux/include/linux/netdevice.h @@ -1007,6 +1007,10 @@ static inline int netif_rx_reschedule(st */ static inline void __netif_rx_complete(struct net_device *dev) { + if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) { + dev->quota = dev->weight; + return; + } BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state)); list_del(&dev->poll_list); smp_mb__before_clear_bit(); ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 16:57 ` Ingo Molnar @ 2007-07-17 18:06 ` Olaf Kirch 2007-07-17 18:18 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-17 18:06 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem Hi Ingo, On Tuesday 17 July 2007 18:57, Ingo Molnar wrote: > i've done the patch below, but it did not change the timeouts nor did it > solve the 'no network' problem. netconsole output hung earlier as well. Hm, pity. To rule out any e1000 problem, can you try the the following please, both with HZ=250 and HZ=1000? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax --- --- drivers/net/e1000/e1000_main.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) Index: build-2.6/drivers/net/e1000/e1000_main.c =================================================================== --- build-2.6.orig/drivers/net/e1000/e1000_main.c +++ build-2.6/drivers/net/e1000/e1000_main.c @@ -3871,10 +3871,17 @@ e1000_intr(int irq, void *data) adapter->total_rx_bytes = 0; adapter->total_rx_packets = 0; __netif_rx_schedule(netdev); - } else + } else { /* this really should not happen! if it does it is basically a * bug, but not a hard error, so enable ints and continue */ + static unsigned int been_here = 0; + + been_here++; + if (net_ratelimit()) + printk(KERN_NOTICE "e1000_intr and rx_sched set (%u); state=0x%lx\n", + been_here, netdev->state); e1000_irq_enable(adapter); + } #else /* Writing IMC and IMS is needed for 82547. * Due to Hub Link bus being occupied, an interrupt ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 18:06 ` Olaf Kirch @ 2007-07-17 18:18 ` Ingo Molnar 2007-07-17 18:34 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-17 18:18 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > Hi Ingo, > > On Tuesday 17 July 2007 18:57, Ingo Molnar wrote: > > i've done the patch below, but it did not change the timeouts nor did it > > solve the 'no network' problem. netconsole output hung earlier as well. > Hm, pity. > > To rule out any e1000 problem, can you try the the following please, > both with HZ=250 and HZ=1000? done, find the logs here: http://redhat.com/~mingo/misc/100hz.log http://redhat.com/~mingo/misc/1000hz.log (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just like HZ=250.) no 'rx_sched set' messages in either case. Network still hung for HZ=100, and is working for HZ=1000. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 18:18 ` Ingo Molnar @ 2007-07-17 18:34 ` Olaf Kirch 2007-07-17 18:56 ` Ingo Molnar 2007-07-18 11:48 ` Jarek Poplawski 0 siblings, 2 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-17 18:34 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Tuesday 17 July 2007 20:18, Ingo Molnar wrote: > (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just > like HZ=250.) > > no 'rx_sched set' messages in either case. Network still hung for > HZ=100, and is working for HZ=1000. Is this from dmesg or the netconsole output? I don't see any Tx Unit Hang messages from e1000 or netdev watchdog messages present in your earlier dmesg logs. So maybe these messages are there, but never get logged? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 18:34 ` Olaf Kirch @ 2007-07-17 18:56 ` Ingo Molnar 2007-07-18 12:04 ` Olaf Kirch 2007-07-18 11:48 ` Jarek Poplawski 1 sibling, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-17 18:56 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > On Tuesday 17 July 2007 20:18, Ingo Molnar wrote: > > (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just > > like HZ=250.) > > > > no 'rx_sched set' messages in either case. Network still hung for > > HZ=100, and is working for HZ=1000. > > Is this from dmesg or the netconsole output? I don't see any Tx Unit > Hang messages from e1000 or netdev watchdog messages present in your > earlier dmesg logs. So maybe these messages are there, but never get > logged? i logged these not via netconsole but via logging on over the console and using dmesg, so it should include everything. in the 100hz case the following seems to show the anomaly: NETDEV WATCHDOG: eth0: transmit timed out indeed it's not followed by the e1000 messages. (perhaps i didnt wait long enough to reboot the box - i zapped it after a minute of uptime) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 18:56 ` Ingo Molnar @ 2007-07-18 12:04 ` Olaf Kirch 2007-07-18 12:41 ` Ingo Molnar 2007-07-18 12:48 ` Ingo Molnar 0 siblings, 2 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-18 12:04 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: > i logged these not via netconsole but via logging on over the console > and using dmesg, so it should include everything. in the 100hz case the > following seems to show the anomaly: > > NETDEV WATCHDOG: eth0: transmit timed out So, it seems as if for some reason, dev->poll isn't called frequently enough. Here's a debugging patch that tries to locate the problem - can you give it a try, please? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax -------- patching file drivers/net/e1000/e1000_main.c --- include/linux/netdevice.h | 1 + net/core/dev.c | 7 +++++++ net/core/netpoll.c | 2 ++ 3 files changed, 10 insertions(+) Index: build-2.6/include/linux/netdevice.h =================================================================== --- build-2.6.orig/include/linux/netdevice.h +++ build-2.6/include/linux/netdevice.h @@ -692,6 +692,7 @@ struct softnet_data #ifdef CONFIG_NET_DMA struct dma_chan *net_dma; #endif + unsigned long scheduled; }; DECLARE_PER_CPU(struct softnet_data,softnet_data); Index: build-2.6/net/core/dev.c =================================================================== --- build-2.6.orig/net/core/dev.c +++ build-2.6/net/core/dev.c @@ -1199,6 +1199,7 @@ void __netif_rx_schedule(struct net_devi dev->quota += dev->weight; else dev->quota = dev->weight; + __get_cpu_var(softnet_data).scheduled = jiffies; __raise_softirq_irqoff(NET_RX_SOFTIRQ); local_irq_restore(flags); } @@ -2028,6 +2029,9 @@ static void net_rx_action(struct softirq local_irq_disable(); + /* Complain if we're scheduled way too late. */ + WARN_ON(time_after(jiffies, __get_cpu_var(softnet_data).scheduled + HZ)); + while (!list_empty(&queue->poll_list)) { struct net_device *dev; @@ -2038,8 +2042,10 @@ static void net_rx_action(struct softirq dev = list_entry(queue->poll_list.next, struct net_device, poll_list); + have = netpoll_poll_lock(dev); + BUG_ON(test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)); if (dev->quota <= 0 || dev->poll(dev, &budget)) { netpoll_poll_unlock(have); local_irq_disable(); @@ -2074,6 +2080,7 @@ out: softnet_break: __get_cpu_var(netdev_rx_stat).time_squeeze++; + __get_cpu_var(softnet_data).scheduled = jiffies; __raise_softirq_irqoff(NET_RX_SOFTIRQ); goto out; } Index: build-2.6/net/core/netpoll.c =================================================================== --- build-2.6.orig/net/core/netpoll.c +++ build-2.6/net/core/netpoll.c @@ -130,6 +130,7 @@ static void poll_napi(struct netpoll *np * Setting POLL_LIST_FROZEN tells netif_rx_complete * to leave the NAPI state alone. */ + local_bh_disable(); set_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state); npinfo->rx_flags |= NETPOLL_RX_DROP; atomic_inc(&trapped); @@ -140,6 +141,7 @@ static void poll_napi(struct netpoll *np npinfo->rx_flags &= ~NETPOLL_RX_DROP; clear_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state); spin_unlock(&npinfo->poll_lock); + local_bh_enable(); } } ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-18 12:04 ` Olaf Kirch @ 2007-07-18 12:41 ` Ingo Molnar 2007-07-18 12:48 ` Ingo Molnar 1 sibling, 0 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-18 12:41 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: > > i logged these not via netconsole but via logging on over the console > > and using dmesg, so it should include everything. in the 100hz case the > > following seems to show the anomaly: > > > > NETDEV WATCHDOG: eth0: transmit timed out > > So, it seems as if for some reason, dev->poll isn't called frequently > enough. > > Here's a debugging patch that tries to locate the problem - can you > give it a try, please? sure, give me a minute. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-18 12:04 ` Olaf Kirch 2007-07-18 12:41 ` Ingo Molnar @ 2007-07-18 12:48 ` Ingo Molnar 2007-07-18 14:41 ` Olaf Kirch 1 sibling, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-18 12:48 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > On Tuesday 17 July 2007 20:56, Ingo Molnar wrote: > > i logged these not via netconsole but via logging on over the console > > and using dmesg, so it should include everything. in the 100hz case the > > following seems to show the anomaly: > > > > NETDEV WATCHDOG: eth0: transmit timed out > > So, it seems as if for some reason, dev->poll isn't called frequently > enough. > > Here's a debugging patch that tries to locate the problem - can you > give it a try, please? it triggers here: netconsole: device eth0 not up yet, forcing it netconsole: timeout waiting for carrier console [netcon0] enabled WARNING: at kernel/softirq.c:138 local_bh_enable() [<c0105e4a>] show_trace_log_lvl+0x19/0x2e [<c0105f43>] show_trace+0x12/0x14 [<c0105f59>] dump_stack+0x14/0x16 [<c0130f8d>] local_bh_enable+0x95/0x15d [<c03cf35b>] netpoll_poll+0xaf/0x361 [<c03cf248>] netpoll_send_skb+0xe8/0x14c [<c03cf8d4>] netpoll_send_udp+0x258/0x260 [<c02f4016>] write_msg+0x53/0x8d [<c012c87e>] __call_console_drivers+0x4e/0x5a [<c012c8e7>] _call_console_drivers+0x5d/0x61 [<c012cf06>] release_console_sem+0x120/0x1c1 [<c012d70a>] register_console+0x22e/0x236 [<c02f3f98>] init_netconsole+0x55/0x67 [<c05e28e5>] kernel_init+0x154/0x2d9 [<c0105c5f>] kernel_thread_helper+0x7/0x10 I've uploaded the full log to: http://redhat.com/~mingo/misc/100hz.2.log something i noticed: netconsole output seems to trickle through though, but very, very slowly (a packet once every 4 seconds or so). TCP/IP is not functional. also, i'm using netconsole via the command line (both the network driver and netconsole is built into the bzImage), maybe that makes a difference? (if there's any other data you'd like to see, let me know.) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-18 12:48 ` Ingo Molnar @ 2007-07-18 14:41 ` Olaf Kirch 2007-07-18 16:43 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-18 14:41 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Wednesday 18 July 2007 14:48, Ingo Molnar wrote: > something i noticed: netconsole output seems to trickle through though, > but very, very slowly (a packet once every 4 seconds or so). TCP/IP is > not functional. > > also, i'm using netconsole via the command line (both the network driver > and netconsole is built into the bzImage), maybe that makes a > difference? Possibly - but so far there's nothing in the code that jumped at me. Can you try the following please? I'm still pretty much in the dark; so I'm adding a bunch of printks. Let's hope this doesn't change the timing in a way that makes the bug disappear. Thanks Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ----- --- include/linux/netdevice.h | 4 ++++ net/core/dev.c | 14 ++++++++++++++ 2 files changed, 18 insertions(+) Index: build-2.6/include/linux/netdevice.h =================================================================== --- build-2.6.orig/include/linux/netdevice.h +++ build-2.6/include/linux/netdevice.h @@ -692,6 +692,7 @@ struct softnet_data #ifdef CONFIG_NET_DMA struct dma_chan *net_dma; #endif + unsigned long scheduled; }; DECLARE_PER_CPU(struct softnet_data,softnet_data); @@ -1026,6 +1027,9 @@ static inline void netif_rx_complete(str /* Prevent race with netpoll - yes, this is a kludge. * But at least it doesn't penalize the non-netpoll * code path. */ + printk(KERN_DEBUG "netif_rx_complete(%s)%s\n", dev->name, + test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)? + " - frozen" : ""); if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) return; #endif Index: build-2.6/net/core/dev.c =================================================================== --- build-2.6.orig/net/core/dev.c +++ build-2.6/net/core/dev.c @@ -1192,6 +1192,7 @@ void __netif_rx_schedule(struct net_devi { unsigned long flags; + printk(KERN_DEBUG "__netif_rx_schedule(%s)\n", dev->name); local_irq_save(flags); dev_hold(dev); list_add_tail(&dev->poll_list, &__get_cpu_var(softnet_data).poll_list); @@ -1199,6 +1200,7 @@ void __netif_rx_schedule(struct net_devi dev->quota += dev->weight; else dev->quota = dev->weight; + __get_cpu_var(softnet_data).scheduled = jiffies; __raise_softirq_irqoff(NET_RX_SOFTIRQ); local_irq_restore(flags); } @@ -2028,6 +2030,9 @@ static void net_rx_action(struct softirq local_irq_disable(); + /* Complain if we're scheduled way too late. */ + WARN_ON(time_after(jiffies, __get_cpu_var(softnet_data).scheduled + HZ)); + while (!list_empty(&queue->poll_list)) { struct net_device *dev; @@ -2038,9 +2043,13 @@ static void net_rx_action(struct softirq dev = list_entry(queue->poll_list.next, struct net_device, poll_list); + have = netpoll_poll_lock(dev); + BUG_ON(test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)); if (dev->quota <= 0 || dev->poll(dev, &budget)) { + printk(KERN_DEBUG "netif_rx_action(%s) - keep on polling\n", + dev->name); netpoll_poll_unlock(have); local_irq_disable(); list_move_tail(&dev->poll_list, &queue->poll_list); @@ -2049,6 +2058,10 @@ static void net_rx_action(struct softirq else dev->quota = dev->weight; } else { + printk(KERN_DEBUG "netif_rx_action(%s) - done%s\n", + dev->name, + test_bit(__LINK_STATE_RX_SCHED, &dev->state)? + "; rx_sched set" : ""); netpoll_poll_unlock(have); dev_put(dev); local_irq_disable(); @@ -2074,6 +2087,7 @@ out: softnet_break: __get_cpu_var(netdev_rx_stat).time_squeeze++; + __get_cpu_var(softnet_data).scheduled = jiffies; __raise_softirq_irqoff(NET_RX_SOFTIRQ); goto out; } ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-18 14:41 ` Olaf Kirch @ 2007-07-18 16:43 ` Ingo Molnar 2007-07-19 9:09 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-18 16:43 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@ORACLE.COM> wrote: > > also, i'm using netconsole via the command line (both the network > > driver and netconsole is built into the bzImage), maybe that makes a > > difference? > > Possibly - but so far there's nothing in the code that jumped at me. > > Can you try the following please? I'm still pretty much in the dark; > so I'm adding a bunch of printks. Let's hope this doesn't change the > timing in a way that makes the bug disappear. hm ... the box wouldnt even boot up because the 'keep on polling' and 'done' messages are scrolling across so fast. Perhaps every printk triggers a packet which triggers another printk ... etc? here's the 15MB log of it: http://redhat.com/~mingo/misc/100hz.3.log.bz2 Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-18 16:43 ` Ingo Molnar @ 2007-07-19 9:09 ` Ingo Molnar 2007-07-19 9:44 ` Olaf Kirch 2007-07-19 10:17 ` Ingo Molnar 0 siblings, 2 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 9:09 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem i have your original patch applied to my working tree to be able to observe this bug's behavior, and here's another observation: the problem seems to go away if i turn on CONFIG_NO_HZ. So it looks timing related indeed ... but when the bug happens, it happens all the time, reboot after reboot. When it doesnt happen, networking and netlogging is robust for hours, reboot after reboot. That seems atypical for timing problems. I'm puzzled. the e1000 in this laptop is historically pretty robust. The only problem i ever had with it were some rx/tx hw-engine latency problems [pings from the outside took up to 1 second to propagate] that were quickly fixed by the e1000 driver guys. Maybe that's related. (although it never caused total inavailability of networking - it was only latency problems) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 9:09 ` Ingo Molnar @ 2007-07-19 9:44 ` Olaf Kirch 2007-07-19 10:01 ` Ingo Molnar 2007-07-19 10:17 ` Ingo Molnar 1 sibling, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 9:44 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Thursday 19 July 2007 11:09, Ingo Molnar wrote: > the e1000 in this laptop is historically pretty robust. The only problem > i ever had with it were some rx/tx hw-engine latency problems [pings > from the outside took up to 1 second to propagate] that were quickly > fixed by the e1000 driver guys. Maybe that's related. (although it never > caused total inavailability of networking - it was only latency > problems) I've been poring over this code for 3 days now, and I'm facing a blank wall, mind-wise :-) - it is pretty clear that net_rx_action is invoked every once in a while only. netdev watchdog timeouts are a pretty unmistakable sign for that. - You say that netconsole output continues to trickle after the network gets wedged. This could be caused by the e1000 watchdog, which triggers a NIC interrupt "to ensure rx ring is cleaned". I assume that this triggers the regular e1000_intr, which succeeds in putting the NIC on the poll_list, and net_rx_action call dev->poll once. If this assumption is true, this means that - once an interrupt gets through, NAPI is working as designed - no other interrupts are arriving (Rx, Tx-completion) So, can you verify whether there are any interrupts arriving on the NIC after the network got wedged? You could also try ethtool -s eth0 msglevel 65535 - would be interesting to see what dmesg contains. If there's little to no debug output from the driver, let it run for 10 seconds or so, in order to catch the e1000 watchdog timer a few times. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 9:44 ` Olaf Kirch @ 2007-07-19 10:01 ` Ingo Molnar 2007-07-19 10:37 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 10:01 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > - You say that netconsole output continues to trickle after > the network gets wedged. This could be caused by the > e1000 watchdog, which triggers a NIC interrupt "to ensure > rx ring is cleaned". I assume that this triggers the > regular e1000_intr, which succeeds in putting the NIC on > the poll_list, and net_rx_action call dev->poll once. no - it appears that 'trickle' only happened with one of your patches (to which i replied with that 'trickle' mail). With what i have booted now (only your original patch and nothing else, 100 Hz and !dynticks), netconsole output stopped here: Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a() NET: Registered protocol family 16 and no output ever since - and the box has been up for a few minutes. > So, can you verify whether there are any interrupts arriving on the > NIC after the network got wedged? You could also try ethtool -s eth0 > msglevel 65535 - would be interesting to see what dmesg contains. If > there's little to no debug output from the driver, let it run for 10 > seconds or so, in order to catch the e1000 watchdog timer a few times. eth0's irq count is stuck at 5 interrupts - and has not changed for minutes. i tried ethtool -s eth0 msglvl 65535, but (sa expected) there's no output. I've attached below ifconfig output and ethtool -S output - maybe that tells you something new about the state of eth0. (to me it only tells what we already know: tx timed out once and eth0 is stuck ever since.) Btw., i definitely need your help with this bug as it's now hopelessly out of my league :-/ Ingo ------------------> eth0 Link encap:Ethernet HWaddr 00:16:41:17:49:D2 inet addr:10.0.1.15 Bcast:10.255.255.255 Mask:255.0.0.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:873 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:87076 (85.0 KiB) Base address:0x2000 Memory:ee000000-ee020000 NIC statistics: rx_packets: 0 tx_packets: 873 rx_bytes: 0 tx_bytes: 87076 rx_broadcast: 0 tx_broadcast: 0 rx_multicast: 0 tx_multicast: 0 rx_errors: 0 tx_errors: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_no_buffer_count: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_abort_late_coll: 0 tx_deferred_ok: 0 tx_single_coll_ok: 0 tx_multi_coll_ok: 0 tx_timeout_count: 1 tx_restart_queue: 0 rx_long_length_errors: 0 rx_short_length_errors: 0 rx_align_errors: 0 tx_tcp_seg_good: 0 tx_tcp_seg_failed: 0 rx_flow_control_xon: 0 rx_flow_control_xoff: 0 tx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_long_byte_count: 0 rx_csum_offload_good: 0 rx_csum_offload_errors: 0 rx_header_split: 0 alloc_rx_buff_failed: 0 tx_smbus: 0 rx_smbus: 0 dropped_smbus: 0 ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 10:01 ` Ingo Molnar @ 2007-07-19 10:37 ` Olaf Kirch 2007-07-19 10:47 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 10:37 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Thursday 19 July 2007 12:01, Ingo Molnar wrote: > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. > initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() > Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a() > NET: Registered protocol family 16 > > and no output ever since - and the box has been up for a few minutes. Okay, I need to ask a stupid question - did you verify that it's not spinning on a spinlock? Specifically, I'm wondering whether the net_rx_action softirq may be scheduled while we're in poll_napi holding the poll_lock. net_rx_action would try to take the poll_lock as well, and we'd be hung for good. The patch with local_bh_disable/enable was supposed to test that idea (this is the "trickle" patch) Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 10:37 ` Olaf Kirch @ 2007-07-19 10:47 ` Ingo Molnar 2007-07-19 10:58 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 10:47 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > On Thursday 19 July 2007 12:01, Ingo Molnar wrote: > > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() > > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. > > initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() > > Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a() > > NET: Registered protocol family 16 > > > > and no output ever since - and the box has been up for a few minutes. > > Okay, I need to ask a stupid question - did you verify that it's not > spinning on a spinlock? the box is fully usable after it has booted up and (as you can see it in the config i've uploaded) i've got every kernel debug option enabled, including lockdep. > Specifically, I'm wondering whether the net_rx_action softirq may be > scheduled while we're in poll_napi holding the poll_lock. > net_rx_action would try to take the poll_lock as well, and we'd be > hung for good. The patch with local_bh_disable/enable was supposed to > test that idea (this is the "trickle" patch) the box isnt hung - it just has no networking. (i couldnt get the logs out if it were hung - netconsole is the only remote debugging option and with that broken i can only get info out if it boots up fine) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 10:47 ` Ingo Molnar @ 2007-07-19 10:58 ` Ingo Molnar 2007-07-19 12:52 ` Olaf Kirch 2007-07-19 15:07 ` Ingo Molnar 0 siblings, 2 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 10:58 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok * Ingo Molnar <mingo@elte.hu> wrote: > * Olaf Kirch <olaf.kirch@oracle.com> wrote: > > > On Thursday 19 July 2007 12:01, Ingo Molnar wrote: > > > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39() > > > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0. > > > initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39() > > > Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a() > > > NET: Registered protocol family 16 > > > > > > and no output ever since - and the box has been up for a few minutes. > > > > Okay, I need to ask a stupid question - did you verify that it's not > > spinning on a spinlock? ok, i just to make my description clearer: 'netconsole output hung' means that on the netconsole-receiving box (which is not the laptop with the problem) i dont get any netconsole output printed. (i.e. UDP packets are not being sent by the laptop.) The above snippet is the last i get. Otherwise the laptop boots up fine, but has that early tx timeout and subsequently no networking. I can still do things on the laptop as normal, but eth0 irqs do not advance (are stuck at 5) and networking is stuck. i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine hickup symptoms, with no other bad symptoms such as lockups or crashes. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 10:58 ` Ingo Molnar @ 2007-07-19 12:52 ` Olaf Kirch 2007-07-19 12:54 ` Olaf Kirch 2007-07-19 15:42 ` Kok, Auke 2007-07-19 15:07 ` Ingo Molnar 1 sibling, 2 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 12:52 UTC (permalink / raw) To: Ingo Molnar Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok On Thursday 19 July 2007 12:58, Ingo Molnar wrote: > i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine > hickup symptoms, with no other bad symptoms such as lockups or crashes. Duh, I found it. The e1000 poll routine does this to leave polling mode. netif_rx_complete(poll_dev); e1000_irq_enable(adapter); return 0; Which looks innocent enough, except that e1000_irq_enable has this little irq_sem counter: if (likely(atomic_dec_and_test(&adapter->irq_sem))) { E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK); E1000_WRITE_FLUSH(&adapter->hw); } So as poll_napi calls the poll() routine repeatedly, the irq_sem counter is decremented by one each time. During the first call, it re-enables the interrupt. During the next calls, irq_sem goes negative. Then an interrupt comes in, e1000_intr disables the interrupt, increments irq_sem by one, and schedules the device for rx_action. rx_action calls dev->poll(), which finishes cleaning rx/rx rings, and when it finds there's no more work, it calls rx_complete and irq_enable. Except irq_enable doesn't enable anything now, since irq_sem is <= 0, and dec_and_test returns false. The whole irq_sem accounting in the e1000 does not rhyme well with netpoll's way of exercising dev->poll(). The reason my patch triggers the problem reliably for you is that now, we always get at least two invocations of dev->poll: once from poll_napi - where we do not remove the device from the poll list any longer - and another one from net_rx_action. I don't have a fix ready yet - I hope I'll have something later this afternoon. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 12:52 ` Olaf Kirch @ 2007-07-19 12:54 ` Olaf Kirch 2007-07-19 15:42 ` Kok, Auke 1 sibling, 0 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 12:54 UTC (permalink / raw) To: Ingo Molnar Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok On Thursday 19 July 2007 14:52, Olaf Kirch wrote: > On Thursday 19 July 2007 12:58, Ingo Molnar wrote: > > i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine > > hickup symptoms, with no other bad symptoms such as lockups or crashes. > > Duh, I found it. The following patch should confirm my theory. Can you give it a try, please? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax --------- Index: build-2.6/drivers/net/e1000/e1000_main.c =================================================================== --- build-2.6.orig/drivers/net/e1000/e1000_main.c +++ build-2.6/drivers/net/e1000/e1000_main.c @@ -358,6 +358,7 @@ e1000_irq_enable(struct e1000_adapter *a E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK); E1000_WRITE_FLUSH(&adapter->hw); } + BUG_ON(atomic_read(&adapter->irq_sem) < 0); } static void ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 12:52 ` Olaf Kirch 2007-07-19 12:54 ` Olaf Kirch @ 2007-07-19 15:42 ` Kok, Auke 2007-07-19 16:07 ` Ingo Molnar 1 sibling, 1 reply; 67+ messages in thread From: Kok, Auke @ 2007-07-19 15:42 UTC (permalink / raw) To: Olaf Kirch Cc: Ingo Molnar, Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok Olaf Kirch wrote: > On Thursday 19 July 2007 12:58, Ingo Molnar wrote: >> i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine >> hickup symptoms, with no other bad symptoms such as lockups or crashes. > > Duh, I found it. > > The e1000 poll routine does this to leave polling mode. > > netif_rx_complete(poll_dev); > e1000_irq_enable(adapter); > return 0; > > Which looks innocent enough, except that e1000_irq_enable has > this little irq_sem counter: > > if (likely(atomic_dec_and_test(&adapter->irq_sem))) { > E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK); > E1000_WRITE_FLUSH(&adapter->hw); > } > > So as poll_napi calls the poll() routine repeatedly, the irq_sem > counter is decremented by one each time. During the first call, > it re-enables the interrupt. During the next calls, irq_sem goes > negative. > > Then an interrupt comes in, e1000_intr disables the interrupt, > increments irq_sem by one, and schedules the device for rx_action. > rx_action calls dev->poll(), which finishes cleaning rx/rx rings, > and when it finds there's no more work, it calls rx_complete and > irq_enable. Except irq_enable doesn't enable anything now, since > irq_sem is <= 0, and dec_and_test returns false. > > The whole irq_sem accounting in the e1000 does not rhyme well with > netpoll's way of exercising dev->poll(). it's been accused of worse things ;) > The reason my patch triggers > the problem reliably for you is that now, we always get at least > two invocations of dev->poll: once from poll_napi - where we do not > remove the device from the poll list any longer - and another one > from net_rx_action. > > I don't have a fix ready yet - I hope I'll have something later > this afternoon. interesting, you seem to found the cause allright. I can't confirm the problem but I know that netpoll and NAPI has historically been an issue. I look forward to your suggestions... Cheers, Auke ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 15:42 ` Kok, Auke @ 2007-07-19 16:07 ` Ingo Molnar 2007-07-19 19:13 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 16:07 UTC (permalink / raw) To: Kok, Auke Cc: Olaf Kirch, Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Kok, Auke <auke-jan.h.kok@intel.com> wrote: > > I don't have a fix ready yet - I hope I'll have something later this > > afternoon. > > interesting, you seem to found the cause allright. I can't confirm the > problem but I know that netpoll and NAPI has historically been an > issue. I look forward to your suggestions... because i dont seem to be able to trigger Olaf's WARN_ON(), can you see anything in the ethtool output that i sent in the previous mail(s)? Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 16:07 ` Ingo Molnar @ 2007-07-19 19:13 ` Olaf Kirch 2007-07-19 19:22 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 19:13 UTC (permalink / raw) To: Ingo Molnar Cc: Kok, Auke, Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Thursday 19 July 2007 18:07, Ingo Molnar wrote: > because i dont seem to be able to trigger Olaf's WARN_ON(), can you see > anything in the ethtool output that i sent in the previous mail(s)? If the WARN_ON doesn't trigger, I cannot see how my patch would affect your system. - IF we enter the if() branch in poll_napi, then the following must hold: - an interrupt from the e1000 arrived - e1000_intr disables interrupts, increments irq_sem (which is now 1) and schedules rx_action - Now, inside poll_napi, the following happens: - poll_napi is called, finds the device is marked for polling, invokes dev->poll - dev->poll calls netif_rx_complete (which does *not* remove the device from the poll list), and re-enables interrupts. irq_sem is now 0. - Finally, the rx_action softirq is run, which calls dev->poll again, which ends up invoking netif_rx_complete once more, and tries to re-enable interrupts. The latter doesn't do anything except decrementing irq_sem once more, which now goes negative. Which would trigger the WARN_ON. Now, as you say the WARN_ON is never triggered, it follows that we never end up in the if() branch of poll_napi. But that is where the only substantial modification of my patch is. Here's a somewhat drastic modification that should not change any timing, but just verifies whether my patch is to blame at all. Can you give it a try? Thanks, Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax --- Test patch --- include/linux/netdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: build-2.6/include/linux/netdevice.h =================================================================== --- build-2.6.orig/include/linux/netdevice.h +++ build-2.6/include/linux/netdevice.h @@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str * But at least it doesn't penalize the non-netpoll * code path. */ if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) - return; + BUG(); #endif local_irq_save(flags); ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 19:13 ` Olaf Kirch @ 2007-07-19 19:22 ` Ingo Molnar 2007-07-19 19:35 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 19:22 UTC (permalink / raw) To: Olaf Kirch Cc: Kok, Auke, Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > Here's a somewhat drastic modification that should not change any > timing, but just verifies whether my patch is to blame at all. Can you > give it a try? > @@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str > * But at least it doesn't penalize the non-netpoll > * code path. */ > if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) > - return; > + BUG(); ok, i tried the patch below, and it gave this (single) warning: Calling initcall 0xc02f5c17: init_netconsole+0x0/0x67() netconsole: device eth0 not up yet, forcing it netconsole: timeout waiting for carrier console [netcon0] enabled WARNING: at include/linux/netdevice.h:1030 netif_rx_complete() [<c0105e3e>] show_trace_log_lvl+0x19/0x2e [<c0105f37>] show_trace+0x12/0x14 [<c0105f4d>] dump_stack+0x14/0x16 [<c02c4fef>] e1000_clean+0x1f4/0x26f [<c03d0f6f>] netpoll_poll+0x8b/0x357 [<c03d0e80>] netpoll_send_skb+0xe8/0x14c [<c03d1502>] netpoll_send_udp+0x258/0x260 [<c02f5cea>] write_msg+0x53/0x8d [<c012c5ba>] __call_console_drivers+0x4e/0x5a [<c012c623>] _call_console_drivers+0x5d/0x61 [<c012cc42>] release_console_sem+0x120/0x1c1 [<c012d446>] register_console+0x22e/0x236 [<c02f5c6c>] init_netconsole+0x55/0x67 [<c05e48e5>] kernel_init+0x154/0x2d9 [<c0105c53>] kernel_thread_helper+0x7/0x10 ======================= e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX netconsole: network logging started initcall 0xc02f5c17: init_netconsole+0x0/0x67() returned 0. initcall 0xc02f5c17 ran for 4012 msecs: init_netconsole+0x0/0x67() Calling initcall 0xc0600ff9: spi_transport_init+0x0/0x27() initcall 0xc0600ff9: spi_transport_init+0x0/0x27() returned 0. Ingo --- include/linux/netdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-cfs-2.6.23-git.q.prev3/include/linux/netdevice.h =================================================================== --- linux-cfs-2.6.23-git.q.prev3.orig/include/linux/netdevice.h +++ linux-cfs-2.6.23-git.q.prev3/include/linux/netdevice.h @@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str * But at least it doesn't penalize the non-netpoll * code path. */ if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) - return; + WARN_ON(1); #endif local_irq_save(flags); ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 19:22 ` Ingo Molnar @ 2007-07-19 19:35 ` Olaf Kirch 2007-07-19 19:56 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 19:35 UTC (permalink / raw) To: Ingo Molnar Cc: Kok, Auke, Jarek Poplawski, Linus Torvalds, linux-kernel, davem Does the following help? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax Test patch --- Index: build-2.6/drivers/net/netconsole.c =================================================================== --- build-2.6.orig/drivers/net/netconsole.c +++ build-2.6/drivers/net/netconsole.c @@ -70,7 +70,7 @@ static void write_msg(struct console *co int frag, left; unsigned long flags; - if (!np.dev) + if (!np.dev || !netif_running(np.dev)) return; local_irq_save(flags); ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 19:35 ` Olaf Kirch @ 2007-07-19 19:56 ` Ingo Molnar 2007-07-19 20:02 ` Olaf Kirch 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 19:56 UTC (permalink / raw) To: Olaf Kirch Cc: Kok, Auke, Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > Does the following help? > --- build-2.6.orig/drivers/net/netconsole.c > +++ build-2.6/drivers/net/netconsole.c > @@ -70,7 +70,7 @@ static void write_msg(struct console *co > int frag, left; > unsigned long flags; > > - if (!np.dev) > + if (!np.dev || !netif_running(np.dev)) > return; nope - with this patch applied the box still has no network, symptoms are similar. (should i apply the WARN_ON() patch too?) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 19:56 ` Ingo Molnar @ 2007-07-19 20:02 ` Olaf Kirch 2007-07-20 9:45 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 20:02 UTC (permalink / raw) To: Ingo Molnar Cc: Kok, Auke, Jarek Poplawski, Linus Torvalds, linux-kernel, davem On Thursday 19 July 2007 21:56, Ingo Molnar wrote: > nope - with this patch applied the box still has no network, symptoms > are similar. (should i apply the WARN_ON() patch too?) Yes, that would be nice. If that doesn't help, you can also throw in the one below. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax --- Index: build-2.6/net/core/netpoll.c =================================================================== --- build-2.6.orig/net/core/netpoll.c +++ build-2.6/net/core/netpoll.c @@ -650,7 +650,6 @@ int netpoll_setup(struct netpoll *np) return -ENODEV; } - np->dev = ndev; if (!ndev->npinfo) { npinfo = kmalloc(sizeof(*npinfo), GFP_KERNEL); if (!npinfo) { @@ -722,6 +721,8 @@ int netpoll_setup(struct netpoll *np) } } + np->dev = ndev; + if (is_zero_ether_addr(np->local_mac) && ndev->dev_addr) memcpy(np->local_mac, ndev->dev_addr, 6); ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 20:02 ` Olaf Kirch @ 2007-07-20 9:45 ` Ingo Molnar 0 siblings, 0 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-20 9:45 UTC (permalink / raw) To: Olaf Kirch Cc: Kok, Auke, Jarek Poplawski, Linus Torvalds, linux-kernel, davem * Olaf Kirch <olaf.kirch@oracle.com> wrote: > On Thursday 19 July 2007 21:56, Ingo Molnar wrote: > > nope - with this patch applied the box still has no network, symptoms > > are similar. (should i apply the WARN_ON() patch too?) > > Yes, that would be nice. If that doesn't help, you can also throw in > the one below. > - np->dev = ndev; > if (!ndev->npinfo) { > npinfo = kmalloc(sizeof(*npinfo), GFP_KERNEL); > if (!npinfo) { > @@ -722,6 +721,8 @@ int netpoll_setup(struct netpoll *np) > } > } > > + np->dev = ndev; no change in behavior from this patch. If i add the WARN_ON() it triggers once and it makes the network work as well (probably due to the side-effect of the packets generated by the WARN_ON() printks). Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 10:58 ` Ingo Molnar 2007-07-19 12:52 ` Olaf Kirch @ 2007-07-19 15:07 ` Ingo Molnar 2007-07-19 15:27 ` Olaf Kirch 2007-07-19 15:32 ` Ingo Molnar 1 sibling, 2 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 15:07 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok ugh. Something really weird happened with this e1000 problem. i crashed the laptop in a weird way and had to power-cycle it in an unusual fashion. After that i wanted to try your latest BUG_ON() theory but the network hang went away! For 3 hours i tried to reproduce the hang (i went back to the original git tree and the .config under which i found it, i power cycled it again, i unplugged the power cord to make it go off battery, unplugged the ethernet, recreated a completely new tree, etc. etc.) but with no success! Total Heisenbug - and the really annoying thing is that you just had a good theory about what might be happening. I'm now at kernel build iteration #75 in this tree ... maybe it's not the power-cycling that somehow brought the e1000 out of its weird state but the ethtool ops (and the related firmware readouts, etc.)? (i have your BUG_ON() applied (as a WARN_ON()), but it doesnt trigger. Not that this means anything under these circumstances ...) Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 15:07 ` Ingo Molnar @ 2007-07-19 15:27 ` Olaf Kirch 2007-07-19 15:32 ` Ingo Molnar 1 sibling, 0 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 15:27 UTC (permalink / raw) To: Ingo Molnar Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok On Thursday 19 July 2007 17:07, Ingo Molnar wrote: > i crashed the laptop in a weird way and had to power-cycle it in an > unusual fashion. After that i wanted to try your latest BUG_ON() theory > but the network hang went away! Should I rejoice, or regret? :-) > maybe it's not the power-cycling that somehow brought the e1000 out of > its weird state but the ethtool ops (and the related firmware readouts, > etc.)? Oh, don't tell me there's a --scrub-bad-karma switch in ethtool. > (i have your BUG_ON() applied (as a WARN_ON()), but it doesnt trigger. > Not that this means anything under these circumstances ...) Too bad. Now where do we take it from here? I'm currently thinking of ways to do this patch differently. But that is kind of relying on a good testbed to verify whether a different patch works better for you or not... Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 15:07 ` Ingo Molnar 2007-07-19 15:27 ` Olaf Kirch @ 2007-07-19 15:32 ` Ingo Molnar 2007-07-19 15:52 ` Ingo Molnar 1 sibling, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 15:32 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok * Ingo Molnar <mingo@elte.hu> wrote: > ugh. Something really weird happened with this e1000 problem. > > i crashed the laptop in a weird way and had to power-cycle it in an > unusual fashion. After that i wanted to try your latest BUG_ON() > theory but the network hang went away! > > For 3 hours i tried to reproduce the hang (i went back to the original > git tree and the .config under which i found it, i power cycled it > again, i unplugged the power cord to make it go off battery, unplugged > the ethernet, recreated a completely new tree, etc. etc.) but with no > success! Total Heisenbug - and the really annoying thing is that you > just had a good theory about what might be happening. I'm now at > kernel build iteration #75 in this tree ... ah! Just found the reason: the bug apparently depends on the precise kernel command-line contents. I accidentally dropped ignore_loglevel (found this while comparing with the older logs i sent to you), adding it back in produces hung networking too. So it appears that a netconsole printout while e1000 is initializing (or while some other networking component is initializing) might be the culprit? Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 15:32 ` Ingo Molnar @ 2007-07-19 15:52 ` Ingo Molnar 2007-07-19 16:05 ` Ingo Molnar 0 siblings, 1 reply; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 15:52 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok * Ingo Molnar <mingo@elte.hu> wrote: > ah! Just found the reason: the bug apparently depends on the precise > kernel command-line contents. I accidentally dropped ignore_loglevel > (found this while comparing with the older logs i sent to you), adding > it back in produces hung networking too. So it appears that a > netconsole printout while e1000 is initializing (or while some other > networking component is initializing) might be the culprit? and the WARN_ON() below does not seem to trigger. i'll now check whether removing ignore_on_loglevel (no other changes) makes the hang go away. Maybe ignore_on_loglevel is buggy - or it produces an immediate printk (going out to the interface) during a particularly sensitive period of network initialization. Ingo -----------------> Index: linux/drivers/net/e1000/e1000_main.c =================================================================== --- linux.orig/drivers/net/e1000/e1000_main.c +++ linux/drivers/net/e1000/e1000_main.c @@ -358,6 +358,7 @@ e1000_irq_enable(struct e1000_adapter *a E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK); E1000_WRITE_FLUSH(&adapter->hw); } + WARN_ON_ONCE(atomic_read(&adapter->irq_sem) < 0); } static void ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 15:52 ` Ingo Molnar @ 2007-07-19 16:05 ` Ingo Molnar 2007-07-19 16:13 ` Ingo Molnar 2007-07-19 17:36 ` Olaf Kirch 0 siblings, 2 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 16:05 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok * Ingo Molnar <mingo@elte.hu> wrote: > i'll now check whether removing ignore_on_loglevel (no other changes) > makes the hang go away. Maybe ignore_on_loglevel is buggy - or it > produces an immediate printk (going out to the interface) during a > particularly sensitive period of network initialization. nope, that didnt have any effect. I have another theory: i had a network-intense stress-test going on in the past few hours on the same network (not involving the laptop) and i stopped it recently. Perhaps that network-intense test also produced periodic broadcast packets that got the e1000 out of its weird state before the tx timeout could hit. Now that i've stopped the test, the network is quiescent again and the e1000 hangs. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 16:05 ` Ingo Molnar @ 2007-07-19 16:13 ` Ingo Molnar 2007-07-19 17:36 ` Olaf Kirch 1 sibling, 0 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 16:13 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok * Ingo Molnar <mingo@elte.hu> wrote: > > i'll now check whether removing ignore_on_loglevel (no other > > changes) makes the hang go away. Maybe ignore_on_loglevel is buggy - > > or it produces an immediate printk (going out to the interface) > > during a particularly sensitive period of network initialization. > > nope, that didnt have any effect. I have another theory: i had a > network-intense stress-test going on in the past few hours on the same > network (not involving the laptop) and i stopped it recently. Perhaps > that network-intense test also produced periodic broadcast packets > that got the e1000 out of its weird state before the tx timeout could > hit. Now that i've stopped the test, the network is quiescent again > and the e1000 hangs. yep - i can confirm this theory now: if i start a broadcast ping on another box while the laptop is booting up, the network hang does not happen. If i stop the ping and do another bootup with the very same kernel, it hangs. So for the hang to happen, the network has to be very quiet. At least one mystery solved :-/ Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 16:05 ` Ingo Molnar 2007-07-19 16:13 ` Ingo Molnar @ 2007-07-19 17:36 ` Olaf Kirch 2007-07-19 17:41 ` Ingo Molnar 2007-07-19 17:51 ` Olaf Kirch 1 sibling, 2 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 17:36 UTC (permalink / raw) To: Ingo Molnar Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok On Thursday 19 July 2007 18:05, Ingo Molnar wrote: > that network-intense test also produced periodic broadcast packets that > got the e1000 out of its weird state before the tx timeout could hit. > Now that i've stopped the test, the network is quiescent again and the > e1000 hangs. Can you confirm this by spraying the laptop with arp packets or broadcast pings while it's booting? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 17:36 ` Olaf Kirch @ 2007-07-19 17:41 ` Ingo Molnar 2007-07-19 17:51 ` Olaf Kirch 1 sibling, 0 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 17:41 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok * Olaf Kirch <olaf.kirch@oracle.com> wrote: > On Thursday 19 July 2007 18:05, Ingo Molnar wrote: > > that network-intense test also produced periodic broadcast packets that > > got the e1000 out of its weird state before the tx timeout could hit. > > Now that i've stopped the test, the network is quiescent again and the > > e1000 hangs. > > Can you confirm this by spraying the laptop with arp packets or > broadcast pings while it's booting? yes, i did exactly that, and it made the box boot. Ingo ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 17:36 ` Olaf Kirch 2007-07-19 17:41 ` Ingo Molnar @ 2007-07-19 17:51 ` Olaf Kirch 1 sibling, 0 replies; 67+ messages in thread From: Olaf Kirch @ 2007-07-19 17:51 UTC (permalink / raw) To: Ingo Molnar Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok On Thursday 19 July 2007 19:36, Olaf Kirch wrote: > Can you confirm this by spraying the laptop with arp packets > or broadcast pings while it's booting? Sorry for the noise - didn't see your other message where you described just that. This sounds more like a hardware issue - Rx interrupt seems to go through (triggering all the usual NAPI cleanup), but the Tx descriptor writeback interrupt doesn't. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-19 9:09 ` Ingo Molnar 2007-07-19 9:44 ` Olaf Kirch @ 2007-07-19 10:17 ` Ingo Molnar 1 sibling, 0 replies; 67+ messages in thread From: Ingo Molnar @ 2007-07-19 10:17 UTC (permalink / raw) To: Olaf Kirch; +Cc: Jarek Poplawski, Linus Torvalds, linux-kernel, davem, Auke Kok * Ingo Molnar <mingo@elte.hu> wrote: > the e1000 in this laptop is historically pretty robust. The only > problem i ever had with it were some rx/tx hw-engine latency problems > [pings from the outside took up to 1 second to propagate] that were > quickly fixed by the e1000 driver guys. Maybe that's related. > (although it never caused total inavailability of networking - it was > only latency problems) i've Cc:-ed Auke - maybe he has some ideas? I've also attached all the other possibly relevant ethtool dumps of eth0 below, done when the hung interface happens. (Auke, see more details about this problem in this thread on lkml.) another thing: how can i make the tx timeout much longer than it is right now? Maybe an interrupt is delayed for long and we just dont tolerate things long enough - and once the device tx timeout has been noticed we kill the interface from subsequent use in essence? there's one oddity i noticed: 0x02810: RDH (Receive desc head) 0x00000000 0x02818: RDT (Receive desc tail) 0x000000FE 0x03810: TDH (Transmit desc head) 0x00000000 0x03818: TDT (Transmit desc tail) 0x00000000 doesnt this suggest that we have a receive packet, just the irq didnt come and we didnt process it? although .. the tx timeout disabled the transmitter and the receiver of the device, correct? That in itself could have destroyed any evidence of what happened earlier on. But ... i'm really only guessing here. Ingo ------------------------------------> MAC Registers ------------- 0x00000: CTRL (Device control register) 0x00140248 Endian mode (buffers): little Link reset: reset Set link up: 1 Invert Loss-Of-Signal: no Receive flow control: disabled Transmit flow control: disabled VLAN mode: disabled Auto speed detect: disabled Speed select: 1000Mb/s Force speed: no Force duplex: no 0x00008: STATUS (Device status register) 0x80080783 Duplex: full Link up: link config TBI mode: disabled Link speed: 1000Mb/s Bus type: PCI Bus speed: 33MHz Bus width: 32-bit 0x00100: RCTL (Receive control register) 0x00008002 Receiver: enabled Store bad packets: disabled Unicast promiscuous: disabled Multicast promiscuous: disabled Long packet: disabled Descriptor minimum threshold size: 1/2 Broadcast accept mode: accept VLAN filter: disabled Cononical form indicator: disabled Discard pause frames: filtered Pass MAC control frames: don't pass Receive buffer size: 2048 0x02808: RDLEN (Receive desc length) 0x00001000 0x02810: RDH (Receive desc head) 0x00000000 0x02818: RDT (Receive desc tail) 0x000000FE 0x02820: RDTR (Receive delay timer) 0x00000000 0x00400: TCTL (Transmit ctrl register) 0x310000F8 Transmitter: disabled Pad short packets: enabled Software XOFF Transmission: disabled Re-transmit on late collision: enabled 0x03808: TDLEN (Transmit desc length) 0x00001000 0x03810: TDH (Transmit desc head) 0x00000000 0x03818: TDT (Transmit desc tail) 0x00000000 0x03820: TIDV (Transmit delay timer) 0x00000008 PHY type: M88 Offset Values ------ ------ 0x0000 00 16 41 17 49 d2 30 0b b2 ff 51 00 ff ff ff ff 0x0010 53 00 03 01 6b 02 01 20 aa 17 9a 10 86 80 df 80 0x0020 00 00 00 20 54 7e 00 00 14 00 da 00 04 00 00 27 0x0030 c9 6c 50 31 3e 07 0b 04 8b 29 00 00 00 f0 02 0f 0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff 0x0050 14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00 0x0060 00 01 00 40 1f 12 07 40 ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ef 9f Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 256 RX Mini: 0 RX Jumbo: 0 TX: 256 driver: e1000 version: 7.3.20-k2-NAPI firmware-version: 0.5-1 bus-info: 0000:02:00.0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off NIC statistics: rx_packets: 0 tx_packets: 591 rx_bytes: 0 tx_bytes: 58116 rx_broadcast: 0 tx_broadcast: 0 rx_multicast: 0 tx_multicast: 0 rx_errors: 0 tx_errors: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_no_buffer_count: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_abort_late_coll: 0 tx_deferred_ok: 0 tx_single_coll_ok: 0 tx_multi_coll_ok: 0 tx_timeout_count: 1 tx_restart_queue: 0 rx_long_length_errors: 0 rx_short_length_errors: 0 rx_align_errors: 0 tx_tcp_seg_good: 0 tx_tcp_seg_failed: 0 rx_flow_control_xon: 0 rx_flow_control_xoff: 0 tx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_long_byte_count: 0 rx_csum_offload_good: 0 rx_csum_offload_errors: 0 rx_header_split: 0 alloc_rx_buff_failed: 0 tx_smbus: 0 rx_smbus: 0 dropped_smbus: 0 ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 18:34 ` Olaf Kirch 2007-07-17 18:56 ` Ingo Molnar @ 2007-07-18 11:48 ` Jarek Poplawski 2007-07-19 5:58 ` Jarek Poplawski 1 sibling, 1 reply; 67+ messages in thread From: Jarek Poplawski @ 2007-07-18 11:48 UTC (permalink / raw) To: Olaf Kirch; +Cc: Ingo Molnar, Linus Torvalds, linux-kernel, davem Hi, Here is my proposal of a solution based on dev->state flag, but intended mainly to prevent poll_napi from disturbing while net_rx_action is running and polling the device. It doesn't look very nice or clean but I hope it could guard net_rx_action enough with some room for netpoll too. I'd be very glad if it could be verified and/or tested. Regards, Jarek P. PS: not tested! --- diff -Nurp 2.6.22-git9-/include/linux/netdevice.h 2.6.22-git9/include/linux/netdevice.h --- 2.6.22-git9-/include/linux/netdevice.h 2007-07-18 07:50:56.000000000 +0200 +++ 2.6.22-git9/include/linux/netdevice.h 2007-07-18 13:21:12.000000000 +0200 @@ -262,6 +262,8 @@ enum netdev_state_t __LINK_STATE_LINKWATCH_PENDING, __LINK_STATE_DORMANT, __LINK_STATE_QDISC_RUNNING, + /* Set by net_rx_action vs poll_napi */ + __LINK_STATE_POLL_LIST_FROZEN, }; diff -Nurp 2.6.22-git9-/net/core/dev.c 2.6.22-git9/net/core/dev.c --- 2.6.22-git9-/net/core/dev.c 2007-07-18 07:50:58.000000000 +0200 +++ 2.6.22-git9/net/core/dev.c 2007-07-18 13:18:23.000000000 +0200 @@ -2025,6 +2025,7 @@ static void net_rx_action(struct softirq unsigned long start_time = jiffies; int budget = netdev_budget; void *have; + struct net_device *dev_frozen = NULL; local_irq_disable(); @@ -2040,15 +2041,35 @@ static void net_rx_action(struct softirq struct net_device, poll_list); have = netpoll_poll_lock(dev); +#ifdef CONFIG_NETPOLL + /* prevent a race with poll_napi() */ + if (dev_frozen && dev_frozen != dev) { + clear_bit(__LINK_STATE_POLL_LIST_FROZEN, + &dev_frozen->state); + set_bit(__LINK_STATE_POLL_LIST_FROZEN, + &dev->state); + dev_frozen = dev; + } else if (!dev_frozen) { + set_bit(__LINK_STATE_POLL_LIST_FROZEN, + &dev->state); + dev_frozen = dev; + } +#endif if (dev->quota <= 0 || dev->poll(dev, &budget)) { netpoll_poll_unlock(have); local_irq_disable(); - list_move_tail(&dev->poll_list, &queue->poll_list); + list_move_tail(&dev->poll_list, + &queue->poll_list); if (dev->quota < 0) dev->quota += dev->weight; else dev->quota = dev->weight; } else { +#ifdef CONFIG_NETPOLL + clear_bit(__LINK_STATE_POLL_LIST_FROZEN, + &dev->state); + dev_frozen = NULL; +#endif netpoll_poll_unlock(have); dev_put(dev); local_irq_disable(); @@ -2056,6 +2077,14 @@ static void net_rx_action(struct softirq } out: local_irq_enable(); +#ifdef CONFIG_NETPOLL + if (dev_frozen) { + have = netpoll_poll_lock(dev_frozen); + clear_bit(__LINK_STATE_POLL_LIST_FROZEN, + &dev_frozen->state); + netpoll_poll_unlock(have); + } +#endif #ifdef CONFIG_NET_DMA /* * There may not be any more sk_buffs coming right now, so push diff -Nurp 2.6.22-git9-/net/core/netpoll.c 2.6.22-git9/net/core/netpoll.c --- 2.6.22-git9-/net/core/netpoll.c 2007-07-18 07:50:58.000000000 +0200 +++ 2.6.22-git9/net/core/netpoll.c 2007-07-18 12:09:43.000000000 +0200 @@ -124,13 +124,20 @@ static void poll_napi(struct netpoll *np if (test_bit(__LINK_STATE_RX_SCHED, &np->dev->state) && npinfo->poll_owner != smp_processor_id() && spin_trylock(&npinfo->poll_lock)) { - npinfo->rx_flags |= NETPOLL_RX_DROP; - atomic_inc(&trapped); + /* + * If POLL_LIST_FROZEN is set net_rx_action + * is working with this device. + */ + if (!test_bit(__LINK_STATE_POLL_LIST_FROZEN, + &np->dev->state)) { + npinfo->rx_flags |= NETPOLL_RX_DROP; + atomic_inc(&trapped); - np->dev->poll(np->dev, &budget); + np->dev->poll(np->dev, &budget); - atomic_dec(&trapped); - npinfo->rx_flags &= ~NETPOLL_RX_DROP; + atomic_dec(&trapped); + npinfo->rx_flags &= ~NETPOLL_RX_DROP; + } spin_unlock(&npinfo->poll_lock); } } ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-18 11:48 ` Jarek Poplawski @ 2007-07-19 5:58 ` Jarek Poplawski 0 siblings, 0 replies; 67+ messages in thread From: Jarek Poplawski @ 2007-07-19 5:58 UTC (permalink / raw) To: Olaf Kirch; +Cc: Ingo Molnar, Linus Torvalds, linux-kernel, davem On Wed, Jul 18, 2007 at 01:48:20PM +0200, Jarek Poplawski wrote: ... > I'd be very glad if it could be verified and/or tested. Jarek, This patch is verified crap! Regards, Jarek P. PS: Olaf, You've written earlier that one of the main reasons for poll_napi is to work when the kernel "doesn't even service softirqs anymore". But in your patch poll_napi leaves netif_rx_complete for softirqs, so even if it starts to work for Ingo in normal conditions, probably something else is needed, anyway. ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 8:57 ` Ingo Molnar 2007-07-17 9:29 ` Jarek Poplawski 2007-07-17 14:07 ` Olaf Kirch @ 2007-07-17 17:49 ` Linus Torvalds 2 siblings, 0 replies; 67+ messages in thread From: Linus Torvalds @ 2007-07-17 17:49 UTC (permalink / raw) To: Ingo Molnar; +Cc: Olaf Kirch, Jarek Poplawski, linux-kernel, davem On Tue, 17 Jul 2007, Ingo Molnar wrote: > > i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes > the problem go away. So it's somehow also related to jiffies. No, I suspect it's just related to timing: you need to hit that window when the LIST_FROZEN bit is set, and since it was so reliable for you before (and others didn't see it), your timing probably happened to hit it exactly. And changing HZ probably just changed some timeout or softirq getting called, and you didn't hit the bug any more. Linus ^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll 2007-07-17 8:28 ` Olaf Kirch 2007-07-17 8:57 ` Ingo Molnar @ 2007-07-17 9:12 ` Jarek Poplawski 1 sibling, 0 replies; 67+ messages in thread From: Jarek Poplawski @ 2007-07-17 9:12 UTC (permalink / raw) To: Olaf Kirch; +Cc: Ingo Molnar, Linus Torvalds, linux-kernel, davem On Tue, Jul 17, 2007 at 10:28:34AM +0200, Olaf Kirch wrote: > On Tuesday 17 July 2007 09:55, Olaf Kirch wrote: > > What I find more problematic about this portion of code though > > is that once a net_device is over quota, net_rx_action will > > loop for up to one jiffy, even if there's just this one device on > > the poll_list. > > Duh, wrong. For every loop, it'll add dev->weight to dev->quota, > so it'll be fine very soon. After your patch only two things can be different here: dev->poll_list and this flag. Since Ingo offered testing it should be easy to check after moving the check of "your" flag to __netif_rx_complete, and skipping only list_del on true. Jarek P. ^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2007-07-20 9:46 UTC | newest] Thread overview: 67+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-07-16 9:12 [patch] revert: [NET]: Fix races in net_rx_action vs netpoll Ingo Molnar 2007-07-16 10:35 ` Olaf Kirch 2007-07-16 11:26 ` David Miller 2007-07-16 12:18 ` Olaf Kirch 2007-07-16 13:29 ` Ingo Molnar 2007-07-16 21:09 ` Ingo Molnar 2007-07-16 22:06 ` David Miller 2007-07-16 21:40 ` Linus Torvalds 2007-07-16 21:51 ` Ingo Molnar 2007-07-16 22:09 ` David Miller 2007-07-16 22:37 ` Ingo Molnar 2007-07-16 22:57 ` David Miller 2007-07-17 18:09 ` Ingo Molnar 2007-07-16 22:08 ` David Miller 2007-07-16 22:29 ` Linus Torvalds 2007-07-16 22:52 ` David Miller 2007-07-16 23:17 ` Matt Mackall 2007-07-16 23:34 ` Linus Torvalds 2007-07-17 7:37 ` Olaf Kirch 2007-07-17 8:16 ` Olaf Kirch 2007-07-17 5:46 ` Jarek Poplawski 2007-07-17 6:14 ` Jarek Poplawski 2007-07-17 7:55 ` Olaf Kirch 2007-07-17 8:28 ` Olaf Kirch 2007-07-17 8:57 ` Ingo Molnar 2007-07-17 9:29 ` Jarek Poplawski 2007-07-17 14:07 ` Olaf Kirch 2007-07-17 16:57 ` Ingo Molnar 2007-07-17 18:06 ` Olaf Kirch 2007-07-17 18:18 ` Ingo Molnar 2007-07-17 18:34 ` Olaf Kirch 2007-07-17 18:56 ` Ingo Molnar 2007-07-18 12:04 ` Olaf Kirch 2007-07-18 12:41 ` Ingo Molnar 2007-07-18 12:48 ` Ingo Molnar 2007-07-18 14:41 ` Olaf Kirch 2007-07-18 16:43 ` Ingo Molnar 2007-07-19 9:09 ` Ingo Molnar 2007-07-19 9:44 ` Olaf Kirch 2007-07-19 10:01 ` Ingo Molnar 2007-07-19 10:37 ` Olaf Kirch 2007-07-19 10:47 ` Ingo Molnar 2007-07-19 10:58 ` Ingo Molnar 2007-07-19 12:52 ` Olaf Kirch 2007-07-19 12:54 ` Olaf Kirch 2007-07-19 15:42 ` Kok, Auke 2007-07-19 16:07 ` Ingo Molnar 2007-07-19 19:13 ` Olaf Kirch 2007-07-19 19:22 ` Ingo Molnar 2007-07-19 19:35 ` Olaf Kirch 2007-07-19 19:56 ` Ingo Molnar 2007-07-19 20:02 ` Olaf Kirch 2007-07-20 9:45 ` Ingo Molnar 2007-07-19 15:07 ` Ingo Molnar 2007-07-19 15:27 ` Olaf Kirch 2007-07-19 15:32 ` Ingo Molnar 2007-07-19 15:52 ` Ingo Molnar 2007-07-19 16:05 ` Ingo Molnar 2007-07-19 16:13 ` Ingo Molnar 2007-07-19 17:36 ` Olaf Kirch 2007-07-19 17:41 ` Ingo Molnar 2007-07-19 17:51 ` Olaf Kirch 2007-07-19 10:17 ` Ingo Molnar 2007-07-18 11:48 ` Jarek Poplawski 2007-07-19 5:58 ` Jarek Poplawski 2007-07-17 17:49 ` Linus Torvalds 2007-07-17 9:12 ` Jarek Poplawski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox