* netconsole still hangs @ 2008-03-12 23:14 Andrew Morton 2008-03-12 23:16 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-12 23:14 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev, Rafael J. Wysocki I thought the recent reverts fixed this, but it seems that it's just become a little harder to hit. I'm seeing netconsole hangs on two x86_64 systems (2-way t61p laptop, 8-way server). Both use e1000. With current mainline on the 8-way, create a printk storm with while true do echo t > /proc/sysrq-trigger done and the machine goes tits-up after about five seconds. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-12 23:14 netconsole still hangs Andrew Morton @ 2008-03-12 23:16 ` Andrew Morton 2008-03-12 23:30 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-12 23:16 UTC (permalink / raw) To: shemminger, netdev, rjw On Wed, 12 Mar 2008 16:14:29 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > I thought the recent reverts fixed this, but it seems that it's just become > a little harder to hit. > > I'm seeing netconsole hangs on two x86_64 systems (2-way t61p laptop, 8-way > server). Both use e1000. > > With current mainline on the 8-way, create a printk storm with > > while true > do > echo t > /proc/sysrq-trigger > done > > and the machine goes tits-up after about five seconds. > whoops, hang on, it's still running. For some reason the receiving machine went super-slow after it had received a lot of data. I'll retest on the t61p this evening, see why it is freezing. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-12 23:16 ` Andrew Morton @ 2008-03-12 23:30 ` Andrew Morton 2008-03-12 23:57 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-12 23:30 UTC (permalink / raw) To: shemminger, netdev, rjw On Wed, 12 Mar 2008 16:16:37 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 12 Mar 2008 16:14:29 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > I thought the recent reverts fixed this, but it seems that it's just become > > a little harder to hit. > > > > I'm seeing netconsole hangs on two x86_64 systems (2-way t61p laptop, 8-way > > server). Both use e1000. > > > > With current mainline on the 8-way, create a printk storm with > > > > while true > > do > > echo t > /proc/sysrq-trigger > > done > > > > and the machine goes tits-up after about five seconds. > > > > whoops, hang on, it's still running. And it's still running! I killed the above loop five minutes ago and nothing new is coming out in `dmesg -c', yet data is still flying out over netconsole. hundreds and hundreds of megabytes. So I'd say that something in netconsole or the console susbsytem has screwed up its buffer indices and it has gone infinite. I don't know whether that's a regression though. <does reboot -f> OK, that stopped it, so the problem isn't buffering at the receiver. I already knew that, because the ifconfig "TX bytes" counters were going up on the sending side. <runs the sysrq-trigger thing again> OK, this time it did hang up. Machine unpingable, no signs of life. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-12 23:30 ` Andrew Morton @ 2008-03-12 23:57 ` Andrew Morton 2008-03-13 6:10 ` David Miller 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-12 23:57 UTC (permalink / raw) To: shemminger, netdev, rjw On Wed, 12 Mar 2008 16:30:13 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 12 Mar 2008 16:16:37 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Wed, 12 Mar 2008 16:14:29 -0700 > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > > > I thought the recent reverts fixed this, but it seems that it's just become > > > a little harder to hit. > > > > > > I'm seeing netconsole hangs on two x86_64 systems (2-way t61p laptop, 8-way > > > server). Both use e1000. > > > > > > With current mainline on the 8-way, create a printk storm with > > > > > > while true > > > do > > > echo t > /proc/sysrq-trigger > > > done > > > > > > and the machine goes tits-up after about five seconds. > > > > > > > whoops, hang on, it's still running. > > And it's still running! I killed the above loop five minutes ago and > nothing new is coming out in `dmesg -c', yet data is still flying out over > netconsole. hundreds and hundreds of megabytes. > > So I'd say that something in netconsole or the console susbsytem has > screwed up its buffer indices and it has gone infinite. > > I don't know whether that's a regression though. > > <does reboot -f> > > OK, that stopped it, so the problem isn't buffering at the receiver. I > already knew that, because the ifconfig "TX bytes" counters were going up > on the sending side. > > <runs the sysrq-trigger thing again> > > OK, this time it did hang up. Machine unpingable, no signs of life. > I reran the test on 2.6.24 and all seemed fine: the machine didn't hang and stopping the script stopped the netconsole output. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-12 23:57 ` Andrew Morton @ 2008-03-13 6:10 ` David Miller 2008-03-13 6:52 ` Andrew Morton 2008-03-13 7:59 ` Andrew Morton 0 siblings, 2 replies; 24+ messages in thread From: David Miller @ 2008-03-13 6:10 UTC (permalink / raw) To: akpm; +Cc: shemminger, netdev, rjw From: Andrew Morton <akpm@linux-foundation.org> Date: Wed, 12 Mar 2008 16:57:17 -0700 > I reran the test on 2.6.24 and all seemed fine: the machine didn't hang and > stopping the script stopped the netconsole output. Can you go back and bisect the tree in one shot to the guilty commit you found last time, and make sure this test case works at that point? With the current data points available, the only conclusion I have is that I somehow mis-reverted the changes or missed some important detail. Because, you said that the problem went away when you bisected to that specific point. Thanks. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-13 6:10 ` David Miller @ 2008-03-13 6:52 ` Andrew Morton 2008-03-13 7:12 ` David Miller 2008-03-14 23:47 ` [PATCH] " Jarek Poplawski 2008-03-13 7:59 ` Andrew Morton 1 sibling, 2 replies; 24+ messages in thread From: Andrew Morton @ 2008-03-13 6:52 UTC (permalink / raw) To: David Miller; +Cc: shemminger, netdev, rjw On Wed, 12 Mar 2008 23:10:53 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Andrew Morton <akpm@linux-foundation.org> > Date: Wed, 12 Mar 2008 16:57:17 -0700 > > > I reran the test on 2.6.24 and all seemed fine: the machine didn't hang and > > stopping the script stopped the netconsole output. > > Can you go back and bisect the tree in one shot to the guilty > commit you found last time, That was 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 > and make sure this test case works at that point? argh, I'm too git-stupid to know how to get a tree which ends just prior to 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7. Help? > With the current data points available, the only conclusion > I have is that I somehow mis-reverted the changes or missed > some important detail. Because, you said that the problem > went away when you bisected to that specific point. Whine. I can _trivially_ reproduce this stuff on the two machines which I tried it on. Has anyone over in net land actually expended some cycles trying to reproduce these things? I tried it on the t61p and actually got an oops: general protection fault: 0000 [1] SMP last sysfs file: /sys/class/net/wlan0/address CPU 0 Modules linked in: autofs4 sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_log dm_multipath dm_mod snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss button arc4 ecb crypto_blkcipher snd_mixer_oss iwl4965 joydev snd_pcm mac80211 firewire_ohci battery ac thinkpad_acpi hwmon cfg80211 snd_timer snd_page_alloc snd_hwdep i2c_i801 i2c_core pcspkr firewire_core crc_itu_t snd soundcore sr_mod sg cdrom ata_piix ahci libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 2916, comm: zsh Not tainted 2.6.25-rc5-mm1 #9 RIP: 0010:[<ffffffff8123a8ee>] [<ffffffff8123a8ee>] zap_completion_queue+0x54/0x85 RSP: 0018:ffff810072c35938 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000001 RDX: ffff810072d46000 RSI: 0000000000000001 RDI: ffff81007c87f6c0 RBP: ffff810072c35958 R08: 0000000000000002 R09: ffffffff81226956 R10: 00000000000000c8 R11: ffff81007e0b6fc8 R12: ffff810001069dc0 R13: 6b6b6b6b6b6b6b6b R14: ffff81007cfa8b30 R15: 0000000000000010 FS: 00007f9084c716f0(0000) GS:ffffffff81439000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000069b960 CR3: 0000000072ca7000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process zsh (pid: 2916, threadinfo ffff810072c34000, task ffff810072d46000) Stack: 00000000000000c8 ffff81007e0b6fc8 0000000000000000 ffff81007e0b6fc8 ffff810072c359c8 ffffffff8123b5d3 ffff81007dc61298 000000008123b624 00000000ffffffff ffff81007e0b6fc8 0000000000000046 0000000000000000 Call Trace: [<ffffffff8123b5d3>] netpoll_poll+0x394/0x3b3 [<ffffffff8123b1b0>] netpoll_send_skb+0xf1/0x180 [<ffffffff8123b8ef>] netpoll_send_udp+0x277/0x283 [<ffffffff811d9043>] write_msg+0x80/0xbd [<ffffffff81033537>] __call_console_drivers+0x6a/0x7b [<ffffffff810335a7>] _call_console_drivers+0x5f/0x63 [<ffffffff81033a11>] release_console_sem+0x12e/0x1ba [<ffffffff81033f51>] vprintk+0x32d/0x37a [<ffffffff812b1ed1>] ? _spin_unlock_irqrestore+0x38/0x47 [<ffffffff81033a90>] ? release_console_sem+0x1ad/0x1ba [<ffffffff81033a90>] ? release_console_sem+0x1ad/0x1ba [<ffffffff81034005>] printk+0x67/0x69 [<ffffffff81049efa>] ? up+0xf/0x38 [<ffffffff81034005>] ? printk+0x67/0x69 [<ffffffff810806c1>] ? next_zone+0x28/0x2a [<ffffffff81079e03>] show_free_areas+0x289/0x3b0 [<ffffffff81049efa>] ? up+0xf/0x38 [<ffffffff81049efa>] ? up+0xf/0x38 [<ffffffff8102122c>] show_mem+0x2a/0x16a [<ffffffff811a1c3a>] sysrq_handle_showmem+0x9/0xb [<ffffffff811a1dd4>] __handle_sysrq+0x9c/0x139 [<ffffffff810e74ff>] ? write_sysrq_trigger+0x0/0x37 [<ffffffff810e752c>] write_sysrq_trigger+0x2d/0x37 [<ffffffff810e0fb0>] proc_reg_write+0x87/0xa4 [<ffffffff810a2386>] vfs_write+0xae/0x157 [<ffffffff810a255a>] sys_write+0x47/0x70 [<ffffffff8100c037>] tracesys+0xdc/0xe1 Code: fa e8 fb 58 e1 ff f6 c7 02 4d 8b 6c 24 68 49 c7 44 24 68 00 00 00 00 75 09 53 9d e8 e1 58 e1 ff eb 2c e8 8d 7e e1 ff 53 9d eb 23 <49> 83 bd 80 00 00 00 00 49 8b 5d 00 74 0a 4c 89 ef e8 06 24 ff RIP [<ffffffff8123a8ee>] zap_completion_queue+0x54/0x85 RSP <ffff810072c35938> ---[ end trace 5edf7491f5a4a975 ]--- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-13 6:52 ` Andrew Morton @ 2008-03-13 7:12 ` David Miller 2008-03-13 7:25 ` Andrew Morton 2008-03-14 23:47 ` [PATCH] " Jarek Poplawski 1 sibling, 1 reply; 24+ messages in thread From: David Miller @ 2008-03-13 7:12 UTC (permalink / raw) To: akpm; +Cc: shemminger, netdev, rjw From: Andrew Morton <akpm@linux-foundation.org> Date: Wed, 12 Mar 2008 23:52:05 -0700 > Whine. I can _trivially_ reproduce this stuff on the two machines which I > tried it on. Has anyone over in net land actually expended some cycles > trying to reproduce these things? I would if I weren't in Japan currently giving presentations and attending BoF's, so whining at me isn't very productive. I asked you to do the bisect check since that's the fastest way forward at the current time. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-13 7:12 ` David Miller @ 2008-03-13 7:25 ` Andrew Morton 2008-03-13 7:48 ` Jike Song 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-13 7:25 UTC (permalink / raw) To: David Miller; +Cc: shemminger, netdev, rjw On Thu, 13 Mar 2008 00:12:55 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Andrew Morton <akpm@linux-foundation.org> > Date: Wed, 12 Mar 2008 23:52:05 -0700 > > > Whine. I can _trivially_ reproduce this stuff on the two machines which I > > tried it on. Has anyone over in net land actually expended some cycles > > trying to reproduce these things? > > I would if I weren't in Japan currently giving presentations > and attending BoF's, so whining at me isn't very productive. Others were cc'ed.. > I asked you to do the bisect check since that's the fastest > way forward at the current time. I'm sitting here waiting to do it, but was hoping that you'd know what the magical git command is. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-13 7:25 ` Andrew Morton @ 2008-03-13 7:48 ` Jike Song 0 siblings, 0 replies; 24+ messages in thread From: Jike Song @ 2008-03-13 7:48 UTC (permalink / raw) To: Andrew Morton; +Cc: David Miller, shemminger, netdev, rjw On Thu, Mar 13, 2008 at 3:25 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 13 Mar 2008 00:12:55 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > > > > I asked you to do the bisect check since that's the fastest > > way forward at the current time. > > I'm sitting here waiting to do it, but was hoping that you'd > know what the magical git command is. Hello Mr Andrew, I guess you just want this? $ git archive -v 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 | ( cd <pathname> && tar xf -) The commit prior to this one is 0953864160bdd28dfe45fd46fa462b4d2d53cb96, so please use the latter instead if you don't want include 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7. Best Regards, ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] Re: netconsole still hangs 2008-03-13 6:52 ` Andrew Morton 2008-03-13 7:12 ` David Miller @ 2008-03-14 23:47 ` Jarek Poplawski 2008-03-17 23:12 ` Andrew Morton 2008-03-20 23:08 ` David Miller 1 sibling, 2 replies; 24+ messages in thread From: Jarek Poplawski @ 2008-03-14 23:47 UTC (permalink / raw) To: Andrew Morton; +Cc: David Miller, shemminger, netdev, rjw Andrew Morton wrote, On 03/13/2008 07:52 AM: ... > I tried it on the t61p and actually got an oops: > > general protection fault: 0000 [1] SMP > last sysfs file: /sys/class/net/wlan0/address > CPU 0 > Modules linked in: autofs4 sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter > ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header > ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand > acpi_cpufreq dm_mirror dm_log dm_multipath dm_mod snd_hda_intel snd_seq_dummy > snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss button arc4 > ecb crypto_blkcipher snd_mixer_oss iwl4965 joydev snd_pcm mac80211 > firewire_ohci battery ac thinkpad_acpi hwmon cfg80211 snd_timer snd_page_alloc > snd_hwdep i2c_i801 i2c_core pcspkr firewire_core crc_itu_t snd soundcore > sr_mod sg cdrom ata_piix ahci libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd > ohci_hcd ehci_hcd [last unloaded: microcode] > Pid: 2916, comm: zsh Not tainted 2.6.25-rc5-mm1 #9 > RIP: 0010:[<ffffffff8123a8ee>] [<ffffffff8123a8ee>] zap_completion_queue+0x54/0x85 > RSP: 0018:ffff810072c35938 EFLAGS: 00010002 > RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000001 It looks like POISON_FREE probably while: "while(clist != NULL)". I haven't found a culprit, but it could be some dev_kfree_skb_irq/_any() user - not necessarily netpoll to blame. (A card could matter here: e1000, iwl4965...?). BTW, here is a patch which isn't supposed to fix this OOPs, but seems to be needed near this place. Regards, Jarek P. -----------> [NETPOLL] zap_completion_queue: adjust skb->users counter zap_completion_queue() retrieves skbs from completion_queue where they have zero skb->users counter. Before dev_kfree_skb_any() it should be non-zero yet, so it's increased now. Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> (not tested) --- net/core/netpoll.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/core/netpoll.c b/net/core/netpoll.c index d0c8bf5..b04d643 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -215,10 +215,12 @@ static void zap_completion_queue(void) while (clist != NULL) { struct sk_buff *skb = clist; clist = clist->next; - if (skb->destructor) + if (skb->destructor) { + atomic_inc(&skb->users); dev_kfree_skb_any(skb); /* put this one back */ - else + } else { __kfree_skb(skb); + } } } ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] Re: netconsole still hangs 2008-03-14 23:47 ` [PATCH] " Jarek Poplawski @ 2008-03-17 23:12 ` Andrew Morton 2008-03-18 8:04 ` Jarek Poplawski 2008-03-20 23:08 ` David Miller 1 sibling, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-17 23:12 UTC (permalink / raw) To: Jarek Poplawski; +Cc: davem, shemminger, netdev, rjw On Sat, 15 Mar 2008 00:47:49 +0100 Jarek Poplawski <jarkao2@gmail.com> wrote: > [NETPOLL] zap_completion_queue: adjust skb->users counter > > zap_completion_queue() retrieves skbs from completion_queue where they > have zero skb->users counter. Before dev_kfree_skb_any() it should be > non-zero yet, so it's increased now. > > > Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > > (not tested) > --- > > net/core/netpoll.c | 6 ++++-- > 1 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > index d0c8bf5..b04d643 100644 > --- a/net/core/netpoll.c > +++ b/net/core/netpoll.c > @@ -215,10 +215,12 @@ static void zap_completion_queue(void) > while (clist != NULL) { > struct sk_buff *skb = clist; > clist = clist->next; > - if (skb->destructor) > + if (skb->destructor) { > + atomic_inc(&skb->users); > dev_kfree_skb_any(skb); /* put this one back */ > - else > + } else { > __kfree_skb(skb); > + } > } > } I retested. This patch doesn't appear to make anything worse, but the hang is still there. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] Re: netconsole still hangs 2008-03-17 23:12 ` Andrew Morton @ 2008-03-18 8:04 ` Jarek Poplawski 2008-03-18 8:50 ` [Bug 10238] " Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Jarek Poplawski @ 2008-03-18 8:04 UTC (permalink / raw) To: Andrew Morton; +Cc: davem, shemminger, netdev, rjw On Mon, Mar 17, 2008 at 04:12:22PM -0700, Andrew Morton wrote: ... > I retested. This patch doesn't appear to make anything worse, but the hang > is still there. Yes, but since this doesn't look like something very common, and we don't even know if this OOPS and the hangs are the same bug, there is needed more information e.g.: - is it reproducible with e1000E only and no wlan? - is there a possibility to check this with some other card (even wlan while e1000E is off)? - could you add .config to the bugzilla report: http://bugzilla.kernel.org/show_bug.cgi?id=10238 - is it acceptable to send you some patches for debugging this? Regards, Jarek P. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-18 8:04 ` Jarek Poplawski @ 2008-03-18 8:50 ` Andrew Morton 2008-03-18 21:05 ` Jarek Poplawski 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-18 8:50 UTC (permalink / raw) To: Jarek Poplawski Cc: davem, shemminger, netdev, rjw, bugme-daemon@kernel-bugs.osdl.org On Tue, 18 Mar 2008 08:04:39 +0000 Jarek Poplawski <jarkao2@gmail.com> wrote: > On Mon, Mar 17, 2008 at 04:12:22PM -0700, Andrew Morton wrote: > ... > > I retested. This patch doesn't appear to make anything worse, but the hang > > is still there. > > Yes, but since this doesn't look like something very common, and we > don't even know if this OOPS and the hangs are the same bug, there is > needed more information e.g.: > > - is it reproducible with e1000E only and no wlan? Yes. Both the machines I can reproduce this on have both E1000=y and E1000E=y. From the dmesg (below), one uses e1000 and the other uses e1000e. Both crash. http://userweb.kernel.org/~akpm/config-akpm2.txt http://userweb.kernel.org/~akpm/dmesg-akpm2.txt http://userweb.kernel.org/~akpm/config-t61p.txt http://userweb.kernel.org/~akpm/dmesg-t61p.txt I used to be able to reproduce the problems with a 2-way i386 e100 system, but that seems to be fixed now, perhaps from David's revert. I also used to be able to reproduce the problem on a one-way i386 e100 machine but that also seem to have gone away. > - is there a possibility to check this with some other card > (even wlan while e1000E is off)? err, dunno. Perhaps I could try e1000 on the e1000e-using machine and vice versa, but for that some PCI ID table hacking might be needed. I cc'ed bugzilla on this thread. > - could you add .config to the bugzilla report: > http://bugzilla.kernel.org/show_bug.cgi?id=10238 See above. > - is it acceptable to send you some patches for debugging this? As a last resort. But it'd surely be better if a net developer could reproduce this and do some work on it. It's bog-trivial to reproduce here and afaik nobody has even tried. Perhaps you have... service syslog stop while true do echo t > /proc/sysrq-trigger done and that's it. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-18 8:50 ` [Bug 10238] " Andrew Morton @ 2008-03-18 21:05 ` Jarek Poplawski 2008-03-18 21:47 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Jarek Poplawski @ 2008-03-18 21:05 UTC (permalink / raw) To: Andrew Morton Cc: davem, shemminger, netdev, rjw, bugme-daemon@kernel-bugs.osdl.org Andrew Morton wrote, On 03/18/2008 09:50 AM: ... > As a last resort. But it'd surely be better if a net developer could > reproduce this and do some work on it. It's bog-trivial to reproduce here > and afaik nobody has even tried. Perhaps you have... > > service syslog stop > while true > do > echo t > /proc/sysrq-trigger > done > > and that's it. Alas my testing possibilities, especially with real network, are very limited, I can confirm: yes, the above test really hangs my box, yet with syslog on and netconsole off. So, maybe I miss something, but I don't understand why do you expect netconsole should endure this? IMHO, after the below patch to sched.c you can't compare netconsole to 2.6.24 with this sysrq-trigger test; any bugs found with this could be something old and not necessarily in netconsole (could be only exposed by netconsole like this earlier mentioned, unexplained, probably after double kfree OOPS). Regards, Jarek P. From: Nick Piggin <nickpiggin@yahoo.com.au> Date: Fri, 25 Jan 2008 20:08:34 +0000 (+0100) Subject: sched: print backtrace of running tasks too X-Git-Tag: v2.6.25-rc1~1237^2~3 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464 sched: print backtrace of running tasks too The attached patch is something really simple that can sometimes help in getting more info out of a hung system. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- diff --git a/kernel/sched.c b/kernel/sched.c index 4d3a5a7..524285e 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5161,8 +5161,7 @@ void sched_show_task(struct task_struct *p) printk(KERN_CONT "%5lu %5d %6d\n", free, task_pid_nr(p), task_pid_nr(p->real_parent)); - if (state != TASK_RUNNING) - show_stack(p, NULL); + show_stack(p, NULL); } void show_state_filter(unsigned long state_filter) http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-18 21:05 ` Jarek Poplawski @ 2008-03-18 21:47 ` Andrew Morton 2008-03-18 22:47 ` Jarek Poplawski 0 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-18 21:47 UTC (permalink / raw) To: Jarek Poplawski; +Cc: davem, shemminger, netdev, rjw, bugme-daemon On Tue, 18 Mar 2008 22:05:42 +0100 Jarek Poplawski <jarkao2@gmail.com> wrote: > Andrew Morton wrote, On 03/18/2008 09:50 AM: > ... > > As a last resort. But it'd surely be better if a net developer could > > reproduce this and do some work on it. It's bog-trivial to reproduce here > > and afaik nobody has even tried. Perhaps you have... > > > > service syslog stop > > while true > > do > > echo t > /proc/sysrq-trigger > > done > > > > and that's it. > > Alas my testing possibilities, especially with real network, are very > limited, I can confirm: yes, the above test really hangs my box, yet > with syslog on and netconsole off. So, maybe I miss something, but I > don't understand why do you expect netconsole should endure this? I expect it to fail coz it's recently been filled with bugs ;) I see that your netpoll-zap_completion_queue-adjust-skb-users-counter.patch should fix the oops I earlier hit. Good. > IMHO, after the below patch to sched.c you can't compare netconsole to > 2.6.24 with this sysrq-trigger test; any bugs found with this could be > something old and not necessarily in netconsole (could be only exposed > by netconsole like this earlier mentioned, unexplained, probably after > double kfree OOPS). > > Regards, > Jarek P. > > From: Nick Piggin <nickpiggin@yahoo.com.au> > Date: Fri, 25 Jan 2008 20:08:34 +0000 (+0100) > Subject: sched: print backtrace of running tasks too > X-Git-Tag: v2.6.25-rc1~1237^2~3 > X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464 > > sched: print backtrace of running tasks too > > The attached patch is something really simple that can sometimes help > in getting more info out of a hung system. > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > > diff --git a/kernel/sched.c b/kernel/sched.c > index 4d3a5a7..524285e 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -5161,8 +5161,7 @@ void sched_show_task(struct task_struct *p) > printk(KERN_CONT "%5lu %5d %6d\n", free, > task_pid_nr(p), task_pid_nr(p->real_parent)); > > - if (state != TASK_RUNNING) > - show_stack(p, NULL); > + show_stack(p, NULL); > } > > void show_state_filter(unsigned long state_filter) hm. I tried a few things: 1: cat monstrous-text-file > /dev/kmsg Works OK. 2: Disable netconsole, do while true do echo t > /proc/sysrq-trigger done Works OK. 3: Enable netconsole, do while true do echo t > /proc/sysrq-trigger done Output comes out. I was able to ^C the while loop. After a while the output stopped. So that seems OK too. So right now it's cannot-reproduce. I'll try things on the other machine this evening. I dunno why the sched.c change causes your sysrq-T operation to fail. Can you provide more details please? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-18 21:47 ` Andrew Morton @ 2008-03-18 22:47 ` Jarek Poplawski 2008-03-19 19:17 ` Jarek Poplawski 0 siblings, 1 reply; 24+ messages in thread From: Jarek Poplawski @ 2008-03-18 22:47 UTC (permalink / raw) To: Andrew Morton; +Cc: davem, shemminger, netdev, rjw, bugme-daemon On Tue, Mar 18, 2008 at 02:47:42PM -0700, Andrew Morton wrote: > On Tue, 18 Mar 2008 22:05:42 +0100 > Jarek Poplawski <jarkao2@gmail.com> wrote: ... > > IMHO, after the below patch to sched.c you can't compare netconsole to > > 2.6.24 with this sysrq-trigger test; any bugs found with this could be ... > hm. ... > So right now it's cannot-reproduce. I'll try things on the other machine > this evening. > > I dunno why the sched.c change causes your sysrq-T operation to fail. Can > you provide more details please? ...hmm... Doesn't sysrq-t trigger this sched.c function? Anyway... My first tests seemed to hang the box with syslog only. Now I can't repeat it neither with syslog nor netconsole... So, this patch is a bad hit or it's really about timing. Jarek P. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-18 22:47 ` Jarek Poplawski @ 2008-03-19 19:17 ` Jarek Poplawski 2008-03-19 21:20 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Jarek Poplawski @ 2008-03-19 19:17 UTC (permalink / raw) To: Andrew Morton; +Cc: davem, shemminger, netdev, rjw, bugme-daemon On Tue, Mar 18, 2008 at 11:47:29PM +0100, Jarek Poplawski wrote: ... > Anyway... My first tests seemed to hang the box with syslog only. Now > I can't repeat it neither with syslog nor netconsole... So, this patch > is a bad hit or it's really about timing. I've just repeated this this test with syslog only. After letting it go for ~5 min. I couldn't break it with any keys for at least next 10 min., and I turned the power down. Then the same but with this sched.c patch reverted: ^C worked after a few seconds. It looks like time can really matter here. So, maybe it's again something accidental, I don't have another box around to stay idle while repeting this test, but it seems this could be not the best way to compare anything with 2.6.24 or older. Regards, Jarek P. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-19 19:17 ` Jarek Poplawski @ 2008-03-19 21:20 ` Andrew Morton 2008-03-19 21:31 ` David Miller 2008-03-19 21:54 ` Jarek Poplawski 0 siblings, 2 replies; 24+ messages in thread From: Andrew Morton @ 2008-03-19 21:20 UTC (permalink / raw) To: Jarek Poplawski; +Cc: davem, shemminger, netdev, rjw, bugme-daemon On Wed, 19 Mar 2008 20:17:25 +0100 Jarek Poplawski <jarkao2@gmail.com> wrote: > On Tue, Mar 18, 2008 at 11:47:29PM +0100, Jarek Poplawski wrote: > ... > > Anyway... My first tests seemed to hang the box with syslog only. Now > > I can't repeat it neither with syslog nor netconsole... So, this patch > > is a bad hit or it's really about timing. > > I've just repeated this this test with syslog only. After letting it > go for ~5 min. I couldn't break it with any keys for at least next 10 > min., and I turned the power down. Then the same but with this sched.c > patch reverted: ^C worked after a few seconds. It looks like time > can really matter here. Yeah, I was fiddling with that. If you do for i in $(seq 100) do echo t > /proc/sysrq-trigger done then yes there's no response to ^C and the machine is basically dead. But when the loop finishes, things return to normal. Perhaps it's something to do with longer holds on tasklist_lock, something liek that. > So, maybe it's again something accidental, I don't have another box > around to stay idle while repeting this test, but it seems this could > be not the best way to compare anything with 2.6.24 or older. No. I still haven't retested on the other offending machine. Right now I'm not sure that we any longer have anything which needs fixing. Apart from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-19 21:20 ` Andrew Morton @ 2008-03-19 21:31 ` David Miller 2008-03-19 21:54 ` Jarek Poplawski 1 sibling, 0 replies; 24+ messages in thread From: David Miller @ 2008-03-19 21:31 UTC (permalink / raw) To: akpm; +Cc: jarkao2, shemminger, netdev, rjw, bugme-daemon From: Andrew Morton <akpm@linux-foundation.org> Date: Wed, 19 Mar 2008 14:20:10 -0700 > I'm not sure that we any longer have anything which needs fixing. Apart > from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch? I'll take care of merging this, give me a day or two. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Bug 10238] Re: [PATCH] Re: netconsole still hangs 2008-03-19 21:20 ` Andrew Morton 2008-03-19 21:31 ` David Miller @ 2008-03-19 21:54 ` Jarek Poplawski 1 sibling, 0 replies; 24+ messages in thread From: Jarek Poplawski @ 2008-03-19 21:54 UTC (permalink / raw) To: Andrew Morton; +Cc: davem, shemminger, netdev, rjw, bugme-daemon On Wed, Mar 19, 2008 at 02:20:10PM -0700, Andrew Morton wrote: ... > No. I still haven't retested on the other offending machine. Right now > I'm not sure that we any longer have anything which needs fixing. Apart > from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch? I agree that at least there seems to be no proof of a regression which needs fixing. But I bet there are still things in netpoll, like this zap_completion_queue, which could be (not urgently) fixed... Jarek P. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] Re: netconsole still hangs 2008-03-14 23:47 ` [PATCH] " Jarek Poplawski 2008-03-17 23:12 ` Andrew Morton @ 2008-03-20 23:08 ` David Miller 1 sibling, 0 replies; 24+ messages in thread From: David Miller @ 2008-03-20 23:08 UTC (permalink / raw) To: jarkao2; +Cc: akpm, shemminger, netdev, rjw From: Jarek Poplawski <jarkao2@gmail.com> Date: Sat, 15 Mar 2008 00:47:49 +0100 > [NETPOLL] zap_completion_queue: adjust skb->users counter > > zap_completion_queue() retrieves skbs from completion_queue where they > have zero skb->users counter. Before dev_kfree_skb_any() it should be > non-zero yet, so it's increased now. > > > Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> I've applied this, thanks everyone. I'll queue it up for -stable too. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-13 6:10 ` David Miller 2008-03-13 6:52 ` Andrew Morton @ 2008-03-13 7:59 ` Andrew Morton 2008-03-13 15:09 ` Stephen Hemminger 1 sibling, 1 reply; 24+ messages in thread From: Andrew Morton @ 2008-03-13 7:59 UTC (permalink / raw) To: David Miller; +Cc: shemminger, netdev, rjw On Wed, 12 Mar 2008 23:10:53 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Andrew Morton <akpm@linux-foundation.org> > Date: Wed, 12 Mar 2008 16:57:17 -0700 > > > I reran the test on 2.6.24 and all seemed fine: the machine didn't hang and > > stopping the script stopped the netconsole output. > > Can you go back and bisect the tree in one shot to the guilty > commit you found last time, and make sure this test case > works at that point? Plain old git-checkout 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 seems to dtrt. git-checkout 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 Fails very very easily. Basically the machine never successfully boots. git-checkout 0953864160bdd28dfe45fd46fa462b4d2d53cb96 Works OK. So yes, I'd say that the revert was not complete. aside: running that while loop just slays the machine. It took 30 seocnds to respond to ^C (across ssh over the same link). There's some severe starvation happening somewhere. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-13 7:59 ` Andrew Morton @ 2008-03-13 15:09 ` Stephen Hemminger 2008-03-13 15:52 ` Andrew Morton 0 siblings, 1 reply; 24+ messages in thread From: Stephen Hemminger @ 2008-03-13 15:09 UTC (permalink / raw) To: Andrew Morton; +Cc: David Miller, netdev, rjw On Thu, 13 Mar 2008 00:59:01 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 12 Mar 2008 23:10:53 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > > > From: Andrew Morton <akpm@linux-foundation.org> > > Date: Wed, 12 Mar 2008 16:57:17 -0700 > > > > > I reran the test on 2.6.24 and all seemed fine: the machine didn't hang and > > > stopping the script stopped the netconsole output. > > > > Can you go back and bisect the tree in one shot to the guilty > > commit you found last time, and make sure this test case > > works at that point? > > Plain old > > git-checkout 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 seems to dtrt. > > > > git-checkout 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 > > Fails very very easily. Basically the machine never successfully boots. > > > git-checkout 0953864160bdd28dfe45fd46fa462b4d2d53cb96 > > Works OK. > > > So yes, I'd say that the revert was not complete. > > > aside: running that while loop just slays the machine. It took 30 seocnds > to respond to ^C (across ssh over the same link). There's some severe > starvation happening somewhere. The other possible candidates for this are: commit 0953864160bdd28dfe45fd46fa462b4d2d53cb96 Author: Stephen Hemminger <shemminger@linux-foundation.org> Date: Mon Nov 19 19:23:29 2007 -0800 [NETPOLL]: no need to store local_mac commit 5106930bd6b57402205e3de54dae9476e215b622 Author: Stephen Hemminger <shemminger@linux-foundation.org> Date: Mon Nov 19 19:18:11 2007 -0800 [NETPOLL]: netpoll_poll() cleanup What hardware is the problem seen on? Perhaps there is something different to look for? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: netconsole still hangs 2008-03-13 15:09 ` Stephen Hemminger @ 2008-03-13 15:52 ` Andrew Morton 0 siblings, 0 replies; 24+ messages in thread From: Andrew Morton @ 2008-03-13 15:52 UTC (permalink / raw) To: Stephen Hemminger; +Cc: David Miller, netdev, rjw On Thu, 13 Mar 2008 08:09:26 -0700 Stephen Hemminger <shemminger@linux-foundation.org> wrote: > On Thu, 13 Mar 2008 00:59:01 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Wed, 12 Mar 2008 23:10:53 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > > > > > From: Andrew Morton <akpm@linux-foundation.org> > > > Date: Wed, 12 Mar 2008 16:57:17 -0700 > > > > > > > I reran the test on 2.6.24 and all seemed fine: the machine didn't hang and > > > > stopping the script stopped the netconsole output. > > > > > > Can you go back and bisect the tree in one shot to the guilty > > > commit you found last time, and make sure this test case > > > works at that point? > > > > Plain old > > > > git-checkout 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 seems to dtrt. > > > > > > > > git-checkout 33f807ba0d9259e7c75c7a2ce8bd2787e5b540c7 > > > > Fails very very easily. Basically the machine never successfully boots. > > > > > > git-checkout 0953864160bdd28dfe45fd46fa462b4d2d53cb96 > > > > Works OK. > > > > > > So yes, I'd say that the revert was not complete. > > > > > > aside: running that while loop just slays the machine. It took 30 seocnds > > to respond to ^C (across ssh over the same link). There's some severe > > starvation happening somewhere. > > > The other possible candidates for this are: > > commit 0953864160bdd28dfe45fd46fa462b4d2d53cb96 > Author: Stephen Hemminger <shemminger@linux-foundation.org> > Date: Mon Nov 19 19:23:29 2007 -0800 > > [NETPOLL]: no need to store local_mac > > commit 5106930bd6b57402205e3de54dae9476e215b622 > Author: Stephen Hemminger <shemminger@linux-foundation.org> > Date: Mon Nov 19 19:18:11 2007 -0800 > > [NETPOLL]: netpoll_poll() cleanup Both of those were present in the tree whihc resulted from git-checkout 0953864160bdd28dfe45fd46fa462b4d2d53cb96 and that tree passed testing. > What hardware is the problem seen on? Perhaps there is something different > to look for? 8-way x86_64 with e1000E 2-way x86_64 with e1000E I previously saw problems with 1-way i386 and a 2-way i386 both with e100 but I haven't retested those since David's revert. Are the problems not reproducible on your test machines? ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2008-03-20 23:08 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-03-12 23:14 netconsole still hangs Andrew Morton 2008-03-12 23:16 ` Andrew Morton 2008-03-12 23:30 ` Andrew Morton 2008-03-12 23:57 ` Andrew Morton 2008-03-13 6:10 ` David Miller 2008-03-13 6:52 ` Andrew Morton 2008-03-13 7:12 ` David Miller 2008-03-13 7:25 ` Andrew Morton 2008-03-13 7:48 ` Jike Song 2008-03-14 23:47 ` [PATCH] " Jarek Poplawski 2008-03-17 23:12 ` Andrew Morton 2008-03-18 8:04 ` Jarek Poplawski 2008-03-18 8:50 ` [Bug 10238] " Andrew Morton 2008-03-18 21:05 ` Jarek Poplawski 2008-03-18 21:47 ` Andrew Morton 2008-03-18 22:47 ` Jarek Poplawski 2008-03-19 19:17 ` Jarek Poplawski 2008-03-19 21:20 ` Andrew Morton 2008-03-19 21:31 ` David Miller 2008-03-19 21:54 ` Jarek Poplawski 2008-03-20 23:08 ` David Miller 2008-03-13 7:59 ` Andrew Morton 2008-03-13 15:09 ` Stephen Hemminger 2008-03-13 15:52 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).