* 2.6.20.4: NETDEV WATCHDOG and lockups
@ 2007-04-02 19:41 Christian Kujau
2007-04-02 20:20 ` Chuck Ebbert
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-02 19:41 UTC (permalink / raw)
To: linux-kernel; +Cc: netdev, malte
Hi there,
we have serious problems with 2 of our servers: both shiny new amd64
dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing).
Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s
(eth1, irq11).
Both boxes are running fine but after "a while" they lock up and
eventually restart all of a sudden. The last messages in the logfile
are:
14:15:11 db2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
14:15:14 db2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
Then the box reboots, nothing else in the log.
As the servers have been set up recently, we only know that it happend
with Debian's 2.6.17-? kernel. When we upgraded the installation, we
went to 2.6.18-4-k7 and the problem persistent. We're using now vanilla
2.6.20.4 and while the problem persists, it takes longer to lockup (~20h
as opposed to 4-5h). While this is a good thing for us, it's now harder
to reproduce (we have to wait longer).
Searching the archives turned up quite a few results but no real fix and
lots of old postings too. We then disabled ACPI completely and booted
with 'noapic'. Now both boxes are running for > 20h and we're curious
how long they make it. However, booting with 'noapic' slowed down both
servers *a lot*.
>From /proc/interrupts we can see that only CPU0 (core 0) is handling
interrupts while CPU1 does not. We compiled with CONFIG_IRQBALANCE=n so
that irqbalance(1) would work - but to no avail.
Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both
hosts and feel free to ask for more details. Although both boxes are in
production we'll be happy test more bootoptions/patches and the like.
TIA,
Christian.
--
BOFH excuse #266:
All of the packets are empty.
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau @ 2007-04-02 20:20 ` Chuck Ebbert 2007-04-02 21:15 ` Christian Kujau ` (2 more replies) 2007-04-03 5:20 ` Len Brown ` (2 subsequent siblings) 3 siblings, 3 replies; 21+ messages in thread From: Chuck Ebbert @ 2007-04-02 20:20 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte Christian Kujau wrote: > > Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both > hosts and feel free to ask for more details. Although both boxes are in > production we'll be happy test more bootoptions/patches and the like. Where is the info from before you changed to "noapic"? Or were the machines always using XT-PIC for all the interrupts??? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-02 20:20 ` Chuck Ebbert @ 2007-04-02 21:15 ` Christian Kujau 2007-04-03 5:34 ` Christian Kujau 2007-04-03 15:17 ` Christian Kujau 2 siblings, 0 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-02 21:15 UTC (permalink / raw) To: Chuck Ebbert; +Cc: linux-kernel, netdev, malte On Mon, 2 Apr 2007, Chuck Ebbert wrote: >> Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both >> hosts and feel free to ask for more details. Although both boxes are in >> production we'll be happy test more bootoptions/patches and the like. > > Where is the info from before you changed to "noapic"? Or were the > machines always using XT-PIC for all the interrupts??? Who, didn't notice the XT-PIC in /proc/interrupts. From the count of your questionmarks I figure it's a bad thing then? @Malte, do we still have this information around? Thanks, Christian. -- BOFH excuse #63: not properly grounded, please bury computer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-02 20:20 ` Chuck Ebbert 2007-04-02 21:15 ` Christian Kujau @ 2007-04-03 5:34 ` Christian Kujau 2007-04-03 15:17 ` Christian Kujau 2 siblings, 0 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-03 5:34 UTC (permalink / raw) To: Chuck Ebbert; +Cc: linux-kernel, netdev, malte On Mon, 2 Apr 2007, Chuck Ebbert wrote: > Where is the info from before you changed to "noapic"? Or were the > machines always using XT-PIC for all the interrupts??? XT-PIC is only used since we switched to noapic, before there was IO-APIC-fasteoi on both ethernet cards and interrupts were balanced well. Thanks, Christian. -- BOFH excuse #340: Well fix that in the next (upgrade, update, patch release, service pack). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-02 20:20 ` Chuck Ebbert 2007-04-02 21:15 ` Christian Kujau 2007-04-03 5:34 ` Christian Kujau @ 2007-04-03 15:17 ` Christian Kujau 2 siblings, 0 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-03 15:17 UTC (permalink / raw) To: Chuck Ebbert; +Cc: linux-kernel, netdev On Mon, 2 Apr 2007, Chuck Ebbert wrote: > Where is the info from before you changed to "noapic"? Or were the > machines always using XT-PIC for all the interrupts??? We booted with 'acpi=off lapic' (with ACPI options compiled in, to be able to boot with acpi=on later on) and the box locked up again. I'll try to boot with a slightly different .config later on. With 'lapic' /proc/interrupts looks like: CPU0 CPU1 0: 63656 63396 IO-APIC-edge timer 1: 0 8 IO-APIC-edge i8042 2: 0 0 XT-PIC-XT cascade 4: 6 1 IO-APIC-edge serial 6: 0 3 IO-APIC-edge floppy 8: 0 1 IO-APIC-edge rtc 17: 17050 1 IO-APIC-fasteoi eth1 18: 102 80615 IO-APIC-fasteoi eth0 20: 195817 77721 IO-APIC-fasteoi libata NMI: 0 0 LOC: 126969 126970 ERR: 0 MIS: 0 Christian. -- BOFH excuse #146: Communications satellite used by the military for star wars. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau 2007-04-02 20:20 ` Chuck Ebbert @ 2007-04-03 5:20 ` Len Brown 2007-04-03 5:46 ` Christian Kujau 2007-04-03 6:58 ` Jarek Poplawski 2007-04-03 20:57 ` Francois Romieu 3 siblings, 1 reply; 21+ messages in thread From: Len Brown @ 2007-04-03 5:20 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte On Monday 02 April 2007 15:41, Christian Kujau wrote: > > Hi there, > > we have serious problems with 2 of our servers: both shiny new amd64 > dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing). > Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s > (eth1, irq11). > > Both boxes are running fine but after "a while" they lock up and > eventually restart all of a sudden. The last messages in the logfile > are: > > 14:15:11 db2 kernel: NETDEV WATCHDOG: eth0: transmit timed out > 14:15:14 db2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 > > Then the box reboots, nothing else in the log. > > As the servers have been set up recently, we only know that it happend > with Debian's 2.6.17-? kernel. When we upgraded the installation, we > went to 2.6.18-4-k7 and the problem persistent. We're using now vanilla > 2.6.20.4 and while the problem persists, it takes longer to lockup (~20h > as opposed to 4-5h). While this is a good thing for us, it's now harder > to reproduce (we have to wait longer). > > Searching the archives turned up quite a few results but no real fix and > lots of old postings too. We then disabled ACPI completely and booted > with 'noapic'. Now both boxes are running for > 20h and we're curious > how long they make it. However, booting with 'noapic' slowed down both > servers *a lot*. Which increased stability, disabling ACPI, or disabling the IOAPIC? Your box has MPS, so you should be able to use the IOAPIC in either mode. Note that you can do these both independently at boot-time with "acpi=off" and "noapic", respectively. eg. 4 combos 1. <default - no boot params> 2. noapic 3. acpi=off 4. acpi=off noapic you started with #1, and are running hard-coded #4 now, but skipped #2 and #3 cheers, -Len > >From /proc/interrupts we can see that only CPU0 (core 0) is handling > interrupts while CPU1 does not. We compiled with CONFIG_IRQBALANCE=n so > that irqbalance(1) would work - but to no avail. > > Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both > hosts and feel free to ask for more details. Although both boxes are in > production we'll be happy test more bootoptions/patches and the like. > > TIA, > Christian. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-03 5:20 ` Len Brown @ 2007-04-03 5:46 ` Christian Kujau 0 siblings, 0 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-03 5:46 UTC (permalink / raw) To: Len Brown; +Cc: linux-kernel, netdev, malte On Tue, 3 Apr 2007, Len Brown wrote: > Which increased stability, disabling ACPI, or disabling the IOAPIC? To be honest, we're not sure. See below. > Your box has MPS, so you should be able to use the IOAPIC in either mode. MPS - Multiprocessor Specification? SMP? Yes, it'd be good to use the IOAPIC again. > Note that you can do these both independently at boot-time with "acpi=off" > and "noapic", respectively. > eg. 4 combos > 1. <default - no boot params> > 2. noapic > 3. acpi=off > 4. acpi=off noapic > you started with #1, and are running hard-coded #4 now, but skipped #2 > and #3 Indeed, we skipped quite a few options. As mentioned before, the boxes are in production already so we don't have much time to play around and we were just happy when they survived a few hours :( But yes, we'll try booting with "acpi=off" and enabled IOAPIC again. @Malte: when will we be able to do so? Len et al., do you even suggest to use ACPI on a server system at all? I myself always thought of ACPI being evil and to avoid when possible (thus switching it off completely on a serversystem). Since these NETDEV WATCHDOG issues seems to be a "known issue" (kinda, since the many postings on the lists in the past), is there something else we should look into? Would more debug .config options help to find out why they lock up? Thanks for your comments, Christian. -- BOFH excuse #340: Well fix that in the next (upgrade, update, patch release, service pack). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau 2007-04-02 20:20 ` Chuck Ebbert 2007-04-03 5:20 ` Len Brown @ 2007-04-03 6:58 ` Jarek Poplawski 2007-04-03 9:47 ` Christian Kujau 2007-04-03 15:19 ` Christian Kujau 2007-04-03 20:57 ` Francois Romieu 3 siblings, 2 replies; 21+ messages in thread From: Jarek Poplawski @ 2007-04-03 6:58 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte On 02-04-2007 21:41, Christian Kujau wrote: > > Hi there, > > we have serious problems with 2 of our servers: both shiny new amd64 > dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing). > Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s > (eth1, irq11). Hi, Did you try with 8139cp instead of 8139too? (Maybe even try some other card to narrow the problem?) You could also try to test without ehci, if it's possible. Regards, Jarek P. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-03 6:58 ` Jarek Poplawski @ 2007-04-03 9:47 ` Christian Kujau 2007-04-03 15:19 ` Christian Kujau 1 sibling, 0 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-03 9:47 UTC (permalink / raw) To: Jarek Poplawski; +Cc: linux-kernel, netdev, malte On Tue, 3 Apr 2007, Jarek Poplawski wrote: > Did you try with 8139cp instead of 8139too? I forgot about that, thanks. > (Maybe even try some other card to narrow the problem?) We're try to convince our hosting provider to replace the NIC with a e1000. > You could also try to test without ehci, if it's possible. Sure, I think USB is not needed. thanks for your input! Christian. -- BOFH excuse #387: Your computer's union contract is set to expire at midnight. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-03 6:58 ` Jarek Poplawski 2007-04-03 9:47 ` Christian Kujau @ 2007-04-03 15:19 ` Christian Kujau 2007-04-03 20:34 ` Francois Romieu 2007-04-04 11:21 ` Jarek Poplawski 1 sibling, 2 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-03 15:19 UTC (permalink / raw) To: Jarek Poplawski; +Cc: linux-kernel, netdev, malte On Tue, 3 Apr 2007, Jarek Poplawski wrote: > Did you try with 8139cp instead of 8139too? Tried that, 8139cp could not be loaded :( > (Maybe even try some other card to narrow the problem?) > You could also try to test without ehci, if it's possible. USB has been disabled completely. After booting with 'acpi=off lapic' the box survived ~30min then locked up again and rebooted. Christian. -- BOFH excuse #305: IRQ-problems with the Un-Interruptible-Power-Supply ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-03 15:19 ` Christian Kujau @ 2007-04-03 20:34 ` Francois Romieu 2007-04-04 11:21 ` Jarek Poplawski 1 sibling, 0 replies; 21+ messages in thread From: Francois Romieu @ 2007-04-03 20:34 UTC (permalink / raw) To: Christian Kujau; +Cc: Jarek Poplawski, linux-kernel, netdev, malte Christian Kujau <christian@g-house.de> : > On Tue, 3 Apr 2007, Jarek Poplawski wrote: > >Did you try with 8139cp instead of 8139too? > > Tried that, 8139cp could not be loaded :( It is a different beast. -- Ueimor ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-03 15:19 ` Christian Kujau 2007-04-03 20:34 ` Francois Romieu @ 2007-04-04 11:21 ` Jarek Poplawski 2007-04-04 13:20 ` Christian Kujau 2007-04-04 13:53 ` Denys 1 sibling, 2 replies; 21+ messages in thread From: Jarek Poplawski @ 2007-04-04 11:21 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte, Francois Romieu On Tue, Apr 03, 2007 at 04:19:46PM +0100, Christian Kujau wrote: > On Tue, 3 Apr 2007, Jarek Poplawski wrote: > >Did you try with 8139cp instead of 8139too? > > Tried that, 8139cp could not be loaded :( Sorry for misleading! > >(Maybe even try some other card to narrow the problem?) > >You could also try to test without ehci, if it's possible. > > USB has been disabled completely. After booting with 'acpi=off lapic' > the box survived ~30min then locked up again and rebooted. So, it's a lot sooner than before. (BTW, isn't there anything in debug log?) I see both CPUs did interrupt handling again. Maybe it's a real locking problem. Here are some more suggestions for testing (if you don't find anything better): - try without SMP, so: 'acpi=off lapic nosmp' - lock debugging turned on as much as possible plus maybe for curiosity: - different CONFIG_HZ - 8139TOO_PIO = y ... > IRQ-problems with the Un-Interruptible-Power-Supply I wouldn't be surprised... Cheers, Jarek P. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-04 11:21 ` Jarek Poplawski @ 2007-04-04 13:20 ` Christian Kujau 2007-04-05 6:20 ` Jarek Poplawski 2007-04-06 18:19 ` Christian Kujau 2007-04-04 13:53 ` Denys 1 sibling, 2 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-04 13:20 UTC (permalink / raw) To: Jarek Poplawski Cc: Christian Kujau, linux-kernel, netdev, malte, Francois Romieu On Wed, 4 Apr 2007, Jarek Poplawski wrote: > So, it's a lot sooner than before. (BTW, isn't there anything > in debug log?) No, nothing. I've set up remote-syslgging to the other node (node1 logging to node2 and vice versa) - nothing :( > I see both CPUs did interrupt handling again. Yes, when booting with 'lapic' both CPUs/cores are handling interrupts again. However, since 'lapic' seems to lead to crashes here, we would be more than happy to just boot with 'noapic' but have 'irqbalance' working. Unfortunately, irqbalance is unable to write to /proc/irq/*/smp_affinity (did not help to disable CONFIG_IRQBALANCE). > Maybe it's a real locking problem. Here are some more > suggestions for testing (if you don't find anything better): > - try without SMP, so: 'acpi=off lapic nosmp' Yeah, we'll see if we still have time for trying this. But I figure this will not be a real (long term) option for us :( > - lock debugging turned on as much as possible > plus maybe for curiosity: > - different CONFIG_HZ > - 8139TOO_PIO = y Indeed, that's what I too wanted to do. @Malte: any plans for another downtime? Thanks for your comments! Christian. -- BOFH excuse #265: The mouse escaped. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-04 13:20 ` Christian Kujau @ 2007-04-05 6:20 ` Jarek Poplawski 2007-04-06 18:19 ` Christian Kujau 1 sibling, 0 replies; 21+ messages in thread From: Jarek Poplawski @ 2007-04-05 6:20 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte, Francois Romieu On Wed, Apr 04, 2007 at 02:20:23PM +0100, Christian Kujau wrote: > On Wed, 4 Apr 2007, Jarek Poplawski wrote: > >So, it's a lot sooner than before. (BTW, isn't there anything > >in debug log?) > > No, nothing. I've set up remote-syslgging to the other node (node1 > logging to node2 and vice versa) - nothing :( > > >I see both CPUs did interrupt handling again. > > Yes, when booting with 'lapic' both CPUs/cores are handling interrupts > again. However, since 'lapic' seems to lead to crashes here, we would be > more than happy to just boot with 'noapic' but have 'irqbalance' > working. Unfortunately, irqbalance is unable to write to > /proc/irq/*/smp_affinity (did not help to disable CONFIG_IRQBALANCE). I hope you are right, but maybe it's not lapic's fault? Probably the fastest way to know would be to try with some other card, yet. > >Maybe it's a real locking problem. Here are some more > >suggestions for testing (if you don't find anything better): > >- try without SMP, so: 'acpi=off lapic nosmp' BTW, I'm not sure acpi should be turned off with any modern hardware. Did you tried to compile with CONFIG_ACPI = y, all other acpi options off, and maybe to tweak only with 'pci=...' boot parameter? Regards, Jarek P. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-04 13:20 ` Christian Kujau 2007-04-05 6:20 ` Jarek Poplawski @ 2007-04-06 18:19 ` Christian Kujau 2007-04-06 18:27 ` Christian Kujau 2007-04-17 12:36 ` Jarek Poplawski 1 sibling, 2 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-06 18:19 UTC (permalink / raw) To: Christian Kujau Cc: Jarek Poplawski, linux-kernel, netdev, malte, Francois Romieu On Wed, 4 Apr 2007, Christian Kujau wrote: >> Maybe it's a real locking problem. Here are some more >> suggestions for testing (if you don't find anything better): >> - try without SMP, so: 'acpi=off lapic nosmp' We were able to have our hosting provider to replace the 8139too with a E100, the onboard r8169 stayed of course. After this, the box came back fine...only to lock up again shortly after :( So again we spoke to our hosting provider and they just took out the 2 SATA disks and put them in a completely new system: amd64 dualcore again, 2 GB ram, r8169 onboard NIC, e100 pci-slot NIC. Now booting 2.6.20.4 and even 2.6.18-4-k7 (the debian kernel) with IOAPIC eabled seems to work, meaning the box is up since yesterday evening and interrupts are shared. Not equally, but still: # cat /proc/interrupts CPU0 CPU1 0: 111 0 IO-APIC-edge timer 1: 7 9 IO-APIC-edge i8042 4: 260 1 IO-APIC-edge serial 6: 0 3 IO-APIC-edge floppy 8: 0 1 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 16: 157 575579 IO-APIC-fasteoi eth0 17: 3812553 1 IO-APIC-fasteoi eth1 19: 100651 8262484 IO-APIC-fasteoi libata NMI: 0 0 LOC: 17272991 17266237 ERR: 0 MIS: 0 While this is a good thing, we now have different problems: our 2nd sata drive is not usable any more, but we again we doubt hardware problems, because this disk has been replaced already back in the old box... but yes, this seem to be different problems, for the curious among you I've put details here: http://nerdbynature.de/bits/2.6.20.4/db2/ Thanks to all who have replied, Christian. -- BOFH excuse #209: Only people with names beginning with 'A' are getting mail this week (a la Microsoft) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-06 18:19 ` Christian Kujau @ 2007-04-06 18:27 ` Christian Kujau 2007-04-17 12:36 ` Jarek Poplawski 1 sibling, 0 replies; 21+ messages in thread From: Christian Kujau @ 2007-04-06 18:27 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev On Fri, 6 Apr 2007, Christian Kujau wrote: > but yes, this seem to be different problems, for the curious among you I've > put details here: http://nerdbynature.de/bits/2.6.20.4/db2/ that's http://nerdbynature.de/bits/2.6.20.4/db1/2/ ....sorry. -- BOFH excuse #270: Someone has messed up the kernel pointers ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-06 18:19 ` Christian Kujau 2007-04-06 18:27 ` Christian Kujau @ 2007-04-17 12:36 ` Jarek Poplawski 1 sibling, 0 replies; 21+ messages in thread From: Jarek Poplawski @ 2007-04-17 12:36 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte, Francois Romieu On Fri, Apr 06, 2007 at 07:19:25PM +0100, Christian Kujau wrote: > On Wed, 4 Apr 2007, Christian Kujau wrote: > >>Maybe it's a real locking problem. Here are some more > >>suggestions for testing (if you don't find anything better): > >>- try without SMP, so: 'acpi=off lapic nosmp' > > We were able to have our hosting provider to replace the 8139too with a > E100, the onboard r8169 stayed of course. After this, the box came back > fine...only to lock up again shortly after :( > > So again we spoke to our hosting provider and they just took out the 2 > SATA disks and put them in a completely new system: amd64 dualcore > again, 2 GB ram, r8169 onboard NIC, e100 pci-slot NIC. Now booting > 2.6.20.4 and even 2.6.18-4-k7 (the debian kernel) with IOAPIC eabled > seems to work, meaning the box is up since yesterday evening and > interrupts are shared. Not equally, but still: > > # cat /proc/interrupts > CPU0 CPU1 ... > 16: 157 575579 IO-APIC-fasteoi eth0 > 17: 3812553 1 IO-APIC-fasteoi eth1 ... Yes! Nobody can deny they are shared. It's a miracle they don't lockup now! > While this is a good thing, we now have different problems: our 2nd sata > drive is not usable any more, but we again we doubt hardware problems, > because this disk has been replaced already back in the old box... > > but yes, this seem to be different problems, for the curious among > you I've put details here: http://nerdbynature.de/bits/2.6.20.4/db2/ I don't want to waste your time for experiments, so don't feel obliged to respond or try this all, but here are some impressions - what I'd do: - these disk errors look serious and there is no reason to try anything else without removing such disk and testing it in some other place, - the configs are changed, but they sometimes include many "risky" options like: X86_MCE, HOTPLUG, ACPI_BATTERY, ACPI_BUTTON, ACPI_PROCESSOR, ENABLE_MEMORY_HOTPLUG etc.; I doubt you need IDE at all: probably SATA_VIA and PATA_VIA should be enough for your disks; with such problems I'd start with absolute minimum - and no drivers for other models (BTW - maybe I'm wrong, but isn't AMD64 MK-8?), - if, with some config, a lockup is expected soon, I'd turn off any watchdogs, turn on many debugging - e.g. lockdep, and try to wait and get some oops during a lockup (sometime it needs a few minutes, sometime SYSRQ is helpful); without this you could never be sure it'll work or your hardware would work at half speed with unnecessarily turned off options. > Thanks to all who have replied, And I thank you for cooperation and interesting problems. But I doubt anybody here is satisfied with anything but: "it's working" (and your hardware doesn't look so special it shouldn't work). Cheers, Jarek P. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-04 11:21 ` Jarek Poplawski 2007-04-04 13:20 ` Christian Kujau @ 2007-04-04 13:53 ` Denys 1 sibling, 0 replies; 21+ messages in thread From: Denys @ 2007-04-04 13:53 UTC (permalink / raw) To: Jarek Poplawski, Christian Kujau Cc: linux-kernel, netdev, malte, Francois Romieu IMHO it can be hardware issue also, i had something very similar with faulty hardware combinations. On Wed, 4 Apr 2007 13:21:00 +0200, Jarek Poplawski wrote > On Tue, Apr 03, 2007 at 04:19:46PM +0100, Christian Kujau wrote: > > On Tue, 3 Apr 2007, Jarek Poplawski wrote: > > >Did you try with 8139cp instead of 8139too? > > > > Tried that, 8139cp could not be loaded :( > > Sorry for misleading! > > > >(Maybe even try some other card to narrow the problem?) > > >You could also try to test without ehci, if it's possible. > > > > USB has been disabled completely. After booting with 'acpi=off lapic' > > the box survived ~30min then locked up again and rebooted. > > So, it's a lot sooner than before. (BTW, isn't there anything > in debug log?) I see both CPUs did interrupt handling again. > Maybe it's a real locking problem. Here are some more > suggestions for testing (if you don't find anything better): > - try without SMP, so: 'acpi=off lapic nosmp' > - lock debugging turned on as much as possible > plus maybe for curiosity: > - different CONFIG_HZ > - 8139TOO_PIO = y > > .... > > IRQ-problems with the Un-Interruptible-Power-Supply > > I wouldn't be surprised... > > Cheers, > Jarek P. > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau ` (2 preceding siblings ...) 2007-04-03 6:58 ` Jarek Poplawski @ 2007-04-03 20:57 ` Francois Romieu 2007-04-04 13:12 ` Christian Kujau 3 siblings, 1 reply; 21+ messages in thread From: Francois Romieu @ 2007-04-03 20:57 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte Christian Kujau <evil@g-house.de> : [...] > Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both > hosts and feel free to ask for more details. Although both boxes are in > production we'll be happy test more bootoptions/patches and the like. If the apic voodoo makes no difference, you can: 1 - leave it enabled 2 - check that netconsole is not used with the 8139 (it is busted) 3 - check that netconsole is not used at all 4 - try: http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402 -- Ueimor ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-03 20:57 ` Francois Romieu @ 2007-04-04 13:12 ` Christian Kujau 2007-04-04 18:10 ` Francois Romieu 0 siblings, 1 reply; 21+ messages in thread From: Christian Kujau @ 2007-04-04 13:12 UTC (permalink / raw) To: Francois Romieu; +Cc: linux-kernel, netdev, malte On Tue, 3 Apr 2007, Francois Romieu wrote: > Christian Kujau <evil@g-house.de> : > If the apic voodoo makes no difference, you can: > 1 - leave it enabled Well, we tried to boot with ACPI compiled in again, but disabled during boot: - acpi=off lapic, crashed after 1h (almost exactly) of service - acpi=off lapic, crashed again, this time after 4h (almost exactly) - acpi=off noapic, still running, now 21h. The 2nd node has been booted with 'noapic' and ACPI *not* compiled in and is now running for 2,5 days. However, interrupts are not shared between cores. This means we still have to test booting with 'lapic' and ACPI enabled. Unfortnately there are a few more sub-options to choose from: - acpi=force -- enable ACPI if default was off - acpi=noirq -- do not use ACPI for IRQ routing - acpi=ht -- run only enough ACPI to enable Hyper Threading - acpi=strict -- Be less tolerant of platforms that are not strictly ACPI specification compliant. > 2 - check that netconsole is not used with the 8139 (it is busted) > 3 - check that netconsole is not used at all Actually I was thinking about *using* netconsole, since even setting up remote (userspace-)syslog left nothing on the syslog-server, when the machine crashed. But if it's b0rked in 8139, I will refrain from doing so. > 4 - try: > http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402 Are they in -rc5 yet or 'not in -rc5 but should be applied to -rc5'? Thanks for your time, Christian. -- BOFH excuse #235: The new frame relay network hasn't bedded down the software loop transmitter yet. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: 2.6.20.4: NETDEV WATCHDOG and lockups 2007-04-04 13:12 ` Christian Kujau @ 2007-04-04 18:10 ` Francois Romieu 0 siblings, 0 replies; 21+ messages in thread From: Francois Romieu @ 2007-04-04 18:10 UTC (permalink / raw) To: Christian Kujau; +Cc: linux-kernel, netdev, malte Christian Kujau <christian@g-house.de> : [...] > Actually I was thinking about *using* netconsole, since even setting up > remote (userspace-)syslog left nothing on the syslog-server, when the > machine crashed. But if it's b0rked in 8139, I will refrain from doing > so. Please refrain :o) No serial cable ? > >4 - try: > > http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402 > > Are they in -rc5 yet or 'not in -rc5 but should be applied to -rc5'? 'not in -rc5 but should be applied to -rc5 though the first two at the bottom (000[12]-r8169-blah-blah.patch) are now in latest -git'. -- Ueimor ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2007-04-17 12:30 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau 2007-04-02 20:20 ` Chuck Ebbert 2007-04-02 21:15 ` Christian Kujau 2007-04-03 5:34 ` Christian Kujau 2007-04-03 15:17 ` Christian Kujau 2007-04-03 5:20 ` Len Brown 2007-04-03 5:46 ` Christian Kujau 2007-04-03 6:58 ` Jarek Poplawski 2007-04-03 9:47 ` Christian Kujau 2007-04-03 15:19 ` Christian Kujau 2007-04-03 20:34 ` Francois Romieu 2007-04-04 11:21 ` Jarek Poplawski 2007-04-04 13:20 ` Christian Kujau 2007-04-05 6:20 ` Jarek Poplawski 2007-04-06 18:19 ` Christian Kujau 2007-04-06 18:27 ` Christian Kujau 2007-04-17 12:36 ` Jarek Poplawski 2007-04-04 13:53 ` Denys 2007-04-03 20:57 ` Francois Romieu 2007-04-04 13:12 ` Christian Kujau 2007-04-04 18:10 ` Francois Romieu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).