netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.20.4: NETDEV WATCHDOG and lockups
@ 2007-04-02 19:41 Christian Kujau
  2007-04-02 20:20 ` Chuck Ebbert
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-02 19:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, malte


Hi there,

we have serious problems with 2 of our servers: both shiny new amd64 
dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing).
Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s
(eth1, irq11).

Both boxes are running fine but after "a while" they lock up and 
eventually restart all of a sudden. The last messages in the logfile 
are:

14:15:11 db2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
14:15:14 db2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1

Then the box reboots, nothing else in the log.

As the servers have been set up recently, we only know that it happend 
with Debian's 2.6.17-? kernel. When we upgraded the installation, we 
went to 2.6.18-4-k7 and the problem persistent. We're using now vanilla 
2.6.20.4 and while the problem persists, it takes longer to lockup (~20h 
as opposed to 4-5h). While this is a good thing for us, it's now harder
to reproduce (we have to wait longer).

Searching the archives turned up quite a few results but no real fix and 
lots of old postings too. We then disabled ACPI completely and booted 
with 'noapic'. Now both boxes are running for > 20h and we're curious 
how long they make it. However, booting with 'noapic' slowed down both 
servers *a lot*.

>From /proc/interrupts we can see that only CPU0 (core 0) is handling 
interrupts while CPU1 does not. We compiled with CONFIG_IRQBALANCE=n so 
that irqbalance(1) would work - but to no avail.

Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both 
hosts and feel free to ask for more details. Although both boxes are in 
production we'll be happy test more bootoptions/patches and the like.

TIA,
Christian.
-- 
BOFH excuse #266:

All of the packets are empty.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau
@ 2007-04-02 20:20 ` Chuck Ebbert
  2007-04-02 21:15   ` Christian Kujau
                     ` (2 more replies)
  2007-04-03  5:20 ` Len Brown
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 21+ messages in thread
From: Chuck Ebbert @ 2007-04-02 20:20 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte

Christian Kujau wrote:
> 
> Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both
> hosts and feel free to ask for more details. Although both boxes are in
> production we'll be happy test more bootoptions/patches and the like.

Where is the info from before you changed to "noapic"? Or were the
machines always using XT-PIC for all the interrupts???


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-02 20:20 ` Chuck Ebbert
@ 2007-04-02 21:15   ` Christian Kujau
  2007-04-03  5:34   ` Christian Kujau
  2007-04-03 15:17   ` Christian Kujau
  2 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-02 21:15 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, netdev, malte

On Mon, 2 Apr 2007, Chuck Ebbert wrote:
>> Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both
>> hosts and feel free to ask for more details. Although both boxes are in
>> production we'll be happy test more bootoptions/patches and the like.
>
> Where is the info from before you changed to "noapic"? Or were the
> machines always using XT-PIC for all the interrupts???

Who, didn't notice the XT-PIC in /proc/interrupts. From the count of 
your questionmarks I figure it's a bad thing then?

@Malte, do we still have this information around?

Thanks,
Christian.
-- 
BOFH excuse #63:

not properly grounded, please bury computer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau
  2007-04-02 20:20 ` Chuck Ebbert
@ 2007-04-03  5:20 ` Len Brown
  2007-04-03  5:46   ` Christian Kujau
  2007-04-03  6:58 ` Jarek Poplawski
  2007-04-03 20:57 ` Francois Romieu
  3 siblings, 1 reply; 21+ messages in thread
From: Len Brown @ 2007-04-03  5:20 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte

On Monday 02 April 2007 15:41, Christian Kujau wrote:
> 
> Hi there,
> 
> we have serious problems with 2 of our servers: both shiny new amd64 
> dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing).
> Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s
> (eth1, irq11).
> 
> Both boxes are running fine but after "a while" they lock up and 
> eventually restart all of a sudden. The last messages in the logfile 
> are:
> 
> 14:15:11 db2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
> 14:15:14 db2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
> 
> Then the box reboots, nothing else in the log.
> 
> As the servers have been set up recently, we only know that it happend 
> with Debian's 2.6.17-? kernel. When we upgraded the installation, we 
> went to 2.6.18-4-k7 and the problem persistent. We're using now vanilla 
> 2.6.20.4 and while the problem persists, it takes longer to lockup (~20h 
> as opposed to 4-5h). While this is a good thing for us, it's now harder
> to reproduce (we have to wait longer).
> 
> Searching the archives turned up quite a few results but no real fix and 
> lots of old postings too. We then disabled ACPI completely and booted 
> with 'noapic'. Now both boxes are running for > 20h and we're curious 
> how long they make it. However, booting with 'noapic' slowed down both 
> servers *a lot*.

Which increased stability, disabling ACPI, or disabling the IOAPIC?
Your box has MPS, so you should be able to use the IOAPIC in either mode.
Note that you can do these both independently at boot-time with "acpi=off"
and "noapic", respectively.
eg. 4 combos
1. <default - no boot params>
2. noapic
3. acpi=off
4. acpi=off noapic

you started with #1, and are running hard-coded #4 now, but skipped #2 and #3

cheers,
-Len

> >From /proc/interrupts we can see that only CPU0 (core 0) is handling 
> interrupts while CPU1 does not. We compiled with CONFIG_IRQBALANCE=n so 
> that irqbalance(1) would work - but to no avail.
> 
> Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both 
> hosts and feel free to ask for more details. Although both boxes are in 
> production we'll be happy test more bootoptions/patches and the like.
> 
> TIA,
> Christian.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-02 20:20 ` Chuck Ebbert
  2007-04-02 21:15   ` Christian Kujau
@ 2007-04-03  5:34   ` Christian Kujau
  2007-04-03 15:17   ` Christian Kujau
  2 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-03  5:34 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, netdev, malte

On Mon, 2 Apr 2007, Chuck Ebbert wrote:
> Where is the info from before you changed to "noapic"? Or were the
> machines always using XT-PIC for all the interrupts???

XT-PIC is only used since we switched to noapic, before there was 
IO-APIC-fasteoi on both ethernet cards and interrupts were balanced 
well.

Thanks,
Christian.
-- 
BOFH excuse #340:

Well fix that in the next (upgrade, update, patch release, service pack).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-03  5:20 ` Len Brown
@ 2007-04-03  5:46   ` Christian Kujau
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-03  5:46 UTC (permalink / raw)
  To: Len Brown; +Cc: linux-kernel, netdev, malte

On Tue, 3 Apr 2007, Len Brown wrote:
> Which increased stability, disabling ACPI, or disabling the IOAPIC?

To be honest, we're not sure. See below.

> Your box has MPS, so you should be able to use the IOAPIC in either mode.

MPS - Multiprocessor Specification? SMP? Yes, it'd be good to use the 
IOAPIC again.

> Note that you can do these both independently at boot-time with "acpi=off"
> and "noapic", respectively.
> eg. 4 combos
> 1. <default - no boot params>
> 2. noapic
> 3. acpi=off
> 4. acpi=off noapic
> you started with #1, and are running hard-coded #4 now, but skipped #2 
> and #3

Indeed, we skipped quite a few options. As mentioned before, the boxes 
are in production already so we don't have much time to play around and
we were just happy when they survived a few hours :(

But yes, we'll try booting with "acpi=off" and enabled IOAPIC again.

@Malte: when will we be able to do so?

Len et al., do you even suggest to use ACPI on a server system at all? I 
myself always thought of ACPI being evil and to avoid when possible 
(thus switching it off completely on a serversystem).

Since these NETDEV WATCHDOG issues seems to be a "known issue" (kinda, 
since the many postings on the lists in the past), is there something 
else we should look into? Would more debug .config options help to find 
out why they lock up?

Thanks for your comments,
Christian.
-- 
BOFH excuse #340:

Well fix that in the next (upgrade, update, patch release, service pack).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau
  2007-04-02 20:20 ` Chuck Ebbert
  2007-04-03  5:20 ` Len Brown
@ 2007-04-03  6:58 ` Jarek Poplawski
  2007-04-03  9:47   ` Christian Kujau
  2007-04-03 15:19   ` Christian Kujau
  2007-04-03 20:57 ` Francois Romieu
  3 siblings, 2 replies; 21+ messages in thread
From: Jarek Poplawski @ 2007-04-03  6:58 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte

On 02-04-2007 21:41, Christian Kujau wrote:
> 
> Hi there,
> 
> we have serious problems with 2 of our servers: both shiny new amd64 
> dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing).
> Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s
> (eth1, irq11).

Hi,

Did you try with 8139cp instead of 8139too?
(Maybe even try some other card to narrow the problem?)
You could also try to test without ehci, if it's possible.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-03  6:58 ` Jarek Poplawski
@ 2007-04-03  9:47   ` Christian Kujau
  2007-04-03 15:19   ` Christian Kujau
  1 sibling, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-03  9:47 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev, malte

On Tue, 3 Apr 2007, Jarek Poplawski wrote:
> Did you try with 8139cp instead of 8139too?

I forgot about that, thanks.

> (Maybe even try some other card to narrow the problem?)

We're try to convince our hosting provider to replace the NIC with a 
e1000.

> You could also try to test without ehci, if it's possible.

Sure, I think USB is not needed.

thanks for your input!

Christian.
-- 
BOFH excuse #387:

Your computer's union contract is set to expire at midnight.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-02 20:20 ` Chuck Ebbert
  2007-04-02 21:15   ` Christian Kujau
  2007-04-03  5:34   ` Christian Kujau
@ 2007-04-03 15:17   ` Christian Kujau
  2 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-03 15:17 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, netdev

On Mon, 2 Apr 2007, Chuck Ebbert wrote:
> Where is the info from before you changed to "noapic"? Or were the
> machines always using XT-PIC for all the interrupts???

We booted with 'acpi=off lapic' (with ACPI options compiled in, to be 
able to boot with acpi=on later on) and the box locked up again.

I'll try to boot with a slightly different .config later on. With 
'lapic' /proc/interrupts looks like:

           CPU0       CPU1
   0:      63656      63396   IO-APIC-edge      timer
   1:          0          8   IO-APIC-edge      i8042
   2:          0          0    XT-PIC-XT        cascade
   4:          6          1   IO-APIC-edge      serial
   6:          0          3   IO-APIC-edge      floppy
   8:          0          1   IO-APIC-edge      rtc
  17:      17050          1   IO-APIC-fasteoi   eth1
  18:        102      80615   IO-APIC-fasteoi   eth0
  20:     195817      77721   IO-APIC-fasteoi   libata
NMI:          0          0
LOC:     126969     126970
ERR:          0
MIS:          0

Christian.

-- 
BOFH excuse #146:

Communications satellite used by the military for star wars.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-03  6:58 ` Jarek Poplawski
  2007-04-03  9:47   ` Christian Kujau
@ 2007-04-03 15:19   ` Christian Kujau
  2007-04-03 20:34     ` Francois Romieu
  2007-04-04 11:21     ` Jarek Poplawski
  1 sibling, 2 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-03 15:19 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev, malte

On Tue, 3 Apr 2007, Jarek Poplawski wrote:
> Did you try with 8139cp instead of 8139too?

Tried that, 8139cp could not be loaded :(

> (Maybe even try some other card to narrow the problem?)
> You could also try to test without ehci, if it's possible.

USB has been disabled completely. After booting with 'acpi=off lapic' 
the box survived ~30min then locked up again and rebooted.

Christian.
-- 
BOFH excuse #305:

IRQ-problems with the Un-Interruptible-Power-Supply

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-03 15:19   ` Christian Kujau
@ 2007-04-03 20:34     ` Francois Romieu
  2007-04-04 11:21     ` Jarek Poplawski
  1 sibling, 0 replies; 21+ messages in thread
From: Francois Romieu @ 2007-04-03 20:34 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Jarek Poplawski, linux-kernel, netdev, malte

Christian Kujau <christian@g-house.de> :
> On Tue, 3 Apr 2007, Jarek Poplawski wrote:
> >Did you try with 8139cp instead of 8139too?
> 
> Tried that, 8139cp could not be loaded :(

It is a different beast.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau
                   ` (2 preceding siblings ...)
  2007-04-03  6:58 ` Jarek Poplawski
@ 2007-04-03 20:57 ` Francois Romieu
  2007-04-04 13:12   ` Christian Kujau
  3 siblings, 1 reply; 21+ messages in thread
From: Francois Romieu @ 2007-04-03 20:57 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte

Christian Kujau <evil@g-house.de> :
[...]
> Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both 
> hosts and feel free to ask for more details. Although both boxes are in 
> production we'll be happy test more bootoptions/patches and the like.

If the apic voodoo makes no difference, you can:
1 - leave it enabled
2 - check that netconsole is not used with the 8139 (it is busted)
3 - check that netconsole is not used at all
4 - try:
    http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402

-- 
Ueimor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-03 15:19   ` Christian Kujau
  2007-04-03 20:34     ` Francois Romieu
@ 2007-04-04 11:21     ` Jarek Poplawski
  2007-04-04 13:20       ` Christian Kujau
  2007-04-04 13:53       ` Denys
  1 sibling, 2 replies; 21+ messages in thread
From: Jarek Poplawski @ 2007-04-04 11:21 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte, Francois Romieu

On Tue, Apr 03, 2007 at 04:19:46PM +0100, Christian Kujau wrote:
> On Tue, 3 Apr 2007, Jarek Poplawski wrote:
> >Did you try with 8139cp instead of 8139too?
> 
> Tried that, 8139cp could not be loaded :(

Sorry for misleading!

> >(Maybe even try some other card to narrow the problem?)
> >You could also try to test without ehci, if it's possible.
> 
> USB has been disabled completely. After booting with 'acpi=off lapic' 
> the box survived ~30min then locked up again and rebooted.

So, it's a lot sooner than before. (BTW, isn't there anything
in debug log?) I see both CPUs did interrupt handling again.
Maybe it's a real locking problem. Here are some more
suggestions for testing (if you don't find anything better):
- try without SMP, so: 'acpi=off lapic nosmp'
- lock debugging turned on as much as possible
plus maybe for curiosity:
- different CONFIG_HZ
- 8139TOO_PIO = y

...
> IRQ-problems with the Un-Interruptible-Power-Supply

I wouldn't be surprised... 

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-03 20:57 ` Francois Romieu
@ 2007-04-04 13:12   ` Christian Kujau
  2007-04-04 18:10     ` Francois Romieu
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2007-04-04 13:12 UTC (permalink / raw)
  To: Francois Romieu; +Cc: linux-kernel, netdev, malte

On Tue, 3 Apr 2007, Francois Romieu wrote:
> Christian Kujau <evil@g-house.de> :
> If the apic voodoo makes no difference, you can:
> 1 - leave it enabled

Well, we tried to boot with ACPI compiled in again, but disabled during 
boot:

- acpi=off lapic, crashed after 1h (almost exactly) of service
- acpi=off lapic, crashed again, this time after 4h (almost exactly)
- acpi=off noapic, still running, now 21h.

The 2nd node has been booted with 'noapic' and ACPI *not* compiled in 
and is now running for 2,5 days. However, interrupts are not shared 
between cores.

This means we still have to test booting with 'lapic' and ACPI enabled. 
Unfortnately there are a few more sub-options to choose from:

     - acpi=force -- enable ACPI if default was off
     - acpi=noirq -- do not use ACPI for IRQ routing
     - acpi=ht -- run only enough ACPI to enable Hyper Threading
     - acpi=strict -- Be less tolerant of platforms that are
                      not strictly ACPI specification compliant.

> 2 - check that netconsole is not used with the 8139 (it is busted)
> 3 - check that netconsole is not used at all

Actually I was thinking about *using* netconsole, since even setting up 
remote (userspace-)syslog left nothing on the syslog-server, when the 
machine crashed. But if it's b0rked in 8139, I will refrain from doing 
so.

> 4 - try:
>    http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402

Are they in -rc5 yet or 'not in -rc5 but should be applied to -rc5'?

Thanks for your time,
Christian.
-- 
BOFH excuse #235:

The new frame relay network hasn't bedded down the software loop transmitter yet.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-04 11:21     ` Jarek Poplawski
@ 2007-04-04 13:20       ` Christian Kujau
  2007-04-05  6:20         ` Jarek Poplawski
  2007-04-06 18:19         ` Christian Kujau
  2007-04-04 13:53       ` Denys
  1 sibling, 2 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-04 13:20 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Christian Kujau, linux-kernel, netdev, malte, Francois Romieu

On Wed, 4 Apr 2007, Jarek Poplawski wrote:
> So, it's a lot sooner than before. (BTW, isn't there anything
> in debug log?)

No, nothing. I've set up remote-syslgging to the other node (node1 
logging to node2 and vice versa) - nothing :(

> I see both CPUs did interrupt handling again.

Yes, when booting with 'lapic' both CPUs/cores are handling interrupts 
again. However, since 'lapic' seems to lead to crashes here, we would be 
more than happy to just boot with 'noapic' but have 'irqbalance' 
working. Unfortunately, irqbalance is unable to write to 
/proc/irq/*/smp_affinity (did not help to disable CONFIG_IRQBALANCE).

> Maybe it's a real locking problem. Here are some more
> suggestions for testing (if you don't find anything better):
> - try without SMP, so: 'acpi=off lapic nosmp'

Yeah, we'll see if we still have time for trying this. But I figure this 
will not be a real (long term) option for us :(

> - lock debugging turned on as much as possible
>   plus maybe for curiosity:
> - different CONFIG_HZ > - 8139TOO_PIO = y

Indeed, that's what I too wanted to do.

@Malte: any plans for another downtime?

Thanks for your comments!

Christian.
-- 
BOFH excuse #265:

The mouse escaped.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-04 11:21     ` Jarek Poplawski
  2007-04-04 13:20       ` Christian Kujau
@ 2007-04-04 13:53       ` Denys
  1 sibling, 0 replies; 21+ messages in thread
From: Denys @ 2007-04-04 13:53 UTC (permalink / raw)
  To: Jarek Poplawski, Christian Kujau
  Cc: linux-kernel, netdev, malte, Francois Romieu

IMHO it can be hardware issue also, i had something very similar with faulty 
hardware combinations.

On Wed, 4 Apr 2007 13:21:00 +0200, Jarek Poplawski wrote
> On Tue, Apr 03, 2007 at 04:19:46PM +0100, Christian Kujau wrote:
> > On Tue, 3 Apr 2007, Jarek Poplawski wrote:
> > >Did you try with 8139cp instead of 8139too?
> > 
> > Tried that, 8139cp could not be loaded :(
> 
> Sorry for misleading!
> 
> > >(Maybe even try some other card to narrow the problem?)
> > >You could also try to test without ehci, if it's possible.
> > 
> > USB has been disabled completely. After booting with 'acpi=off lapic' 
> > the box survived ~30min then locked up again and rebooted.
> 
> So, it's a lot sooner than before. (BTW, isn't there anything
> in debug log?) I see both CPUs did interrupt handling again.
> Maybe it's a real locking problem. Here are some more
> suggestions for testing (if you don't find anything better):
> - try without SMP, so: 'acpi=off lapic nosmp'
> - lock debugging turned on as much as possible
> plus maybe for curiosity:
> - different CONFIG_HZ
> - 8139TOO_PIO = y
> 
> ....
> > IRQ-problems with the Un-Interruptible-Power-Supply
> 
> I wouldn't be surprised...
> 
> Cheers,
> Jarek P.
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-04 13:12   ` Christian Kujau
@ 2007-04-04 18:10     ` Francois Romieu
  0 siblings, 0 replies; 21+ messages in thread
From: Francois Romieu @ 2007-04-04 18:10 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte

Christian Kujau <christian@g-house.de> :
[...]
> Actually I was thinking about *using* netconsole, since even setting up 
> remote (userspace-)syslog left nothing on the syslog-server, when the 
> machine crashed. But if it's b0rked in 8139, I will refrain from doing 
> so.

Please refrain :o)

No serial cable ?

> >4 - try:
> >   http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.21-rc5/r8169-20070402
> 
> Are they in -rc5 yet or 'not in -rc5 but should be applied to -rc5'?

'not in -rc5 but should be applied to -rc5 though the first two at the
bottom (000[12]-r8169-blah-blah.patch) are now in latest -git'.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-04 13:20       ` Christian Kujau
@ 2007-04-05  6:20         ` Jarek Poplawski
  2007-04-06 18:19         ` Christian Kujau
  1 sibling, 0 replies; 21+ messages in thread
From: Jarek Poplawski @ 2007-04-05  6:20 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte, Francois Romieu

On Wed, Apr 04, 2007 at 02:20:23PM +0100, Christian Kujau wrote:
> On Wed, 4 Apr 2007, Jarek Poplawski wrote:
> >So, it's a lot sooner than before. (BTW, isn't there anything
> >in debug log?)
> 
> No, nothing. I've set up remote-syslgging to the other node (node1 
> logging to node2 and vice versa) - nothing :(
> 
> >I see both CPUs did interrupt handling again.
> 
> Yes, when booting with 'lapic' both CPUs/cores are handling interrupts 
> again. However, since 'lapic' seems to lead to crashes here, we would be 
> more than happy to just boot with 'noapic' but have 'irqbalance' 
> working. Unfortunately, irqbalance is unable to write to 
> /proc/irq/*/smp_affinity (did not help to disable CONFIG_IRQBALANCE).

I hope you are right, but maybe it's not lapic's fault?
Probably the fastest way to know would be to try with
some other card, yet.

> >Maybe it's a real locking problem. Here are some more
> >suggestions for testing (if you don't find anything better):
> >- try without SMP, so: 'acpi=off lapic nosmp'

BTW, I'm not sure acpi should be turned off with any
modern hardware. Did you tried to compile with
CONFIG_ACPI = y, all other acpi options off, and maybe
to tweak only with 'pci=...' boot parameter?

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-04 13:20       ` Christian Kujau
  2007-04-05  6:20         ` Jarek Poplawski
@ 2007-04-06 18:19         ` Christian Kujau
  2007-04-06 18:27           ` Christian Kujau
  2007-04-17 12:36           ` Jarek Poplawski
  1 sibling, 2 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-06 18:19 UTC (permalink / raw)
  To: Christian Kujau
  Cc: Jarek Poplawski, linux-kernel, netdev, malte, Francois Romieu

On Wed, 4 Apr 2007, Christian Kujau wrote:
>> Maybe it's a real locking problem. Here are some more
>> suggestions for testing (if you don't find anything better):
>> - try without SMP, so: 'acpi=off lapic nosmp'

We were able to have our hosting provider to replace the 8139too with a 
E100, the onboard r8169 stayed of course. After this, the box came back 
fine...only to lock up again shortly after :(

So again we spoke to our hosting provider and they just took out the 2 
SATA disks and put them in a completely new system: amd64 dualcore 
again, 2 GB ram, r8169 onboard NIC, e100 pci-slot NIC. Now booting 
2.6.20.4 and even 2.6.18-4-k7 (the debian kernel) with IOAPIC eabled 
seems to work, meaning the box is up since yesterday evening and 
interrupts are shared. Not equally, but still:

# cat /proc/interrupts
            CPU0       CPU1
   0:        111          0   IO-APIC-edge      timer
   1:          7          9   IO-APIC-edge      i8042
   4:        260          1   IO-APIC-edge      serial
   6:          0          3   IO-APIC-edge      floppy
   8:          0          1   IO-APIC-edge      rtc
   9:          0          0   IO-APIC-fasteoi   acpi
  16:        157     575579   IO-APIC-fasteoi   eth0
  17:    3812553          1   IO-APIC-fasteoi   eth1
  19:     100651    8262484   IO-APIC-fasteoi   libata
NMI:          0          0
LOC:   17272991   17266237
ERR:          0
MIS:          0

While this is a good thing, we now have different problems: our 2nd sata 
drive is not usable any more, but we again we doubt hardware problems, 
because this disk has been replaced already back in the old box...

but yes, this seem to be different problems, for the curious among 
you I've put details here: http://nerdbynature.de/bits/2.6.20.4/db2/

Thanks to all who have replied,
Christian.
-- 
BOFH excuse #209:

Only people with names beginning with 'A' are getting mail this week (a la Microsoft)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-06 18:19         ` Christian Kujau
@ 2007-04-06 18:27           ` Christian Kujau
  2007-04-17 12:36           ` Jarek Poplawski
  1 sibling, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2007-04-06 18:27 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev

On Fri, 6 Apr 2007, Christian Kujau wrote:
> but yes, this seem to be different problems, for the curious among you I've 
> put details here: http://nerdbynature.de/bits/2.6.20.4/db2/

that's http://nerdbynature.de/bits/2.6.20.4/db1/2/ ....sorry.

-- 
BOFH excuse #270:

Someone has messed up the kernel pointers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: 2.6.20.4: NETDEV WATCHDOG and lockups
  2007-04-06 18:19         ` Christian Kujau
  2007-04-06 18:27           ` Christian Kujau
@ 2007-04-17 12:36           ` Jarek Poplawski
  1 sibling, 0 replies; 21+ messages in thread
From: Jarek Poplawski @ 2007-04-17 12:36 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, netdev, malte, Francois Romieu

On Fri, Apr 06, 2007 at 07:19:25PM +0100, Christian Kujau wrote:
> On Wed, 4 Apr 2007, Christian Kujau wrote:
> >>Maybe it's a real locking problem. Here are some more
> >>suggestions for testing (if you don't find anything better):
> >>- try without SMP, so: 'acpi=off lapic nosmp'
> 
> We were able to have our hosting provider to replace the 8139too with a 
> E100, the onboard r8169 stayed of course. After this, the box came back 
> fine...only to lock up again shortly after :(
> 
> So again we spoke to our hosting provider and they just took out the 2 
> SATA disks and put them in a completely new system: amd64 dualcore 
> again, 2 GB ram, r8169 onboard NIC, e100 pci-slot NIC. Now booting 
> 2.6.20.4 and even 2.6.18-4-k7 (the debian kernel) with IOAPIC eabled 
> seems to work, meaning the box is up since yesterday evening and 
> interrupts are shared. Not equally, but still:
> 
> # cat /proc/interrupts
>            CPU0       CPU1
...
>  16:        157     575579   IO-APIC-fasteoi   eth0
>  17:    3812553          1   IO-APIC-fasteoi   eth1
...

Yes! Nobody can deny they are shared. It's a miracle they
don't lockup now!

> While this is a good thing, we now have different problems: our 2nd sata 
> drive is not usable any more, but we again we doubt hardware problems, 
> because this disk has been replaced already back in the old box...
> 
> but yes, this seem to be different problems, for the curious among 
> you I've put details here: http://nerdbynature.de/bits/2.6.20.4/db2/

I don't want to waste your time for experiments, so don't
feel obliged to respond or try this all, but here are
some impressions - what I'd do:

- these disk errors look serious and there is no reason
to try anything else without removing such disk and
testing it in some other place,

- the configs are changed, but they sometimes include many
"risky" options like: X86_MCE, HOTPLUG, ACPI_BATTERY,
ACPI_BUTTON, ACPI_PROCESSOR, ENABLE_MEMORY_HOTPLUG etc.;
I doubt you need IDE at all: probably SATA_VIA and PATA_VIA
should be enough for your disks; with such problems I'd start
with absolute minimum - and no drivers for other models
(BTW - maybe I'm wrong, but isn't AMD64 MK-8?),

- if, with some config, a lockup is expected soon, I'd
turn off any watchdogs, turn on many debugging - e.g.
lockdep, and try to wait and get some oops during a lockup
(sometime it needs a few minutes, sometime SYSRQ is helpful);
without this you could never be sure it'll work or your
hardware would work at half speed with unnecessarily
turned off options.

> Thanks to all who have replied,

And I thank you for cooperation and interesting problems.
But I doubt anybody here is satisfied with anything but:
"it's working" (and your hardware doesn't look so special
it shouldn't work).

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-04-17 12:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-02 19:41 2.6.20.4: NETDEV WATCHDOG and lockups Christian Kujau
2007-04-02 20:20 ` Chuck Ebbert
2007-04-02 21:15   ` Christian Kujau
2007-04-03  5:34   ` Christian Kujau
2007-04-03 15:17   ` Christian Kujau
2007-04-03  5:20 ` Len Brown
2007-04-03  5:46   ` Christian Kujau
2007-04-03  6:58 ` Jarek Poplawski
2007-04-03  9:47   ` Christian Kujau
2007-04-03 15:19   ` Christian Kujau
2007-04-03 20:34     ` Francois Romieu
2007-04-04 11:21     ` Jarek Poplawski
2007-04-04 13:20       ` Christian Kujau
2007-04-05  6:20         ` Jarek Poplawski
2007-04-06 18:19         ` Christian Kujau
2007-04-06 18:27           ` Christian Kujau
2007-04-17 12:36           ` Jarek Poplawski
2007-04-04 13:53       ` Denys
2007-04-03 20:57 ` Francois Romieu
2007-04-04 13:12   ` Christian Kujau
2007-04-04 18:10     ` Francois Romieu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).