* Kernel lockups on dual-Athlon board -- help wanted
@ 2001-08-11 10:23 Eric S. Raymond
2001-08-11 10:46 ` Alex Buell
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Eric S. Raymond @ 2001-08-11 10:23 UTC (permalink / raw)
To: Linux Kernel List
Gary Sandine of Los Alamos Computers and I are attempting to qualify
Linux on a Tyan 2462 K7 Thunder motherboard -- dual Athlon 1200 MP
chips supported by an AMD 760 chipset. We have been seeing mysterious
lockups during commands to build things from source, like kernels and X.
We've been trying to track down the problem for about sixteen hours
and have gathered quite a bit of data, but don't have a theory to explain
it.
First, we have established that this is a real kernel hang, not just a
bad device state:
A. Lockups can be induced in either console or X mode. A reliable way to
induce them is to run `make clean' on an X tree (any sufficiently
long-running command seems to do it).
B. We logged in over the network, started a top(1) in the network
session, induced the hang on the console, and watch top(1) freeze.
So
C. The magic AltSysRq command is ineffective when the lockups happen.
Here's what we know about it:
1. Lockups never occur under a uniprocessor kernel.
2. Configuring APM and ACPI out of the kernel does not prevent the lockups.
Disabling ACPI and power management doesn't stop them either.
3. Changing kernels from 2.4.3 to 2.4.7 doesn't prevent the lockups.
4. The SMP kernel built for either PII or AMD (no APM, no ACPI) locks up.
5. There is an undocumented BIOS setting "Use PCI Interrupt Entries in
MP table." By default it is on. Turning it off doesn't prevent the
lockups.
6. Here's a weird one. When the kernel is running, the power switch
has to be pressed down for 4 seconds to power down the machine. But
during a lockup it powers down the machine instantly.
What we're seeing suggests some bad interaction between the SMP
support and the hardware. But item 7 hints that power management
could be involved, even though we have it configured out.
Anybody have a brilliant insight? Suggestions for further tests?
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
Government should be weak, amateurish and ridiculous. At present, it
fulfills only a third of the role.
-- Edward Abbey
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 10:23 Kernel lockups on dual-Athlon board -- help wanted Eric S. Raymond
@ 2001-08-11 10:46 ` Alex Buell
2001-08-11 16:22 ` Eric S. Raymond
2001-08-11 13:19 ` Alan Cox
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Alex Buell @ 2001-08-11 10:46 UTC (permalink / raw)
To: Eric S. Raymond; +Cc: Linux Kernel List
On Sat, 11 Aug 2001, Eric S. Raymond wrote:
> 6. Here's a weird one. When the kernel is running, the power switch
> has to be pressed down for 4 seconds to power down the machine. But
> during a lockup it powers down the machine instantly.
>
> What we're seeing suggests some bad interaction between the SMP
> support and the hardware. But item 7 hints that power management
> could be involved, even though we have it configured out.
You appear to be missing item 7.
--
Sigfault: Witty message dumped.
http://www.tahallah.demon.co.uk
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 10:46 ` Alex Buell
@ 2001-08-11 16:22 ` Eric S. Raymond
0 siblings, 0 replies; 14+ messages in thread
From: Eric S. Raymond @ 2001-08-11 16:22 UTC (permalink / raw)
To: Alex Buell; +Cc: Linux Kernel List
Alex Buell <alex.buell@tahallah.demon.co.uk>:
> On Sat, 11 Aug 2001, Eric S. Raymond wrote:
>
> > 6. Here's a weird one. When the kernel is running, the power switch
> > has to be pressed down for 4 seconds to power down the machine. But
> > during a lockup it powers down the machine instantly.
> >
> > What we're seeing suggests some bad interaction between the SMP
> > support and the hardware. But item 7 hints that power management
> > could be involved, even though we have it configured out.
>
> You appear to be missing item 7.
It was 0400 and I was fried. I was referring to the item 6 you quoted.
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
Militias, when properly formed, are in fact the people themselves and
include all men capable of bearing arms. [...] To preserve liberty it is
essential that the whole body of the people always possess arms and be
taught alike, especially when young, how to use them.
-- Senator Richard Henry Lee, 1788, on "militia" in the 2nd Amendment
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 10:23 Kernel lockups on dual-Athlon board -- help wanted Eric S. Raymond
2001-08-11 10:46 ` Alex Buell
@ 2001-08-11 13:19 ` Alan Cox
2001-08-11 16:32 ` Eric S. Raymond
2001-08-11 16:08 ` Charles Cazabon
2001-08-11 16:17 ` Eric W. Biederman
3 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2001-08-11 13:19 UTC (permalink / raw)
To: esr; +Cc: Linux Kernel List
> 6. Here's a weird one. When the kernel is running, the power switch
> has to be pressed down for 4 seconds to power down the machine. But
> during a lockup it powers down the machine instantly.
>
> What we're seeing suggests some bad interaction between the SMP
> support and the hardware. But item 7 hints that power management
> could be involved, even though we have it configured out.
Try a completely different board and components. There are folks running rock
solid 2.4 on dual Athlons. I speculate yours is perhaps marginal somewhere.
Alan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 13:19 ` Alan Cox
@ 2001-08-11 16:32 ` Eric S. Raymond
2001-08-11 16:44 ` Johannes Erdfelt
0 siblings, 1 reply; 14+ messages in thread
From: Eric S. Raymond @ 2001-08-11 16:32 UTC (permalink / raw)
To: Alan Cox; +Cc: Linux Kernel List
Alan Cox <alan@lxorguk.ukuu.org.uk>:
> Try a completely different board and components. There are folks
> running rock solid 2.4 on dual Athlons. I speculate yours is perhaps
> marginal somewhere.
OK...
1. By "different board" do you mean different instance of the same design, or
different design? Because the K7 Thunder is, AFAIK, the only dual-Athlon
1200 board that exists right now.
2. Do you know of anyone else successfully running 2.4 over an AMD 760
support chipset?
3. Which components do you think are likely to be implicated? Bad memory
is an obvious guess, I suppose.
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
"Both oligarch and tyrant mistrust the people,
and therefore deprive them of arms."
--Aristotle
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 16:32 ` Eric S. Raymond
@ 2001-08-11 16:44 ` Johannes Erdfelt
2001-08-11 16:50 ` Eric S. Raymond
0 siblings, 1 reply; 14+ messages in thread
From: Johannes Erdfelt @ 2001-08-11 16:44 UTC (permalink / raw)
To: Eric S. Raymond, Alan Cox, Linux Kernel List
On Sat, Aug 11, 2001, Eric S. Raymond <esr@thyrsus.com> wrote:
> Alan Cox <alan@lxorguk.ukuu.org.uk>:
> > Try a completely different board and components. There are folks
> > running rock solid 2.4 on dual Athlons. I speculate yours is perhaps
> > marginal somewhere.
>
> OK...
>
> 1. By "different board" do you mean different instance of the same design, or
> different design? Because the K7 Thunder is, AFAIK, the only dual-Athlon
> 1200 board that exists right now.
I suspect he means same design, different physical hardware.
> 2. Do you know of anyone else successfully running 2.4 over an AMD 760
> support chipset?
I am.
I have had some cooling problems with some of the hardware in the past.
> 3. Which components do you think are likely to be implicated? Bad memory
> is an obvious guess, I suppose.
Cooling most likely. What kind of system is this? Rackmount? Desktop
case?
JE
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 16:44 ` Johannes Erdfelt
@ 2001-08-11 16:50 ` Eric S. Raymond
2001-08-11 10:09 ` John Heil
2001-08-11 19:31 ` Ben LaHaise
0 siblings, 2 replies; 14+ messages in thread
From: Eric S. Raymond @ 2001-08-11 16:50 UTC (permalink / raw)
To: Johannes Erdfelt; +Cc: Alan Cox, Linux Kernel List
Johannes Erdfelt <johannes@erdfelt.com>:
> > different design? Because the K7 Thunder is, AFAIK, the only dual-Athlon
> > 1200 board that exists right now.
>
> I suspect he means same design, different physical hardware.
Yes, he said so in a reply.
> > 2. Do you know of anyone else successfully running 2.4 over an AMD 760
> > support chipset?
>
> I am.
>
> I have had some cooling problems with some of the hardware in the past.
>
> > 3. Which components do you think are likely to be implicated? Bad memory
> > is an obvious guess, I suppose.
>
> Cooling most likely. What kind of system is this? Rackmount? Desktop
> case?
Server case. Seems to be running pretty cool -- the processor heatsinks
are warm to the touch but not hot. We've got a power-supply fan, two coolers,
and two case fans. Did you find you needed more than that?
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
There's a truism that the road to Hell is often paved with good intentions.
The corollary is that evil is best known not by its motives but by its
*methods*.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 16:50 ` Eric S. Raymond
@ 2001-08-11 10:09 ` John Heil
2001-08-11 17:22 ` Eric S. Raymond
2001-08-11 19:31 ` Ben LaHaise
1 sibling, 1 reply; 14+ messages in thread
From: John Heil @ 2001-08-11 10:09 UTC (permalink / raw)
To: Eric S. Raymond; +Cc: Johannes Erdfelt, Alan Cox, Linux Kernel List
On Sat, 11 Aug 2001, Eric S. Raymond wrote:
> Date: Sat, 11 Aug 2001 12:50:35 -0400
> From: "Eric S. Raymond" <esr@thyrsus.com>
> To: Johannes Erdfelt <johannes@erdfelt.com>
> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>,
> Linux Kernel List <linux-kernel@vger.kernel.org>
> Subject: Re: Kernel lockups on dual-Athlon board -- help wanted
>
> Johannes Erdfelt <johannes@erdfelt.com>:
> > > different design? Because the K7 Thunder is, AFAIK, the only dual-Athlon
> > > 1200 board that exists right now.
> >
> > I suspect he means same design, different physical hardware.
>
> Yes, he said so in a reply.
>
> > > 2. Do you know of anyone else successfully running 2.4 over an AMD 760
> > > support chipset?
> >
> > I am.
> >
> > I have had some cooling problems with some of the hardware in the past.
> >
> > > 3. Which components do you think are likely to be implicated? Bad memory
> > > is an obvious guess, I suppose.
> >
> > Cooling most likely. What kind of system is this? Rackmount? Desktop
> > case?
>
> Server case. Seems to be running pretty cool -- the processor heatsinks
> are warm to the touch but not hot. We've got a power-supply fan, two coolers,
> and two case fans. Did you find you needed more than that?
You might try a heat sink & fan on the north bridge chip.
Also your cpu fans ought to be of the 7+K RPM variety.
> --
> <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
>
> There's a truism that the road to Hell is often paved with good intentions.
> The corollary is that evil is best known not by its motives but by its
> *methods*.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
-
-----------------------------------------------------------------
John Heil
South Coast Software
Custom systems software for UNIX and IBM MVS mainframes
1-714-774-6952
johnhscs@sc-software.com
http://www.sc-software.com
-----------------------------------------------------------------
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 10:09 ` John Heil
@ 2001-08-11 17:22 ` Eric S. Raymond
2001-08-11 10:30 ` John Heil
2001-08-11 17:57 ` Charles Cazabon
0 siblings, 2 replies; 14+ messages in thread
From: Eric S. Raymond @ 2001-08-11 17:22 UTC (permalink / raw)
To: John Heil; +Cc: Johannes Erdfelt, Alan Cox, Linux Kernel List
John Heil <kerndev@sc-software.com>:
> You might try a heat sink & fan on the north bridge chip.
> Also your cpu fans ought to be of the 7+K RPM variety.
Interesting. We're going to put Silverados on the CPUs as soon as we
can get them -- if you don't know what those are, they're a super-well-
designed cooler that can chill a chip by 24 degrees centigrade. Low-noise,
too, they only emit 37dBA.
Where is the north bridge chip on the board? I have the mobo diagram from the
Tyan site but it doesn't show that.
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
Americans have the will to resist because you have weapons.
If you don't have a gun, freedom of speech has no power.
-- Yoshimi Ishikawa, Japanese author, in the LA Times 15 Oct 1992
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 17:22 ` Eric S. Raymond
@ 2001-08-11 10:30 ` John Heil
2001-08-11 17:57 ` Charles Cazabon
1 sibling, 0 replies; 14+ messages in thread
From: John Heil @ 2001-08-11 10:30 UTC (permalink / raw)
To: Eric S. Raymond; +Cc: Johannes Erdfelt, Alan Cox, Linux Kernel List
On Sat, 11 Aug 2001, Eric S. Raymond wrote:
> Date: Sat, 11 Aug 2001 13:22:09 -0400
> From: "Eric S. Raymond" <esr@thyrsus.com>
> To: John Heil <kerndev@sc-software.com>
> Cc: Johannes Erdfelt <johannes@erdfelt.com>,
> Alan Cox <alan@lxorguk.ukuu.org.uk>,
> Linux Kernel List <linux-kernel@vger.kernel.org>
> Subject: Re: Kernel lockups on dual-Athlon board -- help wanted
>
> John Heil <kerndev@sc-software.com>:
> > You might try a heat sink & fan on the north bridge chip.
> > Also your cpu fans ought to be of the 7+K RPM variety.
>
> Interesting. We're going to put Silverados on the CPUs as soon as we
> can get them -- if you don't know what those are, they're a super-well-
> designed cooler that can chill a chip by 24 degrees centigrade. Low-noise,
> too, they only emit 37dBA.
>
> Where is the north bridge chip on the board? I have the mobo diagram from the
> Tyan site but it doesn't show that.
It's the AMD-762 system controller, the one with the metallic cap on top.
> --
> <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
>
> Americans have the will to resist because you have weapons.
> If you don't have a gun, freedom of speech has no power.
> -- Yoshimi Ishikawa, Japanese author, in the LA Times 15 Oct 1992
>
-
-----------------------------------------------------------------
John Heil
South Coast Software
Custom systems software for UNIX and IBM MVS mainframes
1-714-774-6952
johnhscs@sc-software.com
http://www.sc-software.com
-----------------------------------------------------------------
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 17:22 ` Eric S. Raymond
2001-08-11 10:30 ` John Heil
@ 2001-08-11 17:57 ` Charles Cazabon
1 sibling, 0 replies; 14+ messages in thread
From: Charles Cazabon @ 2001-08-11 17:57 UTC (permalink / raw)
To: Linux Kernel List
Eric S. Raymond <esr@thyrsus.com> wrote:
> John Heil <kerndev@sc-software.com>:
> > You might try a heat sink & fan on the north bridge chip.
> > Also your cpu fans ought to be of the 7+K RPM variety.
>
> Interesting. We're going to put Silverados on the CPUs as soon as we
> can get them -- if you don't know what those are, they're a super-well-
> designed cooler that can chill a chip by 24 degrees centigrade.
Are you sure of that spec? 24 degrees isn't a whole lot for a CPU
cooler. An Athlon without a fan can reach 70 centigrade before it fries
a few seconds later, and many of the coolers in use with them can bring
them down to high-20s. 24 degrees isn't enough.
ALso, I doubt it's memory -- you'd see segfaults or oopses. Bad or
overheated CPU, unstable or underpowered power supply, or faulty
mainboard is more likely for your symptoms.
Charles
--
-----------------------------------------------------------------------
Charles Cazabon <linux@discworld.dyndns.org>
GPL'ed software available at: http://www.qcc.sk.ca/~charlesc/software/
-----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 16:50 ` Eric S. Raymond
2001-08-11 10:09 ` John Heil
@ 2001-08-11 19:31 ` Ben LaHaise
1 sibling, 0 replies; 14+ messages in thread
From: Ben LaHaise @ 2001-08-11 19:31 UTC (permalink / raw)
To: Eric S. Raymond; +Cc: Johannes Erdfelt, Alan Cox, Linux Kernel List
On Sat, 11 Aug 2001, Eric S. Raymond wrote:
> Server case. Seems to be running pretty cool -- the processor heatsinks
> are warm to the touch but not hot. We've got a power-supply fan, two coolers,
> and two case fans. Did you find you needed more than that?
Depends on how many drives you've got in the case. The dual Athlon I'm
using has three case fans: two pulling air out from just below the
power supply and one drawing air into the case at the front. The only
crashes I've experienced on the machine were caused by kernel bugs. =)
-ben
--
"The world would be a better place if Larry Wall had been born in
Iceland, or any other country where the native language actually
has syntax" -- Peter da Silva
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 10:23 Kernel lockups on dual-Athlon board -- help wanted Eric S. Raymond
2001-08-11 10:46 ` Alex Buell
2001-08-11 13:19 ` Alan Cox
@ 2001-08-11 16:08 ` Charles Cazabon
2001-08-11 16:17 ` Eric W. Biederman
3 siblings, 0 replies; 14+ messages in thread
From: Charles Cazabon @ 2001-08-11 16:08 UTC (permalink / raw)
To: Linux Kernel List
Eric S. Raymond <esr@thyrsus.com> wrote:
>
> A. Lockups can be induced in either console or X mode. A reliable way to
> induce them is to run `make clean' on an X tree (any sufficiently
> long-running command seems to do it).
This sounds like bad hardware.
> 6. Here's a weird one. When the kernel is running, the power switch
> has to be pressed down for 4 seconds to power down the machine. But
> during a lockup it powers down the machine instantly.
This is normal for ATX machines. There's usually a BIOS setting
controlling whether the power switch is instant or delayed, but once the
software isn't running any more, it's always instant.
I really do think it's bad hardware -- CPU, mainboard, power supply, or
some combination of the above.
Charles
--
-----------------------------------------------------------------------
Charles Cazabon <linux@discworld.dyndns.org>
GPL'ed software available at: http://www.qcc.sk.ca/~charlesc/software/
-----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Kernel lockups on dual-Athlon board -- help wanted
2001-08-11 10:23 Kernel lockups on dual-Athlon board -- help wanted Eric S. Raymond
` (2 preceding siblings ...)
2001-08-11 16:08 ` Charles Cazabon
@ 2001-08-11 16:17 ` Eric W. Biederman
3 siblings, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2001-08-11 16:17 UTC (permalink / raw)
To: esr; +Cc: Linux Kernel List
"Eric S. Raymond" <esr@thyrsus.com> writes:
> Gary Sandine of Los Alamos Computers and I are attempting to qualify
> Linux on a Tyan 2462 K7 Thunder motherboard -- dual Athlon 1200 MP
> chips supported by an AMD 760 chipset. We have been seeing mysterious
> lockups during commands to build things from source, like kernels and X.
>
> We've been trying to track down the problem for about sixteen hours
> and have gathered quite a bit of data, but don't have a theory to explain
> it.
What kind of case are you running in? I have heard of one other case
that sounds similiar and in that case the system was in a 1U.
> First, we have established that this is a real kernel hang, not just a
> bad device state:
>
> A. Lockups can be induced in either console or X mode. A reliable way to
> induce them is to run `make clean' on an X tree (any sufficiently
> long-running command seems to do it).
>
> B. We logged in over the network, started a top(1) in the network
> session, induced the hang on the console, and watch top(1) freeze.
> So
>
> C. The magic AltSysRq command is ineffective when the lockups happen.
>
> Here's what we know about it:
>
> 1. Lockups never occur under a uniprocessor kernel.
>
> 2. Configuring APM and ACPI out of the kernel does not prevent the lockups.
> Disabling ACPI and power management doesn't stop them either.
>
> 3. Changing kernels from 2.4.3 to 2.4.7 doesn't prevent the lockups.
>
> 4. The SMP kernel built for either PII or AMD (no APM, no ACPI) locks up.
>
> 5. There is an undocumented BIOS setting "Use PCI Interrupt Entries in
> MP table." By default it is on. Turning it off doesn't prevent the
> lockups.
This switches between listing the 4 interrupts that the board uses for pci
between either in the ISA range if interrupts or routing them to the IOAPIC
above the normal 16 ISA interrupts.
> 6. Here's a weird one. When the kernel is running, the power switch
> has to be pressed down for 4 seconds to power down the machine. But
> during a lockup it powers down the machine instantly.
>
> What we're seeing suggests some bad interaction between the SMP
> support and the hardware. But item 7 hints that power management
> could be involved, even though we have it configured out.
The board only uses ACPI so power management isn't a large canidate.
I think I have to go with Alan that the most likely case is that the
board is marginal in respect.
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2001-08-11 19:32 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-08-11 10:23 Kernel lockups on dual-Athlon board -- help wanted Eric S. Raymond
2001-08-11 10:46 ` Alex Buell
2001-08-11 16:22 ` Eric S. Raymond
2001-08-11 13:19 ` Alan Cox
2001-08-11 16:32 ` Eric S. Raymond
2001-08-11 16:44 ` Johannes Erdfelt
2001-08-11 16:50 ` Eric S. Raymond
2001-08-11 10:09 ` John Heil
2001-08-11 17:22 ` Eric S. Raymond
2001-08-11 10:30 ` John Heil
2001-08-11 17:57 ` Charles Cazabon
2001-08-11 19:31 ` Ben LaHaise
2001-08-11 16:08 ` Charles Cazabon
2001-08-11 16:17 ` Eric W. Biederman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.