* XCP: Crashes on dual Xeon HP ProLiant systems
@ 2010-04-30 16:32 dwight at supercomputer.org
2010-04-30 18:20 ` Pasi Kärkkäinen
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: dwight at supercomputer.org @ 2010-04-30 16:32 UTC (permalink / raw)
To: xen-devel
Is anyone else running the latest XCP on HP ProLiant DL380
systems? Or a similar dual Xeon 8-core system? I'm seeing
spontaneous reboots when under a load.
Specifically, when 4 Windows HVMs are loaded, I haven't noticed
any reboots yet. But when running 7 or 8, the system will
reboot within minutes. Very little information appears on
the console.
I built a debugging version of the hypervisor, which changed
the behavior; the system managed to stay up for 2-3 hours
with 7 VMs running. However, it again spontaneously rebooted,
with no real messages on the console as to why.
I can send out the console log messages this evening, along
with the system information if there's interest. Alas, I
don't have access to these items at the moment.
I have also been running memtest86 overnight. As of 1.5 hours into
the test, there were no errors. But there are 48 GB of RAM
on the system, so the testing wasn't complete when I left.
Any suggestions here? I was going to build a 32-bit kernel
from the latest patches, but it appears Centos 5.4 Xen is
also not stable on these systems. I had trouble getting
the kernel to build here, with various errors. The most
notable of which was:
----------------------
CC arch/x86/kernel/acpi/processor.o
In file included from arch/x86/kernel/acpi/processor.c:8:
include/linux/kernel.h:185: internal compiler error: Segmentation
fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
The bug is not reproducible, so it is likely a hardware or OS
problem.
make[2]: *** [arch/x86/kernel/acpi/processor.o] Error 1
make[1]: *** [arch/x86/kernel/acpi] Error 2
make: *** [arch/x86/kernel] Error 2
----------------------
This was with a 64-bit Dom0 and a 32-bit Fedora 11 VM.
A 64-bit DomU works just fine. I know the stock XCP
kernel is 32-bits. Are there any issues running a 64-bit
XCP kernel, other than a slight degradation in speed?
-dwight-
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XCP: Crashes on dual Xeon HP ProLiant systems
2010-04-30 16:32 XCP: Crashes on dual Xeon HP ProLiant systems dwight at supercomputer.org
@ 2010-04-30 18:20 ` Pasi Kärkkäinen
2010-05-01 21:06 ` dwight at supercomputer.org
2010-04-30 19:15 ` Ian Campbell
2010-05-24 16:35 ` XCP: Epilog - " dwight at supercomputer.org
2 siblings, 1 reply; 6+ messages in thread
From: Pasi Kärkkäinen @ 2010-04-30 18:20 UTC (permalink / raw)
To: dwight at supercomputer.org; +Cc: xen-devel
On Fri, Apr 30, 2010 at 09:32:37AM -0700, dwight at supercomputer.org wrote:
> Is anyone else running the latest XCP on HP ProLiant DL380
> systems? Or a similar dual Xeon 8-core system? I'm seeing
> spontaneous reboots when under a load.
>
> Specifically, when 4 Windows HVMs are loaded, I haven't noticed
> any reboots yet. But when running 7 or 8, the system will
> reboot within minutes. Very little information appears on
> the console.
>
> I built a debugging version of the hypervisor, which changed
> the behavior; the system managed to stay up for 2-3 hours
> with 7 VMs running. However, it again spontaneously rebooted,
> with no real messages on the console as to why.
>
> I can send out the console log messages this evening, along
> with the system information if there's interest. Alas, I
> don't have access to these items at the moment.
>
> I have also been running memtest86 overnight. As of 1.5 hours into
> the test, there were no errors. But there are 48 GB of RAM
> on the system, so the testing wasn't complete when I left.
>
> Any suggestions here? I was going to build a 32-bit kernel
> from the latest patches, but it appears Centos 5.4 Xen is
> also not stable on these systems. I had trouble getting
> the kernel to build here, with various errors. The most
> notable of which was:
>
> ----------------------
> CC arch/x86/kernel/acpi/processor.o
> In file included from arch/x86/kernel/acpi/processor.c:8:
> include/linux/kernel.h:185: internal compiler error: Segmentation
> fault
> Please submit a full bug report,
> with preprocessed source if appropriate.
> See <http://bugzilla.redhat.com/bugzilla> for instructions.
> The bug is not reproducible, so it is likely a hardware or OS
> problem.
> make[2]: *** [arch/x86/kernel/acpi/processor.o] Error 1
> make[1]: *** [arch/x86/kernel/acpi] Error 2
> make: *** [arch/x86/kernel] Error 2
> ----------------------
>
Uhm.. the compiler really shouldn't crash.
Are you sure your hardware is OK? If the stock EL5.4 Xen also crashes,
it could be broken hardware?
Did you try running memtest86+ ?
Is baremetal Linux stable, if you run for example
"make -j8 bzImage && make -j8 modules && make clean" kernel build in a loop?
-- Pasi
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XCP: Crashes on dual Xeon HP ProLiant systems
2010-04-30 16:32 XCP: Crashes on dual Xeon HP ProLiant systems dwight at supercomputer.org
2010-04-30 18:20 ` Pasi Kärkkäinen
@ 2010-04-30 19:15 ` Ian Campbell
2010-05-01 21:07 ` dwight at supercomputer.org
2010-05-24 16:35 ` XCP: Epilog - " dwight at supercomputer.org
2 siblings, 1 reply; 6+ messages in thread
From: Ian Campbell @ 2010-04-30 19:15 UTC (permalink / raw)
To: dwight at supercomputer.org; +Cc: xen-devel@lists.xensource.com
On Fri, 2010-04-30 at 17:32 +0100, dwight at supercomputer.org wrote:
>
> A 64-bit DomU works just fine. I know the stock XCP
> kernel is 32-bits. Are there any issues running a 64-bit
> XCP kernel, other than a slight degradation in speed?
The XCP domain 0 kernel is only tested in 32 bit (PAE) configurations.
I'd expect it to work for 64 bit but wouldn't necessarily bet on it.
Ian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XCP: Crashes on dual Xeon HP ProLiant systems
2010-04-30 18:20 ` Pasi Kärkkäinen
@ 2010-05-01 21:06 ` dwight at supercomputer.org
0 siblings, 0 replies; 6+ messages in thread
From: dwight at supercomputer.org @ 2010-05-01 21:06 UTC (permalink / raw)
To: Pasi Kärkkäinen; +Cc: xen-devel
On Friday 30 April 2010 11:20:07 am Pasi Kärkkäinen wrote:
> On Fri, Apr 30, 2010 at 09:32:37AM -0700, dwight at
supercomputer.org wrote:
> > Is anyone else running the latest XCP on HP ProLiant DL380
> > systems? Or a similar dual Xeon 8-core system? I'm seeing
> > spontaneous reboots when under a load. ...
>
> Uhm.. the compiler really shouldn't crash.
>
> Are you sure your hardware is OK? If the stock EL5.4 Xen also
> crashes, it could be broken hardware?
>
> Did you try running memtest86+ ?
>
> Is baremetal Linux stable, if you run for example
> "make -j8 bzImage && make -j8 modules && make clean" kernel build
> in a loop?
>
> -- Pasi
Thank you for your reply, Pasi.
I agree that the compiler shouldn't crash. That's definitely
rude behavior.
It might well be broken hardware. I was thinking that it was
more likely that it was an issue between the older CentOS Xen
and this much newer Xeon hardware. And so the "hardware or OS
problem" that gcc was complaining about was an issue with
the Virtualized hardware.
But yesterday I ran into a different issue, which leads me to
believe that it is either a physical hardware or Dom0 OS issue.
On the machine which was running XCP, I tried installing
64-bit CentOS 5.4. The installation crashed. Two separate times.
The first time I didn't have a log file (since it was a video
based installation). The second time through though I used the iLO
virtualized serial port, and I could see that the installation
crashed about halfway through. Again, a spontaneous reboot, as XCP
experienced.
I talked to one of the guys in the lab, who has done far more
installations of these ProLiant (and Dell) boxes than I have,
and he was quite familiar with this. He said that on some of
these boxes (both HP and Dell), the 64-bit CentOS 5.4 install
will crash. But supposedly the 32-bit installation will work.
He also said that CentOS 5.3, both 32 and 64 bit, work fine.
I realize that this is anecdotal, and I don't have any more
information here (as to the CPU's and hardware), but I thought
that this was interesting.
At this point, I don't trust either the hardware or the OS,
so I'm going to start a full diagnostics run using a suite
that I've put together over the past 15 years, which has
served me very well in qualifying boxes.
memtest86 is one of these. I mentioned earlier that I had
started an overnight run of this on both boxes. I can now
report that both have passed. After 12+ hours, they had gone
successfully through two separate runs without error.
Next up is prime95, with the torture test. Nothing else comes
close to exercising the CPU, as indicated by the heat given
off during this test. This will also be a test of the thermal
cooling.
If that passes, then I'm going to exercise the disk subsystem.
One of these is very similar to what you suggested. Specifically,
multiple rebuilds of the kernel, but from scratch each time.
Frankly, though, I'm going to see if I can get a different
ProLiant box. Nonetheless, I want the data on this one.
I'm hoping that I can detect a box which will fail, before I
run XCP on it.
I'll post the results when I have them, hopefully in a
couple of days.
-dwight-
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XCP: Crashes on dual Xeon HP ProLiant systems
2010-04-30 19:15 ` Ian Campbell
@ 2010-05-01 21:07 ` dwight at supercomputer.org
0 siblings, 0 replies; 6+ messages in thread
From: dwight at supercomputer.org @ 2010-05-01 21:07 UTC (permalink / raw)
To: Ian Campbell; +Cc: xen-devel@lists.xensource.com
On Friday 30 April 2010 12:15:38 pm Ian Campbell wrote:
> On Fri, 2010-04-30 at 17:32 +0100, dwight at supercomputer.org
wrote:
> > A 64-bit DomU works just fine. I know the stock XCP
> > kernel is 32-bits. Are there any issues running a 64-bit
> > XCP kernel, other than a slight degradation in speed?
>
> The XCP domain 0 kernel is only tested in 32 bit (PAE)
> configurations. I'd expect it to work for 64 bit but wouldn't
> necessarily bet on it.
>
> Ian
Thank you, Ian. That was helpful.
-dwight-
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: XCP: Epilog - Crashes on dual Xeon HP ProLiant systems
2010-04-30 16:32 XCP: Crashes on dual Xeon HP ProLiant systems dwight at supercomputer.org
2010-04-30 18:20 ` Pasi Kärkkäinen
2010-04-30 19:15 ` Ian Campbell
@ 2010-05-24 16:35 ` dwight at supercomputer.org
2 siblings, 0 replies; 6+ messages in thread
From: dwight at supercomputer.org @ 2010-05-24 16:35 UTC (permalink / raw)
To: xen-devel
On Friday 30 April 2010 09:32:37 am I wrote:
> Is anyone else running the latest XCP on HP ProLiant DL380
> systems? Or a similar dual Xeon 8-core system? I'm seeing
> spontaneous reboots when under a load.
>
I wanted to follow up to the list on this issue, particularly if
someone else in the future comes across this with the ProLiant
series.
The bottom line is that it was a firmware issue (actually, at least
two different components needed a firmware update. Thanks to Pasi
and Ian for the replies and suggestions. Also, I was able to repeat
the odd behavior of 64-bit CentOS 5.4 not installing, while the
32-bit version worked. This also went away after the firmware
upgrade.
Here are some more details which probably aren't of interest to the
list, but I'm sending them along in the hopes of sparing someone
else who comes across this, and does a Google search.
The key test here was running a continual loop of a -j8 kernel build,
from scratch. One test failed after 14 hours; another after 9 hours.
memtestx86 and prime95 in torture test mode worked fine.
The bottom line here is that it looks like we got some machines from
one of the early manufacturing runs back in July. HP has put in a
lot of effort in fixing a number of issues since then. One needs at
least the general firmware update ISO from their website, which is
presently at Version 9. This is necessary, but not sufficient. One
of our machines would still crash (though 64-bit CentOS would now
install). The final missing piece was a CPLD update, which HP
support was kind enough to quickly send me. With that, all machines
have been running XCP and numerous VMs quite solidly under a heavy
load.
In spite of these problems, I have to give kudos to HP for the
support effort that they've put into fixing all of these problems
over the past year. Some manufacturers wouldn't put nearly as much
effort into it.
Thanks again,
-dwight-
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-05-24 16:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-30 16:32 XCP: Crashes on dual Xeon HP ProLiant systems dwight at supercomputer.org
2010-04-30 18:20 ` Pasi Kärkkäinen
2010-05-01 21:06 ` dwight at supercomputer.org
2010-04-30 19:15 ` Ian Campbell
2010-05-01 21:07 ` dwight at supercomputer.org
2010-05-24 16:35 ` XCP: Epilog - " dwight at supercomputer.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).