All of lore.kernel.org
 help / color / mirror / Atom feed
* RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions" during builds
@ 2002-02-21 17:35 Tom Epperly
  2002-02-21 18:19 ` RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions" Alan Cox
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Epperly @ 2002-02-21 17:35 UTC (permalink / raw)
  To: linux-kernel

I am getting intermittent "Illegal instruction" errors during builds
of the software I am developing, and it appears to be kernel
related. I have been investigating this problem for several weeks, and
I have exhausted all the means of investigation known to me (detailed
below). There is evidence to suggest it is not a RAM problem or a
random hardware problem (see below). I can "solve" the problem by
running the non-SMP kernel and ignoring the second processor, but this
is not a particularly satisfying solution.  I am wondering if someone
can suggest some additional things I can do to understand and fix this
problem.  I would appreciate if you could CC me on replies.

EVIDENCE OF THE PROBLEM
=======================

Here is an excerpt from the make log to show the effects of the problem:

make[2]: Entering directory
`/home/epperly/tmp/nightly_qc/cronjobs/tom-linux-gcc2.96/babel/doc/talks'
rm -rf .libs _libs
rm -f *.lo
make[2]: Leaving directory
`/home/epperly/tmp/nightly_qc/cronjobs/tom-linux-gcc2.96/babel/doc/talks'
Making clean in papers
make[1]: *** [clean-recursive] Illegal instruction
make[1]: Leaving directory
`/home/epperly/tmp/nightly_qc/cronjobs/tom-linux-gcc2.96/babel/doc'
make: *** [clean-recursive] Error 1
****** make clean failed ******

Sometimes, the build runs to completion and succeeds.  When it fails, it
fails in a different spot each time. It doesn't always list "Illegal
instruction" as the error.  Here is another error message I've seen:

make[3]: *** [installcheck-local] Error 132

I could show you a lot more examples, but they don't seem to indicate more 
than the examples I've shown here.  The package we build can be downloaded 
here: http://www.llnl.gov/casc/components/docs/babel-0.6.3.tar.gz
The build uses the autotools suite, Sun's JDK 1.3.1_02, gcc, g++, g77, and 
Python.

On one occasion, random processes started dying on the machine. I had to 
reboot to recover.

The failure rate of our nightly build is between 20-40%.  These failures
exclude any that relate to things we can trace back to our coding
mistakes. The nightly builds do a sequence of configure, build and
regression testing.

MACHINE DETAILS
===============

HARDWARE	Dell Precision Workstation 530
PROCESSORS	Dual Intel Xeon Processors 1500MHz
RAM		512MB ECC RAM
O/S		RH 7.2 (upgraded from 7.1) running RH's 2.4.9-21 SMP 

$ /sbin/lsmod
Module                  Size  Used by
nfsd                   71232   8 (autoclean)
autofs                 11556   1 (autoclean)
nfs                    79840   3 (autoclean)
lockd                  53184   1 (autoclean) [nfsd nfs]
sunrpc                 64816   1 (autoclean) [nfsd nfs lockd]
3c59x                  26504   1
usb-uhci               21668   0 (unused)
usbcore                51808   1 [usb-uhci]
aic7xxx               114624   6
sd_mod                 11900   6
scsi_mod               98584   2 [aic7xxx sd_mod]

An identical machine has the same intermittent problems that my box does.

WHAT I HAVE ALREADY TRIED
=========================

1. Upgrade to from an earlier kernel to RH 2.4.9-21 SMP

   The new kernel didn't change anything.

2. Ran Dell's memory checker on the RAM for an hour. It checked out
   fine.

   The fact that another machine next door has the exact same problems 
   suggests that it isn't a random hardware problem unless they both came 
   from the same bad batch.

3. Open my case to vent additional heat.

   Someone suggested that the CPUs might be overheating and that opening
   the case might solve the problem.  It didn't solve the problem. My
   machine is in a well air conditioned room, and it didn't seem 
   excessively hot when I opened the case.  I haven't overclocked the
   machine or anything like that.

4. Disable X11 server and reboot to avoid loading nVIDIA kernel module.

   This may have lowered the frequency of problems, but it did not 
   eliminate the problem. Our nightly build still failed roughly
   20% of the time with "Illegal instruction" errors or other
   unexplainable failures.

5. Run the non-SMP 2.4.9-21 effectively turning off the second processor 
   (X11 still disabled)

   This seems to have "solved" the problem.  I've run over 22 nightly 
   builds, two at a time, on the system without a single failure.
   Running with just one processor is better than running an unstable
   two processor system, but it seems like I should be able to figure
   out how to have a stable two processor system.

I have not tried compiling my own kernel because I don't have root on this
machine yet. I work in an environment where they try to centralize machine
administration, so I need special permission to get root.  There is also a
desire to stick with generic RH software components. Going through the
process above is part of what I've done to justify getting root, so I can
try installing a more recent kernel.

Do you agree that this is likely to be a kernel problem?  Is upgrading
the kernel my best course of action?

Here is what I get when running the non-SMP kernel. 
$ uname -a
Linux tux06.llnl.gov 2.4.9-21 #1 Thu Jan 17 14:16:30 EST 
2002 i686 unknown
$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 0
model name	: Intel(R) Xeon(TM) CPU 1500MHz
stepping	: 10
cpu MHz		: 1495.463
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips	: 2981.88

Thanks in advance,

Tom

--
------------------------------------------------------------------------
Tom Epperly
Center for Applied Scientific Computing   Phone: 925-424-3159
Lawrence Livermore National Laboratory      Fax: 925-424-2477
L-661, P.O. Box 808, Livermore, CA 94551  Email: tepperly@llnl.gov
------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions"
@ 2002-03-11 20:26 James Washer
  0 siblings, 0 replies; 13+ messages in thread
From: James Washer @ 2002-03-11 20:26 UTC (permalink / raw)
  To: Tom Epperly; +Cc: Alan Cox, linux-kernel


Tom,

If I send you a patch to do a bit of debugging, would you be able to run
it? Basically, my plan is just to call a die() like routine for  trap=6, so
that we get a good stack frame (much like an oops).. From there, we should
be able to figure out WHY the program is getting an illegal op..

My guesses at this point in time are....  bad file io( i.e. the executable
is corrupt) , or bad mem... (check to see if the same phys page is being
used, for example)

 - jim

"Tom Epperly" <tepperly@llnl.gov>@vger.kernel.org on 03/11/2002 08:07:05 AM

Sent by:    linux-kernel-owner@vger.kernel.org


To:    Alan Cox <alan@lxorguk.ukuu.org.uk>
cc:    linux-kernel@vger.kernel.org
Subject:    Re: RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal
       instructions"



Alan Cox wrote:

>>Do you agree that this is likely to be a kernel problem?  Is upgrading
>>the kernel my best course of action?
>>
>
>Almost every other report I have ever seen that looked like that one has
always
>turned out to be hardware related. The randomness in paticular tends to be
>a pointer to thinks like cache faults.
>
>You do have ECC main memory which is good.
>
>What other hardware is in the machine ?
>
>
To recap from an earlier email, my nightly build & regression tests (a
roughly 2 hour process involving Sun's JDK, GNU make, gcc, g++, g77 &
Bourne shell scripts) has been failing intermittently on a dual-Xeon
system usually with an "Illegal intruction" signal. I've tried removing
the sound card and disabling the X11 server to avoid loading the nVidia
kernel mod. The intermittent failures disappear when I run non-SMP. I've
tried swapping the processors on the motherboard, and both processors
appear to work fine individually. Most of the boxes I've run on have >=
512MB ECC RAM. I've run Dell's hardware diagnostics (especially the
memory ones) twice. The diagnostics don't seem to have SMP tests where
both CPUs are being stressed.

FYI when I upgraded to the 2.4.18-1smp kernel, the failure rate went
from 20% to 100%. I have tried running the nightly build & regression on
roughly 6 different dual processors Pentium III or better machines
(cylcing it over and over), and they all have intermittent failures of
one kind or another. All these machines are made by Dell, but they
provide some evidence that it is not a hardware problem.

Tom

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2002-03-11 20:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-21 17:35 RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions" during builds Tom Epperly
2002-02-21 18:19 ` RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions" Alan Cox
2002-02-21 18:26   ` Tom Epperly
2002-02-21 18:33     ` RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal arjan
2002-02-21 19:23       ` Tom Epperly
2002-02-21 19:41         ` Alan Cox
2002-02-21 20:21         ` J Sloan
2002-02-21 19:05     ` RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions" Richard B. Johnson
2002-02-21 19:36     ` RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal Alan Cox
2002-02-22 22:45       ` Tom Epperly
2002-03-11 16:07   ` RH7.2 running 2.4.9-21-SMP (dual Xeon's) yields "Illegal instructions" Tom Epperly
2002-03-11 17:08     ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2002-03-11 20:26 James Washer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.