From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Szymanski Subject: Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP Date: Fri, 12 May 2006 12:54:36 +0200 Message-ID: <20060512105436.GA16850@astrouw.edu.pl> References: <20060418191102.GA15132@astrouw.edu.pl> <445B5A8A.3060106@tmr.com> <20060505152344.GA8408@boogeyman> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <20060505152344.GA8408@boogeyman> Sender: linux-smp-owner@vger.kernel.org List-Id: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: SMP list On Fri, May 05, 2006 at 08:23:44AM -0700, cerise@armory.com wrote: > > Michal Szymanski wrote: > > > > >All systems crash (either hang with some "machine check exception" > > >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU > > >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had > > >never survived more than a few hours. > > Let's try the easy stuff first -- if it's crashing with a machine check > exception, then let's disable machine check exceptions, and see if things > still break. > > Try booting with the parameter "nomce". Be aware that mce is a mechanism > for the processor to inform the kernel of thermal issues or component > failure. You'll only want to disable this mechanism if you aren't having > thermal problems. I tried "nomce". The machine does not "halt" now with MCE kernel panic messages onscreen but resets after 3-4 hours of work under 2 or more jobs. As I wrote in a response to Robert's message, it seems to be a memory issue, as there are no crashes with Kingston 1GB memory modules. One of the machines and the memory went back to the dealer for tests. > P.S. I came a little late to this party -- I didn't see the original message. > Did you include the text of the kernel crash? Below the kernel message as OCR-ed from a screen digital photo :) Plus the decoded message as adviced by the first message: Fedora Core release 4 (Stentz) kernel 2.6.16-1.2069_FC4smp on an x86_64 red10 login: HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f604a00200000813 TSC 1504205a42ba ADDR 115e47828 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check Call Trace: <#MC> {panic+133} (ffffffff801129eb){mcheck_timer+0} {do_machine_check+753} {machine_check+127} ------------------ mcelog --ascii output: HARDWARE ERROR CPU 0 BANK 4 TSC 1504205a42ba MCG status:MCIP MCi status: Error overflow Uncorrected error Error enabled MCi_ADDR register valid Processor context corrupt MCA:BUS Generic Originated-request Read Memory-access Request-timeout Error Model: STATUS f604a00200000813 MCGSTATUS 4 ------------------ regards, Michal. -- Michal Szymanski (msz at astrouw dot edu dot pl) Warsaw University Observatory, Warszawa, POLAND