From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bill Davidsen Subject: Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP Date: Fri, 05 May 2006 10:00:42 -0400 Message-ID: <445B5A8A.3060106@tmr.com> References: <20060418191102.GA15132@astrouw.edu.pl> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20060418191102.GA15132@astrouw.edu.pl> Sender: linux-smp-owner@vger.kernel.org List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Michal Szymanski Cc: SMP list Michal Szymanski wrote: >Hi all, > >I have recently purchased three Supermicro AS1020A-T servers equipped >with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB >RAM. The systems carry FC4 x86_64 with proprietary driver (made by >Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original >(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due >to the lack of the SATA driver for other kernel versions. > >All systems crash (either hang with some "machine check exception" >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had >never survived more than a few hours. > >Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs" >and repeated the tests entirely in /tmp (with plenty of RAM this means >(IMHO) doing I/O in memory). No success. > >It is somewhat better when I run similar size no-I/O jobs but these also >crash, although less frequently. > >I tried to install i386 version, also crashes. Same (or even worse) with >FC3. > >Memtest does not show any RAM errors. > >Finally I did two tests which seem to have excluded SATA >controller/driver as the reason for crashes: > >1. I installed an additional IDE hard disk and put FC4/x86_64 system on >it (without the Adaptec driver, so the system does not even see the SATA >disks), updated the kernel to the latest (2.6.16) - also crashed. > >2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. >There have been two test repeating 1.3g jobs running on it (each getting 50% >of the single CPU used by the system) for over 50 hours now, no crashes. >Also, a single test job running on SMP kernel gave no crashes in 24 hours. > >It seems there is a problem with SMP kernel and dual-core Opterons, at >least on this hardware. I am stuck with three top-level machines which >can work only at 25% of nominal cpu power. Any hints would be >appreciated. > > > What happens if you use only one CPU? Either with a uni kernel (you should have gotten one) or "maxcpus=1" in the boot commands. You are running a custom kernel with custom drivers, so you really should be asking the supplier, all we can do is suggest things which might provide extra information. -- bill davidsen CTO TMR Associates, Inc Doing interesting things with small computers since 1979