* FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP @ 2006-04-18 19:11 Michal Szymanski 2006-05-05 14:00 ` Bill Davidsen 0 siblings, 1 reply; 10+ messages in thread From: Michal Szymanski @ 2006-04-18 19:11 UTC (permalink / raw) To: SMP list Hi all, I have recently purchased three Supermicro AS1020A-T servers equipped with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB RAM. The systems carry FC4 x86_64 with proprietary driver (made by Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original (install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due to the lack of the SATA driver for other kernel versions. All systems crash (either hang with some "machine check exception" kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU intensive with some I/O. I run 2 or 4 jobs simultaneously and they had never survived more than a few hours. Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs" and repeated the tests entirely in /tmp (with plenty of RAM this means (IMHO) doing I/O in memory). No success. It is somewhat better when I run similar size no-I/O jobs but these also crash, although less frequently. I tried to install i386 version, also crashes. Same (or even worse) with FC3. Memtest does not show any RAM errors. Finally I did two tests which seem to have excluded SATA controller/driver as the reason for crashes: 1. I installed an additional IDE hard disk and put FC4/x86_64 system on it (without the Adaptec driver, so the system does not even see the SATA disks), updated the kernel to the latest (2.6.16) - also crashed. 2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. There have been two test repeating 1.3g jobs running on it (each getting 50% of the single CPU used by the system) for over 50 hours now, no crashes. Also, a single test job running on SMP kernel gave no crashes in 24 hours. It seems there is a problem with SMP kernel and dual-core Opterons, at least on this hardware. I am stuck with three top-level machines which can work only at 25% of nominal cpu power. Any hints would be appreciated. regards, Michal. -- Michal Szymanski (msz at astrouw dot edu dot pl) Warsaw University Observatory, Warszawa, POLAND ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-04-18 19:11 FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP Michal Szymanski @ 2006-05-05 14:00 ` Bill Davidsen 2006-05-05 15:18 ` Robert M. Hyatt 2006-05-05 15:23 ` cerise 0 siblings, 2 replies; 10+ messages in thread From: Bill Davidsen @ 2006-05-05 14:00 UTC (permalink / raw) To: Michal Szymanski; +Cc: SMP list Michal Szymanski wrote: >Hi all, > >I have recently purchased three Supermicro AS1020A-T servers equipped >with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB >RAM. The systems carry FC4 x86_64 with proprietary driver (made by >Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original >(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due >to the lack of the SATA driver for other kernel versions. > >All systems crash (either hang with some "machine check exception" >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had >never survived more than a few hours. > >Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs" >and repeated the tests entirely in /tmp (with plenty of RAM this means >(IMHO) doing I/O in memory). No success. > >It is somewhat better when I run similar size no-I/O jobs but these also >crash, although less frequently. > >I tried to install i386 version, also crashes. Same (or even worse) with >FC3. > >Memtest does not show any RAM errors. > >Finally I did two tests which seem to have excluded SATA >controller/driver as the reason for crashes: > >1. I installed an additional IDE hard disk and put FC4/x86_64 system on >it (without the Adaptec driver, so the system does not even see the SATA >disks), updated the kernel to the latest (2.6.16) - also crashed. > >2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. >There have been two test repeating 1.3g jobs running on it (each getting 50% >of the single CPU used by the system) for over 50 hours now, no crashes. >Also, a single test job running on SMP kernel gave no crashes in 24 hours. > >It seems there is a problem with SMP kernel and dual-core Opterons, at >least on this hardware. I am stuck with three top-level machines which >can work only at 25% of nominal cpu power. Any hints would be >appreciated. > > > What happens if you use only one CPU? Either with a uni kernel (you should have gotten one) or "maxcpus=1" in the boot commands. You are running a custom kernel with custom drivers, so you really should be asking the supplier, all we can do is suggest things which might provide extra information. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-05 14:00 ` Bill Davidsen @ 2006-05-05 15:18 ` Robert M. Hyatt 2006-05-05 15:28 ` cerise 2006-05-09 12:23 ` Michal Szymanski 2006-05-05 15:23 ` cerise 1 sibling, 2 replies; 10+ messages in thread From: Robert M. Hyatt @ 2006-05-05 15:18 UTC (permalink / raw) To: Bill Davidsen; +Cc: Michal Szymanski, SMP list One note. I am running on a quad 875 system, but am using Suse rather than FC4. It is running perfectly reliable (this is a 4 cpu, dual-core, 2.2ghz box, 8 processors total). I had problems with FC4 myself, although it runs perfectly on my normal dual xeon boxes... Robert M. Hyatt, Ph.D. Computer and Information Sciences hyatt@uab.edu University of Alabama at Birmingham (205) 934-2213 136A Campbell Hall (205) 934-5473 FAX Birmingham, AL 35294-1170 On Fri, 5 May 2006, Bill Davidsen wrote: > Michal Szymanski wrote: > >> Hi all, >> >> I have recently purchased three Supermicro AS1020A-T servers equipped >> with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB >> RAM. The systems carry FC4 x86_64 with proprietary driver (made by >> Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original >> (install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due >> to the lack of the SATA driver for other kernel versions. >> >> All systems crash (either hang with some "machine check exception" >> kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU >> intensive with some I/O. I run 2 or 4 jobs simultaneously and they had >> never survived more than a few hours. >> >> Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs" >> and repeated the tests entirely in /tmp (with plenty of RAM this means >> (IMHO) doing I/O in memory). No success. >> >> It is somewhat better when I run similar size no-I/O jobs but these also >> crash, although less frequently. >> >> I tried to install i386 version, also crashes. Same (or even worse) with >> FC3. >> >> Memtest does not show any RAM errors. >> Finally I did two tests which seem to have excluded SATA >> controller/driver as the reason for crashes: >> >> 1. I installed an additional IDE hard disk and put FC4/x86_64 system on >> it (without the Adaptec driver, so the system does not even see the SATA >> disks), updated the kernel to the latest (2.6.16) - also crashed. >> >> 2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. >> There have been two test repeating 1.3g jobs running on it (each getting >> 50% >> of the single CPU used by the system) for over 50 hours now, no crashes. >> Also, a single test job running on SMP kernel gave no crashes in 24 hours. >> >> It seems there is a problem with SMP kernel and dual-core Opterons, at >> least on this hardware. I am stuck with three top-level machines which >> can work only at 25% of nominal cpu power. Any hints would be >> appreciated. >> >> > What happens if you use only one CPU? Either with a uni kernel (you should > have gotten one) or "maxcpus=1" in the boot commands. You are running a > custom kernel with custom drivers, so you really should be asking the > supplier, all we can do is suggest things which might provide extra > information. > > -- > bill davidsen <davidsen@tmr.com> > CTO TMR Associates, Inc > Doing interesting things with small computers since 1979 > > - > To unsubscribe from this list: send the line "unsubscribe linux-smp" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-05 15:18 ` Robert M. Hyatt @ 2006-05-05 15:28 ` cerise 2006-05-05 16:31 ` Robert M. Hyatt 2006-05-09 12:23 ` Michal Szymanski 1 sibling, 1 reply; 10+ messages in thread From: cerise @ 2006-05-05 15:28 UTC (permalink / raw) To: linux-smp Hi Robert: That might be because SuSE's compiled kernel doesn't use mce. If you can look in the .config for the compiled kernel (or you can ask one of the maintainers for SuSE...or you're fortunate enough to have a /proc/config), I'd be curious if it has MCE enabled (you'd be looking for "CONFIG_X86_MCE=y"). That would nicely explain the discrepancy. 8) -Phil/CERisE On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote: > > One note. I am running on a quad 875 system, but am using Suse rather > than FC4. It is running perfectly reliable (this is a 4 cpu, dual-core, > 2.2ghz box, 8 processors total). I had problems with FC4 myself, > although it runs perfectly on my normal dual xeon boxes... > > > Robert M. Hyatt, Ph.D. Computer and Information Sciences > hyatt@uab.edu University of Alabama at Birmingham > (205) 934-2213 136A Campbell Hall > (205) 934-5473 FAX Birmingham, AL 35294-1170 > > On Fri, 5 May 2006, Bill Davidsen wrote: > > >Michal Szymanski wrote: > > > >>Hi all, > >> > >>I have recently purchased three Supermicro AS1020A-T servers equipped > >>with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB > >>RAM. The systems carry FC4 x86_64 with proprietary driver (made by > >>Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original > >>(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due > >>to the lack of the SATA driver for other kernel versions. > >> > >>All systems crash (either hang with some "machine check exception" > >>kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU > >>intensive with some I/O. I run 2 or 4 jobs simultaneously and they had > >>never survived more than a few hours. > >> > >>Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs" > >>and repeated the tests entirely in /tmp (with plenty of RAM this means > >>(IMHO) doing I/O in memory). No success. > >> > >>It is somewhat better when I run similar size no-I/O jobs but these also > >>crash, although less frequently. > >> > >>I tried to install i386 version, also crashes. Same (or even worse) with > >>FC3. > >> > >>Memtest does not show any RAM errors. > >>Finally I did two tests which seem to have excluded SATA > >>controller/driver as the reason for crashes: > >> > >>1. I installed an additional IDE hard disk and put FC4/x86_64 system on > >>it (without the Adaptec driver, so the system does not even see the SATA > >>disks), updated the kernel to the latest (2.6.16) - also crashed. > >> > >>2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. > >>There have been two test repeating 1.3g jobs running on it (each getting > >>50% > >>of the single CPU used by the system) for over 50 hours now, no crashes. > >>Also, a single test job running on SMP kernel gave no crashes in 24 hours. > >> > >>It seems there is a problem with SMP kernel and dual-core Opterons, at > >>least on this hardware. I am stuck with three top-level machines which > >>can work only at 25% of nominal cpu power. Any hints would be > >>appreciated. > >> > >> > >What happens if you use only one CPU? Either with a uni kernel (you should > >have gotten one) or "maxcpus=1" in the boot commands. You are running a > >custom kernel with custom drivers, so you really should be asking the > >supplier, all we can do is suggest things which might provide extra > >information. > > > >-- > >bill davidsen <davidsen@tmr.com> > >CTO TMR Associates, Inc > >Doing interesting things with small computers since 1979 > > > >- > >To unsubscribe from this list: send the line "unsubscribe linux-smp" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > - > To unsubscribe from this list: send the line "unsubscribe linux-smp" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-05 15:28 ` cerise @ 2006-05-05 16:31 ` Robert M. Hyatt 0 siblings, 0 replies; 10+ messages in thread From: Robert M. Hyatt @ 2006-05-05 16:31 UTC (permalink / raw) To: cerise; +Cc: linux-smp Sorry, not much help here. I've been running redhat forever and have built a zillion kernels. But for Suse, which I am using out at the AMD development center, I don't know a thing about where they put config info. There is no /proc/config, so that's out. I could not find any config file in the places I would look in my redhat systems. This is Suse 10.0. If you can tell me where to look, I'll be happy to peek and report back... Robert M. Hyatt, Ph.D. Computer and Information Sciences hyatt@uab.edu University of Alabama at Birmingham (205) 934-2213 136A Campbell Hall (205) 934-5473 FAX Birmingham, AL 35294-1170 On Fri, 5 May 2006, cerise@armory.com wrote: > Hi Robert: > > That might be because SuSE's compiled kernel doesn't use mce. If you can look > in the .config for the compiled kernel (or you can ask one of the maintainers > for SuSE...or you're fortunate enough to have a /proc/config), I'd be curious > if it has MCE enabled (you'd be looking for "CONFIG_X86_MCE=y"). That would > nicely explain the discrepancy. 8) > > -Phil/CERisE > > On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote: >> >> One note. I am running on a quad 875 system, but am using Suse rather >> than FC4. It is running perfectly reliable (this is a 4 cpu, dual-core, >> 2.2ghz box, 8 processors total). I had problems with FC4 myself, >> although it runs perfectly on my normal dual xeon boxes... >> >> >> Robert M. Hyatt, Ph.D. Computer and Information Sciences >> hyatt@uab.edu University of Alabama at Birmingham >> (205) 934-2213 136A Campbell Hall >> (205) 934-5473 FAX Birmingham, AL 35294-1170 >> >> On Fri, 5 May 2006, Bill Davidsen wrote: >> >>> Michal Szymanski wrote: >>> >>>> Hi all, >>>> >>>> I have recently purchased three Supermicro AS1020A-T servers equipped >>>> with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB >>>> RAM. The systems carry FC4 x86_64 with proprietary driver (made by >>>> Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original >>>> (install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due >>>> to the lack of the SATA driver for other kernel versions. >>>> >>>> All systems crash (either hang with some "machine check exception" >>>> kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU >>>> intensive with some I/O. I run 2 or 4 jobs simultaneously and they had >>>> never survived more than a few hours. >>>> >>>> Suspecting it may be the SATA driver problem I mounted /tmp as "tmpfs" >>>> and repeated the tests entirely in /tmp (with plenty of RAM this means >>>> (IMHO) doing I/O in memory). No success. >>>> >>>> It is somewhat better when I run similar size no-I/O jobs but these also >>>> crash, although less frequently. >>>> >>>> I tried to install i386 version, also crashes. Same (or even worse) with >>>> FC3. >>>> >>>> Memtest does not show any RAM errors. >>>> Finally I did two tests which seem to have excluded SATA >>>> controller/driver as the reason for crashes: >>>> >>>> 1. I installed an additional IDE hard disk and put FC4/x86_64 system on >>>> it (without the Adaptec driver, so the system does not even see the SATA >>>> disks), updated the kernel to the latest (2.6.16) - also crashed. >>>> >>>> 2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. >>>> There have been two test repeating 1.3g jobs running on it (each getting >>>> 50% >>>> of the single CPU used by the system) for over 50 hours now, no crashes. >>>> Also, a single test job running on SMP kernel gave no crashes in 24 hours. >>>> >>>> It seems there is a problem with SMP kernel and dual-core Opterons, at >>>> least on this hardware. I am stuck with three top-level machines which >>>> can work only at 25% of nominal cpu power. Any hints would be >>>> appreciated. >>>> >>>> >>> What happens if you use only one CPU? Either with a uni kernel (you should >>> have gotten one) or "maxcpus=1" in the boot commands. You are running a >>> custom kernel with custom drivers, so you really should be asking the >>> supplier, all we can do is suggest things which might provide extra >>> information. >>> >>> -- >>> bill davidsen <davidsen@tmr.com> >>> CTO TMR Associates, Inc >>> Doing interesting things with small computers since 1979 >>> >>> - >>> To unsubscribe from this list: send the line "unsubscribe linux-smp" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> - >> To unsubscribe from this list: send the line "unsubscribe linux-smp" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > - > To unsubscribe from this list: send the line "unsubscribe linux-smp" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-05 15:18 ` Robert M. Hyatt 2006-05-05 15:28 ` cerise @ 2006-05-09 12:23 ` Michal Szymanski 2006-05-24 20:23 ` Bill Davidsen 1 sibling, 1 reply; 10+ messages in thread From: Michal Szymanski @ 2006-05-09 12:23 UTC (permalink / raw) To: SMP list On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote: > > One note. I am running on a quad 875 system, but am using Suse rather > than FC4. It is running perfectly reliable (this is a 4 cpu, dual-core, > 2.2ghz box, 8 processors total). I had problems with FC4 myself, > although it runs perfectly on my normal dual xeon boxes... > > On Fri, 5 May 2006, Bill Davidsen wrote: > > >Michal Szymanski wrote: > > > >>Hi all, > >> > >>I have recently purchased three Supermicro AS1020A-T servers equipped > >>with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB > >>RAM. The systems carry FC4 x86_64 with proprietary driver (made by > >>Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original > >>(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due > >>to the lack of the SATA driver for other kernel versions. > >> > >>All systems crash (either hang with some "machine check exception" > >>kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU > >>intensive with some I/O. I run 2 or 4 jobs simultaneously and they had > >>never survived more than a few hours. > >> ... > >>2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. > >>There have been two test repeating 1.3g jobs running on it (each getting > >>50% > >>of the single CPU used by the system) for over 50 hours now, no crashes. > >>Also, a single test job running on SMP kernel gave no crashes in 24 hours. > >> > >What happens if you use only one CPU? Either with a uni kernel (you should > >have gotten one) or "maxcpus=1" in the boot commands. You are running a > >custom kernel with custom drivers, so you really should be asking the > >supplier, all we can do is suggest things which might provide extra > >information. Hi all, I got 3 copies of Roberts' message but none of Bill's :-) Still, I don't quite understand Bill's question ("What happens if you use only one CPU?"). The answer is quoted just above this question! There were no crashes with the system running on non-SMP kernel. In the meantime I got Kingston 1GB modules from my dealer, for testing. Strangely as it seems, I could not crash the machine with Kingston memory running tests as long as 72 hours. It seems, then, that it is a memory issue although I do not understand why the same memory crashes the machine in SMP and does not in non-SMP, under similar load. Also, the Patriot 2GB memory modules (which seem to crash the machines) are on the Supermicro's list of memory recommended for H8DAR-T mobo. One of the machines went back to the dealer (actually to their memory supplier) for tests. The memory guys seem not to trust our crashing experience. We'll see what happens. I am afraid, however, that they will say "the memory is OK". regards, Michal. -- Michal Szymanski (msz at astrouw dot edu dot pl) Warsaw University Observatory, Warszawa, POLAND ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-09 12:23 ` Michal Szymanski @ 2006-05-24 20:23 ` Bill Davidsen 2006-05-24 20:28 ` Bill Davidsen 0 siblings, 1 reply; 10+ messages in thread From: Bill Davidsen @ 2006-05-24 20:23 UTC (permalink / raw) To: Michal Szymanski; +Cc: SMP list Michal Szymanski wrote: >On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote: > > >>One note. I am running on a quad 875 system, but am using Suse rather >>than FC4. It is running perfectly reliable (this is a 4 cpu, dual-core, >>2.2ghz box, 8 processors total). I had problems with FC4 myself, >>although it runs perfectly on my normal dual xeon boxes... >> >>On Fri, 5 May 2006, Bill Davidsen wrote: >> >> >> >>>Michal Szymanski wrote: >>> >>> >>> >>>>Hi all, >>>> >>>>I have recently purchased three Supermicro AS1020A-T servers equipped >>>>with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or 12 GB >>>>RAM. The systems carry FC4 x86_64 with proprietary driver (made by >>>>Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original >>>>(install) kernel 2.6.11-1.1369_FC4smp - unfortunately not upgradable due >>>>to the lack of the SATA driver for other kernel versions. >>>> >>>>All systems crash (either hang with some "machine check exception" >>>>kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU >>>>intensive with some I/O. I run 2 or 4 jobs simultaneously and they had >>>>never survived more than a few hours. >>>>... >>>>2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another machine. >>>>There have been two test repeating 1.3g jobs running on it (each getting >>>>50% >>>>of the single CPU used by the system) for over 50 hours now, no crashes. >>>>Also, a single test job running on SMP kernel gave no crashes in 24 hours. >>>> >>>> >>>> >>>What happens if you use only one CPU? Either with a uni kernel (you should >>>have gotten one) or "maxcpus=1" in the boot commands. You are running a >>>custom kernel with custom drivers, so you really should be asking the >>>supplier, all we can do is suggest things which might provide extra >>>information. >>> >>> > >Hi all, > >I got 3 copies of Roberts' message but none of Bill's :-) > >Still, I don't quite understand Bill's question ("What happens if you >use only one CPU?"). The answer is quoted just above this question! >There were no crashes with the system running on non-SMP kernel. > > It's a great answer, but not to my question. I wasn't asking what happens with a different kernel, but what happens when you run the SMP kernel and ==>use<== only one CPU by setting the max cpu to one. The uni kernel doesn't have a lot of code in an SMP kernel, so it haides a lot of possible questions. >In the meantime I got Kingston 1GB modules from my dealer, for testing. >Strangely as it seems, I could not crash the machine with Kingston >memory running tests as long as 72 hours. It seems, then, that it is a >memory issue although I do not understand why the same memory crashes >the machine in SMP and does not in non-SMP, under similar load. Also, >the Patriot 2GB memory modules (which seem to crash the machines) are on >the Supermicro's list of memory recommended for H8DAR-T mobo. > >One of the machines went back to the dealer (actually to their memory >supplier) for tests. The memory guys seem not to trust our crashing >experience. We'll see what happens. I am afraid, however, that they will >say "the memory is OK". > > The memory may be operating within spec, the timing setup in the BIOS may be incorrect, etc, etc. Unfortunately it is possible to get a case where everything is right but it doesn't work. Depending on the BIOS capabilities, adding .05v or .1v to the memory voltage (can you do that?) might solve the problem, or I guess make it worse. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-24 20:23 ` Bill Davidsen @ 2006-05-24 20:28 ` Bill Davidsen 0 siblings, 0 replies; 10+ messages in thread From: Bill Davidsen @ 2006-05-24 20:28 UTC (permalink / raw) To: Bill Davidsen; +Cc: Michal Szymanski, SMP list Bill Davidsen wrote: > Michal Szymanski wrote: > >> On Fri, May 05, 2006 at 10:18:36AM -0500, Robert M. Hyatt wrote: >> >> >>> One note. I am running on a quad 875 system, but am using Suse >>> rather than FC4. It is running perfectly reliable (this is a 4 cpu, >>> dual-core, 2.2ghz box, 8 processors total). I had problems with FC4 >>> myself, although it runs perfectly on my normal dual xeon boxes... >>> >>> On Fri, 5 May 2006, Bill Davidsen wrote: >>> >>> >>> >>>> Michal Szymanski wrote: >>>> >>>> >>>> >>>>> Hi all, >>>>> >>>>> I have recently purchased three Supermicro AS1020A-T servers equipped >>>>> with two dual-core Opterons 280 each. H8DAR-T motherboards, 8 or >>>>> 12 GB >>>>> RAM. The systems carry FC4 x86_64 with proprietary driver (made by >>>>> Adaptec) for the onboard Marvell 88SX6041 SATA Controller. Original >>>>> (install) kernel 2.6.11-1.1369_FC4smp - unfortunately not >>>>> upgradable due >>>>> to the lack of the SATA driver for other kernel versions. >>>>> >>>>> All systems crash (either hang with some "machine check exception" >>>>> kernel messages or reset) when loaded with repeating runs of >>>>> 1.3gb, CPU >>>>> intensive with some I/O. I run 2 or 4 jobs simultaneously and they >>>>> had >>>>> never survived more than a few hours. >>>>> ... >>>>> 2. I ran non-SMP 2.6.11 kernel (with Adaptec driver) on another >>>>> machine. >>>>> There have been two test repeating 1.3g jobs running on it (each >>>>> getting 50% >>>>> of the single CPU used by the system) for over 50 hours now, no >>>>> crashes. >>>>> Also, a single test job running on SMP kernel gave no crashes in >>>>> 24 hours. >>>>> >>>>> >>>> >>>> What happens if you use only one CPU? Either with a uni kernel (you >>>> should have gotten one) or "maxcpus=1" in the boot commands. You >>>> are running a custom kernel with custom drivers, so you really >>>> should be asking the supplier, all we can do is suggest things >>>> which might provide extra information. >>>> >>> >> >> Hi all, >> >> I got 3 copies of Roberts' message but none of Bill's :-) >> >> Still, I don't quite understand Bill's question ("What happens if you >> use only one CPU?"). The answer is quoted just above this question! >> There were no crashes with the system running on non-SMP kernel. >> >> > > It's a great answer, but not to my question. I wasn't asking what > happens with a different kernel, but what happens when you run the SMP > kernel and ==>use<== only one CPU by setting the max cpu to one. The > uni kernel doesn't have a lot of code in an SMP kernel, so it haides a > lot of possible questions. s/haides/hides/ Yes, I know my original question wasn't explicit on what I was asking, it's just the first thing I would have tried because I wouldn't have that uni kernel around. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-05 14:00 ` Bill Davidsen 2006-05-05 15:18 ` Robert M. Hyatt @ 2006-05-05 15:23 ` cerise 2006-05-12 10:54 ` Michal Szymanski 1 sibling, 1 reply; 10+ messages in thread From: cerise @ 2006-05-05 15:23 UTC (permalink / raw) To: msz, linux-smp > Michal Szymanski wrote: > > >All systems crash (either hang with some "machine check exception" > >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU > >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had > >never survived more than a few hours. Let's try the easy stuff first -- if it's crashing with a machine check exception, then let's disable machine check exceptions, and see if things still break. Try booting with the parameter "nomce". Be aware that mce is a mechanism for the processor to inform the kernel of thermal issues or component failure. You'll only want to disable this mechanism if you aren't having thermal problems. Of course, if you are having thermal problems, it's probably a good idea to resolve those before cranking up the other 3/4s of your system. ; ) Hope that helps! -Phil/CERisE P.S. I came a little late to this party -- I didn't see the original message. Did you include the text of the kernel crash? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP 2006-05-05 15:23 ` cerise @ 2006-05-12 10:54 ` Michal Szymanski 0 siblings, 0 replies; 10+ messages in thread From: Michal Szymanski @ 2006-05-12 10:54 UTC (permalink / raw) To: SMP list On Fri, May 05, 2006 at 08:23:44AM -0700, cerise@armory.com wrote: > > Michal Szymanski wrote: > > > > >All systems crash (either hang with some "machine check exception" > > >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU > > >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had > > >never survived more than a few hours. > > Let's try the easy stuff first -- if it's crashing with a machine check > exception, then let's disable machine check exceptions, and see if things > still break. > > Try booting with the parameter "nomce". Be aware that mce is a mechanism > for the processor to inform the kernel of thermal issues or component > failure. You'll only want to disable this mechanism if you aren't having > thermal problems. I tried "nomce". The machine does not "halt" now with MCE kernel panic messages onscreen but resets after 3-4 hours of work under 2 or more jobs. As I wrote in a response to Robert's message, it seems to be a memory issue, as there are no crashes with Kingston 1GB memory modules. One of the machines and the memory went back to the dealer for tests. > P.S. I came a little late to this party -- I didn't see the original message. > Did you include the text of the kernel crash? Below the kernel message as OCR-ed from a screen digital photo :) Plus the decoded message as adviced by the first message: Fedora Core release 4 (Stentz) kernel 2.6.16-1.2069_FC4smp on an x86_64 red10 login: HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f604a00200000813 TSC 1504205a42ba ADDR 115e47828 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check Call Trace: <#MC> <ffffffff80134e6a>{panic+133} (ffffffff801129eb){mcheck_timer+0} <ffffffff801131fc>{do_machine_check+753} <ffffffff8010be43>{machine_check+127} <EOE> ------------------ mcelog --ascii output: HARDWARE ERROR CPU 0 BANK 4 TSC 1504205a42ba MCG status:MCIP MCi status: Error overflow Uncorrected error Error enabled MCi_ADDR register valid Processor context corrupt MCA:BUS Generic Originated-request Read Memory-access Request-timeout Error Model: STATUS f604a00200000813 MCGSTATUS 4 ------------------ regards, Michal. -- Michal Szymanski (msz at astrouw dot edu dot pl) Warsaw University Observatory, Warszawa, POLAND ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2006-05-24 20:28 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-18 19:11 FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP Michal Szymanski 2006-05-05 14:00 ` Bill Davidsen 2006-05-05 15:18 ` Robert M. Hyatt 2006-05-05 15:28 ` cerise 2006-05-05 16:31 ` Robert M. Hyatt 2006-05-09 12:23 ` Michal Szymanski 2006-05-24 20:23 ` Bill Davidsen 2006-05-24 20:28 ` Bill Davidsen 2006-05-05 15:23 ` cerise 2006-05-12 10:54 ` Michal Szymanski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox