From mboxrd@z Thu Jan 1 00:00:00 1970 From: ChristopherHuhn Subject: Re: Kernel Bug at spinlock.h ?! Date: Mon, 10 Mar 2003 09:52:04 +0100 Sender: linux-smp-owner@vger.kernel.org Message-ID: <3E6C5234.8090505@GSI.de> References: <3E637CDC.3030409@GSI.de> <3E64B0EA.4080109@GSI.de> <3E674A13.5020203@GSI.de> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Zwane Mwaikambo Cc: linux-smp , linux-kernel@vger.kernel.org, Walter Schoen , support-gsi@credativ.de Zwane Mwaikambo wrote: >On Thu, 6 Mar 2003, ChristopherHuhn wrote: > > > >>Hi again, >> >> >> >>>It looks like a possible race with rpc_execute and possibly the timer, >>>although i can't be certain where the other cpus are. Do the other oopses >>>look somewhat similar? Could you supply them? >>> >>> >>> >>> >>below are some oopses I gathered yesterday and today, all on different >>machines. >>I'd like to remark that we experience massive NFS problems at the moment >>that seem to be caused by our mixed potato 2.2/ woody 2.4 environment, >>i. e. linking apps on a woody system with the sources mounted via nfs >>from a potato box leads to obscure IO failures like "no space left on >>device" (This never happens with woddy only). So this might be a clue >>here as well. >> >>The oopses are all written down from the screen, I hopefully made little >>"transmission" errors. >> >> > >Some of these are a bit worrying seeing as they are bit flips, also they >all appear to come from a UP machine(?) this would change things with >respect to my previous comment about races. Regarding weird io failures >are you mounting with the 'soft' option? > > Zwane > > The machines all all DP Xeons, our SP machines run the same kernel, but these oopses only occur on DP machines under heavy load. The machines are recognized as SMP: # uname -a Linux lxb000 2.4.20 #2 SMP Tue Dec 17 10:43:29 CET 2002 i686 unknown but the e7500 chipset seems not to be supported 100%: Jan 27 15:26:34 lxb000 kernel: found SMP MP-table at 000f6710 Jan 27 15:26:34 lxb000 kernel: hm, page 000f6000 reserved twice. Jan 27 15:26:34 lxb000 kernel: hm, page 000f7000 reserved twice. Jan 27 15:26:34 lxb000 kernel: hm, page 0009f000 reserved twice. Jan 27 15:26:34 lxb000 kernel: hm, page 000a0000 reserved twice. Jan 27 15:26:34 lxb000 kernel: On node 0 totalpages: 262016 Jan 27 15:26:34 lxb000 kernel: zone(0): 4096 pages. Jan 27 15:26:34 lxb000 kernel: zone(1): 225280 pages. Jan 27 15:26:34 lxb000 kernel: zone(2): 32640 pages. Jan 27 15:26:34 lxb000 kernel: ACPI: Searched entire block, no RSDP was found. Jan 27 15:26:34 lxb000 kernel: ACPI: Searched entire block, no RSDP was found. Jan 27 15:26:34 lxb000 kernel: ACPI: System description tables not found Jan 27 15:26:34 lxb000 kernel: Intel MultiProcessor Specification v1.4 Jan 27 15:26:34 lxb000 kernel: Virtual Wire compatibility mode. Jan 27 15:26:34 lxb000 kernel: OEM ID: Product ID: Kings Canyon APIC at: 0xFEE00000 Jan 27 15:26:34 lxb000 kernel: Processor #0 Pentium 4(tm) XEON(tm) APIC version 20 Jan 27 15:26:34 lxb000 kernel: Processor #6 Pentium 4(tm) XEON(tm) APIC version 20 Jan 27 15:26:34 lxb000 kernel: Processor #1 Pentium 4(tm) XEON(tm) APIC version 20 Jan 27 15:26:34 lxb000 kernel: Processor #7 Pentium 4(tm) XEON(tm) APIC version 20 Jan 27 15:26:34 lxb000 kernel: I/O APIC #2 Version 32 at 0xFEC00000. Jan 27 15:26:34 lxb000 kernel: I/O APIC #3 Version 32 at 0xFEC80000. Jan 27 15:26:34 lxb000 kernel: I/O APIC #4 Version 32 at 0xFEC80400. Jan 27 15:26:34 lxb000 kernel: I/O APIC #5 Version 32 at 0xFEC81000. Jan 27 15:26:34 lxb000 kernel: I/O APIC #8 Version 32 at 0xFEC81400. Jan 27 15:26:34 lxb000 kernel: Processors: 4 ... There might be (are) severe flaws in our NFS configuration and network performance, but that should not crash the box, should it? BTW: I just received a link to a bux incl. fix that sounds similar to our problem: http://marc.theaimsgroup.com/?l=linux-nfs&m=104716581307294&w=2 With kind regards, Christopher