From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Pau_Montero_Par=E9s?= Subject: Athlon SMP troubles under high load. Date: Wed, 14 Aug 2002 16:22:52 +0200 Sender: linux-smp-owner@vger.kernel.org Message-ID: <3D5A67BC.6050802@imente.com> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: List-Id: Content-Type: text/plain; charset="iso-8859-1"; To: linux-smp@vger.kernel.org I wrote to the lkm before, but i have no answer about this problem, i=20 don't know if this is de apropiate forum to ask but i'm a bit desespera= te! ---------- [1.] One line summary of the problem: hangs on dual athlon system under heavy load. [2.] Full description of the problem/report: The systems hangs only under heavy load or during shutdown. It still=20 happens appending noapic and mem=3Dnopentium, removing the networking card or the ati=20 graphic card. Although i can't remove the Adaptec 2100s RAID in order to boot. The=20 system works fine compiled with only one CPU support. The temperature seems fine around 55C. [3.] Keywords (i.e., modules, networking, kernel): with or without networking and with or without modules. [4.] Kernel version (from /proc/version): Linux version 2.4.19 (pau@lorien) (gcc version 2.95.4 20011002 (Debian=20 prerelease)) #6 SMP It still happens under SuSE 7.3, 7.2 and 8 kernels. (2.4.16 and 2.4.18) [5.] Output of Oops.. message (if applicable) with symbolic information Aug 6 13:53:29 lorien kernel: Unable to handle kernel paging request= =20 at virtual address 00009000 Aug 6 13:53:29 lorien kernel: printing eip: Aug 6 13:53:29 lorien kernel: 00009000 Aug 6 13:53:29 lorien kernel: *pde =3D 04bde001 Aug 6 13:53:29 lorien kernel: Oops: 0000 Aug 6 13:53:29 lorien kernel: CPU: 1 Aug 6 13:53:29 lorien kernel: EIP: 0010:[<00009000>] Not taint= ed Aug 6 13:53:29 lorien kernel: EFLAGS: 00010246 Aug 6 13:53:29 lorien kernel: eax: c4854cc0 ebx: c42ef1c0 ecx:=20 c426d7c0 edx: 00000000 Aug 6 13:53:29 lorien kernel: esi: c4853ec0 edi: 080ed000 ebp:=20 00009000 esp: c4bbfe94 Aug 6 13:53:29 lorien kernel: ds: 0018 es: 0018 ss: 0018 Aug 6 13:53:29 lorien kernel: Process perl (pid: 392,=20 stackpage=3Dc4bbf000) Aug 6 13:53:29 lorien kernel: Stack: c0331020 c3b21ac8 48048000=20 000a5000 fffd3768 3b61d025 00000000 fffd3240 Aug 6 13:53:29 lorien kernel: 00000062 000a5000 000a5000=20 00000062 080ed000 00000286 c426d7c0 c4853ec0 Aug 6 13:53:29 lorien kernel: 08048000 c426d7c0 c012e59f=20 c012e579 c42ef1c0 c4853ec0 c4bbe000 c4bbe000 Aug 6 13:53:29 lorien kernel: Call Trace: [] []=20 [] [] [] Aug 6 13:53:29 lorien kernel: [] []=20 [] [] [] Aug 6 13:53:29 lorien kernel: Aug 6 13:53:29 lorien kernel: Code: Bad EIP value. [6.] A small shell script or example program which triggers the problem (if possible) I'm only able fastly hang the machine using something like this: #!/usr/bin/perl my $a =3D 0; while ($a =3D=3D 0) {rand();} It usualy returns a Segmentation Faults and the machine hangs in a few=20 seconds. It can become freeze during the perl script too. The script can run=20 during 10 seconds to 30 minutes. [7.] Environment [7.1.] Software (add the output of the ver_linux script here) Linux lorien 2.4.19 #6 SMP lun ago 12 19:08:58 CEST 2002 i686 unknown gcc version 2.95.4 20011002 (Debian prerelease) GNU Make version 3.79.1 util-linux 2.11n-4 ldd (GNU libc) 2.2.5 procps 2.0.7-8 [7.2.] Processor information (from /proc/cpuinfo): processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) Processor stepping : 4 cpu MHz : 1400.071 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge=20 mca cmov p at pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 2791.83 processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) Processor stepping : 4 cpu MHz : 1400.071 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge=20 mca cmov p at pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 2798.38 [7.3.] Module information (from /proc/modules): none [7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iom= em) /proc/ioports 0000-001f : dma1 0020-003f : pic1 0040-005f : timer 0060-006f : keyboard 0070-007f : rtc 0080-008f : dma page reg 00a0-00bf : pic2 00c0-00df : dma2 00f0-00ff : fpu 03c0-03df : vga+ 0cf8-0cff : PCI conf1 1000-107f : PCI device 10b7:7646 1000-107f : 00:0c.0 1090-1093 : PCI device 1022:700c 2000-2fff : PCI Bus #01 2000-20ff : PCI device 1002:5046 f000-f00f : PCI device 1022:7411 /proc/iomem 00000000-0009ebff : System RAM 0009ec00-0009ffff : reserved 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000f0000-000fffff : System ROM 00100000-7fffffff : System RAM 00100000-001e32eb : Kernel code 001e32ec-002139df : Kernel data f0000000-f000007f : PCI device 10b7:7646 f0001000-f0001fff : PCI device 1022:700c f0100000-f01fffff : PCI Bus #01 f0100000-f0103fff : PCI device 1002:5046 f2000000-f3ffffff : PCI device 1044:a501 f4000000-f7ffffff : PCI device 1022:700c f8000000-fbffffff : PCI Bus #01 f8000000-fbffffff : PCI device 1002:5046 fec00000-fec0ffff : reserved fee00000-fee00fff : reserved fff80000-ffffffff : reserved [7.5.] PCI information ('lspci -vvv' as root) 00:00.0 Host bridge: Advanced Micro Devices [AMD]: Unknown device 700c=20 (rev 11) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-=20 Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=3Dmedium >TAbort-=20 SERR- 00:01.0 PCI bridge: Advanced Micro Devices [AMD]: Unknown device 700d=20 (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-=20 Stepping- SERR- FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=3Dmedium >TAbort-=20 SERR- Reset- FastB2B- 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ISA=20 (rev 02) Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr-=20 Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=3Dmedium >TAbort-=20 SERR- TAbort-=20 SERR- TAbort-=20 SERR- TAbort-=20 SERR- [disabled] [size=3D128K] Capabilities: [dc] Power Management version 1 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=3D0mA=20 PME(D0-,D1+,D2+,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=3D0 DScale=3D0 PME- 00:0d.0 PCI bridge: Distributed Processing Technology PCI Bridge (rev=20 02) (prog-if 00 [Normal decode]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-=20 Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=3Dmedium >TAbort-=20 SERR- Reset- FastB2B- Capabilities: [68] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=3D0mA=20 PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=3D0 DScale=3D0 PME- 00:0d.1 I2O: Distributed Processing Technology SmartRAID V Controller=20 (rev 02) (prog-if 01) Subsystem: Distributed Processing Technology: Unknown device c03c Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr-=20 Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=3Dmedium >TAbort-=20 SERR- [disabled] [size=3D32K] Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=3D0mA=20 PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=3D0 DScale=3D0 PME- 01:05.0 VGA compatible controller: ATI Technologies Inc Rage 128 PF=20 (prog-if 00 [VGA]) Subsystem: ATI Technologies Inc: Unknown device 0008 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-=20 Stepping+ SERR- FastB2B+ Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=3Dmedium >TAbort-=20 SERR- [disabled] [size=3D128K] Capabilities: [50] AGP version 2.0 Status: RQ=3D31 SBA+ 64bit- FW- Rate=3Dx1,x2 Command: RQ=3D0 SBA+ AGP- 64bit- FW- Rate=3D Capabilities: [5c] Power Management version 2 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=3D0mA=20 PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=3D0 DScale=3D0 PME- [7.6.] SCSI information (from /proc/scsi/scsi) Attached devices: Host: scsi0 Channel: 00 Id: 02 Lun: 00 Vendor: ADAPTEC Model: RAID-5 Rev: 370F Type: Direct-Access ANSI SCSI revision: 02 [7.7.] Other information that might be relevant to the problem /proc/interrupts CPU0 CPU1 0: 143027 0 XT-PIC timer 1: 5401 0 XT-PIC keyboard 2: 0 0 XT-PIC cascade 5: 1715 0 XT-PIC eth0 8: 3 0 XT-PIC rtc 11: 5744 0 XT-PIC dpti0 NMI: 0 0 LOC: 142944 143078 ERR: 8 MIS: 0 More logs that i can't understand: Aug 6 13:48:13 lorien kernel: IO-APIC (apicid-pin) 2-0, 2-16, 2-17,= =20 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected. Aug 6 13:48:14 lorien kernel: : booting with the "noapic"=20 option. ---- Aug 6 13:48:13 lorien kernel: mtrr: detected mtrr type: Intel Aug 6 13:48:14 lorien kernel: mtrr: your CPUs had inconsistent fixed= =20 MTRR settings Aug 6 13:48:14 lorien kernel: mtrr: probably your BIOS does not setu= p=20 all CPUs The system hangs both 1.1 and 1.4 SMP specification and with or without= =20 using MP interrupts table. MainBoard: Tyan 2460 with registered ECC memory. The MB also can't reboot normaly, but i think it is a BIOS issue, i=20 should update it. Good luck! Pau Montero Par=E9s. http://pau.no-ip.com