* Re: GART Error 11 [not found] ` <22ELe-oP-47@gated-at.bofh.it> @ 2004-06-02 18:00 ` Andi Kleen 2004-06-02 18:35 ` Arthur Perry 0 siblings, 1 reply; 13+ messages in thread From: Andi Kleen @ 2004-06-02 18:00 UTC (permalink / raw) To: Arthur Perry; +Cc: linux-kernel Arthur Perry <kernel@linuxfarms.com> writes: > Hello, > > Oops. Sorry I have made a mistake in all of my statements below. > It was after 5pm yesterday, and it was a long day... > It's not offset 0x44 that we are interested in. > My listings were at offset 0x48, which is MCA NB Status Low Register. > Sorry, did not mean to confuse anybody. I would recommend to just read the MC* MSRs via /dev/msr. Accessing the northbridge directly for MCE information has various quirks and i removed that completely in the 2.6 driver. They contain the same information. -Andi ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-02 18:00 ` GART Error 11 Andi Kleen @ 2004-06-02 18:35 ` Arthur Perry 2004-06-02 19:48 ` Andi Kleen 0 siblings, 1 reply; 13+ messages in thread From: Arthur Perry @ 2004-06-02 18:35 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel Thanks Andi! I did not realize there were quirks associated with reading this right from pci config space. Perhaps someone can tell me this: Does anybody know if there is any documented information about the differences between agp driver version 0.99 and 0.100? I know I can just read the source, but there must be list of known bugs and what has been addressed by the newer version, right? The reason why I ask is that both RedHat and SuSE are using 0.99 agp driver still.. RedHat Enterprise 3.0 's 2.4.21-9.0.1EL kernel and SuSE's 2.4.19 kernel have this in common, and I am seeing such gart errors only with their kernels. The mainline kernel 2.4.27-pre4 using gart 0.100 does not produce this failure condition. Please let me know if I am going in the wrong direction, but I am going to patch RedHat's kernel with the agp 0.100 driver and see if the problem does indeed go away. I'll do the same with SuSE. If this is the case, then I have found root cause of this particular problem, and I can then address it to the specific distributors. Thanks! Best Regards, Arthur Perry On Wed, 2 Jun 2004, Andi Kleen wrote: > Arthur Perry <kernel@linuxfarms.com> writes: > > > Hello, > > > > Oops. Sorry I have made a mistake in all of my statements below. > > It was after 5pm yesterday, and it was a long day... > > It's not offset 0x44 that we are interested in. > > My listings were at offset 0x48, which is MCA NB Status Low Register. > > Sorry, did not mean to confuse anybody. > > I would recommend to just read the MC* MSRs via /dev/msr. > Accessing the northbridge directly for MCE information has various > quirks and i removed that completely in the 2.6 driver. > They contain the same information. > > -Andi > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-02 18:35 ` Arthur Perry @ 2004-06-02 19:48 ` Andi Kleen 0 siblings, 0 replies; 13+ messages in thread From: Andi Kleen @ 2004-06-02 19:48 UTC (permalink / raw) To: Arthur Perry; +Cc: Andi Kleen, linux-kernel On Wed, Jun 02, 2004 at 02:35:33PM -0400, Arthur Perry wrote: > Thanks Andi! > I did not realize there were quirks associated with reading this right from pci config space. > > Perhaps someone can tell me this: > Does anybody know if there is any documented information about the differences between agp driver version 0.99 and 0.100? > I know I can just read the source, but there must be list of known bugs and what has been addressed by the newer version, right? You can read the bitkeeper logs at http://linux.bkbits.net for the file in question. Version numbers for kernel subsystems are often meaningless because there are lots of changes without version number changes. > The reason why I ask is that both RedHat and SuSE are using 0.99 agp driver still.. > RedHat Enterprise 3.0 's 2.4.21-9.0.1EL kernel and SuSE's 2.4.19 kernel have this in common, and I am seeing such gart errors only with their kernels. Don't use the 2.4.19 kernel, use the SLES8-SP3 kernel. It will likely fix this, there was a fix in this area, which also got into 2.4 mainline. I don't know about RH kernels. -Andi ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <Pine.LNX.4.44.0406011436530.4341-200000@eliassen.atmos.colostate.edu>]
* Re: GART Error 11 [not found] <Pine.LNX.4.44.0406011436530.4341-200000@eliassen.atmos.colostate.edu> @ 2004-06-01 21:48 ` Arthur Perry 2004-06-01 21:55 ` Arthur Perry 2004-06-01 23:10 ` Saurabh Barve 0 siblings, 2 replies; 13+ messages in thread From: Arthur Perry @ 2004-06-01 21:48 UTC (permalink / raw) To: Saurabh Barve; +Cc: Red Hat AMD64 Mailing List, linux-kernel Hi Saurabh, I am working on this issue as we speak. It is interesting that your machine crashes entirely with iommu disabled. I am starting to think there is more to this than just the kernel misreporting other hardware errors (being improperly decoded as GART errors). On my machine, I am actaully getting Gart erros on 3 out of 4 CPUS when I use RedHat's 2.4.21-9EL kernel. This same kernel when rebuilt from source, however, will not produce GART errors when built without AGP support. Here is my Extended error code (bits 19-16 on 0:[18,19,1b]:3 at offset 0x44: 0101 = GART error So, this is not a translation issue on my side. Can you do this for me? pcitweak -r 0:18:3 0x44 and pcitweak -r 0:19:3 0x44 Thanks! Arthur Perry Lead Linux Developer / Linux Systems Architect Validation, CSU Celestica Sair/Linux Gnu Certified Professional Providing professional Linux solutions for 7+ years On Tue, 1 Jun 2004, Saurabh Barve wrote: > Hi, > > I know this has been posted before on this list, but the solution > suggested does not seem to work for me. > > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is > the Tyan Thunder K8S Pro - 2882 motherboard. > > I am getting the following error every two minutes or so: > > GART error 11 > Lost an northbridge error > NB error address some-hex-number > Error uncorrected > > I checked the various postings on the list, and someone suggested that > passing iommu=off option to the kernel solved the problem for him. > However, when I tried that, it got the kernel to panic. I read somewhere > that a newer kernel would fix these 'bugs' in the default RHEL kernel. > However, I am using the onboard SATA controller for my hard disks. This > requires binary drivers from Tyan. I already downloaded a newer kernel, > however, it breaks the drivers, so I can't boot into the new kernel. > > Here is my output from lspci: > > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > (rev 12) > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > (rev 12) > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > Gigabit Ethernet (rev 03) > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > Gigabit Ethernet (rev 03) > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02) > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev > 10) > > The dmesg output was too large to include inline, so I am attaching it as > a text file. > > I tried passing the following options to the kernel: > > iommu=noagp > iommu=noforce > iommu=off (results in kernel-panic) > mce=off > mce=0 > > I tried all the above in various combinations, but none of them worked. > The machine doesn't crash, and everything else seems to work fine, but I'd > like to get rid of these errors. > > There are some snippets from the dmesg output that I found to be of > interest: > > ------------------------------------------------------------ > Linux agpgart interface v0.99 (c) Jeff Hartmann > agpgart: Maximum main memory to use for agp memory: 7956M > agpgart: no supported devices found. > PCI-DMA: Disabling AGP. > PCI-DMA: aperture base @ 10000000 size 65536 KB > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture > ----------------------------------------------------------- > > ----------------------------------------------------------- > > GART error 11 > Lost an northbridge error > NB error address 00000000fbfe4398 > Error uncorrected > Northbridge status a40000000005001b > > ---------------------------------------------------------- > > > Any suggestions? > > Thanks, > Saurabh. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-01 21:48 ` Arthur Perry @ 2004-06-01 21:55 ` Arthur Perry 2004-06-01 22:52 ` Saurabh Barve 2004-06-01 23:10 ` Saurabh Barve 1 sibling, 1 reply; 13+ messages in thread From: Arthur Perry @ 2004-06-01 21:55 UTC (permalink / raw) To: Saurabh Barve; +Cc: linux-kernel, Red Hat AMD64 Mailing List Hi Saurabh, I almost forgot. Can you also tell me which AMD CPUs you are using? Preferrably by number if you know (starts with OSA I believe), or at least the CPU speed. Thanks! Arthur Perry Lead Linux Developer / Linux Systems Architect Validation, CSU Celestica Sair/Linux Gnu Certified Professional Providing professional Linux solutions for 7+ years On Tue, 1 Jun 2004, Arthur Perry wrote: > Hi Saurabh, > > I am working on this issue as we speak. > It is interesting that your machine crashes entirely with iommu disabled. > > I am starting to think there is more to this than just the kernel > misreporting other hardware errors (being improperly decoded as GART > errors). > On my machine, I am actaully getting Gart erros on 3 out of 4 CPUS when I > use RedHat's 2.4.21-9EL kernel. This same kernel when rebuilt from source, > however, will not produce GART errors when built without AGP support. > > Here is my Extended error code (bits 19-16 on 0:[18,19,1b]:3 at offset 0x44: > 0101 = GART error > > So, this is not a translation issue on my side. > > Can you do this for me? > > pcitweak -r 0:18:3 0x44 > and > pcitweak -r 0:19:3 0x44 > > > Thanks! > > > Arthur Perry > Lead Linux Developer / Linux Systems Architect > Validation, CSU Celestica > Sair/Linux Gnu Certified Professional > Providing professional Linux solutions for 7+ years > > On Tue, 1 Jun 2004, Saurabh Barve wrote: > > > Hi, > > > > I know this has been posted before on this list, but the solution > > suggested does not seem to work for me. > > > > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on > > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is > > the Tyan Thunder K8S Pro - 2882 motherboard. > > > > I am getting the following error every two minutes or so: > > > > GART error 11 > > Lost an northbridge error > > NB error address some-hex-number > > Error uncorrected > > > > I checked the various postings on the list, and someone suggested that > > passing iommu=off option to the kernel solved the problem for him. > > However, when I tried that, it got the kernel to panic. I read somewhere > > that a newer kernel would fix these 'bugs' in the default RHEL kernel. > > However, I am using the onboard SATA controller for my hard disks. This > > requires binary drivers from Tyan. I already downloaded a newer kernel, > > however, it breaks the drivers, so I can't boot into the new kernel. > > > > Here is my output from lspci: > > > > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) > > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) > > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) > > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) > > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) > > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > (rev 12) > > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > (rev 12) > > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > Gigabit Ethernet (rev 03) > > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > Gigabit Ethernet (rev 03) > > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD > > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02) > > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) > > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev > > 10) > > > > The dmesg output was too large to include inline, so I am attaching it as > > a text file. > > > > I tried passing the following options to the kernel: > > > > iommu=noagp > > iommu=noforce > > iommu=off (results in kernel-panic) > > mce=off > > mce=0 > > > > I tried all the above in various combinations, but none of them worked. > > The machine doesn't crash, and everything else seems to work fine, but I'd > > like to get rid of these errors. > > > > There are some snippets from the dmesg output that I found to be of > > interest: > > > > ------------------------------------------------------------ > > Linux agpgart interface v0.99 (c) Jeff Hartmann > > agpgart: Maximum main memory to use for agp memory: 7956M > > agpgart: no supported devices found. > > PCI-DMA: Disabling AGP. > > PCI-DMA: aperture base @ 10000000 size 65536 KB > > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture > > ----------------------------------------------------------- > > > > ----------------------------------------------------------- > > > > GART error 11 > > Lost an northbridge error > > NB error address 00000000fbfe4398 > > Error uncorrected > > Northbridge status a40000000005001b > > > > ---------------------------------------------------------- > > > > > > Any suggestions? > > > > Thanks, > > Saurabh. > > > > > -- > amd64-list mailing list > amd64-list@redhat.com > https://www.redhat.com/mailman/listinfo/amd64-list > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-01 21:55 ` Arthur Perry @ 2004-06-01 22:52 ` Saurabh Barve 0 siblings, 0 replies; 13+ messages in thread From: Saurabh Barve @ 2004-06-01 22:52 UTC (permalink / raw) To: Arthur Perry; +Cc: linux-kernel, Red Hat AMD64 Mailing List Arthur, I list all the information I have right off the bat: AMD Opteron Model 246, 1 MB L2 Cache 64-bit processor Model : AMD Opteron Model 246 Core : Hammer Operating Frequency : 2 GHz Cache : L1/128K, L2/1024K Socekt: Socket 940 Is that info enough? I just remembered looking at /var/log/dmesg again. There was a line that said that IOMMU was not enabled in my BIOS, and that I should enable it. However, I can't see any option in my BIOS for enabling/disabling IOMMU. Thanks, Saurabh. On Tue, 1 Jun 2004, Arthur Perry wrote: > Hi Saurabh, > > I almost forgot. > Can you also tell me which AMD CPUs you are using? > Preferrably by number if you know (starts with OSA I believe), or at least > the CPU speed. > Thanks! > > Arthur Perry > Lead Linux Developer / Linux Systems Architect > Validation, CSU Celestica > Sair/Linux Gnu Certified Professional > Providing professional Linux solutions for 7+ years > > On Tue, 1 Jun 2004, Arthur Perry wrote: > > > Hi Saurabh, > > > > I am working on this issue as we speak. > > It is interesting that your machine crashes entirely with iommu disabled. > > > > I am starting to think there is more to this than just the kernel > > misreporting other hardware errors (being improperly decoded as GART > > errors). > > On my machine, I am actaully getting Gart erros on 3 out of 4 CPUS when I > > use RedHat's 2.4.21-9EL kernel. This same kernel when rebuilt from source, > > however, will not produce GART errors when built without AGP support. > > > > Here is my Extended error code (bits 19-16 on 0:[18,19,1b]:3 at offset 0x44: > > 0101 = GART error > > > > So, this is not a translation issue on my side. > > > > Can you do this for me? > > > > pcitweak -r 0:18:3 0x44 > > and > > pcitweak -r 0:19:3 0x44 > > > > > > Thanks! > > > > > > Arthur Perry > > Lead Linux Developer / Linux Systems Architect > > Validation, CSU Celestica > > Sair/Linux Gnu Certified Professional > > Providing professional Linux solutions for 7+ years > > > > On Tue, 1 Jun 2004, Saurabh Barve wrote: > > > > > Hi, > > > > > > I know this has been posted before on this list, but the solution > > > suggested does not seem to work for me. > > > > > > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on > > > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is > > > the Tyan Thunder K8S Pro - 2882 motherboard. > > > > > > I am getting the following error every two minutes or so: > > > > > > GART error 11 > > > Lost an northbridge error > > > NB error address some-hex-number > > > Error uncorrected > > > > > > I checked the various postings on the list, and someone suggested that > > > passing iommu=off option to the kernel solved the problem for him. > > > However, when I tried that, it got the kernel to panic. I read somewhere > > > that a newer kernel would fix these 'bugs' in the default RHEL kernel. > > > However, I am using the onboard SATA controller for my hard disks. This > > > requires binary drivers from Tyan. I already downloaded a newer kernel, > > > however, it breaks the drivers, so I can't boot into the new kernel. > > > > > > Here is my output from lspci: > > > > > > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) > > > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) > > > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) > > > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) > > > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) > > > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > > (rev 12) > > > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > > (rev 12) > > > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > > Gigabit Ethernet (rev 03) > > > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > > Gigabit Ethernet (rev 03) > > > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD > > > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02) > > > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) > > > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev > > > 10) > > > > > > The dmesg output was too large to include inline, so I am attaching it as > > > a text file. > > > > > > I tried passing the following options to the kernel: > > > > > > iommu=noagp > > > iommu=noforce > > > iommu=off (results in kernel-panic) > > > mce=off > > > mce=0 > > > > > > I tried all the above in various combinations, but none of them worked. > > > The machine doesn't crash, and everything else seems to work fine, but I'd > > > like to get rid of these errors. > > > > > > There are some snippets from the dmesg output that I found to be of > > > interest: > > > > > > ------------------------------------------------------------ > > > Linux agpgart interface v0.99 (c) Jeff Hartmann > > > agpgart: Maximum main memory to use for agp memory: 7956M > > > agpgart: no supported devices found. > > > PCI-DMA: Disabling AGP. > > > PCI-DMA: aperture base @ 10000000 size 65536 KB > > > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture > > > ----------------------------------------------------------- > > > > > > ----------------------------------------------------------- > > > > > > GART error 11 > > > Lost an northbridge error > > > NB error address 00000000fbfe4398 > > > Error uncorrected > > > Northbridge status a40000000005001b > > > > > > ---------------------------------------------------------- > > > > > > > > > Any suggestions? > > > > > > Thanks, > > > Saurabh. > > > > > > > > > -- > > amd64-list mailing list > > amd64-list@redhat.com > > https://www.redhat.com/mailman/listinfo/amd64-list > > > -- =============================================================================== Saurabh Barve Phone: System Administrator/Data Specialist 970-491-7714 (voice) Montgomery Research Group, 970-491-8449 (Fax) Atmospheric Sciences Department, Fort Collins, Colorado Colorado State University Mail : sa@atmos.colostate.edu Web : http://fjortoft.atmos.colostate.edu/~sa ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-01 21:48 ` Arthur Perry 2004-06-01 21:55 ` Arthur Perry @ 2004-06-01 23:10 ` Saurabh Barve 2004-06-02 14:16 ` Arthur Perry 1 sibling, 1 reply; 13+ messages in thread From: Saurabh Barve @ 2004-06-01 23:10 UTC (permalink / raw) To: Arthur Perry; +Cc: Red Hat AMD64 Mailing List, linux-kernel Arthur, Here are the results that I got > Can you do this for me? > > pcitweak -r 0:18:3 0x44 0x02400040 > and > pcitweak -r 0:19:3 0x44 0x02400040 Hope this helps, Saurabh. > > Thanks! > > > Arthur Perry > Lead Linux Developer / Linux Systems Architect > Validation, CSU Celestica > Sair/Linux Gnu Certified Professional > Providing professional Linux solutions for 7+ years > > On Tue, 1 Jun 2004, Saurabh Barve wrote: > > > Hi, > > > > I know this has been posted before on this list, but the solution > > suggested does not seem to work for me. > > > > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on > > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is > > the Tyan Thunder K8S Pro - 2882 motherboard. > > > > I am getting the following error every two minutes or so: > > > > GART error 11 > > Lost an northbridge error > > NB error address some-hex-number > > Error uncorrected > > > > I checked the various postings on the list, and someone suggested that > > passing iommu=off option to the kernel solved the problem for him. > > However, when I tried that, it got the kernel to panic. I read somewhere > > that a newer kernel would fix these 'bugs' in the default RHEL kernel. > > However, I am using the onboard SATA controller for my hard disks. This > > requires binary drivers from Tyan. I already downloaded a newer kernel, > > however, it breaks the drivers, so I can't boot into the new kernel. > > > > Here is my output from lspci: > > > > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) > > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) > > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) > > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) > > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) > > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > (rev 12) > > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > (rev 12) > > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > Gigabit Ethernet (rev 03) > > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > Gigabit Ethernet (rev 03) > > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD > > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02) > > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) > > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev > > 10) > > > > The dmesg output was too large to include inline, so I am attaching it as > > a text file. > > > > I tried passing the following options to the kernel: > > > > iommu=noagp > > iommu=noforce > > iommu=off (results in kernel-panic) > > mce=off > > mce=0 > > > > I tried all the above in various combinations, but none of them worked. > > The machine doesn't crash, and everything else seems to work fine, but I'd > > like to get rid of these errors. > > > > There are some snippets from the dmesg output that I found to be of > > interest: > > > > ------------------------------------------------------------ > > Linux agpgart interface v0.99 (c) Jeff Hartmann > > agpgart: Maximum main memory to use for agp memory: 7956M > > agpgart: no supported devices found. > > PCI-DMA: Disabling AGP. > > PCI-DMA: aperture base @ 10000000 size 65536 KB > > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture > > ----------------------------------------------------------- > > > > ----------------------------------------------------------- > > > > GART error 11 > > Lost an northbridge error > > NB error address 00000000fbfe4398 > > Error uncorrected > > Northbridge status a40000000005001b > > > > ---------------------------------------------------------- > > > > > > Any suggestions? > > > > Thanks, > > Saurabh. > > > -- =============================================================================== Saurabh Barve Phone: System Administrator/Data Specialist 970-491-7714 (voice) Montgomery Research Group, 970-491-8449 (Fax) Atmospheric Sciences Department, Fort Collins, Colorado Colorado State University Mail : sa@atmos.colostate.edu Web : http://fjortoft.atmos.colostate.edu/~sa ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-01 23:10 ` Saurabh Barve @ 2004-06-02 14:16 ` Arthur Perry 2004-06-02 17:18 ` Saurabh Barve 0 siblings, 1 reply; 13+ messages in thread From: Arthur Perry @ 2004-06-02 14:16 UTC (permalink / raw) To: Saurabh Barve; +Cc: Red Hat AMD64 Mailing List, linux-kernel Hello, Oops. Sorry I have made a mistake in all of my statements below. It was after 5pm yesterday, and it was a long day... It's not offset 0x44 that we are interested in. My listings were at offset 0x48, which is MCA NB Status Low Register. Sorry, did not mean to confuse anybody. So Saurabh, can you please do this again with the corrected lines? pcitweak -r 0:18:3 0x48 and pcitweak -r 0:19:3 0x48 While you are at it, can you send us status high as well? pcitweak -r 0:18:3 0x4c and pcitweak -r 0:19:3 0x4c Thanks, and sorry about the confusion. Arthur Perry On Tue, 1 Jun 2004, Saurabh Barve wrote: > Arthur, > > Here are the results that I got > > > Can you do this for me? > > > > pcitweak -r 0:18:3 0x44 > > 0x02400040 > > > and > > pcitweak -r 0:19:3 0x44 > > 0x02400040 > > Hope this helps, > Saurabh. > > > > > Thanks! > > > > > > Arthur Perry > > Lead Linux Developer / Linux Systems Architect > > Validation, CSU Celestica > > Sair/Linux Gnu Certified Professional > > Providing professional Linux solutions for 7+ years > > > > On Tue, 1 Jun 2004, Saurabh Barve wrote: > > > > > Hi, > > > > > > I know this has been posted before on this list, but the solution > > > suggested does not seem to work for me. > > > > > > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on > > > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is > > > the Tyan Thunder K8S Pro - 2882 motherboard. > > > > > > I am getting the following error every two minutes or so: > > > > > > GART error 11 > > > Lost an northbridge error > > > NB error address some-hex-number > > > Error uncorrected > > > > > > I checked the various postings on the list, and someone suggested that > > > passing iommu=off option to the kernel solved the problem for him. > > > However, when I tried that, it got the kernel to panic. I read somewhere > > > that a newer kernel would fix these 'bugs' in the default RHEL kernel. > > > However, I am using the onboard SATA controller for my hard disks. This > > > requires binary drivers from Tyan. I already downloaded a newer kernel, > > > however, it breaks the drivers, so I can't boot into the new kernel. > > > > > > Here is my output from lspci: > > > > > > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) > > > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) > > > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) > > > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) > > > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) > > > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > > (rev 12) > > > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > > (rev 12) > > > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01) > > > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge > > > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > > Gigabit Ethernet (rev 03) > > > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 > > > Gigabit Ethernet (rev 03) > > > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) > > > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD > > > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02) > > > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) > > > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev > > > 10) > > > > > > The dmesg output was too large to include inline, so I am attaching it as > > > a text file. > > > > > > I tried passing the following options to the kernel: > > > > > > iommu=noagp > > > iommu=noforce > > > iommu=off (results in kernel-panic) > > > mce=off > > > mce=0 > > > > > > I tried all the above in various combinations, but none of them worked. > > > The machine doesn't crash, and everything else seems to work fine, but I'd > > > like to get rid of these errors. > > > > > > There are some snippets from the dmesg output that I found to be of > > > interest: > > > > > > ------------------------------------------------------------ > > > Linux agpgart interface v0.99 (c) Jeff Hartmann > > > agpgart: Maximum main memory to use for agp memory: 7956M > > > agpgart: no supported devices found. > > > PCI-DMA: Disabling AGP. > > > PCI-DMA: aperture base @ 10000000 size 65536 KB > > > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture > > > ----------------------------------------------------------- > > > > > > ----------------------------------------------------------- > > > > > > GART error 11 > > > Lost an northbridge error > > > NB error address 00000000fbfe4398 > > > Error uncorrected > > > Northbridge status a40000000005001b > > > > > > ---------------------------------------------------------- > > > > > > > > > Any suggestions? > > > > > > Thanks, > > > Saurabh. > > > > > > > -- > =============================================================================== > Saurabh Barve Phone: > System Administrator/Data Specialist 970-491-7714 (voice) > Montgomery Research Group, 970-491-8449 (Fax) > Atmospheric Sciences Department, > Fort Collins, Colorado > Colorado State University > > Mail : sa@atmos.colostate.edu > Web : http://fjortoft.atmos.colostate.edu/~sa > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-02 14:16 ` Arthur Perry @ 2004-06-02 17:18 ` Saurabh Barve 2004-06-02 18:39 ` Arthur Perry 0 siblings, 1 reply; 13+ messages in thread From: Saurabh Barve @ 2004-06-02 17:18 UTC (permalink / raw) To: Arthur Perry; +Cc: Red Hat AMD64 Mailing List, linux-kernel Sorry about the delay in my reply. Just got in to work! Here is the output: > pcitweak -r 0:18:3 0x48 0x0005001B > and > pcitweak -r 0:19:3 0x48 0x00000000 > While you are at it, can you send us status high as well? > > pcitweak -r 0:18:3 0x4c 0xA4000000 > and > pcitweak -r 0:19:3 0x4c 0x00000000 I don't know if this would help, but below is a part of my cronwatch log: --------------------- Init Begin ------------------------ **Unmatched Entries** Trying to re-exec init Trying to re-exec init ---------------------- Init End ------------------------- --------------------- Kernel Begin ------------------------ WARNING: Kernel Errors Present uteval-0098: *** Error: Method executio...: 4Time(s) psparse-1121: *** Error: Method executio...: 8Time(s) Error uncorrected...: 538Time(s) GART error 11...: 538Time(s) Lost an northbridge error...: 538Time(s) NB error address 00000000...: 538Time(s) ---------------------- Kernel End ------------------------- --------------------- ModProbe Begin ------------------------ Can't locate these modules: char-major-10-134: 4 Time(s) sound-service-0-3: 6 Time(s) xp0: 3 Time(s) sound-slot-0: 6 Time(s) char-major-188: 15 Time(s) ---------------------- ModProbe End ------------------------- Thanks, Saurabh. -- =============================================================================== Saurabh Barve Phone: System Administrator/Data Specialist 970-491-7714 (voice) Montgomery Research Group, 970-491-8449 (Fax) Atmospheric Sciences Department, Fort Collins, Colorado Colorado State University Mail : sa@atmos.colostate.edu Web : http://fjortoft.atmos.colostate.edu/~sa ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-02 17:18 ` Saurabh Barve @ 2004-06-02 18:39 ` Arthur Perry 2004-06-02 18:44 ` Arthur Perry 2004-06-02 18:50 ` Saurabh Barve 0 siblings, 2 replies; 13+ messages in thread From: Arthur Perry @ 2004-06-02 18:39 UTC (permalink / raw) To: Saurabh Barve; +Cc: Red Hat AMD64 Mailing List, linux-kernel Hi Saurabh, Thanks. It looks like you also have true GART errors as reported by hardware, on CPU0. So our common failure mode here is actual GART errors and not something else being reported as a GART error because of erroneous kernel translation. It's possible that we are using a device driver somewhere that is misbehaving, which is using the GART or IOMMU improperly somehow, or my guess is that is may be the actual AGP device driver used by RedHat. ie, they may have not patched in the most recent version that may contain a lot of fixes. Thanks for your feedback. As of making your messages go away, I would tell you to disable the GartTableWalk in MCE, but that does not seem to work on my machine. I'll let you know what does work without turning off Northbridge MC* entirely once I discover it. -Arthur Perry On Wed, 2 Jun 2004, Saurabh Barve wrote: > Sorry about the delay in my reply. Just got in to work! > Here is the output: > > > pcitweak -r 0:18:3 0x48 > > 0x0005001B > > > and > > pcitweak -r 0:19:3 0x48 > > 0x00000000 > > > While you are at it, can you send us status high as well? > > > > pcitweak -r 0:18:3 0x4c > > 0xA4000000 > > > and > > pcitweak -r 0:19:3 0x4c > > 0x00000000 > > I don't know if this would help, but below is a part of my cronwatch log: > > --------------------- Init Begin ------------------------ > > **Unmatched Entries** > Trying to re-exec init > Trying to re-exec init > > ---------------------- Init End ------------------------- > > > --------------------- Kernel Begin ------------------------ > > > WARNING: Kernel Errors Present > uteval-0098: *** Error: Method executio...: 4Time(s) > psparse-1121: *** Error: Method executio...: 8Time(s) > Error uncorrected...: 538Time(s) > GART error 11...: 538Time(s) > Lost an northbridge error...: 538Time(s) > NB error address 00000000...: 538Time(s) > > ---------------------- Kernel End ------------------------- > > > --------------------- ModProbe Begin ------------------------ > > > Can't locate these modules: > char-major-10-134: 4 Time(s) > sound-service-0-3: 6 Time(s) > xp0: 3 Time(s) > sound-slot-0: 6 Time(s) > char-major-188: 15 Time(s) > > ---------------------- ModProbe End ------------------------- > > > Thanks, > Saurabh. > > -- > =============================================================================== > Saurabh Barve Phone: > System Administrator/Data Specialist 970-491-7714 (voice) > Montgomery Research Group, 970-491-8449 (Fax) > Atmospheric Sciences Department, > Fort Collins, Colorado > Colorado State University > > Mail : sa@atmos.colostate.edu > Web : http://fjortoft.atmos.colostate.edu/~sa > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-02 18:39 ` Arthur Perry @ 2004-06-02 18:44 ` Arthur Perry 2004-06-02 18:50 ` Saurabh Barve 1 sibling, 0 replies; 13+ messages in thread From: Arthur Perry @ 2004-06-02 18:44 UTC (permalink / raw) To: Saurabh Barve; +Cc: linux-kernel, Red Hat AMD64 Mailing List Or actually, I should say, you "most likely" have this as well, since I asked you to gather the information through the more qurky interface. The bits for this error case match perfectly, so I'd say it's probably a good bet. Arthur Perry On Wed, 2 Jun 2004, Arthur Perry wrote: > Hi Saurabh, > > Thanks. It looks like you also have true GART errors as reported by hardware, on CPU0. > So our common failure mode here is actual GART errors and not something else being reported as a GART error because of erroneous kernel translation. > > It's possible that we are using a device driver somewhere that is misbehaving, which is using the GART or IOMMU improperly somehow, or my guess is that is may be the actual AGP device driver used by RedHat. > ie, they may have not patched in the most recent version that may contain a lot of fixes. > > Thanks for your feedback. > > As of making your messages go away, I would tell you to disable the GartTableWalk in MCE, but that does not seem to work on my machine. > I'll let you know what does work without turning off Northbridge MC* entirely once I discover it. > > -Arthur Perry > > > > On Wed, 2 Jun 2004, Saurabh Barve wrote: > > > Sorry about the delay in my reply. Just got in to work! > > Here is the output: > > > > > pcitweak -r 0:18:3 0x48 > > > > 0x0005001B > > > > > and > > > pcitweak -r 0:19:3 0x48 > > > > 0x00000000 > > > > > While you are at it, can you send us status high as well? > > > > > > pcitweak -r 0:18:3 0x4c > > > > 0xA4000000 > > > > > and > > > pcitweak -r 0:19:3 0x4c > > > > 0x00000000 > > > > I don't know if this would help, but below is a part of my cronwatch log: > > > > --------------------- Init Begin ------------------------ > > > > **Unmatched Entries** > > Trying to re-exec init > > Trying to re-exec init > > > > ---------------------- Init End ------------------------- > > > > > > --------------------- Kernel Begin ------------------------ > > > > > > WARNING: Kernel Errors Present > > uteval-0098: *** Error: Method executio...: 4Time(s) > > psparse-1121: *** Error: Method executio...: 8Time(s) > > Error uncorrected...: 538Time(s) > > GART error 11...: 538Time(s) > > Lost an northbridge error...: 538Time(s) > > NB error address 00000000...: 538Time(s) > > > > ---------------------- Kernel End ------------------------- > > > > > > --------------------- ModProbe Begin ------------------------ > > > > > > Can't locate these modules: > > char-major-10-134: 4 Time(s) > > sound-service-0-3: 6 Time(s) > > xp0: 3 Time(s) > > sound-slot-0: 6 Time(s) > > char-major-188: 15 Time(s) > > > > ---------------------- ModProbe End ------------------------- > > > > > > Thanks, > > Saurabh. > > > > -- > > =============================================================================== > > Saurabh Barve Phone: > > System Administrator/Data Specialist 970-491-7714 (voice) > > Montgomery Research Group, 970-491-8449 (Fax) > > Atmospheric Sciences Department, > > Fort Collins, Colorado > > Colorado State University > > > > Mail : sa@atmos.colostate.edu > > Web : http://fjortoft.atmos.colostate.edu/~sa > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > > -- > amd64-list mailing list > amd64-list@redhat.com > https://www.redhat.com/mailman/listinfo/amd64-list > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: GART Error 11 2004-06-02 18:39 ` Arthur Perry 2004-06-02 18:44 ` Arthur Perry @ 2004-06-02 18:50 ` Saurabh Barve 1 sibling, 0 replies; 13+ messages in thread From: Saurabh Barve @ 2004-06-02 18:50 UTC (permalink / raw) To: Arthur Perry; +Cc: Red Hat AMD64 Mailing List, linux-kernel Thanks Arthur! The machine seems to work except for the errors. Is there a way to update the drivers in the OS without having to upgrade the kernel. I guess we'll first have to find out which driver is misbehaving !! I'll try the 'mce=off' and 'iommu=off' options again. I'll keep you posted. Thanks again, Saurabh. On Wed, 2 Jun 2004, Arthur Perry wrote: > Hi Saurabh, > > Thanks. It looks like you also have true GART errors as reported by hardware, on CPU0. > So our common failure mode here is actual GART errors and not something else being reported as a GART error because of erroneous kernel translation. > > It's possible that we are using a device driver somewhere that is misbehaving, which is using the GART or IOMMU improperly somehow, or my guess is that is may be the actual AGP device driver used by RedHat. > ie, they may have not patched in the most recent version that may contain a lot of fixes. > > Thanks for your feedback. > > As of making your messages go away, I would tell you to disable the GartTableWalk in MCE, but that does not seem to work on my machine. > I'll let you know what does work without turning off Northbridge MC* entirely once I discover it. > > -Arthur Perry > > > > On Wed, 2 Jun 2004, Saurabh Barve wrote: > > > Sorry about the delay in my reply. Just got in to work! > > Here is the output: > > > > > pcitweak -r 0:18:3 0x48 > > > > 0x0005001B > > > > > and > > > pcitweak -r 0:19:3 0x48 > > > > 0x00000000 > > > > > While you are at it, can you send us status high as well? > > > > > > pcitweak -r 0:18:3 0x4c > > > > 0xA4000000 > > > > > and > > > pcitweak -r 0:19:3 0x4c > > > > 0x00000000 > > > > I don't know if this would help, but below is a part of my cronwatch log: > > > > --------------------- Init Begin ------------------------ > > > > **Unmatched Entries** > > Trying to re-exec init > > Trying to re-exec init > > > > ---------------------- Init End ------------------------- > > > > > > --------------------- Kernel Begin ------------------------ > > > > > > WARNING: Kernel Errors Present > > uteval-0098: *** Error: Method executio...: 4Time(s) > > psparse-1121: *** Error: Method executio...: 8Time(s) > > Error uncorrected...: 538Time(s) > > GART error 11...: 538Time(s) > > Lost an northbridge error...: 538Time(s) > > NB error address 00000000...: 538Time(s) > > > > ---------------------- Kernel End ------------------------- > > > > > > --------------------- ModProbe Begin ------------------------ > > > > > > Can't locate these modules: > > char-major-10-134: 4 Time(s) > > sound-service-0-3: 6 Time(s) > > xp0: 3 Time(s) > > sound-slot-0: 6 Time(s) > > char-major-188: 15 Time(s) > > > > ---------------------- ModProbe End ------------------------- > > > > > > Thanks, > > Saurabh. > > > > -- > > =============================================================================== > > Saurabh Barve Phone: > > System Administrator/Data Specialist 970-491-7714 (voice) > > Montgomery Research Group, 970-491-8449 (Fax) > > Atmospheric Sciences Department, > > Fort Collins, Colorado > > Colorado State University > > > > Mail : sa@atmos.colostate.edu > > Web : http://fjortoft.atmos.colostate.edu/~sa > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > -- ============================================================================= Saurabh Barve Phone: System Administrator/Data Specialist 970-491-7714 (voice) Montgomery Research Group, 970-491-8449 (Fax) Atmospheric Sciences Department, Fort Collins, Colorado Colorado State University Mail : sa@atmos.colostate.edu Web : http://fjortoft.atmos.colostate.edu/~sa ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <Pine.LNX.4.58.0405271023510.17982@tiamat.perryconsulting.net>]
[parent not found: <20040527150623.GP22630@redhat.com>]
* Re: GART error 11 [not found] ` <20040527150623.GP22630@redhat.com> @ 2004-05-27 17:03 ` Arthur Perry 0 siblings, 0 replies; 13+ messages in thread From: Arthur Perry @ 2004-05-27 17:03 UTC (permalink / raw) To: Dave Jones; +Cc: amd64-list, linux-kernel Hi Dave, I have verified that booting the system with iommu=off (with vanilla kernel) does seem to make the problem go away. I apologize for not being fully up to speed with knowing what relationship the iommu has with the gart, but I will find out. I just wanted to post my findings. Thanks again! Best Regards, Arthur Perry Lead Linux Developer / Linux Systems Architect Validation, CSU Celestica Sair/Linux Gnu Certified Professional Providing professional Linux solutions for 7+ years On Thu, 27 May 2004, Dave Jones wrote: > On Thu, May 27, 2004 at 11:02:10AM -0400, Arthur Perry wrote: > > > The question really comes down to: > > Is this problem an oversight of the distributors (silly! the agp driver should not be built into the kernel for server use!) > > or > > Kernel code implementation? (well, if no agp bus is present, then let's not go and set up the GART, right?) > > If you don't have an AGP graphics card, there'll be nothing really > set up, unless you are using the IOMMU feature of the x86-64 kernel. > Setting up of GART tables etc only happens when the graphics > drivers (DRI) asks it to. > > Does booting with iommu=off fix this for you? > It may be that some driver is doing something that it > shouldn't. If it does make the problem go away, what > devices do you have in the system (lspci output please) > > Dave > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2004-06-02 19:48 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <22qyw-6e7-29@gated-at.bofh.it>
[not found] ` <22ELe-oP-47@gated-at.bofh.it>
2004-06-02 18:00 ` GART Error 11 Andi Kleen
2004-06-02 18:35 ` Arthur Perry
2004-06-02 19:48 ` Andi Kleen
[not found] <Pine.LNX.4.44.0406011436530.4341-200000@eliassen.atmos.colostate.edu>
2004-06-01 21:48 ` Arthur Perry
2004-06-01 21:55 ` Arthur Perry
2004-06-01 22:52 ` Saurabh Barve
2004-06-01 23:10 ` Saurabh Barve
2004-06-02 14:16 ` Arthur Perry
2004-06-02 17:18 ` Saurabh Barve
2004-06-02 18:39 ` Arthur Perry
2004-06-02 18:44 ` Arthur Perry
2004-06-02 18:50 ` Saurabh Barve
[not found] <Pine.LNX.4.58.0405271023510.17982@tiamat.perryconsulting.net>
[not found] ` <20040527150623.GP22630@redhat.com>
2004-05-27 17:03 ` GART error 11 Arthur Perry
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.