public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* x86-64 bad pmds in 2.6.11.6
@ 2005-03-30 21:44 Dave Jones
  2005-03-31 10:41 ` Andi Kleen
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Jones @ 2005-03-30 21:44 UTC (permalink / raw)
  To: ak; +Cc: linux-kernel

[apologies to Andi for getting this twice, I goofed the l-k address
 the first time]

 
 I arrived at the office today to find my workstation had this spew
 in its dmesg buffer..
 
 mm/memory.c:97: bad pmd ffff81004b017438(00000038a5500a88).
 mm/memory.c:97: bad pmd ffff81004b017440(0000000000000003).
 mm/memory.c:97: bad pmd ffff81004b017448(00007ffffffff73b).
 mm/memory.c:97: bad pmd ffff81004b017450(00007ffffffff73c).
 mm/memory.c:97: bad pmd ffff81004b017458(00007ffffffff73d).
 mm/memory.c:97: bad pmd ffff81004b017468(00007ffffffff73e).
 mm/memory.c:97: bad pmd ffff81004b017470(00007ffffffff73f).
 mm/memory.c:97: bad pmd ffff81004b017478(00007ffffffff740).
 mm/memory.c:97: bad pmd ffff81004b017480(00007ffffffff741).
 mm/memory.c:97: bad pmd ffff81004b017488(00007ffffffff742).
 mm/memory.c:97: bad pmd ffff81004b017490(00007ffffffff743).
 mm/memory.c:97: bad pmd ffff81004b017498(00007ffffffff744).
 mm/memory.c:97: bad pmd ffff81004b0174a0(00007ffffffff745).
 mm/memory.c:97: bad pmd ffff81004b0174a8(00007ffffffff746).
 mm/memory.c:97: bad pmd ffff81004b0174b0(00007ffffffff747).
 mm/memory.c:97: bad pmd ffff81004b0174b8(00007ffffffff748).
 mm/memory.c:97: bad pmd ffff81004b0174c0(00007ffffffff749).
 mm/memory.c:97: bad pmd ffff81004b0174c8(00007ffffffff74a).
 mm/memory.c:97: bad pmd ffff81004b0174d0(00007ffffffff74b).
 mm/memory.c:97: bad pmd ffff81004b0174d8(00007ffffffff74c).
 mm/memory.c:97: bad pmd ffff81004b0174e0(00007ffffffff74d).
 mm/memory.c:97: bad pmd ffff81004b0174e8(00007ffffffff74e).
 mm/memory.c:97: bad pmd ffff81004b0174f0(00007ffffffff74f).
 mm/memory.c:97: bad pmd ffff81004b0174f8(00007ffffffff750).
 mm/memory.c:97: bad pmd ffff81004b017500(00007ffffffff751).
 mm/memory.c:97: bad pmd ffff81004b017508(00007ffffffff752).
 mm/memory.c:97: bad pmd ffff81004b017510(00007ffffffff753).
 mm/memory.c:97: bad pmd ffff81004b017518(00007ffffffff754).
 mm/memory.c:97: bad pmd ffff81004b017520(00007ffffffff755).
 mm/memory.c:97: bad pmd ffff81004b017528(00007ffffffff756).
 mm/memory.c:97: bad pmd ffff81004b017530(00007ffffffff757).
 mm/memory.c:97: bad pmd ffff81004b017538(00007ffffffff758).
 mm/memory.c:97: bad pmd ffff81004b017540(00007ffffffff759).
 mm/memory.c:97: bad pmd ffff81004b017548(00007ffffffff75a).
 mm/memory.c:97: bad pmd ffff81004b017550(00007ffffffff75b).
 mm/memory.c:97: bad pmd ffff81004b017558(00007ffffffff75c).
 mm/memory.c:97: bad pmd ffff81004b017560(00007ffffffff75d).
 mm/memory.c:97: bad pmd ffff81004b017568(00007ffffffff75e).
 mm/memory.c:97: bad pmd ffff81004b017570(00007ffffffff75f).
 mm/memory.c:97: bad pmd ffff81004b017578(00007ffffffff760).
 mm/memory.c:97: bad pmd ffff81004b017580(00007ffffffff761).
 mm/memory.c:97: bad pmd ffff81004b017588(00007ffffffff762).
 mm/memory.c:97: bad pmd ffff81004b017590(00007ffffffff763).
 mm/memory.c:97: bad pmd ffff81004b017598(00007ffffffff764).
 mm/memory.c:97: bad pmd ffff81004b0175a0(00007ffffffff765).
 mm/memory.c:97: bad pmd ffff81004b0175a8(00007ffffffff766).
 mm/memory.c:97: bad pmd ffff81004b0175b0(00007ffffffff767).
 mm/memory.c:97: bad pmd ffff81004b0175b8(00007ffffffff768).
 mm/memory.c:97: bad pmd ffff81004b0175c0(00007ffffffff769).
 mm/memory.c:97: bad pmd ffff81004b0175c8(00007ffffffff76a).
 mm/memory.c:97: bad pmd ffff81004b0175d0(00007ffffffff76b).
 mm/memory.c:97: bad pmd ffff81004b0175d8(00007ffffffff76c).
 mm/memory.c:97: bad pmd ffff81004b0175e0(00007ffffffff76d).
 mm/memory.c:97: bad pmd ffff81004b0175e8(00007ffffffff76e).
 mm/memory.c:97: bad pmd ffff81004b0175f0(00007ffffffff76f).
 mm/memory.c:97: bad pmd ffff81004b0175f8(00007ffffffff770).
 mm/memory.c:97: bad pmd ffff81004b017600(00007ffffffff771).
 mm/memory.c:97: bad pmd ffff81004b017608(00007ffffffff772).
 mm/memory.c:97: bad pmd ffff81004b017610(00007ffffffff773).
 mm/memory.c:97: bad pmd ffff81004b017618(00007ffffffff774).
 mm/memory.c:97: bad pmd ffff81004b017628(0000000000000010).
 mm/memory.c:97: bad pmd ffff81004b017630(00000000078bfbff).
 mm/memory.c:97: bad pmd ffff81004b017638(0000000000000006).
 mm/memory.c:97: bad pmd ffff81004b017640(0000000000001000).
 mm/memory.c:97: bad pmd ffff81004b017648(0000000000000011).
 mm/memory.c:97: bad pmd ffff81004b017650(0000000000000064).
 mm/memory.c:97: bad pmd ffff81004b017658(0000000000000003).
 mm/memory.c:97: bad pmd ffff81004b017660(0000000000400040).
 mm/memory.c:97: bad pmd ffff81004b017668(0000000000000004).
 mm/memory.c:97: bad pmd ffff81004b017670(0000000000000038).
 mm/memory.c:97: bad pmd ffff81004b017678(0000000000000005).
 mm/memory.c:97: bad pmd ffff81004b017680(0000000000000008).
 mm/memory.c:97: bad pmd ffff81004b017688(0000000000000007).
 mm/memory.c:97: bad pmd ffff81004b017698(0000000000000008).
 mm/memory.c:97: bad pmd ffff81004b0176a8(0000000000000009).
 mm/memory.c:97: bad pmd ffff81004b0176b0(0000000000403840).
 mm/memory.c:97: bad pmd ffff81004b0176b8(000000000000000b).
 mm/memory.c:97: bad pmd ffff81004b0176c0(00000000000001f4).
 mm/memory.c:97: bad pmd ffff81004b0176c8(000000000000000c).
 mm/memory.c:97: bad pmd ffff81004b0176d0(00000000000001f4).
 mm/memory.c:97: bad pmd ffff81004b0176d8(000000000000000d).
 mm/memory.c:97: bad pmd ffff81004b0176e0(00000000000001f4).
 mm/memory.c:97: bad pmd ffff81004b0176e8(000000000000000e).
 mm/memory.c:97: bad pmd ffff81004b0176f0(00000000000001f4).
 mm/memory.c:97: bad pmd ffff81004b0176f8(0000000000000017).
 mm/memory.c:97: bad pmd ffff81004b017708(000000000000000f).
 mm/memory.c:97: bad pmd ffff81004b017710(00007ffffffff734).
 mm/memory.c:97: bad pmd ffff81004b017730(5f36387800000000).
 mm/memory.c:97: bad pmd ffff81004b017738(0000000000003436).
 
 
I've not done a memtest86 run on this (yet), but I'll be very
surprised if this is bad RAM, especially considering other
folks also seem to have hit the same thing when they moved
to 2.6.11.  (My workstation ran 2.6.9/2.6.10 without incident
previously).

http://lkml.org/lkml/2005/3/11/42 for example lists a similar
dump (though obviously differing addresses).
Googling around reveals a bunch of other similar dumps.
 
 		Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-03-30 21:44 Dave Jones
@ 2005-03-31 10:41 ` Andi Kleen
  2005-03-31 21:52   ` Dave Jones
  2005-04-07  2:49   ` Dave Jones
  0 siblings, 2 replies; 16+ messages in thread
From: Andi Kleen @ 2005-03-31 10:41 UTC (permalink / raw)
  To: Dave Jones, ak, linux-kernel

On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
> [apologies to Andi for getting this twice, I goofed the l-k address
>  the first time]
> 
>  
>  I arrived at the office today to find my workstation had this spew
>  in its dmesg buffer..

Looks like random memory corruption to me.

Can you enable slab debugging etc.?

>  mm/memory.c:97: bad pmd ffff81004b017438(00000038a5500a88).
>  mm/memory.c:97: bad pmd ffff81004b017440(0000000000000003).
>  mm/memory.c:97: bad pmd ffff81004b017448(00007ffffffff73b).
>  mm/memory.c:97: bad pmd ffff81004b017450(00007ffffffff73c).
>  mm/memory.c:97: bad pmd ffff81004b017458(00007ffffffff73d).
>  mm/memory.c:97: bad pmd ffff81004b017468(00007ffffffff73e).
>  mm/memory.c:97: bad pmd ffff81004b017470(00007ffffffff73f).
>  mm/memory.c:97: bad pmd ffff81004b017478(00007ffffffff740).
>  mm/memory.c:97: bad pmd ffff81004b017480(00007ffffffff741).
>  mm/memory.c:97: bad pmd ffff81004b017488(00007ffffffff742).
>  mm/memory.c:97: bad pmd ffff81004b017490(00007ffffffff743).
>  mm/memory.c:97: bad pmd ffff81004b017498(00007ffffffff744).
>  mm/memory.c:97: bad pmd ffff81004b0174a0(00007ffffffff745).
>  mm/memory.c:97: bad pmd ffff81004b0174a8(00007ffffffff746).
>  mm/memory.c:97: bad pmd ffff81004b0174b0(00007ffffffff747).
>  mm/memory.c:97: bad pmd ffff81004b0174b8(00007ffffffff748).
>  mm/memory.c:97: bad pmd ffff81004b0174c0(00007ffffffff749).
>  mm/memory.c:97: bad pmd ffff81004b0174c8(00007ffffffff74a).
>  mm/memory.c:97: bad pmd ffff81004b0174d0(00007ffffffff74b).
>  mm/memory.c:97: bad pmd ffff81004b0174d8(00007ffffffff74c).
>  mm/memory.c:97: bad pmd ffff81004b0174e0(00007ffffffff74d).
>  mm/memory.c:97: bad pmd ffff81004b0174e8(00007ffffffff74e).
>  mm/memory.c:97: bad pmd ffff81004b0174f0(00007ffffffff74f).
>  mm/memory.c:97: bad pmd ffff81004b0174f8(00007ffffffff750).
>  mm/memory.c:97: bad pmd ffff81004b017500(00007ffffffff751).
>  mm/memory.c:97: bad pmd ffff81004b017508(00007ffffffff752).
>  mm/memory.c:97: bad pmd ffff81004b017510(00007ffffffff753).
>  mm/memory.c:97: bad pmd ffff81004b017518(00007ffffffff754).
>  mm/memory.c:97: bad pmd ffff81004b017520(00007ffffffff755).
>  mm/memory.c:97: bad pmd ffff81004b017528(00007ffffffff756).
>  mm/memory.c:97: bad pmd ffff81004b017530(00007ffffffff757).
>  mm/memory.c:97: bad pmd ffff81004b017538(00007ffffffff758).
>  mm/memory.c:97: bad pmd ffff81004b017540(00007ffffffff759).
>  mm/memory.c:97: bad pmd ffff81004b017548(00007ffffffff75a).
>  mm/memory.c:97: bad pmd ffff81004b017550(00007ffffffff75b).
>  mm/memory.c:97: bad pmd ffff81004b017558(00007ffffffff75c).
>  mm/memory.c:97: bad pmd ffff81004b017560(00007ffffffff75d).
>  mm/memory.c:97: bad pmd ffff81004b017568(00007ffffffff75e).
>  mm/memory.c:97: bad pmd ffff81004b017570(00007ffffffff75f).
>  mm/memory.c:97: bad pmd ffff81004b017578(00007ffffffff760).
>  mm/memory.c:97: bad pmd ffff81004b017580(00007ffffffff761).
>  mm/memory.c:97: bad pmd ffff81004b017588(00007ffffffff762).
>  mm/memory.c:97: bad pmd ffff81004b017590(00007ffffffff763).
>  mm/memory.c:97: bad pmd ffff81004b017598(00007ffffffff764).
>  mm/memory.c:97: bad pmd ffff81004b0175a0(00007ffffffff765).
>  mm/memory.c:97: bad pmd ffff81004b0175a8(00007ffffffff766).
>  mm/memory.c:97: bad pmd ffff81004b0175b0(00007ffffffff767).
>  mm/memory.c:97: bad pmd ffff81004b0175b8(00007ffffffff768).
>  mm/memory.c:97: bad pmd ffff81004b0175c0(00007ffffffff769).
>  mm/memory.c:97: bad pmd ffff81004b0175c8(00007ffffffff76a).
>  mm/memory.c:97: bad pmd ffff81004b0175d0(00007ffffffff76b).
>  mm/memory.c:97: bad pmd ffff81004b0175d8(00007ffffffff76c).
>  mm/memory.c:97: bad pmd ffff81004b0175e0(00007ffffffff76d).
>  mm/memory.c:97: bad pmd ffff81004b0175e8(00007ffffffff76e).
>  mm/memory.c:97: bad pmd ffff81004b0175f0(00007ffffffff76f).
>  mm/memory.c:97: bad pmd ffff81004b0175f8(00007ffffffff770).
>  mm/memory.c:97: bad pmd ffff81004b017600(00007ffffffff771).
>  mm/memory.c:97: bad pmd ffff81004b017608(00007ffffffff772).
>  mm/memory.c:97: bad pmd ffff81004b017610(00007ffffffff773).
>  mm/memory.c:97: bad pmd ffff81004b017618(00007ffffffff774).
>  mm/memory.c:97: bad pmd ffff81004b017628(0000000000000010).
>  mm/memory.c:97: bad pmd ffff81004b017630(00000000078bfbff).
>  mm/memory.c:97: bad pmd ffff81004b017638(0000000000000006).
>  mm/memory.c:97: bad pmd ffff81004b017640(0000000000001000).
>  mm/memory.c:97: bad pmd ffff81004b017648(0000000000000011).
>  mm/memory.c:97: bad pmd ffff81004b017650(0000000000000064).
>  mm/memory.c:97: bad pmd ffff81004b017658(0000000000000003).
>  mm/memory.c:97: bad pmd ffff81004b017660(0000000000400040).
>  mm/memory.c:97: bad pmd ffff81004b017668(0000000000000004).
>  mm/memory.c:97: bad pmd ffff81004b017670(0000000000000038).
>  mm/memory.c:97: bad pmd ffff81004b017678(0000000000000005).
>  mm/memory.c:97: bad pmd ffff81004b017680(0000000000000008).
>  mm/memory.c:97: bad pmd ffff81004b017688(0000000000000007).
>  mm/memory.c:97: bad pmd ffff81004b017698(0000000000000008).
>  mm/memory.c:97: bad pmd ffff81004b0176a8(0000000000000009).
>  mm/memory.c:97: bad pmd ffff81004b0176b0(0000000000403840).
>  mm/memory.c:97: bad pmd ffff81004b0176b8(000000000000000b).
>  mm/memory.c:97: bad pmd ffff81004b0176c0(00000000000001f4).
>  mm/memory.c:97: bad pmd ffff81004b0176c8(000000000000000c).
>  mm/memory.c:97: bad pmd ffff81004b0176d0(00000000000001f4).
>  mm/memory.c:97: bad pmd ffff81004b0176d8(000000000000000d).
>  mm/memory.c:97: bad pmd ffff81004b0176e0(00000000000001f4).
>  mm/memory.c:97: bad pmd ffff81004b0176e8(000000000000000e).
>  mm/memory.c:97: bad pmd ffff81004b0176f0(00000000000001f4).
>  mm/memory.c:97: bad pmd ffff81004b0176f8(0000000000000017).
>  mm/memory.c:97: bad pmd ffff81004b017708(000000000000000f).
>  mm/memory.c:97: bad pmd ffff81004b017710(00007ffffffff734).
>  mm/memory.c:97: bad pmd ffff81004b017730(5f36387800000000).
>  mm/memory.c:97: bad pmd ffff81004b017738(0000000000003436).
>  
>  
> I've not done a memtest86 run on this (yet), but I'll be very
> surprised if this is bad RAM, especially considering other
> folks also seem to have hit the same thing when they moved
> to 2.6.11.  (My workstation ran 2.6.9/2.6.10 without incident
> previously).
> 
> http://lkml.org/lkml/2005/3/11/42 for example lists a similar
> dump (though obviously differing addresses).
> Googling around reveals a bunch of other similar dumps.

Yes I saw them, but I supposed it is some driver going bad.
If you want you can collect hardware data and see if there is
a common driver.

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-03-31 10:41 ` Andi Kleen
@ 2005-03-31 21:52   ` Dave Jones
  2005-04-01 11:52     ` Sergey S. Kostyliov
  2005-04-07  2:49   ` Dave Jones
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Jones @ 2005-03-31 21:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
 > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
 > > [apologies to Andi for getting this twice, I goofed the l-k address
 > >  the first time]
 > > 
 > >  
 > >  I arrived at the office today to find my workstation had this spew
 > >  in its dmesg buffer..
 > 
 > Looks like random memory corruption to me.
 > 
 > Can you enable slab debugging etc.?

SLAB_DEBUG=y.  Nothing in the logs.

 > Yes I saw them, but I supposed it is some driver going bad.
 > If you want you can collect hardware data and see if there is
 > a common driver.

There's quite a bit in this box 

00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 AC97 Audio (rev 03)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
02:07.0 USB Controller: NEC Corporation USB (rev 41)
02:07.1 USB Controller: NEC Corporation USB (rev 41)
02:07.2 USB Controller: NEC Corporation USB 2.0 (rev 02)
02:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  Ultra3 SCSI Adapter (rev 01)
02:08.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  Ultra3 SCSI Adapter (rev 01)
02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02)
03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:0a.0 Unknown mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372 (rev 03)
03:0b.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
03:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
04:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-8151 System Controller (rev 13)
04:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8151 AGP Bridge (rev 13)
05:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550 AGP (rev 01)

The SATA & SCSI controllers have no disks attached.  Firewire can be ignored (theres
no actual connector even for it on the board). The various USB controllers
are mostly unused. Only one of them is USB2.0, so that sees occasional
usb-storage use. Not noticed anything going bad there though.

		Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-03-31 21:52   ` Dave Jones
@ 2005-04-01 11:52     ` Sergey S. Kostyliov
  0 siblings, 0 replies; 16+ messages in thread
From: Sergey S. Kostyliov @ 2005-04-01 11:52 UTC (permalink / raw)
  To: Dave Jones; +Cc: Andi Kleen, linux-kernel

On Friday 01 April 2005 01:52, Dave Jones wrote:
> On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
>  > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
>  > > [apologies to Andi for getting this twice, I goofed the l-k address
>  > >  the first time]
>  > > 
>  > >  
>  > >  I arrived at the office today to find my workstation had this spew
>  > >  in its dmesg buffer..
>  > 
>  > Looks like random memory corruption to me.
>  > 
>  > Can you enable slab debugging etc.?
> 
> SLAB_DEBUG=y.  Nothing in the logs.
> 
>  > Yes I saw them, but I supposed it is some driver going bad.
>  > If you want you can collect hardware data and see if there is
>  > a common driver.
> 
> There's quite a bit in this box 
> 
> 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
> 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
> 00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 AC97 Audio (rev 03)
> 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
> 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
> 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
> 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
> 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
> 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
> 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
> 02:07.0 USB Controller: NEC Corporation USB (rev 41)
> 02:07.1 USB Controller: NEC Corporation USB (rev 41)
> 02:07.2 USB Controller: NEC Corporation USB 2.0 (rev 02)
> 02:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  Ultra3 SCSI Adapter (rev 01)
> 02:08.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  Ultra3 SCSI Adapter (rev 01)
> 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02)
> 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> 03:0a.0 Unknown mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372 (rev 03)
> 03:0b.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
> 03:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
> 04:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-8151 System Controller (rev 13)
> 04:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8151 AGP Bridge (rev 13)
> 05:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550 AGP (rev 01)
> 
> The SATA & SCSI controllers have no disks attached.  Firewire can be ignored (theres
> no actual connector even for it on the board). The various USB controllers
> are mostly unused. Only one of them is USB2.0, so that sees occasional
> usb-storage use. Not noticed anything going bad there though.
> 
> 		Dave

And here is my box (looks like there is no many hardware drivers
in common).

rathamahata@lights rathamahata $ /sbin/lspci
0000:00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
0000:00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
0000:00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
0000:00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
0000:00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
0000:00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
0000:00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
0000:00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
0000:00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
0000:00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
0000:00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
0000:00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
0000:01:01.0 PCI bridge: IBM PCI-X to PCI-X Bridge (rev 02)
0000:02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID (rev 02)
0000:03:03.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller
0000:03:04.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller
0000:04:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
0000:04:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
0000:04:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
rathamahata@lights rathamahata $

e1000 is handled by by intel's e1000 driver


usb is not compiled in 
rathamahata@lights linux-2.6.11 $ grep CONFIG_USB .config
# CONFIG_USB is not set
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
# CONFIG_USB_GADGET is not set
rathamahata@lights linux-2.6.11 $

-- 
Sergey S. Kostyliov <rathamahata@ehouse.ru>
Jabber ID: rathamahata@jabber.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-03-31 10:41 ` Andi Kleen
  2005-03-31 21:52   ` Dave Jones
@ 2005-04-07  2:49   ` Dave Jones
  2005-04-07  6:29     ` Andi Kleen
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Jones @ 2005-04-07  2:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
 > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
 > > [apologies to Andi for getting this twice, I goofed the l-k address
 > >  the first time]
 > > 
 > >  
 > >  I arrived at the office today to find my workstation had this spew
 > >  in its dmesg buffer..
 > 
 > Looks like random memory corruption to me.
 > 
 > Can you enable slab debugging etc.?
 > 
 > >  mm/memory.c:97: bad pmd ffff81004b017438(00000038a5500a88).
 > >  mm/memory.c:97: bad pmd ffff81004b017440(0000000000000003).
 > >  mm/memory.c:97: bad pmd ffff81004b017448(00007ffffffff73b).
 > >  mm/memory.c:97: bad pmd ffff81004b017450(00007ffffffff73c).
 > > etc..

I realised today that this happens every time X starts up for
the first time.   I did some experiments, and found that with 2.6.12rc1
it's gone. Either it got fixed accidentally, or its hidden now
by one of the many changes in 4-level patches.

I'll try and narrow this down a little more tomorrow, to see if I
can pinpoint the exact -bk snapshot (may be tricky given they were
broken for a while), as it'd be good to get this fixed in 2.6.11.x
if .12 isn't going to show up any time soon.

		Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-04-07  2:49   ` Dave Jones
@ 2005-04-07  6:29     ` Andi Kleen
  2005-04-14 13:54       ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2005-04-07  6:29 UTC (permalink / raw)
  To: Dave Jones, Andi Kleen, linux-kernel

> I realised today that this happens every time X starts up for
> the first time.   I did some experiments, and found that with 2.6.12rc1
> it's gone. Either it got fixed accidentally, or its hidden now
> by one of the many changes in 4-level patches.
> 
> I'll try and narrow this down a little more tomorrow, to see if I
> can pinpoint the exact -bk snapshot (may be tricky given they were
> broken for a while), as it'd be good to get this fixed in 2.6.11.x
> if .12 isn't going to show up any time soon.

Can you supply a strace of the /dev/mem, /dev/kmem accesses of 
your X server? (including the mmaps or read/writes if available)

My X server doesn't seem to cause that.

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* re: x86-64 bad pmds in 2.6.11.6
@ 2005-04-08 16:33 Clem Taylor
  0 siblings, 0 replies; 16+ messages in thread
From: Clem Taylor @ 2005-04-08 16:33 UTC (permalink / raw)
  To: linux-kernel

Dave Jones reported seeing bad pmd messages in 2.6.11.6. I've been
seeing them with 2.6.11 and today with 2.6.11.6. When I first saw the
problem I ran memtest86 and it didn't catch anything after ~3hours.
However, I don't see them when X starts. They tend to happen after a
program segfaults:

2.6.11:
Apr  3 23:23:33 klaatu kernel: sh[16361]: segfault at 0000000000000000
rip 0000000000000000 rsp 00007ffffffff020 error 14
Apr  3 23:23:33 klaatu kernel: mm/memory.c:97: bad pmd
ffff810027171010(00000000006b68b9).
.. many more ...

2.6.11.6:
Apr  8 12:03:17 klaatu kernel: grep[20971]: segfault at
0000000000000000 rip 0000000000000000 rsp 00007ffffffff090 error 14
Apr  8 12:03:17 klaatu kernel: mm/memory.c:97: bad pmd
ffff810095929010(0000000000000015).
.... many more ...
Apr  8 12:03:18 klaatu kernel: mm/memory.c:97: bad pmd
ffff8100959299d0(000034365f363878).
Apr  8 12:03:18 klaatu kernel: grep[21116]: segfault at
0000000000000000 rip 0000000000000000 rsp 00007ffffffff0a0 error 14
Apr  8 12:03:18 klaatu kernel: mm/memory.c:97: bad pmd
ffff810095f5b000(000000000000000f).
...

At the time I was doing a
find ... -exec grep -H ...
over a linux kernel tree.

I repeated the find and I didn't see segfaults the second run.

                                --Clem

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-04-07  6:29     ` Andi Kleen
@ 2005-04-14 13:54       ` Hugh Dickins
  2005-04-14 17:01         ` Andi Kleen
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2005-04-14 13:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Jones, Sergey S. Kostyliov, Clem Taylor, Chris Wright,
	linux-kernel

On Thu, 7 Apr 2005, Andi Kleen wrote:
> Dave Jones wrote:
> > I realised today that this happens every time X starts up for
> > the first time.   I did some experiments, and found that with 2.6.12rc1
> > it's gone. Either it got fixed accidentally, or its hidden now
> > by one of the many changes in 4-level patches.
> > 
> > I'll try and narrow this down a little more tomorrow, to see if I
> > can pinpoint the exact -bk snapshot (may be tricky given they were
> > broken for a while), as it'd be good to get this fixed in 2.6.11.x
> > if .12 isn't going to show up any time soon.
> 
> Can you supply a strace of the /dev/mem, /dev/kmem accesses of 
> your X server? (including the mmaps or read/writes if available)
> 
> My X server doesn't seem to cause that.

I can't explain why it should appear fixed in 2.6.12-rc1 (probably
other complicating factors at work), but I do believe you've fixed
this in 2.6.12-rc2, and the patch which should go into -stable is
your load_cr3 patch below, which Linus took from Andrew on 28 March.

I say this because I was intrigued by the resemblance between Sergey's
and Dave's corruptions, and spent a while trying to work out where they
come from.  The giveaway is the little ASCII string they share at the
end (seen also in Clem's extract)

 mm/memory.c:97: bad pmd ffff81004b017730(5f36387800000000).
 mm/memory.c:97: bad pmd ffff81004b017738(0000000000003436).

That says "x86_64", and a grep for that as a string shows ELF_PLATFORM,
and a grep for that shows create_elf_tables in fs/binfmt_elf.c.  _All_
this pmd corruption (except for the first line, presumably pushing a
user address on stack) originates from create_elf_tables (the neatly
ascending stack addresses being the argv and envp pointers, incrementing
by 1 because only a NUL-string is found for each, the real strings being
off elsewhere in the intended new stack page, not in this pmd page).

It looks very much as if the mm being created has for pmd a page
which was used for user stack in the outgoing mm; but somehow exec's
exit_mmap TLB flushing hasn't taken effect.  I only now noticed this
patch where you fix just such an issue.

Hugh

From: "Andi Kleen" <ak@suse.de>

Always reload CR3 completely when a lazy MM thread drops a MM.  This avoids
keeping stale mappings around in the TLB that could be run into by the CPU by
itself (e.g.  during prefetches).

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/arch/x86_64/kernel/smp.c         |    3 ++-
 25-akpm/include/asm-x86_64/mmu_context.h |   10 ++++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff -puN arch/x86_64/kernel/smp.c~x86_64-always-reload-cr3-completely-when-a-lazy-mm arch/x86_64/kernel/smp.c
--- 25/arch/x86_64/kernel/smp.c~x86_64-always-reload-cr3-completely-when-a-lazy-mm	Wed Mar 23 15:38:58 2005
+++ 25-akpm/arch/x86_64/kernel/smp.c	Wed Mar 23 15:38:58 2005
@@ -25,6 +25,7 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include <asm/mach_apic.h>
+#include <asm/mmu_context.h>
 #include <asm/proto.h>
 
 /*
@@ -52,7 +53,7 @@ static inline void leave_mm (unsigned lo
 	if (read_pda(mmu_state) == TLBSTATE_OK)
 		BUG();
 	clear_bit(cpu, &read_pda(active_mm)->cpu_vm_mask);
-	__flush_tlb();
+	load_cr3(swapper_pg_dir);
 }
 
 /*
diff -puN include/asm-x86_64/mmu_context.h~x86_64-always-reload-cr3-completely-when-a-lazy-mm include/asm-x86_64/mmu_context.h
--- 25/include/asm-x86_64/mmu_context.h~x86_64-always-reload-cr3-completely-when-a-lazy-mm	Wed Mar 23 15:38:58 2005
+++ 25-akpm/include/asm-x86_64/mmu_context.h	Wed Mar 23 15:38:58 2005
@@ -28,6 +28,11 @@ static inline void enter_lazy_tlb(struct
 }
 #endif
 
+static inline void load_cr3(pgd_t *pgd)
+{
+	asm volatile("movq %0,%%cr3" :: "r" (__pa(pgd)) : "memory");
+}
+
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, 
 			     struct task_struct *tsk)
 {
@@ -40,7 +45,8 @@ static inline void switch_mm(struct mm_s
 		write_pda(active_mm, next);
 #endif
 		set_bit(cpu, &next->cpu_vm_mask);
-		asm volatile("movq %0,%%cr3" :: "r" (__pa(next->pgd)) : "memory");
+		load_cr3(next->pgd);
+
 		if (unlikely(next->context.ldt != prev->context.ldt)) 
 			load_LDT_nolock(&next->context, cpu);
 	}
@@ -54,7 +60,7 @@ static inline void switch_mm(struct mm_s
 			 * tlb flush IPI delivery. We must reload CR3
 			 * to make sure to use no freed page tables.
 			 */
-			asm volatile("movq %0,%%cr3" :: "r" (__pa(next->pgd)) : "memory");
+			load_cr3(next->pgd);
 			load_LDT_nolock(&next->context, cpu);
 		}
 	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-04-14 13:54       ` Hugh Dickins
@ 2005-04-14 17:01         ` Andi Kleen
  2005-04-14 17:34           ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2005-04-14 17:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, Dave Jones, Sergey S. Kostyliov, Clem Taylor,
	Chris Wright, linux-kernel

> It looks very much as if the mm being created has for pmd a page
> which was used for user stack in the outgoing mm; but somehow exec's
> exit_mmap TLB flushing hasn't taken effect.  I only now noticed this
> patch where you fix just such an issue.

Thanks for the analysis. However I doubt the load_cr3 patch can fix
it. All it does is to stop the CPU from prefetching mappings (which
can cause different problem). But the Linux code who does bad pmd checks
never looks at CR3 anyways, it always uses the current->mm. If
bad pmd sees a bad page it must be still in the page tables of the MM,
not a stable TLB entry.

It must be something else. Somehow we get a freed page into
the page table hierarchy. After the initial 4level implementation
I did not do many changes there, my suspection would be rather
on the recent memory.c changes.

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-04-14 17:01         ` Andi Kleen
@ 2005-04-14 17:34           ` Hugh Dickins
  2005-04-14 18:10             ` Andi Kleen
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2005-04-14 17:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Jones, Sergey S. Kostyliov, Clem Taylor, Chris Wright,
	linux-kernel

On Thu, 14 Apr 2005, Andi Kleen wrote:
> 
> Thanks for the analysis. However I doubt the load_cr3 patch can fix
> it. All it does is to stop the CPU from prefetching mappings (which
> can cause different problem).

I thought that the leave_mm code (before your patch) flushes the TLB, but
restores cr3 to the mm, while removing that cpu from the mm's cpu_vm_mask.

So any speculation, not just prefetching, on that cpu is in danger of
bringing address translations according to that mm back into the TLB.

But when the mm is torn down in exit_mmap, there's no longer any record
that the TLB on that cpu needs flushing, so stale translations remain.

As a rule, we always flush TLB _after_ invalidating, not just before,
for this kind of reason.

My paranoia of speculation may be excessive: I _think_ what I outline
above is a real possibility on Intel, but you and others know AMD much
better than I (and the reports I've seen are on AMD64, not EM64T).

> But the Linux code who does bad pmd checks
> never looks at CR3 anyways, it always uses the current->mm. If
> bad pmd sees a bad page it must be still in the page tables of the MM,
> not a stable TLB entry.

Sure, the "mm/memory.c:97: bad pmd" messages are coming from
clear_pmd_range, when the corrupted task exits later (but probably
not much later, since its user stack is oddly distributed across
two different pages: some mentioned SIGSEGVs I think).

The pmd really is bad, but it got to be bad because it had stack data
written into it by create_elf_tables, when the TLB mistakenly thought
it already knew what physical page 0x00007ffffffff000 was mapped to
(prior kernel accesses to that user stack are not by user address).

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-04-14 17:34           ` Hugh Dickins
@ 2005-04-14 18:10             ` Andi Kleen
  0 siblings, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2005-04-14 18:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andi Kleen, Dave Jones, Sergey S. Kostyliov, Clem Taylor,
	Chris Wright, linux-kernel

On Thu, Apr 14, 2005 at 06:34:58PM +0100, Hugh Dickins wrote:
> On Thu, 14 Apr 2005, Andi Kleen wrote:
> > 
> > Thanks for the analysis. However I doubt the load_cr3 patch can fix
> > it. All it does is to stop the CPU from prefetching mappings (which
> > can cause different problem).
> 
> I thought that the leave_mm code (before your patch) flushes the TLB, but
> restores cr3 to the mm, while removing that cpu from the mm's cpu_vm_mask.
> 
> So any speculation, not just prefetching, on that cpu is in danger of
> bringing address translations according to that mm back into the TLB.
> 
> But when the mm is torn down in exit_mmap, there's no longer any record
> that the TLB on that cpu needs flushing, so stale translations remain.
> 
> As a rule, we always flush TLB _after_ invalidating, not just before,
> for this kind of reason.

Yes this is all true. In fact I have several bug fixes for problems
in this area.

But this all cannot explain corruptions comming from the kernel, 
you tend to only see problems with the CPU prefetching something.

Note that with the cr3 reload you end up with init_mm, which
is not any useful mm. So even if there was a store from the kernel
into a stale mapping it would cause -EFAULT now.  But that is
not happening.

> 
> My paranoia of speculation may be excessive: I _think_ what I outline
> above is a real possibility on Intel, but you and others know AMD much
> better than I (and the reports I've seen are on AMD64, not EM64T).

It is not both on Intel and AMD :) These CPUs do a lot of prefetching
behind your back, any stale mappings at any time in the TLB eventually
cause problems. But other ones than this.


> Sure, the "mm/memory.c:97: bad pmd" messages are coming from
> clear_pmd_range, when the corrupted task exits later (but probably
> not much later, since its user stack is oddly distributed across
> two different pages: some mentioned SIGSEGVs I think).
> 
> The pmd really is bad, but it got to be bad because it had stack data
> written into it by create_elf_tables, when the TLB mistakenly thought
> it already knew what physical page 0x00007ffffffff000 was mapped to
> (prior kernel accesses to that user stack are not by user address).

What I meant is that the overwriting must be from Linux code
acting in the direct mapping, not due stale TLBs for addresses < __PAGE_OFFSET.

I will take a closer look at the rc1/rc2 patches later this evening
and see if I can spot something. Can only report back tomorrow though.

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
@ 2005-08-08 16:55 Andy Davidson
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Davidson @ 2005-08-08 16:55 UTC (permalink / raw)
  To: linux-kernel, davej

On Wed, 6 Apr, 2005 22:49:03 -0400, Dave Jones wrote:
> On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
>  > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
>  > >  I arrived at the office today to find my workstation had this spew
>  > >  in its dmesg buffer..
>  > Looks like random memory corruption to me.
>  > Can you enable slab debugging etc.?
>  > >  mm/memory.c:97: bad pmd ffff81004b017438(00000038a5500a88).
>  > >  mm/memory.c:97: bad pmd ffff81004b017440(0000000000000003).
>  > >  mm/memory.c:97: bad pmd ffff81004b017448(00007ffffffff73b).
>  > >  mm/memory.c:97: bad pmd ffff81004b017450(00007ffffffff73c).
> I realised today that this happens every time X starts up for
> the first time.   I did some experiments, and found that with 2.6.12rc1
> it's gone. Either it got fixed accidentally, or its hidden now
> by one of the many changes in 4-level patches.
> I'll try and narrow this down a little more tomorrow, to see if I
> can pinpoint the exact -bk snapshot (may be tricky given they were
> broken for a while), as it'd be good to get this fixed in 2.6.11.x
> if .12 isn't going to show up any time soon.

Hi, Dave, all --

Does anyone remember if they saw any system instability at the time of 
these messages ?

I'm running 2.6.11 on an SMP Opteron box, which is exhibiting these 
notices.  The box occasionally then behaves like it would during a 
serious memory leak - the load average shoots up, the box becomes 
unresponsive, stops accepting network connections, (but memory resources 
are not entirely starved, and nor does the kernel kill any processes off.)

Then - a few minutes later, the computer returns to normal.  This seems 
to happen maybe twice a week.  Thankfully, it's not ruined my weekend 
with a phone call from support yet, but it might. ;-)

If you do remember instability at this time, which was cured with an 
upgrade, then I will schedule some down time to try this out.


-- 

Regards, Andy Davidson                                andy@ebuyer.com
Systems Administrator,                                Ebuyer (UK) Ltd

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
@ 2005-09-20 17:12 Charles McCreary
  2005-09-20 17:30 ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Charles McCreary @ 2005-09-20 17:12 UTC (permalink / raw)
  To: linux-kernel

Another datapoint for this thread. The box spewing the bad pmds messages is a 
dual opteron 246 on a TYAN S2885 Thunder K8W motherboard. Kernel is 
2.6.11.4-20a-smp.

Approximately one hour after the bad pmd's, the box was completely 
unresponsive. This machine is either idle or heavily loaded, many threads, 
lots of io and nfs network traffic. Never see this when idle. When heavily 
loaded, it will invariably become unresponsive within 24 hrs. Looks 
reproducible. I'm willing to provide more information and test patches.

Output:
Sep 15 06:42:46 lakeport -- MARK --
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680bc8
(00002aaaaaaaba98).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680bd0
(0000000000000002).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680bd8
(00007ffffffffdcc).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680be0
(00007ffffffffdcd).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680bf0
(00007ffffffffdce).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680bf8
(00007ffffffffdcf).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c00
(00007ffffffffdd0).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c08
(00007ffffffffdd1).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c10
(00007ffffffffdd2).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c18
(00007ffffffffdd3).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c20
(00007ffffffffdd4).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c28
(00007ffffffffdd5).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c30
(00007ffffffffdd6).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c38
(00007ffffffffdd7).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c40
(00007ffffffffdd8).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c48
(00007ffffffffdd9).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c50
(00007ffffffffdda).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c58
(00007ffffffffddb).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c60
(00007ffffffffddc).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c68
(00007ffffffffddd).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c70
(00007ffffffffdde).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c78
(00007ffffffffddf).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c80
(00007ffffffffde0).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c88
(00007ffffffffde1).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c90
(00007ffffffffde2).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680c98
(00007ffffffffde3).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680ca0
(00007ffffffffde4).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680ca8
(00007ffffffffde5).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680cb0
(00007ffffffffde6).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680cc0
(0000000000000010).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680cc8
(00000000078bfbff).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680cd0
(0000000000000006).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680cd8
(0000000000001000).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680ce0
(0000000000000011).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680ce8
(0000000000000064).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680cf0
(0000000000000003).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680cf8
(0000000000400040).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d00
(0000000000000004).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d08
(0000000000000038).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d10
(0000000000000005).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d18
(0000000000000009).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d20
(0000000000000007).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d28
(00002aaaaaaab000).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d30
(0000000000000008).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d40
(0000000000000009).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d48
(00000000004010f0).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d50
(000000000000000b).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d60
(000000000000000c).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d70
(000000000000000d).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d80
(000000000000000e).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680d90
(0000000000000017).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680da0
(000000000000000f).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680da8
(00007ffffffffdc5).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680dc0
(3638780000000000).
Sep 15 06:58:44 lakeport kernel: mm/memory.c:97: bad pmd ffff81013c680dc8
(000000000034365f).
Sep 15 07:22:47 lakeport -- MARK --


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-09-20 17:12 x86-64 bad pmds in 2.6.11.6 Charles McCreary
@ 2005-09-20 17:30 ` Linus Torvalds
  2005-09-20 19:44   ` Chris Wedgwood
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2005-09-20 17:30 UTC (permalink / raw)
  To: Charles McCreary; +Cc: linux-kernel



On Tue, 20 Sep 2005, Charles McCreary wrote:
>
> Another datapoint for this thread. The box spewing the bad pmds messages is a 
> dual opteron 246 on a TYAN S2885 Thunder K8W motherboard. Kernel is 
> 2.6.11.4-20a-smp.

This is quite possibly the result of an Opteron errata (tlb flush
filtering is broken on SMP) that we worked around as of 2.6.14-rc4.

So either just try 2.6.14-rc2, or try the appended patch (it has since 
been confirmed by many more people).

		Linus

---
diff-tree bc5e8fdfc622b03acf5ac974a1b8b26da6511c99 (from 61ffcafafb3d985e1ab8463be0187b421614775c)
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Sat Sep 17 15:41:04 2005 -0700

    x86-64/smp: fix random SIGSEGV issues
    
    They seem to have been due to AMD errata 63/122; the fix is to disable
    TLB flush filtering in SMP configurations.
    
    Confirmed to fix the problem by Andrew Walrond <andrew@walrond.org>
    
    [ Let's see if we'll have a better fix eventually, this is the Q&D
      "let's get this fixed and out there" version ]
    
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -831,11 +831,26 @@ static void __init amd_detect_cmp(struct
 #endif
 }
 
+#define HWCR 0xc0010015
+
 static int __init init_amd(struct cpuinfo_x86 *c)
 {
 	int r;
 	int level;
 
+#ifdef CONFIG_SMP
+	unsigned long value;
+
+	// Disable TLB flush filter by setting HWCR.FFDIS:
+	// bit 6 of msr C001_0015
+	//
+	// Errata 63 for SH-B3 steppings
+	// Errata 122 for all(?) steppings
+	rdmsrl(HWCR, value);
+	value |= 1 << 6;
+	wrmsrl(HWCR, value);
+#endif
+
 	/* Bit 31 in normal CPUID used for nonstandard 3DNow ID;
 	   3DNow is IDd by bit 31 in extended CPUID (1*32+31) anyway */
 	clear_bit(0*32+31, &c->x86_capability);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-09-20 17:30 ` Linus Torvalds
@ 2005-09-20 19:44   ` Chris Wedgwood
  2005-09-20 23:23     ` Dave Jones
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wedgwood @ 2005-09-20 19:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Charles McCreary, linux-kernel

On Tue, Sep 20, 2005 at 10:30:48AM -0700, Linus Torvalds wrote:

> This is quite possibly the result of an Opteron errata (tlb flush
> filtering is broken on SMP) that we worked around as of 2.6.14-rc4.

It would be really interesting to know if this does help.  I was told
em64t also have the 'bad pmd' problem but I can't make it happen here
on opteron on em64t.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86-64 bad pmds in 2.6.11.6
  2005-09-20 19:44   ` Chris Wedgwood
@ 2005-09-20 23:23     ` Dave Jones
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Jones @ 2005-09-20 23:23 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Linus Torvalds, Charles McCreary, linux-kernel

On Tue, Sep 20, 2005 at 12:44:46PM -0700, Chris Wedgwood wrote:
 > On Tue, Sep 20, 2005 at 10:30:48AM -0700, Linus Torvalds wrote:
 > 
 > > This is quite possibly the result of an Opteron errata (tlb flush
 > > filtering is broken on SMP) that we worked around as of 2.6.14-rc4.
 > 
 > It would be really interesting to know if this does help.  I was told
 > em64t also have the 'bad pmd' problem but I can't make it happen here
 > on opteron on em64t.

In the dozens of reports of bad pmd that Fedora users filed, there
wasn't a single EM64T user.  In fact, most of the hits were from
very similar product lines, from a handful of vendors (Tyan's seemed
especially susceptable). It may be that other vendors updated their
BIOS's to include this workaround already, and Tyan and a few others
lagged behind.

		Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-09-20 23:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-20 17:12 x86-64 bad pmds in 2.6.11.6 Charles McCreary
2005-09-20 17:30 ` Linus Torvalds
2005-09-20 19:44   ` Chris Wedgwood
2005-09-20 23:23     ` Dave Jones
  -- strict thread matches above, loose matches on Subject: below --
2005-08-08 16:55 Andy Davidson
2005-04-08 16:33 Clem Taylor
2005-03-30 21:44 Dave Jones
2005-03-31 10:41 ` Andi Kleen
2005-03-31 21:52   ` Dave Jones
2005-04-01 11:52     ` Sergey S. Kostyliov
2005-04-07  2:49   ` Dave Jones
2005-04-07  6:29     ` Andi Kleen
2005-04-14 13:54       ` Hugh Dickins
2005-04-14 17:01         ` Andi Kleen
2005-04-14 17:34           ` Hugh Dickins
2005-04-14 18:10             ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox