public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* amd64 sata_nv (massive) memory corruption
@ 2008-08-01 17:30 Linas Vepstas
  2008-08-01 20:51 ` John Stoffel
  2008-08-01 22:19 ` Alistair John Strachan
  0 siblings, 2 replies; 24+ messages in thread
From: Linas Vepstas @ 2008-08-01 17:30 UTC (permalink / raw)
  To: linux-kernel

Hi,

I'm seeing strong, easily reproducible (and silent) corruption on a
sata-attached
disk drive on an amd64 board.  It might be the disk itself, but I
doubt it; googling
suggests that its somehow iommu-related but I cannot confirm this.

quickie summary:
-- disk is a brand new WDC WD5000AAKS-00YGA0 500GB disk (well, it
    was brand new a few months ago -- unusued, at any rate)
-- passes smartmon with flying colors, including many repeated short and long
   self-tests. Been passing for months.  No hint of bad sectors or other errors
   in smartctl -a display
-- no ide, sata errors in syslog -- no block device errors, no fs errors, etc.
-- No oopses anywhere to be found
-- system works flawlessly with an old PATA disk. (although I'm running it
   with dma turned off with hdparm, out of paranoia)
-- system is amd64 dual core, ASUS M2N-E mobo, 4GB RAM
   Northbridge is nVidia Corporation MCP55 Memory Controller (rev a3)
-- I tried moving the sata cable around to other ports, no effect; also tried
   reseating it on hard drive, no effect.

corruption is *easily* observed copying files with cp or dd. Also, typically
filesystem metadata is corrupted too. Creating even a small ext2 filesystem,
say 1GB, then copying 300MB of files onto it, unmounting it, and running fsk
will return many dozens of errors. Rerunning e2fsck over and over (as
e2fsck -f -y /dev/sda6) will report new errors about 1 out of every 3 times
(on small fs'es -- on big one's it will find new errors every time)

This behaviour has been observed with two different kernels:
with 2.6.23.9, compiled for 32-bit, and also 2.6.26 complied
for 64-bit.

Googling this uncovers some Dec 2006 LKML emails suggesting an
iommu problem, which I explored:
-- My default boot complains
    Your BIOS doesn't leave a aperture memory hole
    Please enable the IOMMU option in the BIOS setup
    This costs you 64 MB of RAM
-- I cannot find any option in BIOS that even vaguely hints at IOMMU-like
    function; at best, I can assign interrupts to PCI slots, but
that's it.  There's
    a bunch of IO options for olde-fashioned superio-like stuff: serial,parallel
    ports, USB stuff, etc. but that's all.
-- booting with iommu=soft does get rid of the aperature memory hole
   messsage, but does not solve the corruption problem.
-- booting with iommu=force seems to have no effect.

I'm running the powernow-k8 cpu frequency regulator. On a hunch,
I wondered if this might be the source of the problem; however,
using the "performance" regulator to keep the clock speed nailed
at maximum had no effect on the corruption bug.

Also of note:
-- problem was observed earlier, when system had 3GB RAM in it.
-- The integrated nvidia ethernet seems to work great, no errors, etc.
-- A different PCI ethernet card works great too.
-- I'm running graphics on an anceint matrox card in a PCI slot, and
    there's no hint of trouble there.
-- I'm using this system as my day-to-day desktop, and there seem to
   be no other problems. This suggests that if its some pci iommu
   wackiness, it certainly not affecting anything that isn't sata.

I really doubt the problem is the hard-drive; but I'll have to buy another
one to rule this out. Its possible that there's some problem with the
sata_nv driver, but there have been historical reports of corruption
on amd64 with other sata controllers. I can buy another sata controller
if needed, to experiment.

Other than that, any ideas for any further experiments? What can
I do to narrow the problem?

-- Linas Vepstas

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-08-07 18:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.qB5d+HsAJ6G05jNoeU8Q9GV6Dow@ifi.uio.no>
     [not found] ` <fa.fxlDAHxOnGgcBiOH/EOauE67ZPc@ifi.uio.no>
     [not found]   ` <fa.1WYUmN6FHR5yW+sXoYRFN22Y8S8@ifi.uio.no>
     [not found]     ` <fa.LAUkvEUlYiF69V/F8F3wigxqH9w@ifi.uio.no>
     [not found]       ` <fa.mXeFXYNkfZfUYPQcGwzok0IOIfY@ifi.uio.no>
     [not found]         ` <fa.KjbvCGbUr2JeQTcwA1/sFGIIMik@ifi.uio.no>
2008-08-04  3:22           ` amd64 sata_nv (massive) memory corruption Robert Hancock
2008-08-05  5:29             ` Linas Vepstas
2008-08-05  6:36               ` Robert Hancock
2008-08-05 12:29               ` Alan Cox
2008-08-01 17:30 Linas Vepstas
2008-08-01 20:51 ` John Stoffel
2008-08-02  3:06   ` Linas Vepstas
2008-08-01 22:19 ` Alistair John Strachan
2008-08-02  2:51   ` Linas Vepstas
2008-08-02 20:09     ` John Stoffel
2008-08-02 22:01       ` Linas Vepstas
2008-08-03  2:41         ` John Stoffel
2008-08-03 22:23           ` Linas Vepstas
2008-08-03 22:16             ` Alan Cox
2008-08-05 17:02               ` Linas Vepstas
2008-08-05 17:21                 ` Alan Cox
2008-08-06 21:33                   ` Linas Vepstas
2008-08-07  2:59                     ` Martin K. Petersen
2008-08-07  4:32                       ` Linas Vepstas
2008-08-07 16:42                         ` Martin K. Petersen
2008-08-07 17:23                           ` Linas Vepstas
2008-08-07 18:53                           ` John Stoffel
2008-08-07  7:45                     ` Pavel Machek
2008-08-02 21:55     ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox