public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* File Corruption in Kernel 2.4.18
@ 2002-07-18  2:00 J. Hart
  2002-07-18  3:11 ` Kelledin
  2002-07-18  7:21 ` Ville Herva
  0 siblings, 2 replies; 9+ messages in thread
From: J. Hart @ 2002-07-18  2:00 UTC (permalink / raw)
  To: linux-kernel


     A large directory tree (70652 files, 7.6G) is copied recursively to an
empty destination directory using the following commands :

     mkdir aminet1/
     cp -a aminet aminet1/

     The source and destination directories are then compared using
the following commands:

     diff -r aminet aminet1/aminet > difflist

     A few of the files at the copy destination, typically three or four, will
usually be corrupt while the source files will be correct.  Occasionally the
copy will be done without any corrupt files at the destination.  The
mem=nopentium option appears to have no effect on this.  An overnight test using
the memtest86 utility shows no memory errors.  The corruption in each file
occurs in precise 4096 byte blocks.  An overnight test using the memtest86
utility shows no memory errors.  The corruption in each file occurs in precise
4096 byte blocks.  System logs show no evidence of any trouble, and no kernel
panics, warning messages or crashes are observed.  If there is any other user
activity while the copy is running, the system will frequently lock up requiring
a hard reset and reboot.  This forces a file system check due to the lack of a
clean unmount.  System logs also show no evidence of any trouble after the
lockup, and no kernel panics or other messages have been observed.

     If a tar file is made of the source directory and then extracted, and the
resultant extracted directory compared with the original, similar effects are
observed.

     Are there any kernel boot or build parameters which could be used
to give additional diagnostics ?

motherboard   : ASYS-A7V
Linux version : Slackware 8
Kernel        : 2.4.18
hard disk     : ATA100 IBM-DTLA-307045 45gb
hd controller : Promise Technology, Inc. 20265
cpu           : 900mhz AMD Athlon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
  2002-07-18  2:00 File Corruption in Kernel 2.4.18 J. Hart
@ 2002-07-18  3:11 ` Kelledin
  2002-07-18  7:21 ` Ville Herva
  1 sibling, 0 replies; 9+ messages in thread
From: Kelledin @ 2002-07-18  3:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: jhart

This could possibly be a problem with your hard drive.  Judging 
from the model number, you have a 45GB IBM DeskStar 75GXP, one 
of the first IBM drives to earn the nickname "DeathStar" for its 
high failure rate.  What does IBM's Drive Fitness Test tell you?

I'll see about performing your test tonight; I've got a hefty 
little DivX directory I can throw around as I wait for 
j2sdk-1.4.0 to finish compiling.  Such a test should be 
sufficient...

This could also be a recurrence of ye olde VIA686B PCI+IDE issue.  
IIRC, some VIA686B motherboards that had that flaw were 
effectively unfixable, simply because certain motherboard 
manufacturers spotted the problem before everyone else (even 
VIA?) and tried their own partial kludge fixes for it.  Gotta 
love VIA.

On Wednesday 17 July 2002 09:00 pm, J. Hart wrote:
>      A large directory tree (70652 files, 7.6G) is copied
> recursively to an empty destination directory using the
> following commands :
>
>      mkdir aminet1/
>      cp -a aminet aminet1/
>
>      The source and destination directories are then compared
> using the following commands:
>
>      diff -r aminet aminet1/aminet > difflist
>
>      A few of the files at the copy destination, typically
> three or four, will usually be corrupt while the source files
> will be correct.  Occasionally the copy will be done without
> any corrupt files at the destination.  The mem=nopentium
> option appears to have no effect on this.  An overnight test
> using the memtest86 utility shows no memory errors.  The
> corruption in each file occurs in precise 4096 byte blocks. 
> An overnight test using the memtest86 utility shows no memory
> errors.  The corruption in each file occurs in precise 4096
> byte blocks.  System logs show no evidence of any trouble, and
> no kernel panics, warning messages or crashes are observed. 
> If there is any other user activity while the copy is running,
> the system will frequently lock up requiring a hard reset and
> reboot.  This forces a file system check due to the lack of a
> clean unmount.  System logs also show no evidence of any
> trouble after the lockup, and no kernel panics or other
> messages have been observed.
>
>      If a tar file is made of the source directory and then
> extracted, and the resultant extracted directory compared with
> the original, similar effects are observed.
>
>      Are there any kernel boot or build parameters which could
> be used to give additional diagnostics ?
>
> motherboard   : ASYS-A7V
> Linux version : Slackware 8
> Kernel        : 2.4.18
> hard disk     : ATA100 IBM-DTLA-307045 45gb
> hd controller : Promise Technology, Inc. 20265
> cpu           : 900mhz AMD Athlon

-- 
Kelledin
"If a server crashes in a server farm and no one pings it, does 
it still cost four figures to fix?"


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
@ 2002-07-18  4:16 Kelledin
  0 siblings, 0 replies; 9+ messages in thread
From: Kelledin @ 2002-07-18  4:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: jhart

Ok, the test:

I chose directory /home/kelledin/gnutella.  It contains
approximately 10GB of files, ranging in size from ~5MB to
~700MB.  Most are ~600-650MB.

System specs are here:
http://www.anandtech.com/mysystemrig.html?rigid=5092

Only out-of-date info on that page is the kernel--I'm running
2.4.18+XFS-1.0.2+RML-preempt, compiled with gcc-2.95.3.  Kernel
was booted with "acpi=no-idle mem=nopentium" options.

While I was compiling jdk-1.4.0, I did the following:

[ kelledin@valhalla ~ ] # mkdir gnutella2
[ kelledin@valhalla ~ ] # cp -a gnutella gnutella2
[ kelledin@valhalla ~ ] # for FNAME in gnutella/*; do cmp
"$FNAME" "gnutella2/$FNAME"; done

The "cp -a" operation took 19 minutes, during which the system
load reached approximately 4.0 and the CPU temperature held at
54 C.  Ambient case temperature held at 26 C.  Swap usage did
not change.  System was somewhat sluggish but responsive enough
to play an mp3 and allow me to open terminal windows.  j2sdk
compile is still apparently going strong.

The comparison check...well, it finished while I was away
 getting a snack.  It printed no output, which means the check
 probably completed successfully.  Maybe I'll run some md5 sums
 later, just to be sure.

System load stayed at about 3.0, and temperatures remained
approximately the same as during the copy operation.

The relevant software:

kernel...well, you know.
glibc-2.2.5+linuxthreads+LSB+blowfish+math patches
libacl-2.0.11
libattr-2.0.8
bash-2.05a (Just for you, Hell.Surfers, just for you ;)
fileutils 4.1.8 with ACL patches and a Kelledin special. 
 Tarball can be found at:

ftp://skarpsey.dyndns.org/fileutils-4.1.8acl-kelledin.tar.bz2

Things that might be causing the corruption in our friend
J.Hart's case:

Buggy chipset (damn VIA!!!)
Faulty CPU (heat damage, chipped core?)
Faulty hard drive (hey, it's a DeathStar.)
Faulty IDE controller (if using offboard IDE)
Flaky cable (80-conductor ATA cable doesn't like being folded,
stacked, crumpled, etc., not even slightly)
Buggy IDE driver in the kernel
Buggy filesystem driver
Buggy fileutils
Buggy VM

I can't really test any of the possible software problems,
because I'm all SCSI, all XFS, bleeding-edge fileutils, and
didn't have any really significant swapping going on.  There's a
production server I could possibly test it on, but...well...it's
a production machine.  Maybe I'll repeat the test a few times
later.

On Wednesday 17 July 2002 10:11 pm, Kelledin wrote:
> This could possibly be a problem with your hard drive.
> Judging from the model number, you have a 45GB IBM DeskStar
> 75GXP, one of the first IBM drives to earn the nickname
> "DeathStar" for its high failure rate.  What does IBM's Drive
> Fitness Test tell you?
>
> I'll see about performing your test tonight; I've got a hefty
> little DivX directory I can throw around as I wait for
> j2sdk-1.4.0 to finish compiling.  Such a test should be
> sufficient...
>
> This could also be a recurrence of ye olde VIA686B PCI+IDE
> issue. IIRC, some VIA686B motherboards that had that flaw were
> effectively unfixable, simply because certain motherboard
> manufacturers spotted the problem before everyone else (even
> VIA?) and tried their own partial kludge fixes for it.  Gotta
> love VIA.

--
Kelledin
"If a server crashes in a server farm and no one pings it, does
it still cost four figures to fix?"

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
  2002-07-18  2:00 File Corruption in Kernel 2.4.18 J. Hart
  2002-07-18  3:11 ` Kelledin
@ 2002-07-18  7:21 ` Ville Herva
  2002-07-18  7:47   ` Wilfried Weissmann
  1 sibling, 1 reply; 9+ messages in thread
From: Ville Herva @ 2002-07-18  7:21 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-kernel

On Thu, Jul 18, 2002 at 11:00:05AM +0900, you [J. Hart] wrote:
> 
>      A few of the files at the copy destination, typically three or four, will
> usually be corrupt while the source files will be correct.  Occasionally the
> copy will be done without any corrupt files at the destination.  The
> mem=nopentium option appears to have no effect on this.  An overnight test using
> the memtest86 utility shows no memory errors.  The corruption in each file
> occurs in precise 4096 byte blocks.  An overnight test using the memtest86
> utility shows no memory errors.  The corruption in each file occurs in precise
> 4096 byte blocks.  

> motherboard   : ASYS-A7V

Asus A7V is Via KT133 based, right? It has additional Promise ide
controller?

> Linux version : Slackware 8
> Kernel        : 2.4.18

Stock 2.4.18, no patches? Which filesystem are you using? Ext2, ext3, other?

> hard disk     : ATA100 IBM-DTLA-307045 45gb
> hd controller : Promise Technology, Inc. 20265

So the harddisk is connected to Promise, not Via? You have no other
harddisks?

> cpu           : 900mhz AMD Athlon

I had enormous trouble with a KT133(A or not) based mobo (Abit-KT7(A)-RAID
in past - it would just corrupt data when transferring big files from the
additional ide controller (HPT370 in this case). The Via ide controller
didn't show this behaviour.

- This happened on 2.2.20, 2.4.15, 2.4.18preX + ide-patch.
- Memtest86 showed nothing
- Network activity seemed to have to do with it
- Changing the NIC to another PCI slot and tweaking bios params seemed to
  help, but eventually it happened again
- I eventually concluded that KT133 corrupts PCI transfers under load, which
  was found out by others in 'net as well. 
- Tried bios updates and contacting Via, Highpoint, Abit. Highpoint and Abit
  never cared to answer. Neither did Via until I spotted an Via employee on
  viahardware.com forum. She said they'd investigate the issue, never heard
  of her since.
- Ditched the mobo fo good, bought an Abit ST6R, and never had a problem
  since. You may be lucky just switching the drive to Via ide.

First reports on the corruption:

http://marc.theaimsgroup.com/?l=linux-kernel&m=100651892331843&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=100669782329815&w=2
http://groups.google.com/groups?q=We+first+reported+disk+corruption+with+a+VIA+KT133A+based+board&hl=en&lr=&ie=UTF-8&oe=utf-8&selm=linux.kernel.00c201c1a033%241cf46700%24b71c64c2%40viasys.com&rnum=3

There was a long thread on forums.viahardware.com as well
(http://forums.viaarena.com/messageview.cfm?catid=6&threadid=7171&start=21),
but it seems they have sensored it away for good.

I've also received reports of similar experiences from a number of people
since I wrote to linux-kernel about this.

I repoduced the problem with wrchk utility I wrote
(http://iki.fi/v/tmp/wrchk.c) but it seems you can do it with you directory
tree copying.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
  2002-07-18  7:21 ` Ville Herva
@ 2002-07-18  7:47   ` Wilfried Weissmann
  2002-07-21  2:52     ` J. Hart
       [not found]     ` <20020718081630.GX1465@niksula.cs.hut.fi>
  0 siblings, 2 replies; 9+ messages in thread
From: Wilfried Weissmann @ 2002-07-18  7:47 UTC (permalink / raw)
  To: Ville Herva; +Cc: J. Hart, linux-kernel

Ville Herva wrote:
> I had enormous trouble with a KT133(A or not) based mobo (Abit-KT7(A)-RAID
> in past - it would just corrupt data when transferring big files from the
> additional ide controller (HPT370 in this case). The Via ide controller
> didn't show this behaviour.

I got a Abit-KT7-RAID with a AMD Thunderbird 800 and also have seen lots 
of trouble. Finally I have figured out that reducing the memory bus 
clock to 100MHz (instead of 133MHz) make my system pretty stable (My 
memory modules can take 133MHz! I checked the specs.). Maybe that 
chipset memory tweaks that the linux kernel does are not enough to fix 
all memory problems...

[snip]
> - Ditched the mobo fo good, bought an Abit ST6R, and never had a problem
>   since. You may be lucky just switching the drive to Via ide.

Well, after messing around with the mobo for almost 2 years, it finally 
seems to be stable. But I wish I could have done useful stuff with my 
computer during that time.

[snip]
> I repoduced the problem with wrchk utility I wrote
> (http://iki.fi/v/tmp/wrchk.c) but it seems you can do it with you directory
> tree copying.

I got to check this out!

bye,
Wilfried


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
  2002-07-18  7:47   ` Wilfried Weissmann
@ 2002-07-21  2:52     ` J. Hart
       [not found]     ` <20020718081630.GX1465@niksula.cs.hut.fi>
  1 sibling, 0 replies; 9+ messages in thread
From: J. Hart @ 2002-07-21  2:52 UTC (permalink / raw)
  To: linux-kernel


     I must apologize for the delay in replying to the many helpful 
responses I received on this problem, and I'd like to say "thank you 
very much" (Domo Arigato Gozaimasu) to the many who looked into this on 
my behalf....:-)

Here's the status so far:

     I ran the e2fcsk utility which did not detect any problems on the 
hard drive.  I had not known about the IBM DFT utility until it was 
suggested by one of the responses, so I picked that up and tried it.  I 
ran the quick test, which immediately indicated two corrupt sectors.  I 
ran the Corrupt Sector Repair Utility (which does not seem to be 
documented in the manual that comes with DFT) after backing up, and 
repeated the tests a couple of times.  There were no more complaints 
from DFT.  I will be reloading my test directory, which I had to dump 
before the backup, and I will repeat the copy test on Monday after that 
to see if the file corruption still occurs.  

     I do not know what caused the corrupted sectors, but I am giving 
serious thought to a new mother board (to get rid of the chipset and 
Promise controller), and perhaps a replacement for the "IBM DeathStar". 
 I'll let you all know the outcome of the tests on Monday.  I am still 
curious about the precise 4k damaged blocks.






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
       [not found]     ` <20020718081630.GX1465@niksula.cs.hut.fi>
@ 2002-07-22 10:10       ` Wilfried Weissmann
  0 siblings, 0 replies; 9+ messages in thread
From: Wilfried Weissmann @ 2002-07-22 10:10 UTC (permalink / raw)
  To: Ville Herva, linux-kernel

Ville Herva wrote:
> On Thu, Jul 18, 2002 at 09:47:33AM +0200, you [Wilfried Weissmann] wrote:
> 
>>[snip]
>>
>>>I repoduced the problem with wrchk utility I wrote
>>>(http://iki.fi/v/tmp/wrchk.c) but it seems you can do it with you directory
>>>tree copying.
>>
>>I got to check this out!
> 
> 
> I had the problem to appear almost certainly when doing wrchk to raw disks
> (you should be able to use large files just as well), two writes in parallel
> (eg. /dev/hde, /dev/hdg). Occasionally it took ~50GB of writing before it
> happened (multiple rounds), but it always did.

I did a simultaneous:
wrck /dev/hd[fh] 0 64 2
The two disks were connected to the HPT-370 controller and both were 
configured as slave (the masters are configured into an ataraid-0 and 
contain my root partition). The test disk were IBM DLTA 307030 (30GB) 
with updated firmware. These disks are locked down to ata-44 by the 
kernel and I only got a maximum I/O speed of 21.7 MB/s. During the read 
phase one of the disks always slowed down, while the other disk 
proceeded at normal speed. In the first run I got 7.2 MB/s and at the 
second run the other disk slowed down to crawling 5.3 MB/s, but the test 
was completed without any errors. *joy* However I am not that the test 
did stress the chipset enough to trigger the error because of the 
throughput is so low.
My mainboard is a abit kt7-raid (VIA KT133 chipset), BIOS version 3R. 
Memory bus was reduced to 100 MHz (SDR). Linux kernel 2.4.18 tainted by 
NVidia(TM). ;)
DivX 5.0 seems to be a good stability test for VIA chipset based 
motherboards. It finds errors that not even memtest could detect.

greetings,
Wilfried

PS: I will do another run on my raid-0 root partition. The 2 disks that 
are part of the raid run at ata-100 (Maxtor 40GB).


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
@ 2002-07-23  2:56 J. Hart
  2002-07-23  3:04 ` Thunder from the hill
  0 siblings, 1 reply; 9+ messages in thread
From: J. Hart @ 2002-07-23  2:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: jhart


Here is a further update on the file corruption question :

     I ran the DFT utility which picked up two bad sectors, which I then
repaired.  A rerun of DFT after that gave no further reports of any problems.  I
then tried the directory tree copy (cp -a aminet aminet1) which produced one
corrupted file at the destination.  All the other destination files in the tree
(70652 files, 7.6G) appeared to be correct.

     An additional rerun of the IBM DFT utility after this reported no problems
despite the corrupt copy.

     In order to resolve this issue, my employer is considering the replacement
of my current machine with a new one having the following specifications :


motherboard: Asus P4T AGP Pro/4X
ram        : 1Gb
OS         : Linux 2.4.7-10 i686 unknown
CPU        : Intel(R) Pentium(R) 4 CPU 1800MHz
Gfx        : Matrox Graphics, Inc. MGA G400 AGP
drives     : Seagate 40gb UATA ST340810A (two of these)
controller : Intel PIIX4 Ultra 100 Chipset
           : (Intel Corporation 82801BA IDE U100)
chipset    : Intel Corporation 82850 850 (Tehama) Chipset Host Bridge (MCH)

     Are there any outstanding issues with machines of this new configuration as
there seemed to be with my old machine ?

With Thanks,

     J. Hart

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: File Corruption in Kernel 2.4.18
  2002-07-23  2:56 J. Hart
@ 2002-07-23  3:04 ` Thunder from the hill
  0 siblings, 0 replies; 9+ messages in thread
From: Thunder from the hill @ 2002-07-23  3:04 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-kernel

Hi,

On Tue, 23 Jul 2002, J. Hart wrote:
> OS         : Linux 2.4.7-10 i686 unknown
> 
>      Are there any outstanding issues with machines of this new
> configuration

Maybe a new kernel. I think the rest should be OK.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-07-23  3:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-18  2:00 File Corruption in Kernel 2.4.18 J. Hart
2002-07-18  3:11 ` Kelledin
2002-07-18  7:21 ` Ville Herva
2002-07-18  7:47   ` Wilfried Weissmann
2002-07-21  2:52     ` J. Hart
     [not found]     ` <20020718081630.GX1465@niksula.cs.hut.fi>
2002-07-22 10:10       ` Wilfried Weissmann
  -- strict thread matches above, loose matches on Subject: below --
2002-07-18  4:16 Kelledin
2002-07-23  2:56 J. Hart
2002-07-23  3:04 ` Thunder from the hill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox