* Data being corrupted on reiserfs 3.6
@ 2006-01-14 22:41 Michael Barnwell
2006-01-15 20:41 ` Jan Kara
0 siblings, 1 reply; 10+ messages in thread
From: Michael Barnwell @ 2006-01-14 22:41 UTC (permalink / raw)
To: reiserfs-list
Hi,
I'm experiencing data corruption when creating or copy data to my
reiserfs 3.6 partition mounted under /home. The following extract gives
a pretty clear indication that it's getting corrupted somewhere.
michael@biggs:/tmp$ mount
/dev/md0 on / type ext3 (rw,errors=remount-ro)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
usbfs on /proc/bus/usb type usbfs (rw)
tmpfs on /dev type tmpfs (rw,size=10M,mode=0755)
/dev/md2 on /home type reiserfs (rw)
michael@biggs:/tmp$ dd bs=1024 count=1000k if=/dev/urandom of=./1GB.tst
1024000+0 records in
1024000+0 records out
1048576000 bytes transferred in 231.749782 seconds (4524604 bytes/sec)
michael@biggs:/tmp$ md5sum 1GB.tst
48f46744c7e50c42c061a00d11541a85 1GB.tst
michael@biggs:/tmp$ cp 1GB.tst /home/michael/
michael@biggs:/tmp$ md5sum /home/michael/1GB.tst
042d8c462882f848412679e3cea03fe2 /home/michael/1GB.tst
I'm running Debian Sarge on an Athlon XP 2200+, /dev/md2 is made up of
four 400GB SATA hard disks on a Silicon Image 3114 controller in RAID 5.
Dmesg is showing no errors what so ever, the RAID array has been stable
since I installed it a couple of weeks ago and the drive was formatted
with mkfs.reiserfs with no special options.
michael@biggs:/tmp$ uname -a
Linux biggs 2.6.8-2-k7 #1 Tue Aug 16 14:00:15 UTC 2005 i686 GNU/Linux
michael@biggs:/tmp$ cat /proc/mdstat
Personalities : [raid1] [raid5]
md2 : active raid5 sdd1[0] sdc1[3] sdb1[2] sda1[1]
1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
If anyone has any idea what's going on I would be much appreciated.
Regards,
Michael Barnwell.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-14 22:41 Data being corrupted on reiserfs 3.6 Michael Barnwell
@ 2006-01-15 20:41 ` Jan Kara
2006-01-15 21:36 ` Michael Barnwell
0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2006-01-15 20:41 UTC (permalink / raw)
To: Michael Barnwell; +Cc: reiserfs-list
Hello,
> I'm experiencing data corruption when creating or copy data to my
> reiserfs 3.6 partition mounted under /home. The following extract gives
> a pretty clear indication that it's getting corrupted somewhere.
>
> michael@biggs:/tmp$ mount
> /dev/md0 on / type ext3 (rw,errors=remount-ro)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw)
> devpts on /dev/pts type devpts (rw,gid=5,mode=620)
> tmpfs on /dev/shm type tmpfs (rw)
> usbfs on /proc/bus/usb type usbfs (rw)
> tmpfs on /dev type tmpfs (rw,size=10M,mode=0755)
> /dev/md2 on /home type reiserfs (rw)
>
> michael@biggs:/tmp$ dd bs=1024 count=1000k if=/dev/urandom of=./1GB.tst
> 1024000+0 records in
> 1024000+0 records out
> 1048576000 bytes transferred in 231.749782 seconds (4524604 bytes/sec)
>
> michael@biggs:/tmp$ md5sum 1GB.tst
> 48f46744c7e50c42c061a00d11541a85 1GB.tst
>
> michael@biggs:/tmp$ cp 1GB.tst /home/michael/
>
> michael@biggs:/tmp$ md5sum /home/michael/1GB.tst
> 042d8c462882f848412679e3cea03fe2 /home/michael/1GB.tst
Hmm, that is really strange. Do the files have the same size? Do you
get an error also if you just create file full of zeros? If so, how do
the differences look like (e.g. any signs of flipped bits or so?).
> I'm running Debian Sarge on an Athlon XP 2200+, /dev/md2 is made up of
> four 400GB SATA hard disks on a Silicon Image 3114 controller in RAID 5.
> Dmesg is showing no errors what so ever, the RAID array has been stable
> since I installed it a couple of weeks ago and the drive was formatted
> with mkfs.reiserfs with no special options.
>
> michael@biggs:/tmp$ uname -a
> Linux biggs 2.6.8-2-k7 #1 Tue Aug 16 14:00:15 UTC 2005 i686 GNU/Linux
Any chance of trying some newer kernel? 2.6.8 is really old...
Honza
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-15 20:41 ` Jan Kara
@ 2006-01-15 21:36 ` Michael Barnwell
2006-01-15 22:29 ` Pierre Etchemaïté
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Michael Barnwell @ 2006-01-15 21:36 UTC (permalink / raw)
To: Jan Kara; +Cc: reiserfs-list
Hi,
Jan Kara wrote:
<snip>
> Hmm, that is really strange. Do the files have the same size? Do you
> get an error also if you just create file full of zeros? If so, how do
> the differences look like (e.g. any signs of flipped bits or so?).
>
michael@biggs:/tmp$ dd bs=1024 count=1000k if=/dev/zero of=./1GB.tst
1024000+0 records in
1024000+0 records out
1048576000 bytes transferred in 61.578769 seconds (17028207 bytes/sec)
michael@biggs:/tmp$ ls -l 1GB.tst
-rw-r--r-- 1 michael michael 1048576000 2006-01-15 20:51 1GB.tst
michael@biggs:/tmp$ md5sum 1GB.tst
e5c834fbdaa6bfd8eac5eb9404eefdd4 1GB.tst
michael@biggs:/tmp$ ls -l /home/michael/1GB.tst
-rw-r--r-- 1 michael michael 1048576000 2006-01-15 20:54
/home/michael/1GB.tst
michael@biggs:/tmp$ md5sum /home/michael/1GB.tst
92c51557041ebd6424b4467a878c9f44 /home/michael/1GB.tst
I looked at the file in /home/michael/1GB.tst with xdd for about 5
minutes but couldn't see anything but zeros - I'm not sure how to search
through a binary file for non-zero bytes.
So yes, error if the file is all zeros and they have the same size.
Thanks,
Michael Barnwell.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-15 21:36 ` Michael Barnwell
@ 2006-01-15 22:29 ` Pierre Etchemaïté
2006-01-15 23:02 ` Michael Barnwell
2006-01-16 9:45 ` Hans Reiser
2006-01-16 10:41 ` Jan Kara
2 siblings, 1 reply; 10+ messages in thread
From: Pierre Etchemaïté @ 2006-01-15 22:29 UTC (permalink / raw)
To: reiserfs-list
Le Sun, 15 Jan 2006 21:36:20 +0000, Michael Barnwell <xterminate@xterminate.me.uk> a écrit :
> I'm not sure how to search
> through a binary file for non-zero bytes.
cmp -b ?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-15 22:29 ` Pierre Etchemaïté
@ 2006-01-15 23:02 ` Michael Barnwell
0 siblings, 0 replies; 10+ messages in thread
From: Michael Barnwell @ 2006-01-15 23:02 UTC (permalink / raw)
To: reiserfs-list
Thanks,
Pierre Etchemaïté wrote:
> Le Sun, 15 Jan 2006 21:36:20 +0000, Michael Barnwell <xterminate@xterminate.me.uk> a écrit :
>
>> I'm not sure how to search
>> through a binary file for non-zero bytes.
>
> cmp -b ?
michael@biggs:~$ cmp -b /tmp/1GB.tst /home/michael/1GB.tst
/tmp/1GB.tst /home/michael/1GB.tst differ: byte 68494094, line 1 is 0
^@ 40
That seems to stop after the first difference, so I did: -
michael@biggs:~$ cmp -bl /tmp/1GB.tst /home/michael/1GB.tst | wc -l
243
The full output of cmp -bl is at http://pastebin.com/507389
Regards,
Michael.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-15 21:36 ` Michael Barnwell
2006-01-15 22:29 ` Pierre Etchemaïté
@ 2006-01-16 9:45 ` Hans Reiser
2006-01-16 10:41 ` Jan Kara
2 siblings, 0 replies; 10+ messages in thread
From: Hans Reiser @ 2006-01-16 9:45 UTC (permalink / raw)
To: Michael Barnwell; +Cc: Jan Kara, reiserfs-list
Normally we tell people that data corruption bugs must be bad hardware
for V3, and this was good advice empirically in the past, however, my
vague memory is that there have been more than usual reports of problems
with V3 recently. Can you try to reproduce on different hardware, or
look through our mailing list traffic for similar problems recently? If
you can reproduce it on different memory and CPU or find someone else
posting with a similar experience, then I'll ask the guys to try to
reproduce it also. Bad hardware can fail for reiserfs and not fail for
ext3 because we make the CPU 1-2C hotter and things like fans blocked
from turning by cables really can matter for us and not ext3 (a real
user experience was just described). In any event, even if it was bad
hardware, please let me know.
Hans
Michael Barnwell wrote:
> Hi,
>
> Jan Kara wrote:
>
> <snip>
>
>> Hmm, that is really strange. Do the files have the same size? Do you
>> get an error also if you just create file full of zeros? If so, how do
>> the differences look like (e.g. any signs of flipped bits or so?).
>>
>
> michael@biggs:/tmp$ dd bs=1024 count=1000k if=/dev/zero of=./1GB.tst
> 1024000+0 records in
> 1024000+0 records out
> 1048576000 bytes transferred in 61.578769 seconds (17028207 bytes/sec)
> michael@biggs:/tmp$ ls -l 1GB.tst
> -rw-r--r-- 1 michael michael 1048576000 2006-01-15 20:51 1GB.tst
> michael@biggs:/tmp$ md5sum 1GB.tst
> e5c834fbdaa6bfd8eac5eb9404eefdd4 1GB.tst
> michael@biggs:/tmp$ ls -l /home/michael/1GB.tst
> -rw-r--r-- 1 michael michael 1048576000 2006-01-15 20:54
> /home/michael/1GB.tst
> michael@biggs:/tmp$ md5sum /home/michael/1GB.tst
> 92c51557041ebd6424b4467a878c9f44 /home/michael/1GB.tst
>
> I looked at the file in /home/michael/1GB.tst with xdd for about 5
> minutes but couldn't see anything but zeros - I'm not sure how to
> search through a binary file for non-zero bytes.
>
> So yes, error if the file is all zeros and they have the same size.
>
> Thanks,
>
> Michael Barnwell.
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-15 21:36 ` Michael Barnwell
2006-01-15 22:29 ` Pierre Etchemaïté
2006-01-16 9:45 ` Hans Reiser
@ 2006-01-16 10:41 ` Jan Kara
2006-01-22 12:12 ` Michael Barnwell
2 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2006-01-16 10:41 UTC (permalink / raw)
To: Michael Barnwell; +Cc: reiserfs-list
Hello,
> Jan Kara wrote:
>
> <snip>
> > Hmm, that is really strange. Do the files have the same size? Do you
> >get an error also if you just create file full of zeros? If so, how do
> >the differences look like (e.g. any signs of flipped bits or so?).
> >
>
> michael@biggs:/tmp$ dd bs=1024 count=1000k if=/dev/zero of=./1GB.tst
> 1024000+0 records in
> 1024000+0 records out
> 1048576000 bytes transferred in 61.578769 seconds (17028207 bytes/sec)
> michael@biggs:/tmp$ ls -l 1GB.tst
> -rw-r--r-- 1 michael michael 1048576000 2006-01-15 20:51 1GB.tst
> michael@biggs:/tmp$ md5sum 1GB.tst
> e5c834fbdaa6bfd8eac5eb9404eefdd4 1GB.tst
> michael@biggs:/tmp$ ls -l /home/michael/1GB.tst
> -rw-r--r-- 1 michael michael 1048576000 2006-01-15 20:54
> /home/michael/1GB.tst
> michael@biggs:/tmp$ md5sum /home/michael/1GB.tst
> 92c51557041ebd6424b4467a878c9f44 /home/michael/1GB.tst
>
> I looked at the file in /home/michael/1GB.tst with xdd for about 5
> minutes but couldn't see anything but zeros - I'm not sure how to search
> through a binary file for non-zero bytes.
You can use 'od -t x1 <file>' - it should squeeze repeating characters
so you should see the non-zero ones easily...
As Hans said usually such problems are hardware problems (memory,
overheating processor, flaky disk controler etc.).
BTW: I generated the same file as you and md5sum of the one on
reiserfs is same as mine. So the file is stored correctly and something
wrong really happens during the copy from /tmp to /home/michael. I
looked at the differences and they don't seem to be random. It's always
a chunk of 3-16 bytes that gets corrupted. Then numbers written there
also do not seem to be random (lots of characters with code 16, 54,
128,...). I'll investigate more later...
So this could be some memory corruption - for checking out this it
would be useful if you could try to reproduce the problem with 2.6.15
kernel. The problem might well be fixed there.
Honza
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-16 10:41 ` Jan Kara
@ 2006-01-22 12:12 ` Michael Barnwell
2006-01-24 14:09 ` Jan Kara
2006-01-24 17:59 ` Barry K. Nathan
0 siblings, 2 replies; 10+ messages in thread
From: Michael Barnwell @ 2006-01-22 12:12 UTC (permalink / raw)
To: Jan Kara; +Cc: reiserfs-list
Hi,
Jan Kara wrote:
<snip>
>>
> You can use 'od -t x1 <file>' - it should squeeze repeating characters
> so you should see the non-zero ones easily...
> As Hans said usually such problems are hardware problems (memory,
> overheating processor, flaky disk controler etc.).
> BTW: I generated the same file as you and md5sum of the one on
> reiserfs is same as mine. So the file is stored correctly and something
> wrong really happens during the copy from /tmp to /home/michael. I
> looked at the differences and they don't seem to be random. It's always
> a chunk of 3-16 bytes that gets corrupted. Then numbers written there
> also do not seem to be random (lots of characters with code 16, 54,
> 128,...). I'll investigate more later...
> So this could be some memory corruption - for checking out this it
> would be useful if you could try to reproduce the problem with 2.6.15
> kernel. The problem might well be fixed there.
I finally upgraded to 2.6.15-1 and I'm still seeing the same problem
there - It's possibly its a memory issue or flaky disk controller, it's
a Silicon Image 3114 PCI card that I've not used before these hard
disks, it's more likely then memory which has been going fine for a
couple of years without any problems but I will run memtest86 when I get
the chance.
Oh and I don't know if I mentioned this before but the corruption only
ever occurs on writing not reading.
Can anyone suggest a test to tell if it is the disk controller?
Regards,
Michael.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-22 12:12 ` Michael Barnwell
@ 2006-01-24 14:09 ` Jan Kara
2006-01-24 17:59 ` Barry K. Nathan
1 sibling, 0 replies; 10+ messages in thread
From: Jan Kara @ 2006-01-24 14:09 UTC (permalink / raw)
To: Michael Barnwell; +Cc: reiserfs-list
Hello,
> Jan Kara wrote:
> <snip>
> >>
> > You can use 'od -t x1 <file>' - it should squeeze repeating characters
> >so you should see the non-zero ones easily...
> > As Hans said usually such problems are hardware problems (memory,
> >overheating processor, flaky disk controler etc.).
> > BTW: I generated the same file as you and md5sum of the one on
> >reiserfs is same as mine. So the file is stored correctly and something
> >wrong really happens during the copy from /tmp to /home/michael. I
> >looked at the differences and they don't seem to be random. It's always
> >a chunk of 3-16 bytes that gets corrupted. Then numbers written there
> >also do not seem to be random (lots of characters with code 16, 54,
> >128,...). I'll investigate more later...
> > So this could be some memory corruption - for checking out this it
> >would be useful if you could try to reproduce the problem with 2.6.15
> >kernel. The problem might well be fixed there.
>
> I finally upgraded to 2.6.15-1 and I'm still seeing the same problem
> there - It's possibly its a memory issue or flaky disk controller, it's
> a Silicon Image 3114 PCI card that I've not used before these hard
> disks, it's more likely then memory which has been going fine for a
> couple of years without any problems but I will run memtest86 when I get
> the chance.
Ok. Have you tried to reproduce the problem on some other hardware
(with some other controler)?
> Oh and I don't know if I mentioned this before but the corruption only
> ever occurs on writing not reading.
>
> Can anyone suggest a test to tell if it is the disk controller?
I have no other idea than try a different hardware...
Honza
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Data being corrupted on reiserfs 3.6
2006-01-22 12:12 ` Michael Barnwell
2006-01-24 14:09 ` Jan Kara
@ 2006-01-24 17:59 ` Barry K. Nathan
1 sibling, 0 replies; 10+ messages in thread
From: Barry K. Nathan @ 2006-01-24 17:59 UTC (permalink / raw)
To: Michael Barnwell; +Cc: Jan Kara, reiserfs-list
On 1/22/06, Michael Barnwell <xterminate@xterminate.me.uk> wrote:
> Can anyone suggest a test to tell if it is the disk controller?
If you don't mind wiping out the entire disk, I guess this would work
(I've never tried it on an actual failing disk controller though):
badblocks -w -s -t random /dev/whateverdisk
(or I guess you could try making a small partition and running it on that)
If badblocks lists tons of supposedly bad blocks, then it's not the
filesystem. The output might be easier to read without -s, but then
you'll have no indication of progress as it runs. Maybe this
incantation would be better:
badblocks -w -s -t random -o /some/file /dev/whateverdisk
Then you can keep a second shell open and check whether /some/file has
any blocks listed in it. If so, then that too eliminates the
filesystem as a suspect.
--
-Barry K. Nathan <barryn@pobox.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2006-01-24 17:59 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-14 22:41 Data being corrupted on reiserfs 3.6 Michael Barnwell
2006-01-15 20:41 ` Jan Kara
2006-01-15 21:36 ` Michael Barnwell
2006-01-15 22:29 ` Pierre Etchemaïté
2006-01-15 23:02 ` Michael Barnwell
2006-01-16 9:45 ` Hans Reiser
2006-01-16 10:41 ` Jan Kara
2006-01-22 12:12 ` Michael Barnwell
2006-01-24 14:09 ` Jan Kara
2006-01-24 17:59 ` Barry K. Nathan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.