how to deal with continuously getting more errors?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* how to deal with continuously getting more errors?
@ 2007-07-14 18:15 jeff stern
  2007-07-14 21:03 ` Justin Piszcz
  2007-07-18 23:23 ` Neil Brown
  0 siblings, 2 replies; 4+ messages in thread
From: jeff stern @ 2007-07-14 18:15 UTC (permalink / raw)
  To: linux-raid

hi, everyone..  i have a problem.

SUMMARY

i've got a linux software RAID1 setup, with 2 SATA drives (/dev/sdf1,
/dev/sdg1) set up to be /dev/md0. these 2 drives together hold my
/home directories. the / and / partitions are on another drive, a
standard parallel IDE (/dev/hda). (I can provide more hardware
information if someone needs it).

the problem is that new errors (mismatch_cnt discrepancies) between
the two disks keep coming up. weekly. even daily, and i dont know what
to do, or how to handle it.

How many mismatch_cnts between two almost-new drives running in a
healthy RAID1 array should one expect in a year? in a month? a day?

And more importantly, What do i do now?

EXTENDED DESCRIPTION OF PROBLEM

i first noticed this problem when i downloaded the fedora core 7 .iso,
and did a checksum on it, and it didn't match. with a little more
investigating, i found that i could make a copy of any large file on
disk, and its copy would sometimes match, sometimes not.

here is a typical session:
------------------------------------------------------------------------------------------
$ cp F-7-i386-DVD.iso F.iso
$ cmp F-7-i386-DVD.iso F.iso
F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
$ cmp F-7-i386-DVD.iso F.iso
$ cmp F-7-i386-DVD.iso F.iso
F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
$ cmp F-7-i386-DVD.iso F.iso
F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
$ cmp F-7-i386-DVD.iso F.iso
F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
$ _
------------------------------------------------------------------------------------------

as you can see, sometimes the file matches. more often, it doesn't.
when it doesn't, it's not always even at the same point in the file.

this was a bit confusing.

i tried doing these types of file copy/compares in the /tmp directory
(on the /dev/hda drive), and got 0 problems after many attempts.

"Okay," i said to myself, "it's probably not the RAM or the system in
general: it's either the SATA hard drives or it's their controller."

not knowing how to test the serial ATA controller by itself, i decided
to delve into linux software raid and see what i could find.

i went to the linux software raid how-to
(http://tldp.org/HOWTO/Software-RAID-HOWTO.html), but (rather
disappointingly) there was nothing on this problem that i could find
in that document. after several reads.

i also found a linux software raid faq
(http://www.faqs.org/contrib/linux-raid/x37.html), but again, no
reference to these types of problems.

i googled around a bit, and found this group archived at
http://marc.info/?l=linux-raid&r=1&w=2 , and searched and searched
through the messages. i did not find exactly my problem, but i did see
bits and pieces of advice. a couple of these led me to SMART, so i
tested my 2 disks, and found they were/are healthy (at least as far as
they are reporting: when i ran smartctl -t long /dev/sdf1  (and sdg1)
the tests on each drive completed without error. and all the pre-fail
and old-age attributes are fine on these drives (they are less than a
year old so that should not be surprising).

looking at more of the archives, i discovered i could do a couple of
tests. YES! finally, how to diagnose the problem! these tests included
this general regimen, apparently:

1. run
   echo check >> /sys/block/md0/md/sync_action
2. monitor progress with
   watch -n1 'cat /proc/mdstat'
3. afterwards:
   cat /sys/block/md0/md/mismatch_cnt

when i did this, in step 3, i got:

   102656

"over a hundred thousand mismatches?" i thought. "how did THIS happen?
i've had this disk setup for only 6 months! and isn't this RAID!?
aren't these problems supposed to be managed by RAID? what the heck is
going to happen to my data? are my backups fine? or have those been
compromised, too?"

in more reading through the archives, i found that mismatches can
happen, and that indeed linux software raid does not handle them
automatically. furthermore, that several people have found out the
hard way that backups do not help, either, because (in one case, for
months) people found that all they're doing is backing up erroneous
data. LOVELY.

furthermore, i discovered that there was a way to fix them (i.e.,
"sync" the drives). however, this fixing procedure came with a caveat.
 this caveat was something that i should have realized the importance
of in the first place: that a RAID 1 system with only two drives is
going to have a problem when repairing. the problem is that when
sync'ing the drives, whenever a mismatch is found, a decision must be
made as to which drive has the correct data: drive 1 or drive 2? and
that apparently, it's just a toss-up, and the repair program just
picks randomly.

"WHAAAAT????????????"

yeap. so, it's really better to either go with RAID 5, or to have a
RAID 1 system with 3 or more disks.

"gee, sure would have been nice knowing that going in! is that in the HOWTO?"

not really.

(though it's unclear to me that the linux software raid "echo repair"
facility, if faced with 3 (or more drives) would do the "statistics"
and poll all drives and pick the "answer" most commonly given.. would
it?)

so, with this form of repair, if the mismatch is under a jpeg file,
you might get a pixel different. big deal. but if the mismatch is
under your Quicken/GnuCash/Moneydance data files?

"Houston, we have a problem."

well, but what choice did i have?  i made a backup (another supposedly
erroneous one) and took the dive. i followed the posters'
instructions, and attempted a syncing/repair, this way:

4. run
   echo repair >> /sys/block/md0/md/sync_action
5. monitor progress with
   watch -n1 'cat /proc/mdstat'
6. afterwards:
   cat /sys/block/md0/md/mismatch_cnt

now the first time i ran this, i got a mismatch_cnt of

   102656

..which is perfect, because according to the poster's comments, this
means that 102,656 mismatches were REPAIRED. excellent. also,
according to the poster, should i run steps 1,2 & 3 again, i should
*now* see a mismatch_cnt of 0. i did so, and indeed saw 0 mismatches.
Lovely!

also, according to some other posters, linux software raid does not
manage these mismatches, and one should write their own scripts to run
these steps on a regular basis and report on them. (as well as
monitoring smartd's output, as well).

"but wait. if you order now, you also get.."

i did not immediately write scripts, but i waited a week (2 days ago)
and ran steps 1-3 again manually. i found a mismatch_cnt of 512.  "i
got 512 new mismatches in only a week?" i thought. "that's just wrong.
these are essentially new disks, and there just should NOT be that
many errors."

in any case i repaired them (steps 4-6).

i waited 1 day.

i did the tests again. 128 mismatches.

"wait! I just fixed them ***yesterday***!!!! Aaaaaarrrrggghhhh!!!!!"

to wit, my original questions:

what is even the normal mismatch_cnt one could, or should expect 2
drives to have in a year? 3? 10? 0?

what do i do now?  what is the repair or diagnostic procedure at this
point?  any suggestions?  what could be going wrong?  i *really* don't
think 2 almost new drives should be coming up with 128 mismatches in a
single day. so at this point, my RAID array is completely
untrustworthy, and i cannot store any important information on these
drives.

any/all help would be much appreciated.

thank you.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: how to deal with continuously getting more errors?
  2007-07-14 18:15 how to deal with continuously getting more errors? jeff stern
@ 2007-07-14 21:03 ` Justin Piszcz
  2007-07-18 23:23 ` Neil Brown
  1 sibling, 0 replies; 4+ messages in thread
From: Justin Piszcz @ 2007-07-14 21:03 UTC (permalink / raw)
  To: jeff stern; +Cc: linux-raid



On Sat, 14 Jul 2007, jeff stern wrote:

> hi, everyone..  i have a problem.
>
> SUMMARY
>
> i've got a linux software RAID1 setup, with 2 SATA drives (/dev/sdf1,
> /dev/sdg1) set up to be /dev/md0. these 2 drives together hold my
> /home directories. the / and / partitions are on another drive, a
> standard parallel IDE (/dev/hda). (I can provide more hardware
> information if someone needs it).
>
> the problem is that new errors (mismatch_cnt discrepancies) between
> the two disks keep coming up. weekly. even daily, and i dont know what
> to do, or how to handle it.
>
> How many mismatch_cnts between two almost-new drives running in a
> healthy RAID1 array should one expect in a year? in a month? a day?
>
> And more importantly, What do i do now?
>
> EXTENDED DESCRIPTION OF PROBLEM
>
> i first noticed this problem when i downloaded the fedora core 7 .iso,
> and did a checksum on it, and it didn't match. with a little more
> investigating, i found that i could make a copy of any large file on
> disk, and its copy would sometimes match, sometimes not.
>
> here is a typical session:
> ------------------------------------------------------------------------------------------
> $ cp F-7-i386-DVD.iso F.iso
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
> $ cmp F-7-i386-DVD.iso F.iso
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
> $ _
> ------------------------------------------------------------------------------------------


Something sounds very strange here, I have a script that runs the 'check' 
once a week for my RAID1 partitions and it is generally 0 every time, 
except for the swap parition (occasionally)- which Neil has mentioned-- is 
normal.  You bringup a lot of good points though; however, I am not sure 
why you are experiencing so many mismatches.....

Justin.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: how to deal with continuously getting more errors?
  2007-07-14 18:15 how to deal with continuously getting more errors? jeff stern
  2007-07-14 21:03 ` Justin Piszcz
@ 2007-07-18 23:23 ` Neil Brown
  2007-07-28  3:55   ` jeff stern
  1 sibling, 1 reply; 4+ messages in thread
From: Neil Brown @ 2007-07-18 23:23 UTC (permalink / raw)
  To: jeff stern; +Cc: linux-raid

On Saturday July 14, jas.61803+lr@gmail.com wrote:
> 
> EXTENDED DESCRIPTION OF PROBLEM
> 
> i first noticed this problem when i downloaded the fedora core 7 .iso,
> and did a checksum on it, and it didn't match. with a little more
> investigating, i found that i could make a copy of any large file on
> disk, and its copy would sometimes match, sometimes not.
> 
> here is a typical session:
> ------------------------------------------------------------------------------------------
> $ cp F-7-i386-DVD.iso F.iso
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
> $ cmp F-7-i386-DVD.iso F.iso
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
> $ cmp F-7-i386-DVD.iso F.iso
> F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
> $ _
> ------------------------------------------------------------------------------------------

This clearly indicates a hardware problem.
You tried in /tmp and didn't get this sort of result, so it probably
isn't RAM/CPU.
Next step is to break the raid1, mount each drive as a separate
filesystem and do the same test on each filesystem.
If one works and the other fails, then it must be something specific
to the faulty device.  If they are on the same controller, it must be
drive or cable, so swap cables and try again.
If they are on different controllers, try swapping controllers too.

If both filesystems show the same problem, it must be something
common, maybe the controller.  Try to find an alternate controller to
test with.  Narrow it down to the faulty component, and replace it.

> 
> 
> furthermore, i discovered that there was a way to fix them (i.e.,
> "sync" the drives). however, this fixing procedure came with a caveat.
>  this caveat was something that i should have realized the importance
> of in the first place: that a RAID 1 system with only two drives is
> going to have a problem when repairing. the problem is that when
> sync'ing the drives, whenever a mismatch is found, a decision must be
> made as to which drive has the correct data: drive 1 or drive 2? and
> that apparently, it's just a toss-up, and the repair program just
> picks randomly.
> 
> "WHAAAAT????????????"
> 
> yeap. so, it's really better to either go with RAID 5, or to have a
> RAID 1 system with 3 or more disks.
> 
This is not true at all.
If the difference is due to the drive subsystem returning bad data
(rather than indicating a read error), then no RAID system is safe.
If the difference is due to the kernel writing different data to the
two drives (as happens sometimes on swap or with memory-mapped files),
then both copies of the data are equally correct, and there isn't
really a problem.

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: how to deal with continuously getting more errors?
  2007-07-18 23:23 ` Neil Brown
@ 2007-07-28  3:55   ` jeff stern
  0 siblings, 0 replies; 4+ messages in thread
From: jeff stern @ 2007-07-28  3:55 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

thanks for responding, justin and neil.. and for your suggestions.

well, i tried neil's suggestion.. see my info, below.. i'd be grateful
for any suggestions. thank you.

On 7/18/07, Neil Brown <neilb@suse.de> wrote:
> On Saturday July 14, jas.61803+lr@gmail.com wrote:
> >
> > EXTENDED DESCRIPTION OF PROBLEM
> >
> > i first noticed this problem when i downloaded the fedora core 7 .iso,
> > and did a checksum on it, and it didn't match. with a little more
> > investigating, i found that i could make a copy of any large file on
> > disk, and its copy would sometimes match, sometimes not.
> >
> > here is a typical session:
> > ------------------------------------------------------------------------------------------
> > $ cp F-7-i386-DVD.iso F.iso
> > $ cmp F-7-i386-DVD.iso F.iso
> > F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
> > $ cmp F-7-i386-DVD.iso F.iso
> > $ cmp F-7-i386-DVD.iso F.iso
> > F-7-i386-DVD.iso F.iso differ: byte 1033827385, line 3789612
> > $ cmp F-7-i386-DVD.iso F.iso
> > F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
> > $ cmp F-7-i386-DVD.iso F.iso
> > F-7-i386-DVD.iso F.iso differ: byte 8870221, line 37265
> > $ _
> > ------------------------------------------------------------------------------------------
>
> This clearly indicates a hardware problem.
> You tried in /tmp and didn't get this sort of result, so it probably
> isn't RAM/CPU.
> Next step is to break the raid1, mount each drive as a separate
> filesystem and do the same test on each filesystem.
> If one works and the other fails, then it must be something specific
> to the faulty device.  If they are on the same controller, it must be
> drive or cable, so swap cables and try again.
> If they are on different controllers, try swapping controllers too.

well, i got the wierdest behavior. i did break the raid1 system into 2
drives. again, no instructions i could find in the HOWTO on how to do
this, so i just tried commenting out the line in /etc/fstab for the
/dev/md0 raid drive, and rebooting..

however, attempting to manually mount each drive separately gave me an
error saying wrong partition type. so i had to use /sbin/fdisk to
manually change the partition's system id from 'fd' (linux software
raid) to '83' (linux ext2/3) on each of /dev/sde1 and /dev/sdf1.. then
i could mount them.

once i mounted each drive, i tried cp'ing a large file (again,
F-7-i386-DVD.iso) and then cmp'ing the new one to the original 5
times. i did this whole cycle 5 times. guess what? ***0*** errors.
perfect cmp's. and i did this on BOTH drives. no problems at all when
they are mounted separately.

so what could THIS mean? they don't work together in raid but they do
separately? how could this be?

> If both filesystems show the same problem, it must be something
> common, maybe the controller.  Try to find an alternate controller to
> test with.  Narrow it down to the faulty component, and replace it.
>
> >
> >
> > furthermore, i discovered that there was a way to fix them (i.e.,
> > "sync" the drives). however, this fixing procedure came with a caveat.
> >  this caveat was something that i should have realized the importance
> > of in the first place: that a RAID 1 system with only two drives is
> > going to have a problem when repairing. the problem is that when
> > sync'ing the drives, whenever a mismatch is found, a decision must be
> > made as to which drive has the correct data: drive 1 or drive 2? and
> > that apparently, it's just a toss-up, and the repair program just
> > picks randomly.
> >
> > "WHAAAAT????????????"
> >
> > yeap. so, it's really better to either go with RAID 5, or to have a
> > RAID 1 system with 3 or more disks.
> >
> This is not true at all.
> If the difference is due to the drive subsystem returning bad data
> (rather than indicating a read error), then no RAID system is safe.
> If the difference is due to the kernel writing different data to the
> two drives (as happens sometimes on swap or with memory-mapped files),
> then both copies of the data are equally correct, and there isn't
> really a problem.
>
> NeilBrown
>


-- 
"the difference between driving a car and climbing onto a motorcycle
is the difference between watching TV and actually living your life"
(Dave Karlotski, "Season of the Bike",
http://motorcycleinfo.calsci.com/ and http://the751.tri-pixel.com/)

http://www.youtube.com/watch?v=yeMgEuf30G4

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-07-28  3:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-14 18:15 how to deal with continuously getting more errors? jeff stern
2007-07-14 21:03 ` Justin Piszcz
2007-07-18 23:23 ` Neil Brown
2007-07-28  3:55   ` jeff stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).