Re: Read errors on raid5 ignored, array still clean .. then disaster !!

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Roger Heflin <rogerheflin@gmail.com>
To: Giovanni Tessore <giotex@texsoft.it>
Cc: linux-raid@vger.kernel.org
Subject: Re: Read errors on raid5 ignored, array still clean .. then disaster !!
Date: Sun, 31 Jan 2010 08:08:05 -0600	[thread overview]
Message-ID: <4B658EC5.9020004@gmail.com> (raw)
In-Reply-To: <4B655F63.2000705@texsoft.it>

Giovanni Tessore wrote:
> 
>> I have never seen a properly good disk that gets that high of error 
>> rate actually exposed to the OS.  I have dealt with >5000 disk for 
>> several years of history on the 5000+ disks.
> I have experience with not so many disks, but I was used that they are 
> quite reliable, and that the first read error reported to OS is symptom 
> of an incominc failure; I always replaced them in such case, and this is 
> why I am so amazed that kernel 2.6.15 changed the way it manages read 
> errors (as also Asdo said, it's ok for raid-6, but unsafe for raid-5, 1, 
> 4, 10).

Good disks to rescan, and replace the bad blocks before you see them, 
if you help the disks by doing your own scan then things are better.

> 
> Actually I had not a single read error since 2-3 years on my systems, 
> but now ... in a week, I had 4 disk failed (yes... another one since I 
> started this thread!!) ... it's 30% of the total disks in my systems ... 
> so I'm really puzzled out ... I don't know what to trust ... I'm just in 
> the hands of God

That tells me you have one of those "bad" lots.  If the disks start 
failing in mass in <3-4 years it is usually a bad lot.  You can 
manually scan (read) the whole disk, and if the sectors take weeks to 
go bad then the normal disk reallocation will prevent errors (if you 
are scanning faster than they go fully bad--the disk will replace when 
the error rate is high, but not so high that the disk can internally 
reconstruct the data).   The more often that you scan, the higher rate 
of sectors going bad can be corrected.

The reason that md rewrites and does not knock out the read errors, is 
when you get a read error you do not know if you can read the other 
disks.   Consider that if you have 5 crappy disks that have say 1000 
read errors per disk, the chance of one of the other of disks having 
the same sector bad is fairly small.  But given that one disk has a 
read error, the odds of another disk also having a read error is alot 
more likely, especially if none of the other disks have been read in 
several months.

What kind of disks are they?   And were you doing checks on the arrays 
and if so how often?  If you never do checks then a sector won't get 
check and moved before it goes fully bad, and can have months of not 
being read to go completely bad.

>> Nothing in the error rate indicated that behavior, so if you get a bad 
>> lot it will be very bad, if you don't get a bad lot you very likely 
>> won't have issues.   Now including the bad lots data into the overall 
>> error rate, may result in the error rate being that high, but you luck 
>> will depend on if you have a good or bad lot.
> My disks are form same manufacturer as size, but different lot, as 
> bought in different times, and different models.
> Systems are well protected by UPS and in different places!
> ... my unluky week .. or I have a big EM storm over here...
> I've recall to duty old 120G disks to save some data.

The same manufacturer process usually extends over different sizes 
(same underlying platter density), and over several months.   The last 
time I saw the issue 160gb, and 250 gb enterprise and non-enterprise 
disks will all affected.   The symptom was that the sectors went bad 
really really fast, I would suspect that there was a process issues 
with either the design, manufacturer, or quality control of the 
platter that resulting in the platters going bad at a much higher rate 
than expected.

next prev parent reply	other threads:[~2010-01-31 14:08 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-26 22:28 Read errors on raid5 ignored, array still clean .. then disaster !! Giovanni Tessore
2010-01-27  7:41 ` Luca Berra
2010-01-27  9:01   ` Goswin von Brederlow
2010-01-29 10:48   ` Neil Brown
2010-01-29 11:58     ` Goswin von Brederlow
2010-01-29 19:14     ` Giovanni Tessore
2010-01-30  7:58       ` Luca Berra
2010-01-30 15:52         ` Giovanni Tessore
2010-01-30  7:54     ` Luca Berra
2010-01-30 10:55     ` Giovanni Tessore
2010-01-30 18:44     ` Giovanni Tessore
2010-01-30 21:41       ` Asdo
2010-01-30 22:20         ` Giovanni Tessore
2010-01-31  1:23           ` Roger Heflin
2010-01-31 10:45             ` Giovanni Tessore
2010-01-31 14:08               ` Roger Heflin [this message]
2010-01-31 14:31         ` Asdo
2010-02-01 10:56           ` Giovanni Tessore
2010-02-01 12:45             ` Asdo
2010-02-01 15:11               ` Giovanni Tessore
2010-02-01 13:27             ` Luca Berra
2010-02-01 15:51               ` Giovanni Tessore
2010-01-27  9:01 ` Asdo
2010-01-27 10:09   ` Giovanni Tessore
2010-01-27 10:50     ` Asdo
2010-01-27 15:06       ` Goswin von Brederlow
2010-01-27 16:15       ` Giovanni Tessore
2010-01-27 19:33     ` Richard Scobie
  -- strict thread matches above, loose matches on Subject: below --
2010-01-27  9:56 Giovanni Tessore

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B658EC5.9020004@gmail.com \
    --to=rogerheflin@gmail.com \
    --cc=giotex@texsoft.it \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).