Re: Read errors on raid5 ignored, array still clean .. then disaster !!

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Giovanni Tessore <giotex@texsoft.it>
To: linux-raid@vger.kernel.org
Subject: Re: Read errors on raid5 ignored, array still clean .. then disaster !!
Date: Wed, 27 Jan 2010 10:56:05 +0100	[thread overview]
Message-ID: <4B600DB5.3030309@texsoft.it> (raw)

>
> > Is there any way to configure raid in order to have devices marked faulty 
> > on read errors (at least when they clearly become too many)?
> I don't think so
>   
I think it would be useful to be able to configure the number of 
recovered read error allowed before the device goes faulty.

> > This could (and for me did) bring to big disasters!
> Don't agree with you, you had all the info from syslog
> You should have run smart tests on the disks and proactively replace a
> failing disk.
>   
Would be nice if md issues warning on recovered read error events, such 
as it does for other md events (device failure, etc.).

> it does _not_ ignore read errors 
> in case of read errors mdadm rewrites the erroring sector, and only if
> this fails it will kick the member out of the array.
> with modern drives it is possible to have some failed sector, which the
> drive firmware will reallocate on write (all modern drives have a range
> of sectors reserved for this very purpose)
> mdadm does not do any bookkeeping on reallocated_sector_count per drive
> the drive does. the data can be accessed with smartctl
> drives showing excessive reallocated_sector_count should be replaced.
>   
Sorry, with ignore I mean "it silently manage to recover the read error, 
without alerting anybody"
Btw, as I see from kernel sources, it keep track of recovered read error 
per device instead.
And only when they are > 256 it marks the device faulty (I'm preparing 
another post on it).
So, why to wait for just 256 errors?
I think should be configurable ... and a much lower level for me.

> Consider the following scenario:
> raid5 (sda,b,c,d)
> sda has a read error, mdadm kicks it immediately from the array
> a few minutes/hours later sdc fails completely
> lost data and no time to react, that is far worse than having 50 days of
> warnings and ignoring them.
>   
Yes, but suppose that sda has a number of corrected read errors that is 
250; it's still clean.
sdc fails and is kicked off
resync starts
sda get > 6 read erros during resync, it's set as faulty (and it's 
likely to happen as the drive is clearly dying)
lost data the same way
(this is my real scenario actually, really happened)

Much difference?

Personally i'd prefere to know as soon as possible that something is 
going wrong, if not setting the device faulty, with a warning (by mail 
like other md events), saying "this is the n-th revocered error for this 
device"
IMHO the admin have to be clearly awared *by md*, not other monitoring 
tools, that the array is facing a possible critical sistuation.

> I'm sorry for your data, hope you had backups.
>   
Thanks.
I am trying to recover forcing to re-add the drive which gives read 
errors and using the array in degraded mode ... it seems to work.

Giovanni

next             reply	other threads:[~2010-01-27  9:56 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-27  9:56 Giovanni Tessore [this message]
  -- strict thread matches above, loose matches on Subject: below --
2010-01-26 22:28 Read errors on raid5 ignored, array still clean .. then disaster !! Giovanni Tessore
2010-01-27  7:41 ` Luca Berra
2010-01-27  9:01   ` Goswin von Brederlow
2010-01-29 10:48   ` Neil Brown
2010-01-29 11:58     ` Goswin von Brederlow
2010-01-29 19:14     ` Giovanni Tessore
2010-01-30  7:58       ` Luca Berra
2010-01-30 15:52         ` Giovanni Tessore
2010-01-30  7:54     ` Luca Berra
2010-01-30 10:55     ` Giovanni Tessore
2010-01-30 18:44     ` Giovanni Tessore
2010-01-30 21:41       ` Asdo
2010-01-30 22:20         ` Giovanni Tessore
2010-01-31  1:23           ` Roger Heflin
2010-01-31 10:45             ` Giovanni Tessore
2010-01-31 14:08               ` Roger Heflin
2010-01-31 14:31         ` Asdo
2010-02-01 10:56           ` Giovanni Tessore
2010-02-01 12:45             ` Asdo
2010-02-01 15:11               ` Giovanni Tessore
2010-02-01 13:27             ` Luca Berra
2010-02-01 15:51               ` Giovanni Tessore
2010-01-27  9:01 ` Asdo
2010-01-27 10:09   ` Giovanni Tessore
2010-01-27 10:50     ` Asdo
2010-01-27 15:06       ` Goswin von Brederlow
2010-01-27 16:15       ` Giovanni Tessore
2010-01-27 19:33     ` Richard Scobie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B600DB5.3030309@texsoft.it \
    --to=giotex@texsoft.it \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).