Read errors on raid5 ignored, array still clean .. then disaster !!

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Read errors on raid5 ignored, array still clean .. then disaster !!
@ 2010-01-26 22:28 Giovanni Tessore
  2010-01-27  7:41 ` Luca Berra
  2010-01-27  9:01 ` Asdo
  0 siblings, 2 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-26 22:28 UTC (permalink / raw)
  To: linux-raid

Hello everybody!
I'm not very deep inside software raid, so I'd like some expert's help

I'm having a big problem with a raid5 array with 6 sata disks: /dev/md3 
made of /dev/sd[acbdef]4
kernel is 2.6.24 (ubuntu 8.04 2.6.24-21-server)
mdadm - v2.6.3 - 20th August 2007

Here is what happened as read from logs:
- since beginning of december a lot (hundreds) of read errors occurred 
on /dev/sdb, but md3 silently recovered them, WITHOUT setting the device 
as faulty (see error reported below) or signaling the situation
- on 18 january a failure occured on /dev/sdf, and md3 marked it as faulty
- after /dev/sdf was replaced with new disk and re-added to array, the 
resync started
- at 98% of the resync, a read error occurred on /dev/sdb (as is was 
clearly in bad shape) and the whole array became unusable !!!

Is this some kind of bug?
Is there any way to configure raid in order to have devices marked 
faulty on read errors (at least when they clearly become too many)?

This could (and for me did) bring to big disasters!
Suppose you have a 4 disk raid with 2 spare disk ready for recovery
There are lot of read errors on disk 1, but md silently recovers them 
whitout marking disk as faulty (as it did for me)
Disk 3 fails
md adds one of the spare disks, and starts resync
resync fails due to the read errors on disk 1
everything is lost! till having 2 spare disks!!!???
This is no fault tollerance ... it's fault creation!!!

In a post of some months ago of a person who had a similar problem, I 
read as reply that ignoring the read errors is the wanted behaviour of 
md ... but I can't believe this!!

I was able to recover something with
mdadm --create /dev/md3 --assume-clean --level=5 --raid-devices=6 
--spare-devices=0 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4 /dev/sde4 missing
and use md3 in degraded mode, reapplying the command on each read error 
on /dev/sdb

Thanks in advance


Read errors reported into log about /dev/sdb long before the failure of 
/dev/sdf where like (notice the data recover message at bottom):

Dec 27 11:40:45 teroknor kernel:           res 
41/40:08:3b:b2:c3/14:00:38:00:00/00 Emask 0x409 (media error) <F>
Dec 27 11:40:45 teroknor kernel: ata2.00: configured for UDMA/133
Dec 27 11:40:45 teroknor kernel:  ata2: EH complete
Dec 27 11:40:45 teroknor kernel:  sd 1:0:0:0: [sdb] 976773168 512-byte 
hardware sectors (500108 MB)
Dec 27 11:40:45 teroknor kernel:  sd 1:0:0:0: [sdb] Write Protect is off
Dec 27 11:40:45 teroknor kernel:  sd 1:0:0:0: [sdb] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 11:40:48 teroknor kernel:           res 
41/40:08:3b:b2:c3/14:00:38:00:00/00 Emask 0x409 (media error) <F>
Dec 27 11:40:48 teroknor kernel:  ata2.00: configured for UDMA/133
Dec 27 11:40:48 teroknor kernel:  ata2: EH complete
Dec 27 11:40:48 teroknor kernel:  sd 1:0:0:0: [sdb] 976773168 512-byte 
hardware sectors (500108 MB)
Dec 27 11:40:48 teroknor kernel:  sd 1:0:0:0: [sdb] Write Protect is off
Dec 27 11:40:48 teroknor kernel:  sd 1:0:0:0: [sdb] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 11:40:51 teroknor kernel:           res 
41/40:08:3b:b2:c3/14:00:38:00:00/00 Emask 0x409 (media error) <F>
Dec 27 11:40:51 teroknor kernel:  ata2.00: configured for UDMA/133
Dec 27 11:40:51 teroknor kernel:  ata2: EH complete
Dec 27 11:40:51 teroknor kernel:  sd 1:0:0:0: [sdb] 976773168 512-byte 
hardware sectors (500108 MB)
Dec 27 11:40:51 teroknor kernel:  sd 1:0:0:0: [sdb] Write Protect is off
Dec 27 11:40:51 teroknor kernel:  sd 1:0:0:0: [sdb] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 11:40:54 teroknor kernel:           res 
41/40:08:3b:b2:c3/14:00:38:00:00/00 Emask 0x409 (media error) <F>
Dec 27 11:40:54 teroknor kernel:  ata2.00: configured for UDMA/133
Dec 27 11:40:54 teroknor kernel:  ata2: EH complete
Dec 27 11:40:54 teroknor kernel:  sd 1:0:0:0: [sdb] 976773168 512-byte 
hardware sectors (500108 MB)
Dec 27 11:40:54 teroknor kernel:  sd 1:0:0:0: [sdb] Write Protect is off
Dec 27 11:40:54 teroknor kernel:  sd 1:0:0:0: [sdb] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 11:40:57 teroknor kernel:           res 
41/40:08:3b:b2:c3/14:00:38:00:00/00 Emask 0x409 (media error) <F>
Dec 27 11:40:57 teroknor kernel:  ata2.00: configured for UDMA/133
Dec 27 11:40:57 teroknor kernel:  ata2: EH complete
Dec 27 11:40:57 teroknor kernel:  sd 1:0:0:0: [sdb] 976773168 512-byte 
hardware sectors (500108 MB)
Dec 27 11:40:57 teroknor kernel:  sd 1:0:0:0: [sdb] Write Protect is off
Dec 27 11:40:57 teroknor kernel:  sd 1:0:0:0: [sdb] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 11:40:59 teroknor kernel:           res 
41/40:08:3b:b2:c3/14:00:38:00:00/00 Emask 0x409 (media error) <F>
Dec 27 11:40:59 teroknor kernel:  ata2.00: configured for UDMA/133
Dec 27 11:40:59 teroknor kernel:  sd 1:0:0:0: [sdb] Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Dec 27 11:40:59 teroknor kernel:  sd 1:0:0:0: [sdb] Sense Key : Medium 
Error [current] [descriptor]
Dec 27 11:40:59 teroknor kernel:  Descriptor sense data with sense 
descriptors (in hex):
Dec 27 11:40:59 teroknor kernel:          72 03 11 04 00 00 00 0c 00 0a 
80 00 00 00 00 00
Dec 27 11:40:59 teroknor kernel:          00 00 00 3b
Dec 27 11:40:59 teroknor kernel:  sd 1:0:0:0: [sdb] Add. Sense: 
Unrecovered read error - auto reallocate failed
Dec 27 11:40:59 teroknor kernel:  end_request: I/O error, dev sdb, 
sector 952349242
Dec 27 11:40:59 teroknor kernel:  ata2: EH complete
Dec 27 11:40:59 teroknor kernel:  sd 1:0:0:0: [sdb] 976773168 512-byte 
hardware sectors (500108 MB)
Dec 27 11:40:59 teroknor kernel:  sd 1:0:0:0: [sdb] Write Protect is off
Dec 27 11:40:59 teroknor kernel:  sd 1:0:0:0: [sdb] Write cache: 
enabled, read cache: enabled, doesn't support DPO or FUA
Dec 27 11:41:00 teroknor kernel:  raid5:md3: read error corrected (8 
sectors at 942549592 on sdb4)

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-26 22:28 Read errors on raid5 ignored, array still clean .. then disaster !! Giovanni Tessore
@ 2010-01-27  7:41 ` Luca Berra
  2010-01-27  9:01   ` Goswin von Brederlow
  2010-01-29 10:48   ` Neil Brown
  2010-01-27  9:01 ` Asdo
  1 sibling, 2 replies; 29+ messages in thread
From: Luca Berra @ 2010-01-27  7:41 UTC (permalink / raw)
  To: linux-raid

On Tue, Jan 26, 2010 at 11:28:03PM +0100, Giovanni Tessore wrote:
> Is this some kind of bug?
No
> Is there any way to configure raid in order to have devices marked faulty 
> on read errors (at least when they clearly become too many)?
I don't think so
> This could (and for me did) bring to big disasters!
Don't agree with you, you had all the info from syslog
You should have run smart tests on the disks and proactively replace a
failing disk.

> In a post of some months ago of a person who had a similar problem, I read 
> as reply that ignoring the read errors is the wanted behaviour of md ... 
> but I can't believe this!!

it does _not_ ignore read errors 
in case of read errors mdadm rewrites the erroring sector, and only if
this fails it will kick the member out of the array.
with modern drives it is possible to have some failed sector, which the
drive firmware will reallocate on write (all modern drives have a range
of sectors reserved for this very purpose)
mdadm does not do any bookkeeping on reallocated_sector_count per drive
the drive does. the data can be accessed with smartctl
drives showing excessive reallocated_sector_count should be replaced.

Consider the following scenario:
raid5 (sda,b,c,d)
sda has a read error, mdadm kicks it immediately from the array
a few minutes/hours later sdc fails completely
lost data and no time to react, that is far worse than having 50 days of
warnings and ignoring them.

L.

I'm sorry for your data, hope you had backups.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-27  7:41 ` Luca Berra
@ 2010-01-27  9:01   ` Goswin von Brederlow
  2010-01-29 10:48   ` Neil Brown
  1 sibling, 0 replies; 29+ messages in thread
From: Goswin von Brederlow @ 2010-01-27  9:01 UTC (permalink / raw)
  To: linux-raid

Luca Berra <bluca@comedia.it> writes:

> On Tue, Jan 26, 2010 at 11:28:03PM +0100, Giovanni Tessore wrote:
>> Is this some kind of bug?
> No
>> Is there any way to configure raid in order to have devices marked
>> faulty on read errors (at least when they clearly become too many)?
> I don't think so
>> This could (and for me did) bring to big disasters!
> Don't agree with you, you had all the info from syslog
> You should have run smart tests on the disks and proactively replace a
> failing disk.
>
>> In a post of some months ago of a person who had a similar problem,
>> I read as reply that ignoring the read errors is the wanted
>> behaviour of md ... but I can't believe this!!
>
> it does _not_ ignore read errors in case of read errors mdadm rewrites
> the erroring sector, and only if
> this fails it will kick the member out of the array.
> with modern drives it is possible to have some failed sector, which the
> drive firmware will reallocate on write (all modern drives have a range
> of sectors reserved for this very purpose)
> mdadm does not do any bookkeeping on reallocated_sector_count per drive
> the drive does. the data can be accessed with smartctl
> drives showing excessive reallocated_sector_count should be replaced.
>
> Consider the following scenario:
> raid5 (sda,b,c,d)
> sda has a read error, mdadm kicks it immediately from the array
> a few minutes/hours later sdc fails completely
> lost data and no time to react, that is far worse than having 50 days of
> warnings and ignoring them.

Plus you should have run the raid check as a cron job. In Debian that is
done per default on every first sunday of a month at 3am. The check
reads every stripe from all disks and checks that the parity
matches. That would have caused all read errors of sda to be repaired
or, when the drive runs out of sectors to remap to, kicked the drive.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-27  7:41 ` Luca Berra
  2010-01-27  9:01   ` Goswin von Brederlow
@ 2010-01-29 10:48   ` Neil Brown
  2010-01-29 11:58     ` Goswin von Brederlow
                       ` (4 more replies)
  1 sibling, 5 replies; 29+ messages in thread
From: Neil Brown @ 2010-01-29 10:48 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

On Wed, 27 Jan 2010 08:41:38 +0100
Luca Berra <bluca@comedia.it> wrote:

> On Tue, Jan 26, 2010 at 11:28:03PM +0100, Giovanni Tessore wrote:
> > Is this some kind of bug?  
> No

I'm not sure I agree.
If a device is generating lots of read errors, we really should do something
proactive about that.
If there is a hot spare, then building onto that while keeping the original
active (yes, still on the todo list) would be a good thing to do.

v1.x metadata allows the number of corrected errors to be recorded across
restarts so a real long-term value can be used as a trigger.

So there certainly are useful improvements that could be made here.

NeilBrown

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-29 10:48   ` Neil Brown
@ 2010-01-29 11:58     ` Goswin von Brederlow
  2010-01-29 19:14     ` Giovanni Tessore
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Goswin von Brederlow @ 2010-01-29 11:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown <neilb@suse.de> writes:

> On Wed, 27 Jan 2010 08:41:38 +0100
> Luca Berra <bluca@comedia.it> wrote:
>
>> On Tue, Jan 26, 2010 at 11:28:03PM +0100, Giovanni Tessore wrote:
>> > Is this some kind of bug?  
>> No
>
>
> I'm not sure I agree.
> If a device is generating lots of read errors, we really should do something
> proactive about that.
> If there is a hot spare, then building onto that while keeping the original
> active (yes, still on the todo list) would be a good thing to do.
>
> v1.x metadata allows the number of corrected errors to be recorded across
> restarts so a real long-term value can be used as a trigger.
>
> So there certainly are useful improvements that could be made here.
>
> NeilBrown

Someone mentioned there already is an error count. Maybe throw a warning
message to mdadm every 1,2,4,8,16,32,... errors?

Locally I would see the read errors in the syslog/kernel.log and start
saving for a new drive. So I'm already warned. But for remote systems a
mail from mdadm would be nice.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-29 10:48   ` Neil Brown
  2010-01-29 11:58     ` Goswin von Brederlow
@ 2010-01-29 19:14     ` Giovanni Tessore
  2010-01-30  7:58       ` Luca Berra
  2010-01-30  7:54     ` Luca Berra
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-29 19:14 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Goswin von Brederlow

Neil Brown wrote:
> If a device is generating lots of read errors, we really should do something
> proactive about that.
> If there is a hot spare, then building onto that while keeping the original
> active (yes, still on the todo list) would be a good thing to do.
>
> v1.x metadata allows the number of corrected errors to be recorded across
> restarts so a real long-term value can be used as a trigger.
>
> So there certainly are useful improvements that could be made here.
>   

It's exactly my opinion.
To use a hot spare if available seems to me a very good idea.
About the metedata version, I was quite disappointed to see that the 
default when creating the array is still the 0.9 (correct me if newer 
distros behave differently), which does not persist info about the 
corrected read errors.
Into a previous post I suggested to let at least the admins to be 
conscious of the sistuation:

> - it seems that the max number of read errors allowed is set 
> statically into raid5.c by "conf->max_nr_stripes = NR_STRIPES;" to 
> 256, eventually let it be configurable by an entry into /sys/block/mdXX
> - let /proc/mdstat report clearly how many read errors occurred per 
> device, if any
> - let mdadm be configurable in monitor mode to trigger alerts when the 
> number of read errors for a device changes or goes > n
> - explain clearly in the how-to and other user's documentation what's 
> the behaviour of the raid towards read errors; after a fast survey 
> among my colleagues, i have noticed nobody was aware of this, and all 
> of them were sure that raid had the same behaviour for both write and 
> read errors!
I wrote a little patch (just 2 lines of code) for drivers/md/md.c in 
order to let /proc/mdstat report if a device has read errors, and how many.
So my /proc/mdstat now shows something like:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] 
[raid1] [raid10]
md0 : active raid5 sda1[0] sdb1[1](R:36) sdc1[2]
     4192768 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

where /dev/sdb1 has 36 corrected read errors.
This lets me know at a glace the real health status of my array.
As every info needed is available throught 
/sys/block/mdXX/md/rdXX/errors, i think it would be not difficult to 
manage to implement some monitor as standalone application or into mdadm.

One thing is clear to me, now that I faced a disaster over a 6 disks 
raid5 array: it's a *big* *hazard* to have devices which gave read 
errors into an array, without having md at least signaling the situation 
(throught /proc/mdstat or mdadm or anything else). Resync in case of 
another disk failure is likely to fail.

I think it's also a mess for the image of the whole linux server 
community: try to explain to a customer that his robust raid system, 
with 6 disks plus 2 hot spares, just died because there were read 
errors, which were well kwnown by the system; and that now all his 
valuable data are lost!!! That customer may say "What a server...!!!", 
kill you, then get a win server by sure!!

Someone may argue that the health status of disk should be monitored by 
smart monitors... but I disagree, imho md driver must not rely on 
external tools, it already has info on read errors and should manage 
them to avoid as much risk as possible by itself. Smart monitoring is 
surely useful... if installed, supported by hardware, properly 
configured... but md should not assume that.

Thanks for your interest.

-- 
Yours faithfully.

Giovanni Tessore

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-29 19:14     ` Giovanni Tessore
@ 2010-01-30  7:58       ` Luca Berra
  2010-01-30 15:52         ` Giovanni Tessore
  0 siblings, 1 reply; 29+ messages in thread
From: Luca Berra @ 2010-01-30  7:58 UTC (permalink / raw)
  To: linux-raid

On Fri, Jan 29, 2010 at 08:14:25PM +0100, Giovanni Tessore wrote:
> To use a hot spare if available seems to me a very good idea.
> About the metedata version, I was quite disappointed to see that the 
> default when creating the array is still the 0.9 (correct me if newer 
> distros behave differently), which does not persist info about the 
> corrected read errors.
it was changed in mdadm 3.1.1

> Into a previous post I suggested to let at least the admins to be conscious 
> of the sistuation:
>
> I think it's also a mess for the image of the whole linux server community: 
> try to explain to a customer that his robust raid system, with 6 disks plus 
> 2 hot spares, just died because there were read errors, which were well 
> kwnown by the system; and that now all his valuable data are lost!!! That 
> customer may say "What a server...!!!", kill you, then get a win server by 
> sure!!

Oh, please, stop trolling.

L.
-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-30  7:58       ` Luca Berra
@ 2010-01-30 15:52         ` Giovanni Tessore
  0 siblings, 0 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-30 15:52 UTC (permalink / raw)
  To: linux-raid

>> Into a previous post I suggested to let at least the admins to be 
>> conscious of the sistuation:
>>
>> I think it's also a mess for the image of the whole linux server 
>> community: try to explain to a customer that his robust raid system, 
>> with 6 disks plus 2 hot spares, just died because there were read 
>> errors, which were well kwnown by the system; and that now all his 
>> valuable data are lost!!! That customer may say "What a 
>> server...!!!", kill you, then get a win server by sure!!
>
> Oh, please, stop trolling.
>

Ok, maybe I'm a bit nervous due to the data loss... touche'
But the problem exists, and it's not only mine: I just see another post 
sent today on similar problem. So it's worth discuss on it, imho, 
because it may involve many installations.

Suppose you have a single disc: if it gives a read error, you lose some 
data and then? Do you keep the disc or do you replace it as soon as 
possible? I guess the second. So I would adopt the same policy if the 
drive is into a raid array too, moreover as one would excpect from it 
the maximun safety. To kick the disk out from the array at the first 
read error is not a good choice too, I agree, as the array can still 
run, BUT the urgency of replacing the disk is the same as for a faulty 
disk, as the array may not survive another disk failure! This should be 
clearly exposed to admin.

I already posted a little path for /proc/mdadm.
I'll try to write a little daemon to track /sys/block/mdXX/rdYY/errors.

Giovanni

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-29 10:48   ` Neil Brown
  2010-01-29 11:58     ` Goswin von Brederlow
  2010-01-29 19:14     ` Giovanni Tessore
@ 2010-01-30  7:54     ` Luca Berra
  2010-01-30 10:55     ` Giovanni Tessore
  2010-01-30 18:44     ` Giovanni Tessore
  4 siblings, 0 replies; 29+ messages in thread
From: Luca Berra @ 2010-01-30  7:54 UTC (permalink / raw)
  To: linux-raid

On Fri, Jan 29, 2010 at 09:48:52PM +1100, Neil Brown wrote:
>On Wed, 27 Jan 2010 08:41:38 +0100
>Luca Berra <bluca@comedia.it> wrote:
>
>> On Tue, Jan 26, 2010 at 11:28:03PM +0100, Giovanni Tessore wrote:
>> > Is this some kind of bug?  
>> No
>
>
>I'm not sure I agree.
>If a device is generating lots of read errors, we really should do something
>proactive about that.
>If there is a hot spare, then building onto that while keeping the original
>active (yes, still on the todo list) would be a good thing to do.
>
>v1.x metadata allows the number of corrected errors to be recorded across
>restarts so a real long-term value can be used as a trigger.
uhm, should we use an absolute value here, or should we consider the
ratio of read errors over time. Or both?
the former would indicate a disk that is degrading slowly over time
the latter migh be a symptom of a disk that will die very soon.
we also need to control the threshold on a per device base via sysfs
(eg mdX/md/dev-FOO/maximum_tolerated_read_errors)

>So there certainly are useful improvements that could be made here.
I don't deny that, but i would not define as bugs features that are not
yet designed/implemented.

L.


-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-29 10:48   ` Neil Brown
                       ` (2 preceding siblings ...)
  2010-01-30  7:54     ` Luca Berra
@ 2010-01-30 10:55     ` Giovanni Tessore
  2010-01-30 18:44     ` Giovanni Tessore
  4 siblings, 0 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-30 10:55 UTC (permalink / raw)
  To: linux-raid

Here are the lines I added to my drivers/md/md.c , function md_seq_show, 
to let /proc/mdstat show read errors on devices, if any:

    } else if (rdev->raid_disk < 0)
        seq_printf(seq, "(S)"); /* spare */

+  if (atomic_read(&rdev->read_errors) ||
+      atomic_read(&rdev->corrected_errors) )
+        seq_printf(seq, "(R:%u:%u)",
+            (unsigned int) atomic_read(&rdev->read_errors),
+            (unsigned int) atomic_read(&rdev->corrected_errors));

        sectors += rdev->sectors;
    }

Into md.h i see:

atomic_t  read_errors;   
/* number of consecutive read errors that we have tried to ignore. */
atomic_t corrected_errors;
/* number of corrected read errors, for reporting to userspace and 
storing in superblock. */

Ok for the second.. but I'm not sure about the meaning of the first 
one... and seems it's not reported by /sys/block/mdXX/dev-YY .. can it 
just be ignored?

Sample output:

md0 : active raid5 sdb1[1](R:0:36) sda1[0] sdc1[2]
      4192768 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

Regards
Giovanni

-- 
Yours faithfully.

Giovanni Tessore

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-29 10:48   ` Neil Brown
                       ` (3 preceding siblings ...)
  2010-01-30 10:55     ` Giovanni Tessore
@ 2010-01-30 18:44     ` Giovanni Tessore
  2010-01-30 21:41       ` Asdo
  4 siblings, 1 reply; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-30 18:44 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

>>> Is this some kind of bug?  
>>>       
>> No
>>     
> I'm not sure I agree.
>   

Hm funny ... I just read now from md's man:

"In  kernels  prior to about 2.6.15, a read error would cause the same 
effect as a write error.  In later kernels, a read-error will instead 
cause md to attempt a recovery by overwriting the bad block. .... "

So things have changed since 2.6.15 ... I was not so wrong to expect 
"the old behaviour" and to be disappointed.
But something important was missing during this change imho:
1) let the old behaviour be the default: add 
/sys/block/mdXX/max_correctale_read_errors, with default to 0.
2) let the new behaviour be the default, but update mdadm and 
/proc/mdstat to report read error events.

I think the situation is now quite clear.
Thanks

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-30 18:44     ` Giovanni Tessore
@ 2010-01-30 21:41       ` Asdo
  2010-01-30 22:20         ` Giovanni Tessore
  2010-01-31 14:31         ` Asdo
  0 siblings, 2 replies; 29+ messages in thread
From: Asdo @ 2010-01-30 21:41 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid, Neil Brown

Giovanni Tessore wrote:
>
>>>> Is this some kind of bug?        
>>> No
>>>     
>> I'm not sure I agree.
>>   
>
> Hm funny ... I just read now from md's man:
>
> "In  kernels  prior to about 2.6.15, a read error would cause the same 
> effect as a write error.  In later kernels, a read-error will instead 
> cause md to attempt a recovery by overwriting the bad block. .... "
>
> So things have changed since 2.6.15 ... I was not so wrong to expect 
> "the old behaviour" and to be disappointed.
> But something important was missing during this change imho:
> 1) let the old behaviour be the default: add 
> /sys/block/mdXX/max_correctale_read_errors, with default to 0.
> 2) let the new behaviour be the default, but update mdadm and 
> /proc/mdstat to report read error events.
>
> I think the situation is now quite clear.
> Thanks

I have the feeling the current behaviour is the correct one at least for 
RAID-6.

If you scrub often enough, read errors should be catched when you still 
have enough good disks in that stripe.
At that point rewrite will kick in.
If the disk has enough relocation sectors available, the sector will 
relocate, otherwise the disk gets kicked.

As other people have written, disks now are much bigger than in the 
past, and a damaged sector can happen. It's not necessary to kick the 
drive yet.

This is with RAID-6.

RAID-5 unfortunately is inherently insecure, here is why:
If one drive gets kicked, MD starts recovering to a spare.
At that point any single read error during the regeneration (that's a 
scrub) will fail the array.
This is a problem that cannot be overcome in theory.
Even with the old algorithm, any sector failed after the last scrub will 
take the array down when one disk is kicked (array will go down during 
recovery).
So you would need to scrub continuously, or you would need 
hyper-reliable disks.

Yes, kicking a drive as soon as it presents the first unreadable sector 
can be a strategy for trying to select hyper-reliable disks...

Ok after all I might agree this can be a reasonable strategy for 
raid1,4,5...

I'd also agree that with 1.x superblock it would be desirable to be able 
to set the maximum number of corrected read errors before a drive is 
kicked, which could be set by default to 0 for raid 1,4,5 and to... I 
don't know... 20 (50? 100?) for raid-6.

Actually I believe the drives should be kicked for this threshold only 
AFTER the end of the scrub, so that they are used for parity computation 
till the end of the scrub. I would suggest to check for this threshold 
at the end of each scrub, not before, and during normal array operation 
only if a scrub/resync is not in progress (will be checked at the end 
anyway).

Thank you
Asdo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-30 21:41       ` Asdo
@ 2010-01-30 22:20         ` Giovanni Tessore
  2010-01-31  1:23           ` Roger Heflin
  2010-01-31 14:31         ` Asdo
  1 sibling, 1 reply; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-30 22:20 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid


> RAID-5 unfortunately is inherently insecure, here is why:
> If one drive gets kicked, MD starts recovering to a spare.
> At that point any single read error during the regeneration (that's a 
> scrub) will fail the array.
> This is a problem that cannot be overcome in theory.
Yes, I was just getting the same conclusion :-(
Suppose you have 2Tb mainstream disks, with a read error ratio of 1 
sector each 1E+14 bits = 1.25E+13 bytes.
It means that you likely get an error every 6.25 times you read the 
whole disk!
So in case of failure of a disk, you have 1 possibility over 6 to fail 
the array during recostruction.
Simply unacceptable!

I looked at specs of some enterprise disks, and the read error ratio for 
them is 1 sector each 1E+15 or each 1E+16. Better but still risky.
Ok.. I'll definitely move to raid-6.

Also raid-1 with less than 3 disks becomes useless the same way :-(
Idem for raid-10 ...wow

Well, these two threads on read errors came out as kinda instructive ... 
doh!!

Regards

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-30 22:20         ` Giovanni Tessore
@ 2010-01-31  1:23           ` Roger Heflin
  2010-01-31 10:45             ` Giovanni Tessore
  0 siblings, 1 reply; 29+ messages in thread
From: Roger Heflin @ 2010-01-31  1:23 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: Asdo, linux-raid

Giovanni Tessore wrote:
> 
>> RAID-5 unfortunately is inherently insecure, here is why:
>> If one drive gets kicked, MD starts recovering to a spare.
>> At that point any single read error during the regeneration (that's a 
>> scrub) will fail the array.
>> This is a problem that cannot be overcome in theory.
> Yes, I was just getting the same conclusion :-(
> Suppose you have 2Tb mainstream disks, with a read error ratio of 1 
> sector each 1E+14 bits = 1.25E+13 bytes.
> It means that you likely get an error every 6.25 times you read the 
> whole disk!
> So in case of failure of a disk, you have 1 possibility over 6 to fail 
> the array during recostruction.
> Simply unacceptable!
> 
> I looked at specs of some enterprise disks, and the read error ratio for 
> them is 1 sector each 1E+15 or each 1E+16. Better but still risky.
> Ok.. I'll definitely move to raid-6.
> 
> Also raid-1 with less than 3 disks becomes useless the same way :-(
> Idem for raid-10 ...wow
> 
> Well, these two threads on read errors came out as kinda instructive ... 
> doh!!
> 
> Regards
> 

The manufacturer error numbers don't mean much, typically good disks 
won't fail a rebuild that often, I have done alot of rebuilds and read 
  errors during a rebuild are fairly rare, especially if you are doing 
proper patrol reads/scans of the raid arrays, if you have disks 
setting for long periods of time without read scans, then all bets are 
off and you will have issues.

I have never seen a properly good disk that gets that high of error 
rate actually exposed to the OS.  I have dealt with >5000 disk for 
several years of history on the 5000+ disks.

I have seen a few manufacturer "lots" of disks that had seriously high 
error rates (certain sizes and certain manufacture ranges), in one set 
of disks (2000 desktop drives, and 600 "enterprise" drives) the 
desktop drives were almost perfect (same company as the "enterprise" 
disks, <10 replaced after 2 years, out of the 600 "enterprise" disks 
we had replaced about 50 (read errors) when we finally got the 
manufacture send a engineer on-site to validate what was happening and 
RMA the entire lot (all 550 disk that had not yet been replaced).

Nothing in the error rate indicated that behavior, so if you get a bad 
lot it will be very bad, if you don't get a bad lot you very likely 
won't have issues.   Now including the bad lots data into the overall 
error rate, may result in the error rate being that high, but you luck 
will depend on if you have a good or bad lot.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-31  1:23           ` Roger Heflin
@ 2010-01-31 10:45             ` Giovanni Tessore
  2010-01-31 14:08               ` Roger Heflin
  0 siblings, 1 reply; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-31 10:45 UTC (permalink / raw)
  To: linux-raid


> I have never seen a properly good disk that gets that high of error 
> rate actually exposed to the OS.  I have dealt with >5000 disk for 
> several years of history on the 5000+ disks.
I have experience with not so many disks, but I was used that they are 
quite reliable, and that the first read error reported to OS is symptom 
of an incominc failure; I always replaced them in such case, and this is 
why I am so amazed that kernel 2.6.15 changed the way it manages read 
errors (as also Asdo said, it's ok for raid-6, but unsafe for raid-5, 1, 
4, 10).

Actually I had not a single read error since 2-3 years on my systems, 
but now ... in a week, I had 4 disk failed (yes... another one since I 
started this thread!!) ... it's 30% of the total disks in my systems ... 
so I'm really puzzled out ... I don't know what to trust ... I'm just in 
the hands of God
> Nothing in the error rate indicated that behavior, so if you get a bad 
> lot it will be very bad, if you don't get a bad lot you very likely 
> won't have issues.   Now including the bad lots data into the overall 
> error rate, may result in the error rate being that high, but you luck 
> will depend on if you have a good or bad lot.
My disks are form same manufacturer as size, but different lot, as 
bought in different times, and different models.
Systems are well protected by UPS and in different places!
 ... my unluky week .. or I have a big EM storm over here...
I've recall to duty old 120G disks to save some data.

Cheers

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-31 10:45             ` Giovanni Tessore
@ 2010-01-31 14:08               ` Roger Heflin
  0 siblings, 0 replies; 29+ messages in thread
From: Roger Heflin @ 2010-01-31 14:08 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

Giovanni Tessore wrote:
> 
>> I have never seen a properly good disk that gets that high of error 
>> rate actually exposed to the OS.  I have dealt with >5000 disk for 
>> several years of history on the 5000+ disks.
> I have experience with not so many disks, but I was used that they are 
> quite reliable, and that the first read error reported to OS is symptom 
> of an incominc failure; I always replaced them in such case, and this is 
> why I am so amazed that kernel 2.6.15 changed the way it manages read 
> errors (as also Asdo said, it's ok for raid-6, but unsafe for raid-5, 1, 
> 4, 10).

Good disks to rescan, and replace the bad blocks before you see them, 
if you help the disks by doing your own scan then things are better.

> 
> Actually I had not a single read error since 2-3 years on my systems, 
> but now ... in a week, I had 4 disk failed (yes... another one since I 
> started this thread!!) ... it's 30% of the total disks in my systems ... 
> so I'm really puzzled out ... I don't know what to trust ... I'm just in 
> the hands of God

That tells me you have one of those "bad" lots.  If the disks start 
failing in mass in <3-4 years it is usually a bad lot.  You can 
manually scan (read) the whole disk, and if the sectors take weeks to 
go bad then the normal disk reallocation will prevent errors (if you 
are scanning faster than they go fully bad--the disk will replace when 
the error rate is high, but not so high that the disk can internally 
reconstruct the data).   The more often that you scan, the higher rate 
of sectors going bad can be corrected.

The reason that md rewrites and does not knock out the read errors, is 
when you get a read error you do not know if you can read the other 
disks.   Consider that if you have 5 crappy disks that have say 1000 
read errors per disk, the chance of one of the other of disks having 
the same sector bad is fairly small.  But given that one disk has a 
read error, the odds of another disk also having a read error is alot 
more likely, especially if none of the other disks have been read in 
several months.

What kind of disks are they?   And were you doing checks on the arrays 
and if so how often?  If you never do checks then a sector won't get 
check and moved before it goes fully bad, and can have months of not 
being read to go completely bad.

>> Nothing in the error rate indicated that behavior, so if you get a bad 
>> lot it will be very bad, if you don't get a bad lot you very likely 
>> won't have issues.   Now including the bad lots data into the overall 
>> error rate, may result in the error rate being that high, but you luck 
>> will depend on if you have a good or bad lot.
> My disks are form same manufacturer as size, but different lot, as 
> bought in different times, and different models.
> Systems are well protected by UPS and in different places!
> ... my unluky week .. or I have a big EM storm over here...
> I've recall to duty old 120G disks to save some data.

The same manufacturer process usually extends over different sizes 
(same underlying platter density), and over several months.   The last 
time I saw the issue 160gb, and 250 gb enterprise and non-enterprise 
disks will all affected.   The symptom was that the sectors went bad 
really really fast, I would suspect that there was a process issues 
with either the design, manufacturer, or quality control of the 
platter that resulting in the platters going bad at a much higher rate 
than expected.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-30 21:41       ` Asdo
  2010-01-30 22:20         ` Giovanni Tessore
@ 2010-01-31 14:31         ` Asdo
  2010-02-01 10:56           ` Giovanni Tessore
  1 sibling, 1 reply; 29+ messages in thread
From: Asdo @ 2010-01-31 14:31 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid, Neil Brown

Asdo wrote:
> Giovanni Tessore wrote:
>> Hm funny ... I just read now from md's man:
>>
>> "In  kernels  prior to about 2.6.15, a read error would cause the 
>> same effect as a write error.  In later kernels, a read-error will 
>> instead cause md to attempt a recovery by overwriting the bad block. 
>> .... "
>>
>> So things have changed since 2.6.15 ... I was not so wrong to expect 
>> "the old behaviour" and to be disappointed.
>> [CUT]
>
> I have the feeling the current behaviour is the correct one at least 
> for RAID-6.
>
> [CUT]
>
> This is with RAID-6.
>
> RAID-5 unfortunately is inherently insecure, here is why:
> If one drive gets kicked, MD starts recovering to a spare.
> At that point any single read error during the regeneration (that's a 
> scrub) will fail the array.
> This is a problem that cannot be overcome in theory.
> Even with the old algorithm, any sector failed after the last scrub 
> will take the array down when one disk is kicked (array will go down 
> during recovery).
> So you would need to scrub continuously, or you would need 
> hyper-reliable disks.
>
> Yes, kicking a drive as soon as it presents the first unreadable 
> sector can be a strategy for trying to select hyper-reliable disks...
>
> Ok after all I might agree this can be a reasonable strategy for 
> raid1,4,5...
>
> I'd also agree that with 1.x superblock it would be desirable to be 
> able to set the maximum number of corrected read errors before a drive 
> is kicked, which could be set by default to 0 for raid 1,4,5 and to... 
> I don't know... 20 (50? 100?) for raid-6.
>
> Actually I believe the drives should be kicked for this threshold only 
> AFTER the end of the scrub, so that they are used for parity 
> computation till the end of the scrub. I would suggest to check for 
> this threshold at the end of each scrub, not before, and during normal 
> array operation only if a scrub/resync is not in progress (will be 
> checked at the end anyway).
>
> Thank you

I can add that this situation with raid 1,4,5,10 would be greatly 
ameliorated when the hot-device-replace feature gets implemented.
The failures of raid 1,4,5,10 are due to the zero redundancy you get in 
the time frame from when a drive is kicked to the end of the regeneration.
However if the hot-device-replace feature is added, and gets linked to 
the drive-kicking process, the problem would disappear.

Ideally instead of kicking (=failing) a drive directly, the 
hot-device-replace feature would be triggered, so the new drive would be 
replicated from the one being kicked (a few damaged blocks can be read 
from parity in case of read error from the disk being replaced, but 
don't "fail" the drive during the replace process just for this) In this 
way you get 1 redundancy instead of zero during rebuild, and the chances 
of the array going down during the rebuild process are pratically nullified.

I think the "hot-device-replace" action can replace the "fail" action in 
the most used scenarios, which is the drive being kicked due to:
1 - unrecoverable read error (end of relocation sectors available)
2 - surpassing the threshold for max corrected read errors (see above, 
if/when this gets implemented on 1.x superblock)

The reason for why #2 is feasible is trivial

#1 is more difficult (and it's useless to implement this if threshold 
for max corrected read errors gets implemented, because such threshold 
would trigger before the first unrecoverable read error happens), but I 
think it's still feasible. This would be the algorithm: you don't kick 
the drive, you ignore the write error on the bad disk (the correct data 
for that block can still be stored on the parity). Then you immediately 
trigger the hot-device-replace. When the scrub of the bad disk reaches 
the damaged sector, that one will be unreadable (I hope that it will not 
return the old data), but data can be read from the parity so the 
regeneration process can continue. So it should work, I think.

One case when you cannot replace the "fail" with "hot-device-replace" is 
when a disk dies suddenly (e.g. the electronic part dies). Maybe the 
"hot-device-replace" could still be triggered first, but then if the bad 
drive turns out to be completely unresponsive (timeout? number of 
commands without response?) you fall back on "fail".

Thank you
Asdo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-31 14:31         ` Asdo
@ 2010-02-01 10:56           ` Giovanni Tessore
  2010-02-01 12:45             ` Asdo
  2010-02-01 13:27             ` Luca Berra
  0 siblings, 2 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-02-01 10:56 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid, Neil Brown

Asdo wrote:
> Asdo wrote:
>> Giovanni Tessore wrote:
>>> Hm funny ... I just read now from md's man:
>>>
>>> "In  kernels  prior to about 2.6.15, a read error would cause the 
>>> same effect as a write error.  In later kernels, a read-error will 
>>> instead cause md to attempt a recovery by overwriting the bad block. 
>>> .... "
>>>
>>> So things have changed since 2.6.15 ... I was not so wrong to expect 
>>> "the old behaviour" and to be disappointed.
>>> [CUT]
>>
>> I have the feeling the current behaviour is the correct one at least 
>> for RAID-6.
>>
>> [CUT]
>>
>> RAID-5 unfortunately is inherently insecure, here is why:
>> If one drive gets kicked, MD starts recovering to a spare.
>> At that point any single read error during the regeneration (that's a 
>> scrub) will fail the array.
>> This is a problem that cannot be overcome in theory.
>> Even with the old algorithm, any sector failed after the last scrub 
>> will take the array down when one disk is kicked (array will go down 
>> during recovery).
>> So you would need to scrub continuously, or you would need 
>> hyper-reliable disks.
>>
>> Yes, kicking a drive as soon as it presents the first unreadable 
>> sector can be a strategy for trying to select hyper-reliable disks...
>>
>> Ok after all I might agree this can be a reasonable strategy for 
>> raid1,4,5...
Yes, the new behaviour is good for raid-6.
But unsafe for raid 1, 4, 5, 10.
The old behaviour saved me in the past, and would have saved also this 
time, allowing me to replace the disk as soon as possible.. the new one 
didn't at all...
The new one must at least clearly alert the user that a drive is getting 
read errors on raid 1,4,5,10.
>>
>> I'd also agree that with 1.x superblock it would be desirable to be 
>> able to set the maximum number of corrected read errors before a 
>> drive is kicked, which could be set by default to 0 for raid 1,4,5 
>> and to... I don't know... 20 (50? 100?) for raid-6.
Now seems to be hard coded set to 256 ...

> I can add that this situation with raid 1,4,5,10 would be greatly 
> ameliorated when the hot-device-replace feature gets implemented.
> The failures of raid 1,4,5,10 are due to the zero redundancy you get 
> in the time frame from when a drive is kicked to the end of the 
> regeneration.
> However if the hot-device-replace feature is added, and gets linked to 
> the drive-kicking process, the problem would disappear.
>
> Ideally instead of kicking (=failing) a drive directly, the 
> hot-device-replace feature would be triggered, so the new drive would 
> be replicated from the one being kicked (a few damaged blocks can be 
> read from parity in case of read error from the disk being replaced, 
> but don't "fail" the drive during the replace process just for this) 
> In this way you get 1 redundancy instead of zero during rebuild, and 
> the chances of the array going down during the rebuild process are 
> pratically nullified.
>
> I think the "hot-device-replace" action can replace the "fail" action 
> in the most used scenarios, which is the drive being kicked due to:
> 1 - unrecoverable read error (end of relocation sectors available)
> 2 - surpassing the threshold for max corrected read errors (see above, 
> if/when this gets implemented on 1.x superblock)
Both solutions seems good to me ... even if, yes, #1 is problably 
overcame by #2.
And personally I'd keep zero, or a very low value, for max corrected 
error threshold in raid 1,4,5,10.

I may suggest also this for emergency situation (no hot spares 
available, already degraded array, read error on remaining disk(s)):
suppose you have a single disk which is getting read errors: maybe you 
lose some data, but you can still do a backup and save most data.
If you have a degraded array which gets an unrecoverable read error, 
reconstruction is not feasible any more, the disk is mark failed and the 
whole array fails. The you have to recreate with --force or 
--assume-clean, start to backup data.. but on each other read errors you 
get the array offline again ... recreate in --force mode .. and so on 
(which needs skill and it's error prone).
Maybe would be useful to have unrecoverable read errors on degraded 
array to:
1) sent a big alert to admin, with detailed info
2) don't fail the disk and whole array, but set it into readonly mode
3) report read errors to the OS (as for a single drive)

This would allow to do a partial backup and save as most data as 
possible without having to tamper with create --force etc..
Experienced use may still try to overcome the situation readding devices 
(maybe one gone out simply due to timeout), with create --force, etc.. 
but many persons may have big troubles doing so, and they just see all 
their data gone, when just a few sectors over many Tb are unreadable and 
most data cab be saved.

Bets regards.

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-02-01 10:56           ` Giovanni Tessore
@ 2010-02-01 12:45             ` Asdo
  2010-02-01 15:11               ` Giovanni Tessore
  2010-02-01 13:27             ` Luca Berra
  1 sibling, 1 reply; 29+ messages in thread
From: Asdo @ 2010-02-01 12:45 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

Giovanni Tessore wrote:
> If you have a degraded array which gets an unrecoverable read error, 
> reconstruction is not feasible any more, the disk is mark failed and 
> the whole array fails. The you have to recreate with --force or 
> --assume-clean, start to backup data.. but on each other read errors 
> you get the array offline again ... recreate in --force mode .. and so 
> on (which needs skill and it's error prone).
> Maybe would be useful to have unrecoverable read errors on degraded 
> array to:
> 1) sent a big alert to admin, with detailed info
> 2) don't fail the disk and whole array, but set it into readonly mode
> 3) report read errors to the OS (as for a single drive)
>
> This would allow to do a partial backup and save as most data as 
> possible without having to tamper with create --force etc..
> Experienced use may still try to overcome the situation readding 
> devices (maybe one gone out simply due to timeout), with create 
> --force, etc.. but many persons may have big troubles doing so, and 
> they just see all their data gone, when just a few sectors over many 
> Tb are unreadable and most data cab be saved.

I think if you set it to readonly mode, it wouldn't degrade further even 
on read error.
I think I saw this from the source code, but now I'm not really sure any 
more.
Do you want to check?

If what I say is correct, you can get data out relatively easily with 1 
operation. The array won't go down.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-02-01 12:45             ` Asdo
@ 2010-02-01 15:11               ` Giovanni Tessore
  0 siblings, 0 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-02-01 15:11 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid

Asdo wrote:
> Giovanni Tessore wrote:
>> Maybe would be useful to have unrecoverable read errors on degraded 
>> array to:
>> 1) sent a big alert to admin, with detailed info
>> 2) don't fail the disk and whole array, but set it into readonly mode
>> 3) report read errors to the OS (as for a single drive)
>
> I think if you set it to readonly mode, it wouldn't degrade further 
> even on read error.
> I think I saw this from the source code, but now I'm not really sure 
> any more.
> Do you want to check?

I had by sure to recreate the degraded array some times because of going 
down while reading from defective sectors.
I'm not sure if always I set array to readonly (because I was quite in 
panic) ... but I'd guess I did.
I'll try to look into the code.

This has been my first bad experience with raid, I never had to go deep 
inside maintenance mode or to look into source code .. but I had to 
learn fast ;-)

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-02-01 10:56           ` Giovanni Tessore
  2010-02-01 12:45             ` Asdo
@ 2010-02-01 13:27             ` Luca Berra
  2010-02-01 15:51               ` Giovanni Tessore
  1 sibling, 1 reply; 29+ messages in thread
From: Luca Berra @ 2010-02-01 13:27 UTC (permalink / raw)
  To: linux-raid

On Mon, Feb 01, 2010 at 11:56:39AM +0100, Giovanni Tessore wrote:
> Yes, the new behaviour is good for raid-6.
> But unsafe for raid 1, 4, 5, 10.
> The old behaviour saved me in the past, and would have saved also this 
> time, allowing me to replace the disk as soon as possible.. the new one 
> didn't at all...
The 'new' behaviour was implemented because kicking drives out of an
array on a read error may prevent the array to be repaired at all.
modern drives _have_ correctable read errors, it is a fact.
So if md kicked drives on read error it is also possible to lose all
data on multiple failures (read errors on more than one drives, or
read-errors when sparing), that could have been recovered.

> The new one must at least clearly alert the user that a drive is getting 
> read errors on raid 1,4,5,10.
Agreed, now let's define 'clearly alert', besides syslog.

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-02-01 13:27             ` Luca Berra
@ 2010-02-01 15:51               ` Giovanni Tessore
  0 siblings, 0 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-02-01 15:51 UTC (permalink / raw)
  To: linux-raid

> modern drives _have_ correctable read errors, it is a fact.
> So if md kicked drives on read error it is also possible to lose all
> data on multiple failures (read errors on more than one drives, or
> read-errors when sparing), that could have been recovered.

But if we assume that modern drives behave like this, we should also
assume that radid 5, 4, 10 and 1 with < 3 devices, are intrinsically
vulnerable, and someway 'deprecated', because a read error on
recostruction after a disk failure can likely occur.

Personally I just reshaped the failed array as a 6-disk raid-6.
I'll also reshape another machine which has 3 disks to have 2 arrays, a
raid-1 with 3 devices and a raid-5, the first to be used for most
valuable data.

>> The new one must at least clearly alert the user that a drive is 
>> getting read errors on raid 1,4,5,10.
> Agreed, now let's define 'clearly alert', besides syslog.

I would use the same mechanism of events used now my mdadm, defining new
CorrectedReadError event ... for raid-6 it can be info  (or warning when
errors becamo too many,configurable); for other raid levels (the
'vulnerable' ones) the severity should be warning or critical.

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-26 22:28 Read errors on raid5 ignored, array still clean .. then disaster !! Giovanni Tessore
  2010-01-27  7:41 ` Luca Berra
@ 2010-01-27  9:01 ` Asdo
  2010-01-27 10:09   ` Giovanni Tessore
  1 sibling, 1 reply; 29+ messages in thread
From: Asdo @ 2010-01-27  9:01 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

Giovanni Tessore wrote:
> This could (and for me did) bring to big disasters!
> Suppose you have a 4 disk raid with 2 spare disk ready for recovery
> There are lot of read errors on disk 1, but md silently recovers them 
> whitout marking disk as faulty (as it did for me)
> Disk 3 fails
> md adds one of the spare disks, and starts resync
> resync fails due to the read errors on disk 1
> everything is lost! till having 2 spare disks!!!???
> This is no fault tollerance ... it's fault creation!!!

Other than monitoring & proactively replacing the disk as Luca suggests, 
the thing that you (probably) have missed is periodically performing scrubs.

See man md for "check" or "repair".

With scrubs, your errors in /dev/sdf and /dev/sdb would have been 
detected long time ago, and the disk in the worst shape would have run 
out of reallocation sectors and be kicked long time ago when the other 
disk was still relatively in good shape.

Double failures (in different positions of different disks) are 
relatively likely if you don't scrub the array. If you scrub the array 
they are much less likely.

That said, you might still be able to get data out of your array:

1 - reassemble it, possibly with --force if normal reassemble refuses to 
work  (*)
2 - immediately stop the resync by writing "idle" on 
/sys/block/mdX/md/sync_action
3 - immediately set it as readonly: mdadm --readonly /dev/sdX
4 - mount the array (w/ readonly mount) and get data out of it with rsyncs

The purpose of 2 and 3 is to stop the resync (your array is not clean). 
I hope one of those two does it. You should not see progress with cat 
/dev/mdstat after those two steps.

#3 also should prevent further resyncs to start, which normally start 
when you hit an unreadable sector. Remember that if the rsync starts, at 
98% of it your array will go down.

Let us know

(*) I don't suggest to use --create and --assume-clean like you did, 
it's much more dangerous than --assemble --force. Was it really needed? 
Really --assemble --force doesn't work?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-27  9:01 ` Asdo
@ 2010-01-27 10:09   ` Giovanni Tessore
  2010-01-27 10:50     ` Asdo
  2010-01-27 19:33     ` Richard Scobie
  0 siblings, 2 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-27 10:09 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid

Firstable, thanks for the many replies :-)

>> This could (and for me did) bring to big disasters!
>> Suppose you have a 4 disk raid with 2 spare disk ready for recovery
>> There are lot of read errors on disk 1, but md silently recovers them 
>> whitout marking disk as faulty (as it did for me)
>> Disk 3 fails
>> md adds one of the spare disks, and starts resync
>> resync fails due to the read errors on disk 1
>> everything is lost! till having 2 spare disks!!!???
>> This is no fault tollerance ... it's fault creation!!!
>
> Other than monitoring & proactively replacing the disk as Luca 
> suggests, the thing that you (probably) have missed is periodically 
> performing scrubs.
>
> See man md for "check" or "repair".
>
> With scrubs, your errors in /dev/sdf and /dev/sdb would have been 
> detected long time ago, and the disk in the worst shape would have run 
> out of reallocation sectors and be kicked long time ago when the other 
> disk was still relatively in good shape.
I didn't setup smart monitoring... Luca is right.
But I dont like the idea that md relies on another tool, that could be 
not installed or correctly configured, to warn the user on potential 
critical situations which involve directly itself.

 From the logs, it results that it did a "check" on md3 the 4 january 
(first read errors at beginning of december, failure of other disk 18 
january); no read error occurred.
Looks like it did not help much :(
Maybe I was just very unluky

>
> Double failures (in different positions of different disks) are 
> relatively likely if you don't scrub the array. If you scrub the array 
> they are much less likely.
>
> That said, you might still be able to get data out of your array:
>
> 1 - reassemble it, possibly with --force if normal reassemble refuses 
> to work  (*)
> 2 - immediately stop the resync by writing "idle" on 
> /sys/block/mdX/md/sync_action
> 3 - immediately set it as readonly: mdadm --readonly /dev/sdX
> 4 - mount the array (w/ readonly mount) and get data out of it with 
> rsyncs
>
> The purpose of 2 and 3 is to stop the resync (your array is not 
> clean). I hope one of those two does it. You should not see progress 
> with cat /dev/mdstat after those two steps.
>
> #3 also should prevent further resyncs to start, which normally start 
> when you hit an unreadable sector. Remember that if the rsync starts, 
> at 98% of it your array will go down.
>
> Let us know
I recovered the array in degraded mode with:

mdadm --create /dev/md3 --assume-clean --level=5 --raid-devices=6 
--spare-devices=0 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4 /dev/sde4 missing

setting md3 readonly and mounting readonly.
Obviously when a read error occurs on sdb, it goes again offline and I 
have to repeat the procedure
It does not resync as I run in degraded mode (missing in place of 
/dev/sdf4).

I'm saving not all data, but many.

Which I complain most, is the 'silent' manage of the recovered read errors.
I think it would be very apreciated, not only by me, if a warning could 
be issued (by md, no others, as it says and manages the read error) to 
inform the admin that a problem may face.
In 2 years of 24/7 activity, none of the other 5 disks in the array gave 
a single read error; just sdb started giving many in december (while sdf 
failed suddenly at one shot).
In my experience, when a disk begins to give read errors, it's better to 
replace it as soon as possible (with a disk which has done some burn in, 
and which is tested to be ok).
As raid is meant to be redundant, maybe some redundant warning on 
recoverable read errors could be useful too.
Because which is a recoverable read error if all other disk are ok, it 
becomes a fatal error if another disk fails.

By the way I'm having a look at the md kernel source.
It look like raid5 is hard configured to set a device faulty if it gives 
 > 256 recoverable read errors.
Would be nice if this could be configurable into /proc/sys/md

I'll follow a more detailed post on this point.

Thanks again.

Giovanni

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-27 10:09   ` Giovanni Tessore
@ 2010-01-27 10:50     ` Asdo
  2010-01-27 15:06       ` Goswin von Brederlow
  2010-01-27 16:15       ` Giovanni Tessore
  2010-01-27 19:33     ` Richard Scobie
  1 sibling, 2 replies; 29+ messages in thread
From: Asdo @ 2010-01-27 10:50 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: linux-raid

Giovanni Tessore wrote:
> I didn't setup smart monitoring... Luca is right.
> But I dont like the idea that md relies on another tool, that could be 
> not installed or correctly configured, to warn the user on potential 
> critical situations which involve directly itself.
>
> From the logs, it results that it did a "check" on md3 the 4 january 
> (first read errors at beginning of december, failure of other disk 18 
> january); no read error occurred.
> Looks like it did not help much :(
> Maybe I was just very unluky 

It's strange... quite unlucky I'd say...
Maybe the check needs to be run more frequently, like weekly.
(Asdo taking notes... :-D)
I also am not expert of this kind of statistics.
FWIW I have seen two hardware raid controllers, and they do the scrub 
weekly by default. I don't know if they also run SMART tests.

Also is it possible that you experienced an electricity surge or a 
physical shock on the computer?

Obviously smart checks also help.

Good luck
Asdo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-27 10:50     ` Asdo
@ 2010-01-27 15:06       ` Goswin von Brederlow
  2010-01-27 16:15       ` Giovanni Tessore
  1 sibling, 0 replies; 29+ messages in thread
From: Goswin von Brederlow @ 2010-01-27 15:06 UTC (permalink / raw)
  To: linux-raid

Asdo <asdo@shiftmail.org> writes:

> Giovanni Tessore wrote:
>> I didn't setup smart monitoring... Luca is right.
>> But I dont like the idea that md relies on another tool, that could
>> be not installed or correctly configured, to warn the user on
>> potential critical situations which involve directly itself.
>>
>> From the logs, it results that it did a "check" on md3 the 4 january
>> (first read errors at beginning of december, failure of other disk
>> 18 january); no read error occurred.
>> Looks like it did not help much :(
>> Maybe I was just very unluky
>
> It's strange... quite unlucky I'd say...
> Maybe the check needs to be run more frequently, like weekly.
> (Asdo taking notes... :-D)
> I also am not expert of this kind of statistics.
> FWIW I have seen two hardware raid controllers, and they do the scrub
> weekly by default. I don't know if they also run SMART tests.
>
> Also is it possible that you experienced an electricity surge or a
> physical shock on the computer?
>
> Obviously smart checks also help.
>
> Good luck
> Asdo

Or the check just does not report errors.

You need to read your syslog/kernel log to see read errors and check the
mismatch count after a check. Not all distributions do report mismatch
count != 0 after a check. Most don't I think.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-27 10:50     ` Asdo
  2010-01-27 15:06       ` Goswin von Brederlow
@ 2010-01-27 16:15       ` Giovanni Tessore
  1 sibling, 0 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-27 16:15 UTC (permalink / raw)
  To: linux-raid


> Also is it possible that you experienced an electricity surge or a 
> physical shock on the computer?
No, the machine is well protected by a good UPS unit.


I had a look to the kernel's sources (2.6.24, I'll check later latest 
kernel)
I'm not a kernel's expert, I didn't need to take a deep look inside it 
before, but:


Into drivers/md/raid5.c :

raid5_end_read_request()
{ ...
else if (atomic_read(&rdev->read_errors) > conf->max_nr_stripes)
    printk(KERN_WARNING "raid5:%s: Too many read errors, failing device 
%s.\n", mdname(conf->mddev), bdn);
... }

It surely keeps track of how many read errors occured! So, the driver 
detects recovered read errors and counts them!
Later in the same source:

int run(mddev_t *mddev)
{ ...
conf->max_nr_stripes = NR_STRIPES;
... }

Looks like it statically sets a limit of 256 recovered read errors 
before setting the device as faulty.

Moreover, from the *Documentation/md.txt* file itself, it states that 
for each md device into /sys/block there is a directory for each 
physical device composing the array, like /sys/block/md0/md/dev-sda1, 
each directory containing many device's parameter, and among them:

...
errors
        An approximate count of read errors that have been detected on
        this device but have not caused the device to be evicted from
        the array (either because they were corrected or because they
        happened while the array was read-only).  When using version-1
        metadata, this value persists across restarts of the array.
...

So the info on how many read errors occured on device is collected and 
available!

I would suggest the following, that *would surely help a lot in 
preventing disasters* like mine:

- it seems that the max number of read errors allowed is set statically 
into raid5.c by "conf->max_nr_stripes = NR_STRIPES;" to 256, eventually 
let it be configurable by an entry into /sys/block/mdXX
- let /proc/mdstat report clearly how many read errors occurred per 
device, if any
- let mdadm be configurable in monitor mode to trigger alerts when the 
number of read errors for a device changes or goes > n
- explain clearly in the how-to and other user's documentation what's 
the behaviour of the raid towards read errors; after a fast survey among 
my colleagues, i have noticed nobody was aware of this, and all of them 
were sure that raid had the same behaviour for both write and read errors!

I examined kernel source 2.6.24 and mdadm 2.6.3, maybe into newer 
versions this already happens; if so, sorry.
My knowledge of linux-raud implementation is not good (otherwise I would 
anwser here, not ask :P ), but maybe I can help.

Thanks

Giovanni


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
  2010-01-27 10:09   ` Giovanni Tessore
  2010-01-27 10:50     ` Asdo
@ 2010-01-27 19:33     ` Richard Scobie
  1 sibling, 0 replies; 29+ messages in thread
From: Richard Scobie @ 2010-01-27 19:33 UTC (permalink / raw)
  To: Giovanni Tessore; +Cc: Asdo, linux-raid

Giovanni Tessore wrote:

> I didn't setup smart monitoring... Luca is right.
> But I dont like the idea that md relies on another tool, that could be
> not installed or correctly configured, to warn the user on potential
> critical situations which involve directly itself.

Or you build systems around a controller* that works with smartd, only 
to find that smartd support breaks in later kernel versions and finding 
suitable replacements has proven difficult. :(

*LSI SAS

http://marc.info/?l=linux-scsi&m=125293858330877&w=2

https://bugzilla.redhat.com/show_bug.cgi?id=512613

Regards,

Richard

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Read errors on raid5 ignored, array still clean .. then disaster !!
@ 2010-01-27  9:56 Giovanni Tessore
  0 siblings, 0 replies; 29+ messages in thread
From: Giovanni Tessore @ 2010-01-27  9:56 UTC (permalink / raw)
  To: linux-raid

>
> > Is there any way to configure raid in order to have devices marked faulty 
> > on read errors (at least when they clearly become too many)?
> I don't think so
>   
I think it would be useful to be able to configure the number of 
recovered read error allowed before the device goes faulty.

> > This could (and for me did) bring to big disasters!
> Don't agree with you, you had all the info from syslog
> You should have run smart tests on the disks and proactively replace a
> failing disk.
>   
Would be nice if md issues warning on recovered read error events, such 
as it does for other md events (device failure, etc.).

> it does _not_ ignore read errors 
> in case of read errors mdadm rewrites the erroring sector, and only if
> this fails it will kick the member out of the array.
> with modern drives it is possible to have some failed sector, which the
> drive firmware will reallocate on write (all modern drives have a range
> of sectors reserved for this very purpose)
> mdadm does not do any bookkeeping on reallocated_sector_count per drive
> the drive does. the data can be accessed with smartctl
> drives showing excessive reallocated_sector_count should be replaced.
>   
Sorry, with ignore I mean "it silently manage to recover the read error, 
without alerting anybody"
Btw, as I see from kernel sources, it keep track of recovered read error 
per device instead.
And only when they are > 256 it marks the device faulty (I'm preparing 
another post on it).
So, why to wait for just 256 errors?
I think should be configurable ... and a much lower level for me.

> Consider the following scenario:
> raid5 (sda,b,c,d)
> sda has a read error, mdadm kicks it immediately from the array
> a few minutes/hours later sdc fails completely
> lost data and no time to react, that is far worse than having 50 days of
> warnings and ignoring them.
>   
Yes, but suppose that sda has a number of corrected read errors that is 
250; it's still clean.
sdc fails and is kicked off
resync starts
sda get > 6 read erros during resync, it's set as faulty (and it's 
likely to happen as the drive is clearly dying)
lost data the same way
(this is my real scenario actually, really happened)

Much difference?

Personally i'd prefere to know as soon as possible that something is 
going wrong, if not setting the device faulty, with a warning (by mail 
like other md events), saying "this is the n-th revocered error for this 
device"
IMHO the admin have to be clearly awared *by md*, not other monitoring 
tools, that the array is facing a possible critical sistuation.

> I'm sorry for your data, hope you had backups.
>   
Thanks.
I am trying to recover forcing to re-add the drive which gives read 
errors and using the array in degraded mode ... it seems to work.

Giovanni

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-02-01 15:51 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-26 22:28 Read errors on raid5 ignored, array still clean .. then disaster !! Giovanni Tessore
2010-01-27  7:41 ` Luca Berra
2010-01-27  9:01   ` Goswin von Brederlow
2010-01-29 10:48   ` Neil Brown
2010-01-29 11:58     ` Goswin von Brederlow
2010-01-29 19:14     ` Giovanni Tessore
2010-01-30  7:58       ` Luca Berra
2010-01-30 15:52         ` Giovanni Tessore
2010-01-30  7:54     ` Luca Berra
2010-01-30 10:55     ` Giovanni Tessore
2010-01-30 18:44     ` Giovanni Tessore
2010-01-30 21:41       ` Asdo
2010-01-30 22:20         ` Giovanni Tessore
2010-01-31  1:23           ` Roger Heflin
2010-01-31 10:45             ` Giovanni Tessore
2010-01-31 14:08               ` Roger Heflin
2010-01-31 14:31         ` Asdo
2010-02-01 10:56           ` Giovanni Tessore
2010-02-01 12:45             ` Asdo
2010-02-01 15:11               ` Giovanni Tessore
2010-02-01 13:27             ` Luca Berra
2010-02-01 15:51               ` Giovanni Tessore
2010-01-27  9:01 ` Asdo
2010-01-27 10:09   ` Giovanni Tessore
2010-01-27 10:50     ` Asdo
2010-01-27 15:06       ` Goswin von Brederlow
2010-01-27 16:15       ` Giovanni Tessore
2010-01-27 19:33     ` Richard Scobie
  -- strict thread matches above, loose matches on Subject: below --
2010-01-27  9:56 Giovanni Tessore

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).