Proactive Drive Replacement

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Proactive Drive Replacement
@ 2008-10-20 17:35 Jon Nelson
  2008-10-20 22:40 ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: Jon Nelson @ 2008-10-20 17:35 UTC (permalink / raw)
  To: LinuxRaid

I was wondering about proactive drive replacement.
Specifically, let's assume we have a RAID5 (or 10 or whatever)
comprised of 3 drives, A, B, and C.
Let's assume we want to replace drive C with drive D, and the array is md0.
We want to minimize our rebuild windows.

The naive approach would be to:

--add drive D to md0
--fail drive C on md0
wait for the rebuild to finish.
(zero the superblock on drive C)
remove drive C

Obviously, this places the array in mortal danger if another drive
should fail during that time.
Could we not do something like this instead?

1. make sure md0 is using bitmaps
2. --fail drive C
3. create a new *single disk* raid1 from drive C
4. --add drive D to md99
5. --add md99 back into md1.
6. wait for md99's rebuild to finish
7. --fail and --remove md99
8. break md99
9. --add drive D to md0

The problem I see with the above is the creation of the raid1 which
overwrites the superblock. Is there some way to avoid that (--build?)?

The advantage is that the amount of time the array spends degraded is,
theoretically, very small. The disadvantages include complexity,
difficulty resuming in the case of more serious error (maybe), and *2*
windows during which the array is mortally vulnerable to a component
failure.

-- 
Jon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-20 17:35 Proactive Drive Replacement Jon Nelson
@ 2008-10-20 22:40 ` Mario 'BitKoenig' Holbe
  2008-10-21  8:38   ` David Greaves
  0 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2008-10-20 22:40 UTC (permalink / raw)
  To: linux-raid

Jon Nelson <jnelson-linux-raid@jamponi.net> wrote:
> I was wondering about proactive drive replacement.
[bitmaps, raid1 drive to replace and new drive, ...]

I belive to remember a HowTo going over this list somewhere in the past
(early bitmap times?) which was recommending exactly your way.

> The problem I see with the above is the creation of the raid1 which
> overwrites the superblock. Is there some way to avoid that (--build?)?

You can build a RAID1 without superblock.


regards
   Mario
-- 
[mod_nessus for iauth]
<delta> "scanning your system...found depreciated OS...found
        hole...installing new OS...please reboot and reconnect now"


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-20 22:40 ` Mario 'BitKoenig' Holbe
@ 2008-10-21  8:38   ` David Greaves
  2008-10-21 13:05     ` Jon Nelson
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: David Greaves @ 2008-10-21  8:38 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid, Jon Nelson, neilb

Mario 'BitKoenig' Holbe wrote:
> Jon Nelson <jnelson-linux-raid@jamponi.net> wrote:
>> I was wondering about proactive drive replacement.
> [bitmaps, raid1 drive to replace and new drive, ...]
> 
> I belive to remember a HowTo going over this list somewhere in the past
> (early bitmap times?) which was recommending exactly your way.
> 
>> The problem I see with the above is the creation of the raid1 which
>> overwrites the superblock. Is there some way to avoid that (--build?)?
> 
> You can build a RAID1 without superblock.

How nice, an independent request for a feature just a few days later...

See:
   "non-degraded component replacement was Re: Distributed spares"
http://marc.info/?l=linux-raid&m=122398583728320&w=2

It references Dean Gaudet's work which explains why the above scenario, although
it seems OK at first glance, isn't good enough.

The main issue is that the drive being replaced almost certainly has a bad
block. This block could be recovered from the raid5 set but won't be.
Worse, the mirror operation may just fail to mirror that block - leaving it
'random' and thus corrupt the set when replaced.
Of course this will work in the happy path ... but raid is about correct
behaviour in the unhappy path.

If you could force the mirroring to complete and note the non-mirrored blocks
then you could fix it by identifying the bad/unwritten block on the new device
in a raid set and manually set the bitmap for the area around that block to be
'dirty' and force it to be rebuilt from the remaining disks.

Actually, this would be a nice thing to do as a subset of the feature to force a
re-write of SMART identified badblocks using parity calculated values.

David

-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21  8:38   ` David Greaves
@ 2008-10-21 13:05     ` Jon Nelson
  2008-10-21 13:36       ` David Greaves
  2008-10-21 13:50       ` David Lethe
  2008-10-21 13:57     ` Mario 'BitKoenig' Holbe
  2008-10-24  5:57     ` Luca Berra
  2 siblings, 2 replies; 16+ messages in thread
From: Jon Nelson @ 2008-10-21 13:05 UTC (permalink / raw)
  To: David Greaves; +Cc: Mario 'BitKoenig' Holbe, LinuxRaid

On Tue, Oct 21, 2008 at 3:38 AM, David Greaves <david@dgreaves.com> wrote:
> Mario 'BitKoenig' Holbe wrote:
>> Jon Nelson <jnelson-linux-raid@jamponi.net> wrote:
>>> I was wondering about proactive drive replacement.
>> [bitmaps, raid1 drive to replace and new drive, ...]
>>
>> I belive to remember a HowTo going over this list somewhere in the past
>> (early bitmap times?) which was recommending exactly your way.
>>
>>> The problem I see with the above is the creation of the raid1 which
>>> overwrites the superblock. Is there some way to avoid that (--build?)?
>>
>> You can build a RAID1 without superblock.
>
> How nice, an independent request for a feature just a few days later...
>
> See:
>   "non-degraded component replacement was Re: Distributed spares"
> http://marc.info/?l=linux-raid&m=122398583728320&w=2

D'oh!  I had skipped that thread before. There are differences, however minor.

> It references Dean Gaudet's work which explains why the above scenario, although
> it seems OK at first glance, isn't good enough.
>
> The main issue is that the drive being replaced almost certainly has a bad
> block. This block could be recovered from the raid5 set but won't be.
> Worse, the mirror operation may just fail to mirror that block - leaving it
> 'random' and thus corrupt the set when replaced.
> Of course this will work in the happy path ... but raid is about correct
> behaviour in the unhappy path.

In my case I was replacing a drive because I didn't like it.

-- 
Jon

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21 13:05     ` Jon Nelson
@ 2008-10-21 13:36       ` David Greaves
  2008-10-21 13:50       ` David Lethe
  1 sibling, 0 replies; 16+ messages in thread
From: David Greaves @ 2008-10-21 13:36 UTC (permalink / raw)
  To: Jon Nelson; +Cc: Mario 'BitKoenig' Holbe, LinuxRaid

Jon Nelson wrote:

> In my case I was replacing a drive because I didn't like it.

Hmm, I suspect drive-ism will possibly not be the most common reason for
swapping drives ;)


-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Proactive Drive Replacement
  2008-10-21 13:05     ` Jon Nelson
  2008-10-21 13:36       ` David Greaves
@ 2008-10-21 13:50       ` David Lethe
  2008-10-21 14:11         ` Mario 'BitKoenig' Holbe
  2008-10-21 19:39         ` David Greaves
  1 sibling, 2 replies; 16+ messages in thread
From: David Lethe @ 2008-10-21 13:50 UTC (permalink / raw)
  To: Jon Nelson, David Greaves; +Cc: Mario 'BitKoenig' Holbe, LinuxRaid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Jon Nelson
> Sent: Tuesday, October 21, 2008 8:06 AM
> To: David Greaves
> Cc: Mario 'BitKoenig' Holbe; LinuxRaid
> Subject: Re: Proactive Drive Replacement
> 
> On Tue, Oct 21, 2008 at 3:38 AM, David Greaves <david@dgreaves.com>
> wrote:
> > Mario 'BitKoenig' Holbe wrote:
> >> Jon Nelson <jnelson-linux-raid@jamponi.net> wrote:
> >>> I was wondering about proactive drive replacement.
> >> [bitmaps, raid1 drive to replace and new drive, ...]
> >>
> >> I belive to remember a HowTo going over this list somewhere in the
> past
> >> (early bitmap times?) which was recommending exactly your way.
> >>
> >>> The problem I see with the above is the creation of the raid1
which
> >>> overwrites the superblock. Is there some way to avoid that (--
> build?)?
> >>
> >> You can build a RAID1 without superblock.
> >
> > How nice, an independent request for a feature just a few days
> later...
> >
> > See:
> >   "non-degraded component replacement was Re: Distributed spares"
> > http://marc.info/?l=linux-raid&m=122398583728320&w=2
> 
> D'oh!  I had skipped that thread before. There are differences,
however
> minor.
> 
> > It references Dean Gaudet's work which explains why the above
> scenario, although
> > it seems OK at first glance, isn't good enough.
> >
> > The main issue is that the drive being replaced almost certainly has
> a bad
> > block. This block could be recovered from the raid5 set but won't
be.
> > Worse, the mirror operation may just fail to mirror that block -
> leaving it
> > 'random' and thus corrupt the set when replaced.
> > Of course this will work in the happy path ... but raid is about
> correct
> > behaviour in the unhappy path.
> 
> In my case I was replacing a drive because I didn't like it.
> 
> --
> Jon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

S.M.A.R.T. does not, has not, will not, ever ... identify bad blocks.
At most, depending
on the firmware, it will trigger a bit if the disk has a bad block that
was discovered as
a result of a read already.  It will NOT trigger a bit if there is a bad
block that hasn't
been read yet by either a self-test or an I/O request from the host.

For ATA/SATA class drives, the ANSI specification for S.M.A.R.T.
provides for reading some
structures which indicate such things as cumulative errors, temperature,
and a Boolean that
says if the disk is in a degrading mode and a S.M.A.R.T. alert is
warranted.  The ANSI spec
is also clear in that everything but that single pass/fail bit is open
to interpretation by
the manufacturer (other than data format for these various registers).

SCSI/SAS/FC/SAA class devices also have this bit, but the ANSI SCSI spec
also provides for 
Log pages which are somewhat similar to the structures defined in
ATA/SATA class disks, the
Difference being that the ANSI spec formalized such things as exactly
where errors and warnings
of various types belong.  They also provided or a rich subset of
vendor-specific pages.

Both families of disks provide for some self-test commands, but these
commands do not scan the
entire surface of the disk, so they are incapable of reporting or
indicating where you have a 
new bad block.  They report if you have a bad block if one is found in
the extremely small sample
of I/O it ran.   Now some enterprise class drives support something
called BGMS (Like the Seagate
15K.5  SAS/FC/SCSI disks, but 99% of the disks out there do not have
such a mechanism.

Sorry about rant .. but it got to me finally, where people keep posting
how S.M.A.R.T. seems
to be this all-knowing mechanism that tells you what is wrong with the
disk and/or where the
bad blocks might be.  It isn't.

The poster is 100% correct in that parity-protected RAID is all about
recovering when bad things happen.
Distributing spares is about performance.   Their objectives are
mutually exclusive.   If you
Must have a RAID mechanism that is fast, safe, and efficient on rebuilds
and expansions, 
then consider either high-end hardware-based RAID or run ZFS on Solaris.
Next best thing in LINUX
world is RAID6.
David @ santools.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21  8:38   ` David Greaves
  2008-10-21 13:05     ` Jon Nelson
@ 2008-10-21 13:57     ` Mario 'BitKoenig' Holbe
  2008-10-21 17:29       ` David Greaves
  2008-10-24  5:57     ` Luca Berra
  2 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2008-10-21 13:57 UTC (permalink / raw)
  To: linux-raid

David Greaves <david@dgreaves.com> wrote:
> The main issue is that the drive being replaced almost certainly has a bad
> block.

Then, the replacement is not pro-active ;)

> This block could be recovered from the raid5 set but won't be.

This is what 'check' and 'repair' operations
(/sys/block/md*/md/sync_action) can be used for.


regards
   Mario
-- 
When Bruce Schneier uses double ROT13 encryption, the ciphertext is totally
unbreakable.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21 13:50       ` David Lethe
@ 2008-10-21 14:11         ` Mario 'BitKoenig' Holbe
  2008-10-21 15:13           ` David Lethe
  2008-10-21 19:39         ` David Greaves
  1 sibling, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2008-10-21 14:11 UTC (permalink / raw)
  To: linux-raid

David Lethe <david@santools.com> wrote:
> S.M.A.R.T. does not, has not, will not, ever ... identify bad blocks.

Well, as you state yourself later, S.M.A.R.T. defines self-tests which
are able to identify bad blocks. Though, they have to be triggered.

> Both families of disks provide for some self-test commands, but these
> commands do not scan the
> entire surface of the disk

This is not true. The long self-test scans the entire surface of the
disk at least for ATA devices, I don't know if it does that for SCSI
devices too.
ATA does also know about selective self-tests which are able to scan
defineable surface areas - which is, at first, quite nice to identify
more than one bad sector, and which is, at second, quite nice on bigger
devices as well... my ST31500341AS take about 4.5 hours for a long
self-test.

> new bad block.  They report if you have a bad block if one is found in
> the extremely small sample
> of I/O it ran.

And, at least ATA devices report the LBA_of_first_error in the self-test
log, so you can identify the first bad sector.

regards
   Mario
-- 
Singing is the lowest form of communication.
                         -- Homer J. Simpson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE:  Re: Proactive Drive Replacement
  2008-10-21 14:11         ` Mario 'BitKoenig' Holbe
@ 2008-10-21 15:13           ` David Lethe
  2008-10-21 15:30             ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: David Lethe @ 2008-10-21 15:13 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe, linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Mario 'BitKoenig' Holbe
> Sent: Tuesday, October 21, 2008 9:12 AM
> To: linux-raid@vger.kernel.org
> Subject: Re: Proactive Drive Replacement
> 
> David Lethe <david@santools.com> wrote:
> > S.M.A.R.T. does not, has not, will not, ever ... identify bad
blocks.
> 
> Well, as you state yourself later, S.M.A.R.T. defines self-tests which
> are able to identify bad blocks. Though, they have to be triggered.
> 
> > Both families of disks provide for some self-test commands, but
these
> > commands do not scan the
> > entire surface of the disk
> 
> This is not true. The long self-test scans the entire surface of the
> disk at least for ATA devices, I don't know if it does that for SCSI
> devices too.
> ATA does also know about selective self-tests which are able to scan
> defineable surface areas - which is, at first, quite nice to identify
> more than one bad sector, and which is, at second, quite nice on
bigger
> devices as well... my ST31500341AS take about 4.5 hours for a long
> self-test.
> 
> > new bad block.  They report if you have a bad block if one is found
> in
> > the extremely small sample
> > of I/O it ran.
> 
> And, at least ATA devices report the LBA_of_first_error in the self-
> test
> log, so you can identify the first bad sector.
> 
> 
> regards
>    Mario
> --
> Singing is the lowest form of communication.
>                          -- Homer J. Simpson
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The SCSI-family of self-test commands terminate after the first media
error.  This makes perfect sense
as if the disk fails, you ordinarily want to know that immediately,
rather than have the disk continue
scanning.   As such, self-test gives you the first bad block and that is
it.   

As for SATA/ATA self-tests, then all logs are limited to 512 bytes. If
you run the right self-test, then
You will get a PARTIAL list of bad blocks.  Specifically, you get 24
bytes which tell you the starting bad
Block.  You do not even get a range of bad blocks.  You just know that
block X is bad.   It doesn't
tell you if block X+1 is bad.  If block X+2 is bad, it will tell you
that, because it chews up another log
entry.  There is room for 20 entries. 

Not all disks support this type of self-test either. The ANSI spec says
this is optional, and it is a relatively 
recent introduction. 

So, at best, if you disk supports it, you can run self-tests that will
take half a day and give you a partial
list of bad blocks, between ranges of LBA numbers you want to scan. This
is correctly called the "SMART selective
self-test routine".   By the way, this is an OFF-LINE scan.  

So bottom line, Mario is correct in that there is a way to get a PARTIAL
list of bad blocks, if you have a disk
that supports this command, and you're willing to run an off-line scan
(not practical or a parity-protected RAID 
environment).

As original poster wanted to just use SMART to factor in known bad
blocks on a rebuild, then you can see that there
Is no viable option unless you already have a full list of known bad
blocks.  You have to find bad blocks as you 
just read from them as part of the rebuild for these types of disks).

It is possible that some vendor has implemented a SATA ON-LINE bad block
scanning mechanism that reports results and
doesn't kill I/O performance.  It would have to give full list of bad
blocks, or at least startingblock + range.

That would be wonderful as you could just read the list on regular
interval and rebuild stripes as necessary. You'd have
Self-healing parity. It still wouldn't protect against a drive failure,
but it would insure that you wouldn't have any
lost chunks due to a unreadable block on one of the surviving disks in a
RAID set.

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21 15:13           ` David Lethe
@ 2008-10-21 15:30             ` Mario 'BitKoenig' Holbe
  0 siblings, 0 replies; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2008-10-21 15:30 UTC (permalink / raw)
  To: linux-raid

David Lethe <david@santools.com> wrote:
> is correctly called the "SMART selective
> self-test routine".   By the way, this is an OFF-LINE scan.  

short, long, conveyance and selective tests are all offline.

> So bottom line, Mario is correct in that there is a way to get a PARTIAL
> list of bad blocks, if you have a disk
> that supports this command, and you're willing to run an off-line scan
> (not practical or a parity-protected RAID 
> environment).

Most modern (ATA) disks support "Suspend Offline collection upon new
command". Well, the tests take notably longer on a loaded disk and
(low-frequent) requests to that disk take notably longer as well
(high-frequent requests just keep the test suspended), but it works.

> It is possible that some vendor has implemented a SATA ON-LINE bad block
> scanning mechanism that reports results and
> doesn't kill I/O performance.  It would have to give full list of bad
> blocks, or at least startingblock + range.
>
> That would be wonderful as you could just read the list on regular
> interval and rebuild stripes as necessary. You'd have
> Self-healing parity.

echo check > /sys/block/mdx/md/sync_action

That's indeed way more powerful than any attempt to rely on any
S.M.A.R.T. thingy.


regards
   Mario
-- 
I thought the only thing the internet was good for was porn.  -- Futurama


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21 13:57     ` Mario 'BitKoenig' Holbe
@ 2008-10-21 17:29       ` David Greaves
  0 siblings, 0 replies; 16+ messages in thread
From: David Greaves @ 2008-10-21 17:29 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

Mario 'BitKoenig' Holbe wrote:
> David Greaves <david@dgreaves.com> wrote:
>> The main issue is that the drive being replaced almost certainly has a bad
>> block.
> 
> Then, the replacement is not pro-active ;)
> 
>> This block could be recovered from the raid5 set but won't be.
> 
> This is what 'check' and 'repair' operations
> (/sys/block/md*/md/sync_action) can be used for.

Well, yes and no.

If I have a bad block then I could use the remaining disks to calculate data to
overwrite it. So yes.

However the overwrite may fail. So no.

If I have an md managed mirror then the overwrite will write to the new disk and
the old one. I don't care if the old one fails.


David



-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21 13:50       ` David Lethe
  2008-10-21 14:11         ` Mario 'BitKoenig' Holbe
@ 2008-10-21 19:39         ` David Greaves
  1 sibling, 0 replies; 16+ messages in thread
From: David Greaves @ 2008-10-21 19:39 UTC (permalink / raw)
  To: David Lethe; +Cc: Jon Nelson, Mario 'BitKoenig' Holbe, LinuxRaid

It is also worth saying that this has wandered way off topic.

The comment about parity rebuild yadda yadda was an aside to the real meat: a
drive replace facility that uses very efficient mirroring for >99.9% of the disk
rebuild and parity for the <0.1% where a read-error occured.

Hmm, it occurs in the event of a highly dodgy failed drive then maybe it could
do >99.9% recovery from parity and in the event of a failure from one of the
remaining drives, it could attempt a read from the dodgy disk.

David Lethe wrote:

> Sorry about rant .. but it got to me finally, where people keep posting
> how S.M.A.R.T. seems
> to be this all-knowing mechanism that tells you what is wrong with the
> disk and/or where the
> bad blocks might be.  It isn't.

No, but I run long self-tests on a weekly basis and when it tells me I have a
bad block I can examine further; attempt a re-write; run another long test and
see if it comes back clean.

David Lethe also wrote:
> As original poster wanted to just use SMART to factor in known bad
> blocks on a rebuild, then you can see that there
> Is no viable option unless you already have a full list of known bad
> blocks.  You have to find bad blocks as you
> just read from them as part of the rebuild for these types of disks).

I did say
 force a re-write of SMART identified badblocks using parity calculated values.
and that was innacurate.

I should have said something like:
  when SMART identifies a bad block then force a re-write using parity
calculated values.

I appreciate that SMART isn't that smart - but it has a lot of value way down
here below the top-end enterprise systems.

David
-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-21  8:38   ` David Greaves
  2008-10-21 13:05     ` Jon Nelson
  2008-10-21 13:57     ` Mario 'BitKoenig' Holbe
@ 2008-10-24  5:57     ` Luca Berra
  2008-10-24  8:09       ` David Greaves
  2 siblings, 1 reply; 16+ messages in thread
From: Luca Berra @ 2008-10-24  5:57 UTC (permalink / raw)
  To: linux-raid

On Tue, Oct 21, 2008 at 09:38:17AM +0100, David Greaves wrote:
>The main issue is that the drive being replaced almost certainly has a bad
>block. This block could be recovered from the raid5 set but won't be.
>Worse, the mirror operation may just fail to mirror that block - leaving it
>'random' and thus corrupt the set when replaced.
False,
if SMART reports the drive is failing, it just means the number of
_correctable_ errors got too high, remember that hard drives (*) do use
ECC and autonomously remap bad blocks.
You replace a drive based on smart to prevent it developing bad blocks.

Ignoring the above, your scenario is still impossible, if you tried to
mirror a source drive with a bad block, md will notice and fail the
mirroring process. You will never end up with one drive with a bad block
and the other with uninitialized data.

If what you are really worried about is not bad block, but silent
corruption, you should run a check (see sync_action in
/usr/src/linux/Documentation/md.txt)

L.
(*) note that i don't write 'modern hard drives'.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-24  5:57     ` Luca Berra
@ 2008-10-24  8:09       ` David Greaves
  2008-10-25 13:20         ` Luca Berra
  0 siblings, 1 reply; 16+ messages in thread
From: David Greaves @ 2008-10-24  8:09 UTC (permalink / raw)
  To: linux-raid

Luca Berra wrote:
> On Tue, Oct 21, 2008 at 09:38:17AM +0100, David Greaves wrote:
>> The main issue is that the drive being replaced almost certainly has a
>> bad
>> block. This block could be recovered from the raid5 set but won't be.
>> Worse, the mirror operation may just fail to mirror that block -
>> leaving it
>> 'random' and thus corrupt the set when replaced.
> False,
> if SMART reports the drive is failing, it just means the number of
> _correctable_ errors got too high, remember that hard drives (*) do use
> ECC and autonomously remap bad blocks.
> You replace a drive based on smart to prevent it developing bad blocks.
I have just been through a batch of RMAing and re-RMAing 18+ dreadful Samsung
1Tb drives in a 3 and 5 drive level 5 array.

smartd did a great job of alerting me to bad blocks found during nightly short
and weekly long selftests.

Usually by the time the RMA arrived the drive was capable of being fully read
(once with retries). I manually mirrored the drives using ddrescue since this
stressed the remaining disks less, had a reliable retry* facility.
About 3 times the drive had unreadable blocks. In this case I couldn't use the
mirrored drive which had a tiny bad area (a few Kb in 1Tb) - I had to do a rebuild.
In one of these cases I developed a bad block on another component and had to
restore from a backup.

That was entirely avoidable.


> Ignoring the above, your scenario is still impossible, if you tried to
> mirror a source drive with a bad block, md will notice and fail the
> mirroring process. You will never end up with one drive with a bad block
> and the other with uninitialized data.
Well done. Great nit you found <sigh>.
When I wrote that I was thinking about the case above which wasn't md mirroring
and re-reading it I realise that I was totally unclear and you're right; that
can't happen.

However you seem to ignore the part of the threads that demonstrate my
understanding of the issue when I talk about mirroring from the failing drive
and the need to have md resort to the remaining components/parity in the event
of a failed block precisely to avoid md failing the mirroring process and
leaving you stuck :)

> If what you are really worried about is not bad block, but silent
> corruption, you should run a check (see sync_action in
> /usr/src/linux/Documentation/md.txt)
No, what I am worried about is having a raid5 develop a bad block on one
component and then, during recovery, develop a bad block (different #) on
another component.
That results in unneeded data loss - the parity is there but nothing reads it.

There was some noise on /. recently when they pointed back to a year-old story
about raid5 being redundant.
Well, IMO this proposal would massively improve raid5/6 reliability when, not
if, drives are replaced.

David

*I was stuck on 2.6.18 due to Xen - though eventually I did recovery using a
rescue disk and 2.6.27.
-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-24  8:09       ` David Greaves
@ 2008-10-25 13:20         ` Luca Berra
  2008-10-25 16:33           ` David Greaves
  0 siblings, 1 reply; 16+ messages in thread
From: Luca Berra @ 2008-10-25 13:20 UTC (permalink / raw)
  To: linux-raid

On Fri, Oct 24, 2008 at 09:09:33AM +0100, David Greaves wrote:
>However you seem to ignore the part of the threads that demonstrate my
>understanding of the issue when I talk about mirroring from the failing drive
>and the need to have md resort to the remaining components/parity in the event
>of a failed block precisely to avoid md failing the mirroring process and
>leaving you stuck :)
It was not 'ignored', in the sense i did not read or understand it:
I do agree that hot-sparing of a failing drive should be a native
feature of md
I was just pointing out what were, imho, errors in your
reasoning.

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Proactive Drive Replacement
  2008-10-25 13:20         ` Luca Berra
@ 2008-10-25 16:33           ` David Greaves
  0 siblings, 0 replies; 16+ messages in thread
From: David Greaves @ 2008-10-25 16:33 UTC (permalink / raw)
  To: linux-raid

Luca Berra wrote:
> I do agree that hot-sparing of a failing drive should be a native
> feature of md

OK - good to hear.

I suppose I'm just trying to raise the image of this issue.

Hot-replacing a drive seems massively more valuable than squeezing a bit of
performance out of an idle spare.

David


-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-10-25 16:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-20 17:35 Proactive Drive Replacement Jon Nelson
2008-10-20 22:40 ` Mario 'BitKoenig' Holbe
2008-10-21  8:38   ` David Greaves
2008-10-21 13:05     ` Jon Nelson
2008-10-21 13:36       ` David Greaves
2008-10-21 13:50       ` David Lethe
2008-10-21 14:11         ` Mario 'BitKoenig' Holbe
2008-10-21 15:13           ` David Lethe
2008-10-21 15:30             ` Mario 'BitKoenig' Holbe
2008-10-21 19:39         ` David Greaves
2008-10-21 13:57     ` Mario 'BitKoenig' Holbe
2008-10-21 17:29       ` David Greaves
2008-10-24  5:57     ` Luca Berra
2008-10-24  8:09       ` David Greaves
2008-10-25 13:20         ` Luca Berra
2008-10-25 16:33           ` David Greaves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).