stoppind md from kicking out "bad' drives

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* stoppind md from kicking out "bad' drives
@ 2013-11-11  7:28 Michael Tokarev
  2013-11-11  7:41 ` Mikael Abrahamsson
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Michael Tokarev @ 2013-11-11  7:28 UTC (permalink / raw)
  To: linux-raid

Hello.

Yesterday we've hit a classical issue of two drives
failure in raid5 configuration.

The scenario was like this:

  - one disk failed (atually just stopped responding, but
    started working again after bus reset, but that was much
    later than needed)

  - the failed disk has been kicked out of the array

  - md started syncronizing a hot-spare drive

  - during resync, another drive developed a bad (unreadable)
    sector

  - another drive has been kicked out of the array

  - boom

Now it is obvious that almost all data on the second drive
is intact, except of the area where the bad sector resides
(which is, btw, at the very end of the drive, where most
likely there's no useful data at all).  The hot-spare is
almost ready too (up to amost the end of it).  But the array
is non-functional and all filesystems are switched to read-
only mode...

The question is: what's missing currently to prevent kicking
drives from md arrays at all?  And I really mean preventing
_both_ first failed drive (before start of resync) and second
failed drive?

Can write-intent bitmap be used in this case, to mark areas
changed in array which are failed to be written to one or
another component device, for example?  Md can mark a drive
as "semi-failed" and still try to use it in some situations.

This "semi" state can be different - f.e., one is where md
tries all normal operations on the drive and redirects failed
reads to other drives (with continued attempts to re-write
bad data) and continues writing normally, marking all failed
writes in the bitmap.  Let's say it is "semi-working" state.
Another is when no regular I/O is happening to it except of
the critical situations when _another_ drive becomes unreadable
in some place - so md will try to reconstruct that data based
on this semi-failed drive in a hope that those places will
be read successfully.  And other variations of the same theme...

At the very least, maybe we should prevent md from kicking
the last component device which makes the array unusable, like
failed second drive on raid5 config - even if it has a bad
sector, the array was 99.9% fine before md kicked it out,
but after kicking it, the array is 100% dead...  This does
not look right to me.

Also, what's the way to assemble this array now?  We've almost
resynced hot spare, a failed-at-the-end drive (the second
failed one), and a non-fresh first failed drive which is in
good condition, just outdated.  Can mdadm be forced to assemble
the array from good drives plus second-failed drive?, maybe in
read-only mode (this will let us to copy data which is still
readable to another place)?

I'd try to re-write the bad places on second-failed drive based
on the information on good drives plus data from first-failed
drive, -- it is obvious that those places still can be reconstructed,
because even when the filesystem were in use during (attempt to)
resync, no changes were made to the problematic areas, so there,
first-failed drive still can be used.  But this - at this stage -
is rather tricky, i'll need to write a program to help me, and
made it bug-free to be useful.

All in all, it still looks like md has very good potential for
improvements wrt reliability... ;)

(The system in question belongs to one of a very well-known
organisations in free software, and it is (or was) the main
software repository)

Thank you!

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: stoppind md from kicking out "bad' drives
  2013-11-11  7:28 stoppind md from kicking out "bad' drives Michael Tokarev
@ 2013-11-11  7:41 ` Mikael Abrahamsson
  2013-11-11  7:51   ` Michael Tokarev
  2013-11-11 15:55 ` Ian Pilcher
  2013-11-23 22:05 ` Michael Tokarev
  2 siblings, 1 reply; 9+ messages in thread
From: Mikael Abrahamsson @ 2013-11-11  7:41 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid

On Mon, 11 Nov 2013, Michael Tokarev wrote:

> The question is: what's missing currently to prevent kicking drives from 
> md arrays at all?  And I really mean preventing _both_ first failed 
> drive (before start of resync) and second failed drive?

Crank up the timeout settings a lot might help (I use 180 seconds), it 
would probably have stopped the first drive from being kicked out.

But you really should be running RAID6 and not RAID5 (as you now have 
observed) to handle the failure case you just observed.

Write-intent bitmap would have stopped the initial full resync of the 
drive that was kicked out, which might have helped as well.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: stoppind md from kicking out "bad' drives
  2013-11-11  7:41 ` Mikael Abrahamsson
@ 2013-11-11  7:51   ` Michael Tokarev
  2013-11-11  7:56     ` Mikael Abrahamsson
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Tokarev @ 2013-11-11  7:51 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

11.11.2013 11:41, Mikael Abrahamsson wrote:
> On Mon, 11 Nov 2013, Michael Tokarev wrote:
>
>> The question is: what's missing currently to prevent kicking drives from md arrays at all?  And I really mean preventing _both_ first failed drive (before start of resync) and second failed drive?
>
> Crank up the timeout settings a lot might help (I use 180 seconds), it would probably have stopped the first drive from being kicked out.
>
> But you really should be running RAID6 and not RAID5 (as you now have observed) to handle the failure case you just observed.

No, really, that's not the solutions I was asking for.

Yes raid6 is better in this context.  But it has exactly the same properties
when drives start "semi-failing" - it is enough to have one bad sector in
different places of 3 drives for a catastrophic failure, while the array
can even continue to work normally because the bad sectors are in different
places.

It is the drive kick-off - the decision made by md driver - which makes the
failure catastrophic.

We may reduce probability of such event by using different configuration
tweaks, but the underlying problem remains.

> Write-intent bitmap would have stopped the initial full resync of the drive that was kicked out, which might have helped as well.

Nope, because the array were (re)syncing a hot spare, not the first failed
drive.

I asked about write-intent bitmap because it can act as a semi-permanent "list
of bad blocks on component devices" -- instead of kicking whole device out,
mark just the "bad place" on it in the bitmap (the place where we weren't
able to write _new_ data) and continue using it, just avoiding reading from
the marked-as-bad places (because even if it'll succees, the data will be
wrong already).

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: stoppind md from kicking out "bad' drives
  2013-11-11  7:51   ` Michael Tokarev
@ 2013-11-11  7:56     ` Mikael Abrahamsson
  2013-11-11  8:05       ` Michael Tokarev
  0 siblings, 1 reply; 9+ messages in thread
From: Mikael Abrahamsson @ 2013-11-11  7:56 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid

On Mon, 11 Nov 2013, Michael Tokarev wrote:

> No, really, that's not the solutions I was asking for.

Well, it is.

> Yes raid6 is better in this context.  But it has exactly the same properties
> when drives start "semi-failing" - it is enough to have one bad sector in
> different places of 3 drives for a catastrophic failure, while the array
> can even continue to work normally because the bad sectors are in different
> places.

If you have timeouts set properly then md will be able to re-calculate the 
bad sector from parity and re-write it, even with one drive failed.

> It is the drive kick-off - the decision made by md driver - which makes 
> the failure catastrophic.

That's what the timeout problem is. If you're running consumer drives and 
default linux kernel timeouts then the drive will be kicked before it can 
return a read error.

> We may reduce probability of such event by using different configuration 
> tweaks, but the underlying problem remains.

The underlying problem is that you have drives that take longer to return 
errors compared to the settings you have to wait for results from the 
drive.

> Nope, because the array were (re)syncing a hot spare, not the first failed
> drive.

I don't understand why you would be running a RAID5+spare instead of 
RAID6 without spare.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: stoppind md from kicking out "bad' drives
  2013-11-11  7:56     ` Mikael Abrahamsson
@ 2013-11-11  8:05       ` Michael Tokarev
       [not found]         ` <CAPbD+Re7sVfSawjGFyMZMpU4Oaf3ULTW-UA3eoD_upcgxj3GOg@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Tokarev @ 2013-11-11  8:05 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

11.11.2013 11:56, Mikael Abrahamsson wrote:
> On Mon, 11 Nov 2013, Michael Tokarev wrote:
>
>> No, really, that's not the solutions I was asking for.
>
> Well, it is.
>
>> Yes raid6 is better in this context.  But it has exactly the same properties
>> when drives start "semi-failing" - it is enough to have one bad sector in
>> different places of 3 drives for a catastrophic failure, while the array
>> can even continue to work normally because the bad sectors are in different
>> places.
>
> If you have timeouts set properly then md will be able to re-calculate the bad sector from parity and re-write it, even with one drive failed.

Timeouts has nothing to do with this at all.

First drive were "stuck" somewhere in its firmware or electronics and
didn't respond at all (for several MINUTES), even to device reset.
It recovered much later when a bus reset was performed.

Second drive returned "I can't read this data" rather quickly.  It
was not "timeout reading" or somesuch, it was a confident "sorry guys
I've lost this piece".

>> It is the drive kick-off - the decision made by md driver - which makes the failure catastrophic.
>
> That's what the timeout problem is. If you're running consumer drives and default linux kernel timeouts then the drive will be kicked before it can return a read error.

It's not consumer drivers, and again, it has nothing to do with the timeouts.

Even if it were really timeouts, even given infinite timeout, if the bad
sector can't be read, no games with timeouts will let to recover it.

And it is just ONE bad sector (on next drive) which makes md to kick the
WHOLE device out of the array -- exactly the moment which turns the issue
from "maybe, just maybe, lost some data" to "whole data has been lost".
(And yes I pretty much understand that md tries to rewrite the place when
it can do that)

[]
> I don't understand why you would be running a RAID5+spare instead of RAID6 without spare.

Yet again, this is a entirely different question.

Please, pretty please, don't speak if you don't understand the topic... ;)

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <CAPbD+Re7sVfSawjGFyMZMpU4Oaf3ULTW-UA3eoD_upcgxj3GOg@mail.gmail.com>]

* Re: stoppind md from kicking out "bad' drives
       [not found]         ` <CAPbD+Re7sVfSawjGFyMZMpU4Oaf3ULTW-UA3eoD_upcgxj3GOg@mail.gmail.com>
@ 2013-11-13 15:45           ` Michael Tokarev
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Tokarev @ 2013-11-13 15:45 UTC (permalink / raw)
  To: Guillaume Betous; +Cc: Mikael Abrahamsson, linux-raid

12.11.2013 10:34, Guillaume Betous wrote:
> 
>     And it is just ONE bad sector (on next drive) which makes md to kick the
>     WHOLE device out of the array
> 
> 
> I admit that this policy is good as long as I have a bunch of redundancy (in any way) available. When this is your last chance to keep the service up, this seems a little bit "rude" :)

The "last chance" isn't exactly a well-defined term really.
For example, if you have a raid5, how do you think, pulling
one drive out of fully working raid, - is/was this drive your
last chance or not?  From one point of view it is not, it
is your last chance to have redundancy instead.  But once
you hit an error on any of other drives, you may reconsider...

> Would you mean that you'd prefer an algorithm like :
> 
> if data can be read then
>   read it
>   => NO_ERROR
> else
>   is there another way to get it ?
>   if yes
>     get it
>     rebuild failing sector
>     => NO_ERROR
>   else
>     kick the drive out
>     => ERROR
>   end
> end

No.  Please take a look at the subject again.
What I'm asking is to NOT kick any drives,
at least not when this leads to lack of
redundancy.

> Maybe we could consider this "soft" algorithm in case there is no more redundancy available (just to avoid a complete system failure, which finally is the worst solution).

I described the the algorithm which I'd love to be implemented in md,
in my previous email.  Here it is again.

When hitting an unrecoverable error on a raid compoent device
(when the device can't be written), do not kick it just yet,
but instead, mark it as "failing".  In this mode, we may still
attempt to read from the device and/or write to it, maybe marking
the new failed areas in a bitmap to not read them again (esp. if
it was write of new data which failed), or may just keep the device
around without touching it at all (and still filling the bitmap
when new writes are skipped).

This way, when some other component fails, we may _try_ to reconstruct
that place from other, good, drives and this first failed drive,
provided we didn't performed write to this part of array (if this
place isn't marked in the bitmap for the first failed drive).

And if we can't re-write and fix second drive which failed, do not
kick it from the array too, leaving it here just in case, in one
of the two modes again.

This way, we may have, say, 2-drive array where half of the data
is okay one one drive and another half is okay on another drive,
but it is still working.

The bitmap might be permanent, saved to non-volatile memory just
like current write-intent bitmap is handled, OR it can be stored
just in memory (if no persistent bitmap has been configured), so
that it is valid until the drive is disassembled -- at least this
in-memory bitmap will help to keep the device working before
shutdown...

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: stoppind md from kicking out "bad' drives
  2013-11-11  7:28 stoppind md from kicking out "bad' drives Michael Tokarev
  2013-11-11  7:41 ` Mikael Abrahamsson
@ 2013-11-11 15:55 ` Ian Pilcher
  2013-11-23 22:05 ` Michael Tokarev
  2 siblings, 0 replies; 9+ messages in thread
From: Ian Pilcher @ 2013-11-11 15:55 UTC (permalink / raw)
  To: linux-raid

On 11/11/2013 01:28 AM, Michael Tokarev wrote:
> The question is: what's missing currently to prevent kicking
> drives from md arrays at all?  And I really mean preventing
> _both_ first failed drive (before start of resync) and second
> failed drive?

I'm becoming increasingly convinced that hot-spares are a bad idea,
particularly when you're one failure away from data loss.  (I.e. I might
be willing to auto-add a hot-spare after the initial failure in a RAID-6
array, but not after a second failure.)

I much prefer to do a manual recovery, after using a badblocks read-only
test to check all of the component devices for bad sectors.  (I also
build my MD arrays out of partitions rather than entire drives to make
this process more manageable.)

-- 
========================================================================
Ian Pilcher                                         arequipeno@gmail.com
           Sent from the cloud -- where it's already tomorrow
========================================================================

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: stoppind md from kicking out "bad' drives
  2013-11-11  7:28 stoppind md from kicking out "bad' drives Michael Tokarev
  2013-11-11  7:41 ` Mikael Abrahamsson
  2013-11-11 15:55 ` Ian Pilcher
@ 2013-11-23 22:05 ` Michael Tokarev
  2013-11-24 23:00   ` NeilBrown
  2 siblings, 1 reply; 9+ messages in thread
From: Michael Tokarev @ 2013-11-23 22:05 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

Neil, I'm sorry for the repost, -- can you comment please?

I think this is important enough to deserve your comments... ;)

Meanwhile, in order to fix the mentioned broken raid5, I had
to resort to a small perl script which reads each stripe (all
parts which can be read), re-constructs missing stuff from
available data when possible, and writes result to external
file, displaying areas which can't be reconstructed.  Repeating
procedure for the missing areas using another set of drives...
So in the end we were able to completely restore all data
from the array in question.

Thanks,

/mjt

11.11.2013 11:28, Michael Tokarev wrote:
> Hello.
> 
> Yesterday we've hit a classical issue of two drives
> failure in raid5 configuration.
> 
> The scenario was like this:
> 
>  - one disk failed (atually just stopped responding, but
>    started working again after bus reset, but that was much
>    later than needed)
> 
>  - the failed disk has been kicked out of the array
> 
>  - md started syncronizing a hot-spare drive
> 
>  - during resync, another drive developed a bad (unreadable)
>    sector
> 
>  - another drive has been kicked out of the array
> 
>  - boom
> 
> Now it is obvious that almost all data on the second drive
> is intact, except of the area where the bad sector resides
> (which is, btw, at the very end of the drive, where most
> likely there's no useful data at all).  The hot-spare is
> almost ready too (up to amost the end of it).  But the array
> is non-functional and all filesystems are switched to read-
> only mode...
> 
> The question is: what's missing currently to prevent kicking
> drives from md arrays at all?  And I really mean preventing
> _both_ first failed drive (before start of resync) and second
> failed drive?
> 
> Can write-intent bitmap be used in this case, to mark areas
> changed in array which are failed to be written to one or
> another component device, for example?  Md can mark a drive
> as "semi-failed" and still try to use it in some situations.
> 
> This "semi" state can be different - f.e., one is where md
> tries all normal operations on the drive and redirects failed
> reads to other drives (with continued attempts to re-write
> bad data) and continues writing normally, marking all failed
> writes in the bitmap.  Let's say it is "semi-working" state.
> Another is when no regular I/O is happening to it except of
> the critical situations when _another_ drive becomes unreadable
> in some place - so md will try to reconstruct that data based
> on this semi-failed drive in a hope that those places will
> be read successfully.  And other variations of the same theme...
> 
> At the very least, maybe we should prevent md from kicking
> the last component device which makes the array unusable, like
> failed second drive on raid5 config - even if it has a bad
> sector, the array was 99.9% fine before md kicked it out,
> but after kicking it, the array is 100% dead...  This does
> not look right to me.
> 
> Also, what's the way to assemble this array now?  We've almost
> resynced hot spare, a failed-at-the-end drive (the second
> failed one), and a non-fresh first failed drive which is in
> good condition, just outdated.  Can mdadm be forced to assemble
> the array from good drives plus second-failed drive?, maybe in
> read-only mode (this will let us to copy data which is still
> readable to another place)?
> 
> I'd try to re-write the bad places on second-failed drive based
> on the information on good drives plus data from first-failed
> drive, -- it is obvious that those places still can be reconstructed,
> because even when the filesystem were in use during (attempt to)
> resync, no changes were made to the problematic areas, so there,
> first-failed drive still can be used.  But this - at this stage -
> is rather tricky, i'll need to write a program to help me, and
> made it bug-free to be useful.
> 
> All in all, it still looks like md has very good potential for
> improvements wrt reliability... ;)
> 
> (The system in question belongs to one of a very well-known
> organisations in free software, and it is (or was) the main
> software repository)
> 
> Thank you!
> 
> /mjt
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: stoppind md from kicking out "bad' drives
  2013-11-23 22:05 ` Michael Tokarev
@ 2013-11-24 23:00   ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2013-11-24 23:00 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6898 bytes --]

On Sun, 24 Nov 2013 02:05:16 +0400 Michael Tokarev <mjt@tls.msk.ru> wrote:

> Neil, I'm sorry for the repost, -- can you comment please?
> 
> I think this is important enough to deserve your comments... ;)

Sorry.  I had meant to reply.  I even thought about what the reply would be.
Unfortunately my telepath-to-SMTP gateway is a bit flakely and must have
dropped the connection.


> 
> Meanwhile, in order to fix the mentioned broken raid5, I had
> to resort to a small perl script which reads each stripe (all
> parts which can be read), re-constructs missing stuff from
> available data when possible, and writes result to external
> file, displaying areas which can't be reconstructed.  Repeating
> procedure for the missing areas using another set of drives...
> So in the end we were able to completely restore all data
> from the array in question.

Well that is good news.  Well done!  Obviously we don't want everyone to have
to do that though.

> 
> Thanks,
> 
> /mjt
> 
> 11.11.2013 11:28, Michael Tokarev wrote:
> > Hello.
> > 
> > Yesterday we've hit a classical issue of two drives
> > failure in raid5 configuration.
> > 
> > The scenario was like this:
> > 
> >  - one disk failed (atually just stopped responding, but
> >    started working again after bus reset, but that was much
> >    later than needed)
> > 
> >  - the failed disk has been kicked out of the array
> > 
> >  - md started syncronizing a hot-spare drive
> > 
> >  - during resync, another drive developed a bad (unreadable)
> >    sector
> > 
> >  - another drive has been kicked out of the array
> > 
> >  - boom
> > 
> > Now it is obvious that almost all data on the second drive
> > is intact, except of the area where the bad sector resides
> > (which is, btw, at the very end of the drive, where most
> > likely there's no useful data at all).  The hot-spare is
> > almost ready too (up to amost the end of it).  But the array
> > is non-functional and all filesystems are switched to read-
> > only mode...
> > 
> > The question is: what's missing currently to prevent kicking
> > drives from md arrays at all?  And I really mean preventing
> > _both_ first failed drive (before start of resync) and second
> > failed drive?

For the first drive failure, the alternatives are:
 - kick the drive from the array.  Easiest, but you don't like that.
 - block all writes which affect that drive until either the drive
   starts responding again, or an administrative decision (whether manual or
   based on some high-level policy and longer timeouts) allows the drive to
   be kicked out. (After all we must be able to handle cases where
   the drive really is completely and totally dead)
 - continue permitting writes and recording a bad-block-list for the
   failed drive on every other drive.
   When an access to some other drive also fails, you then need to decide
   whether to fail the request, or try just that block on the
   first-fail-drive, and in the second case, whether to block or fail if two
   drives cannot respond.

There is a lot of non-trivial policy here, and non-trivial implementation
details.

I might feel comfortable with a configurable policy to block all writes when
a whole-drive appears to have disappeared, but I wouldn't want that to be the
default, and I doubt many people would turn it on, even if they knew about it.

For the second drive failure the answer is the per-device bad-block list.  One
of the key design goals for that functionality was to survive single bad
blocks when recovering to a spare.
It's very new though and I don't remember if it is enabled by default with
mdadm-3.3 or not.

> > 
> > Can write-intent bitmap be used in this case, to mark areas
> > changed in array which are failed to be written to one or
> > another component device, for example?  Md can mark a drive
> > as "semi-failed" and still try to use it in some situations.

I don't think the granularity of the bitmap is nearly fine enough.  You
really want a bad-block-list.  That will of course be limited in size.


> > 
> > This "semi" state can be different - f.e., one is where md
> > tries all normal operations on the drive and redirects failed
> > reads to other drives (with continued attempts to re-write
> > bad data) and continues writing normally, marking all failed
> > writes in the bitmap.  Let's say it is "semi-working" state.
> > Another is when no regular I/O is happening to it except of
> > the critical situations when _another_ drive becomes unreadable
> > in some place - so md will try to reconstruct that data based
> > on this semi-failed drive in a hope that those places will
> > be read successfully.  And other variations of the same theme...
> > 
> > At the very least, maybe we should prevent md from kicking
> > the last component device which makes the array unusable, like
> > failed second drive on raid5 config - even if it has a bad
> > sector, the array was 99.9% fine before md kicked it out,
> > but after kicking it, the array is 100% dead...  This does
> > not look right to me.
> > 
> > Also, what's the way to assemble this array now?  We've almost
> > resynced hot spare, a failed-at-the-end drive (the second
> > failed one), and a non-fresh first failed drive which is in
> > good condition, just outdated.  Can mdadm be forced to assemble
> > the array from good drives plus second-failed drive?, maybe in
> > read-only mode (this will let us to copy data which is still
> > readable to another place)?
> > 
> > I'd try to re-write the bad places on second-failed drive based
> > on the information on good drives plus data from first-failed
> > drive, -- it is obvious that those places still can be reconstructed,
> > because even when the filesystem were in use during (attempt to)
> > resync, no changes were made to the problematic areas, so there,
> > first-failed drive still can be used.  But this - at this stage -
> > is rather tricky, i'll need to write a program to help me, and
> > made it bug-free to be useful.
> > 
> > All in all, it still looks like md has very good potential for
> > improvements wrt reliability... ;)

The bad-block-log should help reliability in some of these cases.
It would probably make sense to provide a utility which will access the
bad-block-list for a device and recover the blocks from some other device -
in you case the blocks that could not be recovered to the spare could then be
recovered from the original device.

Also, regular scrubbing should significantly reduce the chance of hitting a
bad read during a recovery.

> > 
> > (The system in question belongs to one of a very well-known
> > organisations in free software, and it is (or was) the main
> > software repository)

so they naturally had backups :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-11-24 23:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-11  7:28 stoppind md from kicking out "bad' drives Michael Tokarev
2013-11-11  7:41 ` Mikael Abrahamsson
2013-11-11  7:51   ` Michael Tokarev
2013-11-11  7:56     ` Mikael Abrahamsson
2013-11-11  8:05       ` Michael Tokarev
     [not found]         ` <CAPbD+Re7sVfSawjGFyMZMpU4Oaf3ULTW-UA3eoD_upcgxj3GOg@mail.gmail.com>
2013-11-13 15:45           ` Michael Tokarev
2013-11-11 15:55 ` Ian Pilcher
2013-11-23 22:05 ` Michael Tokarev
2013-11-24 23:00   ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).