linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Swapping a disk without degrading an array
@ 2010-01-25 12:11 Michał Sawicz
  2010-01-25 12:25 ` Majed B.
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Michał Sawicz @ 2010-01-25 12:11 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 653 bytes --]

Hi list,

This is something I've discussed on IRC and we achieved a conclusion
that this might be useful, but somewhat limited use-case count might not
warrant the effort to be implemented.

What I have in mind is allowing a member of an array to be paired with a
spare while the array is on-line. The spare disk would then be filled
with exactly the same data and would, in the end, replace the active
member. The replaced disk could then be hot-removed without the array
ever going into degraded mode.

I wanted to start a discussion whether this at all makes sense, what can
be the use cases etc.

-- 
Cheers
Michał (Saviq) Sawicz

[-- Attachment #2: To jest część wiadomości podpisana cyfrowo --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-25 12:11 Swapping a disk without degrading an array Michał Sawicz
@ 2010-01-25 12:25 ` Majed B.
  2010-01-25 12:53   ` Mikael Abrahamsson
  2010-01-25 14:44 ` Michał Sawicz
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Majed B. @ 2010-01-25 12:25 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

There's a technique called active spare, and is already available on
some hardware RAID controllers. It keeps the hot spare in sync with
the array, such that in an event of a disk failure, the spare kicks-in
immediately without wasting time for a resync.

I think what you're proposing is similar to the following scenario:
array0: (assume raid5): disk0, disk1, disk2, disk3(spare)
array1: (raid1): disk0, disk3

Though I'm not sure if it's feasible to nest raids or have a disk to
be a member of 2 arrays at the same time.

I think it was proposed before, but I donno about its priority.

2010/1/25 Michał Sawicz <michal@sawicz.net>:
> Hi list,
>
> This is something I've discussed on IRC and we achieved a conclusion
> that this might be useful, but somewhat limited use-case count might not
> warrant the effort to be implemented.
>
> What I have in mind is allowing a member of an array to be paired with a
> spare while the array is on-line. The spare disk would then be filled
> with exactly the same data and would, in the end, replace the active
> member. The replaced disk could then be hot-removed without the array
> ever going into degraded mode.
>
> I wanted to start a discussion whether this at all makes sense, what can
> be the use cases etc.
>
> --
> Cheers
> Michał (Saviq) Sawicz
>



-- 
       Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-25 12:25 ` Majed B.
@ 2010-01-25 12:53   ` Mikael Abrahamsson
  0 siblings, 0 replies; 11+ messages in thread
From: Mikael Abrahamsson @ 2010-01-25 12:53 UTC (permalink / raw)
  To: Majed B.; +Cc: Michał Sawicz, linux-raid

On Mon, 25 Jan 2010, Majed B. wrote:

> Though I'm not sure if it's feasible to nest raids or have a disk to be 
> a member of 2 arrays at the same time.

I think the proposal is for the scenario when a drive is being upgraded to 
a larger sized drive.

So:

1 Add spare X
2 Tell mdadm to replace drive N with the new spare
3 Information on N is now copied to X online
4 When copy is done, N and X contains the same info, and N is now 
converted to spare by mdadm
5 Hot-remove N

This means if I want to upgrade from size to size+X drives I can do that 
without ever degrading the array.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-25 12:11 Swapping a disk without degrading an array Michał Sawicz
  2010-01-25 12:25 ` Majed B.
@ 2010-01-25 14:44 ` Michał Sawicz
  2010-01-25 14:51 ` Asdo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Michał Sawicz @ 2010-01-25 14:44 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 367 bytes --]

Dnia 2010-01-25, pon o godzinie 13:11 +0100, Michał Sawicz pisze:
> I wanted to start a discussion whether this at all makes sense, what
> can
> be the use cases etc. 

It seems it is on the mdadm To-Do list for a year now:

http://neil.brown.name/blog/20090129234603

This suggests my idea wasn't entirely stupid :)

-- 
Cheers
Michał (Saviq) Sawicz

[-- Attachment #2: To jest część wiadomości podpisana cyfrowo --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-25 12:11 Swapping a disk without degrading an array Michał Sawicz
  2010-01-25 12:25 ` Majed B.
  2010-01-25 14:44 ` Michał Sawicz
@ 2010-01-25 14:51 ` Asdo
  2010-01-25 17:40 ` Goswin von Brederlow
  2010-01-29 11:19 ` Neil Brown
  4 siblings, 0 replies; 11+ messages in thread
From: Asdo @ 2010-01-25 14:51 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

Michał Sawicz wrote:
> ... I wanted to start a discussion whether this at all makes sense, what can
> be the use cases etc. ...
This appears just a great feature to me, you get my vote.

I also was thinking about something similar. This is probably the most 
desirable feature request for MD for me right now.


Use cases could be:

- 1 -  the obvious one: you are seeing some preliminary errors 
(correctable read errors, or SMART errors) on the disk and you want to 
replace it without making the array degraded & temporarily vulnerable.

- 2 - recoverying from a really bad array having multiple read errors in 
different places in multiple disks (replacing one disk at a time with 
the feature you suggest): consider that while filling each sector of the 
the hot-spare the algorithm has 2 places where to read data from: 
firstly it can try read from the drive being replaced, and then if that 
one returns read errors it can get the information from parity. 
Currently there is no other way to do this with this level of redundancy 
AFAIK, at least not automatically and not with the array online. 
Consider that if you have a bad array as described, doing a full scrub 
would take the array down, i.e. the scrub would never successfully 
finish, and the new drive could never be filled with data. While with 
the feature you suggest, there is no scrub on the whole array: data is 
taken only from the drive being replaced for all the sectors (that's the 
only disk being scrubbed), except possibly for a few sectors being 
defective on that disk, for which parity is used.


Thank you
Asdo
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-25 12:11 Swapping a disk without degrading an array Michał Sawicz
                   ` (2 preceding siblings ...)
  2010-01-25 14:51 ` Asdo
@ 2010-01-25 17:40 ` Goswin von Brederlow
  2010-01-29 11:19 ` Neil Brown
  4 siblings, 0 replies; 11+ messages in thread
From: Goswin von Brederlow @ 2010-01-25 17:40 UTC (permalink / raw)
  To: Micha Sawicz; +Cc: linux-raid

Michał Sawicz <michal@sawicz.net> writes:

> Hi list,
>
> This is something I've discussed on IRC and we achieved a conclusion
> that this might be useful, but somewhat limited use-case count might not
> warrant the effort to be implemented.
>
> What I have in mind is allowing a member of an array to be paired with a
> spare while the array is on-line. The spare disk would then be filled
> with exactly the same data and would, in the end, replace the active
> member. The replaced disk could then be hot-removed without the array
> ever going into degraded mode.
>
> I wanted to start a discussion whether this at all makes sense, what can
> be the use cases etc.

I had that discussion last year with Neil. Summary: It totaly makes
sense, is not that hard to implement but doesn't have a high priority.

You sort of do it with 2 short downtimes. Shut down the raid, set up a
dm-mirror target, restart the raid, wait for the mirror to complete,
shutdown and undo the dm-mirror. Instead of dm-mirror you can also use a
superblock-less raid1. You get problems on a crash though unless the
superblock is mirrored last because then the wrong (incomplete) disk
might be added to the raid on boot.


Besides replacing a disk suspect of failing soon there is also a second
use case. Balancing the wear of the active and spare disks. If you buy 6
new disks and create a 5 disks + spare raid 5 then the spare will remain
unused while the remaining disks wear down. So every now and then it
would be nice to rotate the spare disk so the wear is distributed
better. This could be done in parallel with the monthly raid check,
where you read the full disk data and verify it anyway. Copying out one
disk to the spare at the same time and switching them at the end would
cost little extra (given enough controler bandwith it wouldn't even slow
the check down).

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-25 12:11 Swapping a disk without degrading an array Michał Sawicz
                   ` (3 preceding siblings ...)
  2010-01-25 17:40 ` Goswin von Brederlow
@ 2010-01-29 11:19 ` Neil Brown
  2010-01-29 15:35   ` Goswin von Brederlow
  4 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2010-01-29 11:19 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

On Mon, 25 Jan 2010 13:11:15 +0100
Michał Sawicz <michal@sawicz.net> wrote:

> Hi list,
> 
> This is something I've discussed on IRC and we achieved a conclusion
> that this might be useful, but somewhat limited use-case count might not
> warrant the effort to be implemented.
> 
> What I have in mind is allowing a member of an array to be paired with a
> spare while the array is on-line. The spare disk would then be filled
> with exactly the same data and would, in the end, replace the active
> member. The replaced disk could then be hot-removed without the array
> ever going into degraded mode.
> 
> I wanted to start a discussion whether this at all makes sense, what can
> be the use cases etc.
> 

As has been noted, this is a really good idea.  It just doesn't seem to get
priority.  Volunteers ???

So time to start:  with a little design work.

1/ The start of the array *must* be recorded in the metadata.  It we try to
   create a transparent whole-device copy then we could get confused later.
   So let's (For now) decide not to support 0.90 metadata, and support this
   in 1.x metadata with:
     - a new feature_flag saying that live spares are present
     - the high bit set in dev_roles[] means that this device is a live spare
       and is only in_sync up to 'recovery_offset'

2/ in sysfs we currently identify devices with a symlink
     md/rd$N -> dev-$X
   for live-spare devices, this would be
     md/ls$N -> dev-$X

3/ We create a live spare by writing 'live-spare' to md/dev-$X/state
   and an appropriate value to md/dev-$X/recovery_start before setting
   md/dev-$X/slot

4/ When a device is failed, if there was a live spare is instantly takes
   the place of the failed device.

5/ This needs to be implemented separately in raid10 and raid456.
   raid1 doesn't really need live spares  but I wouldn't be totally against
   implementing them if it seemed helpful.

6/ There is no dynamic read balancing between a device and its live-spare.
   If the live spare is in-sync up to the end of the read, we read from the
   live-spare, else from the main device.

7/ writes transparently go to both the device and the live-spare, whether they
   are normal data writes or resync writes or whatever.

8/ In raid5.h struct r5dev needs a second 'struct bio' and a second
   'struct bio_vec'.
   'struct disk_info' needs a second mdk_rdev_t.

9/ in raid10.h mirror_info needs another mdk_rdev_t and the anon struct in 
   r10bio_s needs another 'struct bio *'.

10/ Both struct r5dev and r10bio_s need some counter or flag so we can know
    when both writes have completed.

11/ For both r5 and r10, the 'recover' process need to be enhanced to just
    read from the main device when a live-spare is being built.
    Obviously if this fail there needs to be a fall-back to read from
    elsewhere.

Probably lots more details, but that might be enough to get me (or someone)
started one day.

There would be lots of work to do in mdadm too of course to report on these
extensions and to assemble arrays with live-spares..

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-29 11:19 ` Neil Brown
@ 2010-01-29 15:35   ` Goswin von Brederlow
  2010-01-31 15:34     ` Asdo
  0 siblings, 1 reply; 11+ messages in thread
From: Goswin von Brederlow @ 2010-01-29 15:35 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@suse.de> writes:

> So time to start:  with a little design work.
>
> 1/ The start of the array *must* be recorded in the metadata.  It we try to
>    create a transparent whole-device copy then we could get confused later.
>    So let's (For now) decide not to support 0.90 metadata, and support this
>    in 1.x metadata with:
>      - a new feature_flag saying that live spares are present
>      - the high bit set in dev_roles[] means that this device is a live spare
>        and is only in_sync up to 'recovery_offset'

Could the bitmap be used here too?

> 2/ in sysfs we currently identify devices with a symlink
>      md/rd$N -> dev-$X
>    for live-spare devices, this would be
>      md/ls$N -> dev-$X
>
> 3/ We create a live spare by writing 'live-spare' to md/dev-$X/state
>    and an appropriate value to md/dev-$X/recovery_start before setting
>    md/dev-$X/slot
>
> 4/ When a device is failed, if there was a live spare is instantly takes
>    the place of the failed device.

Some cases:

1) the mirroring is still going and the error is in a in-sync region

I think setting the drive to write-mostly and keeping it is better than
kicking the drive and requireing a re-sync to get the live-spare active.

2) the mirroring is still going and the error is in a out-of-sync region

If the erorr is caused by the mirroring itself then the block can also
be restored from parity and then goto 1. But if it happens often fail
the drive anyway as the errors cost too much time. Otherwise, unless we
have bitmaps to first repair the region covered by the bit and then goto
1, there is not much we can do here. Fail the drive.

It would be good to note that the being mirrored disk had faults and
imediatly fail it when the mirroring is complete.

Also the "often" above should be configurable and include a "never"
option. Say you have 2 disks that are damaged at different locations. By
creating a live-spare with "never" the mirroring would eventualy succeed
and repair the raid while kicking a disk would cause data loss.

3) the mirroring is complete

No sense keeping the broken disk, fail it and use the live-spare
instead. Mdadm should probably have an option to automatically remove
the old disk once the mirroring is done for a live spare.

> 5/ This needs to be implemented separately in raid10 and raid456.
>    raid1 doesn't really need live spares  but I wouldn't be totally against
>    implementing them if it seemed helpful.

Raid1 would only need the "create new mirror without failing existing
disks" mode. The disks in a raid1 might all be damages but in different
locations.

> 6/ There is no dynamic read balancing between a device and its live-spare.
>    If the live spare is in-sync up to the end of the read, we read from the
>    live-spare, else from the main device.

So the old drive is write-mostly. That makes (1) above irelevant.

> 7/ writes transparently go to both the device and the live-spare, whether they
>    are normal data writes or resync writes or whatever.
>
> 8/ In raid5.h struct r5dev needs a second 'struct bio' and a second
>    'struct bio_vec'.
>    'struct disk_info' needs a second mdk_rdev_t.
>
> 9/ in raid10.h mirror_info needs another mdk_rdev_t and the anon struct in 
>    r10bio_s needs another 'struct bio *'.
>
> 10/ Both struct r5dev and r10bio_s need some counter or flag so we can know
>     when both writes have completed.
>
> 11/ For both r5 and r10, the 'recover' process need to be enhanced to just
>     read from the main device when a live-spare is being built.
>     Obviously if this fail there needs to be a fall-back to read from
>     elsewhere.

Shouldn't recover read from the live-spare where the live-spare already
is in-sync and the main drive otherwise?

> Probably lots more details, but that might be enough to get me (or someone)
> started one day.
>
> There would be lots of work to do in mdadm too of course to report on these
> extensions and to assemble arrays with live-spares..
>
> NeilBrown

MfG
        Goswin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-29 15:35   ` Goswin von Brederlow
@ 2010-01-31 15:34     ` Asdo
  2010-01-31 16:33       ` Gabor Gombas
  0 siblings, 1 reply; 11+ messages in thread
From: Asdo @ 2010-01-31 15:34 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-raid, Neil Brown

Goswin von Brederlow wrote:
> Neil Brown <neilb@suse.de> writes:
>
>   
>> So time to start:  with a little design work.
>>
>> 1/ The start of the array *must* be recorded in the metadata.  It we try to
>>    create a transparent whole-device copy then we could get confused later.
>>    So let's (For now) decide not to support 0.90 metadata, and support this
>>    in 1.x metadata with:
>>      - a new feature_flag saying that live spares are present
>>      - the high bit set in dev_roles[] means that this device is a live spare
>>        and is only in_sync up to 'recovery_offset'
>>     
>
> Could the bitmap be used here too?
>
>   
>> 2/ in sysfs we currently identify devices with a symlink
>>      md/rd$N -> dev-$X
>>    for live-spare devices, this would be
>>      md/ls$N -> dev-$X
>>
>> 3/ We create a live spare by writing 'live-spare' to md/dev-$X/state
>>    and an appropriate value to md/dev-$X/recovery_start before setting
>>    md/dev-$X/slot
>>
>> 4/ When a device is failed, if there was a live spare is instantly takes
>>    the place of the failed device.
>>     
>
> Some cases:
>
> 1) the mirroring is still going and the error is in a in-sync region
>
> I think setting the drive to write-mostly and keeping it is better than
> kicking the drive and requireing a re-sync to get the live-spare active.
>
> 2) the mirroring is still going and the error is in a out-of-sync region
>
> If the erorr is caused by the mirroring itself then the block can also
> be restored from parity and then goto 1. But if it happens often fail
> the drive anyway as the errors cost too much time. Otherwise, unless we
> have bitmaps to first repair the region covered by the bit and then goto
> 1, there is not much we can do here. Fail the drive.
>
> It would be good to note that the being mirrored disk had faults and
> imediatly fail it when the mirroring is complete.
>
> Also the "often" above should be configurable and include a "never"
> option. Say you have 2 disks that are damaged at different locations. By
> creating a live-spare with "never" the mirroring would eventualy succeed
> and repair the raid while kicking a disk would cause data loss.
>
> 3) the mirroring is complete
>
> No sense keeping the broken disk, fail it and use the live-spare
> instead. Mdadm should probably have an option to automatically remove
> the old disk once the mirroring is done for a live spare.
>
>   
>> 5/ This needs to be implemented separately in raid10 and raid456.
>>    raid1 doesn't really need live spares  but I wouldn't be totally against
>>    implementing them if it seemed helpful.
>>     
>
> Raid1 would only need the "create new mirror without failing existing
> disks" mode. The disks in a raid1 might all be damages but in different
> locations.
>
>   
>> 6/ There is no dynamic read balancing between a device and its live-spare.
>>    If the live spare is in-sync up to the end of the read, we read from the
>>    live-spare, else from the main device.
>>     
>
> So the old drive is write-mostly. That makes (1) above irelevant.
>
>   
>> 7/ writes transparently go to both the device and the live-spare, whether they
>>    are normal data writes or resync writes or whatever.
>>
>> 8/ In raid5.h struct r5dev needs a second 'struct bio' and a second
>>    'struct bio_vec'.
>>    'struct disk_info' needs a second mdk_rdev_t.
>>
>> 9/ in raid10.h mirror_info needs another mdk_rdev_t and the anon struct in 
>>    r10bio_s needs another 'struct bio *'.
>>
>> 10/ Both struct r5dev and r10bio_s need some counter or flag so we can know
>>     when both writes have completed.
>>
>> 11/ For both r5 and r10, the 'recover' process need to be enhanced to just
>>     read from the main device when a live-spare is being built.
>>     Obviously if this fail there needs to be a fall-back to read from
>>     elsewhere.
>>     
>
> Shouldn't recover read from the live-spare where the live-spare already
> is in-sync and the main drive otherwise?
>
>   
>> Probably lots more details, but that might be enough to get me (or someone)
>> started one day.
>>
>> There would be lots of work to do in mdadm too of course to report on these
>> extensions and to assemble arrays with live-spares..
>>
>> NeilBrown
>>     
>
> MfG
>         Goswin
>   

The implementation you are proposing is great, very featureful.

However for a first implementation there is probably a simpler 
alternative which can give most of the benefits and still leave you the 
chance to add the rest of the features afterwards.

This would be my suggestion:

1/ The live-spare gets filled of data without recording anything on any 
superblocks. If there is a power failure and reboot, the new MD will 
know nothing about this. The process has to be restarted.

2/ When the live-spare is full of data, you switch the superblocks in a 
quick (almost atomic) operation. You remove the old device from the 
array and you add the new device in its place.

This doesn't support two copies of a drive running together, but I guess 
most people would be using hot-device-replace simply as a replacement 
for "fail" (also see my other post in thread "Re: Read errors on raid5 
ignored, array still clean .. then disaster !!"). It already has a great 
value for us for what I have read recently on the ML.

What I'd really suggest for the algorithm is: during read of the old 
device for replication, don't fail and kick-out the old device if there 
are read errors on a few sectors. Just read from parity and go on. 
Unless the old drive is really disastered (like it doesn't respond to 
anything, times out too many times, or was kicked by the controller), 
try to fail the old device only at the end.

If parity read also fails, fail just the hot-device-replace operation 
(and log something into dmesg), not the whole old device (failing the 
whole old device would trigger replication and eventually bring down the 
array). The rationale is that the hot-device-replace should be a safe 
operation that the sysadmin can run without anxiety. If the sysadmin 
knows that the operation can bring down the array, the purpose of this 
feature would be partly missed imho.

E.g. in case of raid-6, the algorithm would be:
For each block:
    read from disk being replaced and write the block into the hot-spare
    If the read fails:
        read from all other disks.
        If you get at least N-2 no-error reads:
            compute the block and write it into the hot-spare
        else:
            fail the hot-device-replace operation. I suggest to leave 
the array up.
            Log something into dmesg. mdadm can send an email. Also see 
below (*)


The hot-device-replace feature makes a great addition especially if 
coupled with the "threshold for max corrected read errors" feature. The 
hot-device-replace should get triggered when the threshold for max 
corrected read errors is surpassed. See motivation for it in my other 
post in thread "Re: Read errors on raid5 ignored, array still clean .. 
then disaster !!" .

(*) If "threshold for max corrected read errors" is surpassed by more 
than 1, it means more than one hot-device-replace actions have failed 
due to too many read errors on the same stripe. I suggest to still keep 
the array up and do not fail disks, however I hope mdadm is set to send 
emails... If the drive then shows an uncorrectable read error probably 
there's no other choice than failing it, however in this case the array 
will certainly go down.
Summing up I suggest to really "fail" the drive (remove from array) only 
if "threshold for max corrected read errors" is surpassed AND "an 
uncorrectable read error happens". When just one of the 2 things happen, 
I suggest to just try triggering an hot-device-replace.

Thank you
Asdo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-31 15:34     ` Asdo
@ 2010-01-31 16:33       ` Gabor Gombas
  2010-01-31 17:32         ` Goswin von Brederlow
  0 siblings, 1 reply; 11+ messages in thread
From: Gabor Gombas @ 2010-01-31 16:33 UTC (permalink / raw)
  To: Asdo; +Cc: Goswin von Brederlow, linux-raid, Neil Brown

On Sun, Jan 31, 2010 at 04:34:03PM +0100, Asdo wrote:

> 1/ The live-spare gets filled of data without recording anything on
> any superblocks. If there is a power failure and reboot, the new MD
> will know nothing about this. The process has to be restarted.

IMHO MD must know about the copy and it must know not to use the new
device before the copying is completed. Otherwise after a reboot mdadm
may either import the new half-written spare instead of the real one if
the superblock is already copied, or other tools like LVM may start
using the new half-written spare instead of the RAID if the MD
superblock is still missing.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Swapping a disk without degrading an array
  2010-01-31 16:33       ` Gabor Gombas
@ 2010-01-31 17:32         ` Goswin von Brederlow
  0 siblings, 0 replies; 11+ messages in thread
From: Goswin von Brederlow @ 2010-01-31 17:32 UTC (permalink / raw)
  To: Gabor Gombas; +Cc: Asdo, Goswin von Brederlow, linux-raid, Neil Brown

Gabor Gombas <gombasg@sztaki.hu> writes:

> On Sun, Jan 31, 2010 at 04:34:03PM +0100, Asdo wrote:
>
>> 1/ The live-spare gets filled of data without recording anything on
>> any superblocks. If there is a power failure and reboot, the new MD
>> will know nothing about this. The process has to be restarted.
>
> IMHO MD must know about the copy and it must know not to use the new
> device before the copying is completed. Otherwise after a reboot mdadm
> may either import the new half-written spare instead of the real one if
> the superblock is already copied, or other tools like LVM may start
> using the new half-written spare instead of the RAID if the MD
> superblock is still missing.
>
> Gabor

No that is exactly what he means to avoid.

His suggestion is that at the start the metadata area of the life-spare
is kept as is, being a simple unused spare. Only the in-memory data
records that it actualy is a live-spare and only the data part of the
device is mirrored.

Then at the end you remove the old disk, add the live-spare and record
the change in the metadata of all drives in an semi atomic way. If
anything interrupts the operation before this the live-spare will still
be recorgnised as normal spare when the raid is reassembled.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-01-31 17:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-25 12:11 Swapping a disk without degrading an array Michał Sawicz
2010-01-25 12:25 ` Majed B.
2010-01-25 12:53   ` Mikael Abrahamsson
2010-01-25 14:44 ` Michał Sawicz
2010-01-25 14:51 ` Asdo
2010-01-25 17:40 ` Goswin von Brederlow
2010-01-29 11:19 ` Neil Brown
2010-01-29 15:35   ` Goswin von Brederlow
2010-01-31 15:34     ` Asdo
2010-01-31 16:33       ` Gabor Gombas
2010-01-31 17:32         ` Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).