Split-Brain Protection for MD arrays

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Split-Brain Protection for MD arrays
@ 2011-12-12 18:51 Alexander Lyakas
  2011-12-12 20:18 ` Vincent Pelletier
  2011-12-15  3:02 ` NeilBrown
  0 siblings, 2 replies; 8+ messages in thread
From: Alexander Lyakas @ 2011-12-12 18:51 UTC (permalink / raw)
  To: linux-raid

Hello Neil and all the MD developers.

There've been a couple of emails asking about MD split-brain
situations (well, one from a co-worker, so that doesn't count
perhaps). A simplest example of a split-brain is a 2-drive RAID1
operating in degraded mode, but after reboot array is re-assembled
with the drive that previously failed.

I would like to propose an approach that would detect when assembling
an array may result in split-brain, and at least warn the user. The
proposed approach is documented in a 3-page googledoc, linked here:
https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE/edit
(anybody can comment).

The approach is very much based on what MD already has today in the
kernel, with only one possible change. On the mdadm side, only code
that checks things and warns the user needs to be added, i.e., no
extra IOs or non-in-memory operations.

I would very much appreciate a review of the doc, mostly in terms of
my understanding how MD superblocks work. The doc contains some lines
in bold blue font, which are my questions, and comments are very
welcome. I am in the process of testing the code changes I made in my
system, once I am happy with them, I can post them as well for review,
if there is interest. If the community decides that this has value, I
will be happy to work out the best way to add the required
functionality.

I also have some additional questions, that popped why I was studying
the MD code; any help on these is appreciated.

- When a drive fails, the kernel skips updating its superblock, and
updates all other superblocks that this drive is Faulty. How can it
happen that a drive can mark itself as Faulty in its own superblock? I
saw code in mdadm checking for this.

- Why mdadm initializes the dev_roles[] array to 0xFFFF, but kernel
initializes it to 0xFFFE? Since 0xFFFF also indicates a spare, this is
confusing, we might think that we have 380+ spares...

- Why event margin of 1 is permitted both in user and kernel? Is this
for the case when we update all the superblocks in parallel in the
kernel, but crash in the middle?

- Why enough() function in mdadm ignores the "clean" parameter for
raid1/10? Is this because if such array is unclean, then there is no
way of knowing, even with all drives present, which copy contains the
correct data?

- In Assemble.c: update_super(st, &devices[j].i, "assemble") is called
and updates the "chosen_drive" superblock only (which might not even
write this to disk, unless force is given), but later in add_disk the
disk.state might still have the FAULTY flag set
(because it was only cleared in the "chosen_drive" superblock). What
am I missing?

- In Assemble.c: req_cnt = content->array.working_disks: taken from
the "most recent" superblock, but even the most recent superblock may
indicate a FAILED array.
This actually leads to the question that interests me most, and I also
ask it in the doc. Why do we continue updating the superblocks after
the array fails? This way we basically loose "last known good
configuration", i.e., we don't know the last good set of devices array
was operating on. Had we known that, that might be useful in assisting
people on recovering their arrays, I think. Otherwise, we need to
guess in what sequence drives failed until the array died.

Thanks to everybody for taking time reading/answering those....and
please be gentle.
  Alex.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Split-Brain Protection for MD arrays
  2011-12-12 18:51 Split-Brain Protection for MD arrays Alexander Lyakas
@ 2011-12-12 20:18 ` Vincent Pelletier
  2011-12-13  9:50   ` Alexander Lyakas
  2011-12-15  3:02 ` NeilBrown
  1 sibling, 1 reply; 8+ messages in thread
From: Vincent Pelletier @ 2011-12-12 20:18 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

Le lundi 12 décembre 2011 19:51:23, vous avez écrit :
> split-brain

I'm participating on the NEO[1] project (object database server with 
redundancy - that last bit is the one relevant to this discussion), which 
faces the same kind of problem (storage nodes dying when cluster is functional 
or not, dead nodes comming back to life later, etc). So we had to design some 
counter measures to handle split-brain. 

I'm happy to recognise some equivalent of the decisions we took on NEO, and 
I'll be following this thread with attention (we didn't try to get a lot of 
reviewing on our design so far).

I would suggest one thing:
Use a fixed increment for "metadata version" number. Time representation is
not reliable IMHO, especially at times when you need to setup an array:
faulty BIOS battery, old RTC drifting either way, no NTP to correct this
(either none available or no client to access one).
If timestamp is affected by timezone (and especially DST) makes matters
worse.
Admitedly, fixed increment exposes user to problems if he decides to
independently run two halves of a split brain, start making their data
diverge, reach a point (controlable) where version number is at some
convenient value and then let the array assemble itself and burst in fire.
Though, user has to jump through hoops to reach this. Timestamp-based
requires non-monotonous RTC.

Side note: if anyone knows a time source available to userland which is not
affected by date/ntpd/ntpdate nor timezones nor DST (but can drift when 
computer is powered down - but if possible not when suspended), please tell 
me.

[1] http://pypi.python.org/pypi/neoppod

Regards,
-- 
Vincent Pelletier
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Split-Brain Protection for MD arrays
  2011-12-12 20:18 ` Vincent Pelletier
@ 2011-12-13  9:50   ` Alexander Lyakas
  0 siblings, 0 replies; 8+ messages in thread
From: Alexander Lyakas @ 2011-12-13  9:50 UTC (permalink / raw)
  To: Vincent Pelletier; +Cc: linux-raid

Vincent,
thanks for reviewing.

> I would suggest one thing:
> Use a fixed increment for "metadata version" number.
Yes, that is what's happening in MD. The doc was confusing about the
"timestamp" part.

> Admitedly, fixed increment exposes user to problems if he decides to
> independently run two halves of a split brain, start making their data
> diverge, reach a point (controlable) where version number is at some
> convenient value and then let the array assemble itself and burst in fire.
> Though, user has to jump through hoops to reach this.

Yes, so for that case I was thinking that once the user decides to
ignore the split-brain warning and still go ahead with the assemble,
then drives that are not accessible at that point will not be used
from now on ("external entity" should take care about that). The doc
mentions that as well.

Thanks,
 Alex.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Split-Brain Protection for MD arrays
  2011-12-12 18:51 Split-Brain Protection for MD arrays Alexander Lyakas
  2011-12-12 20:18 ` Vincent Pelletier
@ 2011-12-15  3:02 ` NeilBrown
  2011-12-15 14:29   ` Alexander Lyakas
  1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2011-12-15  3:02 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7307 bytes --]

On Mon, 12 Dec 2011 20:51:23 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil and all the MD developers.
> 
> There've been a couple of emails asking about MD split-brain
> situations (well, one from a co-worker, so that doesn't count
> perhaps). A simplest example of a split-brain is a 2-drive RAID1
> operating in degraded mode, but after reboot array is re-assembled
> with the drive that previously failed.
> 
> I would like to propose an approach that would detect when assembling
> an array may result in split-brain, and at least warn the user. The
> proposed approach is documented in a 3-page googledoc, linked here:
> https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE/edit
> (anybody can comment).

I much prefer text to be inline in the email.  It is much easier to comment
on.  I really don't even want to think about learning how to comment on a
google-docs thing.

> 
> The approach is very much based on what MD already has today in the
> kernel, with only one possible change. On the mdadm side, only code
> that checks things and warns the user needs to be added, i.e., no
> extra IOs or non-in-memory operations.

This "warns the user" thing concerns me somewhat.

The simplest examine of a possible split brain is a 2-device RAID1 where only
one device is available.   Your document seems to suggest that assembling
such an array should require user-intervention.  I cannot agree with that.
Even  assembling a 2-out-of-4 RAID6 should "just work".

We already have the "--no-degraded" option so that if someone wants to
request failure rather than possible-split brain.  I don't think we want or
need more than that.

> 
> I would very much appreciate a review of the doc, mostly in terms of
> my understanding how MD superblocks work. The doc contains some lines
> in bold blue font, which are my questions, and comments are very
> welcome. I am in the process of testing the code changes I made in my
> system, once I am happy with them, I can post them as well for review,
> if there is interest. If the community decides that this has value, I
> will be happy to work out the best way to add the required
> functionality.
> 
> I also have some additional questions, that popped why I was studying
> the MD code; any help on these is appreciated.
> 
> - When a drive fails, the kernel skips updating its superblock, and
> updates all other superblocks that this drive is Faulty. How can it
> happen that a drive can mark itself as Faulty in its own superblock? I
> saw code in mdadm checking for this.

It cannot, as you say.

I don't remember why mdadm checks for that.  Maybe a very old version of the
kernel code could do that.

> 
> - Why mdadm initializes the dev_roles[] array to 0xFFFF, but kernel
> initializes it to 0xFFFE? Since 0xFFFF also indicates a spare, this is
> confusing, we might think that we have 380+ spares...

"this is confusing" is exactly correct.
I never really sorted out what values I wanted in the dev_roles array.

With the benefit of the extra hindsight I now have, I think there should have
been 3 special values:  'failed', 'spare' and 'missing'.

So we initialised to 'missing'.  As we add devices their slot first becomes
'spare', and then maybe becomes N (for some role in the array), and the
eventually 'failed' when the device fails (though this is never recorded on
the device itself).

If we re-add a failed device, we give it the same slot and make it 'spare' or
'N' again.

Eventually we could 'use up' all the available slots (no 'missing' slots
left) and so would need to convert some 'failed' slots to 'missing'.

So I guess when I was writing mdadm I thought that missing devices were
'spare' and when I was writing the kernel code I thought that 'missing'
devices were failed. :-(

We cannot safely add another special value now so I think the best way
forward is to treat 'spare' and 'missing' as the same.  So when we add a
spare we cannot just look for a free slot in the array, but must look at all
current spares as well to see what role they hold.  Awkward but not
impractical.

When we mark a device 'failed' it should stay marked as 'failed'.  When the
array is optimal again it is safe to convert all 'failed' slots to
'spare/missing' but not before.

> 
> - Why event margin of 1 is permitted both in user and kernel? Is this
> for the case when we update all the superblocks in parallel in the
> kernel, but crash in the middle?

Exactly.

> 
> - Why enough() function in mdadm ignores the "clean" parameter for
> raid1/10? Is this because if such array is unclean, then there is no
> way of knowing, even with all drives present, which copy contains the
> correct data?

In RAID1/RAID10, if the array is not clean we simply choose the 'first'
working devices (in some arbitrary ordering) and we have good-enough data.

In RAID5/6 if the array is not clean, then we cannot trust the parity so if
any device is missing, then the data for that device cannot be reliably
recovered.

They really a very different situations.

> 
> - In Assemble.c: update_super(st, &devices[j].i, "assemble") is called
> and updates the "chosen_drive" superblock only (which might not even
> write this to disk, unless force is given), but later in add_disk the
> disk.state might still have the FAULTY flag set
> (because it was only cleared in the "chosen_drive" superblock). What
> am I missing?

The 'chosen' drive is the first one given to the kernel, and the kernel
believes it in preference to subsequent devices.  So rather  than update all
superblocks we only need to update one.

> 
> - In Assemble.c: req_cnt = content->array.working_disks: taken from
> the "most recent" superblock, but even the most recent superblock may
> indicate a FAILED array.
> This actually leads to the question that interests me most, and I also
> ask it in the doc. Why do we continue updating the superblocks after
> the array fails? This way we basically loose "last known good
> configuration", i.e., we don't know the last good set of devices array
> was operating on. Had we known that, that might be useful in assisting
> people on recovering their arrays, I think. Otherwise, we need to
> guess in what sequence drives failed until the array died.

I've wondered that too - but never been quite confident enough to change it.

If you have a working array and you initiate a write of a data block and the
parity block, and if one of those writes fails, then you no longer have a
working array.  Some data blocks in that stripe cannot be recovered.
So we need to make sure that admin knows the array is dead and doesn't just
re-assemble and think everything is OK.

So we go ahead and record the failure.
mdadm -Af can fix it up and allow you to continue with a possibly-corrupt
array. 

If you want other questions answered, best to include them in an Email.

I think to resolve this issue we need 2 thing.

1/ when assembling an array if any device thinks that the 'chosen' device has
   failed, then don't trust that devices.
2/ Don't erase 'failed' status from dev_roles[] until the array is
optimal.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Split-Brain Protection for MD arrays
  2011-12-15  3:02 ` NeilBrown
@ 2011-12-15 14:29   ` Alexander Lyakas
  2011-12-15 19:40     ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Lyakas @ 2011-12-15 14:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Neil,
thanks for the review, and for detailed answers to my questions.

> When we mark a device 'failed' it should stay marked as 'failed'.  When the
> array is optimal again it is safe to convert all 'failed' slots to
> 'spare/missing' but not before.
I did not understand all that reasoning. When you say "slot", you mean
index in the dev_roles[] array, correct? If yes, I don't see what
importance the index has, compared to the value of the entry itself
(which is "role" in your terminology).
Currently, 0xFFFE means both "failed" and "missing", and that makes
perfect sense to me. Basically this means that this entry of
dev_roles[] is unused. When a device fails, it is kicked out of the
array, so its entry in dev_roles[] becomes available.
(You once mentioned that for older arrays, their dev_roles[] index was
also their role, perhaps you are concerned about those too).
In any case, I will be watching for changes in this area, if you
decide to make them (although I think this might break backwards
compatibility, unless a new version of superblock will be used).

> If you have a working array and you initiate a write of a data block and the
> parity block, and if one of those writes fails, then you no longer have a
> working array.  Some data blocks in that stripe cannot be recovered.
> So we need to make sure that admin knows the array is dead and doesn't just
> re-assemble and think everything is OK.
I see your point. I don't know what's better: to know the "last known
good" configuration, or to know that the array has failed. I guess, I
am just used to the former.

> I think to resolve this issue we need 2 thing.
>
> 1/ when assembling an array if any device thinks that the 'chosen' device has
>   failed, then don't trust that devices.
I think that if any device thinks that "chosen" has failed, then
either it has a more recent superblock, and then this device should be
"chosen" and not the other. Or, the "chosen" device's superblock is
the one that counts, then it doesn't matter what current device
thinks, because array will be assembled according to the "chosen"
superblock.

> 2/ Don't erase 'failed' status from dev_roles[] until the array is
> optimal.

Neil, I think both these points don't resolve the following simple
scenario: RAID1 with drive A and B. Drive A fails, array continues to
operate on drive B. After reboot, only drive A is accessible. If we go
ahead with assemble, we will see stale data. If after reboot, we,
however, see only drive A, then (since B is "faulty" in A's
superblock), we can go ahead and assemble. The change I suggested will
abort in the first case, but will assemble in the second case.

But obviously, you know better what MD users expect and want.
Thanks again for taking time and reviewing the proposal! And yes, next
time, I will put everything in the email.

Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Split-Brain Protection for MD arrays
  2011-12-15 14:29   ` Alexander Lyakas
@ 2011-12-15 19:40     ` NeilBrown
  2011-12-16 13:46       ` Roberto Spadim
  2011-12-16 14:30       ` Alexander Lyakas
  0 siblings, 2 replies; 8+ messages in thread
From: NeilBrown @ 2011-12-15 19:40 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4164 bytes --]

On Thu, 15 Dec 2011 16:29:12 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Neil,
> thanks for the review, and for detailed answers to my questions.
> 
> > When we mark a device 'failed' it should stay marked as 'failed'.  When the
> > array is optimal again it is safe to convert all 'failed' slots to
> > 'spare/missing' but not before.
> I did not understand all that reasoning. When you say "slot", you mean
> index in the dev_roles[] array, correct? If yes, I don't see what
> importance the index has, compared to the value of the entry itself
> (which is "role" in your terminology).
> Currently, 0xFFFE means both "failed" and "missing", and that makes
> perfect sense to me. Basically this means that this entry of
> dev_roles[] is unused. When a device fails, it is kicked out of the
> array, so its entry in dev_roles[] becomes available.
> (You once mentioned that for older arrays, their dev_roles[] index was
> also their role, perhaps you are concerned about those too).
> In any case, I will be watching for changes in this area, if you
> decide to make them (although I think this might break backwards
> compatibility, unless a new version of superblock will be used).

Maybe...  as I said, "confusing" is a relevant word in this area.

> 
> > If you have a working array and you initiate a write of a data block and the
> > parity block, and if one of those writes fails, then you no longer have a
> > working array.  Some data blocks in that stripe cannot be recovered.
> > So we need to make sure that admin knows the array is dead and doesn't just
> > re-assemble and think everything is OK.
> I see your point. I don't know what's better: to know the "last known
> good" configuration, or to know that the array has failed. I guess, I
> am just used to the former.

Possibly an 'array-has-failed' flag in the metadata would allow us to keep
the last known-good config.  But as it isn't any good any more I don't really
see the point.


> 
> > I think to resolve this issue we need 2 thing.
> >
> > 1/ when assembling an array if any device thinks that the 'chosen' device has
> >   failed, then don't trust that devices.
> I think that if any device thinks that "chosen" has failed, then
> either it has a more recent superblock, and then this device should be
> "chosen" and not the other. Or, the "chosen" device's superblock is
> the one that counts, then it doesn't matter what current device
> thinks, because array will be assembled according to the "chosen"
> superblock.

This is exactly what the current code does and it allows you to assemble an
array after a split-brain experience.  This is bad.  Checking what other
devices think of the chosen device lets you detect the effect of a
split-brain.


> 
> > 2/ Don't erase 'failed' status from dev_roles[] until the array is
> > optimal.
> 
> Neil, I think both these points don't resolve the following simple
> scenario: RAID1 with drive A and B. Drive A fails, array continues to
> operate on drive B. After reboot, only drive A is accessible. If we go
> ahead with assemble, we will see stale data. If after reboot, we,
> however, see only drive A, then (since B is "faulty" in A's
> superblock), we can go ahead and assemble. The change I suggested will
> abort in the first case, but will assemble in the second case.

Using --no-degraded will do what you want in both cases.  So no code change
is needed!

> 
> But obviously, you know better what MD users expect and want.

Don't bet on it.
So far I have one vote - from you - that --no-degraded should be he default
(I think that is what you are saying).  If others agree I'll certainly
consider it more.

Note that "--no-degraded" doesn't exactly mean "not assemble a degraded
array".  It means "don't assemble an array more degraded that it was last
time it was working".  i.e. require that all devices that are working
according to the metadata are actually available.

NeilBrown



> Thanks again for taking time and reviewing the proposal! And yes, next
> time, I will put everything in the email.
> 
> Alex.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Split-Brain Protection for MD arrays
  2011-12-15 19:40     ` NeilBrown
@ 2011-12-16 13:46       ` Roberto Spadim
  2011-12-16 14:30       ` Alexander Lyakas
  1 sibling, 0 replies; 8+ messages in thread
From: Roberto Spadim @ 2011-12-16 13:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Alexander Lyakas, linux-raid

just some points that we shouldn´t forget... thinking like a end user
of mdadm, not as a developer...
a disk fail occur about 1 time after 2 years of heavy use in a desktop sata disk
a complex structure just for 1 minute of mdadm --remove, mdadm --add
should be accepted by end users... it´s just 1 minute of 2 years...
2 years=730 days=17520 hours=1051200 minutes, in other works 1 minute
~= 1/1.000.000=0.0001% of stop time, 99.9999% of online time, if we
consider turn server off add a new disk and remove older, let we
consider 10minutes? 0.001% = 99.999% of online time
it´s well accepted for desktop and servers...

for raid1 and linear- i don´t see a real complex logic telling what
block isn´t ok, just a counter telling what disk have more recent data
is wellcome
for raid10, raid5 and raid6- ok we can allow a block specific ,since
we could consider a bad disk like many bad blocks and many good blocks
(in the good disk)


2011/12/15 NeilBrown <neilb@suse.de>:
> On Thu, 15 Dec 2011 16:29:12 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Neil,
>> thanks for the review, and for detailed answers to my questions.
>>
>> > When we mark a device 'failed' it should stay marked as 'failed'.  When the
>> > array is optimal again it is safe to convert all 'failed' slots to
>> > 'spare/missing' but not before.
>> I did not understand all that reasoning. When you say "slot", you mean
>> index in the dev_roles[] array, correct? If yes, I don't see what
>> importance the index has, compared to the value of the entry itself
>> (which is "role" in your terminology).
>> Currently, 0xFFFE means both "failed" and "missing", and that makes
>> perfect sense to me. Basically this means that this entry of
>> dev_roles[] is unused. When a device fails, it is kicked out of the
>> array, so its entry in dev_roles[] becomes available.
>> (You once mentioned that for older arrays, their dev_roles[] index was
>> also their role, perhaps you are concerned about those too).
>> In any case, I will be watching for changes in this area, if you
>> decide to make them (although I think this might break backwards
>> compatibility, unless a new version of superblock will be used).
>
> Maybe...  as I said, "confusing" is a relevant word in this area.
>
>>
>> > If you have a working array and you initiate a write of a data block and the
>> > parity block, and if one of those writes fails, then you no longer have a
>> > working array.  Some data blocks in that stripe cannot be recovered.
>> > So we need to make sure that admin knows the array is dead and doesn't just
>> > re-assemble and think everything is OK.
>> I see your point. I don't know what's better: to know the "last known
>> good" configuration, or to know that the array has failed. I guess, I
>> am just used to the former.
>
> Possibly an 'array-has-failed' flag in the metadata would allow us to keep
> the last known-good config.  But as it isn't any good any more I don't really
> see the point.
>
>
>>
>> > I think to resolve this issue we need 2 thing.
>> >
>> > 1/ when assembling an array if any device thinks that the 'chosen' device has
>> >   failed, then don't trust that devices.
>> I think that if any device thinks that "chosen" has failed, then
>> either it has a more recent superblock, and then this device should be
>> "chosen" and not the other. Or, the "chosen" device's superblock is
>> the one that counts, then it doesn't matter what current device
>> thinks, because array will be assembled according to the "chosen"
>> superblock.
>
> This is exactly what the current code does and it allows you to assemble an
> array after a split-brain experience.  This is bad.  Checking what other
> devices think of the chosen device lets you detect the effect of a
> split-brain.
>
>
>>
>> > 2/ Don't erase 'failed' status from dev_roles[] until the array is
>> > optimal.
>>
>> Neil, I think both these points don't resolve the following simple
>> scenario: RAID1 with drive A and B. Drive A fails, array continues to
>> operate on drive B. After reboot, only drive A is accessible. If we go
>> ahead with assemble, we will see stale data. If after reboot, we,
>> however, see only drive A, then (since B is "faulty" in A's
>> superblock), we can go ahead and assemble. The change I suggested will
>> abort in the first case, but will assemble in the second case.
>
> Using --no-degraded will do what you want in both cases.  So no code change
> is needed!
>
>>
>> But obviously, you know better what MD users expect and want.
>
> Don't bet on it.
> So far I have one vote - from you - that --no-degraded should be he default
> (I think that is what you are saying).  If others agree I'll certainly
> consider it more.
>
> Note that "--no-degraded" doesn't exactly mean "not assemble a degraded
> array".  It means "don't assemble an array more degraded that it was last
> time it was working".  i.e. require that all devices that are working
> according to the metadata are actually available.
>
> NeilBrown
>
>
>
>> Thanks again for taking time and reviewing the proposal! And yes, next
>> time, I will put everything in the email.
>>
>> Alex.
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Split-Brain Protection for MD arrays
  2011-12-15 19:40     ` NeilBrown
  2011-12-16 13:46       ` Roberto Spadim
@ 2011-12-16 14:30       ` Alexander Lyakas
  1 sibling, 0 replies; 8+ messages in thread
From: Alexander Lyakas @ 2011-12-16 14:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hello Neil,

I have re-run all my tests with the "no-degraded" option, and I
happily admit, that it provides perfect split-brain protection!
The (okcnt + rebuilding_cnt >= req_cnt) logic provides exactly that.

On the other hand, this option, exactly as you mentioned, doesn't
allow the array to come up "more degraded than it was last time it was
working". So there are cases, in which there is no split-brain danger,
but the array will not come up with this option. For example, a
3-drive RAID5, coming up with 2 drives after reboot. In those cases,
my tests failed with "no-degraded", as expected.

I agree that most of the users probably don't need this special
"protect-from-split-brain, but allow-to-come-up-degraded" semantics,
which my approach provides.

Thanks again for your insights!
Alex.



On Thu, Dec 15, 2011 at 9:40 PM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 15 Dec 2011 16:29:12 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Neil,
>> thanks for the review, and for detailed answers to my questions.
>>
>> > When we mark a device 'failed' it should stay marked as 'failed'.  When the
>> > array is optimal again it is safe to convert all 'failed' slots to
>> > 'spare/missing' but not before.
>> I did not understand all that reasoning. When you say "slot", you mean
>> index in the dev_roles[] array, correct? If yes, I don't see what
>> importance the index has, compared to the value of the entry itself
>> (which is "role" in your terminology).
>> Currently, 0xFFFE means both "failed" and "missing", and that makes
>> perfect sense to me. Basically this means that this entry of
>> dev_roles[] is unused. When a device fails, it is kicked out of the
>> array, so its entry in dev_roles[] becomes available.
>> (You once mentioned that for older arrays, their dev_roles[] index was
>> also their role, perhaps you are concerned about those too).
>> In any case, I will be watching for changes in this area, if you
>> decide to make them (although I think this might break backwards
>> compatibility, unless a new version of superblock will be used).
>
> Maybe...  as I said, "confusing" is a relevant word in this area.
>
>>
>> > If you have a working array and you initiate a write of a data block and the
>> > parity block, and if one of those writes fails, then you no longer have a
>> > working array.  Some data blocks in that stripe cannot be recovered.
>> > So we need to make sure that admin knows the array is dead and doesn't just
>> > re-assemble and think everything is OK.
>> I see your point. I don't know what's better: to know the "last known
>> good" configuration, or to know that the array has failed. I guess, I
>> am just used to the former.
>
> Possibly an 'array-has-failed' flag in the metadata would allow us to keep
> the last known-good config.  But as it isn't any good any more I don't really
> see the point.
>
>
>>
>> > I think to resolve this issue we need 2 thing.
>> >
>> > 1/ when assembling an array if any device thinks that the 'chosen' device has
>> >   failed, then don't trust that devices.
>> I think that if any device thinks that "chosen" has failed, then
>> either it has a more recent superblock, and then this device should be
>> "chosen" and not the other. Or, the "chosen" device's superblock is
>> the one that counts, then it doesn't matter what current device
>> thinks, because array will be assembled according to the "chosen"
>> superblock.
>
> This is exactly what the current code does and it allows you to assemble an
> array after a split-brain experience.  This is bad.  Checking what other
> devices think of the chosen device lets you detect the effect of a
> split-brain.
>
>
>>
>> > 2/ Don't erase 'failed' status from dev_roles[] until the array is
>> > optimal.
>>
>> Neil, I think both these points don't resolve the following simple
>> scenario: RAID1 with drive A and B. Drive A fails, array continues to
>> operate on drive B. After reboot, only drive A is accessible. If we go
>> ahead with assemble, we will see stale data. If after reboot, we,
>> however, see only drive A, then (since B is "faulty" in A's
>> superblock), we can go ahead and assemble. The change I suggested will
>> abort in the first case, but will assemble in the second case.
>
> Using --no-degraded will do what you want in both cases.  So no code change
> is needed!
>
>>
>> But obviously, you know better what MD users expect and want.
>
> Don't bet on it.
> So far I have one vote - from you - that --no-degraded should be he default
> (I think that is what you are saying).  If others agree I'll certainly
> consider it more.
>
> Note that "--no-degraded" doesn't exactly mean "not assemble a degraded
> array".  It means "don't assemble an array more degraded that it was last
> time it was working".  i.e. require that all devices that are working
> according to the metadata are actually available.
>
> NeilBrown
>
>
>
>> Thanks again for taking time and reviewing the proposal! And yes, next
>> time, I will put everything in the email.
>>
>> Alex.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-12-16 14:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-12 18:51 Split-Brain Protection for MD arrays Alexander Lyakas
2011-12-12 20:18 ` Vincent Pelletier
2011-12-13  9:50   ` Alexander Lyakas
2011-12-15  3:02 ` NeilBrown
2011-12-15 14:29   ` Alexander Lyakas
2011-12-15 19:40     ` NeilBrown
2011-12-16 13:46       ` Roberto Spadim
2011-12-16 14:30       ` Alexander Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).