Re: Split-Brain Protection for MD arrays

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Roberto Spadim <roberto@spadim.com.br>
To: NeilBrown <neilb@suse.de>
Cc: Alexander Lyakas <alex.bolshoy@gmail.com>,
	linux-raid <linux-raid@vger.kernel.org>
Subject: Re: Split-Brain Protection for MD arrays
Date: Fri, 16 Dec 2011 11:46:50 -0200	[thread overview]
Message-ID: <CABYL=TrZ9hFTWa9Op8xaoN7qXYYRmdyrZU7AMvAnZWTeJVT=rg@mail.gmail.com> (raw)
In-Reply-To: <20111216064003.18a7ab4f@notabene.brown>

just some points that we shouldn´t forget... thinking like a end user
of mdadm, not as a developer...
a disk fail occur about 1 time after 2 years of heavy use in a desktop sata disk
a complex structure just for 1 minute of mdadm --remove, mdadm --add
should be accepted by end users... it´s just 1 minute of 2 years...
2 years=730 days=17520 hours=1051200 minutes, in other works 1 minute
~= 1/1.000.000=0.0001% of stop time, 99.9999% of online time, if we
consider turn server off add a new disk and remove older, let we
consider 10minutes? 0.001% = 99.999% of online time
it´s well accepted for desktop and servers...

for raid1 and linear- i don´t see a real complex logic telling what
block isn´t ok, just a counter telling what disk have more recent data
is wellcome
for raid10, raid5 and raid6- ok we can allow a block specific ,since
we could consider a bad disk like many bad blocks and many good blocks
(in the good disk)


2011/12/15 NeilBrown <neilb@suse.de>:
> On Thu, 15 Dec 2011 16:29:12 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Neil,
>> thanks for the review, and for detailed answers to my questions.
>>
>> > When we mark a device 'failed' it should stay marked as 'failed'.  When the
>> > array is optimal again it is safe to convert all 'failed' slots to
>> > 'spare/missing' but not before.
>> I did not understand all that reasoning. When you say "slot", you mean
>> index in the dev_roles[] array, correct? If yes, I don't see what
>> importance the index has, compared to the value of the entry itself
>> (which is "role" in your terminology).
>> Currently, 0xFFFE means both "failed" and "missing", and that makes
>> perfect sense to me. Basically this means that this entry of
>> dev_roles[] is unused. When a device fails, it is kicked out of the
>> array, so its entry in dev_roles[] becomes available.
>> (You once mentioned that for older arrays, their dev_roles[] index was
>> also their role, perhaps you are concerned about those too).
>> In any case, I will be watching for changes in this area, if you
>> decide to make them (although I think this might break backwards
>> compatibility, unless a new version of superblock will be used).
>
> Maybe...  as I said, "confusing" is a relevant word in this area.
>
>>
>> > If you have a working array and you initiate a write of a data block and the
>> > parity block, and if one of those writes fails, then you no longer have a
>> > working array.  Some data blocks in that stripe cannot be recovered.
>> > So we need to make sure that admin knows the array is dead and doesn't just
>> > re-assemble and think everything is OK.
>> I see your point. I don't know what's better: to know the "last known
>> good" configuration, or to know that the array has failed. I guess, I
>> am just used to the former.
>
> Possibly an 'array-has-failed' flag in the metadata would allow us to keep
> the last known-good config.  But as it isn't any good any more I don't really
> see the point.
>
>
>>
>> > I think to resolve this issue we need 2 thing.
>> >
>> > 1/ when assembling an array if any device thinks that the 'chosen' device has
>> >   failed, then don't trust that devices.
>> I think that if any device thinks that "chosen" has failed, then
>> either it has a more recent superblock, and then this device should be
>> "chosen" and not the other. Or, the "chosen" device's superblock is
>> the one that counts, then it doesn't matter what current device
>> thinks, because array will be assembled according to the "chosen"
>> superblock.
>
> This is exactly what the current code does and it allows you to assemble an
> array after a split-brain experience.  This is bad.  Checking what other
> devices think of the chosen device lets you detect the effect of a
> split-brain.
>
>
>>
>> > 2/ Don't erase 'failed' status from dev_roles[] until the array is
>> > optimal.
>>
>> Neil, I think both these points don't resolve the following simple
>> scenario: RAID1 with drive A and B. Drive A fails, array continues to
>> operate on drive B. After reboot, only drive A is accessible. If we go
>> ahead with assemble, we will see stale data. If after reboot, we,
>> however, see only drive A, then (since B is "faulty" in A's
>> superblock), we can go ahead and assemble. The change I suggested will
>> abort in the first case, but will assemble in the second case.
>
> Using --no-degraded will do what you want in both cases.  So no code change
> is needed!
>
>>
>> But obviously, you know better what MD users expect and want.
>
> Don't bet on it.
> So far I have one vote - from you - that --no-degraded should be he default
> (I think that is what you are saying).  If others agree I'll certainly
> consider it more.
>
> Note that "--no-degraded" doesn't exactly mean "not assemble a degraded
> array".  It means "don't assemble an array more degraded that it was last
> time it was working".  i.e. require that all devices that are working
> according to the metadata are actually available.
>
> NeilBrown
>
>
>
>> Thanks again for taking time and reviewing the proposal! And yes, next
>> time, I will put everything in the email.
>>
>> Alex.
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-12-16 13:46 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-12 18:51 Split-Brain Protection for MD arrays Alexander Lyakas
2011-12-12 20:18 ` Vincent Pelletier
2011-12-13  9:50   ` Alexander Lyakas
2011-12-15  3:02 ` NeilBrown
2011-12-15 14:29   ` Alexander Lyakas
2011-12-15 19:40     ` NeilBrown
2011-12-16 13:46       ` Roberto Spadim [this message]
2011-12-16 14:30       ` Alexander Lyakas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABYL=TrZ9hFTWa9Op8xaoN7qXYYRmdyrZU7AMvAnZWTeJVT=rg@mail.gmail.com' \
    --to=roberto@spadim.com.br \
    --cc=alex.bolshoy@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).