linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: Martin Wilck <mwilck@arcor.de>
Cc: linux-raid <linux-raid@vger.kernel.org>,
	Francis Moreau <francis.moro@gmail.com>
Subject: Re: RFC: incremental container assembly when sequence numbers don't match
Date: Mon, 21 Oct 2013 11:07:27 +1100	[thread overview]
Message-ID: <20131021110727.363cdd02@notabene.brown> (raw)
In-Reply-To: <523CADFD.9050006@arcor.de>

[-- Attachment #1: Type: text/plain, Size: 4976 bytes --]

On Fri, 20 Sep 2013 22:20:13 +0200 Martin Wilck <mwilck@arcor.de> wrote:

> Hi,
> 
> I have spent a few days thinking about the problem of incremental
> container assembly when disk sequence numbers (aka event counters) don't
> match, and how mdadm/mdmon should behave in various situations.
> Before I start coding on this, I'd like to get your opinion - I may be
> overlooking something  important.
> 
> The scenario I look at is that sequence numbers don't match during
> incremental assembly. This can occur quite easily. A disk may have been
> missing the last time the array was assembled, and be added again. The
> last incremental assembly may have been interrupted before all disks
> were found, for whatever reason. Etc. The problems Francis reported
> lately all occur in situations of this type.
> 
> A) New disk has lower seq number as previously scanned ones:
>    The up-to-date meta data is the meta data previously parsed.
> 
>    For each subarray the new disk is a member in the meta data:
>      A.1) If the subarray is already running, add the new disk a spare.

If the new disk has old metadata, then it might have failed at some point, so
we shouldn't add it as anything without good reason.
If the most recent metadata records that a device went missing, rather than
actually failed, then it might be justified to add it as a spare.  But in
general I'd prefer thing were only added as spares if that was explicitly
requested of if the policy in mdadm.conf encourages it.

>      A.2) check the subarray seqnum; if the subarray seqnum is equal
> between existing and new disks, the new disk can be added as "clean".
> (This requires implementing separate seqnums for every subarray, but
> that can be done quite easily, at least for DDF).
>      A.3) Otherwise, add the new disk as a spare.
> 
>    The added disk may be marked as "Missing" or "Faulty" in the meta
> data. That will be handled already by existing code already AFAICS.
> 
> B) New disk has higher seq number than previously scanned ones.
>    The up-to-date meta data is on the new disk. Here it gets tricky.
> 
>    B.1) If mdmon isn't running for this container:
>      B.1.a) reread the meta data (load_container() will automatically
> choose the best meta data).
>      B.1.b) Discard previously made configurations
>      B.1.c) Reassemble the arrays, starting with the new disk. When
> re-adding the drive(s) with the older meta data, act as in A) above.
> 
>    B.2) If mdmon is already running for this container, it means at
> least one subarray is already running, too.
>      B.2.a) If the new disk belongs to a already running and active
> subarray, we have encountered a fatal error. mdadm should refuse to do
> anything with the new disk and emit an alert.
>      B.2.b) If the new disk belongs to a already running read-only
> subarray, and the subarray seqnum of the new disk is lower than that of
> the existing disks, we also have a fatal error - we don't know which
> data is more recent. Human intervention is necessary.
>      B.2.c) Both mdadm and mdmon need to update the meta data as
> described in B.1.a).
>      B.2.d) If the new disk belongs to a already running read-only
> subarray, and the subarray seqnum of the new disk is greater or equal to
> the subarray seqnum of the existing disk(s), it might be possible to add
> the new disk to the array as clean. If the seqnum isn't equal, recovery
> must be started on the previously existing disk(s). Currently the kernel
> doesn't allow adding a new disk as "clean" in any state except
> "inactive", so this special case will not be implemented any time soon.
> It's a general question whether or not mdadm should attempt to be
> "smart" in situations like this.
>      B.2.e) Subarrays that aren't running yet, and which the new disk is
> a member of, can be reassembled as described in A)
>      B.2.f) pre-existing disks that are marked missing or failed in the
> updated meta data must have their status changed. This may cause the
> already running array(s) to degrade or break, even if the new disk
> doen't belong to them.
>      B.2.g) The status of all subarrays (consistent/initialized) is
> updated according to the new meta data.
> 
> Note that the really difficult cases B.2.a/b/d can't easily happen if
> the Incremental assembly is done without "-R", as it should be. So it
> may be reasonable to just quit with an error if any of these situation
> is encountered.
> 
> An important further question is where this logic should be implemented.
> This is independent of meta data type and thus most of it should be in
> the generic Incremental_container() code path.

maybe in assemble_container_content?  But mdmon need to know about some of it
too of course.

> 
> Feedback welcome.
> Best regards
> Martin

Sounds very sensible, but the devil is in the detail of course. :-)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

      parent reply	other threads:[~2013-10-21  0:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-20 20:20 RFC: incremental container assembly when sequence numbers don't match Martin Wilck
2013-09-23  7:30 ` Francis Moreau
2013-09-23 20:30   ` Martin Wilck
2013-09-25 20:46 ` Martin Wilck
2013-10-21  0:07 ` NeilBrown [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131021110727.363cdd02@notabene.brown \
    --to=neilb@suse.de \
    --cc=francis.moro@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mwilck@arcor.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).