RFC: incremental container assembly when sequence numbers don't match

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: incremental container assembly when sequence numbers don't match
@ 2013-09-20 20:20 Martin Wilck
  2013-09-23  7:30 ` Francis Moreau
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Martin Wilck @ 2013-09-20 20:20 UTC (permalink / raw)
  To: linux-raid, NeilBrown, Francis Moreau

Hi,

I have spent a few days thinking about the problem of incremental
container assembly when disk sequence numbers (aka event counters) don't
match, and how mdadm/mdmon should behave in various situations.
Before I start coding on this, I'd like to get your opinion - I may be
overlooking something  important.

The scenario I look at is that sequence numbers don't match during
incremental assembly. This can occur quite easily. A disk may have been
missing the last time the array was assembled, and be added again. The
last incremental assembly may have been interrupted before all disks
were found, for whatever reason. Etc. The problems Francis reported
lately all occur in situations of this type.

A) New disk has lower seq number as previously scanned ones:
   The up-to-date meta data is the meta data previously parsed.

   For each subarray the new disk is a member in the meta data:
     A.1) If the subarray is already running, add the new disk a spare.
     A.2) check the subarray seqnum; if the subarray seqnum is equal
between existing and new disks, the new disk can be added as "clean".
(This requires implementing separate seqnums for every subarray, but
that can be done quite easily, at least for DDF).
     A.3) Otherwise, add the new disk as a spare.

   The added disk may be marked as "Missing" or "Faulty" in the meta
data. That will be handled already by existing code already AFAICS.

B) New disk has higher seq number than previously scanned ones.
   The up-to-date meta data is on the new disk. Here it gets tricky.

   B.1) If mdmon isn't running for this container:
     B.1.a) reread the meta data (load_container() will automatically
choose the best meta data).
     B.1.b) Discard previously made configurations
     B.1.c) Reassemble the arrays, starting with the new disk. When
re-adding the drive(s) with the older meta data, act as in A) above.

   B.2) If mdmon is already running for this container, it means at
least one subarray is already running, too.
     B.2.a) If the new disk belongs to a already running and active
subarray, we have encountered a fatal error. mdadm should refuse to do
anything with the new disk and emit an alert.
     B.2.b) If the new disk belongs to a already running read-only
subarray, and the subarray seqnum of the new disk is lower than that of
the existing disks, we also have a fatal error - we don't know which
data is more recent. Human intervention is necessary.
     B.2.c) Both mdadm and mdmon need to update the meta data as
described in B.1.a).
     B.2.d) If the new disk belongs to a already running read-only
subarray, and the subarray seqnum of the new disk is greater or equal to
the subarray seqnum of the existing disk(s), it might be possible to add
the new disk to the array as clean. If the seqnum isn't equal, recovery
must be started on the previously existing disk(s). Currently the kernel
doesn't allow adding a new disk as "clean" in any state except
"inactive", so this special case will not be implemented any time soon.
It's a general question whether or not mdadm should attempt to be
"smart" in situations like this.
     B.2.e) Subarrays that aren't running yet, and which the new disk is
a member of, can be reassembled as described in A)
     B.2.f) pre-existing disks that are marked missing or failed in the
updated meta data must have their status changed. This may cause the
already running array(s) to degrade or break, even if the new disk
doen't belong to them.
     B.2.g) The status of all subarrays (consistent/initialized) is
updated according to the new meta data.

Note that the really difficult cases B.2.a/b/d can't easily happen if
the Incremental assembly is done without "-R", as it should be. So it
may be reasonable to just quit with an error if any of these situation
is encountered.

An important further question is where this logic should be implemented.
This is independent of meta data type and thus most of it should be in
the generic Incremental_container() code path.

Feedback welcome.
Best regards
Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: incremental container assembly when sequence numbers don't match
  2013-09-20 20:20 RFC: incremental container assembly when sequence numbers don't match Martin Wilck
@ 2013-09-23  7:30 ` Francis Moreau
  2013-09-23 20:30   ` Martin Wilck
  2013-09-25 20:46 ` Martin Wilck
  2013-10-21  0:07 ` NeilBrown
  2 siblings, 1 reply; 5+ messages in thread
From: Francis Moreau @ 2013-09-23  7:30 UTC (permalink / raw)
  To: Martin Wilck; +Cc: linux-raid, NeilBrown

Hello Martin

On Fri, Sep 20, 2013 at 10:20 PM, Martin Wilck <mwilck@arcor.de> wrote:
> Hi,
>
> I have spent a few days thinking about the problem of incremental
> container assembly when disk sequence numbers (aka event counters) don't
> match, and how mdadm/mdmon should behave in various situations.
> Before I start coding on this, I'd like to get your opinion - I may be
> overlooking something  important.

I was really suprised to see that this functionnality needs to be
implemented since in my understanding, it's the most important one, at
least for RAID1.

Isn't this already implemented for IMSM ? If so can't we use the same strategy ?

If not, isn't dmraid supporting it ? If so can't we use the same strategy ?

Thanks
-- 
Francis

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: incremental container assembly when sequence numbers don't match
  2013-09-23  7:30 ` Francis Moreau
@ 2013-09-23 20:30   ` Martin Wilck
  0 siblings, 0 replies; 5+ messages in thread
From: Martin Wilck @ 2013-09-23 20:30 UTC (permalink / raw)
  To: Francis Moreau; +Cc: linux-raid, NeilBrown

On 09/23/2013 09:30 AM, Francis Moreau wrote:
> Hello Martin
> 
> On Fri, Sep 20, 2013 at 10:20 PM, Martin Wilck <mwilck@arcor.de> wrote:
>> Hi,
>>
>> I have spent a few days thinking about the problem of incremental
>> container assembly when disk sequence numbers (aka event counters) don't
>> match, and how mdadm/mdmon should behave in various situations.
>> Before I start coding on this, I'd like to get your opinion - I may be
>> overlooking something  important.
> 
> I was really suprised to see that this functionnality needs to be
> implemented since in my understanding, it's the most important one, at
> least for RAID1.

Please don't confuse this with the problem you are currently seeing,
which looks more like a bug we need to find yet.

I agree this is important, that's why I wrote this RFC, but it's not the
most important functionality. AFAICS, the cases that won't work
optimally with the current code are pretty rare corner cases. They
should be fixed but it isn't too urgent. The "normal" case is that after
a failure, you add a new disk (possibly the same one again, as you did
in your late testing), and auto-recovery is started.

> Isn't this already implemented for IMSM ? If so can't we use the same strategy ?

I don't know the IMSM code well enough to tell. I had a look at it and
didn't find code treating this situation, but it's a lot of code, so I
may be missing something. I will set up test cases soon.

> If not, isn't dmraid supporting it ? If so can't we use the same strategy ?

I think dmraid doesn't have this problem because it doesn't do
incremental assembly, as mdadm does. Rather, after udev has settled,
dmraid scans its devices; this is similar to running "mdadm -As" at that
stage. If you do this, you don't have the problem that possibly an
already running array needs to change state because a new disk with more
recent meta data is added. Rather, you choose the "best" meta data at
assembly time. It would be possible to change the scanning behavior for
mdadm by changing the udev rules such that normal assembly is done after
udev has settled, rather than incremental on-the-fly assembly.

Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: incremental container assembly when sequence numbers don't match
  2013-09-20 20:20 RFC: incremental container assembly when sequence numbers don't match Martin Wilck
  2013-09-23  7:30 ` Francis Moreau
@ 2013-09-25 20:46 ` Martin Wilck
  2013-10-21  0:07 ` NeilBrown
  2 siblings, 0 replies; 5+ messages in thread
From: Martin Wilck @ 2013-09-25 20:46 UTC (permalink / raw)
  To: linux-raid

I just submitted the new unit test tests/10ddf-incremental-wrong-order
that illustrates tricky scenario described in this RFC, B.2.e).

Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: incremental container assembly when sequence numbers don't match
  2013-09-20 20:20 RFC: incremental container assembly when sequence numbers don't match Martin Wilck
  2013-09-23  7:30 ` Francis Moreau
  2013-09-25 20:46 ` Martin Wilck
@ 2013-10-21  0:07 ` NeilBrown
  2 siblings, 0 replies; 5+ messages in thread
From: NeilBrown @ 2013-10-21  0:07 UTC (permalink / raw)
  To: Martin Wilck; +Cc: linux-raid, Francis Moreau

[-- Attachment #1: Type: text/plain, Size: 4976 bytes --]

On Fri, 20 Sep 2013 22:20:13 +0200 Martin Wilck <mwilck@arcor.de> wrote:

> Hi,
> 
> I have spent a few days thinking about the problem of incremental
> container assembly when disk sequence numbers (aka event counters) don't
> match, and how mdadm/mdmon should behave in various situations.
> Before I start coding on this, I'd like to get your opinion - I may be
> overlooking something  important.
> 
> The scenario I look at is that sequence numbers don't match during
> incremental assembly. This can occur quite easily. A disk may have been
> missing the last time the array was assembled, and be added again. The
> last incremental assembly may have been interrupted before all disks
> were found, for whatever reason. Etc. The problems Francis reported
> lately all occur in situations of this type.
> 
> A) New disk has lower seq number as previously scanned ones:
>    The up-to-date meta data is the meta data previously parsed.
> 
>    For each subarray the new disk is a member in the meta data:
>      A.1) If the subarray is already running, add the new disk a spare.

If the new disk has old metadata, then it might have failed at some point, so
we shouldn't add it as anything without good reason.
If the most recent metadata records that a device went missing, rather than
actually failed, then it might be justified to add it as a spare.  But in
general I'd prefer thing were only added as spares if that was explicitly
requested of if the policy in mdadm.conf encourages it.

>      A.2) check the subarray seqnum; if the subarray seqnum is equal
> between existing and new disks, the new disk can be added as "clean".
> (This requires implementing separate seqnums for every subarray, but
> that can be done quite easily, at least for DDF).
>      A.3) Otherwise, add the new disk as a spare.
> 
>    The added disk may be marked as "Missing" or "Faulty" in the meta
> data. That will be handled already by existing code already AFAICS.
> 
> B) New disk has higher seq number than previously scanned ones.
>    The up-to-date meta data is on the new disk. Here it gets tricky.
> 
>    B.1) If mdmon isn't running for this container:
>      B.1.a) reread the meta data (load_container() will automatically
> choose the best meta data).
>      B.1.b) Discard previously made configurations
>      B.1.c) Reassemble the arrays, starting with the new disk. When
> re-adding the drive(s) with the older meta data, act as in A) above.
> 
>    B.2) If mdmon is already running for this container, it means at
> least one subarray is already running, too.
>      B.2.a) If the new disk belongs to a already running and active
> subarray, we have encountered a fatal error. mdadm should refuse to do
> anything with the new disk and emit an alert.
>      B.2.b) If the new disk belongs to a already running read-only
> subarray, and the subarray seqnum of the new disk is lower than that of
> the existing disks, we also have a fatal error - we don't know which
> data is more recent. Human intervention is necessary.
>      B.2.c) Both mdadm and mdmon need to update the meta data as
> described in B.1.a).
>      B.2.d) If the new disk belongs to a already running read-only
> subarray, and the subarray seqnum of the new disk is greater or equal to
> the subarray seqnum of the existing disk(s), it might be possible to add
> the new disk to the array as clean. If the seqnum isn't equal, recovery
> must be started on the previously existing disk(s). Currently the kernel
> doesn't allow adding a new disk as "clean" in any state except
> "inactive", so this special case will not be implemented any time soon.
> It's a general question whether or not mdadm should attempt to be
> "smart" in situations like this.
>      B.2.e) Subarrays that aren't running yet, and which the new disk is
> a member of, can be reassembled as described in A)
>      B.2.f) pre-existing disks that are marked missing or failed in the
> updated meta data must have their status changed. This may cause the
> already running array(s) to degrade or break, even if the new disk
> doen't belong to them.
>      B.2.g) The status of all subarrays (consistent/initialized) is
> updated according to the new meta data.
> 
> Note that the really difficult cases B.2.a/b/d can't easily happen if
> the Incremental assembly is done without "-R", as it should be. So it
> may be reasonable to just quit with an error if any of these situation
> is encountered.
> 
> An important further question is where this logic should be implemented.
> This is independent of meta data type and thus most of it should be in
> the generic Incremental_container() code path.

maybe in assemble_container_content?  But mdmon need to know about some of it
too of course.

> 
> Feedback welcome.
> Best regards
> Martin

Sounds very sensible, but the devil is in the detail of course. :-)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-10-21  0:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-20 20:20 RFC: incremental container assembly when sequence numbers don't match Martin Wilck
2013-09-23  7:30 ` Francis Moreau
2013-09-23 20:30   ` Martin Wilck
2013-09-25 20:46 ` Martin Wilck
2013-10-21  0:07 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).