From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: RFC: incremental container assembly when sequence numbers don't match Date: Mon, 21 Oct 2013 11:07:27 +1100 Message-ID: <20131021110727.363cdd02@notabene.brown> References: <523CADFD.9050006@arcor.de> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/v/efc_XWeZUMbK=m5dSc8F_"; protocol="application/pgp-signature" Return-path: In-Reply-To: <523CADFD.9050006@arcor.de> Sender: linux-raid-owner@vger.kernel.org To: Martin Wilck Cc: linux-raid , Francis Moreau List-Id: linux-raid.ids --Sig_/v/efc_XWeZUMbK=m5dSc8F_ Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 20 Sep 2013 22:20:13 +0200 Martin Wilck wrote: > Hi, >=20 > I have spent a few days thinking about the problem of incremental > container assembly when disk sequence numbers (aka event counters) don't > match, and how mdadm/mdmon should behave in various situations. > Before I start coding on this, I'd like to get your opinion - I may be > overlooking something important. >=20 > The scenario I look at is that sequence numbers don't match during > incremental assembly. This can occur quite easily. A disk may have been > missing the last time the array was assembled, and be added again. The > last incremental assembly may have been interrupted before all disks > were found, for whatever reason. Etc. The problems Francis reported > lately all occur in situations of this type. >=20 > A) New disk has lower seq number as previously scanned ones: > The up-to-date meta data is the meta data previously parsed. >=20 > For each subarray the new disk is a member in the meta data: > A.1) If the subarray is already running, add the new disk a spare. If the new disk has old metadata, then it might have failed at some point, = so we shouldn't add it as anything without good reason. If the most recent metadata records that a device went missing, rather than actually failed, then it might be justified to add it as a spare. But in general I'd prefer thing were only added as spares if that was explicitly requested of if the policy in mdadm.conf encourages it. > A.2) check the subarray seqnum; if the subarray seqnum is equal > between existing and new disks, the new disk can be added as "clean". > (This requires implementing separate seqnums for every subarray, but > that can be done quite easily, at least for DDF). > A.3) Otherwise, add the new disk as a spare. >=20 > The added disk may be marked as "Missing" or "Faulty" in the meta > data. That will be handled already by existing code already AFAICS. >=20 > B) New disk has higher seq number than previously scanned ones. > The up-to-date meta data is on the new disk. Here it gets tricky. >=20 > B.1) If mdmon isn't running for this container: > B.1.a) reread the meta data (load_container() will automatically > choose the best meta data). > B.1.b) Discard previously made configurations > B.1.c) Reassemble the arrays, starting with the new disk. When > re-adding the drive(s) with the older meta data, act as in A) above. >=20 > B.2) If mdmon is already running for this container, it means at > least one subarray is already running, too. > B.2.a) If the new disk belongs to a already running and active > subarray, we have encountered a fatal error. mdadm should refuse to do > anything with the new disk and emit an alert. > B.2.b) If the new disk belongs to a already running read-only > subarray, and the subarray seqnum of the new disk is lower than that of > the existing disks, we also have a fatal error - we don't know which > data is more recent. Human intervention is necessary. > B.2.c) Both mdadm and mdmon need to update the meta data as > described in B.1.a). > B.2.d) If the new disk belongs to a already running read-only > subarray, and the subarray seqnum of the new disk is greater or equal to > the subarray seqnum of the existing disk(s), it might be possible to add > the new disk to the array as clean. If the seqnum isn't equal, recovery > must be started on the previously existing disk(s). Currently the kernel > doesn't allow adding a new disk as "clean" in any state except > "inactive", so this special case will not be implemented any time soon. > It's a general question whether or not mdadm should attempt to be > "smart" in situations like this. > B.2.e) Subarrays that aren't running yet, and which the new disk is > a member of, can be reassembled as described in A) > B.2.f) pre-existing disks that are marked missing or failed in the > updated meta data must have their status changed. This may cause the > already running array(s) to degrade or break, even if the new disk > doen't belong to them. > B.2.g) The status of all subarrays (consistent/initialized) is > updated according to the new meta data. >=20 > Note that the really difficult cases B.2.a/b/d can't easily happen if > the Incremental assembly is done without "-R", as it should be. So it > may be reasonable to just quit with an error if any of these situation > is encountered. >=20 > An important further question is where this logic should be implemented. > This is independent of meta data type and thus most of it should be in > the generic Incremental_container() code path. maybe in assemble_container_content? But mdmon need to know about some of = it too of course. >=20 > Feedback welcome. > Best regards > Martin Sounds very sensible, but the devil is in the detail of course. :-) Thanks, NeilBrown --Sig_/v/efc_XWeZUMbK=m5dSc8F_ Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUBUmRwPznsnt1WYoG5AQJ6Xw/+ImAsjgSWgsQYTSFR0AM9MzBjAnSGvr28 PRj/W2imTFIoaOeggMg2auJ5zzNr3cwiICpDEyes9izHGYl5vQwraX1CAi4LImSL O4UslEQL13KBbZuKQ2ztqabFDbFhNU9bPHEw/pJ4Vv+aMUvcUKc5BCEFS1/m+DZE BYs60Ea1TIUWNrdP7x7cd2HgEvLTdpvCZ3b3YYc6KK4BeAy+WksgzSyi0ukxrcG1 gt9OXIuKOXKyDcVJ1qmCX7pCjUGQsd6SrDmwKivnEam3Z+QhV5IM7AV1vkrS7FzN NJNlC72/UjIB9T7bNvoXEvdTzgMn2li2VlYuoiUOzETEShrGvcym7VeIaXxRPTc7 DDNfprUkWM0H2seL1MJJqs528FzeCXHnD3rQWdkT9p1wQdpHh/30xjz+AeHWhS+Y fHD548fOmVVYoekl1kq3oX+84Lksvop7QMfCk4+fwrVoUCd0Mw0j1WVqKYqbzwlV ZCLVbBv30YCiMATmN9e1g1Yy4jry/xbip6o1seyI2gfTlnEna7JznPAx+3Exevza nyjaxZJ0OD1KjoavTsuCYMtbi6LNr8G8sebxRWX/F9ZjhLJap3LKTv3zfHrQk9uS qzl2hkYZwKVh1fcpiqVyxY+OBXzRlOPANaKeMgG8QvUK+GoKguUuMM5106qwpqHp lDhfVSVWdPI= =59tC -----END PGP SIGNATURE----- --Sig_/v/efc_XWeZUMbK=m5dSc8F_--