From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roberto Spadim Subject: Re: Split-Brain Protection for MD arrays Date: Fri, 16 Dec 2011 11:46:50 -0200 Message-ID: References: <20111215140252.2f9bb986@notabene.brown> <20111216064003.18a7ab4f@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20111216064003.18a7ab4f@notabene.brown> Sender: linux-raid-owner@vger.kernel.org To: NeilBrown Cc: Alexander Lyakas , linux-raid List-Id: linux-raid.ids just some points that we shouldn=B4t forget... thinking like a end user of mdadm, not as a developer... a disk fail occur about 1 time after 2 years of heavy use in a desktop = sata disk a complex structure just for 1 minute of mdadm --remove, mdadm --add should be accepted by end users... it=B4s just 1 minute of 2 years... 2 years=3D730 days=3D17520 hours=3D1051200 minutes, in other works 1 mi= nute ~=3D 1/1.000.000=3D0.0001% of stop time, 99.9999% of online time, if we consider turn server off add a new disk and remove older, let we consider 10minutes? 0.001% =3D 99.999% of online time it=B4s well accepted for desktop and servers... for raid1 and linear- i don=B4t see a real complex logic telling what block isn=B4t ok, just a counter telling what disk have more recent dat= a is wellcome for raid10, raid5 and raid6- ok we can allow a block specific ,since we could consider a bad disk like many bad blocks and many good blocks (in the good disk) 2011/12/15 NeilBrown : > On Thu, 15 Dec 2011 16:29:12 +0200 Alexander Lyakas > wrote: > >> Neil, >> thanks for the review, and for detailed answers to my questions. >> >> > When we mark a device 'failed' it should stay marked as 'failed'. = =A0When the >> > array is optimal again it is safe to convert all 'failed' slots to >> > 'spare/missing' but not before. >> I did not understand all that reasoning. When you say "slot", you me= an >> index in the dev_roles[] array, correct? If yes, I don't see what >> importance the index has, compared to the value of the entry itself >> (which is "role" in your terminology). >> Currently, 0xFFFE means both "failed" and "missing", and that makes >> perfect sense to me. Basically this means that this entry of >> dev_roles[] is unused. When a device fails, it is kicked out of the >> array, so its entry in dev_roles[] becomes available. >> (You once mentioned that for older arrays, their dev_roles[] index w= as >> also their role, perhaps you are concerned about those too). >> In any case, I will be watching for changes in this area, if you >> decide to make them (although I think this might break backwards >> compatibility, unless a new version of superblock will be used). > > Maybe... =A0as I said, "confusing" is a relevant word in this area. > >> >> > If you have a working array and you initiate a write of a data blo= ck and the >> > parity block, and if one of those writes fails, then you no longer= have a >> > working array. =A0Some data blocks in that stripe cannot be recove= red. >> > So we need to make sure that admin knows the array is dead and doe= sn't just >> > re-assemble and think everything is OK. >> I see your point. I don't know what's better: to know the "last know= n >> good" configuration, or to know that the array has failed. I guess, = I >> am just used to the former. > > Possibly an 'array-has-failed' flag in the metadata would allow us to= keep > the last known-good config. =A0But as it isn't any good any more I do= n't really > see the point. > > >> >> > I think to resolve this issue we need 2 thing. >> > >> > 1/ when assembling an array if any device thinks that the 'chosen'= device has >> > =A0 failed, then don't trust that devices. >> I think that if any device thinks that "chosen" has failed, then >> either it has a more recent superblock, and then this device should = be >> "chosen" and not the other. Or, the "chosen" device's superblock is >> the one that counts, then it doesn't matter what current device >> thinks, because array will be assembled according to the "chosen" >> superblock. > > This is exactly what the current code does and it allows you to assem= ble an > array after a split-brain experience. =A0This is bad. =A0Checking wha= t other > devices think of the chosen device lets you detect the effect of a > split-brain. > > >> >> > 2/ Don't erase 'failed' status from dev_roles[] until the array is >> > optimal. >> >> Neil, I think both these points don't resolve the following simple >> scenario: RAID1 with drive A and B. Drive A fails, array continues t= o >> operate on drive B. After reboot, only drive A is accessible. If we = go >> ahead with assemble, we will see stale data. If after reboot, we, >> however, see only drive A, then (since B is "faulty" in A's >> superblock), we can go ahead and assemble. The change I suggested wi= ll >> abort in the first case, but will assemble in the second case. > > Using --no-degraded will do what you want in both cases. =A0So no cod= e change > is needed! > >> >> But obviously, you know better what MD users expect and want. > > Don't bet on it. > So far I have one vote - from you - that --no-degraded should be he d= efault > (I think that is what you are saying). =A0If others agree I'll certai= nly > consider it more. > > Note that "--no-degraded" doesn't exactly mean "not assemble a degrad= ed > array". =A0It means "don't assemble an array more degraded that it wa= s last > time it was working". =A0i.e. require that all devices that are worki= ng > according to the metadata are actually available. > > NeilBrown > > > >> Thanks again for taking time and reviewing the proposal! And yes, ne= xt >> time, I will put everything in the email. >> >> Alex. > --=20 Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html