From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tokarev Subject: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Date: Tue, 04 Jan 2005 14:57:57 +0300 Message-ID: <41DA84C5.3070403@tls.msk.ru> References: <200501030916.j039Gqe23568@inv.it.uc3m.es> <200501031846.42950.maarten@ultratux.net> <200501032052.21459.maarten@ultratux.net> <16857.55609.534526.297577@cse.unsw.edu.au> <16857.64086.362458.177296@cse.unsw.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: linux-raid@vger.kernel.org List-Id: linux-raid.ids Peter T. Breuer wrote: > Neil Brown wrote: [] >>If there is a system crash before correct, consistent data is written, >>then on restart, disk B will not be read at all until disk A as been > > Why do you think so? I know of no mechanism in RAID that records to > which of the two disks paired data has been written and to which it has > not! > > Please clarify - this is important. If you are thinking of the "event > count" that is stamped on the superblocks, that is only updated from > time to time as far as I know! Can you please specify (for my > curiousity) exactly when it is updated? That would be useful to know. Yes, this is the most dark corner in whole raid stuff for me still. I just looked at the code again, re-read it several times, but the code is a bit.. large to understand in a relatively short time. This very question bothered me for quite some time now. How md code "knows" which drive has "more recent" data on it in case of system crash (power loss, whatever) after one drive has completed the write but before another hasn't? The "event counter" isn't updated on every write (it'd be very expensive in both time and disk health -- too much seeking and too much writes to a single block where the superblock is located). For me, and I'm just thinking how it can be done, the only possible solution in this case is to choose "random" drive and declare it as "up-to-date" -- it will not necessary be really up-to-date. Or, maybe, write to "first" drive first and to "second" next, and assume first drive have the data written before second (no guarantee here because of reordering, differences in drive speed etc, but it is -- sort of -- valid assumption). Speaking of a reasonable filesystem (journalling isn't relevant here, the key word is "reasonable", that it, the system that makes comples operations to be atomic) and filesystem metadata, choosing "random" drive as up-to-date makes some sense, at least the metadata will be consistent (not necessary up to date, ie, for example, it is still possible to lose some mail file which has been acknowleged by filesystem AND by the smtp server, but due to choosing the "wrong" (not recent) drive, that file operation has been "rolled back"), but still consistent (I'm not talking about data consistency and integrity, that's another long story). Or, maybe it's better to ask the question slightly (?) differently: recalling "write barriers" etc and raid1 (for simplicity), will raid code acknowlege a write only after ALL drives has been written to? And thus, having reasonable filesystem (again), will the filesystem operation (at least metadata) succeed ONLY after the md layer will report ALL disks has the data written? (This way, it really makes no difference which - fresh or not - drive will be considered up to date after the poweroff in the middle of some write, *at least* for filesystem metadata, and for applications that implements "commit" concept as needed to correctly implement "reasonable" metadata operations). How it all fits together? Which drive will be declared "fresh"? How about several (>2) drives in raid1 array? How about data written without a concept of "commits", if "wrong" drive will be choosen -- will it contain some old data in it, while another drive contained new data but was declared "non fresh" at reconstruction? And speaking of the previous question, is there any difference here between md device and single disk, which also does various write reordering and stuff like that? -- I mean, does md layer increase probability to see old data after reboot caused by a power loss (for example) if an app (or whatever) was writing (or even when the filesystem reported the write is complete) some new data during the power loss? Alot of questions.. but I think it's really worth to understand how it all works. Thanks. /mjt