From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paul Clements Subject: Re: [PATCH 1/2] md bitmap bug fixes Date: Mon, 21 Mar 2005 11:07:06 -0500 Message-ID: <423EF12A.4030207@steeleye.com> References: <422F7621.8090602@steeleye.com> <16949.5768.392061.95882@cse.unsw.edu.au> <20050314094454.GK3858@marowsky-bree.de> <16949.26113.68948.938529@cse.unsw.edu.au> <20050314112403.GT3858@marowsky-bree.de> <16950.5692.594941.130741@cse.unsw.edu.au> <20050318103326.GA18819@marowsky-bree.de> <6ivqg2-qsn.ln1@news.it.uc3m.es> <20050318134255.GS18819@marowsky-bree.de> <7e6rg2-pj1.ln1@news.it.uc3m.es> <423B09EF.8070708@steeleye.com> <23krg2-4rr.ln1@news.it.uc3m.es> <423B2F7C.3030907@steeleye.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: "Peter T. Breuer" Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids Peter T. Breuer wrote: > Paul Clements wrote: > >>Peter T. Breuer wrote: >> >>>I don't see that this solves anything. If you had both sides going at >>>once, receiving different writes, then you are sc&**ed, and no >>>resolution of bitmaps will help you, since both sides have received >>>different (legitimate) data. It doesn't seem relevant to me to consider >> >>You're forgetting that journalling filesystems and databases have to >>replay their journals or transaction logs when they start up. All I'm saying is that in a split-brain scenario, typical cluster frameworks will make two (or more) systems active at the same time. This is not necessarily fatal, because as you pointed out, if only one of those systems (let's call it system A) is really available to the outside world then you can usually simply trust the data on A and use it to sync over the other copies. But, if system B brought up a database or journalling FS on its copy of the data, then there were writes to that disk that have to be synced over. You can't simply use the bitmap on system A; you have to combine them (or else do a full resync). >>>What about when A comes back up? We then get a >>> >>> .--------------. >>> system A | system B | >>> nbd ---' [raid1] | >>> | / \ | >>> [disk] [disk] [nbd]-' >>> >>>situation, and a resync is done (skipping clean sectors). >> >>You're forgetting that there may be some data (uncommitted data) that >>didn't reach B that is on A's disk (or even vice versa). > > > You are saying that the journal on A (presumably not raided itself?) is > waiting to play some data into its own disk as soon as we have finished > resyncing it from B? I don't think that would be a good idea at all. No, I'm simply saying this: when you fail over from system A to system B (say you lost the network or system A died), there is some amount of data that could be out of sync (because raid1 submits writes to all disks simultaneously). When you take over using the data on system B, you're presumably going to want to (at some point) get A back to a state where it has the latest data (in case B later fails or in case A is a better system and you want to make it active, instead of B). To do that, you can't simply take the bitmap from B and sync back to A. You've got to look at the old bitmap on A and combine it with B's bitmap (or you've got to do a full resync). Until you've done that, the data that is on A is worthless.