From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mr1.dcs.gla.ac.uk (mr1.dcs.gla.ac.uk [130.209.249.184]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.linbit.com (LINBIT Mail Daemon) with ESMTP id D11C52D9EAB3 for ; Thu, 7 Dec 2006 21:25:27 +0100 (CET) Received: from paraoa.dcs.gla.ac.uk ([130.209.253.109]:42917) by mr1.dcs.gla.ac.uk with esmtpa (Exim 4.42) id 1GsPo6-000321-HJ for drbd-dev@lists.linbit.com; Thu, 07 Dec 2006 20:25:26 +0000 Message-ID: <457878BE.6040305@dcs.gla.ac.uk> Date: Thu, 07 Dec 2006 20:25:34 +0000 From: Cristian Zamfir MIME-Version: 1.0 To: drbd-dev@lists.linbit.com Subject: Re: [Drbd-dev] lock for reading device state References: <4576FC63.704@dcs.gla.ac.uk> <20061207130920.GD7521@soda.linbit> <45781C8F.1080400@dcs.gla.ac.uk> <20061207155648.GF7521@soda.linbit> In-Reply-To: <20061207155648.GF7521@soda.linbit> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: Coordination of development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Lars Ellenberg wrote: > / 2006-12-07 13:52:15 +0000 > \ Cristian Zamfir: >> >> Lars Ellenberg wrote: >>> / 2006-12-06 17:22:43 +0000 >>> \ Cristian Zamfir: >>>> Hi, >>>> >>>> I am using drbd to implement xen block device migration. Right now I >>>> am parsing /proc/drbd to find out if the drives are synchronized and I >>>> can migrate them. >>> you talk about drbd state "Connected, Consistent", >>> or what exactly are you parsing? >> Yes, indeed, I am parsing these values: "cs:Connected st:Secondary/Primary ld:Consistent" >> >> >>>> Is there a way to obtain a lock while reading and processing this >>>> information and prevent other writes to the primary device? >>> no. why? >> I wrote a script that parses /proc/drbd on the primary node. While I am running this script, writes to the primary >> device are still allowed. If I find that the ld state is "Consistent" then I will make this node secondary and the >> peer will become primary. >> The problem is when writes happen while my script is making the peer node primary. >> >> A race situation would be the following: >> At moment X, I read /proc/drbd and see the ld state is consistent. >> At moment X+1 a write arrives at /dev/drbd1 and the devices are not >> consistent any more. They start syncing but this may last longer, for >> instance until moment X+5. >> Now, at moment X+2, I wrongly believe that the state is still >> consistentand I decide to make the peer node primary and thus loose >> the write at moment X+1. >> >> Are my assumptions correct so far? > > no. you don't "become Inconsistent" because "some write". Thank you very much for your answer. I guess what I assumed incorrectly was that writes would make the device inconsistent. > > "Consistent" in drbd speak is "not Inconsistent". > oh well. > so what is Inconsistent. > drbd starts as beeing "inconsistent" when the meta data is first > initialized. then you force one side to think it is Consistent, > to be able to make it Primary, and the initial full sync starts. > > Once the sync is finished, the sync target becomes Connected Consistent. > If the nodes now disconnect, they still are "Consistent" in the sense of > "whatever data is on that disk, it is transactional consistent, though > maybe it is not 'clean', i.e. you may have to replay some journal to get > into 'clean' state." > > You get into "Inconsistent" only by becoming SyncTarget after > (re)establishing the connection to the Peer and the handshake determins > that your data is different from the Peers, and the Peers is "better" > (which typically means "newer"). > > Because the Resync copies changed blocks linearly over the device, > while new writes get mirrored already, the data on the SyncTarget is > "not Consistent" anymore during sync. Even if we had data journalling > during degraded mode, and would replay that during Sync, the SyncTarget > would stay Consistent but "outdated" until the Resync was completely > done. > >> I'm thinking that there are two solutions: One would be to prevent any writes from Xen's domUs by modifying Xen. >> The other would be to be able to hold a lock that prevents writes from reaching /dev/drbdX and release it after the >> processing within the script finishes (that is while I switch the peer device from secondary to primary). >> >> I haven't looked at drbd's source yet ( I am using 0.7.22 now) but I am considering implementing this lock within >> drbd if there is no other solution available. > > That "lock" does not make sense to me, > and even if you could do it, it won't solve that "race", > it would only move it to some other point in time. > > Note that a device in Secondary state denies access. > Also note that you cannot make a device Primary if it sees its Peer as > being Primary (unless you use drbd8, and explicitly allow > "two-primaries"). I assume that using drbd8 would make xen bloc device migration easier because both devices are primary. Am I right? > And a device that knows it is "Inconsistent" cannot be made Primary, > unless it is Connected, in which case it would be SyncTarget and get the > good data from the SyncSource Peer. > > So what you need to do for xen migration with drbd 0.7 is: > Start the migration, once you think you want to switch over, i.e. > ** once you are done writing on nodeA ** > ** you switch nodeA to Secondary. ** > now, both nodes are Secondary, and neither can write. > now you can check wether the target nodeB is still Connected, Consistent. > if so, you make it Primary. > if not, you abort the migration. This is exactly what my code is doing now. I was worried that writes would make the drive inconsistent so that is why I needed the lock. Now it is clear that making the transition from primary to secondary is enough. > > "locking" the state of drbd or freezing io while it is Primary on > migration source nodeA won't help you in any way. > >> As a future project, I am also interested if there is anyone working >> on implementing multiple secondary devices. I am interested in having >> multiple replicas of the primary node. > > here at LINBIT we have some very nice concepts about how we'd implement > multiple (> 2) nodes and other nice features. But don't ask about timelines. > It is great that you are considering this because I will also start working on something similar in the near future. Thanks, Cristian