From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41237) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aNmMJ-00012E-Rj for qemu-devel@nongnu.org; Mon, 25 Jan 2016 13:59:29 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aNmMI-0000MH-Fj for qemu-devel@nongnu.org; Mon, 25 Jan 2016 13:59:27 -0500 Date: Mon, 25 Jan 2016 18:59:13 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20160125185912.GG2464@work-vm> References: <20160122193534.GF2482@work-vm> <56A57B43.1040103@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <56A57B43.1040103@cn.fujitsu.com> Subject: Re: [Qemu-devel] COLO: how to flip a secondary to a primary? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wen Congyang Cc: Li Zhijian , Changlong Xie , zhanghailiang , qemu block , qemu devel , Luis Tomas , Simon Kollberg , Abel Souza * Wen Congyang (wency@cn.fujitsu.com) wrote: > On 01/23/2016 03:35 AM, Dr. David Alan Gilbert wrote: > > Hi, > > I've been looking at what's needed to add a new secondary after > > a primary failed; from the block side it doesn't look as hard > > as I'd expected, perhaps you can tell me if I'm missing something! > >=20 > > The normal primary setup is: > >=20 > > quorum > > Real disk > > nbd client >=20 > quorum > real disk > replication > nbd client >=20 > >=20 > > The normal secondary setup is: > > replication > > active-disk > > hidden-disk > > Real-disk >=20 > IIRC, we can do it like this: > quorum > replication > active-disk > hidden-disk > real-disk Yes. > > With a couple of minor code hacks; I changed the secondary to be: > >=20 > > quorum > > replication > > active-disk > > hidden-disk > > Real-disk > > dummy-disk >=20 > after failover, > quorum > replicaion(old, mode is secondary) > active-disk > hidden-disk* > real-disk* > replication(new, mode is primary) > nbd-client Do you need to keep the old secondary-replication? Does that just pass straight through? > In the newest version, we active commit active-disk to real-disk. > So it will be: > quorum > replicaion(old, mode is secondary) > active-disk(it is real disk now) > replication(new, mode is primary) > nbd-client How does that active-commit work? I didn't think you could change the real disk until you had the full checkpoint, since you don't know whether the primary or secondaries changes need to be written? > > and then after the primary fails, I start a new secondary > > on another host and then on the old secondary do: > >=20 > > nbd_server_stop > > stop > > x_block_change top-quorum -d children.0 # deletes use of real= disk, leaves dummy > > drive_del active-disk0 > > x_block_change top-quorum -a node-real-disk > > x_block_change top-quorum -d children.1 # Seems to have delet= ed the dummy?!, the disk is now child 0 > > drive_add buddy driver=3Dreplication,mode=3Dprimary,file.driver=3Dnbd= ,file.host=3Dibpair,file.port=3D8889,file.export=3Dcolo-disk0,node-name=3Dn= bd-client,if=3Dnone,cache=3Dnone > > x_block_change top-quorum -a nbd-client > > c > > migrate_set_capability x-colo on > > migrate -d -b tcp:ibpair:8888 > >=20 > > and I think that means what was the secondary, has the same disk > > structure as a normal primary. > > That's not quite happy yet, and I've not figured out why - but the > > order/structure of the block devices looks right? > >=20 > > Notes: > > a) The dummy serves two purposes, 1) it works around the segfault > > I reported in the other mail, 2) when I delete the real disk in t= he > > first x_block_change it means the quorum still has 1 disk so does= n't > > get upset. >=20 > I don't understand the purpose 2. quorum wont allow you to delete all it's members ('The number of children c= annot be lower than the vote threshold 1') and it's very tricky getting the order correct with add/delete; for example I tried: drive_add buddy driver=3Dreplication,mode=3Dprimary,file.driver=3Dnbd,file.= host=3Dibpair,file.port=3D8889,file.export=3Dcolo-disk0,node-name=3Dnbd-cli= ent,if=3Dnone,cache=3Dnone # gets children.1 x_block_change top-quorum -a nbd-client # deletes the secondary replication x_block_change top-quorum -d children.0 drive_del active-disk0 # ends up as children.0 but in the 2nd slot x_block_change top-quorum -a node-real-disk info block shows me: top-quorum (#block615): json:{"children": [ {"driver": "replication", "mode": "primary", "file": {"port": "8889", "= host": "ibpair", "driver": "nbd", "export": "colo-disk0"}}, {"driver": "raw", "file": {"driver": "file", "filename": "/home/localvm= s/bugzilla.raw"}} ], "driver": "quorum", "blkverify": false, "rewrite-corrupted": false, "vot= e-threshold": 1} (quorum) Cache mode: writeback that has the replication first and the file second; that's the opposite =66rom the normal primary startup - does it matter? I can't add node-real-disk until I drive_del active-disk0 (which previously used it); and I can't drive_del until I remove it from the quorum; but I can't remove that from the quorum first, because that leaves an empty quorum. > > b) I had to remove the restriction in quorum_start_replication > > on which mode it would run in.=20 >=20 > IIRC, this check will be removed. >=20 > > c) I'm not really sure everything knows it's in secondary mode yet, = and > > I'm not convinced whether the replication is doing the right thin= g. > > d) The migrate -d -b eventually fails on the destination, not work= ed out why > > yet. >=20 > Can you give me the error message? I need to repeat it to check; it was something like a bad flag from the blo= ck migration code; it happened after the block migration hit 100%. > > e) Adding/deleting children on quorum is hard having to use the chil= dren.0/1 > > notation when you've added children using node names - it's worry= ing > > which number is which; is there a way to give them a name? >=20 > No. I think we can improve 'info block' output. Yes, that would be good; I thought it was the order in the list; but after debugging it today I'm not convinced it is; I think it always keeps the same name - so for example if you start off with [children.0, children.1]; then delete children.0 you now have [children.1]; if you then add a new child I *think* that becomes children.0 but you end up with [children.1,chi= ldren.0] > > f) I've not thought about the colo-proxy that much yet - I guess that > > existing connections need to keep their sequence number offset but > > new connections made by what is now the primary dont need to do a= nything > > special. >=20 > Hailiang or Zhijian can answer this question. Thanks, > Thanks > Wen Congyang >=20 > >=20 > > Dave > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > >=20 > >=20 > > . > >=20 >=20 >=20 >=20 -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK