From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55638) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VHrfU-0006N6-Ix for qemu-devel@nongnu.org; Fri, 06 Sep 2013 04:45:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VHrfO-0007FH-I1 for qemu-devel@nongnu.org; Fri, 06 Sep 2013 04:45:28 -0400 Received: from mx1.redhat.com ([209.132.183.28]:12748) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VHrfO-0007F7-2w for qemu-devel@nongnu.org; Fri, 06 Sep 2013 04:45:22 -0400 Date: Fri, 6 Sep 2013 10:45:14 +0200 From: Kevin Wolf Message-ID: <20130906084513.GE2588@dhcp-200-207.str.redhat.com> References: <20130903162449.GF5285@irqsave.net> <20130906075606.GD4814@T430s.nay.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20130906075606.GD4814@T430s.nay.redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Block Filters List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Fam Zheng Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , jcody@redhat.com, armbru@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, pbonzini@redhat.com, mreitz@redhat.com Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben: > On Tue, 09/03 18:24, Beno=EEt Canet wrote: > >=20 > > Hello list, > >=20 > > I am thinking about QEMU block filters lately. > >=20 > > I am not a block.c/blockdev.c expert so tell me what you think of the= following. > >=20 > > The use cases I see would be: > >=20 > > -$user want to have some real cryptography on top of qcow2/qed or ano= ther > > format. > > snapshots and other block features should continue to work > >=20 > > -$user want to use a raid like feature like QUORUM in QEMU. > > other features should continue to work > >=20 > > -$user want to use the future SSD deduplication implementation with m= etadata on > > SSD and data on spinning disks. > > other features should continue to work > >=20 > > -$user want to I/O throttle one drive of his vm. > >=20 > > -$user want to do Copy On Read > >=20 > > -$user want to do a combination of the above > >=20 > > -$developer want to make the minimum of required steps to keep change= s small > >=20 > > -$developer want to keep user interface changes for later > >=20 > > Lets take a example case of an user wanting to do I/O throttled encry= pted QUORUM > > on top of QCOW2. > >=20 > > Assuming we want to implement throttle and encryption as something re= motely > > being like a block filter this makes a pretty complex BlockDriverStat= e tree. > >=20 > > The tree would look like the following: > >=20 > > I/O throttling BlockDriverState (bs) > > | > > | > > | > > | > > Encryption BlockDriverState (bs) > > | > > | > > | > > | > > Quorum BlockDriverState (bs) > > / | \ > > / | \ > > / | =A0 \ > > / | \ > > QCOW2 bs QCOW2 b s QCOW2 bs > > | | | > > | | | > > | | | > > | | | > > RAW bs RAW bs RAW bs > >=20 > > An external snapshot should result in a tree like the following. > > I/O throttling BlockDriverState (bs) > > | > > | > > | > > | > > Encryption BlockDriverState (bs) > > | > > | > > | > > | > > Quorum BlockDriverState (bs) > > / | \ > > / | \ > > / | =A0 \ > > / | \ > > QCOW2 bs QCOW2 bs QCOW2 bs > > | | | > > | | | > > | | | > > | | | > > QCOW2 bs QCOW2 bs QCOW2 bs > > | | | > > | | | > > | | | > > | | | > > RAW bs RAW bs RAW bs > >=20 > > In the current state of QEMU we can code some block drivers to implem= ent this > > tree. > >=20 > > However when doing operations like snapshots blockdev.c would have no= real idea > > of what should be snapshotted and how. (The 3 top bs should be kept o= n top) > >=20 > > Moreover it would have no way to manipulate easily this tree of Block= DriverState > > has each one is encapsulated in it's parent. > >=20 > > Also there no generic way to tell the block layer that two or more Bl= ockDriverState > > are siblings. > >=20 > > The current mail is here to propose some additionals structures in or= der to cope > > with these problems. > >=20 > > The overall strategy of the proposed structures is to push out the > > BlockDriverStates relationships out of each BlockDriverState. > >=20 > > The idea is that it would make it easier for the block layer to manip= ulate a > > well known structure instead of being forced to enter into each Block= DriverState > > specificity. > >=20 > > The first structure is the BlockStackNode. > >=20 > > The BlockStateNode would be used to represent the relationship betwee= n the > > various BlockDriverStates > >=20 > > struct BlockStackNode { > > BlockDriverState *bs; /* the BlockDriverState holded by this nod= e */ > >=20 > > /* this doubly linked list entry points to the child node and the= parent > > * node > > */ > > QLIST_ENTRY(BlockStateNode) down; > >=20 > > /* This doubly linked list entry point to the siblings of this no= de > > */ > > QLIST_ENTRY(BlockStateNode) siblings; > >=20 > > /* a hash or an array of the sibbling of this node for fast acces= s > > * should be recomputed when updating the tree */ > > QHASH_ENTRY sibblings_hash; > > } > >=20 > > The BlockBackend would be the structure used to hold the "drive" the = guest use. > >=20 > > struct BlockBackend { > > /* the following doubly linked list header point to the top Block= StackNode > > * in our case it's the one containing the I/O throttling bs > > */ > > QLIST_HEAD(, BlockStateNode) block_stack_head; > > /* this is a pointer to the topest node below the block filter ch= ain > > * in our case the first QCOW2 sibling > > */ > > BlockStackNode *top_node_below_filters; > > } > >=20 > >=20 > > Updated diagram: > >=20 > > (Here bsn means BlockStacknode) > >=20 > > ------------------------BlockBackend > > | | > > | block_stack_head > > | | > > | | > > | I/O throttling BlockStackNode (contains i= t's bs) > > | | > > | down > > | | > > | | > > top_node_below_filter Encryption BlockStacknode (contains it's bs= ) > > | | > > | down > > | | > > | | > > | Quorum BlockStackNode (contain's it's bs) > > | / > > | down > > | / =A0 > > | / S S > > ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn con= tains a bs) > > | b | b | > > down l down l down > > | i | i | > > | n | n | > > | g | g | > > | s | s | > > | | | > > RAW bsn RAW bsn RAW bsn (each bsn cont= ains a bs) > >=20 > >=20 > > Block driver point of view: > >=20 > > to construct the tree each BlockDriver would have some utility functi= ons looking > > like. > >=20 > > bdrv_register_child_bs(bs, child_bs, int index); > >=20 > > multiples calls to this function could be done to register multiple s= iblings > > childs identified by their index. > >=20 > > This way something like quorum could register multiple QCOW2 instance= s. > >=20 > > driver would have a > > BlockDriverSTate *bdrv_access_child(bs, int index); > >=20 > > to access their childs. > >=20 > > These functions can be implemented without the driver knowing about > > BlockStateNodes using container_of. > >=20 > > blockdev point of view: (here I need your help) > >=20 > > When doing a snapshot blockdev.c would access > > BlockBackend->top_node_below_filter and make a snapshot of the bs con= tained in > > this node and it's sibblings. > >=20 > Since BlockDriver.bdrv_snapshot_create() is an optional operation, bloc= kdev.c > can navigate down the tree from top node, until hitting some layer wher= e the op > is implemented (the QCow2 bs), so we get rid of this top_node_below_fil= ter > pointer. Is it even inherent to a block driver (like a filter), if a snapshot is to be taken at its level? Or is it rather a policy decision that should be made by the user? In our example, the quorum driver, it's not at all clear to me that you want to snapshot all children. In order to roll back to a previous state, one snapshot is enough, you don't need multiple copies of the same one. Perhaps you want two so that we can still compare them for verification. Or all of them because you can afford the disk space and want ultimate safety. I don't think qemu can know which one is true. In the same way, in a typical case you may want to keep I/O throttling for the whole drive, including the new snapshot. But what if the throttling was used in order to not overload the network where the image is stored, and you're now doing a local snapshot, to which you want to stream the image? The I/O throttling should apply only to the backing file, not the new snapshot. So perhaps what we really need is a more flexible snapshot/BDS tree manipulation command that describes in detail which structure you want to have in the end. Kevin