From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34740) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VHsBV-0001B8-1M for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:18:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VHsBO-0000xa-VK for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:18:32 -0400 Received: from mx1.redhat.com ([209.132.183.28]:5006) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VHsBO-0000xU-LA for qemu-devel@nongnu.org; Fri, 06 Sep 2013 05:18:26 -0400 Date: Fri, 6 Sep 2013 17:18:20 +0800 From: Fam Zheng Message-ID: <20130906091820.GA24154@T430s.nay.redhat.com> References: <20130903162449.GF5285@irqsave.net> <20130906075606.GD4814@T430s.nay.redhat.com> <20130906084513.GE2588@dhcp-200-207.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20130906084513.GE2588@dhcp-200-207.str.redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Block Filters Reply-To: famz@redhat.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , jcody@redhat.com, armbru@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, pbonzini@redhat.com, mreitz@redhat.com On Fri, 09/06 10:45, Kevin Wolf wrote: > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben: > > On Tue, 09/03 18:24, Beno=EEt Canet wrote: > > >=20 > > > Hello list, > > >=20 > > > I am thinking about QEMU block filters lately. > > >=20 > > > I am not a block.c/blockdev.c expert so tell me what you think of t= he following. > > >=20 > > > The use cases I see would be: > > >=20 > > > -$user want to have some real cryptography on top of qcow2/qed or a= nother > > > format. > > > snapshots and other block features should continue to work > > >=20 > > > -$user want to use a raid like feature like QUORUM in QEMU. > > > other features should continue to work > > >=20 > > > -$user want to use the future SSD deduplication implementation with= metadata on > > > SSD and data on spinning disks. > > > other features should continue to work > > >=20 > > > -$user want to I/O throttle one drive of his vm. > > >=20 > > > -$user want to do Copy On Read > > >=20 > > > -$user want to do a combination of the above > > >=20 > > > -$developer want to make the minimum of required steps to keep chan= ges small > > >=20 > > > -$developer want to keep user interface changes for later > > >=20 > > > Lets take a example case of an user wanting to do I/O throttled enc= rypted QUORUM > > > on top of QCOW2. > > >=20 > > > Assuming we want to implement throttle and encryption as something = remotely > > > being like a block filter this makes a pretty complex BlockDriverSt= ate tree. > > >=20 > > > The tree would look like the following: > > >=20 > > > I/O throttling BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Encryption BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Quorum BlockDriverState (bs) > > > / | \ > > > / | \ > > > / | =A0 \ > > > / | \ > > > QCOW2 bs QCOW2 b s QCOW2 bs > > > | | | > > > | | | > > > | | | > > > | | | > > > RAW bs RAW bs RAW bs > > >=20 > > > An external snapshot should result in a tree like the following. > > > I/O throttling BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Encryption BlockDriverState (bs) > > > | > > > | > > > | > > > | > > > Quorum BlockDriverState (bs) > > > / | \ > > > / | \ > > > / | =A0 \ > > > / | \ > > > QCOW2 bs QCOW2 bs QCOW2 bs > > > | | | > > > | | | > > > | | | > > > | | | > > > QCOW2 bs QCOW2 bs QCOW2 bs > > > | | | > > > | | | > > > | | | > > > | | | > > > RAW bs RAW bs RAW bs > > >=20 > > > In the current state of QEMU we can code some block drivers to impl= ement this > > > tree. > > >=20 > > > However when doing operations like snapshots blockdev.c would have = no real idea > > > of what should be snapshotted and how. (The 3 top bs should be kept= on top) > > >=20 > > > Moreover it would have no way to manipulate easily this tree of Blo= ckDriverState > > > has each one is encapsulated in it's parent. > > >=20 > > > Also there no generic way to tell the block layer that two or more = BlockDriverState > > > are siblings. > > >=20 > > > The current mail is here to propose some additionals structures in = order to cope > > > with these problems. > > >=20 > > > The overall strategy of the proposed structures is to push out the > > > BlockDriverStates relationships out of each BlockDriverState. > > >=20 > > > The idea is that it would make it easier for the block layer to man= ipulate a > > > well known structure instead of being forced to enter into each Blo= ckDriverState > > > specificity. > > >=20 > > > The first structure is the BlockStackNode. > > >=20 > > > The BlockStateNode would be used to represent the relationship betw= een the > > > various BlockDriverStates > > >=20 > > > struct BlockStackNode { > > > BlockDriverState *bs; /* the BlockDriverState holded by this n= ode */ > > >=20 > > > /* this doubly linked list entry points to the child node and t= he parent > > > * node > > > */ > > > QLIST_ENTRY(BlockStateNode) down; > > >=20 > > > /* This doubly linked list entry point to the siblings of this = node > > > */ > > > QLIST_ENTRY(BlockStateNode) siblings; > > >=20 > > > /* a hash or an array of the sibbling of this node for fast acc= ess > > > * should be recomputed when updating the tree */ > > > QHASH_ENTRY sibblings_hash; > > > } > > >=20 > > > The BlockBackend would be the structure used to hold the "drive" th= e guest use. > > >=20 > > > struct BlockBackend { > > > /* the following doubly linked list header point to the top Blo= ckStackNode > > > * in our case it's the one containing the I/O throttling bs > > > */ > > > QLIST_HEAD(, BlockStateNode) block_stack_head; > > > /* this is a pointer to the topest node below the block filter = chain > > > * in our case the first QCOW2 sibling > > > */ > > > BlockStackNode *top_node_below_filters; > > > } > > >=20 > > >=20 > > > Updated diagram: > > >=20 > > > (Here bsn means BlockStacknode) > > >=20 > > > ------------------------BlockBackend > > > | | > > > | block_stack_head > > > | | > > > | | > > > | I/O throttling BlockStackNode (contains= it's bs) > > > | | > > > | down > > > | | > > > | | > > > top_node_below_filter Encryption BlockStacknode (contains it's = bs) > > > | | > > > | down > > > | | > > > | | > > > | Quorum BlockStackNode (contain's it's bs) > > > | / > > > | down > > > | / =A0 > > > | / S S > > > ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn c= ontains a bs) > > > | b | b | > > > down l down l down > > > | i | i | > > > | n | n | > > > | g | g | > > > | s | s | > > > | | | > > > RAW bsn RAW bsn RAW bsn (each bsn co= ntains a bs) > > >=20 > > >=20 > > > Block driver point of view: > > >=20 > > > to construct the tree each BlockDriver would have some utility func= tions looking > > > like. > > >=20 > > > bdrv_register_child_bs(bs, child_bs, int index); > > >=20 > > > multiples calls to this function could be done to register multiple= siblings > > > childs identified by their index. > > >=20 > > > This way something like quorum could register multiple QCOW2 instan= ces. > > >=20 > > > driver would have a > > > BlockDriverSTate *bdrv_access_child(bs, int index); > > >=20 > > > to access their childs. > > >=20 > > > These functions can be implemented without the driver knowing about > > > BlockStateNodes using container_of. > > >=20 > > > blockdev point of view: (here I need your help) > > >=20 > > > When doing a snapshot blockdev.c would access > > > BlockBackend->top_node_below_filter and make a snapshot of the bs c= ontained in > > > this node and it's sibblings. > > >=20 > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, bl= ockdev.c > > can navigate down the tree from top node, until hitting some layer wh= ere the op > > is implemented (the QCow2 bs), so we get rid of this top_node_below_f= ilter > > pointer. >=20 > Is it even inherent to a block driver (like a filter), if a snapshot is > to be taken at its level? Or is it rather a policy decision that should > be made by the user? >=20 OK, getting the point that user should have full flexibility and fine ope= ration granularity. It also stands against block_backend->top_node_below_filter.= Do we really have the assumption that all the filters are on top of the tree an= d linear? Shouldn't this be possible? Block Backend | | Quodrum BDS / | \ iot filter | \ / | \ qcow2 qcow2 qcow2 So we throttle only a particular image, not the whole device. But this wi= ll make a top_node_below_filter pointer impossible. > In our example, the quorum driver, it's not at all clear to me that you > want to snapshot all children. In order to roll back to a previous > state, one snapshot is enough, you don't need multiple copies of the > same one. Perhaps you want two so that we can still compare them for > verification. Or all of them because you can afford the disk space and > want ultimate safety. I don't think qemu can know which one is true. >=20 Only if quorum ever knows about and operates on snapshots, it should be considered specifically, but no. So we need to achieve this in the genera= l design: allow user to take snapshot, or set throttle limits on particular BDSes, as above graph. > In the same way, in a typical case you may want to keep I/O throttling > for the whole drive, including the new snapshot. But what if the > throttling was used in order to not overload the network where the imag= e > is stored, and you're now doing a local snapshot, to which you want to > stream the image? The I/O throttling should apply only to the backing > file, not the new snapshot. >=20 Yes, and OTOH, throttling really suits to be a filter only if it can be a= non top one, otherwise it's no better than what we have now. > So perhaps what we really need is a more flexible snapshot/BDS tree > manipulation command that describes in detail which structure you want > to have in the end. >=20 Fam