From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58570) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VGtSp-0002KS-Oe for qemu-devel@nongnu.org; Tue, 03 Sep 2013 12:28:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VGtSj-00059B-Oi for qemu-devel@nongnu.org; Tue, 03 Sep 2013 12:28:23 -0400 Received: from nodalink.pck.nerim.net ([62.212.105.220]:59497 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VGtSj-000595-8b for qemu-devel@nongnu.org; Tue, 03 Sep 2013 12:28:17 -0400 Date: Tue, 3 Sep 2013 18:24:49 +0200 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20130903162449.GF5285@irqsave.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Block Filters List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: kwolf@redhat.com, famz@redhat.com, jcody@redhat.com, armbru@redhat.com, mreitz@redhat.com, stefanha@redhat.com, pbonzini@redhat.com Hello list, I am thinking about QEMU block filters lately. I am not a block.c/blockdev.c expert so tell me what you think of the fol= lowing. The use cases I see would be: -$user want to have some real cryptography on top of qcow2/qed or another format. snapshots and other block features should continue to work -$user want to use a raid like feature like QUORUM in QEMU. other features should continue to work -$user want to use the future SSD deduplication implementation with metad= ata on SSD and data on spinning disks. other features should continue to work -$user want to I/O throttle one drive of his vm. -$user want to do Copy On Read -$user want to do a combination of the above -$developer want to make the minimum of required steps to keep changes sm= all -$developer want to keep user interface changes for later Lets take a example case of an user wanting to do I/O throttled encrypted= QUORUM on top of QCOW2. Assuming we want to implement throttle and encryption as something remote= ly being like a block filter this makes a pretty complex BlockDriverState tr= ee. The tree would look like the following: I/O throttling BlockDriverState (bs) | | | | Encryption BlockDriverState (bs) | | | | Quorum BlockDriverState (bs) / | \ / | \ / | =A0 \ / | \ QCOW2 bs QCOW2 b s QCOW2 bs | | | | | | | | | | | | RAW bs RAW bs RAW bs An external snapshot should result in a tree like the following. I/O throttling BlockDriverState (bs) | | | | Encryption BlockDriverState (bs) | | | | Quorum BlockDriverState (bs) / | \ / | \ / | =A0 \ / | \ QCOW2 bs QCOW2 bs QCOW2 bs | | | | | | | | | | | | QCOW2 bs QCOW2 bs QCOW2 bs | | | | | | | | | | | | RAW bs RAW bs RAW bs In the current state of QEMU we can code some block drivers to implement = this tree. However when doing operations like snapshots blockdev.c would have no rea= l idea of what should be snapshotted and how. (The 3 top bs should be kept on to= p) Moreover it would have no way to manipulate easily this tree of BlockDriv= erState has each one is encapsulated in it's parent. Also there no generic way to tell the block layer that two or more BlockD= riverState are siblings. The current mail is here to propose some additionals structures in order = to cope with these problems. The overall strategy of the proposed structures is to push out the BlockDriverStates relationships out of each BlockDriverState. The idea is that it would make it easier for the block layer to manipulat= e a well known structure instead of being forced to enter into each BlockDriv= erState specificity. The first structure is the BlockStackNode. The BlockStateNode would be used to represent the relationship between th= e various BlockDriverStates struct BlockStackNode { BlockDriverState *bs; /* the BlockDriverState holded by this node */ /* this doubly linked list entry points to the child node and the par= ent * node */ QLIST_ENTRY(BlockStateNode) down; /* This doubly linked list entry point to the siblings of this node */ QLIST_ENTRY(BlockStateNode) siblings; /* a hash or an array of the sibbling of this node for fast access * should be recomputed when updating the tree */ QHASH_ENTRY sibblings_hash; } The BlockBackend would be the structure used to hold the "drive" the gues= t use. struct BlockBackend { /* the following doubly linked list header point to the top BlockStac= kNode * in our case it's the one containing the I/O throttling bs */ QLIST_HEAD(, BlockStateNode) block_stack_head; /* this is a pointer to the topest node below the block filter chain * in our case the first QCOW2 sibling */ BlockStackNode *top_node_below_filters; } Updated diagram: (Here bsn means BlockStacknode) ------------------------BlockBackend | | | block_stack_head | | | | | I/O throttling BlockStackNode (contains it's = bs) | | | down | | | | top_node_below_filter Encryption BlockStacknode (contains it's bs) | | | down | | | | | Quorum BlockStackNode (contain's it's bs) | / | down | / =A0 | / S S ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn contain= s a bs) | b | b | down l down l down | i | i | | n | n | | g | g | | s | s | | | | RAW bsn RAW bsn RAW bsn (each bsn contains= a bs) Block driver point of view: to construct the tree each BlockDriver would have some utility functions = looking like. bdrv_register_child_bs(bs, child_bs, int index); multiples calls to this function could be done to register multiple sibli= ngs childs identified by their index. This way something like quorum could register multiple QCOW2 instances. driver would have a BlockDriverSTate *bdrv_access_child(bs, int index); to access their childs. These functions can be implemented without the driver knowing about BlockStateNodes using container_of. blockdev point of view: (here I need your help) When doing a snapshot blockdev.c would access BlockBackend->top_node_below_filter and make a snapshot of the bs contain= ed in this node and it's sibblings. After each individual snapshot the linked lists and the hash/arrays would= be updated to point to the new top bsn. The snapshot operation can be done without violating any of the top block filter BlockDriverState. What do you think of this idea ? How this would fit in block.c/blockdev.c ? Best regards Beno=EEt