qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fam Zheng <famz@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: "Benoît Canet" <benoit.canet@irqsave.net>,
	jcody@redhat.com, armbru@redhat.com, qemu-devel@nongnu.org,
	stefanha@redhat.com, pbonzini@redhat.com, mreitz@redhat.com
Subject: Re: [Qemu-devel] Block Filters
Date: Fri, 6 Sep 2013 17:18:20 +0800	[thread overview]
Message-ID: <20130906091820.GA24154@T430s.nay.redhat.com> (raw)
In-Reply-To: <20130906084513.GE2588@dhcp-200-207.str.redhat.com>

On Fri, 09/06 10:45, Kevin Wolf wrote:
> Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > On Tue, 09/03 18:24, Benoît Canet wrote:
> > > 
> > > Hello list,
> > > 
> > > I am thinking about QEMU block filters lately.
> > > 
> > > I am not a block.c/blockdev.c expert so tell me what you think of the following.
> > > 
> > > The use cases I see would be:
> > > 
> > > -$user want to have some real cryptography on top of qcow2/qed or another
> > > format.
> > >  snapshots and other block features should continue to work
> > > 
> > > -$user want to use a raid like feature like QUORUM in QEMU.
> > >  other features should continue to work
> > > 
> > > -$user want to use the future SSD deduplication implementation with metadata on
> > > SSD and data on spinning disks.
> > >  other features should continue to work
> > > 
> > > -$user want to I/O throttle one drive of his vm.
> > > 
> > > -$user want to do Copy On Read
> > > 
> > > -$user want to do a combination of the above
> > > 
> > > -$developer want to make the minimum of required steps to keep changes small
> > > 
> > > -$developer want to keep user interface changes for later
> > > 
> > > Lets take a example case of an user wanting to do I/O throttled encrypted QUORUM
> > > on top of QCOW2.
> > > 
> > > Assuming we want to implement throttle and encryption as something remotely
> > > being like a block filter this makes a pretty complex BlockDriverState tree.
> > > 
> > > The tree would look like the following:
> > > 
> > >                     I/O throttling BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Encryption BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Quorum BlockDriverState (bs)
> > >                    /           |           \
> > >                   /            |            \
> > >                  /             |             \
> > >                 /              |              \
> > >             QCOW2 bs       QCOW2 b s       QCOW2 bs
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >             RAW bs         RAW bs           RAW bs
> > > 
> > > An external snapshot should result in a tree like the following.
> > >                     I/O throttling BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Encryption BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Quorum BlockDriverState (bs)
> > >                    /           |           \
> > >                   /            |            \
> > >                  /             |             \
> > >                 /              |              \
> > >             QCOW2 bs       QCOW2 bs         QCOW2 bs
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >             QCOW2 bs       QCOW2 bs         QCOW2 bs
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >             RAW bs         RAW bs           RAW bs
> > > 
> > > In the current state of QEMU we can code some block drivers to implement this
> > > tree.
> > > 
> > > However when doing operations like snapshots blockdev.c would have no real idea
> > > of what should be snapshotted and how. (The 3 top bs should be kept on top)
> > > 
> > > Moreover it would have no way to manipulate easily this tree of BlockDriverState
> > > has each one is encapsulated in it's parent.
> > > 
> > > Also there no generic way to tell the block layer that two or more BlockDriverState
> > > are siblings.
> > > 
> > > The current mail is here to propose some additionals structures in order to cope
> > > with these problems.
> > > 
> > > The overall strategy of the proposed structures is to push out the
> > > BlockDriverStates relationships out of each BlockDriverState.
> > > 
> > > The idea is that it would make it easier for the block layer to manipulate a
> > > well known structure instead of being forced to enter into each BlockDriverState
> > > specificity.
> > > 
> > > The first structure is the BlockStackNode.
> > > 
> > > The BlockStateNode would be used to represent the relationship between the
> > > various BlockDriverStates
> > > 
> > > struct BlockStackNode {
> > >     BlockDriverState *bs;  /* the BlockDriverState holded by this node */
> > > 
> > >     /* this doubly linked list entry points to the child node and the parent
> > >      * node
> > >      */
> > >     QLIST_ENTRY(BlockStateNode) down;
> > > 
> > >     /* This doubly linked list entry point to the siblings of this node
> > >      */
> > >     QLIST_ENTRY(BlockStateNode) siblings;
> > > 
> > >     /* a hash or an array of the sibbling of this node for fast access
> > >      * should be recomputed when updating the tree */
> > >     QHASH_ENTRY<BlockStateNode, index> sibblings_hash;
> > > }
> > > 
> > > The BlockBackend would be the structure used to hold the "drive" the guest use.
> > > 
> > > struct BlockBackend {
> > >     /* the following doubly linked list header point to the top BlockStackNode
> > >      * in our case it's the one containing the I/O throttling bs
> > >      */
> > >     QLIST_HEAD(, BlockStateNode) block_stack_head;
> > >     /* this is a pointer to the topest node below the block filter chain
> > >      * in our case the first QCOW2 sibling
> > >      */
> > >     BlockStackNode *top_node_below_filters;
> > > }
> > > 
> > > 
> > > Updated diagram:
> > > 
> > > (Here bsn means BlockStacknode)
> > > 
> > >     ------------------------BlockBackend
> > >     |                             |
> > >     |                          block_stack_head
> > >     |                             |
> > >     |                             |
> > >     |                       I/O throttling BlockStackNode (contains it's bs)
> > >     |                             |
> > >     |                            down
> > >     |                             |
> > >     |                             |
> > > top_node_below_filter     Encryption BlockStacknode (contains it's bs)
> > >     |                             |
> > >     |                            down
> > >     |                             |
> > >     |                             |
> > >     |                Quorum BlockStackNode (contain's it's bs)
> > >     |               /
> > >     |             down
> > >     |             /               
> > >     |            /     S              S
> > >     ------  QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn contains a bs)
> > >                |       b       |      b         |
> > >              down      l      down    l        down
> > >                |       i       |      i         |
> > >                |       n       |      n         |
> > >                |       g       |      g         |
> > >                |       s       |      s         |
> > >                |               |                |
> > >             RAW bsn         RAW bsn           RAW bsn  (each bsn contains a bs)
> > > 
> > > 
> > > Block driver point of view:
> > > 
> > > to construct the tree each BlockDriver would have some utility functions looking
> > > like.
> > > 
> > > bdrv_register_child_bs(bs, child_bs, int index);
> > > 
> > > multiples calls to this function could be done to register multiple siblings
> > > childs identified by their index.
> > > 
> > > This way something like quorum could register multiple QCOW2 instances.
> > > 
> > > driver would have a
> > > BlockDriverSTate *bdrv_access_child(bs, int index);
> > > 
> > > to access their childs.
> > > 
> > > These functions can be implemented without the driver knowing about
> > > BlockStateNodes using container_of.
> > > 
> > > blockdev point of view: (here I need your help)
> > > 
> > > When doing a snapshot blockdev.c would access
> > > BlockBackend->top_node_below_filter and make a snapshot of the bs contained in
> > > this node and it's sibblings.
> > > 
> > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > can navigate down the tree from top node, until hitting some layer where the op
> > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > pointer.
> 
> Is it even inherent to a block driver (like a filter), if a snapshot is
> to be taken at its level? Or is it rather a policy decision that should
> be made by the user?
> 
OK, getting the point that user should have full flexibility and fine operation
granularity. It also stands against block_backend->top_node_below_filter. Do we
really have the assumption that all the filters are on top of the tree and linear?
Shouldn't this be possible?

                   Block Backend
                         |
                         |
                    Quodrum BDS
                    /    |    \
             iot filter  |     \
                  /      |      \
                qcow2   qcow2   qcow2

So we throttle only a particular image, not the whole device. But this will
make a top_node_below_filter pointer impossible.

> In our example, the quorum driver, it's not at all clear to me that you
> want to snapshot all children. In order to roll back to a previous
> state, one snapshot is enough, you don't need multiple copies of the
> same one. Perhaps you want two so that we can still compare them for
> verification. Or all of them because you can afford the disk space and
> want ultimate safety. I don't think qemu can know which one is true.
> 
Only if quorum ever knows about and operates on snapshots, it should be
considered specifically, but no. So we need to achieve this in the general
design: allow user to take snapshot, or set throttle limits on particular
BDSes, as above graph.

> In the same way, in a typical case you may want to keep I/O throttling
> for the whole drive, including the new snapshot. But what if the
> throttling was used in order to not overload the network where the image
> is stored, and you're now doing a local snapshot, to which you want to
> stream the image? The I/O throttling should apply only to the backing
> file, not the new snapshot.
> 
Yes, and OTOH, throttling really suits to be a filter only if it can be a non
top one, otherwise it's no better than what we have now.

> So perhaps what we really need is a more flexible snapshot/BDS tree
> manipulation command that describes in detail which structure you want
> to have in the end.
> 

Fam

  reply	other threads:[~2013-09-06  9:18 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-03 16:24 [Qemu-devel] Block Filters Benoît Canet
2013-09-04  8:29 ` Stefan Hajnoczi
2013-09-04 18:15   ` Benoît Canet
2013-09-05  7:32     ` Kevin Wolf
2013-09-05 10:01     ` Stefan Hajnoczi
2013-09-05 10:18       ` Fam Zheng
2013-09-05 14:59         ` Stefan Hajnoczi
2013-09-05 17:29         ` Benoît Canet
2013-09-06  7:42           ` Kevin Wolf
2013-09-06  7:49             ` Fam Zheng
2013-09-06  7:56 ` Fam Zheng
2013-09-06  8:45   ` Kevin Wolf
2013-09-06  9:18     ` Fam Zheng [this message]
2013-09-06  9:55       ` Kevin Wolf
2013-09-06 10:50         ` Fam Zheng
2013-09-15 18:10         ` Benoît Canet
2013-09-16  7:41           ` Fam Zheng
2013-09-16 10:43             ` Benoît Canet
2013-09-16 11:00             ` Benoît Canet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130906091820.GA24154@T430s.nay.redhat.com \
    --to=famz@redhat.com \
    --cc=armbru@redhat.com \
    --cc=benoit.canet@irqsave.net \
    --cc=jcody@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).