* [Qemu-devel] Block Filters
@ 2013-09-03 16:24 Benoît Canet
2013-09-04 8:29 ` Stefan Hajnoczi
2013-09-06 7:56 ` Fam Zheng
0 siblings, 2 replies; 19+ messages in thread
From: Benoît Canet @ 2013-09-03 16:24 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, famz, jcody, armbru, mreitz, stefanha, pbonzini
Hello list,
I am thinking about QEMU block filters lately.
I am not a block.c/blockdev.c expert so tell me what you think of the following.
The use cases I see would be:
-$user want to have some real cryptography on top of qcow2/qed or another
format.
snapshots and other block features should continue to work
-$user want to use a raid like feature like QUORUM in QEMU.
other features should continue to work
-$user want to use the future SSD deduplication implementation with metadata on
SSD and data on spinning disks.
other features should continue to work
-$user want to I/O throttle one drive of his vm.
-$user want to do Copy On Read
-$user want to do a combination of the above
-$developer want to make the minimum of required steps to keep changes small
-$developer want to keep user interface changes for later
Lets take a example case of an user wanting to do I/O throttled encrypted QUORUM
on top of QCOW2.
Assuming we want to implement throttle and encryption as something remotely
being like a block filter this makes a pretty complex BlockDriverState tree.
The tree would look like the following:
I/O throttling BlockDriverState (bs)
|
|
|
|
Encryption BlockDriverState (bs)
|
|
|
|
Quorum BlockDriverState (bs)
/ | \
/ | \
/ | \
/ | \
QCOW2 bs QCOW2 b s QCOW2 bs
| | |
| | |
| | |
| | |
RAW bs RAW bs RAW bs
An external snapshot should result in a tree like the following.
I/O throttling BlockDriverState (bs)
|
|
|
|
Encryption BlockDriverState (bs)
|
|
|
|
Quorum BlockDriverState (bs)
/ | \
/ | \
/ | \
/ | \
QCOW2 bs QCOW2 bs QCOW2 bs
| | |
| | |
| | |
| | |
QCOW2 bs QCOW2 bs QCOW2 bs
| | |
| | |
| | |
| | |
RAW bs RAW bs RAW bs
In the current state of QEMU we can code some block drivers to implement this
tree.
However when doing operations like snapshots blockdev.c would have no real idea
of what should be snapshotted and how. (The 3 top bs should be kept on top)
Moreover it would have no way to manipulate easily this tree of BlockDriverState
has each one is encapsulated in it's parent.
Also there no generic way to tell the block layer that two or more BlockDriverState
are siblings.
The current mail is here to propose some additionals structures in order to cope
with these problems.
The overall strategy of the proposed structures is to push out the
BlockDriverStates relationships out of each BlockDriverState.
The idea is that it would make it easier for the block layer to manipulate a
well known structure instead of being forced to enter into each BlockDriverState
specificity.
The first structure is the BlockStackNode.
The BlockStateNode would be used to represent the relationship between the
various BlockDriverStates
struct BlockStackNode {
BlockDriverState *bs; /* the BlockDriverState holded by this node */
/* this doubly linked list entry points to the child node and the parent
* node
*/
QLIST_ENTRY(BlockStateNode) down;
/* This doubly linked list entry point to the siblings of this node
*/
QLIST_ENTRY(BlockStateNode) siblings;
/* a hash or an array of the sibbling of this node for fast access
* should be recomputed when updating the tree */
QHASH_ENTRY<BlockStateNode, index> sibblings_hash;
}
The BlockBackend would be the structure used to hold the "drive" the guest use.
struct BlockBackend {
/* the following doubly linked list header point to the top BlockStackNode
* in our case it's the one containing the I/O throttling bs
*/
QLIST_HEAD(, BlockStateNode) block_stack_head;
/* this is a pointer to the topest node below the block filter chain
* in our case the first QCOW2 sibling
*/
BlockStackNode *top_node_below_filters;
}
Updated diagram:
(Here bsn means BlockStacknode)
------------------------BlockBackend
| |
| block_stack_head
| |
| |
| I/O throttling BlockStackNode (contains it's bs)
| |
| down
| |
| |
top_node_below_filter Encryption BlockStacknode (contains it's bs)
| |
| down
| |
| |
| Quorum BlockStackNode (contain's it's bs)
| /
| down
| /
| / S S
------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn contains a bs)
| b | b |
down l down l down
| i | i |
| n | n |
| g | g |
| s | s |
| | |
RAW bsn RAW bsn RAW bsn (each bsn contains a bs)
Block driver point of view:
to construct the tree each BlockDriver would have some utility functions looking
like.
bdrv_register_child_bs(bs, child_bs, int index);
multiples calls to this function could be done to register multiple siblings
childs identified by their index.
This way something like quorum could register multiple QCOW2 instances.
driver would have a
BlockDriverSTate *bdrv_access_child(bs, int index);
to access their childs.
These functions can be implemented without the driver knowing about
BlockStateNodes using container_of.
blockdev point of view: (here I need your help)
When doing a snapshot blockdev.c would access
BlockBackend->top_node_below_filter and make a snapshot of the bs contained in
this node and it's sibblings.
After each individual snapshot the linked lists and the hash/arrays would be
updated to point to the new top bsn.
The snapshot operation can be done without violating any of the top block
filter BlockDriverState.
What do you think of this idea ?
How this would fit in block.c/blockdev.c ?
Best regards
Benoît
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-03 16:24 [Qemu-devel] Block Filters Benoît Canet
@ 2013-09-04 8:29 ` Stefan Hajnoczi
2013-09-04 18:15 ` Benoît Canet
2013-09-06 7:56 ` Fam Zheng
1 sibling, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2013-09-04 8:29 UTC (permalink / raw)
To: Benoît Canet
Cc: kwolf, famz, jcody, armbru, qemu-devel, pbonzini, mreitz
On Tue, Sep 03, 2013 at 06:24:49PM +0200, Benoît Canet wrote:
> -$user want to do Copy On Read
This feature is currently implemented in the read code path in block.c.
Putting it into a separate, stackable module is fine but may require a
per-device request queue. Today every BDS has its own request queue and
I think that starts to become inefficient if every small stackable
feature duplicates all request metadata.
> Moreover it would have no way to manipulate easily this tree of BlockDriverState
> has each one is encapsulated in it's parent.
>
> Also there no generic way to tell the block layer that two or more BlockDriverState
> are siblings.
vmdk also has this issue. block.c provides bs->file and bs->backing_hd
but it doesn't support multi-file VMDK's natively. So the vmdk.c has to
implement this internally.
Perhaps VMDK could be simplified by using generic infrastructure for
siblings.
> When doing a snapshot blockdev.c would access
> BlockBackend->top_node_below_filter and make a snapshot of the bs contained in
> this node and it's sibblings.
I'm not a fan of ->top_node_below_filter. It seems ad-hoc to me.
The snapshot operation should be part of the function interface that
block drivers can implement. Filters would forward it to their only
child in the simple case. Quorum would perform the snapshot on each of
its children. Image formats would allow the snapshot to happen.
If we need slightly different semantics from ->top_node_below_filter
then we'll need to introduce more ad-hoc fields. I think the cleanest
solution is simply to ask the block drivers to perform the operation and
provide default implementations for the common case.
> After each individual snapshot the linked lists and the hash/arrays would be
> updated to point to the new top bsn.
> The snapshot operation can be done without violating any of the top block
> filter BlockDriverState.
I think we should add bs->parent and bs->children[] fields to BDS and
APIs to manipulate the tree.
Propagate operations like snapshot down the tree. block.c is designed
for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
become a bit more generic to support other types of BlockDrivers
properly.
BTW I don't see a need to create a separate tree structure.
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-04 8:29 ` Stefan Hajnoczi
@ 2013-09-04 18:15 ` Benoît Canet
2013-09-05 7:32 ` Kevin Wolf
2013-09-05 10:01 ` Stefan Hajnoczi
0 siblings, 2 replies; 19+ messages in thread
From: Benoît Canet @ 2013-09-04 18:15 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Benoît Canet, kwolf, famz, jcody, armbru, qemu-devel,
pbonzini, mreitz
> Propagate operations like snapshot down the tree. block.c is designed
> for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> become a bit more generic to support other types of BlockDrivers
> properly.
Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
Best regards
Benoît
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-04 18:15 ` Benoît Canet
@ 2013-09-05 7:32 ` Kevin Wolf
2013-09-05 10:01 ` Stefan Hajnoczi
1 sibling, 0 replies; 19+ messages in thread
From: Kevin Wolf @ 2013-09-05 7:32 UTC (permalink / raw)
To: Benoît Canet
Cc: famz, jcody, armbru, qemu-devel, Stefan Hajnoczi, pbonzini,
mreitz
Am 04.09.2013 um 20:15 hat Benoît Canet geschrieben:
> > Propagate operations like snapshot down the tree. block.c is designed
> > for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> > become a bit more generic to support other types of BlockDrivers
> > properly.
>
> Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
Intuitively I'd say that bs->children[] would contain _all_ children,
including bs->file and bs->backing_hd.
In the end it all depends on what bs->children[] would be used for.
I'm not even sure if the generic block layer has any use for this
information.
Kevin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-04 18:15 ` Benoît Canet
2013-09-05 7:32 ` Kevin Wolf
@ 2013-09-05 10:01 ` Stefan Hajnoczi
2013-09-05 10:18 ` Fam Zheng
1 sibling, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2013-09-05 10:01 UTC (permalink / raw)
To: Benoît Canet
Cc: kwolf, famz, jcody, armbru, qemu-devel, pbonzini, mreitz
On Wed, Sep 04, 2013 at 08:15:36PM +0200, Benoît Canet wrote:
> > Propagate operations like snapshot down the tree. block.c is designed
> > for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> > become a bit more generic to support other types of BlockDrivers
> > properly.
>
> Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
bs->backing_hd and bs->file exist so that block.c can implement generic
functionality shared by a lot of block drivers. They are for code
reuse.
Many places in the block layer have come to assume that there is only
one ->file, for example. That's not true for quorum or even vmdk.
If we forget about code reuse for a second, and just think of BDS trees,
then both ->backing_hd and ->file should be in ->children[].
The block driver can put constraints on ->children[], e.g. how many
children are allowed, read/write access, etc. .bdrv_open() can look at
->children[] and set things up.
I think the problem we have is that too much of QEMU uses ->file and
->backing_hd. It's kind of baked in now to the point where more
flexible block drivers like quorum are hard to represent.
->backing_hd and ->file are mostly image format concepts. Filters and
protocols could do without them.
After saying all that, I don't have a design that makes everything
better :P.
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-05 10:01 ` Stefan Hajnoczi
@ 2013-09-05 10:18 ` Fam Zheng
2013-09-05 14:59 ` Stefan Hajnoczi
2013-09-05 17:29 ` Benoît Canet
0 siblings, 2 replies; 19+ messages in thread
From: Fam Zheng @ 2013-09-05 10:18 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Benoît Canet, kwolf, jcody, armbru, qemu-devel, pbonzini,
mreitz
On Thu, 09/05 12:01, Stefan Hajnoczi wrote:
> On Wed, Sep 04, 2013 at 08:15:36PM +0200, Benoît Canet wrote:
> > > Propagate operations like snapshot down the tree. block.c is designed
> > > for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> > > become a bit more generic to support other types of BlockDrivers
> > > properly.
> >
> > Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
>
> bs->backing_hd and bs->file exist so that block.c can implement generic
> functionality shared by a lot of block drivers. They are for code
> reuse.
>
> Many places in the block layer have come to assume that there is only
> one ->file, for example. That's not true for quorum or even vmdk.
>
> If we forget about code reuse for a second, and just think of BDS trees,
> then both ->backing_hd and ->file should be in ->children[].
>
> I think the problem we have is that too much of QEMU uses ->file and
> ->backing_hd. It's kind of baked in now to the point where more
> flexible block drivers like quorum are hard to represent.
>
> ->backing_hd and ->file are mostly image format concepts. Filters and
> protocols could do without them.
>
> After saying all that, I don't have a design that makes everything
> better :P.
>
Maybe we could start from a generic scheme and add specific operations upon:
I propose we let bs->children[] keep all the node connections, including
backing_hd and file, then leave the BlockDriver, BlockFilter or BlockProtocol
to implement ->get_backing_hd(), ->get_file_hd(), or even ->get_files(), or
mark an operation as NULL. These operations give semantics to its children (of
course they need some semantics to actually be useful), but it's orthogonal to
management of elements in the object tree.
Fam
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-05 10:18 ` Fam Zheng
@ 2013-09-05 14:59 ` Stefan Hajnoczi
2013-09-05 17:29 ` Benoît Canet
1 sibling, 0 replies; 19+ messages in thread
From: Stefan Hajnoczi @ 2013-09-05 14:59 UTC (permalink / raw)
To: Fam Zheng
Cc: Benoît Canet, kwolf, jcody, armbru, qemu-devel, pbonzini,
mreitz
On Thu, Sep 05, 2013 at 06:18:45PM +0800, Fam Zheng wrote:
> On Thu, 09/05 12:01, Stefan Hajnoczi wrote:
> > On Wed, Sep 04, 2013 at 08:15:36PM +0200, Benoît Canet wrote:
> > > > Propagate operations like snapshot down the tree. block.c is designed
> > > > for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> > > > become a bit more generic to support other types of BlockDrivers
> > > > properly.
> > >
> > > Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
> >
> > bs->backing_hd and bs->file exist so that block.c can implement generic
> > functionality shared by a lot of block drivers. They are for code
> > reuse.
> >
> > Many places in the block layer have come to assume that there is only
> > one ->file, for example. That's not true for quorum or even vmdk.
> >
> > If we forget about code reuse for a second, and just think of BDS trees,
> > then both ->backing_hd and ->file should be in ->children[].
> >
> > I think the problem we have is that too much of QEMU uses ->file and
> > ->backing_hd. It's kind of baked in now to the point where more
> > flexible block drivers like quorum are hard to represent.
> >
> > ->backing_hd and ->file are mostly image format concepts. Filters and
> > protocols could do without them.
> >
> > After saying all that, I don't have a design that makes everything
> > better :P.
> >
>
> Maybe we could start from a generic scheme and add specific operations upon:
>
> I propose we let bs->children[] keep all the node connections, including
> backing_hd and file, then leave the BlockDriver, BlockFilter or BlockProtocol
> to implement ->get_backing_hd(), ->get_file_hd(), or even ->get_files(), or
> mark an operation as NULL. These operations give semantics to its children (of
> course they need some semantics to actually be useful), but it's orthogonal to
> management of elements in the object tree.
Yes, plus a mapping to the command-line/QOM (e.g.
backing.cache=writeback instead of children[1].cache=writeback).
Stefan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-05 10:18 ` Fam Zheng
2013-09-05 14:59 ` Stefan Hajnoczi
@ 2013-09-05 17:29 ` Benoît Canet
2013-09-06 7:42 ` Kevin Wolf
1 sibling, 1 reply; 19+ messages in thread
From: Benoît Canet @ 2013-09-05 17:29 UTC (permalink / raw)
To: Fam Zheng
Cc: Benoît Canet, kwolf, jcody, armbru, qemu-devel,
Stefan Hajnoczi, pbonzini, mreitz
Le Thursday 05 Sep 2013 à 18:18:45 (+0800), Fam Zheng a écrit :
> On Thu, 09/05 12:01, Stefan Hajnoczi wrote:
> > On Wed, Sep 04, 2013 at 08:15:36PM +0200, Benoît Canet wrote:
> > > > Propagate operations like snapshot down the tree. block.c is designed
> > > > for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> > > > become a bit more generic to support other types of BlockDrivers
> > > > properly.
> > >
> > > Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
> >
> > bs->backing_hd and bs->file exist so that block.c can implement generic
> > functionality shared by a lot of block drivers. They are for code
> > reuse.
> >
> > Many places in the block layer have come to assume that there is only
> > one ->file, for example. That's not true for quorum or even vmdk.
> >
> > If we forget about code reuse for a second, and just think of BDS trees,
> > then both ->backing_hd and ->file should be in ->children[].
> >
> > I think the problem we have is that too much of QEMU uses ->file and
> > ->backing_hd. It's kind of baked in now to the point where more
> > flexible block drivers like quorum are hard to represent.
> >
> > ->backing_hd and ->file are mostly image format concepts. Filters and
> > protocols could do without them.
> >
> > After saying all that, I don't have a design that makes everything
> > better :P.
> >
>
> Maybe we could start from a generic scheme and add specific operations upon:
>
> I propose we let bs->children[] keep all the node connections, including
> backing_hd and file, then leave the BlockDriver, BlockFilter or BlockProtocol
> to implement ->get_backing_hd(), ->get_file_hd(), or even ->get_files(), or
> mark an operation as NULL. These operations give semantics to its children (of
> course they need some semantics to actually be useful), but it's orthogonal to
> management of elements in the object tree.
So would bs->children[] be manipulated directly by each BlockDriver, BlockFilter
BlockProtocol implying the their code have an exact knowledge of the index of
each bs in bs->children[] ?
Best regards
Benoît
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-05 17:29 ` Benoît Canet
@ 2013-09-06 7:42 ` Kevin Wolf
2013-09-06 7:49 ` Fam Zheng
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2013-09-06 7:42 UTC (permalink / raw)
To: Benoît Canet
Cc: Fam Zheng, jcody, armbru, qemu-devel, Stefan Hajnoczi, pbonzini,
mreitz
Am 05.09.2013 um 19:29 hat Benoît Canet geschrieben:
> Le Thursday 05 Sep 2013 à 18:18:45 (+0800), Fam Zheng a écrit :
> > On Thu, 09/05 12:01, Stefan Hajnoczi wrote:
> > > On Wed, Sep 04, 2013 at 08:15:36PM +0200, Benoît Canet wrote:
> > > > > Propagate operations like snapshot down the tree. block.c is designed
> > > > > for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> > > > > become a bit more generic to support other types of BlockDrivers
> > > > > properly.
> > > >
> > > > Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
> > >
> > > bs->backing_hd and bs->file exist so that block.c can implement generic
> > > functionality shared by a lot of block drivers. They are for code
> > > reuse.
> > >
> > > Many places in the block layer have come to assume that there is only
> > > one ->file, for example. That's not true for quorum or even vmdk.
> > >
> > > If we forget about code reuse for a second, and just think of BDS trees,
> > > then both ->backing_hd and ->file should be in ->children[].
> > >
> > > I think the problem we have is that too much of QEMU uses ->file and
> > > ->backing_hd. It's kind of baked in now to the point where more
> > > flexible block drivers like quorum are hard to represent.
> > >
> > > ->backing_hd and ->file are mostly image format concepts. Filters and
> > > protocols could do without them.
> > >
> > > After saying all that, I don't have a design that makes everything
> > > better :P.
> > >
> >
> > Maybe we could start from a generic scheme and add specific operations upon:
> >
> > I propose we let bs->children[] keep all the node connections, including
> > backing_hd and file, then leave the BlockDriver, BlockFilter or BlockProtocol
> > to implement ->get_backing_hd(), ->get_file_hd(), or even ->get_files(), or
> > mark an operation as NULL. These operations give semantics to its children (of
> > course they need some semantics to actually be useful), but it's orthogonal to
> > management of elements in the object tree.
>
> So would bs->children[] be manipulated directly by each BlockDriver, BlockFilter
> BlockProtocol implying the their code have an exact knowledge of the index of
> each bs in bs->children[] ?
I think the driver implementation is the only one who know how to do it.
Is creating external snapshots still the only use case for bs->children[]
in the generic block layer? Because if so, we could indeed rather move
the snapshotting operation into the BlockDrivers.
Oh, and why are you talking about "BlockDriver, BlockFilter,
BlockProtocol"? What's the difference between them and why isn't the one
BlockDriver struct that we have today enough any more?
Kevin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-06 7:42 ` Kevin Wolf
@ 2013-09-06 7:49 ` Fam Zheng
0 siblings, 0 replies; 19+ messages in thread
From: Fam Zheng @ 2013-09-06 7:49 UTC (permalink / raw)
To: Kevin Wolf
Cc: Benoît Canet, jcody, armbru, qemu-devel, Stefan Hajnoczi,
pbonzini, mreitz
On Fri, 09/06 09:42, Kevin Wolf wrote:
> Am 05.09.2013 um 19:29 hat Benoît Canet geschrieben:
> > Le Thursday 05 Sep 2013 à 18:18:45 (+0800), Fam Zheng a écrit :
> > > On Thu, 09/05 12:01, Stefan Hajnoczi wrote:
> > > > On Wed, Sep 04, 2013 at 08:15:36PM +0200, Benoît Canet wrote:
> > > > > > Propagate operations like snapshot down the tree. block.c is designed
> > > > > > for bs->file/bs->backing_hd kind of BlockDrivers, perhaps it needs to
> > > > > > become a bit more generic to support other types of BlockDrivers
> > > > > > properly.
> > > > >
> > > > > Shouldn't bs->backing_hd become bs->children[0] and bs->file stay the same ?
> > > >
> > > > bs->backing_hd and bs->file exist so that block.c can implement generic
> > > > functionality shared by a lot of block drivers. They are for code
> > > > reuse.
> > > >
> > > > Many places in the block layer have come to assume that there is only
> > > > one ->file, for example. That's not true for quorum or even vmdk.
> > > >
> > > > If we forget about code reuse for a second, and just think of BDS trees,
> > > > then both ->backing_hd and ->file should be in ->children[].
> > > >
> > > > I think the problem we have is that too much of QEMU uses ->file and
> > > > ->backing_hd. It's kind of baked in now to the point where more
> > > > flexible block drivers like quorum are hard to represent.
> > > >
> > > > ->backing_hd and ->file are mostly image format concepts. Filters and
> > > > protocols could do without them.
> > > >
> > > > After saying all that, I don't have a design that makes everything
> > > > better :P.
> > > >
> > >
> > > Maybe we could start from a generic scheme and add specific operations upon:
> > >
> > > I propose we let bs->children[] keep all the node connections, including
> > > backing_hd and file, then leave the BlockDriver, BlockFilter or BlockProtocol
> > > to implement ->get_backing_hd(), ->get_file_hd(), or even ->get_files(), or
> > > mark an operation as NULL. These operations give semantics to its children (of
> > > course they need some semantics to actually be useful), but it's orthogonal to
> > > management of elements in the object tree.
> >
> > So would bs->children[] be manipulated directly by each BlockDriver, BlockFilter
> > BlockProtocol implying the their code have an exact knowledge of the index of
> > each bs in bs->children[] ?
>
> I think the driver implementation is the only one who know how to do it.
>
> Is creating external snapshots still the only use case for bs->children[]
> in the generic block layer? Because if so, we could indeed rather move
> the snapshotting operation into the BlockDrivers.
>
> Oh, and why are you talking about "BlockDriver, BlockFilter,
> BlockProtocol"? What's the difference between them and why isn't the one
> BlockDriver struct that we have today enough any more?
I used the terms with their functional type in mind, didn't mean it's a actual
struct in code, they are just block drivers that implement these functions.
Fam
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-03 16:24 [Qemu-devel] Block Filters Benoît Canet
2013-09-04 8:29 ` Stefan Hajnoczi
@ 2013-09-06 7:56 ` Fam Zheng
2013-09-06 8:45 ` Kevin Wolf
1 sibling, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2013-09-06 7:56 UTC (permalink / raw)
To: Benoît Canet
Cc: kwolf, jcody, armbru, qemu-devel, stefanha, pbonzini, mreitz
On Tue, 09/03 18:24, Benoît Canet wrote:
>
> Hello list,
>
> I am thinking about QEMU block filters lately.
>
> I am not a block.c/blockdev.c expert so tell me what you think of the following.
>
> The use cases I see would be:
>
> -$user want to have some real cryptography on top of qcow2/qed or another
> format.
> snapshots and other block features should continue to work
>
> -$user want to use a raid like feature like QUORUM in QEMU.
> other features should continue to work
>
> -$user want to use the future SSD deduplication implementation with metadata on
> SSD and data on spinning disks.
> other features should continue to work
>
> -$user want to I/O throttle one drive of his vm.
>
> -$user want to do Copy On Read
>
> -$user want to do a combination of the above
>
> -$developer want to make the minimum of required steps to keep changes small
>
> -$developer want to keep user interface changes for later
>
> Lets take a example case of an user wanting to do I/O throttled encrypted QUORUM
> on top of QCOW2.
>
> Assuming we want to implement throttle and encryption as something remotely
> being like a block filter this makes a pretty complex BlockDriverState tree.
>
> The tree would look like the following:
>
> I/O throttling BlockDriverState (bs)
> |
> |
> |
> |
> Encryption BlockDriverState (bs)
> |
> |
> |
> |
> Quorum BlockDriverState (bs)
> / | \
> / | \
> / | \
> / | \
> QCOW2 bs QCOW2 b s QCOW2 bs
> | | |
> | | |
> | | |
> | | |
> RAW bs RAW bs RAW bs
>
> An external snapshot should result in a tree like the following.
> I/O throttling BlockDriverState (bs)
> |
> |
> |
> |
> Encryption BlockDriverState (bs)
> |
> |
> |
> |
> Quorum BlockDriverState (bs)
> / | \
> / | \
> / | \
> / | \
> QCOW2 bs QCOW2 bs QCOW2 bs
> | | |
> | | |
> | | |
> | | |
> QCOW2 bs QCOW2 bs QCOW2 bs
> | | |
> | | |
> | | |
> | | |
> RAW bs RAW bs RAW bs
>
> In the current state of QEMU we can code some block drivers to implement this
> tree.
>
> However when doing operations like snapshots blockdev.c would have no real idea
> of what should be snapshotted and how. (The 3 top bs should be kept on top)
>
> Moreover it would have no way to manipulate easily this tree of BlockDriverState
> has each one is encapsulated in it's parent.
>
> Also there no generic way to tell the block layer that two or more BlockDriverState
> are siblings.
>
> The current mail is here to propose some additionals structures in order to cope
> with these problems.
>
> The overall strategy of the proposed structures is to push out the
> BlockDriverStates relationships out of each BlockDriverState.
>
> The idea is that it would make it easier for the block layer to manipulate a
> well known structure instead of being forced to enter into each BlockDriverState
> specificity.
>
> The first structure is the BlockStackNode.
>
> The BlockStateNode would be used to represent the relationship between the
> various BlockDriverStates
>
> struct BlockStackNode {
> BlockDriverState *bs; /* the BlockDriverState holded by this node */
>
> /* this doubly linked list entry points to the child node and the parent
> * node
> */
> QLIST_ENTRY(BlockStateNode) down;
>
> /* This doubly linked list entry point to the siblings of this node
> */
> QLIST_ENTRY(BlockStateNode) siblings;
>
> /* a hash or an array of the sibbling of this node for fast access
> * should be recomputed when updating the tree */
> QHASH_ENTRY<BlockStateNode, index> sibblings_hash;
> }
>
> The BlockBackend would be the structure used to hold the "drive" the guest use.
>
> struct BlockBackend {
> /* the following doubly linked list header point to the top BlockStackNode
> * in our case it's the one containing the I/O throttling bs
> */
> QLIST_HEAD(, BlockStateNode) block_stack_head;
> /* this is a pointer to the topest node below the block filter chain
> * in our case the first QCOW2 sibling
> */
> BlockStackNode *top_node_below_filters;
> }
>
>
> Updated diagram:
>
> (Here bsn means BlockStacknode)
>
> ------------------------BlockBackend
> | |
> | block_stack_head
> | |
> | |
> | I/O throttling BlockStackNode (contains it's bs)
> | |
> | down
> | |
> | |
> top_node_below_filter Encryption BlockStacknode (contains it's bs)
> | |
> | down
> | |
> | |
> | Quorum BlockStackNode (contain's it's bs)
> | /
> | down
> | /
> | / S S
> ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn contains a bs)
> | b | b |
> down l down l down
> | i | i |
> | n | n |
> | g | g |
> | s | s |
> | | |
> RAW bsn RAW bsn RAW bsn (each bsn contains a bs)
>
>
> Block driver point of view:
>
> to construct the tree each BlockDriver would have some utility functions looking
> like.
>
> bdrv_register_child_bs(bs, child_bs, int index);
>
> multiples calls to this function could be done to register multiple siblings
> childs identified by their index.
>
> This way something like quorum could register multiple QCOW2 instances.
>
> driver would have a
> BlockDriverSTate *bdrv_access_child(bs, int index);
>
> to access their childs.
>
> These functions can be implemented without the driver knowing about
> BlockStateNodes using container_of.
>
> blockdev point of view: (here I need your help)
>
> When doing a snapshot blockdev.c would access
> BlockBackend->top_node_below_filter and make a snapshot of the bs contained in
> this node and it's sibblings.
>
Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
can navigate down the tree from top node, until hitting some layer where the op
is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
pointer.
Is this the only use case of top_node_below_filter?
Fam
> After each individual snapshot the linked lists and the hash/arrays would be
> updated to point to the new top bsn.
> The snapshot operation can be done without violating any of the top block
> filter BlockDriverState.
>
> What do you think of this idea ?
> How this would fit in block.c/blockdev.c ?
>
> Best regards
>
> Benoît
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-06 7:56 ` Fam Zheng
@ 2013-09-06 8:45 ` Kevin Wolf
2013-09-06 9:18 ` Fam Zheng
0 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2013-09-06 8:45 UTC (permalink / raw)
To: Fam Zheng
Cc: Benoît Canet, jcody, armbru, qemu-devel, stefanha, pbonzini,
mreitz
Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> On Tue, 09/03 18:24, Benoît Canet wrote:
> >
> > Hello list,
> >
> > I am thinking about QEMU block filters lately.
> >
> > I am not a block.c/blockdev.c expert so tell me what you think of the following.
> >
> > The use cases I see would be:
> >
> > -$user want to have some real cryptography on top of qcow2/qed or another
> > format.
> > snapshots and other block features should continue to work
> >
> > -$user want to use a raid like feature like QUORUM in QEMU.
> > other features should continue to work
> >
> > -$user want to use the future SSD deduplication implementation with metadata on
> > SSD and data on spinning disks.
> > other features should continue to work
> >
> > -$user want to I/O throttle one drive of his vm.
> >
> > -$user want to do Copy On Read
> >
> > -$user want to do a combination of the above
> >
> > -$developer want to make the minimum of required steps to keep changes small
> >
> > -$developer want to keep user interface changes for later
> >
> > Lets take a example case of an user wanting to do I/O throttled encrypted QUORUM
> > on top of QCOW2.
> >
> > Assuming we want to implement throttle and encryption as something remotely
> > being like a block filter this makes a pretty complex BlockDriverState tree.
> >
> > The tree would look like the following:
> >
> > I/O throttling BlockDriverState (bs)
> > |
> > |
> > |
> > |
> > Encryption BlockDriverState (bs)
> > |
> > |
> > |
> > |
> > Quorum BlockDriverState (bs)
> > / | \
> > / | \
> > / | \
> > / | \
> > QCOW2 bs QCOW2 b s QCOW2 bs
> > | | |
> > | | |
> > | | |
> > | | |
> > RAW bs RAW bs RAW bs
> >
> > An external snapshot should result in a tree like the following.
> > I/O throttling BlockDriverState (bs)
> > |
> > |
> > |
> > |
> > Encryption BlockDriverState (bs)
> > |
> > |
> > |
> > |
> > Quorum BlockDriverState (bs)
> > / | \
> > / | \
> > / | \
> > / | \
> > QCOW2 bs QCOW2 bs QCOW2 bs
> > | | |
> > | | |
> > | | |
> > | | |
> > QCOW2 bs QCOW2 bs QCOW2 bs
> > | | |
> > | | |
> > | | |
> > | | |
> > RAW bs RAW bs RAW bs
> >
> > In the current state of QEMU we can code some block drivers to implement this
> > tree.
> >
> > However when doing operations like snapshots blockdev.c would have no real idea
> > of what should be snapshotted and how. (The 3 top bs should be kept on top)
> >
> > Moreover it would have no way to manipulate easily this tree of BlockDriverState
> > has each one is encapsulated in it's parent.
> >
> > Also there no generic way to tell the block layer that two or more BlockDriverState
> > are siblings.
> >
> > The current mail is here to propose some additionals structures in order to cope
> > with these problems.
> >
> > The overall strategy of the proposed structures is to push out the
> > BlockDriverStates relationships out of each BlockDriverState.
> >
> > The idea is that it would make it easier for the block layer to manipulate a
> > well known structure instead of being forced to enter into each BlockDriverState
> > specificity.
> >
> > The first structure is the BlockStackNode.
> >
> > The BlockStateNode would be used to represent the relationship between the
> > various BlockDriverStates
> >
> > struct BlockStackNode {
> > BlockDriverState *bs; /* the BlockDriverState holded by this node */
> >
> > /* this doubly linked list entry points to the child node and the parent
> > * node
> > */
> > QLIST_ENTRY(BlockStateNode) down;
> >
> > /* This doubly linked list entry point to the siblings of this node
> > */
> > QLIST_ENTRY(BlockStateNode) siblings;
> >
> > /* a hash or an array of the sibbling of this node for fast access
> > * should be recomputed when updating the tree */
> > QHASH_ENTRY<BlockStateNode, index> sibblings_hash;
> > }
> >
> > The BlockBackend would be the structure used to hold the "drive" the guest use.
> >
> > struct BlockBackend {
> > /* the following doubly linked list header point to the top BlockStackNode
> > * in our case it's the one containing the I/O throttling bs
> > */
> > QLIST_HEAD(, BlockStateNode) block_stack_head;
> > /* this is a pointer to the topest node below the block filter chain
> > * in our case the first QCOW2 sibling
> > */
> > BlockStackNode *top_node_below_filters;
> > }
> >
> >
> > Updated diagram:
> >
> > (Here bsn means BlockStacknode)
> >
> > ------------------------BlockBackend
> > | |
> > | block_stack_head
> > | |
> > | |
> > | I/O throttling BlockStackNode (contains it's bs)
> > | |
> > | down
> > | |
> > | |
> > top_node_below_filter Encryption BlockStacknode (contains it's bs)
> > | |
> > | down
> > | |
> > | |
> > | Quorum BlockStackNode (contain's it's bs)
> > | /
> > | down
> > | /
> > | / S S
> > ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn contains a bs)
> > | b | b |
> > down l down l down
> > | i | i |
> > | n | n |
> > | g | g |
> > | s | s |
> > | | |
> > RAW bsn RAW bsn RAW bsn (each bsn contains a bs)
> >
> >
> > Block driver point of view:
> >
> > to construct the tree each BlockDriver would have some utility functions looking
> > like.
> >
> > bdrv_register_child_bs(bs, child_bs, int index);
> >
> > multiples calls to this function could be done to register multiple siblings
> > childs identified by their index.
> >
> > This way something like quorum could register multiple QCOW2 instances.
> >
> > driver would have a
> > BlockDriverSTate *bdrv_access_child(bs, int index);
> >
> > to access their childs.
> >
> > These functions can be implemented without the driver knowing about
> > BlockStateNodes using container_of.
> >
> > blockdev point of view: (here I need your help)
> >
> > When doing a snapshot blockdev.c would access
> > BlockBackend->top_node_below_filter and make a snapshot of the bs contained in
> > this node and it's sibblings.
> >
> Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> can navigate down the tree from top node, until hitting some layer where the op
> is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> pointer.
Is it even inherent to a block driver (like a filter), if a snapshot is
to be taken at its level? Or is it rather a policy decision that should
be made by the user?
In our example, the quorum driver, it's not at all clear to me that you
want to snapshot all children. In order to roll back to a previous
state, one snapshot is enough, you don't need multiple copies of the
same one. Perhaps you want two so that we can still compare them for
verification. Or all of them because you can afford the disk space and
want ultimate safety. I don't think qemu can know which one is true.
In the same way, in a typical case you may want to keep I/O throttling
for the whole drive, including the new snapshot. But what if the
throttling was used in order to not overload the network where the image
is stored, and you're now doing a local snapshot, to which you want to
stream the image? The I/O throttling should apply only to the backing
file, not the new snapshot.
So perhaps what we really need is a more flexible snapshot/BDS tree
manipulation command that describes in detail which structure you want
to have in the end.
Kevin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-06 8:45 ` Kevin Wolf
@ 2013-09-06 9:18 ` Fam Zheng
2013-09-06 9:55 ` Kevin Wolf
0 siblings, 1 reply; 19+ messages in thread
From: Fam Zheng @ 2013-09-06 9:18 UTC (permalink / raw)
To: Kevin Wolf
Cc: Benoît Canet, jcody, armbru, qemu-devel, stefanha, pbonzini,
mreitz
On Fri, 09/06 10:45, Kevin Wolf wrote:
> Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > On Tue, 09/03 18:24, Benoît Canet wrote:
> > >
> > > Hello list,
> > >
> > > I am thinking about QEMU block filters lately.
> > >
> > > I am not a block.c/blockdev.c expert so tell me what you think of the following.
> > >
> > > The use cases I see would be:
> > >
> > > -$user want to have some real cryptography on top of qcow2/qed or another
> > > format.
> > > snapshots and other block features should continue to work
> > >
> > > -$user want to use a raid like feature like QUORUM in QEMU.
> > > other features should continue to work
> > >
> > > -$user want to use the future SSD deduplication implementation with metadata on
> > > SSD and data on spinning disks.
> > > other features should continue to work
> > >
> > > -$user want to I/O throttle one drive of his vm.
> > >
> > > -$user want to do Copy On Read
> > >
> > > -$user want to do a combination of the above
> > >
> > > -$developer want to make the minimum of required steps to keep changes small
> > >
> > > -$developer want to keep user interface changes for later
> > >
> > > Lets take a example case of an user wanting to do I/O throttled encrypted QUORUM
> > > on top of QCOW2.
> > >
> > > Assuming we want to implement throttle and encryption as something remotely
> > > being like a block filter this makes a pretty complex BlockDriverState tree.
> > >
> > > The tree would look like the following:
> > >
> > > I/O throttling BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Encryption BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Quorum BlockDriverState (bs)
> > > / | \
> > > / | \
> > > / | \
> > > / | \
> > > QCOW2 bs QCOW2 b s QCOW2 bs
> > > | | |
> > > | | |
> > > | | |
> > > | | |
> > > RAW bs RAW bs RAW bs
> > >
> > > An external snapshot should result in a tree like the following.
> > > I/O throttling BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Encryption BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Quorum BlockDriverState (bs)
> > > / | \
> > > / | \
> > > / | \
> > > / | \
> > > QCOW2 bs QCOW2 bs QCOW2 bs
> > > | | |
> > > | | |
> > > | | |
> > > | | |
> > > QCOW2 bs QCOW2 bs QCOW2 bs
> > > | | |
> > > | | |
> > > | | |
> > > | | |
> > > RAW bs RAW bs RAW bs
> > >
> > > In the current state of QEMU we can code some block drivers to implement this
> > > tree.
> > >
> > > However when doing operations like snapshots blockdev.c would have no real idea
> > > of what should be snapshotted and how. (The 3 top bs should be kept on top)
> > >
> > > Moreover it would have no way to manipulate easily this tree of BlockDriverState
> > > has each one is encapsulated in it's parent.
> > >
> > > Also there no generic way to tell the block layer that two or more BlockDriverState
> > > are siblings.
> > >
> > > The current mail is here to propose some additionals structures in order to cope
> > > with these problems.
> > >
> > > The overall strategy of the proposed structures is to push out the
> > > BlockDriverStates relationships out of each BlockDriverState.
> > >
> > > The idea is that it would make it easier for the block layer to manipulate a
> > > well known structure instead of being forced to enter into each BlockDriverState
> > > specificity.
> > >
> > > The first structure is the BlockStackNode.
> > >
> > > The BlockStateNode would be used to represent the relationship between the
> > > various BlockDriverStates
> > >
> > > struct BlockStackNode {
> > > BlockDriverState *bs; /* the BlockDriverState holded by this node */
> > >
> > > /* this doubly linked list entry points to the child node and the parent
> > > * node
> > > */
> > > QLIST_ENTRY(BlockStateNode) down;
> > >
> > > /* This doubly linked list entry point to the siblings of this node
> > > */
> > > QLIST_ENTRY(BlockStateNode) siblings;
> > >
> > > /* a hash or an array of the sibbling of this node for fast access
> > > * should be recomputed when updating the tree */
> > > QHASH_ENTRY<BlockStateNode, index> sibblings_hash;
> > > }
> > >
> > > The BlockBackend would be the structure used to hold the "drive" the guest use.
> > >
> > > struct BlockBackend {
> > > /* the following doubly linked list header point to the top BlockStackNode
> > > * in our case it's the one containing the I/O throttling bs
> > > */
> > > QLIST_HEAD(, BlockStateNode) block_stack_head;
> > > /* this is a pointer to the topest node below the block filter chain
> > > * in our case the first QCOW2 sibling
> > > */
> > > BlockStackNode *top_node_below_filters;
> > > }
> > >
> > >
> > > Updated diagram:
> > >
> > > (Here bsn means BlockStacknode)
> > >
> > > ------------------------BlockBackend
> > > | |
> > > | block_stack_head
> > > | |
> > > | |
> > > | I/O throttling BlockStackNode (contains it's bs)
> > > | |
> > > | down
> > > | |
> > > | |
> > > top_node_below_filter Encryption BlockStacknode (contains it's bs)
> > > | |
> > > | down
> > > | |
> > > | |
> > > | Quorum BlockStackNode (contain's it's bs)
> > > | /
> > > | down
> > > | /
> > > | / S S
> > > ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn contains a bs)
> > > | b | b |
> > > down l down l down
> > > | i | i |
> > > | n | n |
> > > | g | g |
> > > | s | s |
> > > | | |
> > > RAW bsn RAW bsn RAW bsn (each bsn contains a bs)
> > >
> > >
> > > Block driver point of view:
> > >
> > > to construct the tree each BlockDriver would have some utility functions looking
> > > like.
> > >
> > > bdrv_register_child_bs(bs, child_bs, int index);
> > >
> > > multiples calls to this function could be done to register multiple siblings
> > > childs identified by their index.
> > >
> > > This way something like quorum could register multiple QCOW2 instances.
> > >
> > > driver would have a
> > > BlockDriverSTate *bdrv_access_child(bs, int index);
> > >
> > > to access their childs.
> > >
> > > These functions can be implemented without the driver knowing about
> > > BlockStateNodes using container_of.
> > >
> > > blockdev point of view: (here I need your help)
> > >
> > > When doing a snapshot blockdev.c would access
> > > BlockBackend->top_node_below_filter and make a snapshot of the bs contained in
> > > this node and it's sibblings.
> > >
> > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > can navigate down the tree from top node, until hitting some layer where the op
> > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > pointer.
>
> Is it even inherent to a block driver (like a filter), if a snapshot is
> to be taken at its level? Or is it rather a policy decision that should
> be made by the user?
>
OK, getting the point that user should have full flexibility and fine operation
granularity. It also stands against block_backend->top_node_below_filter. Do we
really have the assumption that all the filters are on top of the tree and linear?
Shouldn't this be possible?
Block Backend
|
|
Quodrum BDS
/ | \
iot filter | \
/ | \
qcow2 qcow2 qcow2
So we throttle only a particular image, not the whole device. But this will
make a top_node_below_filter pointer impossible.
> In our example, the quorum driver, it's not at all clear to me that you
> want to snapshot all children. In order to roll back to a previous
> state, one snapshot is enough, you don't need multiple copies of the
> same one. Perhaps you want two so that we can still compare them for
> verification. Or all of them because you can afford the disk space and
> want ultimate safety. I don't think qemu can know which one is true.
>
Only if quorum ever knows about and operates on snapshots, it should be
considered specifically, but no. So we need to achieve this in the general
design: allow user to take snapshot, or set throttle limits on particular
BDSes, as above graph.
> In the same way, in a typical case you may want to keep I/O throttling
> for the whole drive, including the new snapshot. But what if the
> throttling was used in order to not overload the network where the image
> is stored, and you're now doing a local snapshot, to which you want to
> stream the image? The I/O throttling should apply only to the backing
> file, not the new snapshot.
>
Yes, and OTOH, throttling really suits to be a filter only if it can be a non
top one, otherwise it's no better than what we have now.
> So perhaps what we really need is a more flexible snapshot/BDS tree
> manipulation command that describes in detail which structure you want
> to have in the end.
>
Fam
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-06 9:18 ` Fam Zheng
@ 2013-09-06 9:55 ` Kevin Wolf
2013-09-06 10:50 ` Fam Zheng
2013-09-15 18:10 ` Benoît Canet
0 siblings, 2 replies; 19+ messages in thread
From: Kevin Wolf @ 2013-09-06 9:55 UTC (permalink / raw)
To: Fam Zheng
Cc: Benoît Canet, jcody, armbru, qemu-devel, stefanha, pbonzini,
mreitz
Am 06.09.2013 um 11:18 hat Fam Zheng geschrieben:
> On Fri, 09/06 10:45, Kevin Wolf wrote:
> > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > > can navigate down the tree from top node, until hitting some layer where the op
> > > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > > pointer.
> >
> > Is it even inherent to a block driver (like a filter), if a snapshot is
> > to be taken at its level? Or is it rather a policy decision that should
> > be made by the user?
> >
> OK, getting the point that user should have full flexibility and fine operation
> granularity. It also stands against block_backend->top_node_below_filter. Do we
> really have the assumption that all the filters are on top of the tree and linear?
> Shouldn't this be possible?
>
> Block Backend
> |
> |
> Quodrum BDS
> / | \
> iot filter | \
> / | \
> qcow2 qcow2 qcow2
>
> So we throttle only a particular image, not the whole device. But this will
> make a top_node_below_filter pointer impossible.
I was assuming that Benoît's model works for the special case of
snapshotting in one predefined way, but this is actually a very good
example of why it doesn't.
The approach relies on snapshotting siblings together, and in this case
the siblings would be iot/qcow2/qcow2, while iot is still a filter. This
would mean that either iot needs to be top_node_below_filter and
throttling doesn't stay on top, or the left qcow2 is
top_node_below_filter and the other Quorum images aren't snapshotted.
> > In our example, the quorum driver, it's not at all clear to me that you
> > want to snapshot all children. In order to roll back to a previous
> > state, one snapshot is enough, you don't need multiple copies of the
> > same one. Perhaps you want two so that we can still compare them for
> > verification. Or all of them because you can afford the disk space and
> > want ultimate safety. I don't think qemu can know which one is true.
> >
> Only if quorum ever knows about and operates on snapshots, it should be
> considered specifically, but no. So we need to achieve this in the general
> design: allow user to take snapshot, or set throttle limits on particular
> BDSes, as above graph.
>
> > In the same way, in a typical case you may want to keep I/O throttling
> > for the whole drive, including the new snapshot. But what if the
> > throttling was used in order to not overload the network where the image
> > is stored, and you're now doing a local snapshot, to which you want to
> > stream the image? The I/O throttling should apply only to the backing
> > file, not the new snapshot.
> >
> Yes, and OTOH, throttling really suits to be a filter only if it can be a non
> top one, otherwise it's no better than what we have now.
Well, it would be a cleaner architecture in any case, but having it in
the middle of the stack feels useful indeed, so we should support it.
> > So perhaps what we really need is a more flexible snapshot/BDS tree
> > manipulation command that describes in detail which structure you want
> > to have in the end.
Designing the corresponding QMP command is the hard part, I guess.
Kevin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-06 9:55 ` Kevin Wolf
@ 2013-09-06 10:50 ` Fam Zheng
2013-09-15 18:10 ` Benoît Canet
1 sibling, 0 replies; 19+ messages in thread
From: Fam Zheng @ 2013-09-06 10:50 UTC (permalink / raw)
To: Kevin Wolf
Cc: Benoît Canet, jcody, armbru, qemu-devel, stefanha, pbonzini,
mreitz
On Fri, 09/06 11:55, Kevin Wolf wrote:
> Am 06.09.2013 um 11:18 hat Fam Zheng geschrieben:
> > On Fri, 09/06 10:45, Kevin Wolf wrote:
> > > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > > > can navigate down the tree from top node, until hitting some layer where the op
> > > > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > > > pointer.
> > >
> > > Is it even inherent to a block driver (like a filter), if a snapshot is
> > > to be taken at its level? Or is it rather a policy decision that should
> > > be made by the user?
> > >
> > OK, getting the point that user should have full flexibility and fine operation
> > granularity. It also stands against block_backend->top_node_below_filter. Do we
> > really have the assumption that all the filters are on top of the tree and linear?
> > Shouldn't this be possible?
> >
> > Block Backend
> > |
> > |
> > Quodrum BDS
> > / | \
> > iot filter | \
> > / | \
> > qcow2 qcow2 qcow2
> >
> > So we throttle only a particular image, not the whole device. But this will
> > make a top_node_below_filter pointer impossible.
>
> I was assuming that Benoît's model works for the special case of
> snapshotting in one predefined way, but this is actually a very good
> example of why it doesn't.
>
> The approach relies on snapshotting siblings together, and in this case
> the siblings would be iot/qcow2/qcow2, while iot is still a filter. This
> would mean that either iot needs to be top_node_below_filter and
> throttling doesn't stay on top, or the left qcow2 is
> top_node_below_filter and the other Quorum images aren't snapshotted.
>
I think that user must look at what is there first, rather than do it blindly
on the Block Backend, which is only an entry to the tree behind. Because
wheather snapshot is supported, depends on what's in the backend. If the tree
is merely a single raw image, nothing can be done. So, Block Backend doesn't
hide things: I think we split it from BDS for putting device related states in
a right place. Users still operates on the BDS for data operations (backup,
snapshot, throttle, etc...), much the same as they currently do. The point of
this design here is that we are able to assemble them to get flexible and
sophisticated configurations.
IMO, snapshotting Quodrum childrens, doesn't sound much different from
snapshotting guest's rg0-disk0, rg0-disk1, rg0-disk2, it requires the knowledge
that they are bound to work together. It's just that the knowledge is from
guest raid configuration, or from Quodrum driver. If Quodrum requires all the
children's snapshot be taken at the same time, like a transactional snapshot on
rg0-disk{0,1,2} must triggered from QMP by user, Quodrum driver must implement
its ->bdrv_snapshot_create() and ensure it.
> > > In our example, the quorum driver, it's not at all clear to me that you
> > > want to snapshot all children. In order to roll back to a previous
> > > state, one snapshot is enough, you don't need multiple copies of the
> > > same one. Perhaps you want two so that we can still compare them for
> > > verification. Or all of them because you can afford the disk space and
> > > want ultimate safety. I don't think qemu can know which one is true.
> > >
> > Only if quorum ever knows about and operates on snapshots, it should be
> > considered specifically, but no. So we need to achieve this in the general
> > design: allow user to take snapshot, or set throttle limits on particular
> > BDSes, as above graph.
> >
> > > In the same way, in a typical case you may want to keep I/O throttling
> > > for the whole drive, including the new snapshot. But what if the
> > > throttling was used in order to not overload the network where the image
> > > is stored, and you're now doing a local snapshot, to which you want to
> > > stream the image? The I/O throttling should apply only to the backing
> > > file, not the new snapshot.
> > >
> > Yes, and OTOH, throttling really suits to be a filter only if it can be a non
> > top one, otherwise it's no better than what we have now.
>
> Well, it would be a cleaner architecture in any case, but having it in
> the middle of the stack feels useful indeed, so we should support it.
>
> > > So perhaps what we really need is a more flexible snapshot/BDS tree
> > > manipulation command that describes in detail which structure you want
> > > to have in the end.
>
> Designing the corresponding QMP command is the hard part, I guess.
>
Another concern is that we may do all this with substantial duplication work of
QOM.
Fam
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-06 9:55 ` Kevin Wolf
2013-09-06 10:50 ` Fam Zheng
@ 2013-09-15 18:10 ` Benoît Canet
2013-09-16 7:41 ` Fam Zheng
1 sibling, 1 reply; 19+ messages in thread
From: Benoît Canet @ 2013-09-15 18:10 UTC (permalink / raw)
To: Kevin Wolf
Cc: Benoît Canet, Fam Zheng, jcody, armbru, qemu-devel, stefanha,
pbonzini, mreitz
Le Friday 06 Sep 2013 à 11:55:38 (+0200), Kevin Wolf a écrit :
> Am 06.09.2013 um 11:18 hat Fam Zheng geschrieben:
> > On Fri, 09/06 10:45, Kevin Wolf wrote:
> > > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > > > can navigate down the tree from top node, until hitting some layer where the op
> > > > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > > > pointer.
> > >
> > > Is it even inherent to a block driver (like a filter), if a snapshot is
> > > to be taken at its level? Or is it rather a policy decision that should
> > > be made by the user?
> > >
> > OK, getting the point that user should have full flexibility and fine operation
> > granularity. It also stands against block_backend->top_node_below_filter. Do we
> > really have the assumption that all the filters are on top of the tree and linear?
> > Shouldn't this be possible?
> >
> > Block Backend
> > |
> > |
> > Quodrum BDS
> > / | \
> > throttle filter | \
> > / | \
> > qcow2 qcow2 qcow2
> >
> > So we throttle only a particular image, not the whole device. But this will
> > make a top_node_below_filter pointer impossible.
>
> I was assuming that Benoît's model works for the special case of
> snapshotting in one predefined way, but this is actually a very good
> example of why it doesn't.
>
> The approach relies on snapshotting siblings together, and in this case
> the siblings would be throttle/qcow2/qcow2, while throttle is still a filter. This
> would mean that either throttle needs to be top_node_below_filter and
> throttling doesn't stay on top, or the left qcow2 is
> top_node_below_filter and the other Quorum images aren't snapshotted.
>
> > > In our example, the quorum driver, it's not at all clear to me that you
> > > want to snapshot all children. In order to roll back to a previous
> > > state, one snapshot is enough, you don't need multiple copies of the
> > > same one. Perhaps you want two so that we can still compare them for
> > > verification. Or all of them because you can afford the disk space and
> > > want ultimate safety. I don't think qemu can know which one is true.
> > >
> > Only if quorum ever knows about and operates on snapshots, it should be
> > considered specifically, but no. So we need to achieve this in the general
> > design: allow user to take snapshot, or set throttle limits on particular
> > BDSes, as above graph.
> >
> > > In the same way, in a typical case you may want to keep I/O throttling
> > > for the whole drive, including the new snapshot. But what if the
> > > throttling was used in order to not overload the network where the image
> > > is stored, and you're now doing a local snapshot, to which you want to
> > > stream the image? The I/O throttling should apply only to the backing
> > > file, not the new snapshot.
> > >
> > Yes, and OTOH, throttling really suits to be a filter only if it can be a non
> > top one, otherwise it's no better than what we have now.
>
> Well, it would be a cleaner architecture in any case, but having it in
> the middle of the stack feels useful indeed, so we should support it.
>
> > > So perhaps what we really need is a more flexible snapshot/BDS tree
> > > manipulation command that describes in detail which structure you want
> > > to have in the end.
>
> Designing the corresponding QMP command is the hard part, I guess.
During my vacation I though about the fact that JSON is pretty good to build a
tree.
QMP, HMP and the command line could take a "block-tree" argument which would
look like the following.
block-tree = { 'quorum': [
{
'throttle' : {
'qcow2' : { 'filename': "img1.qcow2" }
'snapshotable': true,
},
'throttle-iops' : 150,
'throttle-iops-max' : 1000,
},
{
'qcow2' : { 'filename': "img2.qcow2" },
'snapshotable': true,
},
{
'qcow2' : { 'filename': "img3.qcow2" }
'snapshotable': false,
}
] };
This would be passed to QEMU in a compact form without carriage return and
spaces.
The block layer would convert this to C structs like the QMP code would do for a
QMP command and the bs tree would be recursively build from top to bottom by
the Block Backend and each Block driver in the path using the C structs.
Each level would instanciate the lower level until a raw or protocol driver is
reached.
What about this ?
Best regards
Benoît
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-15 18:10 ` Benoît Canet
@ 2013-09-16 7:41 ` Fam Zheng
2013-09-16 10:43 ` Benoît Canet
2013-09-16 11:00 ` Benoît Canet
0 siblings, 2 replies; 19+ messages in thread
From: Fam Zheng @ 2013-09-16 7:41 UTC (permalink / raw)
To: Benoît Canet
Cc: Kevin Wolf, jcody, armbru, qemu-devel, stefanha, pbonzini, mreitz
On Sun, 09/15 20:10, Benoît Canet wrote:
> Le Friday 06 Sep 2013 à 11:55:38 (+0200), Kevin Wolf a écrit :
> > Am 06.09.2013 um 11:18 hat Fam Zheng geschrieben:
> > > On Fri, 09/06 10:45, Kevin Wolf wrote:
> > > > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > > > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > > > > can navigate down the tree from top node, until hitting some layer where the op
> > > > > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > > > > pointer.
> > > >
> > > > Is it even inherent to a block driver (like a filter), if a snapshot is
> > > > to be taken at its level? Or is it rather a policy decision that should
> > > > be made by the user?
> > > >
> > > OK, getting the point that user should have full flexibility and fine operation
> > > granularity. It also stands against block_backend->top_node_below_filter. Do we
> > > really have the assumption that all the filters are on top of the tree and linear?
> > > Shouldn't this be possible?
> > >
> > > Block Backend
> > > |
> > > |
> > > Quodrum BDS
> > > / | \
> > > throttle filter | \
> > > / | \
> > > qcow2 qcow2 qcow2
> > >
> > > So we throttle only a particular image, not the whole device. But this will
> > > make a top_node_below_filter pointer impossible.
> >
> > I was assuming that Benoît's model works for the special case of
> > snapshotting in one predefined way, but this is actually a very good
> > example of why it doesn't.
> >
> > The approach relies on snapshotting siblings together, and in this case
> > the siblings would be throttle/qcow2/qcow2, while throttle is still a filter. This
> > would mean that either throttle needs to be top_node_below_filter and
> > throttling doesn't stay on top, or the left qcow2 is
> > top_node_below_filter and the other Quorum images aren't snapshotted.
> >
> > > > In our example, the quorum driver, it's not at all clear to me that you
> > > > want to snapshot all children. In order to roll back to a previous
> > > > state, one snapshot is enough, you don't need multiple copies of the
> > > > same one. Perhaps you want two so that we can still compare them for
> > > > verification. Or all of them because you can afford the disk space and
> > > > want ultimate safety. I don't think qemu can know which one is true.
> > > >
> > > Only if quorum ever knows about and operates on snapshots, it should be
> > > considered specifically, but no. So we need to achieve this in the general
> > > design: allow user to take snapshot, or set throttle limits on particular
> > > BDSes, as above graph.
> > >
> > > > In the same way, in a typical case you may want to keep I/O throttling
> > > > for the whole drive, including the new snapshot. But what if the
> > > > throttling was used in order to not overload the network where the image
> > > > is stored, and you're now doing a local snapshot, to which you want to
> > > > stream the image? The I/O throttling should apply only to the backing
> > > > file, not the new snapshot.
> > > >
> > > Yes, and OTOH, throttling really suits to be a filter only if it can be a non
> > > top one, otherwise it's no better than what we have now.
> >
> > Well, it would be a cleaner architecture in any case, but having it in
> > the middle of the stack feels useful indeed, so we should support it.
> >
> > > > So perhaps what we really need is a more flexible snapshot/BDS tree
> > > > manipulation command that describes in detail which structure you want
> > > > to have in the end.
> >
> > Designing the corresponding QMP command is the hard part, I guess.
>
> During my vacation I though about the fact that JSON is pretty good to build a
> tree.
>
> QMP, HMP and the command line could take a "block-tree" argument which would
> look like the following.
>
> block-tree = { 'quorum': [
> {
> 'throttle' : {
> 'qcow2' : { 'filename': "img1.qcow2" }
> 'snapshotable': true,
What's the 'snapshotable' for?
> },
> 'throttle-iops' : 150,
> 'throttle-iops-max' : 1000,
> },
> {
> 'qcow2' : { 'filename': "img2.qcow2" },
> 'snapshotable': true,
> },
> {
> 'qcow2' : { 'filename': "img3.qcow2" }
> 'snapshotable': false,
> }
> ] };
>
It's not very clear to me. Does this mean a key associated with a dict value
means creating type? What do you put in the inner dict (i.e. why filename here)
and what to put outter besides the key (i.e. snapshotable)? Where to put 'id'?
I think JSON is flexible enough to specify anything we can take, but the format
needs to be designed carefully. And do we really want to use JSON in the
command line options? Very hard to imagine that. :)
Thanks,
Fam
> This would be passed to QEMU in a compact form without carriage return and
> spaces.
>
> The block layer would convert this to C structs like the QMP code would do for a
> QMP command and the bs tree would be recursively build from top to bottom by
> the Block Backend and each Block driver in the path using the C structs.
>
> Each level would instanciate the lower level until a raw or protocol driver is
> reached.
>
> What about this ?
>
> Best regards
>
> Benoît
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-16 7:41 ` Fam Zheng
@ 2013-09-16 10:43 ` Benoît Canet
2013-09-16 11:00 ` Benoît Canet
1 sibling, 0 replies; 19+ messages in thread
From: Benoît Canet @ 2013-09-16 10:43 UTC (permalink / raw)
To: Fam Zheng
Cc: Benoît Canet, Kevin Wolf, jcody, armbru, qemu-devel,
stefanha, pbonzini, mreitz
Le Monday 16 Sep 2013 à 15:41:45 (+0800), Fam Zheng a écrit :
> On Sun, 09/15 20:10, Benoît Canet wrote:
> > Le Friday 06 Sep 2013 à 11:55:38 (+0200), Kevin Wolf a écrit :
> > > Am 06.09.2013 um 11:18 hat Fam Zheng geschrieben:
> > > > On Fri, 09/06 10:45, Kevin Wolf wrote:
> > > > > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > > > > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > > > > > can navigate down the tree from top node, until hitting some layer where the op
> > > > > > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > > > > > pointer.
> > > > >
> > > > > Is it even inherent to a block driver (like a filter), if a snapshot is
> > > > > to be taken at its level? Or is it rather a policy decision that should
> > > > > be made by the user?
> > > > >
> > > > OK, getting the point that user should have full flexibility and fine operation
> > > > granularity. It also stands against block_backend->top_node_below_filter. Do we
> > > > really have the assumption that all the filters are on top of the tree and linear?
> > > > Shouldn't this be possible?
> > > >
> > > > Block Backend
> > > > |
> > > > |
> > > > Quodrum BDS
> > > > / | \
> > > > throttle filter | \
> > > > / | \
> > > > qcow2 qcow2 qcow2
> > > >
> > > > So we throttle only a particular image, not the whole device. But this will
> > > > make a top_node_below_filter pointer impossible.
> > >
> > > I was assuming that Benoît's model works for the special case of
> > > snapshotting in one predefined way, but this is actually a very good
> > > example of why it doesn't.
> > >
> > > The approach relies on snapshotting siblings together, and in this case
> > > the siblings would be throttle/qcow2/qcow2, while throttle is still a filter. This
> > > would mean that either throttle needs to be top_node_below_filter and
> > > throttling doesn't stay on top, or the left qcow2 is
> > > top_node_below_filter and the other Quorum images aren't snapshotted.
> > >
> > > > > In our example, the quorum driver, it's not at all clear to me that you
> > > > > want to snapshot all children. In order to roll back to a previous
> > > > > state, one snapshot is enough, you don't need multiple copies of the
> > > > > same one. Perhaps you want two so that we can still compare them for
> > > > > verification. Or all of them because you can afford the disk space and
> > > > > want ultimate safety. I don't think qemu can know which one is true.
> > > > >
> > > > Only if quorum ever knows about and operates on snapshots, it should be
> > > > considered specifically, but no. So we need to achieve this in the general
> > > > design: allow user to take snapshot, or set throttle limits on particular
> > > > BDSes, as above graph.
> > > >
> > > > > In the same way, in a typical case you may want to keep I/O throttling
> > > > > for the whole drive, including the new snapshot. But what if the
> > > > > throttling was used in order to not overload the network where the image
> > > > > is stored, and you're now doing a local snapshot, to which you want to
> > > > > stream the image? The I/O throttling should apply only to the backing
> > > > > file, not the new snapshot.
> > > > >
> > > > Yes, and OTOH, throttling really suits to be a filter only if it can be a non
> > > > top one, otherwise it's no better than what we have now.
> > >
> > > Well, it would be a cleaner architecture in any case, but having it in
> > > the middle of the stack feels useful indeed, so we should support it.
> > >
> > > > > So perhaps what we really need is a more flexible snapshot/BDS tree
> > > > > manipulation command that describes in detail which structure you want
> > > > > to have in the end.
> > >
> > > Designing the corresponding QMP command is the hard part, I guess.
> >
> > During my vacation I though about the fact that JSON is pretty good to build a
> > tree.
> >
> > QMP, HMP and the command line could take a "block-tree" argument which would
> > look like the following.
> >
> > block-tree = { 'quorum': [
> > {
> > 'throttle' : {
> > 'qcow2' : { 'filename': "img1.qcow2" }
> > 'snapshotable': true,
>
> What's the 'snapshotable' for?
>
> > },
> > 'throttle-iops' : 150,
> > 'throttle-iops-max' : 1000,
> > },
> > {
> > 'qcow2' : { 'filename': "img2.qcow2" },
> > 'snapshotable': true,
> > },
> > {
> > 'qcow2' : { 'filename': "img3.qcow2" }
> > 'snapshotable': false,
> > }
> > ] };
> >
>
> It's not very clear to me. Does this mean a key associated with a dict value
> means creating type? What do you put in the inner dict (i.e. why filename here)
> and what to put outter besides the key (i.e. snapshotable)? Where to put 'id'?
This example is here just to give a rough idea. Proper definition of a suitable
format would be still needed.
>
> I think JSON is flexible enough to specify anything we can take, but the format
> needs to be designed carefully. And do we really want to use JSON in the
> command line options? Very hard to imagine that. :)
Why not ? Most QEMU users are programs (libvirt ...) and they would be good at
generating the required JSON. It would also enable to have an easy way to edit
the block filter stack in a GUI.
Best regards
Benoît
>
> Thanks,
>
> Fam
>
> > This would be passed to QEMU in a compact form without carriage return and
> > spaces.
> >
> > The block layer would convert this to C structs like the QMP code would do for a
> > QMP command and the bs tree would be recursively build from top to bottom by
> > the Block Backend and each Block driver in the path using the C structs.
> >
> > Each level would instanciate the lower level until a raw or protocol driver is
> > reached.
> >
> > What about this ?
> >
> > Best regards
> >
> > Benoît
> >
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] Block Filters
2013-09-16 7:41 ` Fam Zheng
2013-09-16 10:43 ` Benoît Canet
@ 2013-09-16 11:00 ` Benoît Canet
1 sibling, 0 replies; 19+ messages in thread
From: Benoît Canet @ 2013-09-16 11:00 UTC (permalink / raw)
To: Fam Zheng
Cc: Benoît Canet, Kevin Wolf, jcody, armbru, qemu-devel,
stefanha, pbonzini, mreitz
Le Monday 16 Sep 2013 à 15:41:45 (+0800), Fam Zheng a écrit :
> On Sun, 09/15 20:10, Benoît Canet wrote:
> > Le Friday 06 Sep 2013 à 11:55:38 (+0200), Kevin Wolf a écrit :
> > > Am 06.09.2013 um 11:18 hat Fam Zheng geschrieben:
> > > > On Fri, 09/06 10:45, Kevin Wolf wrote:
> > > > > Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > > > > > Since BlockDriver.bdrv_snapshot_create() is an optional operation, blockdev.c
> > > > > > can navigate down the tree from top node, until hitting some layer where the op
> > > > > > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > > > > > pointer.
> > > > >
> > > > > Is it even inherent to a block driver (like a filter), if a snapshot is
> > > > > to be taken at its level? Or is it rather a policy decision that should
> > > > > be made by the user?
> > > > >
> > > > OK, getting the point that user should have full flexibility and fine operation
> > > > granularity. It also stands against block_backend->top_node_below_filter. Do we
> > > > really have the assumption that all the filters are on top of the tree and linear?
> > > > Shouldn't this be possible?
> > > >
> > > > Block Backend
> > > > |
> > > > |
> > > > Quodrum BDS
> > > > / | \
> > > > throttle filter | \
> > > > / | \
> > > > qcow2 qcow2 qcow2
> > > >
> > > > So we throttle only a particular image, not the whole device. But this will
> > > > make a top_node_below_filter pointer impossible.
> > >
> > > I was assuming that Benoît's model works for the special case of
> > > snapshotting in one predefined way, but this is actually a very good
> > > example of why it doesn't.
> > >
> > > The approach relies on snapshotting siblings together, and in this case
> > > the siblings would be throttle/qcow2/qcow2, while throttle is still a filter. This
> > > would mean that either throttle needs to be top_node_below_filter and
> > > throttling doesn't stay on top, or the left qcow2 is
> > > top_node_below_filter and the other Quorum images aren't snapshotted.
> > >
> > > > > In our example, the quorum driver, it's not at all clear to me that you
> > > > > want to snapshot all children. In order to roll back to a previous
> > > > > state, one snapshot is enough, you don't need multiple copies of the
> > > > > same one. Perhaps you want two so that we can still compare them for
> > > > > verification. Or all of them because you can afford the disk space and
> > > > > want ultimate safety. I don't think qemu can know which one is true.
> > > > >
> > > > Only if quorum ever knows about and operates on snapshots, it should be
> > > > considered specifically, but no. So we need to achieve this in the general
> > > > design: allow user to take snapshot, or set throttle limits on particular
> > > > BDSes, as above graph.
> > > >
> > > > > In the same way, in a typical case you may want to keep I/O throttling
> > > > > for the whole drive, including the new snapshot. But what if the
> > > > > throttling was used in order to not overload the network where the image
> > > > > is stored, and you're now doing a local snapshot, to which you want to
> > > > > stream the image? The I/O throttling should apply only to the backing
> > > > > file, not the new snapshot.
> > > > >
> > > > Yes, and OTOH, throttling really suits to be a filter only if it can be a non
> > > > top one, otherwise it's no better than what we have now.
> > >
> > > Well, it would be a cleaner architecture in any case, but having it in
> > > the middle of the stack feels useful indeed, so we should support it.
> > >
> > > > > So perhaps what we really need is a more flexible snapshot/BDS tree
> > > > > manipulation command that describes in detail which structure you want
> > > > > to have in the end.
> > >
> > > Designing the corresponding QMP command is the hard part, I guess.
> >
> > During my vacation I though about the fact that JSON is pretty good to build a
> > tree.
> >
> > QMP, HMP and the command line could take a "block-tree" argument which would
> > look like the following.
> >
> > block-tree = { 'quorum': [
> > {
> > 'throttle' : {
> > 'qcow2' : { 'filename': "img1.qcow2" }
> > 'snapshotable': true,
>
> What's the 'snapshotable' for?
Kevin mentioned the fact that when taking a snapshot with QUORUM we may want to
snapshot only a part of the QUORUM branches.
snapshotable is a way to indicate to the QCOW2 driver that it must create a new
snapshot when the snapshoting order cascade down the bs tree.
Best regards
Benoît
>
> > },
> > 'throttle-iops' : 150,
> > 'throttle-iops-max' : 1000,
> > },
> > {
> > 'qcow2' : { 'filename': "img2.qcow2" },
> > 'snapshotable': true,
> > },
> > {
> > 'qcow2' : { 'filename': "img3.qcow2" }
> > 'snapshotable': false,
> > }
> > ] };
> >
>
> It's not very clear to me. Does this mean a key associated with a dict value
> means creating type? What do you put in the inner dict (i.e. why filename here)
> and what to put outter besides the key (i.e. snapshotable)? Where to put 'id'?
>
> I think JSON is flexible enough to specify anything we can take, but the format
> needs to be designed carefully. And do we really want to use JSON in the
> command line options? Very hard to imagine that. :)
>
> Thanks,
>
> Fam
>
> > This would be passed to QEMU in a compact form without carriage return and
> > spaces.
> >
> > The block layer would convert this to C structs like the QMP code would do for a
> > QMP command and the bs tree would be recursively build from top to bottom by
> > the Block Backend and each Block driver in the path using the C structs.
> >
> > Each level would instanciate the lower level until a raw or protocol driver is
> > reached.
> >
> > What about this ?
> >
> > Best regards
> >
> > Benoît
> >
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-09-16 10:58 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-03 16:24 [Qemu-devel] Block Filters Benoît Canet
2013-09-04 8:29 ` Stefan Hajnoczi
2013-09-04 18:15 ` Benoît Canet
2013-09-05 7:32 ` Kevin Wolf
2013-09-05 10:01 ` Stefan Hajnoczi
2013-09-05 10:18 ` Fam Zheng
2013-09-05 14:59 ` Stefan Hajnoczi
2013-09-05 17:29 ` Benoît Canet
2013-09-06 7:42 ` Kevin Wolf
2013-09-06 7:49 ` Fam Zheng
2013-09-06 7:56 ` Fam Zheng
2013-09-06 8:45 ` Kevin Wolf
2013-09-06 9:18 ` Fam Zheng
2013-09-06 9:55 ` Kevin Wolf
2013-09-06 10:50 ` Fam Zheng
2013-09-15 18:10 ` Benoît Canet
2013-09-16 7:41 ` Fam Zheng
2013-09-16 10:43 ` Benoît Canet
2013-09-16 11:00 ` Benoît Canet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).