From: "Benoît Canet" <benoit.canet@irqsave.net>
To: Kevin Wolf <kwolf@redhat.com>
Cc: "Benoît Canet" <benoit.canet@irqsave.net>,
"Fam Zheng" <famz@redhat.com>,
"Stefan Hajnoczi" <stefanha@gmail.com>,
armbru@redhat.com, qemu-devel <qemu-devel@nongnu.org>,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Max Reitz" <mreitz@redhat.com>
Subject: Re: [Qemu-devel] n ways block filters
Date: Thu, 27 Mar 2014 16:49:34 +0100 [thread overview]
Message-ID: <20140327154933.GC3945@irqsave.net> (raw)
In-Reply-To: <20140324155855.GD3094@noname.str.redhat.com>
The Monday 24 Mar 2014 à 16:58:55 (+0100), Kevin Wolf wrote :
> Am 24.03.2014 um 15:53 hat Benoît Canet geschrieben:
> > The Thursday 20 Mar 2014 à 17:06:26 (+0100), Benoît Canet wrote :
> > > The Thursday 20 Mar 2014 à 16:12:34 (+0100), Kevin Wolf wrote :
> > > > Am 20.03.2014 um 15:05 hat Benoît Canet geschrieben:
> > > > > The Tuesday 18 Mar 2014 à 14:27:47 (+0100), Kevin Wolf wrote :
> > > > > > Am 17.03.2014 um 17:02 hat Stefan Hajnoczi geschrieben:
> > > > > > > On Mon, Mar 17, 2014 at 4:12 AM, Fam Zheng <famz@redhat.com> wrote:
> > > > > > > > On Fri, 03/14 16:57, Benoît Canet wrote:
> > > > > > > >> I discussed a bit with Stefan on the list and we came to the conclusion that the
> > > > > > > >> block filter API need group support.
> > > > > > > >>
> > > > > > > >> filter group:
> > > > > > > >> -------------
> > > > > > > >>
> > > > > > > >> My current plan to implement this is to add the following fields to the BlockDriver
> > > > > > > >> structure.
> > > > > > > >>
> > > > > > > >> int bdrv_add_filter_group(const char *name, QDict options);
> > > > > > > >> int bdrv_reconfigure_filter_group(const char *name, QDict options);
> > > > > > > >> int bdrv_destroy_filter_group(const char *name);
> > > > > >
> > > > > > Benoît, your mail left me puzzled. You didn't really describe the
> > > > > > problem that you're solving, nor what the QDict options actually
> > > > > > contains or what a filter group even is.
> > > > > >
> > > > > > > >> These three extra method would allow to create, reconfigure or destroy a block
> > > > > > > >> filter group. A block filter group contain the shared or non shared state of the
> > > > > > > >> blockfilter. For throttling it would contains the ThrottleState structure.
> > > > > > > >>
> > > > > > > >> Each block filter driver would contains a linked list of linked list where the
> > > > > > > >> BDS are registered grouped by filter groups state.
> > > > > > > >
> > > > > > > > Sorry I don't fully understand this. Does a filter group contain multiple block
> > > > > > > > filters, and every block filter has effect on multiple BDSes? Could you give an
> > > > > > > > example?
> > > > > > >
> > > > > > > Just to why a "group" mechanism is useful:
> > > > > > >
> > > > > > > You want to impose a 2000 IOPS limit for the entire VM. Currently
> > > > > > > this is not possible because each drive has its own throttling state.
> > > > > > >
> > > > > > > We need a way to say certain drives are part of a group. All drives
> > > > > > > in a group share the same throttling state and therefore a 2000 IOPS
> > > > > > > limit is shared amongst them.
> > > > > >
> > > > > > Now at least I have an idea what you're all talking about, but it's
> > > > > > still not obvious to me how the three functions from above solve your
> > > > > > problem or how they work in detail.
> > > > > >
> > > > > > The obvious solution, using often discussed blockdev-add concepts, is:
> > > > > > ______________
> > > > > > virtio-blk_A --> | | --> qcow2_A --> raw-posix_A
> > > > > > | throttling |
> > > > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B
> > > > >
> > > > > My proposal would be:
> > > > > ______________
> > > > > virtio-blk_A --> | BDS 1 | --> qcow2_A --> raw-posix_A
> > > > > |____________|
> > > > > |
> > > > > _____|________
> > > > > | | The shared state is the state of a BDS group
> > > > > | Shared | It's stored in a static linked list of the
> > > > > | State | block/throttle.c module. It has a name and contains a
> > > > > |____________| throttle state structure.
> > > > > |
> > > > > _____|________
> > > > > | BDS 2 |
> > > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B
> > > >
> > > > Okay. I think your proposal might be easier to implement in the short
> > > > run, but it introduces an additional type of nodes to the graph (so far
> > > > we have only one type, BlockDriverStates) with their own set of
> > > > functions, and I assume monitor commands, for management.
> > > >
> > > > This makes the whole graph less uniform and consistent. There may be
> > > > cases where this is necessary or at least tolerable because the fully
> > > > generic alternativ isn't doable. I'm not convinced yet that this is the
> > > > case here.
> > > >
> > > > In contrast, my approach would require considerable infrastructure work
> > > > (you somehow seem to attract that kind of things ;-)), but it's merely a
> > > > generalisation of what we already have and as such fits nicely in the
> > > > graph.
> > > >
> > > > We already have multiple children of BDS nodes. And we take it for
> > > > granted that they don't refer to the same data, but that bs->file and
> > > > bs->backing_hd have actually different semantics.
> > > >
> > > > We have recently introduced refcounts for BDSes so that one BDS can now
> > > > have multiple parents, too, as a first step towards symmetry. The
> > > > logical extension is that these parent get different semantics, just
> > > > like the children have different semantics.
> > > >
> > > > Doing the abstraction in one model right instead of adding hacks that
> > > > don't really fit in but are easy to implement has paid off in the past.
> > > > I'm pretty sure that extending the infrastructure this way will find
> > > > more users than just I/O throttling, and that having different parents
> > > > in different roles is universally useful. With qcow2 exposing the
> > > > snapshots, too, I already named a second potential user of the
> > > > infrastructure.
> > > >
> > > > > The name of the shared state is the throttle group name.
> > > > > The three added methods are used to add, configure and destroy such shared
> > > > > states.
> > > > >
> > > > > The benefit of this aproach is that we don't need to add a special slot mechanism
> > > > > and that removing BDS 2 would be easy.
> > > > > Your approach don't deal with the fact that the throttling group membership can
> > > > > be changed dynamically while the vm is running: for example adding qcow2_C and
> > > > > removing qcow2_B should be made easy.
> > > >
> > > > Yes, this is right. But then, the nice thing about it is that I stayed
> > > > fully within the one uniform graph. We just need a way to modify the
> > > > edges in this graph (and we already need that to insert/delete filters)
> > > > and you get this special case and many others for free.
> > > >
> > > > So, I vote for investing into a uniform infrastructure here instead of
> > > > adding new one-off node types.
> > >
> > > Maybe parents BDS could use a generic block function to get a cookie when they
> > > start to use a children BDS.
> > >
> > > The parent would to
> > >
> > > bs->file_cookie = bdrv_get_cookie(file);
> > > bs->file = file;
> > >
> > > when choosing to use file as bs file.
> > >
> > > The get cookie method would be
> > >
> > > uint64_t bdrv_get_cookie(bs) {
> > > bs->cookie = gen_uuid(bs);
> > > return bs->cookie;
> > > }
> > >
> > > gen_uuid would combine a random 64 bit number with a registry to prevent
> > > identical cookie generation.
> > >
> > > After this step every BlockDriver method would receive the cookie as second
> > > parameter.
> > >
> > > For example bdrv_read(bs, cookie, ...)
> > >
> > > So it's easy for a block driver to discriminate based on the cookie and even to
> > > look up which of his own child is associated to this cookie.
> > >
> > > Best regards
> > >
> > > Benoît
> > >
> > > >
> > > > Kevin
> >
> > Kevin: what do you think of this cookie idea ?
> > It seems something doable with reasonable small steps.
>
> Sorry, I was going to reply with some more detailed description of what
> things should look like, but got preempted once again.
>
> So, no, this cookies thing is not directly the right thing to do. The
> idea that the information must be passed is alright, but not as an
> additional int parameter. First thing is that you can simply use another
> opaque pointer instead of the integer so that the driver doesn't have to
> look it up, but can directly use it. The second thing is that there's no
> need to have two parameters, when one of them implies the other one.
>
> So what you end up with is a new type of structure, and you'll split
> today's BlockDriverStates in three parts:
>
> - BlockBackend (the thing that each guest device has)
> - BlockView (a qcow2 snapshot or a "slot" for I/O throttling filters)
> - BlockDriverState (deals with the image file and provides views)
>
> I'm not completely happy with these names, but I have to use something
> for this discussion, so I'll just use them until someone comes up with
> something better.
>
> In the end, we should have something like:
>
> typedef struct BlockDriverState {
> /* Like today, except without the fields covered elsewhere */
> } BlockDriverState;
>
> typedef struct BlockView {
> BlockDriverState *bs;
> const char *view_name;
> uint64_t total_bytes;
> ...
> /* more common per-view data */
> } BlockView;
>
> typedef struct Qcow2View {
> BlockView common;
> uint64_t *l1_table;
> ...
> /* more per-snapshot data */
> }
>
> Then you have the different block layer functions, and some of them
> refer to the whole BlockDriverState (like bdrv_open(), which opens the
> images and creates all of the views) and others operate on a given
> BlockView (like bdrv_co_preadv()).
Does BDS nodes contains pointers to their views ?
What is the relation ship between views and the node graph ?
Best regards
Benoît
>
> Kevin
>
prev parent reply other threads:[~2014-03-27 15:49 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-14 15:57 [Qemu-devel] n ways block filters Benoît Canet
2014-03-17 3:12 ` Fam Zheng
2014-03-17 13:52 ` Benoît Canet
2014-03-17 16:02 ` Stefan Hajnoczi
2014-03-18 13:27 ` Kevin Wolf
2014-03-20 14:05 ` Benoît Canet
2014-03-20 15:12 ` Kevin Wolf
2014-03-20 15:47 ` Benoît Canet
2014-03-20 16:06 ` Benoît Canet
2014-03-24 14:53 ` Benoît Canet
2014-03-24 15:58 ` Kevin Wolf
2014-03-27 15:49 ` Benoît Canet [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140327154933.GC3945@irqsave.net \
--to=benoit.canet@irqsave.net \
--cc=armbru@redhat.com \
--cc=famz@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@gmail.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).