All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Benoît Canet" <benoit.canet@irqsave.net>
To: "Benoît Canet" <benoit.canet@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <famz@redhat.com>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	qemu-devel <qemu-devel@nongnu.org>, Max Reitz <mreitz@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	armbru@redhat.com
Subject: Re: [Qemu-devel] n ways block filters
Date: Mon, 24 Mar 2014 15:53:19 +0100	[thread overview]
Message-ID: <20140324145319.GG3071@irqsave.net> (raw)
In-Reply-To: <20140320160626.GB3045@irqsave.net>

The Thursday 20 Mar 2014 à 17:06:26 (+0100), Benoît Canet wrote :
> The Thursday 20 Mar 2014 à 16:12:34 (+0100), Kevin Wolf wrote :
> > Am 20.03.2014 um 15:05 hat Benoît Canet geschrieben:
> > > The Tuesday 18 Mar 2014 à 14:27:47 (+0100), Kevin Wolf wrote :
> > > > Am 17.03.2014 um 17:02 hat Stefan Hajnoczi geschrieben:
> > > > > On Mon, Mar 17, 2014 at 4:12 AM, Fam Zheng <famz@redhat.com> wrote:
> > > > > > On Fri, 03/14 16:57, Benoît Canet wrote:
> > > > > >> I discussed a bit with Stefan on the list and we came to the conclusion that the
> > > > > >> block filter API need group support.
> > > > > >>
> > > > > >> filter group:
> > > > > >> -------------
> > > > > >>
> > > > > >> My current plan to implement this is to add the following fields to the BlockDriver
> > > > > >> structure.
> > > > > >>
> > > > > >> int bdrv_add_filter_group(const char *name, QDict options);
> > > > > >> int bdrv_reconfigure_filter_group(const char *name, QDict options);
> > > > > >> int bdrv_destroy_filter_group(const char *name);
> > > > 
> > > > Benoît, your mail left me puzzled. You didn't really describe the
> > > > problem that you're solving, nor what the QDict options actually
> > > > contains or what a filter group even is.
> > > > 
> > > > > >> These three extra method would allow to create, reconfigure or destroy a block
> > > > > >> filter group. A block filter group contain the shared or non shared state of the
> > > > > >> blockfilter. For throttling it would contains the ThrottleState structure.
> > > > > >>
> > > > > >> Each block filter driver would contains a linked list of linked list where the
> > > > > >> BDS are registered grouped by filter groups state.
> > > > > >
> > > > > > Sorry I don't fully understand this. Does a filter group contain multiple block
> > > > > > filters, and every block filter has effect on multiple BDSes? Could you give an
> > > > > > example?
> > > > > 
> > > > > Just to why a "group" mechanism is useful:
> > > > > 
> > > > > You want to impose a 2000 IOPS limit for the entire VM.  Currently
> > > > > this is not possible because each drive has its own throttling state.
> > > > > 
> > > > > We need a way to say certain drives are part of a group.  All drives
> > > > > in a group share the same throttling state and therefore a 2000 IOPS
> > > > > limit is shared amongst them.
> > > > 
> > > > Now at least I have an idea what you're all talking about, but it's
> > > > still not obvious to me how the three functions from above solve your
> > > > problem or how they work in detail.
> > > > 
> > > > The obvious solution, using often discussed blockdev-add concepts, is:
> > > >                  ______________
> > > > virtio-blk_A --> |            | --> qcow2_A --> raw-posix_A
> > > >                  | throttling |
> > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B
> > > 
> > > My proposal would be:
> > >                  ______________
> > > virtio-blk_A --> | BDS 1      | --> qcow2_A --> raw-posix_A
> > >                  |____________|
> > >                       |
> > >                  _____|________
> > >                  |            |  The shared state is the state of a BDS group
> > >                  | Shared     |  It's stored in a static linked list of the
> > >                  | State      |  block/throttle.c module. It has a name and contains a
> > >                  |____________|  throttle state structure.
> > >                       |
> > >                  _____|________
> > >                  |  BDS 2     |
> > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B
> > 
> > Okay. I think your proposal might be easier to implement in the short
> > run, but it introduces an additional type of nodes to the graph (so far
> > we have only one type, BlockDriverStates) with their own set of
> > functions, and I assume monitor commands, for management.
> > 
> > This makes the whole graph less uniform and consistent. There may be
> > cases where this is necessary or at least tolerable because the fully
> > generic alternativ isn't doable. I'm not convinced yet that this is the
> > case here.
> > 
> > In contrast, my approach would require considerable infrastructure work
> > (you somehow seem to attract that kind of things ;-)), but it's merely a
> > generalisation of what we already have and as such fits nicely in the
> > graph.
> > 
> > We already have multiple children of BDS nodes. And we take it for
> > granted that they don't refer to the same data, but that bs->file and
> > bs->backing_hd have actually different semantics.
> > 
> > We have recently introduced refcounts for BDSes so that one BDS can now
> > have multiple parents, too, as a first step towards symmetry. The
> > logical extension is that these parent get different semantics, just
> > like the children have different semantics.
> > 
> > Doing the abstraction in one model right instead of adding hacks that
> > don't really fit in but are easy to implement has paid off in the past.
> > I'm pretty sure that extending the infrastructure this way will find
> > more users than just I/O throttling, and that having different parents
> > in different roles is universally useful. With qcow2 exposing the
> > snapshots, too, I already named a second potential user of the
> > infrastructure.
> > 
> > > The name of the shared state is the throttle group name.
> > > The three added methods are used to add, configure and destroy such shared
> > > states.
> > > 
> > > The benefit of this aproach is that we don't need to add a special slot mechanism
> > > and that removing BDS 2 would be easy.
> > > Your approach don't deal with the fact that the throttling group membership can
> > > be changed dynamically while the vm is running: for example adding qcow2_C and
> > > removing qcow2_B should be made easy.
> > 
> > Yes, this is right. But then, the nice thing about it is that I stayed
> > fully within the one uniform graph. We just need a way to modify the
> > edges in this graph (and we already need that to insert/delete filters)
> > and you get this special case and many others for free.
> > 
> > So, I vote for investing into a uniform infrastructure here instead of
> > adding new one-off node types.
> 
> Maybe parents BDS could use a generic block function to get a cookie when they
> start to use a children BDS.
> 
> The parent would to
> 
> bs->file_cookie = bdrv_get_cookie(file);
> bs->file = file;
> 
> when choosing to use file as bs file.
> 
> The get cookie method would be
> 
> uint64_t bdrv_get_cookie(bs) {
>    bs->cookie = gen_uuid(bs);
>    return bs->cookie;
> }
> 
> gen_uuid would combine a random 64 bit number with a registry to prevent
> identical cookie generation.
> 
> After this step every BlockDriver method would receive the cookie as second
> parameter.
> 
> For example bdrv_read(bs, cookie, ...)
> 
> So it's easy for a block driver to discriminate based on the cookie and even to
> look up which of his own child is associated to this cookie.
> 
> Best regards
> 
> Benoît
> 
> > 
> > Kevin

Kevin: what do you think of this cookie idea ?
It seems something doable with reasonable small steps.

Best regards

Benoît

> > 
> > > > That is, the I/O throttling BDS is referenced by two devices instead of
> > > > just one and it associates one 'input' with one 'output'. Once we have
> > > > BlockBackend, we would have two BBs, but still only one throttling
> > > > BDS.
> > > > 
> > > > The new thing that you get there is that the throttling driver has
> > > > not only multiple parents (that part exists today), but it behaves
> > > > differently depending on who called it. So we need to provide some way
> > > > for one BDS to expose multiple slots or whatever you want to call them
> > > > that users can attach to.
> > > > 
> > > > This is, by the way, the very same thing as would be required for
> > > > exposing qcow2 internal snapshots (read-only) while the VM is running.
> > > > 
> > > > Kevin
> > > > 

  reply	other threads:[~2014-03-24 14:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-14 15:57 [Qemu-devel] n ways block filters Benoît Canet
2014-03-17  3:12 ` Fam Zheng
2014-03-17 13:52   ` Benoît Canet
2014-03-17 16:02   ` Stefan Hajnoczi
2014-03-18 13:27     ` Kevin Wolf
2014-03-20 14:05       ` Benoît Canet
2014-03-20 15:12         ` Kevin Wolf
2014-03-20 15:47           ` Benoît Canet
2014-03-20 16:06           ` Benoît Canet
2014-03-24 14:53             ` Benoît Canet [this message]
2014-03-24 15:58               ` Kevin Wolf
2014-03-27 15:49                 ` Benoît Canet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140324145319.GG3071@irqsave.net \
    --to=benoit.canet@irqsave.net \
    --cc=armbru@redhat.com \
    --cc=famz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.