From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:32936) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WQfC0-0007b4-DQ for qemu-devel@nongnu.org; Thu, 20 Mar 2014 11:47:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WQfBu-0000lr-D9 for qemu-devel@nongnu.org; Thu, 20 Mar 2014 11:47:40 -0400 Received: from lputeaux-656-01-25-125.w80-12.abo.wanadoo.fr ([80.12.84.125]:56349 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WQfBu-0000la-3c for qemu-devel@nongnu.org; Thu, 20 Mar 2014 11:47:34 -0400 Date: Thu, 20 Mar 2014 16:47:32 +0100 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20140320154732.GA3045@irqsave.net> References: <20140314155756.GC3324@irqsave.net> <20140317031231.GA28582@T430.nay.redhat.com> <20140318132747.GN4607@noname.str.redhat.com> <20140320140509.GA3205@irqsave.net> <20140320151234.GC3450@noname.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20140320151234.GC3450@noname.redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] n ways block filters List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , Fam Zheng , Stefan Hajnoczi , armbru@redhat.com, qemu-devel , Stefan Hajnoczi , Max Reitz The Thursday 20 Mar 2014 =E0 16:12:34 (+0100), Kevin Wolf wrote : > Am 20.03.2014 um 15:05 hat Beno=EEt Canet geschrieben: > > The Tuesday 18 Mar 2014 =E0 14:27:47 (+0100), Kevin Wolf wrote : > > > Am 17.03.2014 um 17:02 hat Stefan Hajnoczi geschrieben: > > > > On Mon, Mar 17, 2014 at 4:12 AM, Fam Zheng wrot= e: > > > > > On Fri, 03/14 16:57, Beno=EEt Canet wrote: > > > > >> I discussed a bit with Stefan on the list and we came to the c= onclusion that the > > > > >> block filter API need group support. > > > > >> > > > > >> filter group: > > > > >> ------------- > > > > >> > > > > >> My current plan to implement this is to add the following fiel= ds to the BlockDriver > > > > >> structure. > > > > >> > > > > >> int bdrv_add_filter_group(const char *name, QDict options); > > > > >> int bdrv_reconfigure_filter_group(const char *name, QDict opti= ons); > > > > >> int bdrv_destroy_filter_group(const char *name); > > >=20 > > > Beno=EEt, your mail left me puzzled. You didn't really describe the > > > problem that you're solving, nor what the QDict options actually > > > contains or what a filter group even is. > > >=20 > > > > >> These three extra method would allow to create, reconfigure or= destroy a block > > > > >> filter group. A block filter group contain the shared or non s= hared state of the > > > > >> blockfilter. For throttling it would contains the ThrottleStat= e structure. > > > > >> > > > > >> Each block filter driver would contains a linked list of linke= d list where the > > > > >> BDS are registered grouped by filter groups state. > > > > > > > > > > Sorry I don't fully understand this. Does a filter group contai= n multiple block > > > > > filters, and every block filter has effect on multiple BDSes? C= ould you give an > > > > > example? > > > >=20 > > > > Just to why a "group" mechanism is useful: > > > >=20 > > > > You want to impose a 2000 IOPS limit for the entire VM. Currentl= y > > > > this is not possible because each drive has its own throttling st= ate. > > > >=20 > > > > We need a way to say certain drives are part of a group. All dri= ves > > > > in a group share the same throttling state and therefore a 2000 I= OPS > > > > limit is shared amongst them. > > >=20 > > > Now at least I have an idea what you're all talking about, but it's > > > still not obvious to me how the three functions from above solve yo= ur > > > problem or how they work in detail. > > >=20 > > > The obvious solution, using often discussed blockdev-add concepts, = is: > > > ______________ > > > virtio-blk_A --> | | --> qcow2_A --> raw-posix_A > > > | throttling | > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B > >=20 > > My proposal would be: > > ______________ > > virtio-blk_A --> | BDS 1 | --> qcow2_A --> raw-posix_A > > |____________| > > | > > _____|________ > > | | The shared state is the state of a B= DS group > > | Shared | It's stored in a static linked list = of the > > | State | block/throttle.c module. It has a na= me and contains a > > |____________| throttle state structure. > > | > > _____|________ > > | BDS 2 | > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B >=20 > Okay. I think your proposal might be easier to implement in the short > run, but it introduces an additional type of nodes to the graph (so far > we have only one type, BlockDriverStates) with their own set of > functions, and I assume monitor commands, for management. BDS1 and BDS2 would be regular BDS. Their associated s structure would contain a pointer to the shared state. The shared state would be stored in a static list of the blockfilter. The shared state is not seem from outside the block filter module. The BlockDriver structure would have only the 3 methods to define, config= ure and destroy groups that the only change required. block.c would manipulated regular BBS no special case involved. Only blockdev.c would fiddle with the 3 extra BlockDriver methods. >=20 > This makes the whole graph less uniform and consistent. There may be > cases where this is necessary or at least tolerable because the fully > generic alternativ isn't doable. I'm not convinced yet that this is the > case here. >=20 > In contrast, my approach would require considerable infrastructure work > (you somehow seem to attract that kind of things ;-)), but it's merely = a > generalisation of what we already have and as such fits nicely in the > graph. >=20 > We already have multiple children of BDS nodes. And we take it for > granted that they don't refer to the same data, but that bs->file and > bs->backing_hd have actually different semantics. >=20 > We have recently introduced refcounts for BDSes so that one BDS can now > have multiple parents, too, as a first step towards symmetry. The > logical extension is that these parent get different semantics, just > like the children have different semantics. >=20 > Doing the abstraction in one model right instead of adding hacks that > don't really fit in but are easy to implement has paid off in the past. > I'm pretty sure that extending the infrastructure this way will find > more users than just I/O throttling, and that having different parents > in different roles is universally useful. With qcow2 exposing the > snapshots, too, I already named a second potential user of the > infrastructure. >=20 > > The name of the shared state is the throttle group name. > > The three added methods are used to add, configure and destroy such s= hared > > states. > >=20 > > The benefit of this aproach is that we don't need to add a special sl= ot mechanism > > and that removing BDS 2 would be easy. > > Your approach don't deal with the fact that the throttling group memb= ership can > > be changed dynamically while the vm is running: for example adding qc= ow2_C and > > removing qcow2_B should be made easy. >=20 > Yes, this is right. But then, the nice thing about it is that I stayed > fully within the one uniform graph. We just need a way to modify the > edges in this graph (and we already need that to insert/delete filters) > and you get this special case and many others for free. >=20 > So, I vote for investing into a uniform infrastructure here instead of > adding new one-off node types. >=20 > Kevin >=20 > > > That is, the I/O throttling BDS is referenced by two devices instea= d of > > > just one and it associates one 'input' with one 'output'. Once we h= ave > > > BlockBackend, we would have two BBs, but still only one throttling > > > BDS. > > >=20 > > > The new thing that you get there is that the throttling driver has > > > not only multiple parents (that part exists today), but it behaves > > > differently depending on who called it. So we need to provide some = way > > > for one BDS to expose multiple slots or whatever you want to call t= hem > > > that users can attach to. > > >=20 > > > This is, by the way, the very same thing as would be required for > > > exposing qcow2 internal snapshots (read-only) while the VM is runni= ng. > > >=20 > > > Kevin > > >=20 >=20