From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40910) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTCYn-000474-Qa for qemu-devel@nongnu.org; Thu, 27 Mar 2014 11:49:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTCYh-0001rd-OW for qemu-devel@nongnu.org; Thu, 27 Mar 2014 11:49:41 -0400 Received: from lputeaux-656-01-25-125.w80-12.abo.wanadoo.fr ([80.12.84.125]:56592 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTCYh-0001qW-FH for qemu-devel@nongnu.org; Thu, 27 Mar 2014 11:49:35 -0400 Date: Thu, 27 Mar 2014 16:49:34 +0100 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20140327154933.GC3945@irqsave.net> References: <20140314155756.GC3324@irqsave.net> <20140317031231.GA28582@T430.nay.redhat.com> <20140318132747.GN4607@noname.str.redhat.com> <20140320140509.GA3205@irqsave.net> <20140320151234.GC3450@noname.redhat.com> <20140320160626.GB3045@irqsave.net> <20140324145319.GG3071@irqsave.net> <20140324155855.GD3094@noname.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20140324155855.GD3094@noname.str.redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] n ways block filters List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , Fam Zheng , Stefan Hajnoczi , armbru@redhat.com, qemu-devel , Stefan Hajnoczi , Max Reitz The Monday 24 Mar 2014 =E0 16:58:55 (+0100), Kevin Wolf wrote : > Am 24.03.2014 um 15:53 hat Beno=EEt Canet geschrieben: > > The Thursday 20 Mar 2014 =E0 17:06:26 (+0100), Beno=EEt Canet wrote : > > > The Thursday 20 Mar 2014 =E0 16:12:34 (+0100), Kevin Wolf wrote : > > > > Am 20.03.2014 um 15:05 hat Beno=EEt Canet geschrieben: > > > > > The Tuesday 18 Mar 2014 =E0 14:27:47 (+0100), Kevin Wolf wrote = : > > > > > > Am 17.03.2014 um 17:02 hat Stefan Hajnoczi geschrieben: > > > > > > > On Mon, Mar 17, 2014 at 4:12 AM, Fam Zheng wrote: > > > > > > > > On Fri, 03/14 16:57, Beno=EEt Canet wrote: > > > > > > > >> I discussed a bit with Stefan on the list and we came to= the conclusion that the > > > > > > > >> block filter API need group support. > > > > > > > >> > > > > > > > >> filter group: > > > > > > > >> ------------- > > > > > > > >> > > > > > > > >> My current plan to implement this is to add the followin= g fields to the BlockDriver > > > > > > > >> structure. > > > > > > > >> > > > > > > > >> int bdrv_add_filter_group(const char *name, QDict option= s); > > > > > > > >> int bdrv_reconfigure_filter_group(const char *name, QDic= t options); > > > > > > > >> int bdrv_destroy_filter_group(const char *name); > > > > > >=20 > > > > > > Beno=EEt, your mail left me puzzled. You didn't really descri= be the > > > > > > problem that you're solving, nor what the QDict options actua= lly > > > > > > contains or what a filter group even is. > > > > > >=20 > > > > > > > >> These three extra method would allow to create, reconfig= ure or destroy a block > > > > > > > >> filter group. A block filter group contain the shared or= non shared state of the > > > > > > > >> blockfilter. For throttling it would contains the Thrott= leState structure. > > > > > > > >> > > > > > > > >> Each block filter driver would contains a linked list of= linked list where the > > > > > > > >> BDS are registered grouped by filter groups state. > > > > > > > > > > > > > > > > Sorry I don't fully understand this. Does a filter group = contain multiple block > > > > > > > > filters, and every block filter has effect on multiple BD= Ses? Could you give an > > > > > > > > example? > > > > > > >=20 > > > > > > > Just to why a "group" mechanism is useful: > > > > > > >=20 > > > > > > > You want to impose a 2000 IOPS limit for the entire VM. Cu= rrently > > > > > > > this is not possible because each drive has its own throttl= ing state. > > > > > > >=20 > > > > > > > We need a way to say certain drives are part of a group. A= ll drives > > > > > > > in a group share the same throttling state and therefore a = 2000 IOPS > > > > > > > limit is shared amongst them. > > > > > >=20 > > > > > > Now at least I have an idea what you're all talking about, bu= t it's > > > > > > still not obvious to me how the three functions from above so= lve your > > > > > > problem or how they work in detail. > > > > > >=20 > > > > > > The obvious solution, using often discussed blockdev-add conc= epts, is: > > > > > > ______________ > > > > > > virtio-blk_A --> | | --> qcow2_A --> raw-posix_A > > > > > > | throttling | > > > > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B > > > > >=20 > > > > > My proposal would be: > > > > > ______________ > > > > > virtio-blk_A --> | BDS 1 | --> qcow2_A --> raw-posix_A > > > > > |____________| > > > > > | > > > > > _____|________ > > > > > | | The shared state is the state = of a BDS group > > > > > | Shared | It's stored in a static linked= list of the > > > > > | State | block/throttle.c module. It ha= s a name and contains a > > > > > |____________| throttle state structure. > > > > > | > > > > > _____|________ > > > > > | BDS 2 | > > > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B > > > >=20 > > > > Okay. I think your proposal might be easier to implement in the s= hort > > > > run, but it introduces an additional type of nodes to the graph (= so far > > > > we have only one type, BlockDriverStates) with their own set of > > > > functions, and I assume monitor commands, for management. > > > >=20 > > > > This makes the whole graph less uniform and consistent. There may= be > > > > cases where this is necessary or at least tolerable because the f= ully > > > > generic alternativ isn't doable. I'm not convinced yet that this = is the > > > > case here. > > > >=20 > > > > In contrast, my approach would require considerable infrastructur= e work > > > > (you somehow seem to attract that kind of things ;-)), but it's m= erely a > > > > generalisation of what we already have and as such fits nicely in= the > > > > graph. > > > >=20 > > > > We already have multiple children of BDS nodes. And we take it fo= r > > > > granted that they don't refer to the same data, but that bs->file= and > > > > bs->backing_hd have actually different semantics. > > > >=20 > > > > We have recently introduced refcounts for BDSes so that one BDS c= an now > > > > have multiple parents, too, as a first step towards symmetry. The > > > > logical extension is that these parent get different semantics, j= ust > > > > like the children have different semantics. > > > >=20 > > > > Doing the abstraction in one model right instead of adding hacks = that > > > > don't really fit in but are easy to implement has paid off in the= past. > > > > I'm pretty sure that extending the infrastructure this way will f= ind > > > > more users than just I/O throttling, and that having different pa= rents > > > > in different roles is universally useful. With qcow2 exposing the > > > > snapshots, too, I already named a second potential user of the > > > > infrastructure. > > > >=20 > > > > > The name of the shared state is the throttle group name. > > > > > The three added methods are used to add, configure and destroy = such shared > > > > > states. > > > > >=20 > > > > > The benefit of this aproach is that we don't need to add a spec= ial slot mechanism > > > > > and that removing BDS 2 would be easy. > > > > > Your approach don't deal with the fact that the throttling grou= p membership can > > > > > be changed dynamically while the vm is running: for example add= ing qcow2_C and > > > > > removing qcow2_B should be made easy. > > > >=20 > > > > Yes, this is right. But then, the nice thing about it is that I s= tayed > > > > fully within the one uniform graph. We just need a way to modify = the > > > > edges in this graph (and we already need that to insert/delete fi= lters) > > > > and you get this special case and many others for free. > > > >=20 > > > > So, I vote for investing into a uniform infrastructure here inste= ad of > > > > adding new one-off node types. > > >=20 > > > Maybe parents BDS could use a generic block function to get a cooki= e when they > > > start to use a children BDS. > > >=20 > > > The parent would to > > >=20 > > > bs->file_cookie =3D bdrv_get_cookie(file); > > > bs->file =3D file; > > >=20 > > > when choosing to use file as bs file. > > >=20 > > > The get cookie method would be > > >=20 > > > uint64_t bdrv_get_cookie(bs) { > > > bs->cookie =3D gen_uuid(bs); > > > return bs->cookie; > > > } > > >=20 > > > gen_uuid would combine a random 64 bit number with a registry to pr= event > > > identical cookie generation. > > >=20 > > > After this step every BlockDriver method would receive the cookie a= s second > > > parameter. > > >=20 > > > For example bdrv_read(bs, cookie, ...) > > >=20 > > > So it's easy for a block driver to discriminate based on the cookie= and even to > > > look up which of his own child is associated to this cookie. > > >=20 > > > Best regards > > >=20 > > > Beno=EEt > > >=20 > > > >=20 > > > > Kevin > >=20 > > Kevin: what do you think of this cookie idea ? > > It seems something doable with reasonable small steps. >=20 > Sorry, I was going to reply with some more detailed description of what > things should look like, but got preempted once again. >=20 > So, no, this cookies thing is not directly the right thing to do. The > idea that the information must be passed is alright, but not as an > additional int parameter. First thing is that you can simply use anothe= r > opaque pointer instead of the integer so that the driver doesn't have t= o > look it up, but can directly use it. The second thing is that there's n= o > need to have two parameters, when one of them implies the other one. >=20 > So what you end up with is a new type of structure, and you'll split > today's BlockDriverStates in three parts: >=20 > - BlockBackend (the thing that each guest device has) > - BlockView (a qcow2 snapshot or a "slot" for I/O throttling filters) > - BlockDriverState (deals with the image file and provides views) >=20 > I'm not completely happy with these names, but I have to use something > for this discussion, so I'll just use them until someone comes up with > something better. >=20 > In the end, we should have something like: >=20 > typedef struct BlockDriverState { > /* Like today, except without the fields covered elsewhere */ > } BlockDriverState; >=20 > typedef struct BlockView { > BlockDriverState *bs; > const char *view_name; > uint64_t total_bytes; > ... > /* more common per-view data */ > } BlockView; >=20 > typedef struct Qcow2View { > BlockView common; > uint64_t *l1_table; > ... > /* more per-snapshot data */ > } >=20 > Then you have the different block layer functions, and some of them > refer to the whole BlockDriverState (like bdrv_open(), which opens the > images and creates all of the views) and others operate on a given > BlockView (like bdrv_co_preadv()). Does BDS nodes contains pointers to their views ? What is the relation ship between views and the node graph ? Best regards Beno=EEt >=20 > Kevin >=20