From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35348) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WS7HK-00087v-1a for qemu-devel@nongnu.org; Mon, 24 Mar 2014 11:59:16 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WS7HC-00015V-Vv for qemu-devel@nongnu.org; Mon, 24 Mar 2014 11:59:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47997) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WS7HC-000158-Nk for qemu-devel@nongnu.org; Mon, 24 Mar 2014 11:59:02 -0400 Date: Mon, 24 Mar 2014 16:58:55 +0100 From: Kevin Wolf Message-ID: <20140324155855.GD3094@noname.str.redhat.com> References: <20140314155756.GC3324@irqsave.net> <20140317031231.GA28582@T430.nay.redhat.com> <20140318132747.GN4607@noname.str.redhat.com> <20140320140509.GA3205@irqsave.net> <20140320151234.GC3450@noname.redhat.com> <20140320160626.GB3045@irqsave.net> <20140324145319.GG3071@irqsave.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20140324145319.GG3071@irqsave.net> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] n ways block filters List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?iso-8859-1?Q?Beno=EEt?= Canet Cc: Fam Zheng , Stefan Hajnoczi , armbru@redhat.com, qemu-devel , Stefan Hajnoczi , Max Reitz Am 24.03.2014 um 15:53 hat Beno=EEt Canet geschrieben: > The Thursday 20 Mar 2014 =E0 17:06:26 (+0100), Beno=EEt Canet wrote : > > The Thursday 20 Mar 2014 =E0 16:12:34 (+0100), Kevin Wolf wrote : > > > Am 20.03.2014 um 15:05 hat Beno=EEt Canet geschrieben: > > > > The Tuesday 18 Mar 2014 =E0 14:27:47 (+0100), Kevin Wolf wrote : > > > > > Am 17.03.2014 um 17:02 hat Stefan Hajnoczi geschrieben: > > > > > > On Mon, Mar 17, 2014 at 4:12 AM, Fam Zheng = wrote: > > > > > > > On Fri, 03/14 16:57, Beno=EEt Canet wrote: > > > > > > >> I discussed a bit with Stefan on the list and we came to t= he conclusion that the > > > > > > >> block filter API need group support. > > > > > > >> > > > > > > >> filter group: > > > > > > >> ------------- > > > > > > >> > > > > > > >> My current plan to implement this is to add the following = fields to the BlockDriver > > > > > > >> structure. > > > > > > >> > > > > > > >> int bdrv_add_filter_group(const char *name, QDict options)= ; > > > > > > >> int bdrv_reconfigure_filter_group(const char *name, QDict = options); > > > > > > >> int bdrv_destroy_filter_group(const char *name); > > > > >=20 > > > > > Beno=EEt, your mail left me puzzled. You didn't really describe= the > > > > > problem that you're solving, nor what the QDict options actuall= y > > > > > contains or what a filter group even is. > > > > >=20 > > > > > > >> These three extra method would allow to create, reconfigur= e or destroy a block > > > > > > >> filter group. A block filter group contain the shared or n= on shared state of the > > > > > > >> blockfilter. For throttling it would contains the Throttle= State structure. > > > > > > >> > > > > > > >> Each block filter driver would contains a linked list of l= inked list where the > > > > > > >> BDS are registered grouped by filter groups state. > > > > > > > > > > > > > > Sorry I don't fully understand this. Does a filter group co= ntain multiple block > > > > > > > filters, and every block filter has effect on multiple BDSe= s? Could you give an > > > > > > > example? > > > > > >=20 > > > > > > Just to why a "group" mechanism is useful: > > > > > >=20 > > > > > > You want to impose a 2000 IOPS limit for the entire VM. Curr= ently > > > > > > this is not possible because each drive has its own throttlin= g state. > > > > > >=20 > > > > > > We need a way to say certain drives are part of a group. All= drives > > > > > > in a group share the same throttling state and therefore a 20= 00 IOPS > > > > > > limit is shared amongst them. > > > > >=20 > > > > > Now at least I have an idea what you're all talking about, but = it's > > > > > still not obvious to me how the three functions from above solv= e your > > > > > problem or how they work in detail. > > > > >=20 > > > > > The obvious solution, using often discussed blockdev-add concep= ts, is: > > > > > ______________ > > > > > virtio-blk_A --> | | --> qcow2_A --> raw-posix_A > > > > > | throttling | > > > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B > > > >=20 > > > > My proposal would be: > > > > ______________ > > > > virtio-blk_A --> | BDS 1 | --> qcow2_A --> raw-posix_A > > > > |____________| > > > > | > > > > _____|________ > > > > | | The shared state is the state of= a BDS group > > > > | Shared | It's stored in a static linked l= ist of the > > > > | State | block/throttle.c module. It has = a name and contains a > > > > |____________| throttle state structure. > > > > | > > > > _____|________ > > > > | BDS 2 | > > > > virtio_blk_B --> |____________| --> qcow2_B --> nbd_B > > >=20 > > > Okay. I think your proposal might be easier to implement in the sho= rt > > > run, but it introduces an additional type of nodes to the graph (so= far > > > we have only one type, BlockDriverStates) with their own set of > > > functions, and I assume monitor commands, for management. > > >=20 > > > This makes the whole graph less uniform and consistent. There may b= e > > > cases where this is necessary or at least tolerable because the ful= ly > > > generic alternativ isn't doable. I'm not convinced yet that this is= the > > > case here. > > >=20 > > > In contrast, my approach would require considerable infrastructure = work > > > (you somehow seem to attract that kind of things ;-)), but it's mer= ely a > > > generalisation of what we already have and as such fits nicely in t= he > > > graph. > > >=20 > > > We already have multiple children of BDS nodes. And we take it for > > > granted that they don't refer to the same data, but that bs->file a= nd > > > bs->backing_hd have actually different semantics. > > >=20 > > > We have recently introduced refcounts for BDSes so that one BDS can= now > > > have multiple parents, too, as a first step towards symmetry. The > > > logical extension is that these parent get different semantics, jus= t > > > like the children have different semantics. > > >=20 > > > Doing the abstraction in one model right instead of adding hacks th= at > > > don't really fit in but are easy to implement has paid off in the p= ast. > > > I'm pretty sure that extending the infrastructure this way will fin= d > > > more users than just I/O throttling, and that having different pare= nts > > > in different roles is universally useful. With qcow2 exposing the > > > snapshots, too, I already named a second potential user of the > > > infrastructure. > > >=20 > > > > The name of the shared state is the throttle group name. > > > > The three added methods are used to add, configure and destroy su= ch shared > > > > states. > > > >=20 > > > > The benefit of this aproach is that we don't need to add a specia= l slot mechanism > > > > and that removing BDS 2 would be easy. > > > > Your approach don't deal with the fact that the throttling group = membership can > > > > be changed dynamically while the vm is running: for example addin= g qcow2_C and > > > > removing qcow2_B should be made easy. > > >=20 > > > Yes, this is right. But then, the nice thing about it is that I sta= yed > > > fully within the one uniform graph. We just need a way to modify th= e > > > edges in this graph (and we already need that to insert/delete filt= ers) > > > and you get this special case and many others for free. > > >=20 > > > So, I vote for investing into a uniform infrastructure here instead= of > > > adding new one-off node types. > >=20 > > Maybe parents BDS could use a generic block function to get a cookie = when they > > start to use a children BDS. > >=20 > > The parent would to > >=20 > > bs->file_cookie =3D bdrv_get_cookie(file); > > bs->file =3D file; > >=20 > > when choosing to use file as bs file. > >=20 > > The get cookie method would be > >=20 > > uint64_t bdrv_get_cookie(bs) { > > bs->cookie =3D gen_uuid(bs); > > return bs->cookie; > > } > >=20 > > gen_uuid would combine a random 64 bit number with a registry to prev= ent > > identical cookie generation. > >=20 > > After this step every BlockDriver method would receive the cookie as = second > > parameter. > >=20 > > For example bdrv_read(bs, cookie, ...) > >=20 > > So it's easy for a block driver to discriminate based on the cookie a= nd even to > > look up which of his own child is associated to this cookie. > >=20 > > Best regards > >=20 > > Beno=EEt > >=20 > > >=20 > > > Kevin >=20 > Kevin: what do you think of this cookie idea ? > It seems something doable with reasonable small steps. Sorry, I was going to reply with some more detailed description of what things should look like, but got preempted once again. So, no, this cookies thing is not directly the right thing to do. The idea that the information must be passed is alright, but not as an additional int parameter. First thing is that you can simply use another opaque pointer instead of the integer so that the driver doesn't have to look it up, but can directly use it. The second thing is that there's no need to have two parameters, when one of them implies the other one. So what you end up with is a new type of structure, and you'll split today's BlockDriverStates in three parts: - BlockBackend (the thing that each guest device has) - BlockView (a qcow2 snapshot or a "slot" for I/O throttling filters) - BlockDriverState (deals with the image file and provides views) I'm not completely happy with these names, but I have to use something for this discussion, so I'll just use them until someone comes up with something better. In the end, we should have something like: typedef struct BlockDriverState { /* Like today, except without the fields covered elsewhere */ } BlockDriverState; typedef struct BlockView { BlockDriverState *bs; const char *view_name; uint64_t total_bytes; ... /* more common per-view data */ } BlockView; typedef struct Qcow2View { BlockView common; uint64_t *l1_table; ... /* more per-snapshot data */ } Then you have the different block layer functions, and some of them refer to the whole BlockDriverState (like bdrv_open(), which opens the images and creates all of the views) and others operate on a given BlockView (like bdrv_co_preadv()). Kevin