From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [Design RFC] Towards work-conserving RTDS scheduler Date: Mon, 8 Aug 2016 11:38:31 +0200 Message-ID: <1470649111.9623.79.camel@citrix.com> References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5966683460604918816==" Return-path: Received: from mail6.bemta6.messagelabs.com ([85.158.143.247]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bWh1b-00068J-6f for xen-devel@lists.xenproject.org; Mon, 08 Aug 2016 09:39:11 +0000 In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: Meng Xu , "xen-devel@lists.xenproject.org" Cc: George Dunlap List-Id: xen-devel@lists.xenproject.org --===============5966683460604918816== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="=-g1KYrCjj9wsUb277TEUn" --=-g1KYrCjj9wsUb277TEUn Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, 2016-08-04 at 01:15 -0400, Meng Xu wrote: > Hi Dario, >=20 Hi, > I'm thinking about changing the current RTDS scheduler to > work-conserving version as we briefly discussed before. > Below is a design of the work-conserving RTDS. > I'm hoping to get your feedback about the design ideas first before I > start writing it in code. >=20 Here I am, sorry for the delay. > I think the code change should not be a lot as long as we don't > provide the functionality of switching between work-conserving and > non-work-conserving. Because the following design will keep the > real-time property of the current RTDS scheduler, I don't see the > reason why we should let users switch to non-work-conserving version. > :-) >=20 Oh, but there's a bit one: _money_! :-O If you're a service/cloud provided you may or may not want that a customers that pays for a 40% utilization VM to be able to use more than that. In particular, you may want to ask more money to them, in order to enable that possibility! :-P Anyway, I don't think --with this design of yours-- that it is such a big deal to make it possible to switch work-conserving*ness on and off (see below). Actually, I think it's even possible to to that on a per- vcpu basis, which I think would be quite cool! > --- Below is the design --- >=20 > [...] > > *** Requirement of the work-conserving RTDS scheduler *** > 1) The new RTDS scheduler should be work-conserving, of course. > 2) The new RTDS scheduler should not break any real-time guarantee > provided by the current RTDS scheduler. >=20 > *** Design of=C2=A0=C2=A0Work-Conserving RTDS Scheduler *** > VCPU model > 1) (Period, Budget): Guaranteed time for each > 2) Priority index: It indicates the current=C2=A0=C2=A0priority level of = the > VCPU. When a VCPU=E2=80=99s budget is depleted in the current period, its > priority index will increase by 1 and its budget will be replenished. > 3) A VCPU=E2=80=99s budget and priority index will be reset at the beginn= ing > of each period >=20 Ok, I think I see what you mean and it looks to make sense to me. Just one question/observation. As you know, I come from a CBS mindset. CBS postpones a task/vcpu's deadline when it runs out of budget, and it can, natively, work in work conserving or non-work conserving mode (just by wither continue to consider the vcpu runnable, with the later deadline which mean demoted priority, or block it until the next period, respectively). The nice thing about this is that the scheduling analysis that has been developed works for both modes. Of course, what it says is that you can only guarantee to each vcpu the reserved utilization, and you should not rely on the additional capacity that you may be getting because you're in work conserving mode and the system happened to be idle for a few time this or that other period (so, very similar to what you're proposing). _HOWEVER_, there are evolutions of CBS (called GRUB and SHRUB, I'm sure you'll be able to find the papers), where the 'unused bandwidth' (i.e., the otherwise idle time that you're making use of iff you're in work conserving mode) is distributed in a precise way (according to some weights, IIRC) to the various vcpus, hence making scheduling analysis both possible and useful again. Now, I'm not at all saying that we (you! :-D) should RTDS into using CBS(ish) or anything like that. I'm just thinking out loud and wondering: =C2=A0- could it be useful to have a scheduling analysis in place for the= =C2=A0 =C2=A0 =C2=A0scheduler in work conserving mode (one, of course, that takes = into=C2=A0 =C2=A0 =C2=A0account and give guarantees on the otherwise idle bandwidth...= I=C2=A0 =C2=A0 =C2=A0know that the existing one holds! :-P) ? =C2=A0- if yes, do you already have one --or do you think it will be=C2=A0 =C2=A0 =C2=A0possible to develop one-- for your priority-index based model? Note that I'm not saying you should, and I'd be perfectly fine with a "no analysis, but let's keep things simple for now"... This just came to my mind, and I'm just pointing it ouy, to make sure we consider and think about it, and make a conscious decision. > Scheduling policy: modified gEDF > 1) Priority comparison: > =C2=A0=C2=A0=C2=A0a) VCPUs with lower priority index has higher priority = than VCPUs > with higher priority index > =C2=A0=C2=A0=C2=A0b) VCPUs with same priority index uses gEDF policy to d= ecide the > priority order > 2) Scheduling point > =C2=A0=C2=A0=C2=A0a) VCPU=E2=80=99s budget is depleted for the current pr= iority index > =C2=A0=C2=A0=C2=A0b) VCPU starts a new period > =C2=A0=C2=A0=C2=A0c) VCPU is blocked or waked up > 3) Scheduling decision is made when scheduler is invoked > =C2=A0=C2=A0=C2=A0=C2=A0a) Always pick the current M highest-priority VCP= Us to run on the > M cores. >=20 So, still about the analysis point above, and just out of the top of my head (and without being used to do this things any longer!!), it looks like it's possible think at some analysis for this. In fact, since: =C2=A0- vcpus with different priority indexes are totally disjoint sets, =C2=A0- there's a strict ordering between priority indexes, =C2=A0- vcpus sort of use their scheduling parameters at each priority inde= x This looks to me like vcpus are subject to a "hierarchy" of RTDS schedulers, the one at level x+1 running in the idle time of the one at level x... And I think there's scope for writing down some maths formulas that model this situation. :-) Actually, it's quite likely that you either have already noticed this and done the analysis, or that someone else in literature has done something similar --maybe with other schedulers-- before. Anyway, the idea itself looks fair enough to me. I'd like to hear, if that's fine with you, how you plan to actually implement it, as there of course are multiple different ways to do it, and there are, IMO, a couple of things that should be kept in mind. Finally, about the work-conserving*ness on-off switch, what added difficulty or increase in code complexity prevents us to, instead of this: "2) Priority index: It indicates the current=C2=A0=C2=A0priority level of t= he =C2=A0 =C2=A0 VCPU. When a VCPU=E2=80=99s budget is depleted in the current= period, its =C2=A0 =C2=A0 priority index will increase by 1 and its budget will be =C2=A0 =C2=A0 replenished." do something like this: "2) Priority index: It indicates the current=C2=A0=C2=A0priority level of t= he =C2=A0 =C2=A0 VCPU. When a VCPU's budget is depleted in the current period: =C2=A0 =C2=A0 =C2=A02a) if the VCPU has the work conserving flag set, its p= riority =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0index will be increased by 1, and its bud= get replenished; =C2=A0 =C2=A0 =C2=A02b) if the VCPU has the work conserving flag cleat, it'= s blocked =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0until next period." ? Thanks and Regards, Dario ---=C2=A0 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-g1KYrCjj9wsUb277TEUn Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJXqFMlAAoJEBZCeImluHPuZCQP/2e2VI7ikyo13ifr40iSkKBs LbUPIiWnaXqCyqigOivAQCBj1Lhl3105pR7VMJsmoOtAv9Xu7howx7vL7WDuwgTn owKb6KGvyhsyS/fOWVrCShh9JdkJaCT3IRU8gIB9QcRHmtKUX9RUcSr28/aeCqwG 0bLSLNwYCwlhcX2ir8A722aeyfXYhnSUnisk2fwQecOJXLP8K6QdR0BPos3mJQ60 n6qZmwEY1w5ubmk+q7nXMTJYkViy5hXsiI8bOiZblQJYoh7xFSLx4UdYAwiAum24 nBJYEZ4VLeW3f8MucDHtQasJf3MYSGUns7xA5LVSJHOxmfNTED3n/ShpscukIqD+ hSCPhteZUI3ey3ETjA0i4/r4+Rks88V2RgpS83t6QUJZQcJWh9sRdywYewGShECH YjvA74AzhB/K0g/u4XXb3XaneAamaFt+cS5x9vsu2wAu893FyzaHQiKqBBHmGoqR YsWOgiHNJr0bosG/V/+18QiWslpRqEpvWWPNQlHJ9QQaGGHhaXaqcAL2LHq+d5lh nBmWDwOWqRv71Elss9jTFkCub5yiJJoo3tSIOvvmgyFbkdo8MxKxHzkWXcIQN5zj xsy3IvS/srDsrTPHfqM3O9ynXv6TLrXBuzas8MMjEoBi6ycGz6marlk4YCKSJPRP Zp9WS/XbKBgk5ZnTwgrG =ue9z -----END PGP SIGNATURE----- --=-g1KYrCjj9wsUb277TEUn-- --===============5966683460604918816== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --===============5966683460604918816==--