From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <dario.faggioli@citrix.com>
Subject: Re: [Design RFC] Towards work-conserving RTDS scheduler
Date: Mon, 8 Aug 2016 11:38:31 +0200
Message-ID: <1470649111.9623.79.camel@citrix.com>
References: <CAENZ-+mFAexOjwcNtdS7TzHbyPjgX9tYSR5-DfVnnjaWWRe1Gw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============5966683460604918816=="
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta6.messagelabs.com ([85.158.143.247])
 by lists.xenproject.org with esmtp (Exim 4.84_2)
 (envelope-from <prvs=021f51101=dario.faggioli@citrix.com>)
 id 1bWh1b-00068J-6f
 for xen-devel@lists.xenproject.org; Mon, 08 Aug 2016 09:39:11 +0000
In-Reply-To: <CAENZ-+mFAexOjwcNtdS7TzHbyPjgX9tYSR5-DfVnnjaWWRe1Gw@mail.gmail.com>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
To: Meng Xu <mengxu@cis.upenn.edu>, "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

--===============5966683460604918816==
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="=-g1KYrCjj9wsUb277TEUn"

--=-g1KYrCjj9wsUb277TEUn
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2016-08-04 at 01:15 -0400, Meng Xu wrote:
> Hi Dario,
>=20
Hi,

> I'm thinking about changing the current RTDS scheduler to
> work-conserving version as we briefly discussed before.
> Below is a design of the work-conserving RTDS.
> I'm hoping to get your feedback about the design ideas first before I
> start writing it in code.
>=20
Here I am, sorry for the delay.

> I think the code change should not be a lot as long as we don't
> provide the functionality of switching between work-conserving and
> non-work-conserving. Because the following design will keep the
> real-time property of the current RTDS scheduler, I don't see the
> reason why we should let users switch to non-work-conserving version.
> :-)
>=20
Oh, but there's a bit one: _money_! :-O

If you're a service/cloud provided you may or may not want that a
customers that pays for a 40% utilization VM to be able to use more
than that. In particular, you may want to ask more money to them, in
order to enable that possibility! :-P

Anyway, I don't think --with this design of yours-- that it is such a
big deal to make it possible to switch work-conserving*ness on and off
(see below). Actually, I think it's even possible to to that on a per-
vcpu basis, which I think would be quite cool!

> --- Below is the design ---
>=20
> [...]
>
> *** Requirement of the work-conserving RTDS scheduler ***
> 1) The new RTDS scheduler should be work-conserving, of course.
> 2) The new RTDS scheduler should not break any real-time guarantee
> provided by the current RTDS scheduler.
>=20
> *** Design of=C2=A0=C2=A0Work-Conserving RTDS Scheduler ***
> VCPU model
> 1) (Period, Budget): Guaranteed <Budget> time for each <Period>
> 2) Priority index: It indicates the current=C2=A0=C2=A0priority level of =
the
> VCPU. When a VCPU=E2=80=99s budget is depleted in the current period, its
> priority index will increase by 1 and its budget will be replenished.
> 3) A VCPU=E2=80=99s budget and priority index will be reset at the beginn=
ing
> of each period
>=20
Ok, I think I see what you mean and it looks to make sense to me.

Just one question/observation. As you know, I come from a CBS mindset.
CBS postpones a task/vcpu's deadline when it runs out of budget, and it
can, natively, work in work conserving or non-work conserving mode
(just by wither continue to consider the vcpu runnable, with the later
deadline which mean demoted priority, or block it until the next
period, respectively).

The nice thing about this is that the scheduling analysis that has been
developed works for both modes. Of course, what it says is that you can
only guarantee to each vcpu the reserved utilization, and you should
not rely on the additional capacity that you may be getting because
you're in work conserving mode and the system happened to be idle for a
few time this or that other period (so, very similar to what you're
proposing). _HOWEVER_, there are evolutions of CBS (called GRUB and
SHRUB, I'm sure you'll be able to find the papers), where the 'unused
bandwidth' (i.e., the otherwise idle time that you're making use of iff
you're in work conserving mode) is distributed in a precise way
(according to some weights, IIRC) to the various vcpus, hence making
scheduling analysis both possible and useful again.

Now, I'm not at all saying that we (you! :-D) should RTDS into using
CBS(ish) or anything like that. I'm just thinking out loud and
wondering:
=C2=A0- could it be useful to have a scheduling analysis in place for the=
=C2=A0
=C2=A0 =C2=A0scheduler in work conserving mode (one, of course, that takes =
into=C2=A0
=C2=A0 =C2=A0account and give guarantees on the otherwise idle bandwidth...=
 I=C2=A0
=C2=A0 =C2=A0know that the existing one holds! :-P) ?
=C2=A0- if yes, do you already have one --or do you think it will be=C2=A0
=C2=A0 =C2=A0possible to develop one-- for your priority-index based model?

Note that I'm not saying you should, and I'd be perfectly fine with a
"no analysis, but let's keep things simple for now"... This just came
to my mind, and I'm just pointing it ouy, to make sure we consider and
think about it, and make a conscious decision.

> Scheduling policy: modified gEDF
> 1) Priority comparison:
> =C2=A0=C2=A0=C2=A0a) VCPUs with lower priority index has higher priority =
than VCPUs
> with higher priority index
> =C2=A0=C2=A0=C2=A0b) VCPUs with same priority index uses gEDF policy to d=
ecide the
> priority order
> 2) Scheduling point
> =C2=A0=C2=A0=C2=A0a) VCPU=E2=80=99s budget is depleted for the current pr=
iority index
> =C2=A0=C2=A0=C2=A0b) VCPU starts a new period
> =C2=A0=C2=A0=C2=A0c) VCPU is blocked or waked up
> 3) Scheduling decision is made when scheduler is invoked
> =C2=A0=C2=A0=C2=A0=C2=A0a) Always pick the current M highest-priority VCP=
Us to run on the
> M cores.
>=20
So, still about the analysis point above, and just out of the top of my
head (and without being used to do this things any longer!!), it looks
like it's possible think at some analysis for this.

In fact, since:
=C2=A0- vcpus with different priority indexes are totally disjoint sets,
=C2=A0- there's a strict ordering between priority indexes,
=C2=A0- vcpus sort of use their scheduling parameters at each priority inde=
x

This looks to me like vcpus are subject to a "hierarchy" of RTDS
schedulers, the one at level x+1 running in the idle time of the one at
level x... And I think there's scope for writing down some maths
formulas that model this situation. :-)

Actually, it's quite likely that you either have already noticed this
and done the analysis, or that someone else in literature has done
something similar --maybe with other schedulers-- before.

Anyway, the idea itself looks fair enough to me. I'd like to hear, if
that's fine with you, how you plan to actually implement it, as there
of course are multiple different ways to do it, and there are, IMO, a
couple of things that should be kept in mind.

Finally, about the work-conserving*ness on-off switch, what added
difficulty or increase in code complexity prevents us to, instead of
this:

"2) Priority index: It indicates the current=C2=A0=C2=A0priority level of t=
he
=C2=A0 =C2=A0 VCPU. When a VCPU=E2=80=99s budget is depleted in the current=
 period, its
=C2=A0 =C2=A0 priority index will increase by 1 and its budget will be
=C2=A0 =C2=A0 replenished."

do something like this:

"2) Priority index: It indicates the current=C2=A0=C2=A0priority level of t=
he
=C2=A0 =C2=A0 VCPU. When a VCPU's budget is depleted in the current period:
=C2=A0 =C2=A0 =C2=A02a) if the VCPU has the work conserving flag set, its p=
riority
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0index will be increased by 1, and its bud=
get replenished;
=C2=A0 =C2=A0 =C2=A02b) if the VCPU has the work conserving flag cleat, it'=
s blocked
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0until next period."

?

Thanks and Regards,
Dario
---=C2=A0
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


--=-g1KYrCjj9wsUb277TEUn
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCAAGBQJXqFMlAAoJEBZCeImluHPuZCQP/2e2VI7ikyo13ifr40iSkKBs
LbUPIiWnaXqCyqigOivAQCBj1Lhl3105pR7VMJsmoOtAv9Xu7howx7vL7WDuwgTn
owKb6KGvyhsyS/fOWVrCShh9JdkJaCT3IRU8gIB9QcRHmtKUX9RUcSr28/aeCqwG
0bLSLNwYCwlhcX2ir8A722aeyfXYhnSUnisk2fwQecOJXLP8K6QdR0BPos3mJQ60
n6qZmwEY1w5ubmk+q7nXMTJYkViy5hXsiI8bOiZblQJYoh7xFSLx4UdYAwiAum24
nBJYEZ4VLeW3f8MucDHtQasJf3MYSGUns7xA5LVSJHOxmfNTED3n/ShpscukIqD+
hSCPhteZUI3ey3ETjA0i4/r4+Rks88V2RgpS83t6QUJZQcJWh9sRdywYewGShECH
YjvA74AzhB/K0g/u4XXb3XaneAamaFt+cS5x9vsu2wAu893FyzaHQiKqBBHmGoqR
YsWOgiHNJr0bosG/V/+18QiWslpRqEpvWWPNQlHJ9QQaGGHhaXaqcAL2LHq+d5lh
nBmWDwOWqRv71Elss9jTFkCub5yiJJoo3tSIOvvmgyFbkdo8MxKxHzkWXcIQN5zj
xsy3IvS/srDsrTPHfqM3O9ynXv6TLrXBuzas8MMjEoBi6ycGz6marlk4YCKSJPRP
Zp9WS/XbKBgk5ZnTwgrG
=ue9z
-----END PGP SIGNATURE-----

--=-g1KYrCjj9wsUb277TEUn--


--===============5966683460604918816==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs
IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v
cmcveGVuLWRldmVsCg==

--===============5966683460604918816==--