From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: [RFC] Scheduler work,	part 1: High-level goals and
	interface.
Date: Thu, 09 Apr 2009 11:41:35 -0700
Message-ID: <49DE415F.3060002@goop.org>
References: <de76405a0904090858g145f07cja3bd7ccbd6b30ce9@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <de76405a0904090858g145f07cja3bd7ccbd6b30ce9@mail.gmail.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
List-Id: xen-devel@lists.xenproject.org

George Dunlap wrote:
> 1. Design targets
>
> We have three general use cases in mind: Server consolidation, virtual
> desktop providers, and clients (e.g. XenClient).
>
> For servers, our target "sweet spot" for which we will optimize is a
> system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
> Ideal performance is expected to be reached at about 80% total system
> cpu utilization; but the system should function reasonably well up to
> a utilization of 800% (e.g., a load of 8).
>   

Is that forward-looking enough?  That hardware is currently available; 
what's going to be commonplace in 2-3 years?

> For virtual desktop systems, we will have a large number of
> interactive VMs with a lot of shared memory.  Most of these will be
> single-vcpu, or at most 2 vcpus.
>
> For client systems, we expect to have 3-4 VMs (including dom0).
> Systems will probably ahve a single socket with 2 cores and SMT (4
> logical cpus).  Many VMs will be using PCI pass-through to access
> network, video, and audio cards.  They'll also be running video and
> audio workloads, which are extremely latency-sensitive.
>
> 2. Design goals
>
> For each of the target systems and workloads above, we have some
> high-level goals for the scheduler:
>
> * Fairness.  In this context, we define "fairness" as the ability to
> get cpu time proportional to weight.
>
> We want to try to make this true even for latency-sensitive workloads
> such as networking, where long scheduling latency can reduce the
> throughput, and thus the total amount of time the VM can effectively
> use.
>
> * Good scheduling for latency-sensitive workloads.
>
> To the degree we are able, we want this to be true even those which
> use a significant amount of cpu power: That is, my audio shouldn't
> break up if I start a cpu hog process in the VM playing the audio.
>
> * HT-aware.
>
> Running on a logical processor with an idle peer thread is not the
> same as running on a logical processor with a busy peer thread.  The
> scheduler needs to take this into account when deciding "fairness".
>   

Would it be worth just pair-scheduling HT threads so they're always 
running in the same domain?

> * Power-aware.
>
> Using as many sockets / cores as possible can increase the total cache
> size avalable to VMs, and thus (in the absence of inter-VM sharing)
> increase total computing power; but by keeping multiple sockets and
> cores powered up, also increases the electrical power used by the
> system.  We want a configurable way to balance between maximizing
> processing power vs minimizing electrical power.
>   

I don't remember if there's a proper term for this, but what about 
having multiple domains sharing the same scheduling context, so that a 
stub domain can be co-scheduled with its main domain, rather than having 
them treated separately?

Also, a somewhat related point, some kind of directed schedule so that 
when one vcpu is synchronously waiting on anohter vcpu, have it directly 
hand over its pcpu to avoid any cross-cpu overhead (including the 
ability to take advantage of directly using hot cache lines).  That 
would be useful for intra-domain IPIs, etc, but also inter-domain 
context switches (domain<->stub, frontend<->backend, etc).

> 3. Target interface:
>
> The target interface will be similar to credit1:
>
> * The basic unit is the VM "weight".  When competing for cpu
> resources, VMs will get a share of the resources proportional to their
> weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
> get 33% and 67% of the cpu, respectively).
>
> * Additionally, we will be introducing a "reservation" or "floor".
>   (I'm open to name changes on this one.)  This will be a minimum
>   amount of cpu time that a VM can get if it wants it.
>
> For example, one could give dom0 a "reservation" of 50%, but leave the
> weight at 256.  No matter how many other VMs run with a weight of 256,
> dom0 will be guaranteed to get 50% of one cpu if it wants it.
>   

How does the reservation interact with the credits?  Is the reservtion 
in addition to its credits, or does using the reservation consume them?

> * The "cap" functionality of credit1 will be retained.
>
> This is a maximum amount of cpu time that a VM can get: i.e., a VM
> with a cap of 50% will only get half of one cpu, even if the rest of
> the system is completely idle.
>
> * We will also have an interface to the cpu-vs-electrical power.
>
> This is yet to be defined.  At the hypervisor level, it will probably
> be a number representing the "badness" of powering up extra cpus /
> cores.  At the tools level, there will probably be the option of
> either specifying the number, or of using one of 2/3 pre-defined
> values {power, balance, green/battery}.
>   

Is it worth taking into account the power cost of cache misses vs hits?

Do vcpus running on pcpus running at less than 100% speed consume fewer 
credits?

Is there any explicit interface to cpu power state management, or would 
that be decoupled?

    J