From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [RFC] Scheduler work, part 1: High-level goals and interface. Date: Thu, 09 Apr 2009 11:41:35 -0700 Message-ID: <49DE415F.3060002@goop.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: George Dunlap Cc: "xen-devel@lists.xensource.com" List-Id: xen-devel@lists.xenproject.org George Dunlap wrote: > 1. Design targets > > We have three general use cases in mind: Server consolidation, virtual > desktop providers, and clients (e.g. XenClient). > > For servers, our target "sweet spot" for which we will optimize is a > system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus). > Ideal performance is expected to be reached at about 80% total system > cpu utilization; but the system should function reasonably well up to > a utilization of 800% (e.g., a load of 8). > Is that forward-looking enough? That hardware is currently available; what's going to be commonplace in 2-3 years? > For virtual desktop systems, we will have a large number of > interactive VMs with a lot of shared memory. Most of these will be > single-vcpu, or at most 2 vcpus. > > For client systems, we expect to have 3-4 VMs (including dom0). > Systems will probably ahve a single socket with 2 cores and SMT (4 > logical cpus). Many VMs will be using PCI pass-through to access > network, video, and audio cards. They'll also be running video and > audio workloads, which are extremely latency-sensitive. > > 2. Design goals > > For each of the target systems and workloads above, we have some > high-level goals for the scheduler: > > * Fairness. In this context, we define "fairness" as the ability to > get cpu time proportional to weight. > > We want to try to make this true even for latency-sensitive workloads > such as networking, where long scheduling latency can reduce the > throughput, and thus the total amount of time the VM can effectively > use. > > * Good scheduling for latency-sensitive workloads. > > To the degree we are able, we want this to be true even those which > use a significant amount of cpu power: That is, my audio shouldn't > break up if I start a cpu hog process in the VM playing the audio. > > * HT-aware. > > Running on a logical processor with an idle peer thread is not the > same as running on a logical processor with a busy peer thread. The > scheduler needs to take this into account when deciding "fairness". > Would it be worth just pair-scheduling HT threads so they're always running in the same domain? > * Power-aware. > > Using as many sockets / cores as possible can increase the total cache > size avalable to VMs, and thus (in the absence of inter-VM sharing) > increase total computing power; but by keeping multiple sockets and > cores powered up, also increases the electrical power used by the > system. We want a configurable way to balance between maximizing > processing power vs minimizing electrical power. > I don't remember if there's a proper term for this, but what about having multiple domains sharing the same scheduling context, so that a stub domain can be co-scheduled with its main domain, rather than having them treated separately? Also, a somewhat related point, some kind of directed schedule so that when one vcpu is synchronously waiting on anohter vcpu, have it directly hand over its pcpu to avoid any cross-cpu overhead (including the ability to take advantage of directly using hot cache lines). That would be useful for intra-domain IPIs, etc, but also inter-domain context switches (domain<->stub, frontend<->backend, etc). > 3. Target interface: > > The target interface will be similar to credit1: > > * The basic unit is the VM "weight". When competing for cpu > resources, VMs will get a share of the resources proportional to their > weight. (e.g., two cpu-hog workloads with weights of 256 and 512 will > get 33% and 67% of the cpu, respectively). > > * Additionally, we will be introducing a "reservation" or "floor". > (I'm open to name changes on this one.) This will be a minimum > amount of cpu time that a VM can get if it wants it. > > For example, one could give dom0 a "reservation" of 50%, but leave the > weight at 256. No matter how many other VMs run with a weight of 256, > dom0 will be guaranteed to get 50% of one cpu if it wants it. > How does the reservation interact with the credits? Is the reservtion in addition to its credits, or does using the reservation consume them? > * The "cap" functionality of credit1 will be retained. > > This is a maximum amount of cpu time that a VM can get: i.e., a VM > with a cap of 50% will only get half of one cpu, even if the rest of > the system is completely idle. > > * We will also have an interface to the cpu-vs-electrical power. > > This is yet to be defined. At the hypervisor level, it will probably > be a number representing the "badness" of powering up extra cpus / > cores. At the tools level, there will probably be the option of > either specifying the number, or of using one of 2/3 pre-defined > values {power, balance, green/battery}. > Is it worth taking into account the power cost of cache misses vs hits? Do vcpus running on pcpus running at less than 100% speed consume fewer credits? Is there any explicit interface to cpu power state management, or would that be decoupled? J