All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Scheduler work, part 1: High-level goals and interface.
@ 2009-04-09 15:58 George Dunlap
  2009-04-09 18:41 ` Jeremy Fitzhardinge
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: George Dunlap @ 2009-04-09 15:58 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com

In the interest of openness (as well as in the interest of taking
advantage of all the smart people out there), I'm posting a very early
design prototype of the credit2 scheduler.  We've had a lot of
contributors to the scheduler recently, so I hope that those with
interest and knowledge will take a look and let me know what they
think at a high level.

This first e-mail will discuss the overall goals: the target "sweet
spot" use cases to consider, measurable goals for the scheduler, and
the target interface / features.  This is for general comment.

The subsequent e-mail(s?) will include some specific algorithms and
changes currently in consideration, as well as some bleeding-edge
patches.  This will be for people who have a specific interest in the
details of the scheduling algorithms.

Please feel free to comment / discuss / suggest improvements.

1. Design targets

We have three general use cases in mind: Server consolidation, virtual
desktop providers, and clients (e.g. XenClient).

For servers, our target "sweet spot" for which we will optimize is a
system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
Ideal performance is expected to be reached at about 80% total system
cpu utilization; but the system should function reasonably well up to
a utilization of 800% (e.g., a load of 8).

For virtual desktop systems, we will have a large number of
interactive VMs with a lot of shared memory.  Most of these will be
single-vcpu, or at most 2 vcpus.

For client systems, we expect to have 3-4 VMs (including dom0).
Systems will probably ahve a single socket with 2 cores and SMT (4
logical cpus).  Many VMs will be using PCI pass-through to access
network, video, and audio cards.  They'll also be running video and
audio workloads, which are extremely latency-sensitive.

2. Design goals

For each of the target systems and workloads above, we have some
high-level goals for the scheduler:

* Fairness.  In this context, we define "fairness" as the ability to
get cpu time proportional to weight.

We want to try to make this true even for latency-sensitive workloads
such as networking, where long scheduling latency can reduce the
throughput, and thus the total amount of time the VM can effectively
use.

* Good scheduling for latency-sensitive workloads.

To the degree we are able, we want this to be true even those which
use a significant amount of cpu power: That is, my audio shouldn't
break up if I start a cpu hog process in the VM playing the audio.

* HT-aware.

Running on a logical processor with an idle peer thread is not the
same as running on a logical processor with a busy peer thread.  The
scheduler needs to take this into account when deciding "fairness".

* Power-aware.

Using as many sockets / cores as possible can increase the total cache
size avalable to VMs, and thus (in the absence of inter-VM sharing)
increase total computing power; but by keeping multiple sockets and
cores powered up, also increases the electrical power used by the
system.  We want a configurable way to balance between maximizing
processing power vs minimizing electrical power.

3. Target interface:

The target interface will be similar to credit1:

* The basic unit is the VM "weight".  When competing for cpu
resources, VMs will get a share of the resources proportional to their
weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
get 33% and 67% of the cpu, respectively).

* Additionally, we will be introducing a "reservation" or "floor".
  (I'm open to name changes on this one.)  This will be a minimum
  amount of cpu time that a VM can get if it wants it.

For example, one could give dom0 a "reservation" of 50%, but leave the
weight at 256.  No matter how many other VMs run with a weight of 256,
dom0 will be guaranteed to get 50% of one cpu if it wants it.

* The "cap" functionality of credit1 will be retained.

This is a maximum amount of cpu time that a VM can get: i.e., a VM
with a cap of 50% will only get half of one cpu, even if the rest of
the system is completely idle.

* We will also have an interface to the cpu-vs-electrical power.

This is yet to be defined.  At the hypervisor level, it will probably
be a number representing the "badness" of powering up extra cpus /
cores.  At the tools level, there will probably be the option of
either specifying the number, or of using one of 2/3 pre-defined
values {power, balance, green/battery}.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2009-04-17 17:05 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-09 15:58 [RFC] Scheduler work, part 1: High-level goals and interface George Dunlap
2009-04-09 18:41 ` Jeremy Fitzhardinge
2009-04-10  0:33   ` Tian, Kevin
2009-04-10 16:15     ` Jeremy Fitzhardinge
2009-04-10 17:16       ` Ian Pratt
2009-04-10 17:19         ` Jeremy Fitzhardinge
2009-04-11 10:00           ` Tian, Kevin
2009-04-15 15:47             ` George Dunlap
2009-04-15 13:54           ` George Dunlap
2009-04-15 16:23             ` Jeremy Fitzhardinge
2009-04-10 17:34         ` Jeremy Fitzhardinge
2009-04-11  9:57         ` Tian, Kevin
2009-04-11 17:11           ` Ian Pratt
2009-04-12  6:27             ` Tian, Kevin
2009-04-11  9:52       ` Tian, Kevin
2009-04-15 15:56         ` George Dunlap
2009-04-16  5:11           ` Tian, Kevin
2009-04-16 10:27             ` George Dunlap
2009-04-16 14:10               ` Dan Magenheimer
2009-04-16 16:32                 ` Jeremy Fitzhardinge
2009-04-16 18:20                   ` Andrew Lyon
2009-04-16 18:28                     ` Jeremy Fitzhardinge
2009-04-17 10:17                 ` George Dunlap
2009-04-17 14:13                   ` Dan Magenheimer
2009-04-17 14:55                     ` Jeremy Fitzhardinge
2009-04-17 15:55                       ` Dan Magenheimer
2009-04-17 16:17                         ` Jeremy Fitzhardinge
2009-04-17 16:46                           ` Dan Magenheimer
2009-04-17 17:05                           ` George Dunlap
2009-04-17 10:02               ` Tian, Kevin
2009-04-15 14:29   ` George Dunlap
2009-04-10  0:15 ` Tian, Kevin
2009-04-15 15:07   ` George Dunlap
2009-04-16  4:58     ` Tian, Kevin
2009-04-10  2:28 ` Zhiyuan Shao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.