[RFC] Scheduler work, part 1: High-level goals and interface.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] Scheduler work, part 1: High-level goals and interface.
@ 2009-04-09 15:58 George Dunlap
  2009-04-09 18:41 ` Jeremy Fitzhardinge
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: George Dunlap @ 2009-04-09 15:58 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com

In the interest of openness (as well as in the interest of taking
advantage of all the smart people out there), I'm posting a very early
design prototype of the credit2 scheduler.  We've had a lot of
contributors to the scheduler recently, so I hope that those with
interest and knowledge will take a look and let me know what they
think at a high level.

This first e-mail will discuss the overall goals: the target "sweet
spot" use cases to consider, measurable goals for the scheduler, and
the target interface / features.  This is for general comment.

The subsequent e-mail(s?) will include some specific algorithms and
changes currently in consideration, as well as some bleeding-edge
patches.  This will be for people who have a specific interest in the
details of the scheduling algorithms.

Please feel free to comment / discuss / suggest improvements.

1. Design targets

We have three general use cases in mind: Server consolidation, virtual
desktop providers, and clients (e.g. XenClient).

For servers, our target "sweet spot" for which we will optimize is a
system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
Ideal performance is expected to be reached at about 80% total system
cpu utilization; but the system should function reasonably well up to
a utilization of 800% (e.g., a load of 8).

For virtual desktop systems, we will have a large number of
interactive VMs with a lot of shared memory.  Most of these will be
single-vcpu, or at most 2 vcpus.

For client systems, we expect to have 3-4 VMs (including dom0).
Systems will probably ahve a single socket with 2 cores and SMT (4
logical cpus).  Many VMs will be using PCI pass-through to access
network, video, and audio cards.  They'll also be running video and
audio workloads, which are extremely latency-sensitive.

2. Design goals

For each of the target systems and workloads above, we have some
high-level goals for the scheduler:

* Fairness.  In this context, we define "fairness" as the ability to
get cpu time proportional to weight.

We want to try to make this true even for latency-sensitive workloads
such as networking, where long scheduling latency can reduce the
throughput, and thus the total amount of time the VM can effectively
use.

* Good scheduling for latency-sensitive workloads.

To the degree we are able, we want this to be true even those which
use a significant amount of cpu power: That is, my audio shouldn't
break up if I start a cpu hog process in the VM playing the audio.

* HT-aware.

Running on a logical processor with an idle peer thread is not the
same as running on a logical processor with a busy peer thread.  The
scheduler needs to take this into account when deciding "fairness".

* Power-aware.

Using as many sockets / cores as possible can increase the total cache
size avalable to VMs, and thus (in the absence of inter-VM sharing)
increase total computing power; but by keeping multiple sockets and
cores powered up, also increases the electrical power used by the
system.  We want a configurable way to balance between maximizing
processing power vs minimizing electrical power.

3. Target interface:

The target interface will be similar to credit1:

* The basic unit is the VM "weight".  When competing for cpu
resources, VMs will get a share of the resources proportional to their
weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
get 33% and 67% of the cpu, respectively).

* Additionally, we will be introducing a "reservation" or "floor".
  (I'm open to name changes on this one.)  This will be a minimum
  amount of cpu time that a VM can get if it wants it.

For example, one could give dom0 a "reservation" of 50%, but leave the
weight at 256.  No matter how many other VMs run with a weight of 256,
dom0 will be guaranteed to get 50% of one cpu if it wants it.

* The "cap" functionality of credit1 will be retained.

This is a maximum amount of cpu time that a VM can get: i.e., a VM
with a cap of 50% will only get half of one cpu, even if the rest of
the system is completely idle.

* We will also have an interface to the cpu-vs-electrical power.

This is yet to be defined.  At the hypervisor level, it will probably
be a number representing the "badness" of powering up extra cpus /
cores.  At the tools level, there will probably be the option of
either specifying the number, or of using one of 2/3 pre-defined
values {power, balance, green/battery}.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-09 15:58 [RFC] Scheduler work, part 1: High-level goals and interface George Dunlap
@ 2009-04-09 18:41 ` Jeremy Fitzhardinge
  2009-04-10  0:33   ` Tian, Kevin
  2009-04-15 14:29   ` George Dunlap
  2009-04-10  0:15 ` Tian, Kevin
  2009-04-10  2:28 ` Zhiyuan Shao
  2 siblings, 2 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-09 18:41 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

George Dunlap wrote:
> 1. Design targets
>
> We have three general use cases in mind: Server consolidation, virtual
> desktop providers, and clients (e.g. XenClient).
>
> For servers, our target "sweet spot" for which we will optimize is a
> system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
> Ideal performance is expected to be reached at about 80% total system
> cpu utilization; but the system should function reasonably well up to
> a utilization of 800% (e.g., a load of 8).
>   

Is that forward-looking enough?  That hardware is currently available; 
what's going to be commonplace in 2-3 years?

> For virtual desktop systems, we will have a large number of
> interactive VMs with a lot of shared memory.  Most of these will be
> single-vcpu, or at most 2 vcpus.
>
> For client systems, we expect to have 3-4 VMs (including dom0).
> Systems will probably ahve a single socket with 2 cores and SMT (4
> logical cpus).  Many VMs will be using PCI pass-through to access
> network, video, and audio cards.  They'll also be running video and
> audio workloads, which are extremely latency-sensitive.
>
> 2. Design goals
>
> For each of the target systems and workloads above, we have some
> high-level goals for the scheduler:
>
> * Fairness.  In this context, we define "fairness" as the ability to
> get cpu time proportional to weight.
>
> We want to try to make this true even for latency-sensitive workloads
> such as networking, where long scheduling latency can reduce the
> throughput, and thus the total amount of time the VM can effectively
> use.
>
> * Good scheduling for latency-sensitive workloads.
>
> To the degree we are able, we want this to be true even those which
> use a significant amount of cpu power: That is, my audio shouldn't
> break up if I start a cpu hog process in the VM playing the audio.
>
> * HT-aware.
>
> Running on a logical processor with an idle peer thread is not the
> same as running on a logical processor with a busy peer thread.  The
> scheduler needs to take this into account when deciding "fairness".
>   

Would it be worth just pair-scheduling HT threads so they're always 
running in the same domain?

> * Power-aware.
>
> Using as many sockets / cores as possible can increase the total cache
> size avalable to VMs, and thus (in the absence of inter-VM sharing)
> increase total computing power; but by keeping multiple sockets and
> cores powered up, also increases the electrical power used by the
> system.  We want a configurable way to balance between maximizing
> processing power vs minimizing electrical power.
>   

I don't remember if there's a proper term for this, but what about 
having multiple domains sharing the same scheduling context, so that a 
stub domain can be co-scheduled with its main domain, rather than having 
them treated separately?

Also, a somewhat related point, some kind of directed schedule so that 
when one vcpu is synchronously waiting on anohter vcpu, have it directly 
hand over its pcpu to avoid any cross-cpu overhead (including the 
ability to take advantage of directly using hot cache lines).  That 
would be useful for intra-domain IPIs, etc, but also inter-domain 
context switches (domain<->stub, frontend<->backend, etc).

> 3. Target interface:
>
> The target interface will be similar to credit1:
>
> * The basic unit is the VM "weight".  When competing for cpu
> resources, VMs will get a share of the resources proportional to their
> weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
> get 33% and 67% of the cpu, respectively).
>
> * Additionally, we will be introducing a "reservation" or "floor".
>   (I'm open to name changes on this one.)  This will be a minimum
>   amount of cpu time that a VM can get if it wants it.
>
> For example, one could give dom0 a "reservation" of 50%, but leave the
> weight at 256.  No matter how many other VMs run with a weight of 256,
> dom0 will be guaranteed to get 50% of one cpu if it wants it.
>   

How does the reservation interact with the credits?  Is the reservtion 
in addition to its credits, or does using the reservation consume them?

> * The "cap" functionality of credit1 will be retained.
>
> This is a maximum amount of cpu time that a VM can get: i.e., a VM
> with a cap of 50% will only get half of one cpu, even if the rest of
> the system is completely idle.
>
> * We will also have an interface to the cpu-vs-electrical power.
>
> This is yet to be defined.  At the hypervisor level, it will probably
> be a number representing the "badness" of powering up extra cpus /
> cores.  At the tools level, there will probably be the option of
> either specifying the number, or of using one of 2/3 pre-defined
> values {power, balance, green/battery}.
>   

Is it worth taking into account the power cost of cache misses vs hits?

Do vcpus running on pcpus running at less than 100% speed consume fewer 
credits?

Is there any explicit interface to cpu power state management, or would 
that be decoupled?

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-09 15:58 [RFC] Scheduler work, part 1: High-level goals and interface George Dunlap
  2009-04-09 18:41 ` Jeremy Fitzhardinge
@ 2009-04-10  0:15 ` Tian, Kevin
  2009-04-15 15:07   ` George Dunlap
  2009-04-10  2:28 ` Zhiyuan Shao
  2 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2009-04-10  0:15 UTC (permalink / raw)
  To: George Dunlap, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 4245 bytes --]

>From: George Dunlap
>Sent: 2009年4月9日 23:59
>
>For servers, our target "sweet spot" for which we will optimize is a
>system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
>Ideal performance is expected to be reached at about 80% total system
>cpu utilization; but the system should function reasonably well up to
>a utilization of 800% (e.g., a load of 8).

How is 80%/800% chosen here?

>
>For virtual desktop systems, we will have a large number of
>interactive VMs with a lot of shared memory.  Most of these will be
>single-vcpu, or at most 2 vcpus

How about VM number in total you'd like to support?
.
>
>* HT-aware.
>
>Running on a logical processor with an idle peer thread is not the
>same as running on a logical processor with a busy peer thread.  The
>scheduler needs to take this into account when deciding "fairness".

Do you mean that same elapsed time in above two scenarios will be
translated into different credits?

>
>* Power-aware.
>
>Using as many sockets / cores as possible can increase the total cache
>size avalable to VMs, and thus (in the absence of inter-VM sharing)
>increase total computing power; but by keeping multiple sockets and
>cores powered up, also increases the electrical power used by the
>system.  We want a configurable way to balance between maximizing
>processing power vs minimizing electrical power.

Xen3.4 now supports "sched_smt_power_savings" (both boot option 
and touchable by xenpm) to change power/performance preference.
It's simple implementation to simply reverse the span order from 
existing package->core->thread to thread->core->package. More 
fine-grained flexibility could be given in future if hierarchical scheduling 
concept could be more clearly constructed like domain scheduler
in Linux.

Another possible 'fairness' point affected by power management
could be to take freq scaling into consideration, since credit by far
is simply calculated by elapsed time while elapsed time with 
different frequency actually indicates different consumed cycles.

>
>3. Target interface:
>
>The target interface will be similar to credit1:
>
>* The basic unit is the VM "weight".  When competing for cpu
>resources, VMs will get a share of the resources proportional to their
>weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
>get 33% and 67% of the cpu, respectively).

imo, weight is not strictly translated into the care for latency. any
elaboration on that? I remembered that previously Nishiguchi-san
gave idea to boost credit, and Disheng proposed static priority. 
Maybe you can make a summary to help people how latency would
be exactly ensured in your proposal

>
>* Additionally, we will be introducing a "reservation" or "floor".
>  (I'm open to name changes on this one.)  This will be a minimum
>  amount of cpu time that a VM can get if it wants it.

this is good idea.

>
>For example, one could give dom0 a "reservation" of 50%, but leave the
>weight at 256.  No matter how many other VMs run with a weight of 256,
>dom0 will be guaranteed to get 50% of one cpu if it wants it.

there should be some way to adjust or limit usage of 'reservation' when 
multiple vcpus both claim a desire which however sum up to some 
exceeding cpu's computing power or weaken your general
'weight-as-basic-unit' idea?

>
>* The "cap" functionality of credit1 will be retained.
>
>This is a maximum amount of cpu time that a VM can get: i.e., a VM
>with a cap of 50% will only get half of one cpu, even if the rest of
>the system is completely idle.
>
>* We will also have an interface to the cpu-vs-electrical power.
>
>This is yet to be defined.  At the hypervisor level, it will probably
>be a number representing the "badness" of powering up extra cpus /
>cores.  At the tools level, there will probably be the option of
>either specifying the number, or of using one of 2/3 pre-defined
>values {power, balance, green/battery}.

Not sure how that number will be defined. Maybe we can follow
current way to just add individual name-based options matching
its purpose (such as migration_cost and sched_smt_power_savings...)

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-09 18:41 ` Jeremy Fitzhardinge
@ 2009-04-10  0:33   ` Tian, Kevin
  2009-04-10 16:15     ` Jeremy Fitzhardinge
  2009-04-15 14:29   ` George Dunlap
  1 sibling, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2009-04-10  0:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, George Dunlap; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]

>From: Jeremy Fitzhardinge
>Sent: 2009年4月10日 2:42
>
>George Dunlap wrote:
>> 1. Design targets
>>
>> We have three general use cases in mind: Server 
>consolidation, virtual
>> desktop providers, and clients (e.g. XenClient).
>>
>> For servers, our target "sweet spot" for which we will optimize is a
>> system with 2 sockets, 4 cores each socket, and SMT (16 
>logical cpus).
>> Ideal performance is expected to be reached at about 80% total system
>> cpu utilization; but the system should function reasonably well up to
>> a utilization of 800% (e.g., a load of 8).
>>   
>
>Is that forward-looking enough?  That hardware is currently available; 
>what's going to be commonplace in 2-3 years?

good point.

>> * HT-aware.
>>
>> Running on a logical processor with an idle peer thread is not the
>> same as running on a logical processor with a busy peer thread.  The
>> scheduler needs to take this into account when deciding "fairness".
>>   
>
>Would it be worth just pair-scheduling HT threads so they're always 
>running in the same domain?

running same domain doesn't help fairness and instead, it worsens.

>
>> * Power-aware.
>>
>> Using as many sockets / cores as possible can increase the 
>total cache
>> size avalable to VMs, and thus (in the absence of inter-VM sharing)
>> increase total computing power; but by keeping multiple sockets and
>> cores powered up, also increases the electrical power used by the
>> system.  We want a configurable way to balance between maximizing
>> processing power vs minimizing electrical power.
>>   
>
>I don't remember if there's a proper term for this, but what about 
>having multiple domains sharing the same scheduling context, so that a 
>stub domain can be co-scheduled with its main domain, rather 
>than having 
>them treated separately?

This is really desired.

>
>Also, a somewhat related point, some kind of directed schedule so that 
>when one vcpu is synchronously waiting on anohter vcpu, have 
>it directly 
>hand over its pcpu to avoid any cross-cpu overhead (including the 
>ability to take advantage of directly using hot cache lines).  That 
>would be useful for intra-domain IPIs, etc, but also inter-domain 
>context switches (domain<->stub, frontend<->backend, etc).

The hard part here is to find the hint on WHICH vcpu that given
cpu is waiting, which is not straightforward. Of course stub
domain is most possible example, but it may be already cleanly
addressed if above co-scheduling could be added? :-)


>> * We will also have an interface to the cpu-vs-electrical power.
>>
>> This is yet to be defined.  At the hypervisor level, it will probably
>> be a number representing the "badness" of powering up extra cpus /
>> cores.  At the tools level, there will probably be the option of
>> either specifying the number, or of using one of 2/3 pre-defined
>> values {power, balance, green/battery}.
>>   
>
>Is it worth taking into account the power cost of cache misses vs hits?
>
>Do vcpus running on pcpus running at less than 100% speed 
>consume fewer 
>credits?
>
>Is there any explicit interface to cpu power state management, 
>or would 
>that be decoupled?
>

now cpu power management has sysctl interface exposed and xenpm
is the tool using that interface so far.

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-09 15:58 [RFC] Scheduler work, part 1: High-level goals and interface George Dunlap
  2009-04-09 18:41 ` Jeremy Fitzhardinge
  2009-04-10  0:15 ` Tian, Kevin
@ 2009-04-10  2:28 ` Zhiyuan Shao
  2 siblings, 0 replies; 35+ messages in thread
From: Zhiyuan Shao @ 2009-04-10  2:28 UTC (permalink / raw)
  To: George Dunlap, xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 10443 bytes --]

Hi all,

Actually I think I/O responsiveness is important to control by the scheduling algorithm. And this is especially true for vritualized desktop/client environment, since in such environments, there are so many I/O events to handle, which is different from the server consolidation case, where many of the tasks are CPU-intensive.
I would like to show this point by a simple scheduling algorithm, which is attached with this mail, i wrotten in the last winter (Jan. 2009). That time, I am in Intel OTC for a visitation, and thank Intel guys (Disheng, Kevin Tian and etc.) for their help.
The scheduler is named as SDP (you have to use "sched=sdp" parameter in Xen kernel line when boot), which i mean to use ideas of dynamic priority to make the virtualized clients meat their needs of usage. However, this scheduler is basically a simple prototype for idea proving till now, I had not implement the dynamic priority mechanisms in yet. The solution used in this simple scheduler is largely ad-hoc, and i hope it can contribute something to the future development of next generation Xen scheduler. BTW, I borrowed large portion of code from Credit scheduler.

This patch should work in VT-d platform well (it does not doing well in the emulated device environment, since device emulation, especially the video results in too high overhead to handle by the scheduler). We (thank again Intel OTC for the VT-d platform) tested the scheduler in a 3.0Ghz 2-core system, invoked 2 HVM guest domains (one is primary domain, and another is auxiliary, both have two VCPUs), pinned each vcpu of each domain to different PCPUs (VCPUs of domain0 is pinned as well), since the i still had not implement a proper VCPU migration mechanism in SDP (sorry for that, I do not think the aggresive migration mechanism of Credit is proper for virtualized clients, and working on this now, hope can find a proper one for Xen in near future). The sound and video card are directly assigned to the primary domain, while the auxiliary domain uses emulated ones.

Set the "priority" (should be named as "I/O responsiveness", I think i had make a mistake on this, since the initial objectiveness is to use dynamic priority ideas) value of domain0 to 91, and set that of primary domain to 90. regarding the auxiliary domain, you can left it as default (80). Please used the attached domain0 command line tool (i.e., sched_sdp_set [domain_id] [new_priority]) to set the new priority, I am not good at Python, sorry for that! 
We had tested a scenario that the primary domain plays a DivX video, and at the same time, copy very big files in the auxiliary domain. The video can be played well! The effectiveness we experienced beats BCredit in this usage case, no matter how we adjust the parameters of BCredit.

Some explanations of SDP:
The "priority" parameter is actually used to control I/O responsiveness. If a VCPU is woken by an I/O event during the runtime, and at the same time, and its "priority" value happened to be higher than the current VCPU, the current VCPU will be preempted, and the woken VCPU is scheduled. A "bonus" will be given to the woken VCPU to leave "enough possible" time for it to complete its I/O handling. The bonus value is computed by substracting the "priority" parameter of the two (the woken and the current ) VCPUs. This strategy actually inhibits the preemption of a currently running VCPU with high "priority" by another VCPU with lower one, while permits preemption vice vesa, and i think this method fits well for the asymmetric domain role scenario of virtualized clients and desktops. 
Regarding the computation resource allocation, the simple scheduler actually shares the CPU resource by a round-robin fashion. I/O event happened at a high "priority" VCPU can give it a little "bonus", after using up that, the VCPU fall back to the round-robin scheduling ring. By this way, it maintains some kind of fairness even in virtualized client environment. e.g., in our testing scenario, we found the file-copying in auxiliary domain proceeds well (although a little slow) when the primary domain plays a DivX video, which results in high volumn of I/O events. 

By this experience, i think I/O responsiveness is an important parameter to be added in the development of new scheduler, since platforms have their independent performance metrics, and user can adjust the I/O responsiveness parameter of the domains to make them work well. 

Moreover, I think some characters of Credit scheduler does not fit well for the virtualized clients/desktops (for further discussion if possible). If used in the virtualized clients/desktop scenatio, the worst side of Credit is its little state space to mark the VCPUs (i.e., BOOT, UNDER and OVER). This make it very inconvenient at least to differentiate the VCPUs of different domains, and with different kinds of tasks, although the little state space do work well in consolidated servers.  The second inconevient side of Credit scheduler is the method that the scheduler "boosts" the VCPUs. In the original version Credit, a woken VCPU has to have enough credits (UNDER state) to make itself promoted to BOOST state. However, the domain (VCPU) may used up its credit, and at the same time it do have critical task. At this moment, fairness is of the secondary consideration, and should be maintained in later phases. At this point, BCredit brings some changes, although Bcredit may give fairness little consideration, unfortunately. 


Thanks,


2009-04-10 



Zhiyuan Shao 



发件人： George Dunlap 
发送时间： 2009-04-09  23:59:18 
收件人： xen-devel@lists.xensource.com 
抄送： 
主题： [Xen-devel] [RFC] Scheduler work,part 1: High-level goals and interface. 
 
In the interest of openness (as well as in the interest of taking
advantage of all the smart people out there), I'm posting a very early
design prototype of the credit2 scheduler.  We've had a lot of
contributors to the scheduler recently, so I hope that those with
interest and knowledge will take a look and let me know what they
think at a high level.

This first e-mail will discuss the overall goals: the target "sweet
spot" use cases to consider, measurable goals for the scheduler, and
the target interface / features.  This is for general comment.

The subsequent e-mail(s?) will include some specific algorithms and
changes currently in consideration, as well as some bleeding-edge
patches.  This will be for people who have a specific interest in the
details of the scheduling algorithms.

Please feel free to comment / discuss / suggest improvements.

1. Design targets

We have three general use cases in mind: Server consolidation, virtual
desktop providers, and clients (e.g. XenClient).

For servers, our target "sweet spot" for which we will optimize is a
system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
Ideal performance is expected to be reached at about 80% total system
cpu utilization; but the system should function reasonably well up to
a utilization of 800% (e.g., a load of 8).

For virtual desktop systems, we will have a large number of
interactive VMs with a lot of shared memory.  Most of these will be
single-vcpu, or at most 2 vcpus.

For client systems, we expect to have 3-4 VMs (including dom0).
Systems will probably ahve a single socket with 2 cores and SMT (4
logical cpus).  Many VMs will be using PCI pass-through to access
network, video, and audio cards.  They'll also be running video and
audio workloads, which are extremely latency-sensitive.

2. Design goals

For each of the target systems and workloads above, we have some
high-level goals for the scheduler:

* Fairness.  In this context, we define "fairness" as the ability to
get cpu time proportional to weight.

We want to try to make this true even for latency-sensitive workloads
such as networking, where long scheduling latency can reduce the
throughput, and thus the total amount of time the VM can effectively
use.

* Good scheduling for latency-sensitive workloads.

To the degree we are able, we want this to be true even those which
use a significant amount of cpu power: That is, my audio shouldn't
break up if I start a cpu hog process in the VM playing the audio.

* HT-aware.

Running on a logical processor with an idle peer thread is not the
same as running on a logical processor with a busy peer thread.  The
scheduler needs to take this into account when deciding "fairness".

* Power-aware.

Using as many sockets / cores as possible can increase the total cache
size avalable to VMs, and thus (in the absence of inter-VM sharing)
increase total computing power; but by keeping multiple sockets and
cores powered up, also increases the electrical power used by the
system.  We want a configurable way to balance between maximizing
processing power vs minimizing electrical power.

3. Target interface:

The target interface will be similar to credit1:

* The basic unit is the VM "weight".  When competing for cpu
resources, VMs will get a share of the resources proportional to their
weight.  (e.g., two cpu-hog workloads with weights of 256 and 512 will
get 33% and 67% of the cpu, respectively).

* Additionally, we will be introducing a "reservation" or "floor".
  (I'm open to name changes on this one.)  This will be a minimum
  amount of cpu time that a VM can get if it wants it.

For example, one could give dom0 a "reservation" of 50%, but leave the
weight at 256.  No matter how many other VMs run with a weight of 256,
dom0 will be guaranteed to get 50% of one cpu if it wants it.

* The "cap" functionality of credit1 will be retained.

This is a maximum amount of cpu time that a VM can get: i.e., a VM
with a cap of 50% will only get half of one cpu, even if the rest of
the system is completely idle.

* We will also have an interface to the cpu-vs-electrical power.

This is yet to be defined.  At the hypervisor level, it will probably
be a number representing the "badness" of powering up extra cpus /
cores.  At the tools level, there will probably be the option of
either specifying the number, or of using one of 2/3 pre-defined
values {power, balance, green/battery}.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
.

[-- Attachment #1.2: Type: text/html, Size: 18995 bytes --]

[-- Attachment #2: sdp-ctl.tar.gz --]
[-- Type: application/octet-stream, Size: 901 bytes --]

[-- Attachment #3: sdp_09.1.8.patch --]
[-- Type: application/octet-stream, Size: 38942 bytes --]

diff -r e2f36d066b7b tools/libxc/Makefile
--- a/tools/libxc/Makefile	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/libxc/Makefile	Thu Jan 08 01:32:01 2009 -0500
@@ -17,6 +17,7 @@
 CTRL_SRCS-y       += xc_private.c
 CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
+CTRL_SRCS-y       += xc_sdp.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff -r e2f36d066b7b tools/libxc/xc_sdp.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_sdp.c	Thu Jan 08 01:32:01 2009 -0500
@@ -0,0 +1,53 @@
+/****************************************************************************
+ * (C) 2009 -- Zhiyuan Shao -- Huazhong Univers. of Sci.&Tech. PRC
+ ****************************************************************************
+ *
+ *        File: xc_sdp.c
+ *      Author: Zhiyuan Shao
+ *
+ * Description: XC Interface to the SDP scheduler
+ *
+ */
+#include "xc_private.h"
+
+int
+xc_sdp_domain_set(
+    int xc_handle,
+    uint32_t domid,
+    uint16_t priority )
+{
+    DECLARE_DOMCTL;
+    struct xen_domctl_sched_sdp *p = &domctl.u.scheduler_op.u.sdp;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_SDP;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+
+    p->pri = priority;
+
+    return do_domctl(xc_handle, &domctl);
+}
+
+int
+xc_sdp_domain_get(
+    int xc_handle,
+    uint32_t domid,
+    uint16_t *priority)
+{
+    DECLARE_DOMCTL;
+    int ret;
+    struct xen_domctl_sched_sdp *p = &domctl.u.scheduler_op.u.sdp;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_SDP;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+
+    ret = do_domctl(xc_handle, &domctl);
+    if ( ret == 0 )
+        *priority = p->pri;
+
+    return ret;
+}
+
diff -r e2f36d066b7b tools/libxc/xc_sedf.c
--- a/tools/libxc/xc_sedf.c	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/libxc/xc_sedf.c	Thu Jan 08 01:32:01 2009 -0500
@@ -60,5 +60,6 @@
     *latency   = p->latency;
     *extratime = p->extratime;
     *weight    = p->weight;
+
     return ret;
 }
diff -r e2f36d066b7b tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/libxc/xenctrl.h	Thu Jan 08 01:32:01 2009 -0500
@@ -447,6 +447,14 @@
 int xc_sched_credit_domain_get(int xc_handle,
                                uint32_t domid,
                                struct xen_domctl_sched_credit *sdom);
+
+//defined for SDP
+int xc_sdp_domain_set(  int xc_handle,
+			uint32_t domid,
+			uint16_t priority );
+int xc_sdp_domain_get(  int xc_handle,
+			uint32_t domid,
+			uint16_t *priority );
 
 /**
  * This function sends a trigger to a domain.
diff -r e2f36d066b7b tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/python/xen/lowlevel/xc/xc.c	Thu Jan 08 01:32:01 2009 -0500
@@ -1180,7 +1180,7 @@
     uint16_t extratime, weight;
     static char *kwd_list[] = { "domid", "period", "slice",
                                 "latency", "extratime", "weight",NULL };
-    
+
     if( !PyArg_ParseTupleAndKeywords(args, kwds, "iLLLhh", kwd_list, 
                                      &domid, &period, &slice,
                                      &latency, &extratime, &weight) )
@@ -1213,6 +1213,41 @@
                          "latency",   latency,
                          "extratime", extratime,
                          "weight",    weight);
+}
+
+static PyObject *pyxc_sched_sdp_domain_set(XcObject *self,
+                                              PyObject *args,
+                                              PyObject *kwds)
+{
+    static char *kwd_list[] = { "domid", "priority", NULL };
+    uint32_t domid;
+    uint16_t priority;
+
+    if( !PyArg_ParseTupleAndKeywords(args, kwds, "ih", kwd_list,
+                                     &domid, &priority) )
+        return NULL;
+
+    if ( xc_sdp_domain_set(self->xc_handle, domid, priority) != 0 )
+        return pyxc_error_to_exception();
+
+    Py_INCREF(zero);
+    return zero;
+}
+
+static PyObject *pyxc_sched_sdp_domain_get(XcObject *self, PyObject *args)
+{
+    uint32_t domid;
+    uint16_t priority;
+
+    if( !PyArg_ParseTuple(args, "i", &domid) )
+        return NULL;
+
+    if ( xc_sdp_domain_get(self->xc_handle, domid, &priority) != 0 )
+        return pyxc_error_to_exception();
+
+    return Py_BuildValue("{s:i,s:H}",
+			 "domid",     domid,
+                         "priority",  priority );
 }
 
 static PyObject *pyxc_shadow_control(PyObject *self,
@@ -1736,6 +1771,25 @@
       "Returns:   [dict]\n"
       " weight    [short]: domain's scheduling weight\n"},
 
+    { "sched_sdp_domain_set",
+      (PyCFunction)pyxc_sched_sdp_domain_set,
+      METH_KEYWORDS, "\n"
+      "Set the scheduling parameters for a domain when running with the\n"
+      "Simple Dynamic Priority scheduler.\n"
+      " domid     [int]:   domain id to set\n"
+      " priority  [short]: domain's priority\n"
+      "Returns: [int] 0 on success; -1 on error.\n" },
+
+    { "sched_sdp_domain_get",
+      (PyCFunction)pyxc_sched_sdp_domain_get,
+      METH_VARARGS, "\n"
+      "Get the scheduling parameters for a domain when running with the\n"
+      "Simple Dynamic Priority scheduler.\n"
+      " domid     [int]:   domain id to query\n"
+      "Returns:   [dict]\n"
+      " domain    [int]:   domainID\n"
+      " priority  [short]: domain's priority\n"},
+
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
       METH_VARARGS | METH_KEYWORDS, "\n"
@@ -2051,6 +2105,7 @@
     /* Expose some libxc constants to Python */
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_SEDF", XEN_SCHEDULER_SEDF);
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT", XEN_SCHEDULER_CREDIT);
+    PyModule_AddIntConstant(m, "XEN_SCHEDULER_SDP", XEN_SCHEDULER_SDP);
 
 }
 
diff -r e2f36d066b7b tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/python/xen/xend/XendConfig.py	Thu Jan 08 01:32:02 2009 -0500
@@ -589,6 +589,8 @@
             int(sxp.child_value(sxp_cfg, "cpu_weight", 256))
         cfg["vcpus_params"]["cap"] = \
             int(sxp.child_value(sxp_cfg, "cpu_cap", 0))
+        cfg["vcpus_params"]["priority"] = \
+            int(sxp.child_value(sxp_cfg, "cpu_priority", 80))
 
         # Only extract options we know about.
         extract_keys = LEGACY_UNSUPPORTED_BY_XENAPI_CFG + \
diff -r e2f36d066b7b tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/python/xen/xend/XendDomain.py	Thu Jan 08 01:32:02 2009 -0500
@@ -1452,6 +1452,8 @@
         @type domid: int or string.
         @rtype: 0
         """
+	print "domain_cpu_sedf_set is called"
+
         dominfo = self.domain_lookup_nr(domid)
         if not dominfo:
             raise XendInvalidDomain(str(domid))
@@ -1482,6 +1484,44 @@
                     ['latency',   sedf_info['latency']],
                     ['extratime', sedf_info['extratime']],
                     ['weight',    sedf_info['weight']]]
+
+        except Exception, ex:
+            raise XendError(str(ex))
+
+    def domain_cpu_sdp_set(self, domid, priority):
+        """Set SDP scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: 0
+        """
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            return xc.sdp_domain_set(dominfo.getDomid(), priority)
+        except Exception, ex:
+            raise XendError(str(ex))
+
+    def domain_cpu_sdp_get(self, domid):
+        """Get SDP scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: SXP object
+        @return: The parameters for sdp scheduler for a domain.
+        """
+	print "domain_cpu_sdp_get is called"
+
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            sdp_info = xc.sdp_domain_get(dominfo.getDomid())
+            # return sxpr
+            return ['sdp',
+                    ['domid',    sdp_info['domid']],
+                    ['priority', sdp_info['priority']]]
 
         except Exception, ex:
             raise XendError(str(ex))
@@ -1529,6 +1569,60 @@
         try:
             return xc.shadow_mem_control(dominfo.getDomid(), mb=mb)
         except Exception, ex:
+            raise XendError(str(ex))
+
+    def domain_sched_sdp_get(self, domid):
+        """Get sdp scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: dict with keys 'priority'
+        @return: sdp scheduler parameters
+        """
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+
+        if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+            try:
+                return xc.xc_sdp_domain_get(dominfo.getDomid())
+            except Exception, ex:
+                raise XendError(str(ex))
+        else:
+            return {'priority' : dominfo.getPri()}
+
+    def domain_sched_sdp_set(self, domid, priority = None):
+        """Set sdp scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @type priority: int
+        @rtype: 0
+        """
+        set_priority = False
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            if priority is None:
+                priority = int(80)
+            elif priority < 0 or priority > 100:
+                raise XendError("priority is out of range")
+            else:
+                set_priority = True
+
+            assert type(priority) == int
+
+            rc = 0
+            if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+                rc = xc.xc_sdp_domain_set(dominfo.getDomid(), priority)
+            if rc == 0:
+                if set_priority:
+                    dominfo.setPri(priority)
+                self.managed_config_save(dominfo)
+            return rc
+        except Exception, ex:
+            log.exception(ex)
             raise XendError(str(ex))
 
     def domain_sched_credit_get(self, domid):
diff -r e2f36d066b7b tools/python/xen/xend/XendNode.py
--- a/tools/python/xen/xend/XendNode.py	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/python/xen/xend/XendNode.py	Thu Jan 08 01:32:02 2009 -0500
@@ -555,6 +555,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_SDP:
+            return 'sdp'
         else:
             return 'unknown'
 
@@ -714,6 +716,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_SDP:
+            return 'sdp'
         else:
             return 'unknown'
 
diff -r e2f36d066b7b tools/python/xen/xend/server/SrvDomain.py
--- a/tools/python/xen/xend/server/SrvDomain.py	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/python/xen/xend/server/SrvDomain.py	Thu Jan 08 01:32:02 2009 -0500
@@ -136,6 +136,8 @@
 
 
     def op_cpu_sedf_set(self, _, req):
+	print "op_cpu_sedf_set is called"
+
         fn = FormFn(self.xd.domain_cpu_sedf_set,
                     [['dom', 'int'],
                      ['period', 'int'],
@@ -145,7 +147,23 @@
                      ['weight', 'int']])
         val = fn(req.args, {'dom': self.dom.domid})
         return val
-    
+
+    def op_cpu_sdp_get(self, _, req):
+	print "op_cpu_sdp_get is called"
+
+        fn = FormFn(self.xd.domain_cpu_sdp_get,
+                    [['dom', 'int']])
+        val = fn(req.args, {'dom': self.dom.domid})
+        return val
+
+
+    def op_cpu_sdp_set(self, _, req):
+        fn = FormFn(self.xd.domain_cpu_sdp_set,
+                    [['dom', 'int'],
+                     ['priority', 'int']])
+        val = fn(req.args, {'dom': self.dom.domid})
+        return val
+ 
     def op_domain_sched_credit_get(self, _, req):
         fn = FormFn(self.xd.domain_sched_credit_get,
                     [['dom', 'str']])
diff -r e2f36d066b7b tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Mon Dec 22 13:48:40 2008 +0000
+++ b/tools/python/xen/xm/main.py	Thu Jan 08 01:32:02 2009 -0500
@@ -152,6 +152,7 @@
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
     'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
                      'Get/set credit scheduler parameters.'),
+    'sched-sdp'   : ('<Domain> [options]', 'Get/set SDP parameters.'),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
     'trigger'     : ('<Domain> <nmi|reset|init|s3resume> [<VCPU>]',
@@ -235,6 +236,10 @@
         'Flag (0 or 1) controls if domain can run in extra time.'),
        ('-w [FLOAT]', '--weight[=FLOAT]',
         'CPU Period/slice (do not set with --period/--slice)'),
+    ),
+    'sched-sdp': (
+       ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
+       ('-p PRIORITY', '--priority=PRIORITY', 'Relative Priority[0,100](int)'),
     ),
     'sched-credit': (
        ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
@@ -360,6 +365,7 @@
 scheduler_commands = [
     "sched-credit",
     "sched-sedf",
+    "sched-sdp",
     ]
 
 device_commands = [
@@ -955,6 +961,15 @@
         'latency'  : get_info('latency',       int,   -1),
         'extratime': get_info('extratime',     int,   -1),
         'weight'   : get_info('weight',        int,   -1),
+        }
+
+def parse_sdp_info(info):
+    def get_info(n, t, d):
+        return t(sxp.child_value(info, n, d))
+
+    return {
+        'domid'    : get_info('domid',         int,   -1),
+        'priority' : get_info('priority',      int,   -1),
         }
 
 def domid_match(domid, info):
@@ -1544,7 +1559,6 @@
         print '%-33s %4s %-4s %-4s %-7s %-5s %-6s' % \
               ('Name','ID','Period(ms)', 'Slice(ms)', 'Lat(ms)',
                'Extra','Weight')
-    
     for d in doms:
         # fetch current values so as not to clobber them
         try:
@@ -1571,6 +1585,76 @@
         # not setting values, display info
         else:
             print_sedf(sedf_info)
+
+def xm_sched_sdp(args):
+    xenapi_unsupported()
+
+    def print_sdp(info):
+        info['priority']  = info['priority']
+        print( ("%(name)-32s %(domid)5d %(priority)5d") % info)
+
+    check_sched_type('sdp')
+
+    # we want to just display current info if no parameters are passed
+    if len(args) == 0:
+        domid = None
+    else:
+        # we expect at least a domain id (name or number)
+        # and at most a domid up to 5 options with values
+        arg_check(args, "sched-sdp", 1, 10)
+        domid = args[0]
+        # drop domid from args since get_opt doesn't recognize it
+        args = args[1:]
+
+    opts = {}
+    try:
+        (options, params) = getopt.gnu_getopt(args, 'p', ['priority='])
+    except getopt.GetoptError, opterr:
+        err(opterr)
+        usage('sched-sdp')
+
+    # convert to nanoseconds if needed
+    for (k, v) in options:
+        if k in ['-p', '--priority']:
+            opts['priority'] = v
+
+    doms = filter(lambda x : domid_match(domid, x),
+                        [parse_doms_info(dom)
+                         for dom in getDomains(None, 'running')])
+    if domid is not None and doms == []:
+        err("Domain '%s' does not exist." % domid)
+        usage('sched-sdp')
+
+    # print header if we aren't setting any parameters
+    if len(opts.keys()) == 0:
+        print '%-33s %4s %-4s' % \
+              ('Name','ID','Priority')
+
+    for d in doms:
+        # fetch current values so as not to clobber them
+        try:
+            sdp_raw = server.xend.domain.cpu_sdp_get(d['domid'])
+        except xmlrpclib.Fault:
+            # domain does not support sched-sdp?
+	    print "seems platfrom does not support sched-sdp"
+            sdp_raw = {}
+
+        sdp_info = parse_sdp_info(sdp_raw)
+        sdp_info['name'] = d['name']
+        # update values in case of call to set
+        if len(opts.keys()) > 0:
+            for k in opts.keys():
+                sdp_info[k]=opts[k]
+
+            # send the update, converting user input
+            v = map(int, [sdp_info['priority']])
+            rv = server.xend.domain.cpu_sdp_set(d['domid'], *v)
+            if int(rv) != 0:
+                err("Failed to set sdp parameters (rv=%d)."%(rv))
+
+        # not setting values, display info
+        else:
+            print_sdp(sdp_info)
 
 def xm_sched_credit(args):
     """Get/Set options for Credit Scheduler."""
@@ -2825,6 +2909,7 @@
     # scheduler
     "sched-sedf": xm_sched_sedf,
     "sched-credit": xm_sched_credit,
+    "sched-sdp": xm_sched_sdp,
     # block
     "block-attach": xm_block_attach,
     "block-detach": xm_block_detach,
diff -r e2f36d066b7b xen/common/Makefile
--- a/xen/common/Makefile	Mon Dec 22 13:48:40 2008 +0000
+++ b/xen/common/Makefile	Thu Jan 08 01:32:02 2009 -0500
@@ -13,6 +13,7 @@
 obj-y += rangeset.o
 obj-y += sched_credit.o
 obj-y += sched_sedf.o
+obj-y += sched_sdp.o
 obj-y += schedule.o
 obj-y += shutdown.o
 obj-y += softirq.o
diff -r e2f36d066b7b xen/common/sched_sdp.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/common/sched_sdp.c	Thu Jan 08 01:32:02 2009 -0500
@@ -0,0 +1,736 @@
+/****************************************************************************
+ * (C) 2008-2009 Zhiyuan Shao, Huazhong University of Sci.&Tech. PRC
+ ****************************************************************************
+ *
+ *        File: common/sched_sdp.c
+ *      Author: Zhiyuan Shao
+ *
+ * Description: Scheduler for Client Virtualization
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+
+/*
+ The Marcos to be used
+*/
+#define SDP_MSECS_PER_TICK       10
+#define SDP_MSECS_PER_TSLICE	 10
+
+#define SDP_PCPU(_c)     \
+    ((struct sdp_pcpu *)per_cpu(schedule_data, _c).sched_priv)
+#define SDP_DOM(_dom)    ((struct sdp_dom *) (_dom)->sched_priv)
+#define SDP_VCPU(_vcpu)  ((struct sdp_vcpu *) (_vcpu)->sched_priv)
+
+#define SDP_PRI_IDLE 		0
+#define SDP_BONUS_IDLE		-1024
+
+#define SDP_PRI_DEFAULT		80
+#define SDP_BONUS_DEFAULT	0
+
+#define RUNQ(_cpu)          (&(SDP_PCPU(_cpu)->runq))
+
+
+/*
+ * Physical CPU
+ */
+struct sdp_pcpu {
+    struct list_head runq;
+    uint32_t runq_sort_last;
+    struct timer ticker;
+    unsigned int tick;
+};
+
+/*
+ * Domain
+ */
+struct sdp_dom {
+    struct list_head active_vcpu;
+    struct list_head active_sdom_elem;
+    struct domain *dom;
+    uint16_t pri;
+    uint16_t active_vcpu_count;
+};
+
+/*
+ * Virtual CPU
+ */
+//I added a bonus field to the schedule vcpu structure.
+// the intention is explained as follows:
+// if the vcpu is actually woken up, it will has some bonus, 
+// that is MILLSEC(sdp_vcpu->pri - SDP_PRI_DEFAULT).
+// and the runqueue is actually sorted by this bonus field, not the priority field!
+// if the vcpu is scheduled out, its bonus value will be substracted with its r_time
+// till it becomes anther zero again.
+
+//rules of wake up:
+// if the newly wake has lower priority than current, it will NOT preempt;
+// if the newly wake has higher priority than current, it preempt
+struct sdp_vcpu {
+    struct list_head runq_elem;
+    struct list_head active_vcpu_elem;
+    struct sdp_dom *sdom;
+    struct vcpu *vcpu;
+    uint16_t flags;
+    uint16_t pri;
+    s32	   bonus;
+};
+
+/*
+ * System-wide private data
+ */
+struct sdp_private {
+    spinlock_t lock;
+    struct list_head active_sdom;
+    uint32_t ncpus;
+    unsigned int master;
+    cpumask_t idlers;
+    //if resort_before_sched_needed = 1, 
+    // resort it before scheduling, and set it back to 0
+    short resort_before_sched_needed;
+};
+
+/*
+ * Global variables
+ */
+static struct sdp_private sdp_priv;
+
+
+/* ====================================== Utility routines =================================== */
+static inline int
+__cycle_cpu(int cpu, const cpumask_t *mask)
+{
+    int nxt = next_cpu(cpu, *mask);
+    if (nxt == NR_CPUS)
+        nxt = first_cpu(*mask);
+    return nxt;
+}
+
+static inline int
+__vcpu_on_runq(struct sdp_vcpu *svc)
+{
+    return !list_empty(&svc->runq_elem);
+}
+
+static inline struct sdp_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct sdp_vcpu, runq_elem);
+}
+
+static inline void
+__runq_insert(unsigned int cpu, struct sdp_vcpu *svc)
+{
+    const struct list_head * const runq = RUNQ(cpu);
+    struct sdp_vcpu * iter_svc;
+    struct list_head *iter;
+
+    BUG_ON( __vcpu_on_runq(svc) );
+    BUG_ON( cpu != svc->vcpu->processor );
+
+    list_for_each( iter, runq )
+    {
+        iter_svc = __runq_elem(iter);
+        if ( svc->bonus > iter_svc->bonus )
+            break;
+    }
+
+    list_add_tail(&svc->runq_elem, iter);
+}
+
+static inline void
+__runq_remove(struct sdp_vcpu *svc)
+{
+    BUG_ON( !__vcpu_on_runq(svc) );
+    list_del_init(&svc->runq_elem);
+}
+
+static inline void
+__runq_tickle(unsigned int cpu, struct sdp_vcpu *new)
+{
+    struct sdp_vcpu * const cur =
+        SDP_VCPU(per_cpu(schedule_data, cpu).curr);
+    cpumask_t mask;
+
+    ASSERT(cur);
+    cpus_clear(mask);
+
+    /* If strictly higher priority than current VCPU, signal the CPU */
+    if ( new->pri > cur->pri )
+    {
+//	printk( "ready to call scheduling procedure on cpu:%d\n", cpu );
+        cpu_set(cpu, mask);
+    }
+//    else{
+//	printk( "but at cpu %d new->pri<=cur->pri, new->pri=%d, new->domid=%d, new->vcpuid=%d\n",
+//	    cpu, new->pri, new->vcpu->domain->domain_id, new->vcpu->vcpu_id );
+//	printk( "cur->pri=%d, cur->domid=%d, cur->vcpuid=%d\n",
+//	    cur->pri, cur->vcpu->domain->domain_id, cur->vcpu->vcpu_id );
+//    }
+
+    /* Send scheduler interrupts to designated CPUs */
+    if ( !cpus_empty(mask) )
+        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);
+}
+/*
+static void
+sdp_tick(void *_cpu)
+{
+    unsigned int cpu = (unsigned long)_cpu;
+    struct sdp_pcpu *spc = SDP_PCPU(cpu);
+
+    spc->tick++;
+
+    set_timer(&spc->ticker, NOW() + MILLISECS(SDP_MSECS_PER_TICK));
+}
+*/
+static int
+sdp_pcpu_init( int cpu )
+{
+    struct sdp_pcpu *spc;
+    unsigned long flags;
+
+//    printk( "sdp_pcpu_init is called, cpu = %d.\n", cpu );
+
+    /* Allocate per-PCPU info */
+    spc = xmalloc(struct sdp_pcpu);
+    if ( spc == NULL )  return -ENOMEM;
+
+    spin_lock_irqsave(&sdp_priv.lock, flags);
+
+    /* Initialize/update system-wide config */
+    if ( sdp_priv.ncpus <= cpu )
+        sdp_priv.ncpus = cpu + 1;
+    if ( sdp_priv.master >= sdp_priv.ncpus )
+        sdp_priv.master = cpu;
+
+//    init_timer(&spc->ticker, sdp_tick, (void *)(unsigned long)cpu, cpu);
+    INIT_LIST_HEAD(&spc->runq);
+    per_cpu(schedule_data, cpu).sched_priv = spc;
+
+    /* Start off idling... */
+    BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
+    cpu_set(cpu, sdp_priv.idlers);
+
+    spin_unlock_irqrestore(&sdp_priv.lock, flags);
+
+//    printk( "sdp_pcpu_init returned.\n" );
+
+    return 0;
+}
+
+static void
+sdp_dump_vcpu(struct sdp_vcpu *svc)
+{
+    struct sdp_dom * const sdom = svc->sdom;
+
+    printk("[%i.%i] pri=%i flags=%x cpu=%i bonus=%d",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->pri,
+            svc->flags,
+            svc->vcpu->processor,
+	    svc->bonus );
+
+    if ( sdom )
+    {
+        printk(" priority=%d", sdom->pri );
+    }
+
+    printk("\n");
+}
+
+//resort the runq. note, the items here is sorted by using the bonus field
+// we use bubble sorting algorithm here
+static void
+sdp_runq_resort( unsigned int cpu )
+{
+    struct sdp_pcpu * const spc = SDP_PCPU(cpu);
+    struct list_head *runq, *elem, *follow;
+    struct sdp_vcpu *svc_elem, *svc_follow;
+    int i, swapped;
+
+    runq = &spc->runq;
+    elem = runq->next;
+
+    for( i=0; i<10; i++){
+	//allow for 10 times of scanning here.
+	swapped = 0;
+	while ( elem != runq )
+    	{
+	    follow = elem;
+	    elem = elem->next;
+	    if (elem == runq ) break;
+
+	    //compare the two svc
+            svc_elem = __runq_elem(elem);
+	    svc_follow = __runq_elem(follow);
+	
+	    if( svc_elem->bonus > svc_follow->bonus ){
+		//swap the two
+		printk( "SWAP running queue. pri->domid:%d vcpuid:%d bonus:%d\t",
+		   svc_follow->vcpu->domain->domain_id, svc_follow->vcpu->vcpu_id, svc_follow->bonus );
+		printk( "next: domid:%d, vcpuid:%d, bonus:%d\n",
+		   svc_elem->vcpu->domain->domain_id, svc_elem->vcpu->vcpu_id, svc_elem->bonus );
+		list_del( elem );
+		list_add_tail( elem, follow );
+		swapped = 1;
+		break;
+	    }
+    	}
+
+	if ( swapped == 0 ) break;
+    }
+//    if( i>0 )
+//	printk( "===============sdp_runq_resort: sorted for %d times===\n", i );
+}
+
+/* ====================================== Exposed routines =================================== */
+static int
+sdp_dom_init(struct domain *dom)
+{
+    struct sdp_dom *sdom;
+
+//    printk( "sdp_dom_init is called\n" );
+
+    if ( is_idle_domain(dom) )
+        return 0;
+
+    sdom = xmalloc(struct sdp_dom);
+    if ( sdom == NULL )  return -ENOMEM;
+
+    /* Initialize */
+    INIT_LIST_HEAD(&sdom->active_vcpu);
+    sdom->active_vcpu_count = 0;
+    INIT_LIST_HEAD(&sdom->active_sdom_elem);
+    sdom->dom = dom;
+    sdom->pri = SDP_PRI_DEFAULT; //set a default value for later change
+    dom->sched_priv = sdom;
+
+    //join the scheduling anyway, add to sdp_priv
+    // actually, it does not matter if it join sdp_priv, but it may works
+    // if later, we find a global adjustment mechanism is needed.
+    list_add(&sdom->active_sdom_elem, &sdp_priv.active_sdom);
+
+    return 0;
+}
+
+static void
+sdp_dom_destroy(struct domain *dom)
+{
+    struct sdp_dom *sdom = SDP_DOM(dom); //Note, sdom==NULL for IDLE domain!
+
+    list_del( &sdom->active_sdom_elem );
+    xfree(SDP_DOM(dom));
+}
+
+static int
+sdp_vcpu_init(struct vcpu *vc)
+{
+    struct domain * const dom = vc->domain;
+    struct sdp_dom *sdom = SDP_DOM(dom); //Note, sdom==NULL for IDLE domain!
+    struct sdp_vcpu *svc;
+
+//    printk( "sdp_vcpu_init is called, processor:%d\n", vc->processor );
+//    printk( "domid:%d vcpuid:%d\n", dom->domain_id, vc->vcpu_id );
+
+    /* Allocate per-VCPU info */
+    svc = xmalloc(struct sdp_vcpu);
+    if ( svc == NULL )   return -ENOMEM;
+
+    INIT_LIST_HEAD(&svc->runq_elem);
+    INIT_LIST_HEAD(&svc->active_vcpu_elem);
+    svc->sdom = sdom;
+    svc->vcpu = vc;
+    svc->pri = is_idle_domain(dom) ? SDP_PRI_IDLE : sdom->pri;
+    vc->sched_priv = svc;
+
+    //join the sdom now. it is useful, since the priority value may change
+    // if domctl hypercall is made later. however, as idle domain has no
+    // such sdom, leave it
+    if ( svc->pri != SDP_PRI_IDLE ){
+	svc->bonus = SDP_BONUS_DEFAULT;
+        list_add(&svc->active_vcpu_elem, &sdom->active_vcpu);
+	sdom->active_vcpu_count++;	
+	//not actived, but also increase it by one.
+	// later, we should change it a better name, such as vcpu_count
+    }else
+	svc->bonus = SDP_BONUS_IDLE;
+
+    /* Allocate per-PCPU info */
+    if ( unlikely(!SDP_PCPU(vc->processor)) )
+        if ( sdp_pcpu_init(vc->processor) != 0 )
+            return -1;
+
+    return 0;
+}
+
+static void
+sdp_vcpu_destroy(struct vcpu *vc)
+{
+    struct sdp_vcpu * const svc = SDP_VCPU(vc);
+    struct sdp_dom * const sdom = svc->sdom;
+    unsigned long flags;
+
+//    printk( "sdp_vcpu_destroy is called\n" );
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( !list_empty(&svc->runq_elem) );
+
+    spin_lock_irqsave(&sdp_priv.lock, flags);
+
+    if ( !list_empty(&svc->active_vcpu_elem) )
+	list_del( &svc->active_vcpu_elem );
+
+//    if ( !list_empty(&svc->active_vcpu_elem) )
+//        __csched_vcpu_acct_stop_locked(svc);
+
+    spin_unlock_irqrestore(&sdp_priv.lock, flags);
+
+    xfree(svc);
+
+}
+
+static void
+sdp_vcpu_sleep(struct vcpu *vc)
+{
+    struct sdp_vcpu * const svc = SDP_VCPU(vc);
+
+//    printk( "sdp_vcpu_sleep is called\n" );
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( per_cpu(schedule_data, vc->processor).curr == vc )
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    else if ( __vcpu_on_runq(svc) )
+        __runq_remove(svc);
+}
+
+//rules of wake up:
+// if the newly wake has lower priority than current, it will NOT preempt;
+// if the newly wake has higher priority than current, it preempt
+static void
+sdp_vcpu_wake(struct vcpu *vc)
+{
+    struct sdp_vcpu * const svc = SDP_VCPU(vc);
+    const unsigned int cpu = vc->processor;
+
+//    printk( "sdp_vcpu_wake is called, woken vcpu id:%d domainid:%d, wake point:%d\n", 
+//	vc->vcpu_id, vc->domain->domain_id, cpu );
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) )
+    {
+	spin_lock_irq(&sdp_priv.lock);
+	if( svc->pri > SDP_PRI_DEFAULT )
+	    svc->bonus = MILLISECS( svc->pri - SDP_PRI_DEFAULT );
+	else
+	    svc->bonus = 0;
+	spin_unlock_irq(&sdp_priv.lock);
+	//no need to preempt since it is running
+        return;
+    }
+    if ( unlikely(__vcpu_on_runq(svc)) )
+    {
+        spin_lock_irq(&sdp_priv.lock);
+        if( svc->pri > SDP_PRI_DEFAULT ){
+            svc->bonus = MILLISECS( svc->pri - SDP_PRI_DEFAULT );
+            sdp_priv.resort_before_sched_needed = 1;
+//	    printk( "wake on runq, at %d PCPU, domid:%d vcpuid:%d, resort needed.\n",
+//		cpu, vc->domain->domain_id, vc->vcpu_id );
+        }else
+	    svc->bonus = 0;
+        spin_unlock_irq(&sdp_priv.lock);
+
+	if( svc->pri > SDP_VCPU(per_cpu(schedule_data, cpu).curr)->pri )
+	    __runq_tickle(cpu, svc);
+	//need to consider preemption
+        return;
+    }
+
+    if( svc->pri > SDP_PRI_DEFAULT )
+        svc->bonus = MILLISECS( svc->pri - SDP_PRI_DEFAULT );
+    else
+	svc->bonus = 0;
+
+    /* Put the VCPU on the runq and tickle CPUs */
+    __runq_insert(cpu, svc);
+
+    if( svc->pri > SDP_VCPU(per_cpu(schedule_data, cpu).curr)->pri )
+        __runq_tickle(cpu, svc);
+}
+
+static int
+sdp_dom_cntl(
+    struct domain *d,
+    struct xen_domctl_scheduler_op *op)
+{
+    struct sdp_dom * const sdom = SDP_DOM(d);
+    unsigned long flags;
+
+    struct list_head *iter_vcpu, *next_vcpu;
+    struct sdp_vcpu * svc;
+
+//    printk( "sdp_dom_cntl is called\n" );
+
+    if ( op->cmd == XEN_DOMCTL_SCHEDOP_getinfo )
+    {
+//	printk( "return priority:%d\n", sdom->pri );
+        op->u.sdp.pri = sdom->pri;
+    }
+    else
+    {
+        ASSERT(op->cmd == XEN_DOMCTL_SCHEDOP_putinfo);
+
+//	printk( "put info, new priority for domain %d is %d\n", d->domain_id, op->u.sdp.pri );
+
+        spin_lock_irqsave(&sdp_priv.lock, flags);
+
+	//need to browse the active_vcpu queue of sdom to make sure every vcpu
+	// change their pri accordingly
+        if ( op->u.sdp.pri != 0 )
+            sdom->pri = op->u.sdp.pri;
+	//change the pri value of vcpus that belong to this domain accordingly
+	printk( "number of VCPUs of this domain is %d.\n", sdom->active_vcpu_count );
+	list_for_each_safe( iter_vcpu, next_vcpu, &sdom->active_vcpu ){
+	    svc = list_entry(iter_vcpu, struct sdp_vcpu, active_vcpu_elem);
+            BUG_ON( sdom != svc->sdom );
+	    svc->pri = sdom->pri;
+	    printk( "vcpu :%d, new pri:%d\n", svc->vcpu->vcpu_id, svc->pri );
+	}
+
+        spin_unlock_irqrestore(&sdp_priv.lock, flags);
+    }
+
+    return 0;
+}
+
+static int
+sdp_cpu_pick(struct vcpu *vc)
+{
+    cpumask_t cpus;
+    cpumask_t idlers;
+    int cpu;
+
+//    printk( "sdp_cpu_pick is called. vc->id:%d,vc->processor:%d\n", 
+//	vc->vcpu_id, vc->processor );
+    /*
+     * Pick from online CPUs in VCPU's affinity mask, giving a
+     * preference to its current processor if it's in there.
+     */
+    cpus_and(cpus, cpu_online_map, vc->cpu_affinity);
+    cpu = cpu_isset(vc->processor, cpus)
+            ? vc->processor
+            : __cycle_cpu(vc->processor, &cpus);
+    ASSERT( !cpus_empty(cpus) && cpu_isset(cpu, cpus) );
+
+    /*
+     * Try to find an idle processor within the above constraints.
+     *
+     * In multi-core and multi-threaded CPUs, not all idle execution
+     * vehicles are equal!
+     *
+     * We give preference to the idle execution vehicle with the most
+     * idling neighbours in its grouping. This distributes work across
+     * distinct cores first and guarantees we don't do something stupid
+     * like run two VCPUs on co-hyperthreads while there are idle cores
+     * or sockets.
+     */
+    idlers = sdp_priv.idlers;
+    cpu_set(cpu, idlers);
+    cpus_and(cpus, cpus, idlers);
+    cpu_clear(cpu, cpus);
+
+    while ( !cpus_empty(cpus) )
+    {
+        cpumask_t cpu_idlers;
+        cpumask_t nxt_idlers;
+        int nxt;
+
+        nxt = __cycle_cpu(cpu, &cpus);
+
+        if ( cpu_isset(cpu, cpu_core_map[nxt]) )
+        {
+            ASSERT( cpu_isset(nxt, cpu_core_map[cpu]) );
+            cpus_and(cpu_idlers, idlers, cpu_sibling_map[cpu]);
+            cpus_and(nxt_idlers, idlers, cpu_sibling_map[nxt]);
+        }
+        else
+        {
+            ASSERT( !cpu_isset(nxt, cpu_core_map[cpu]) );
+            cpus_and(cpu_idlers, idlers, cpu_core_map[cpu]);
+            cpus_and(nxt_idlers, idlers, cpu_core_map[nxt]);
+        }
+
+        if ( cpus_weight(cpu_idlers) < cpus_weight(nxt_idlers) )
+        {
+            cpu = nxt;
+            cpu_clear(cpu, cpus);
+        }
+        else
+        {
+            cpus_andnot(cpus, cpus, nxt_idlers);
+        }
+    }
+
+//    printk( "sdp_cpu_pick returns %d\n", cpu );
+    return cpu;
+}
+
+static struct task_slice 
+sdp_schedule(s_time_t now)
+{
+    const int cpu = smp_processor_id();
+    struct list_head * const runq = RUNQ(cpu);
+    struct sdp_vcpu * const scurr = SDP_VCPU(current);
+    struct sdp_vcpu *snext;
+    struct task_slice ret;
+
+//    static int debug_time =0;
+
+    //reorder the queue before scheduleing?
+    if ( sdp_priv.resort_before_sched_needed ){
+	sdp_runq_resort( cpu );
+	sdp_priv.resort_before_sched_needed = 0;
+    }
+
+    //recalculate the bonus value of this vcpu
+    if ( scurr->bonus > 0 ){
+	scurr->bonus -= (now - scurr->vcpu->runstate.state_entry_time);
+	if ( scurr->bonus < 0 ) scurr->bonus = 0;
+    }
+
+    /*
+     * Select next runnable local VCPU (ie top of local runq)
+     */
+    if ( vcpu_runnable(current) )
+        __runq_insert(cpu, scurr);
+    else
+        BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
+
+    snext = __runq_elem(runq->next);
+
+    __runq_remove(snext);
+
+    /*
+     * Return task to run next...
+     */
+    ret.time = MILLISECS(SDP_MSECS_PER_TSLICE);
+    ret.task = snext->vcpu;
+
+    return ret;
+}
+
+static void sdp_dump_pcpu(int cpu)
+{
+    struct list_head *runq, *iter;
+    struct sdp_pcpu *spc;
+    struct sdp_vcpu *svc;
+    int loop;
+    char cpustr[100];
+
+    spc = SDP_PCPU(cpu);
+    runq = &spc->runq;
+
+    cpumask_scnprintf(cpustr, sizeof(cpustr), cpu_sibling_map[cpu]);
+    printk("sibling=%s, ", cpustr);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), cpu_core_map[cpu]);
+    printk("core=%s\n", cpustr);
+
+    /* current VCPU */
+    svc = SDP_VCPU(per_cpu(schedule_data, cpu).curr);
+    if ( svc )
+    {
+        printk("\trun: ");
+        sdp_dump_vcpu(svc);
+    }else{
+	printk( "no currently running vcpu!\n" );
+    }
+
+    loop = 0;
+    list_for_each( iter, runq )
+    {
+        svc = __runq_elem(iter);
+        if ( svc )
+        {
+            printk("\t%3d: ", ++loop);
+            sdp_dump_vcpu(svc);
+        }
+    }
+
+}
+
+static void sdp_dump(void)
+{
+//    printk( "sdp_dump is called\n" );
+}
+
+static void sdp_init(void)
+{
+    spin_lock_init(&sdp_priv.lock);
+    INIT_LIST_HEAD(&sdp_priv.active_sdom);
+    sdp_priv.ncpus = 0;
+    sdp_priv.master = UINT_MAX;
+    cpus_clear(sdp_priv.idlers);
+//    printk( "sdp_init is called\n" );
+}
+
+/*
+static __init int sdp_start_tickers(void)
+{
+    struct sdp_pcpu *spc;
+    unsigned int cpu;
+
+    printk( "sdp_start_tickers is called\n" );
+
+    if ( sdp_priv.ncpus == 0 )
+        return 0;
+
+    for_each_online_cpu ( cpu )
+    {
+        spc = SDP_PCPU(cpu);
+        set_timer(&spc->ticker, NOW() + MILLISECS(SDP_MSECS_PER_TICK));
+    }
+
+    printk( "sdp_start_tickers is returned\n" );
+    return 0;
+}
+__initcall(sdp_start_tickers);
+*/
+
+struct scheduler sched_sdp_def = {
+    .name           = "Scheduler for Client Virtualization",
+    .opt_name       = "sdp",
+    .sched_id       = XEN_SCHEDULER_SDP,
+
+    .init_domain    = sdp_dom_init,
+    .destroy_domain = sdp_dom_destroy,
+
+    .init_vcpu      = sdp_vcpu_init,
+    .destroy_vcpu   = sdp_vcpu_destroy,
+
+    .sleep          = sdp_vcpu_sleep,
+    .wake           = sdp_vcpu_wake,
+
+    .adjust         = sdp_dom_cntl,
+
+    .pick_cpu       = sdp_cpu_pick,
+    .do_schedule    = sdp_schedule,
+
+    .dump_cpu_state = sdp_dump_pcpu,
+    .dump_settings  = sdp_dump,
+    .init           = sdp_init,
+};
+
diff -r e2f36d066b7b xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Dec 22 13:48:40 2008 +0000
+++ b/xen/common/schedule.c	Thu Jan 08 01:32:02 2009 -0500
@@ -51,9 +51,11 @@
 
 extern struct scheduler sched_sedf_def;
 extern struct scheduler sched_credit_def;
+extern struct scheduler sched_sdp_def;
 static struct scheduler *schedulers[] = { 
     &sched_sedf_def,
     &sched_credit_def,
+    &sched_sdp_def,
     NULL
 };
 
diff -r e2f36d066b7b xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Mon Dec 22 13:48:40 2008 +0000
+++ b/xen/include/public/domctl.h	Thu Jan 08 01:32:02 2009 -0500
@@ -294,6 +294,7 @@
 /* Scheduler types. */
 #define XEN_SCHEDULER_SEDF     4
 #define XEN_SCHEDULER_CREDIT   5
+#define XEN_SCHEDULER_SDP      6
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -312,6 +313,9 @@
             uint16_t weight;
             uint16_t cap;
         } credit;
+	struct xen_domctl_sched_sdp {
+	    uint16_t pri;
+	} sdp;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10  0:33   ` Tian, Kevin
@ 2009-04-10 16:15     ` Jeremy Fitzhardinge
  2009-04-10 17:16       ` Ian Pratt
  2009-04-11  9:52       ` Tian, Kevin
  0 siblings, 2 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-10 16:15 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: George Dunlap, xen-devel@lists.xensource.com

Tian, Kevin wrote:
>> From: Jeremy Fitzhardinge
>> Sent: 2009年4月10日 2:42
>>
>> George Dunlap wrote:
>>     
>>> 1. Design targets
>>>
>>> We have three general use cases in mind: Server 
>>>       
>> consolidation, virtual
>>     
>>> desktop providers, and clients (e.g. XenClient).
>>>
>>> For servers, our target "sweet spot" for which we will optimize is a
>>> system with 2 sockets, 4 cores each socket, and SMT (16 
>>>       
>> logical cpus).
>>     
>>> Ideal performance is expected to be reached at about 80% total system
>>> cpu utilization; but the system should function reasonably well up to
>>> a utilization of 800% (e.g., a load of 8).
>>>   
>>>       
>> Is that forward-looking enough?  That hardware is currently available; 
>> what's going to be commonplace in 2-3 years?
>>     
>
> good point.
>
>   
>>> * HT-aware.
>>>
>>> Running on a logical processor with an idle peer thread is not the
>>> same as running on a logical processor with a busy peer thread.  The
>>> scheduler needs to take this into account when deciding "fairness".
>>>   
>>>       
>> Would it be worth just pair-scheduling HT threads so they're always 
>> running in the same domain?
>>     
>
> running same domain doesn't help fairness and instead, it worsens.
>   

I don't know what the performance characteristics of modern-HT is, but
in P4-HT the throughput of a given thread was very dependent on what the
other thread was doing. If its competing with some other arbitrary
domain, then its hard to make any estimates about what the throughput of
a given vcpu's thread is.

If we present them as sibling pairs to guests, then it becomes the guest
OS's problem (ie, we don't try to hide the true nature of these pcpus).
That's fairer for the guest, because they know what they're getting, and
Xen can charge the guest for cpu use on a thread-pair, rather than
trying to work out how the two threads compete. In other words, if only
one thread is running, then it can charge max-thread-throughput; if both
are running, it can charge max-core-throughput (possibly scaled by
whatever performance mode the core is running in).

>>> * Power-aware.
>>>
>>> Using as many sockets / cores as possible can increase the 
>>>       
>> total cache
>>     
>>> size avalable to VMs, and thus (in the absence of inter-VM sharing)
>>> increase total computing power; but by keeping multiple sockets and
>>> cores powered up, also increases the electrical power used by the
>>> system.  We want a configurable way to balance between maximizing
>>> processing power vs minimizing electrical power.
>>>   
>>>       
>> I don't remember if there's a proper term for this, but what about 
>> having multiple domains sharing the same scheduling context, so that a 
>> stub domain can be co-scheduled with its main domain, rather 
>> than having 
>> them treated separately?
>>     
>
> This is really desired.
>
>   
>> Also, a somewhat related point, some kind of directed schedule so that 
>> when one vcpu is synchronously waiting on anohter vcpu, have 
>> it directly 
>> hand over its pcpu to avoid any cross-cpu overhead (including the 
>> ability to take advantage of directly using hot cache lines).  That 
>> would be useful for intra-domain IPIs, etc, but also inter-domain 
>> context switches (domain<->stub, frontend<->backend, etc).
>>     
>
> The hard part here is to find the hint on WHICH vcpu that given
> cpu is waiting, which is not straightforward. Of course stub
> domain is most possible example, but it may be already cleanly
> addressed if above co-scheduling could be added? :-)
>   

I'm being unclear by conflating two issues.

One is that when dom0 (or driver domain) does some work on behalf of a
guest, it seems like it would be useful for the time used to be credited
against the guest rather than against dom0.

My thought is that, rather than having the scheduler parameters be the
implicit result of "vcpu A belongs to domain X, charge X", each vcpu has
a charging domain which can be updated via (privileged) hypercall. When
dom0 is about to do some work, it updates the charging domain
accordingly (with some machinery to make that a per-task property within
the kernel so that task context switches update the vcpu state
appropriately).

A further extension would be the idea of charging grants, where domain A
could grant domain B charging rights, and B could set its vcpus to
charge A as an unprivileged operation. As with grant tables, revocation
poses some interesting problems.

This is a generalization of coscheduled stub domains, because you could
achieve the same effect by making the stub domain simply switch all its
vcpus to charge its main domain.

How to schedule vcpus? They could either be scheduled as if they were
part of the other domain; or be scheduled with their "home" domain, but
their time spent is charged against the other domain. The former is
effectively priority inheritance, and raises all the the normal issues -
but it would be appropriate for co-scheduled stub domains. The latter
makes more sense for dom0, but its less clear what it actually means:
does it consume any home domain credits? What happens if the other
domain's credits are all consumed? Could two domains collude to get more
than their fair share of cpu?

The second issue is trying to share pcpu resources between vcpus where
appropriate. The obvious case is doing some kind of cross-domain copy
operation, where the data could well be hot in cache, so if you use the
same pcpu you can just get cache hits. Of course there's the tradeoff
that you're necessarily serialising things which could be done in
parallel, so perhaps it doesn't work well in practice.

J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10 16:15     ` Jeremy Fitzhardinge
@ 2009-04-10 17:16       ` Ian Pratt
  2009-04-10 17:19         ` Jeremy Fitzhardinge
                           ` (2 more replies)
  2009-04-11  9:52       ` Tian, Kevin
  1 sibling, 3 replies; 35+ messages in thread
From: Ian Pratt @ 2009-04-10 17:16 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Tian, Kevin
  Cc: George Dunlap, Ian Pratt, xen-devel@lists.xensource.com

> I don't know what the performance characteristics of modern-HT is, but
> in P4-HT the throughput of a given thread was very dependent on what the
> other thread was doing. If its competing with some other arbitrary
> domain, then its hard to make any estimates about what the throughput
> of a given vcpu's thread is.

The original Northwood P4's were fairly horrible as regards performance predictability, but things got considerably better with later steppings. Nehalem has some interesting features that ought to make it better yet.

Presenting sibling pairs to guests is probably preferable (it avoids any worries about side channel crypto attacks), but I certainly wouldn't restrict it to just that: server hosted desktop workloads often involve large numbers of single VCPU guests, and you want every logical processor available.

Scaling the accounting if two threads share a core is a good way of ensuring things tend toward longer term fairness.

Possibly having two modes of operation would be good thing:

 1. explicitly present HT to guests and gang schedule threads

 2. normal free-for-all with HT aware accounting.

Of course, #1 isn't optimal if guests may migrate between HT and non-HT systems.


Ian 


> If we present them as sibling pairs to guests, then it becomes the
> guest
> OS's problem (ie, we don't try to hide the true nature of these pcpus).
> That's fairer for the guest, because they know what they're getting,
> and
> Xen can charge the guest for cpu use on a thread-pair, rather than
> trying to work out how the two threads compete. In other words, if only
> one thread is running, then it can charge max-thread-throughput; if
> both
> are running, it can charge max-core-throughput (possibly scaled by
> whatever performance mode the core is running in).
> 
> >>> * Power-aware.
> >>>
> >>> Using as many sockets / cores as possible can increase the
> >>>
> >> total cache
> >>
> >>> size avalable to VMs, and thus (in the absence of inter-VM sharing)
> >>> increase total computing power; but by keeping multiple sockets and
> >>> cores powered up, also increases the electrical power used by the
> >>> system.  We want a configurable way to balance between maximizing
> >>> processing power vs minimizing electrical power.
> >>>
> >>>
> >> I don't remember if there's a proper term for this, but what about
> >> having multiple domains sharing the same scheduling context, so that
> a
> >> stub domain can be co-scheduled with its main domain, rather
> >> than having
> >> them treated separately?
> >>
> >
> > This is really desired.
> >
> >
> >> Also, a somewhat related point, some kind of directed schedule so
> that
> >> when one vcpu is synchronously waiting on anohter vcpu, have
> >> it directly
> >> hand over its pcpu to avoid any cross-cpu overhead (including the
> >> ability to take advantage of directly using hot cache lines).  That
> >> would be useful for intra-domain IPIs, etc, but also inter-domain
> >> context switches (domain<->stub, frontend<->backend, etc).
> >>
> >
> > The hard part here is to find the hint on WHICH vcpu that given
> > cpu is waiting, which is not straightforward. Of course stub
> > domain is most possible example, but it may be already cleanly
> > addressed if above co-scheduling could be added? :-)
> >
> 
> I'm being unclear by conflating two issues.
> 
> One is that when dom0 (or driver domain) does some work on behalf of a
> guest, it seems like it would be useful for the time used to be
> credited
> against the guest rather than against dom0.
> 
> My thought is that, rather than having the scheduler parameters be the
> implicit result of "vcpu A belongs to domain X, charge X", each vcpu
> has
> a charging domain which can be updated via (privileged) hypercall. When
> dom0 is about to do some work, it updates the charging domain
> accordingly (with some machinery to make that a per-task property
> within
> the kernel so that task context switches update the vcpu state
> appropriately).
> 
> A further extension would be the idea of charging grants, where domain
> A
> could grant domain B charging rights, and B could set its vcpus to
> charge A as an unprivileged operation. As with grant tables, revocation
> poses some interesting problems.
> 
> This is a generalization of coscheduled stub domains, because you could
> achieve the same effect by making the stub domain simply switch all its
> vcpus to charge its main domain.
> 
> How to schedule vcpus? They could either be scheduled as if they were
> part of the other domain; or be scheduled with their "home" domain, but
> their time spent is charged against the other domain. The former is
> effectively priority inheritance, and raises all the the normal issues
> -
> but it would be appropriate for co-scheduled stub domains. The latter
> makes more sense for dom0, but its less clear what it actually means:
> does it consume any home domain credits? What happens if the other
> domain's credits are all consumed? Could two domains collude to get
> more
> than their fair share of cpu?
> 
> 
> 
> The second issue is trying to share pcpu resources between vcpus where
> appropriate. The obvious case is doing some kind of cross-domain copy
> operation, where the data could well be hot in cache, so if you use the
> same pcpu you can just get cache hits. Of course there's the tradeoff
> that you're necessarily serialising things which could be done in
> parallel, so perhaps it doesn't work well in practice.
> 
> J
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10 17:16       ` Ian Pratt
@ 2009-04-10 17:19         ` Jeremy Fitzhardinge
  2009-04-11 10:00           ` Tian, Kevin
  2009-04-15 13:54           ` George Dunlap
  2009-04-10 17:34         ` Jeremy Fitzhardinge
  2009-04-11  9:57         ` Tian, Kevin
  2 siblings, 2 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-10 17:19 UTC (permalink / raw)
  To: Ian Pratt; +Cc: George Dunlap, Tian, Kevin, xen-devel@lists.xensource.com

Ian Pratt wrote:
>> I don't know what the performance characteristics of modern-HT is, but
>> in P4-HT the throughput of a given thread was very dependent on what the
>> other thread was doing. If its competing with some other arbitrary
>> domain, then its hard to make any estimates about what the throughput
>> of a given vcpu's thread is.
>>     
>
> The original Northwood P4's were fairly horrible as regards performance predictability, but things got considerably better with later steppings. Nehalem has some interesting features that ought to make it better yet.
>
> Presenting sibling pairs to guests is probably preferable (it avoids any worries about side channel crypto attacks), but I certainly wouldn't restrict it to just that: server hosted desktop workloads often involve large numbers of single VCPU guests, and you want every logical processor available.
>
> Scaling the accounting if two threads share a core is a good way of ensuring things tend toward longer term fairness.
>
> Possibly having two modes of operation would be good thing:
>
>  1. explicitly present HT to guests and gang schedule threads
>
>  2. normal free-for-all with HT aware accounting.
>
> Of course, #1 isn't optimal if guests may migrate between HT and non-HT systems.
>   

This can probably be extended to Intel's hyper-dynamic flux mode (that 
may not be the real marketing name), where it can overclock one core if 
the other is idle.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10 17:16       ` Ian Pratt
  2009-04-10 17:19         ` Jeremy Fitzhardinge
@ 2009-04-10 17:34         ` Jeremy Fitzhardinge
  2009-04-11  9:57         ` Tian, Kevin
  2 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-10 17:34 UTC (permalink / raw)
  To: Ian Pratt; +Cc: George Dunlap, Tian, Kevin, xen-devel@lists.xensource.com

Ian Pratt wrote:
>  1. explicitly present HT to guests and gang schedule threads
>
>  2. normal free-for-all with HT aware accounting.
>
> Of course, #1 isn't optimal if guests may migrate between HT and non-HT systems.
>   

Well, we could extend vcpu hotplug to deal with those kinds of cpu 
topology changes, but I guess that doesn't help most Windows/hvm 
guests.  But I think if those vcpus stop being siblings, I don't think 
it would hurt if we stopped gang scheduling them, so long as they're 
kept close (same package, I guess).

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10 16:15     ` Jeremy Fitzhardinge
  2009-04-10 17:16       ` Ian Pratt
@ 2009-04-11  9:52       ` Tian, Kevin
  2009-04-15 15:56         ` George Dunlap
  1 sibling, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2009-04-11  9:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: George Dunlap, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 5740 bytes --]

>From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
>Sent: 2009年4月11日 0:15
>>>> * HT-aware.
>>>>
>>>> Running on a logical processor with an idle peer thread is not the
>>>> same as running on a logical processor with a busy peer 
>thread.  The
>>>> scheduler needs to take this into account when deciding "fairness".
>>>>   
>>>>       
>>> Would it be worth just pair-scheduling HT threads so they're always 
>>> running in the same domain?
>>>     
>>
>> running same domain doesn't help fairness and instead, it worsens.
>>   
>
>I don't know what the performance characteristics of modern-HT is, but
>in P4-HT the throughput of a given thread was very dependent 
>on what the
>other thread was doing. If its competing with some other arbitrary
>domain, then its hard to make any estimates about what the 
>throughput of
>a given vcpu's thread is.
>
>If we present them as sibling pairs to guests, then it becomes 
>the guest
>OS's problem (ie, we don't try to hide the true nature of these pcpus).
>That's fairer for the guest, because they know what they're 
>getting, and
>Xen can charge the guest for cpu use on a thread-pair, rather than
>trying to work out how the two threads compete. In other words, if only
>one thread is running, then it can charge 
>max-thread-throughput; if both
>are running, it can charge max-core-throughput (possibly scaled by
>whatever performance mode the core is running in).

It bases on one assumption that workloads within VM is more HT
friendly than workloads cross VMs. Maybe it's true in some cases
but I don't think it a strong point in most deployments.

The major worry to me is added complexity by exposing such sibling 
pairs to guest. You then have to schedule at core level for that VM, 
since the implication of HT should be always maintained or else 
reverse effect could be seen when VM does try to utilize that topology.
This brings trouble to scheduler. Not all VMs are guest SMP, and
then the VM being exposed with HT is actually treated unfair as one
more limitation is imposed that partial idle core can't be used by it 
while other VMs is immune. Another tricky part is that you have to 
gang schedule that VM, which is in concept fancy but no one has 
come up a solid implementaion in real.

Above is why I said the fairness could be worse in a general level.
It could be useful in some specific scenario. one is in client, where
however it's better to expose full topology instead of HT. the other
is some mission critical usages where cpu resource are paritioned
and thus to expose HT could be also useful.


>>> Also, a somewhat related point, some kind of directed 
>schedule so that 
>>> when one vcpu is synchronously waiting on anohter vcpu, have 
>>> it directly 
>>> hand over its pcpu to avoid any cross-cpu overhead (including the 
>>> ability to take advantage of directly using hot cache lines).  That 
>>> would be useful for intra-domain IPIs, etc, but also inter-domain 
>>> context switches (domain<->stub, frontend<->backend, etc).
>>>     
>>
>> The hard part here is to find the hint on WHICH vcpu that given
>> cpu is waiting, which is not straightforward. Of course stub
>> domain is most possible example, but it may be already cleanly
>> addressed if above co-scheduling could be added? :-)
>>   
>
>I'm being unclear by conflating two issues.
>
>One is that when dom0 (or driver domain) does some work on behalf of a
>guest, it seems like it would be useful for the time used to 
>be credited
>against the guest rather than against dom0.
>
>My thought is that, rather than having the scheduler parameters be the
>implicit result of "vcpu A belongs to domain X, charge X", 
>each vcpu has
>a charging domain which can be updated via (privileged) hypercall. When
>dom0 is about to do some work, it updates the charging domain
>accordingly (with some machinery to make that a per-task 
>property within
>the kernel so that task context switches update the vcpu state
>appropriately).
>
>A further extension would be the idea of charging grants, 
>where domain A
>could grant domain B charging rights, and B could set its vcpus to
>charge A as an unprivileged operation. As with grant tables, revocation
>poses some interesting problems.
>
>This is a generalization of coscheduled stub domains, because you could
>achieve the same effect by making the stub domain simply switch all its
>vcpus to charge its main domain.


Yup. This is one long missing part in Xen. Current accounting mechanism
like in xentop is raw incomplete. In this part KVM could be easier under the
cap of container.


>
>How to schedule vcpus? They could either be scheduled as if they were
>part of the other domain; or be scheduled with their "home" domain, but
>their time spent is charged against the other domain. The former is
>effectively priority inheritance, and raises all the the 
>normal issues -
>but it would be appropriate for co-scheduled stub domains. The latter
>makes more sense for dom0, but its less clear what it actually means:
>does it consume any home domain credits? What happens if the other
>domain's credits are all consumed? Could two domains collude 
>to get more
>than their fair share of cpu?
>
>
>
>The second issue is trying to share pcpu resources between vcpus where
>appropriate. The obvious case is doing some kind of cross-domain copy
>operation, where the data could well be hot in cache, so if you use the
>same pcpu you can just get cache hits. Of course there's the tradeoff
>that you're necessarily serialising things which could be done in
>parallel, so perhaps it doesn't work well in practice.
>
>J
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10 17:16       ` Ian Pratt
  2009-04-10 17:19         ` Jeremy Fitzhardinge
  2009-04-10 17:34         ` Jeremy Fitzhardinge
@ 2009-04-11  9:57         ` Tian, Kevin
  2009-04-11 17:11           ` Ian Pratt
  2 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2009-04-11  9:57 UTC (permalink / raw)
  To: Ian Pratt, Jeremy Fitzhardinge
  Cc: George Dunlap, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] 
>Sent: 2009年4月11日 1:16
>
>> I don't know what the performance characteristics of 
>modern-HT is, but
>> in P4-HT the throughput of a given thread was very dependent 
>on what the
>> other thread was doing. If its competing with some other arbitrary
>> domain, then its hard to make any estimates about what the throughput
>> of a given vcpu's thread is.
>
>The original Northwood P4's were fairly horrible as regards 
>performance predictability, but things got considerably better 
>with later steppings. Nehalem has some interesting features 
>that ought to make it better yet.
>
>Presenting sibling pairs to guests is probably preferable (it 
>avoids any worries about side channel crypto attacks), but I 
>certainly wouldn't restrict it to just that: server hosted 
>desktop workloads often involve large numbers of single VCPU 
>guests, and you want every logical processor available.
>
>Scaling the accounting if two threads share a core is a good 
>way of ensuring things tend toward longer term fairness.
>
>Possibly having two modes of operation would be good thing:
>
> 1. explicitly present HT to guests and gang schedule threads
>
> 2. normal free-for-all with HT aware accounting.
>
>Of course, #1 isn't optimal if guests may migrate between HT 
>and non-HT systems.
>
>

what do you mean by 'free-for-all'? 

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10 17:19         ` Jeremy Fitzhardinge
@ 2009-04-11 10:00           ` Tian, Kevin
  2009-04-15 15:47             ` George Dunlap
  2009-04-15 13:54           ` George Dunlap
  1 sibling, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2009-04-11 10:00 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Ian Pratt
  Cc: George Dunlap, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 2098 bytes --]

>From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
>Sent: 2009年4月11日 1:20
>Ian Pratt wrote:
>>> I don't know what the performance characteristics of 
>modern-HT is, but
>>> in P4-HT the throughput of a given thread was very 
>dependent on what the
>>> other thread was doing. If its competing with some other arbitrary
>>> domain, then its hard to make any estimates about what the 
>throughput
>>> of a given vcpu's thread is.
>>>     
>>
>> The original Northwood P4's were fairly horrible as regards 
>performance predictability, but things got considerably better 
>with later steppings. Nehalem has some interesting features 
>that ought to make it better yet.
>>
>> Presenting sibling pairs to guests is probably preferable 
>(it avoids any worries about side channel crypto attacks), but 
>I certainly wouldn't restrict it to just that: server hosted 
>desktop workloads often involve large numbers of single VCPU 
>guests, and you want every logical processor available.
>>
>> Scaling the accounting if two threads share a core is a good 
>way of ensuring things tend toward longer term fairness.
>>
>> Possibly having two modes of operation would be good thing:
>>
>>  1. explicitly present HT to guests and gang schedule threads
>>
>>  2. normal free-for-all with HT aware accounting.
>>
>> Of course, #1 isn't optimal if guests may migrate between HT 
>and non-HT systems.
>>   
>
>This can probably be extended to Intel's hyper-dynamic flux mode (that 
>may not be the real marketing name), where it can overclock 
>one core if 
>the other is idle.

the normal name for this is Turbo Boost. However it'd be difficult
for software to accounting for extra cycles gained from overclock,
as whether boost actually happens and how much cycles can
be boosted are completely controlled by hardware unit. There's
some feedback mechanism though, to gain average frequency in
an elapsed time. However currently cpufreq governor runs in 
time based style w/o connection to scheduler. That's one part
we could further enhance.

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-11  9:57         ` Tian, Kevin
@ 2009-04-11 17:11           ` Ian Pratt
  2009-04-12  6:27             ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Pratt @ 2009-04-11 17:11 UTC (permalink / raw)
  To: Tian, Kevin, Jeremy Fitzhardinge
  Cc: George Dunlap, Ian Pratt, xen-devel@lists.xensource.com

> >Possibly having two modes of operation would be good thing:
> >
> > 1. explicitly present HT to guests and gang schedule threads
> >
> > 2. normal free-for-all with HT aware accounting.
> >
> >Of course, #1 isn't optimal if guests may migrate between HT
> >and non-HT systems.
> 
> what do you mean by 'free-for-all'?

Same as today, i.e. we don't gang schedule and all threads are available for running VCPUs. 

I think it's reasonable to have two different modes of operation. For some CPU-intensive server virtualization-type workloads the admin basically wants to partition the machine. In this situation it's reasonable to expose the physical topology to guests (not just hyperthreads, but potentially cores/sockets/nodes and all the gory SLIT/SRAT tables stuff too). 

For more general virtualization workloads where the total number of VCPUs is rather greater than the number of physical CPUs then the current behaviour is preferable. HT aware accounting will mean that VCPUs that run concurrently on the same core will be charged less than the full period they are scheduled for.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-11 17:11           ` Ian Pratt
@ 2009-04-12  6:27             ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2009-04-12  6:27 UTC (permalink / raw)
  To: Ian Pratt, Jeremy Fitzhardinge
  Cc: George Dunlap, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 1263 bytes --]

>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] 
>Sent: 2009年4月12日 1:12
>> >Possibly having two modes of operation would be good thing:
>> >
>> > 1. explicitly present HT to guests and gang schedule threads
>> >
>> > 2. normal free-for-all with HT aware accounting.
>> >
>> >Of course, #1 isn't optimal if guests may migrate between HT
>> >and non-HT systems.
>> 
>> what do you mean by 'free-for-all'?
>
>Same as today, i.e. we don't gang schedule and all threads are 
>available for running VCPUs. 
>
>I think it's reasonable to have two different modes of 
>operation. For some CPU-intensive server virtualization-type 
>workloads the admin basically wants to partition the machine. 
>In this situation it's reasonable to expose the physical 
>topology to guests (not just hyperthreads, but potentially 
>cores/sockets/nodes and all the gory SLIT/SRAT tables stuff too). 
>
>For more general virtualization workloads where the total 
>number of VCPUs is rather greater than the number of physical 
>CPUs then the current behaviour is preferable. HT aware 
>accounting will mean that VCPUs that run concurrently on the 
>same core will be charged less than the full period they are 
>scheduled for.
>

Agree.

Thanks
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10 17:19         ` Jeremy Fitzhardinge
  2009-04-11 10:00           ` Tian, Kevin
@ 2009-04-15 13:54           ` George Dunlap
  2009-04-15 16:23             ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 35+ messages in thread
From: George Dunlap @ 2009-04-15 13:54 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Ian Pratt, xen-devel@lists.xensource.com, Tian, Kevin

On Fri, Apr 10, 2009 at 6:19 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> This can probably be extended to Intel's hyper-dynamic flux mode (that may
> not be the real marketing name), where it can overclock one core if the
> other is idle.

Jeremy,

Did you mean we could expose an entire socket to a guest VM, so that
it could schedule so as to take advantage of the effects of Turbo
Boost, just as we can expose thread pairs to a VM and let the guest OS
scheduler deal with threading issues?

 -George

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-09 18:41 ` Jeremy Fitzhardinge
  2009-04-10  0:33   ` Tian, Kevin
@ 2009-04-15 14:29   ` George Dunlap
  1 sibling, 0 replies; 35+ messages in thread
From: George Dunlap @ 2009-04-15 14:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel@lists.xensource.com

On Thu, Apr 9, 2009 at 7:41 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> I don't remember if there's a proper term for this, but what about having
> multiple domains sharing the same scheduling context, so that a stub domain
> can be co-scheduled with its main domain, rather than having them treated
> separately?

I think it's been informally called "co-scheduling". :-)  Yes, I'm
going to be looking into that.  One of the things that makes it a
little less easy is that (as I understand it) there is only one stub
domain "vcpu" per VM, which is shared by all a VM's vcpus.

> Also, a somewhat related point, some kind of directed schedule so that when
> one vcpu is synchronously waiting on anohter vcpu, have it directly hand
> over its pcpu to avoid any cross-cpu overhead (including the ability to take
> advantage of directly using hot cache lines).  That would be useful for
> intra-domain IPIs, etc, but also inter-domain context switches
> (domain<->stub, frontend<->backend, etc).

The only problem is if the "service" domain has other work that it may
do after it's done.  In my tests on a 2-core box doing scp to an HVM
guest, it's faster if I pin dom0 and domU to separate cores than if I
pin them to the same core.  Looking at the traces, it seems as though
after dom0 has woken up domU, it spends another 50K cycles or so
before blocking.  Stub domains may behave differently; in any case,
it's something that needs experimentation to decide.

>> For example, one could give dom0 a "reservation" of 50%, but leave the
>> weight at 256.  No matter how many other VMs run with a weight of 256,
>> dom0 will be guaranteed to get 50% of one cpu if it wants it.
>>
>
> How does the reservation interact with the credits?  Is the reservtion in
> addition to its credits, or does using the reservation consume them?

I think your question is, how does the reservation interact with
weight?  (Credits is the mechanism to implement both.)  The idea is
that a VM would get either an amount of cpu proportional to its
weight, or the reservation, whichever is greater.

So suppose that VMs A, B, and C have weights of 256 on a system with 1
core, no reservations.

If A and B are burning as much cpu as they can and C is idle, then A
and B should get 50% each.

If all of them (A,B,C) are burning as much cpu as they can, they will
should 33% each.

Now suppose that we give B a reservation of 40%.

If A and B are burning as much as they can and C is idle, then A and B
should again get 50% each.

However, if all of them are burning as much as they can, then B should
get 40% (its reservation), and A and C should each get 30% (i.e., the
remaining 60% divided by weight).

Does that make sense?

> Is it worth taking into account the power cost of cache misses vs hits?

If we have a general framework for "goodness" and "badness", and we
have a way of measuring cache hits / misses, we should be able to
extend the scheduler to do so.

> Do vcpus running on pcpus running at less than 100% speed consume fewer
> credits?

Yes, we'll also need to account for cpu frequency states in our accounting.

> Is there any explicit interface to cpu power state management, or would that
> be decoupled?

I think we might be able to fold this in; it depends on how
complicated things get.  Just as one can imagine a "badness factor" to
powering up a second CPU, which we can weight against the "badness" of
vcpus waiting on the runqueue, we can imagine a "badness factor" of
running at a higher cpu HZ that can be weighed against either powering
up extra cores / cpus or having to wait on the runqueue.

Let's start with a basic "badness factor" and see if we can get it
worked out properly, and then look at extending it to these sorts of
things.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-10  0:15 ` Tian, Kevin
@ 2009-04-15 15:07   ` George Dunlap
  2009-04-16  4:58     ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: George Dunlap @ 2009-04-15 15:07 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: xen-devel@lists.xensource.com

2009/4/10 Tian, Kevin <kevin.tian@intel.com>:
> How about VM number in total you'd like to support?

A rule-of-thumb number would be that we want to perform well at 4 VMs
per core, and wouldn't mind having a performance "cliff" past 8 per
core (not thread).  So for a 16-core system, that would be "good" for
64 VMs and "acceptable" up to 128 VMs.

> Do you mean that same elapsed time in above two scenarios will be
> translated into different credits?

Yes.  Ideally, we want to give "processing power" based on weight.
But the "processing power" of a thread whose sibling is idle is
significantly more than the "processing power" of a thread whose
sibling is running.  (Same thing possibly for cpu frequency scaling.)
So we'd want to arrange the credits such that VMs with equal weight
equal "processing power", not just equal "time on a logical cpu".

> Xen3.4 now supports "sched_smt_power_savings" (both boot option
> and touchable by xenpm) to change power/performance preference.
> It's simple implementation to simply reverse the span order from
> existing package->core->thread to thread->core->package. More
> fine-grained flexibility could be given in future if hierarchical scheduling
> concept could be more clearly constructed like domain scheduler
> in Linux.

I haven't looked at this code.  From your description here it sounds
like a sort of a simple hack to get the effect we want (either
spreading things out or pushing them together) -- is that correct?

My general feeling is that hacks are good short-term solutions, but
not long-term.  Things always get more complicated, and often have
unexpected side-effects.  I think since we're doing scheduler work,
it's worth it to try to see if we can actually solve the
power/performance problem.

> imo, weight is not strictly translated into the care for latency. any
> elaboration on that? I remembered that previously Nishiguchi-san
> gave idea to boost credit, and Disheng proposed static priority.
> Maybe you can make a summary to help people how latency would
> be exactly ensured in your proposal

All of this needs to be run through experiments.  So far, I've had
really good success with putting waking VMs in "boost" priority for
1ms if they still have credits.  (And unlike the credit scheduler, I
try to make sure that a VM rarely runs out of credits.)

> there should be some way to adjust or limit usage of 'reservation' when
> multiple vcpus both claim a desire which however sum up to some
> exceeding cpu's computing power or weaken your general
> 'weight-as-basic-unit' idea?

All "reservations" on the system must add up to less than the total
processing power of the system.  So a system with 2 cores can't have a
sum of reservations more than 200%.  Xen will check this when setting
the reservation and return an appropriate error message if necessary.

>>* We will also have an interface to the cpu-vs-electrical power.
>>
>>This is yet to be defined.  At the hypervisor level, it will probably
>>be a number representing the "badness" of powering up extra cpus /
>>cores.  At the tools level, there will probably be the option of
>>either specifying the number, or of using one of 2/3 pre-defined
>>values {power, balance, green/battery}.
>
> Not sure how that number will be defined. Maybe we can follow
> current way to just add individual name-based options matching
> its purpose (such as migration_cost and sched_smt_power_savings...)

At the scheduler level, I was thinking along the lines of
"core_power_up_cost".  This would be comparible to the cost of having
things waiting on the runqueue.  So (for example) if the cost was 0.1,
then when the load on the current processors reached 1.1, then it
would power up another core.  You could set it to 0.5 or 1.0 to save
more power (at the cost of some performance).  I think defining it
that way is the closest to what you really want: a way to define the
performance impact vs power consumption.

Obviously at the user interface level, we might have something more
manageable: e.g., {power, balance, green} => {0, 0.2, 0.8} or
something like that.

But as I said, the *goal* is to have a useful configurable interface;
the implementation will depend on what actually can be made to work in
practice.

 -George

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-11 10:00           ` Tian, Kevin
@ 2009-04-15 15:47             ` George Dunlap
  0 siblings, 0 replies; 35+ messages in thread
From: George Dunlap @ 2009-04-15 15:47 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com, Ian Pratt

2009/4/11 Tian, Kevin <kevin.tian@intel.com>:
> the normal name for this is Turbo Boost. However it'd be difficult
> for software to accounting for extra cycles gained from overclock,
> as whether boost actually happens and how much cycles can
> be boosted are completely controlled by hardware unit.

>From the context, it sounded like Jeremy was saying that if we expose
a whole socket to a guest, then the guest can try to schedule things
either to take advantage of multiple cores or to take advantage of
Turbo Boost.  (i.e., punt the Turbo Boost performance optimization to
the guest, just as we could punt the hyperthreading problem to the
guest.)

In any case, even if we can't control it, we may be able to either do
some estimates (i.e., we expect this core to run at about 120%).
There will probably be some performance counters that we could use to
estimate how much "boost" a VM actually got and deal with credits
accordingly... but that's yet another level of complication.  I'll put
it in the list of things to look at, and we'll see how far we get.

 -George

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-11  9:52       ` Tian, Kevin
@ 2009-04-15 15:56         ` George Dunlap
  2009-04-16  5:11           ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: George Dunlap @ 2009-04-15 15:56 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com

2009/4/11 Tian, Kevin <kevin.tian@intel.com>:
> The major worry to me is added complexity by exposing such sibling
> pairs to guest. You then have to schedule at core level for that VM,
> since the implication of HT should be always maintained or else
> reverse effect could be seen when VM does try to utilize that topology.
> This brings trouble to scheduler. Not all VMs are guest SMP, and
> then the VM being exposed with HT is actually treated unfair as one
> more limitation is imposed that partial idle core can't be used by it
> while other VMs is immune. Another tricky part is that you have to
> gang schedule that VM, which is in concept fancy but no one has
> come up a solid implementaion in real.

I think gang scheduling with this limited scope (a hyper-pair to be
scheduled on a hyper-pair) should be a lot easier than the general
case.  In any case, as long as we assign and deduct credits
appropriately, a threaded VM shouldn't have a disadvantage compared to
a single-thread VM.

 -George

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-15 13:54           ` George Dunlap
@ 2009-04-15 16:23             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-15 16:23 UTC (permalink / raw)
  To: George Dunlap; +Cc: Ian Pratt, xen-devel@lists.xensource.com, Tian, Kevin

George Dunlap wrote:
> On Fri, Apr 10, 2009 at 6:19 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>   
>> This can probably be extended to Intel's hyper-dynamic flux mode (that may
>> not be the real marketing name), where it can overclock one core if the
>> other is idle.
>>     
>
> Jeremy,
>
> Did you mean we could expose an entire socket to a guest VM, so that
> it could schedule so as to take advantage of the effects of Turbo
> Boost, just as we can expose thread pairs to a VM and let the guest OS
> scheduler deal with threading issues?
>   

Yes, precisely.  They're the same in that Xen concurrently schedules two 
(or more?) vcpus to the guest which have interdependent performance.  
One could imagine a case where a guest with a single-threaded workload 
gets best performance by being given a thread/core pair, running their 
work on one while explicitly keeping the other idle.  Of course that 
idle core is lost to the rest of the system in the meantime, so the 
guest should get charged for both.

And some kind of small-scale gang scheduling might be useful for small 
SMP guests anyway, because their spinlocks and IPIs will work as 
expected, and they'll presumably get shared cache at some level.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-15 15:07   ` George Dunlap
@ 2009-04-16  4:58     ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2009-04-16  4:58 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 5069 bytes --]

>From: George Dunlap
>Sent: 2009年4月15日 23:07
>
>> Do you mean that same elapsed time in above two scenarios will be
>> translated into different credits?
>
>Yes.  Ideally, we want to give "processing power" based on weight.
>But the "processing power" of a thread whose sibling is idle is
>significantly more than the "processing power" of a thread whose
>sibling is running.  (Same thing possibly for cpu frequency scaling.)
>So we'd want to arrange the credits such that VMs with equal weight
>equal "processing power", not just equal "time on a logical cpu".

Yup, this is one interesting part to be further explored. 

>
>> Xen3.4 now supports "sched_smt_power_savings" (both boot option
>> and touchable by xenpm) to change power/performance preference.
>> It's simple implementation to simply reverse the span order from
>> existing package->core->thread to thread->core->package. More
>> fine-grained flexibility could be given in future if 
>hierarchical scheduling
>> concept could be more clearly constructed like domain scheduler
>> in Linux.
>
>I haven't looked at this code.  From your description here it sounds
>like a sort of a simple hack to get the effect we want (either
>spreading things out or pushing them together) -- is that correct?

yes, spread first vs. fill first.

>
>My general feeling is that hacks are good short-term solutions, but
>not long-term.  Things always get more complicated, and often have
>unexpected side-effects.  I think since we're doing scheduler work,
>it's worth it to try to see if we can actually solve the
>power/performance problem.

Agree. Have you look at Linux side domain scheduler idea? Not sure
whether that topology based multi-level scheduler could help or over-
complicate here.

>
>> imo, weight is not strictly translated into the care for latency. any
>> elaboration on that? I remembered that previously Nishiguchi-san
>> gave idea to boost credit, and Disheng proposed static priority.
>> Maybe you can make a summary to help people how latency would
>> be exactly ensured in your proposal
>
>All of this needs to be run through experiments.  So far, I've had
>really good success with putting waking VMs in "boost" priority for
>1ms if they still have credits.  (And unlike the credit scheduler, I
>try to make sure that a VM rarely runs out of credits.)

btw, accurate accounting (at context switch instead of current tick-
based) should be also incorporated, if you do want to manipulate 
credits in fine-grain.

>
>> there should be some way to adjust or limit usage of 
>'reservation' when
>> multiple vcpus both claim a desire which however sum up to some
>> exceeding cpu's computing power or weaken your general
>> 'weight-as-basic-unit' idea?
>
>All "reservations" on the system must add up to less than the total
>processing power of the system.  So a system with 2 cores can't have a
>sum of reservations more than 200%.  Xen will check this when setting
>the reservation and return an appropriate error message if necessary.

return error, or scale previous successful reservations down?

>
>>>* We will also have an interface to the cpu-vs-electrical power.
>>>
>>>This is yet to be defined.  At the hypervisor level, it will probably
>>>be a number representing the "badness" of powering up extra cpus /
>>>cores.  At the tools level, there will probably be the option of
>>>either specifying the number, or of using one of 2/3 pre-defined
>>>values {power, balance, green/battery}.
>>
>> Not sure how that number will be defined. Maybe we can follow
>> current way to just add individual name-based options matching
>> its purpose (such as migration_cost and sched_smt_power_savings...)
>
>At the scheduler level, I was thinking along the lines of
>"core_power_up_cost".  This would be comparible to the cost of having
>things waiting on the runqueue.  So (for example) if the cost was 0.1,

who decides what the cost could be? how is it easily useful to an
end customer?

>then when the load on the current processors reached 1.1, then it
>would power up another core.  You could set it to 0.5 or 1.0 to save

what do you mean by 'power up'? boost its frequency or migrate load
to that core?

>more power (at the cost of some performance).  I think defining it
>that way is the closest to what you really want: a way to define the
>performance impact vs power consumption.

I'm still a bit confused here. What (at which situation) is translated into
a comparable value to the "core_power_up_cost"?

>
>Obviously at the user interface level, we might have something more
>manageable: e.g., {power, balance, green} => {0, 0.2, 0.8} or
>something like that.

Then how is this triple mapped to above "core_power_up_cost"?

>
>But as I said, the *goal* is to have a useful configurable interface;
>the implementation will depend on what actually can be made to work in
>practice.
>

I agree with this goal, but not convinced by above example. :-)

Thanks
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-15 15:56         ` George Dunlap
@ 2009-04-16  5:11           ` Tian, Kevin
  2009-04-16 10:27             ` George Dunlap
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2009-04-16  5:11 UTC (permalink / raw)
  To: George Dunlap; +Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 1437 bytes --]

>From: George Dunlap
>Sent: 2009年4月15日 23:56
>
>2009/4/11 Tian, Kevin <kevin.tian@intel.com>:
>> The major worry to me is added complexity by exposing such sibling
>> pairs to guest. You then have to schedule at core level for that VM,
>> since the implication of HT should be always maintained or else
>> reverse effect could be seen when VM does try to utilize 
>that topology.
>> This brings trouble to scheduler. Not all VMs are guest SMP, and
>> then the VM being exposed with HT is actually treated unfair as one
>> more limitation is imposed that partial idle core can't be used by it
>> while other VMs is immune. Another tricky part is that you have to
>> gang schedule that VM, which is in concept fancy but no one has
>> come up a solid implementaion in real.
>
>I think gang scheduling with this limited scope (a hyper-pair to be
>scheduled on a hyper-pair) should be a lot easier than the general
>case.  In any case, as long as we assign and deduct credits
>appropriately, a threaded VM shouldn't have a disadvantage compared to
>a single-thread VM.
>

Could you elaborate more about what's being simplified compared to
generic gang scheduling? I used to be scared by complexity to have 
multiple vcpus sync in and sync out, especially with other 
heterogeneous VMs (w/o gang scheduling requirement). It's possibly
simpler if all VMs in that system are hyper-pair based... :-)

Thanks
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-16  5:11           ` Tian, Kevin
@ 2009-04-16 10:27             ` George Dunlap
  2009-04-16 14:10               ` Dan Magenheimer
  2009-04-17 10:02               ` Tian, Kevin
  0 siblings, 2 replies; 35+ messages in thread
From: George Dunlap @ 2009-04-16 10:27 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com

2009/4/16 Tian, Kevin <kevin.tian@intel.com>:
> Could you elaborate more about what's being simplified compared to
> generic gang scheduling? I used to be scared by complexity to have
> multiple vcpus sync in and sync out, especially with other
> heterogeneous VMs (w/o gang scheduling requirement). It's possibly
> simpler if all VMs in that system are hyper-pair based... :-)

(I've only done some thought experiments re gang scheduling, so take
that into account as you read this description.)

The general gang scheduling problem is that you have P processors, and
N vms, each of which have up to M processors.  Each vm may have a
different number of processors N.vcpu < P, and at any one time there
may be a different number of processors runnable N.runnable < N.vcpu <
P; this may change at any time.  The general problem is to maximize
the number of used processors P (and thus maximize the throughput).
If you have a VM with 4 vcpus, but it's only using 2, you have a
choice: do I run it on 2 cores, and let another VM use the other 2
cores?  Or do I "reserve" 4 cores for it, so that if the other 2 vcpus
wake up they can run immediately?  If you do the former, then if one
of the other vcpus wakes up you have to quickly preempt someone else;
if not, you risk leaving the two cores idle for the entire timeslice,
effectively throwing away the processing power.  The whole problem is
likely to be NP-complete, and really obnoxious to have good heuristics
for.

In the case of HT gang-scheduling, the problem is significantly constrained:
* The pairs of processors to look at is constrianed: each logical cpu
has a pre-defined sibling.
* The quantity we're trying to gang-schedule is significantly
constrained: only 2; and each gang-scheduled vcpu has its pre-defined
HT sibling as well.
* If only one sibling of a HT pair is active, the other one isn't
wasted; the active thread will get more processing power.  So we don't
have that risky choice.

So there are really only a handful of new cases to consider:
* A HT vcpu pair wakes up or comes to the front of the queue when
another HT vcpu pair is running.
 + Simple: order normally.  If this pair is a higher priority
(whatever that means) than the running pair, preempt the running pair
(i.e., preempt the vcpus on both logical cpus).
* A non-HT vcpu becomes runnable (or comes to the front of the
runqueue) when a HT vcpu pair is on a pair of threads
 + If the non-HT vcpu priority is lower, wait until the HT vcpu pair
is finished.
 + If the non-HT vcpu priority is higher, we need to decide whether to
wait longer or whether to preempt both threads.  This may depends on
whether there are other non-HT vcpus waiting to run, and what their
priority is.
* A HT vcpu pair becomes runnable when non-HT vcpus are running on the threads
 + Deciding whether to wait or preempt both threads will depend on the
relative weights of both.

These decisions are a lot easier to deal with than the full
gang-scheduling problem (AFAICT).  One can imagine, for instance, that
each HT pair could share one runqueue.  A vcpu HT pair would be put on
the runqueue as an individual entity.  When it reached the top of the
queue, both threads would preempt what was on them and run the pair of
threads (or idle if the vcpu was idle).  Otherwise, each thread would
take a single non-pair vcpu and execute it.

At any rate, that's a step further down the road.  First we need to
address basic credits, latency, and load-balancing. (Implementation
e-mails to come in a bit.)

 -George

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-16 10:27             ` George Dunlap
@ 2009-04-16 14:10               ` Dan Magenheimer
  2009-04-16 16:32                 ` Jeremy Fitzhardinge
  2009-04-17 10:17                 ` George Dunlap
  2009-04-17 10:02               ` Tian, Kevin
  1 sibling, 2 replies; 35+ messages in thread
From: Dan Magenheimer @ 2009-04-16 14:10 UTC (permalink / raw)
  To: George Dunlap, Tian, Kevin; +Cc: Jeremy Fitzhardinge, xen-devel

>From a resource utilization perspective, hyper-pairing may
make sense.  But what about the user perspective?  How would
an administrator specify hyper-pairing?  And more importantly
why?  When consolidating workloads from, say, a group
of dual-core or dual-processor servers onto some future
larger hyperthreaded server, why would anyone say
"please assign this to a hyper-pair", which is essentially
saying "give me less peak performance than I had before"?

Also, in the analysis below, the problem is greatly
simplified because today's (x86) processors are limited
to two hyperthreads.  How soon will we see more threads
per core, given that other non-x86 CPUs already support
four or more?

> -----Original Message-----
> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com]
> Sent: Thursday, April 16, 2009 4:28 AM
> To: Tian, Kevin
> Cc: Jeremy Fitzhardinge; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> 2009/4/16 Tian, Kevin <kevin.tian@intel.com>:
> > Could you elaborate more about what's being simplified compared to
> > generic gang scheduling? I used to be scared by complexity to have
> > multiple vcpus sync in and sync out, especially with other
> > heterogeneous VMs (w/o gang scheduling requirement). It's possibly
> > simpler if all VMs in that system are hyper-pair based... :-)
> 
> (I've only done some thought experiments re gang scheduling, so take
> that into account as you read this description.)
> 
> The general gang scheduling problem is that you have P processors, and
> N vms, each of which have up to M processors.  Each vm may have a
> different number of processors N.vcpu < P, and at any one time there
> may be a different number of processors runnable N.runnable < N.vcpu <
> P; this may change at any time.  The general problem is to maximize
> the number of used processors P (and thus maximize the throughput).
> If you have a VM with 4 vcpus, but it's only using 2, you have a
> choice: do I run it on 2 cores, and let another VM use the other 2
> cores?  Or do I "reserve" 4 cores for it, so that if the other 2 vcpus
> wake up they can run immediately?  If you do the former, then if one
> of the other vcpus wakes up you have to quickly preempt someone else;
> if not, you risk leaving the two cores idle for the entire timeslice,
> effectively throwing away the processing power.  The whole problem is
> likely to be NP-complete, and really obnoxious to have good heuristics
> for.
> 
> In the case of HT gang-scheduling, the problem is 
> significantly constrained:
> * The pairs of processors to look at is constrianed: each logical cpu
> has a pre-defined sibling.
> * The quantity we're trying to gang-schedule is significantly
> constrained: only 2; and each gang-scheduled vcpu has its pre-defined
> HT sibling as well.
> * If only one sibling of a HT pair is active, the other one isn't
> wasted; the active thread will get more processing power.  So we don't
> have that risky choice.
> 
> So there are really only a handful of new cases to consider:
> * A HT vcpu pair wakes up or comes to the front of the queue when
> another HT vcpu pair is running.
>  + Simple: order normally.  If this pair is a higher priority
> (whatever that means) than the running pair, preempt the running pair
> (i.e., preempt the vcpus on both logical cpus).
> * A non-HT vcpu becomes runnable (or comes to the front of the
> runqueue) when a HT vcpu pair is on a pair of threads
>  + If the non-HT vcpu priority is lower, wait until the HT vcpu pair
> is finished.
>  + If the non-HT vcpu priority is higher, we need to decide whether to
> wait longer or whether to preempt both threads.  This may depends on
> whether there are other non-HT vcpus waiting to run, and what their
> priority is.
> * A HT vcpu pair becomes runnable when non-HT vcpus are 
> running on the threads
>  + Deciding whether to wait or preempt both threads will depend on the
> relative weights of both.
> 
> These decisions are a lot easier to deal with than the full
> gang-scheduling problem (AFAICT).  One can imagine, for instance, that
> each HT pair could share one runqueue.  A vcpu HT pair would be put on
> the runqueue as an individual entity.  When it reached the top of the
> queue, both threads would preempt what was on them and run the pair of
> threads (or idle if the vcpu was idle).  Otherwise, each thread would
> take a single non-pair vcpu and execute it.
> 
> At any rate, that's a step further down the road.  First we need to
> address basic credits, latency, and load-balancing. (Implementation
> e-mails to come in a bit.)
> 
>  -George
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-16 14:10               ` Dan Magenheimer
@ 2009-04-16 16:32                 ` Jeremy Fitzhardinge
  2009-04-16 18:20                   ` Andrew Lyon
  2009-04-17 10:17                 ` George Dunlap
  1 sibling, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-16 16:32 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: George Dunlap, Tian, Kevin, xen-devel

Dan Magenheimer wrote:
> From a resource utilization perspective, hyper-pairing may
> make sense.  But what about the user perspective?  How would
> an administrator specify hyper-pairing?  And more importantly
> why?  When consolidating workloads from, say, a group
> of dual-core or dual-processor servers onto some future
> larger hyperthreaded server, why would anyone say
> "please assign this to a hyper-pair", which is essentially
> saying "give me less peak performance than I had before"?
>   

I don't see how it makes a difference.  At the moment, you're never sure 
if a pair of vcpus are HT thread pairs, two cores on the same socket, or 
on completely different sockets - all of which will have quite different 
performance characteristics.  And unless your server is under-committed, 
you're always running the risk that one domain is competing with another 
for CPU when it needs it most - and if you're under-committed, you can 
always pin everything in exactly the config you want.

Besides, the chances are good that the single-threaded performance of 
each core on your shiny new server will be fast enough to overcome the 
cost of HT compared to your old server...

> Also, in the analysis below, the problem is greatly
> simplified because today's (x86) processors are limited
> to two hyperthreads.  How soon will we see more threads
> per core, given that other non-x86 CPUs already support
> four or more?
>   

I think the simplifying factor is that the number of threads/cores 
you're ganging together is a relatively small proportion of the total 
number of available threads/cores, so the problem is under-constrained 
and there are lots of nearly-optimal solutions.  If you're trying to 
gang schedule a large proportion of your total resources, then you get 
into tricky boxpacking territory.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-16 16:32                 ` Jeremy Fitzhardinge
@ 2009-04-16 18:20                   ` Andrew Lyon
  2009-04-16 18:28                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Lyon @ 2009-04-16 18:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: George Dunlap, Dan Magenheimer, xen-devel, Tian, Kevin

On Thu, Apr 16, 2009 at 5:32 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Dan Magenheimer wrote:
>>
>> From a resource utilization perspective, hyper-pairing may
>> make sense.  But what about the user perspective?  How would
>> an administrator specify hyper-pairing?  And more importantly
>> why?  When consolidating workloads from, say, a group
>> of dual-core or dual-processor servers onto some future
>> larger hyperthreaded server, why would anyone say
>> "please assign this to a hyper-pair", which is essentially
>> saying "give me less peak performance than I had before"?
>>
>
> I don't see how it makes a difference.  At the moment, you're never sure if
> a pair of vcpus are HT thread pairs, two cores on the same socket, or on
> completely different sockets - all of which will have quite different
> performance characteristics.  And unless your server is under-committed,
> you're always running the risk that one domain is competing with another for
> CPU when it needs it most - and if you're under-committed, you can always
> pin everything in exactly the config you want.
>
> Besides, the chances are good that the single-threaded performance of each
> core on your shiny new server will be fast enough to overcome the cost of HT
> compared to your old server...

Is HT particularly worthwhile for virtualization loads? we have
several older servers which have ht and I found that when running
windows terminal services it actually slowed the machine down, and
under certain circumstances it seemed to cause the system to become
extremely slow and had to be rebooted, we disabled ht and the problem
went away.

Many documents on the web recommend disabling ht for specific
workloads, and most benchmarks show that when it is benificial the
performance gain is quite small.

Or is your plan to make use of ht in a way that gets the most benefit
with no impact under edge cases?

Andy


>
>> Also, in the analysis below, the problem is greatly
>> simplified because today's (x86) processors are limited
>> to two hyperthreads.  How soon will we see more threads
>> per core, given that other non-x86 CPUs already support
>> four or more?
>>
>
> I think the simplifying factor is that the number of threads/cores you're
> ganging together is a relatively small proportion of the total number of
> available threads/cores, so the problem is under-constrained and there are
> lots of nearly-optimal solutions.  If you're trying to gang schedule a large
> proportion of your total resources, then you get into tricky boxpacking
> territory.
>
>   J
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-16 18:20                   ` Andrew Lyon
@ 2009-04-16 18:28                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-16 18:28 UTC (permalink / raw)
  To: Andrew Lyon; +Cc: George Dunlap, Dan Magenheimer, xen-devel, Tian, Kevin

Andrew Lyon wrote:
> Is HT particularly worthwhile for virtualization loads? we have
> several older servers which have ht and I found that when running
> windows terminal services it actually slowed the machine down, and
> under certain circumstances it seemed to cause the system to become
> extremely slow and had to be rebooted, we disabled ht and the problem
> went away.
>
> Many documents on the web recommend disabling ht for specific
> workloads, and most benchmarks show that when it is benificial the
> performance gain is quite small.
>
> Or is your plan to make use of ht in a way that gets the most benefit
> with no impact under edge cases?
>   

"New" HT, as reintroduced in i7, is supposed to be "much better".

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-16 10:27             ` George Dunlap
  2009-04-16 14:10               ` Dan Magenheimer
@ 2009-04-17 10:02               ` Tian, Kevin
  1 sibling, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2009-04-17 10:02 UTC (permalink / raw)
  To: George Dunlap; +Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 6554 bytes --]

>From: George Dunlap
>Sent: 2009年4月16日 18:28
>
>2009/4/16 Tian, Kevin <kevin.tian@intel.com>:
>> Could you elaborate more about what's being simplified compared to
>> generic gang scheduling? I used to be scared by complexity to have
>> multiple vcpus sync in and sync out, especially with other
>> heterogeneous VMs (w/o gang scheduling requirement). It's possibly
>> simpler if all VMs in that system are hyper-pair based... :-)
>
>(I've only done some thought experiments re gang scheduling, so take
>that into account as you read this description.)

Thanks for writing this down.

>
>The general gang scheduling problem is that you have P processors, and
>N vms, each of which have up to M processors.  Each vm may have a
>different number of processors N.vcpu < P, and at any one time there
>may be a different number of processors runnable N.runnable < N.vcpu <
>P; this may change at any time.  The general problem is to maximize
>the number of used processors P (and thus maximize the throughput).
>If you have a VM with 4 vcpus, but it's only using 2, you have a
>choice: do I run it on 2 cores, and let another VM use the other 2
>cores?  Or do I "reserve" 4 cores for it, so that if the other 2 vcpus
>wake up they can run immediately?  If you do the former, then if one
>of the other vcpus wakes up you have to quickly preempt someone else;
>if not, you risk leaving the two cores idle for the entire timeslice,
>effectively throwing away the processing power.  The whole problem is
>likely to be NP-complete, and really obnoxious to have good heuristics
>for.

This is related to the definition of gang scheduling. In my memory
gang scheduling comes from parallel computing requirement where
massive inter-threads communications/synchronizations exists and
one thread in blocking also impacts the rest. That normally requires
a 'gang' concept that once one thread is to be scheduled, all other
threads in same group are schdeduled in all together; on the other 
hand context switches are minimized to keep threads always in 
ready state. This idea is very application specific and thus normally
not find way into generic market.

Based on your whole write-up, you don't consider how to ensure 
vcpus of one VM scheduled in sync at given time point. Instead,
your model is to ensure exclusive usage on a thread-pair within
a given quantum window. It's nothing related to 'gang scheduling'
but it's OK for us to call it 'simplified gang scheduling' in this context
since the keypoint in this discussion is whether we can do some
optimization on HT, not strictly limited to gang scheduling itself. :-)

>
>In the case of HT gang-scheduling, the problem is 
>significantly constrained:
>* The pairs of processors to look at is constrianed: each logical cpu
>has a pre-defined sibling.
>* The quantity we're trying to gang-schedule is significantly
>constrained: only 2; and each gang-scheduled vcpu has its pre-defined
>HT sibling as well.

Then I assume only VMs with <=2 vcpus are considered here

>* If only one sibling of a HT pair is active, the other one isn't
>wasted; the active thread will get more processing power.  So we don't
>have that risky choice.

May or may not. Be in mind that why HT is introduced is because one
thread normally includes lots of stall cycles and thus stalled cycles
can be utilized to drive another thread. whether more processing power
is gained is very workload specific.

>
>So there are really only a handful of new cases to consider:
>* A HT vcpu pair wakes up or comes to the front of the queue when
>another HT vcpu pair is running.
> + Simple: order normally.  If this pair is a higher priority
>(whatever that means) than the running pair, preempt the running pair
>(i.e., preempt the vcpus on both logical cpus).

"vcpu pair"? your below descriptions are all based on vcpu pair. How do
you define the status (blocked, runnable, running) of a pair? If it's AND
of individual status of two vcpus, it's too restrictive to have a runnable
status. If it's OR operation, you may then need consider how to define
a sane priorioty (is one-vcpu-ready-the-other-block with higher credits
considered higher priority than two-vcpu-ready with lower credits? etc.)

>* A non-HT vcpu becomes runnable (or comes to the front of the
>runqueue) when a HT vcpu pair is on a pair of threads
> + If the non-HT vcpu priority is lower, wait until the HT vcpu pair
>is finished.
> + If the non-HT vcpu priority is higher, we need to decide whether to
>wait longer or whether to preempt both threads.  This may depends on
>whether there are other non-HT vcpus waiting to run, and what their
>priority is.
>* A HT vcpu pair becomes runnable when non-HT vcpus are 
>running on the threads
> + Deciding whether to wait or preempt both threads will depend on the
>relative weights of both.

Above all looks adding much complexity to a common single-vcpu-based
scheduler. 

>
>These decisions are a lot easier to deal with than the full
>gang-scheduling problem (AFAICT).  One can imagine, for instance, that
>each HT pair could share one runqueue.  A vcpu HT pair would be put on
>the runqueue as an individual entity.  When it reached the top of the
>queue, both threads would preempt what was on them and run the pair of
>threads (or idle if the vcpu was idle).  Otherwise, each thread would
>take a single non-pair vcpu and execute it.

That may bring undesired effect. The more cross-cpu talks you add
to scheduler, less efficient could scheduler be due to locks on both
runqueues.

>
>At any rate, that's a step further down the road.  First we need to
>address basic credits, latency, and load-balancing. (Implementation
>e-mails to come in a bit.)
>

Yup. First implementation should take commonness, simpleness
and overall efficiency as major goals.

Last, any effort toward better HT efficiency is welcomed to me. However
we need be careful enough to avoid introducing unnecessary complexity
into scheduler. I'm educated that the more complex (or flexible options)
the scheduler is, more unexpected side-effects may occur in other areas.
By far I'm not convinced that above idea could lead to a more fair system
either to HT vcpu pairs or to non-HT vcpus. I agree with Ian's idea that
topology is exposed in partition case where scheduler doesn't require
change, and for general case we just need HT accounting to ensure
fairness which is simple enhancement. :-)

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-16 14:10               ` Dan Magenheimer
  2009-04-16 16:32                 ` Jeremy Fitzhardinge
@ 2009-04-17 10:17                 ` George Dunlap
  2009-04-17 14:13                   ` Dan Magenheimer
  1 sibling, 1 reply; 35+ messages in thread
From: George Dunlap @ 2009-04-17 10:17 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Tian, Kevin, xen-devel, Jeremy Fitzhardinge

On Thu, Apr 16, 2009 at 3:10 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> >From a resource utilization perspective, hyper-pairing may
> make sense.  But what about the user perspective?  How would
> an administrator specify hyper-pairing?  And more importantly
> why?  When consolidating workloads from, say, a group
> of dual-core or dual-processor servers onto some future
> larger hyperthreaded server, why would anyone say
> "please assign this to a hyper-pair", which is essentially
> saying "give me less peak performance than I had before"?

I think what you're saying is that when we only expose vcpus to the
guest, we can either run 2 vcpus on HT pairs, or give them an entire
core to themselves; but if we expose them as HT pairs and gang
schedule, then we're promising only to run them on HT pairs, limiting
the peak performance.

Hmm, I'm not sure that's actually true.  We could, if we had a
particularly idle system, split HT pairs and let them run as
independent vcpus.  I'm pretty sure the resulting throughput would be
usually higher.

In any case, it's a bit like asking, "Why would I buy a machine with
two hyperthreads instead of two cores?"  Yes, going from 2 vcpus to 2
vhts (virtual hyperthreads) is a step down in computing power; so
would going from a dual-core processor w/o HT to a single-core
processor with HT.  If you want to monotonically increase power, give
it 4 vhts.

At any rate, I think we can bring these up again when we actually
start to implement this feature.  First things first. :-)

 -George

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-17 10:17                 ` George Dunlap
@ 2009-04-17 14:13                   ` Dan Magenheimer
  2009-04-17 14:55                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Magenheimer @ 2009-04-17 14:13 UTC (permalink / raw)
  To: George Dunlap; +Cc: Tian, Kevin, xen-devel, Jeremy Fitzhardinge

> In any case, it's a bit like asking, "Why would I buy
> a machine with two hyperthreads instead of two cores?"

Yes.  In a physical machine, the OS takes advantage of all
resources available.  So it doesn't matter if some of the
"processors" are cores and some are hyperthreads.  You
are using ALL of the CPU resources you paid for.

But in a virtualized environment, each VM gets a fraction
of the resources and if grabbing some fixed number of
"processors" sometimes gets hyperthreads and sometimes
gets cores, this will cause interesting issues for some
workloads.

Think about a cloud where one pays for resources used.
You likely would demand to pay less for a hyperpair than
a non-vht pair.

As a result, I think it will be a requirement that
a system administrator be able to specify "I want two
FULL cores" vs "I am willing to accept two hyperthreads".
And once you get beyond hyperpairs, this is going to
get very messy.

> At any rate, I think we can bring these up again when
> we actually start to implement this feature.  First
> things first. :-)

Well yes and no.  While I am a big fan of iterative
prototyping, I wonder if this might be a case where
the architecture should drive the design and the
design should drive the implementation.  IOW, first
think through what choices a system admin should
be able to make (and, if given a mind-boggling array
of choices, which ones he/she WOULD make).

Just my two cents...

> -----Original Message-----
> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com]
> Sent: Friday, April 17, 2009 4:17 AM
> To: Dan Magenheimer
> Cc: Tian, Kevin; Jeremy Fitzhardinge; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> On Thu, Apr 16, 2009 at 3:10 PM, Dan Magenheimer
> <dan.magenheimer@oracle.com> wrote:
> > >From a resource utilization perspective, hyper-pairing may
> > make sense.  But what about the user perspective?  How would
> > an administrator specify hyper-pairing?  And more importantly
> > why?  When consolidating workloads from, say, a group
> > of dual-core or dual-processor servers onto some future
> > larger hyperthreaded server, why would anyone say
> > "please assign this to a hyper-pair", which is essentially
> > saying "give me less peak performance than I had before"?
> 
> I think what you're saying is that when we only expose vcpus to the
> guest, we can either run 2 vcpus on HT pairs, or give them an entire
> core to themselves; but if we expose them as HT pairs and gang
> schedule, then we're promising only to run them on HT pairs, limiting
> the peak performance.
> 
> Hmm, I'm not sure that's actually true.  We could, if we had a
> particularly idle system, split HT pairs and let them run as
> independent vcpus.  I'm pretty sure the resulting throughput would be
> usually higher.
> 
> In any case, it's a bit like asking, "Why would I buy a machine with
> two hyperthreads instead of two cores?"  Yes, going from 2 vcpus to 2
> vhts (virtual hyperthreads) is a step down in computing power; so
> would going from a dual-core processor w/o HT to a single-core
> processor with HT.  If you want to monotonically increase power, give
> it 4 vhts.
> 
> At any rate, I think we can bring these up again when we actually
> start to implement this feature.  First things first. :-)
> 
>  -George
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-17 14:13                   ` Dan Magenheimer
@ 2009-04-17 14:55                     ` Jeremy Fitzhardinge
  2009-04-17 15:55                       ` Dan Magenheimer
  0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-17 14:55 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: George Dunlap, Tian, Kevin, xen-devel

Dan Magenheimer wrote:
>> In any case, it's a bit like asking, "Why would I buy
>> a machine with two hyperthreads instead of two cores?"
>>     
>
> Yes.  In a physical machine, the OS takes advantage of all
> resources available.  So it doesn't matter if some of the
> "processors" are cores and some are hyperthreads.  You
> are using ALL of the CPU resources you paid for.
>
> But in a virtualized environment, each VM gets a fraction
> of the resources and if grabbing some fixed number of
> "processors" sometimes gets hyperthreads and sometimes
> gets cores, this will cause interesting issues for some
> workloads.
>
> Think about a cloud where one pays for resources used.
> You likely would demand to pay less for a hyperpair than
> a non-vht pair.
>
> As a result, I think it will be a requirement that
> a system administrator be able to specify "I want two
> FULL cores" vs "I am willing to accept two hyperthreads".
> And once you get beyond hyperpairs, this is going to
> get very messy.
>   

I think you're over-complicating it.  At very worst, it will be no worse 
than the current situation where Xen will place the vcpus on 
threads/cores in more or less arbitrary ways.

I think George's proposal can already accommodate the user needs you're 
talking about:

If the scheduler accounts for time spent executing on a contended HT 
thread (ie, the threads are not paired, so the other thread could be 
idle or running any other code) at a lesser rate than a full 
core/uncontended thread, then the charging works out.

If the user has a requirement that domain X's vcpus must be running at 
full speed, then they can set their reservation to 100%.  If we say that 
a contended HT thread is only worth, say, 70% of a "real" core, then 
that not only factors into the charging, but also means that any domain 
with a reservation > 70% is ineligible to run on a contended HT thread.  
(I think in practise this means that any domain with high reservations 
will end up running on gang scheduled thread pairs, just to guarantee 
that the other thread is idle, so the uncontended HT thread can run at 
100%.)

(Another way to look at it is that HT contention is a bit like having 
your vcpu being preempted by Xen, but rather than going from 100% 
running to 0% running, your vcpu drops to 70%.)

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-17 14:55                     ` Jeremy Fitzhardinge
@ 2009-04-17 15:55                       ` Dan Magenheimer
  2009-04-17 16:17                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Magenheimer @ 2009-04-17 15:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: George Dunlap, Tian, Kevin, xen-devel

> I think you're over-complicating it.

Perhaps.  Or maybe you are oversimplifying it? ;-)

> At very worst, it will 
> be no worse 
> than the current situation where Xen will place the vcpus on 
> threads/cores in more or less arbitrary ways.

Agreed.  Treating threads as cores is bad.  Since that's
what's happening today, one would think that any fix is
better than nothing.

> a contended HT thread is only worth, say, 70% of a
> "real" core
> :
> (Another way to look at it is that HT contention is a bit like having 
> your vcpu being preempted by Xen, but rather than going from 100% 
> running to 0% running, your vcpu drops to 70%.)

And that's the oversimplification I think.  Just
because Intel provides a rule-of-thumb that the extra
thread increases performance by 30% doesn't mean that
it is a good number to choose for scheduling purposes.

I suspect (and maybe this has even already been proven)
that this varies from 0%-100% depending on the workload,
and may even vary from *negative* to *more* than 100%.
(Yes, I understand that i7 is supposed to be better than
the last round of HT... but is it always better?)

Dan

> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Sent: Friday, April 17, 2009 8:56 AM
> To: Dan Magenheimer
> Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> Dan Magenheimer wrote:
> >> In any case, it's a bit like asking, "Why would I buy
> >> a machine with two hyperthreads instead of two cores?"
> >>     
> >
> > Yes.  In a physical machine, the OS takes advantage of all
> > resources available.  So it doesn't matter if some of the
> > "processors" are cores and some are hyperthreads.  You
> > are using ALL of the CPU resources you paid for.
> >
> > But in a virtualized environment, each VM gets a fraction
> > of the resources and if grabbing some fixed number of
> > "processors" sometimes gets hyperthreads and sometimes
> > gets cores, this will cause interesting issues for some
> > workloads.
> >
> > Think about a cloud where one pays for resources used.
> > You likely would demand to pay less for a hyperpair than
> > a non-vht pair.
> >
> > As a result, I think it will be a requirement that
> > a system administrator be able to specify "I want two
> > FULL cores" vs "I am willing to accept two hyperthreads".
> > And once you get beyond hyperpairs, this is going to
> > get very messy.
> >   
> 
> I think you're over-complicating it.  At very worst, it will 
> be no worse 
> than the current situation where Xen will place the vcpus on 
> threads/cores in more or less arbitrary ways.
> 
> I think George's proposal can already accommodate the user 
> needs you're 
> talking about:
> 
> If the scheduler accounts for time spent executing on a contended HT 
> thread (ie, the threads are not paired, so the other thread could be 
> idle or running any other code) at a lesser rate than a full 
> core/uncontended thread, then the charging works out.
> 
> If the user has a requirement that domain X's vcpus must be 
> running at 
> full speed, then they can set their reservation to 100%.  If 
> we say that 
> a contended HT thread is only worth, say, 70% of a "real" core, then 
> that not only factors into the charging, but also means that 
> any domain 
> with a reservation > 70% is ineligible to run on a contended 
> HT thread.  
> (I think in practise this means that any domain with high 
> reservations 
> will end up running on gang scheduled thread pairs, just to guarantee 
> that the other thread is idle, so the uncontended HT thread 
> can run at 
> 100%.)
> 
> (Another way to look at it is that HT contention is a bit like having 
> your vcpu being preempted by Xen, but rather than going from 100% 
> running to 0% running, your vcpu drops to 70%.)
> 
>     J
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-17 15:55                       ` Dan Magenheimer
@ 2009-04-17 16:17                         ` Jeremy Fitzhardinge
  2009-04-17 16:46                           ` Dan Magenheimer
  2009-04-17 17:05                           ` George Dunlap
  0 siblings, 2 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-17 16:17 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: George Dunlap, Tian, Kevin, xen-devel

Dan Magenheimer wrote:
> And that's the oversimplification I think.  Just
> because Intel provides a rule-of-thumb that the extra
> thread increases performance by 30% doesn't mean that
> it is a good number to choose for scheduling purposes.
>   

Actually the 70% was a number I plucked out of the air with no 
justification at all.

> I suspect (and maybe this has even already been proven)
> that this varies from 0%-100% depending on the workload,
> and may even vary from *negative* to *more* than 100%.
> (Yes, I understand that i7 is supposed to be better than
> the last round of HT... but is it always better?)
>   

The only way to know is by measurement, ideally with some specific 
performance counter which tells you what went on in that last 
timeslice.  But if this is a big issue, you can always disable HT, as 
lots of people did the last time around.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-17 16:17                         ` Jeremy Fitzhardinge
@ 2009-04-17 16:46                           ` Dan Magenheimer
  2009-04-17 17:05                           ` George Dunlap
  1 sibling, 0 replies; 35+ messages in thread
From: Dan Magenheimer @ 2009-04-17 16:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: George Dunlap, Tian, Kevin, xen-devel

> But if this is a big issue, you can always disable HT, as 
> lots of people did the last time around.

That would be a shame because HT will almost certainly
provide SOME performance benefit MOST of the time.

After pondering a bit, I guess I am arguing that once
processors have HT, Turboboost, and power management,
scheduling as a discipline has to move from the realm
of discrete to the realm of continuous.  A "second of
CPU" no longer has any real meaning when the value
of "a CPU" varies across time and workload.  (I suppose
due to shared cache effects and bus contention, this has
probably always been the case, but to a less obvious
degree.)

> The only way to know is by measurement, ideally with
> some specific performance counter which tells you
> what went on in that last timeslice.

Indeed.  Even if it is impossible to predict the
throughput of a specific workload on a specific CPU,
it sure would be nice if we could at least roughly
measure the past.

Processor architects take note! ;-)


> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Sent: Friday, April 17, 2009 10:17 AM
> To: Dan Magenheimer
> Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [RFC] Scheduler work, part 1: 
> High-level goals
> and interface.
> 
> 
> Dan Magenheimer wrote:
> > And that's the oversimplification I think.  Just
> > because Intel provides a rule-of-thumb that the extra
> > thread increases performance by 30% doesn't mean that
> > it is a good number to choose for scheduling purposes.
> >   
> 
> Actually the 70% was a number I plucked out of the air with no 
> justification at all.
> 
> > I suspect (and maybe this has even already been proven)
> > that this varies from 0%-100% depending on the workload,
> > and may even vary from *negative* to *more* than 100%.
> > (Yes, I understand that i7 is supposed to be better than
> > the last round of HT... but is it always better?)
> >   
> 
> The only way to know is by measurement, ideally with some specific 
> performance counter which tells you what went on in that last 
> timeslice.  But if this is a big issue, you can always disable HT, as 
> lots of people did the last time around.
> 
>     J
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC] Scheduler work, part 1: High-level goals and interface.
  2009-04-17 16:17                         ` Jeremy Fitzhardinge
  2009-04-17 16:46                           ` Dan Magenheimer
@ 2009-04-17 17:05                           ` George Dunlap
  1 sibling, 0 replies; 35+ messages in thread
From: George Dunlap @ 2009-04-17 17:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Tian, Kevin

Jeremy Fitzhardinge wrote:
> The only way to know is by measurement, ideally with some specific 
> performance counter which tells you what went on in that last 
> timeslice.  But if this is a big issue, you can always disable HT, as 
> lots of people did the last time around.
>   
I think measurement, both of total system throughput and individual VM 
throughput, is the final word on all designs.  I certainly plan on 
testing and comparing throughput for a variety of workloads as I develop 
the scheduler.  And I encourage anyone with the time and inclination to 
try to find workloads for which the scheduler performs poorly as new 
features (such as the proposed HT scheduling) are introduced.  :-)

Peace,
 -George

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2009-04-17 17:05 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-09 15:58 [RFC] Scheduler work, part 1: High-level goals and interface George Dunlap
2009-04-09 18:41 ` Jeremy Fitzhardinge
2009-04-10  0:33   ` Tian, Kevin
2009-04-10 16:15     ` Jeremy Fitzhardinge
2009-04-10 17:16       ` Ian Pratt
2009-04-10 17:19         ` Jeremy Fitzhardinge
2009-04-11 10:00           ` Tian, Kevin
2009-04-15 15:47             ` George Dunlap
2009-04-15 13:54           ` George Dunlap
2009-04-15 16:23             ` Jeremy Fitzhardinge
2009-04-10 17:34         ` Jeremy Fitzhardinge
2009-04-11  9:57         ` Tian, Kevin
2009-04-11 17:11           ` Ian Pratt
2009-04-12  6:27             ` Tian, Kevin
2009-04-11  9:52       ` Tian, Kevin
2009-04-15 15:56         ` George Dunlap
2009-04-16  5:11           ` Tian, Kevin
2009-04-16 10:27             ` George Dunlap
2009-04-16 14:10               ` Dan Magenheimer
2009-04-16 16:32                 ` Jeremy Fitzhardinge
2009-04-16 18:20                   ` Andrew Lyon
2009-04-16 18:28                     ` Jeremy Fitzhardinge
2009-04-17 10:17                 ` George Dunlap
2009-04-17 14:13                   ` Dan Magenheimer
2009-04-17 14:55                     ` Jeremy Fitzhardinge
2009-04-17 15:55                       ` Dan Magenheimer
2009-04-17 16:17                         ` Jeremy Fitzhardinge
2009-04-17 16:46                           ` Dan Magenheimer
2009-04-17 17:05                           ` George Dunlap
2009-04-17 10:02               ` Tian, Kevin
2009-04-15 14:29   ` George Dunlap
2009-04-10  0:15 ` Tian, Kevin
2009-04-15 15:07   ` George Dunlap
2009-04-16  4:58     ` Tian, Kevin
2009-04-10  2:28 ` Zhiyuan Shao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.