[PATCH] scheduler rate controller

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] scheduler rate controller
@ 2011-10-24  3:36 Lv, Hui
  2011-10-24 16:17 ` George Dunlap
  0 siblings, 1 reply; 22+ messages in thread
From: Lv, Hui @ 2011-10-24  3:36 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com
  Cc: Duan, Jiangang, Tian, Kevin, keir@xen.org, Dong, Eddie


As one of the topics presented in Xen summit2011 in SC, we proposed one method scheduler rate controller (SRC) to control high frequency of scheduling under some conditions. You can find the slides at 
http://www.slideshare.net/xen_com_mgr/9-hui-lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystems

In the followings, we have tested it with 2-socket multi-core system with many rounds and got the positive results and improve the performance greatly either with the consolidation workload SPECvirt_2010 or some small workloads such as sysbench and SPECjbb. So I posted it here for review.

>From Xen scheduling mechanism, hypervisor kicks related VCPUs by raising schedule softirq during processing external interrupts. Therefore, if the number of IRQ is very large, the scheduling happens more frequent. Frequent scheduling will 
1) bring more overhead for hypervisor and 
2) increase cache miss rate. 

In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI solution are adopted to bypass software emulation but bring heavy network traffic. Correspondingly, 15k scheduling happened per second on each physical core, which means the average running time is  very short, only 60us. We proposed SRC in XEN to mitigate this problem. 
The performance benefits brought by this patch is very huge at peak throughput with no influence when system loads are low.

SRC improved SPECvirt performance by 14%. 
1)It reduced CPU utilization, which allows more load to be added.
2)Response time (QoS)  became better at the same CPU %.
3)The better response time allowed us to push the CPU % at peak performance to an even higher level (CPU was not saturated in SPECvirt).
SRC reduced context switch rate significantly, resulted in 
2)Smaller Path Length
3)Less cache misses thus lower CPI
4)Better performance for both Guest and Hypervisor sides.

With this patch, from our SPECvirt_sc2010 results, the performance of xen catches up the other open sourced hypervisor. 


Signed-off-by: Hui Lv hui.lv@intel.com


diff -ruNp xen.org/common/schedule.c xen/common/schedule.c
--- xen.org/common/schedule.c	2011-10-20 03:29:44.000000000 -0400
+++ xen/common/schedule.c	2011-10-23 21:41:14.000000000 -0400
@@ -98,6 +98,31 @@ static inline void trace_runstate_change
 
     __trace_var(event, 1/*tsc*/, sizeof(d), &d);
 }
+/*
+ *opt_sched_rate_control:  parameter to turn on/off  scheduler rate controller (SRC)
+ *opt_sched_rate_high: scheduling frequency threshold, default value is 50.
+
+ *Suggest to set the value of opt_sched_rate_high larger than 50.
+ *It means if the scheduling frequency number, calculated during SCHED_SRC_INTERVAL (default 10 millisecond), is larger than opt_sched_rate_high, SRC works.
+*/
+bool_t opt_sched_rate_control = 0;
+unsigned int opt_sched_rate_high = 50;           
+boolean_param("sched_rate_control", opt_sched_rate_control);
+integer_param("sched_rate_high", opt_sched_rate_high);
+
+
+/* The following function is the scheduling rate controller (SRC). It is triggered when
+ * the frequency of scheduling is excessive high. (larger than opt_sched_rate_high)
+ * 
+ * Rules to control the scheduling frequency
+ * 1)if the frequency of scheduling (sd->s_csnum), calculated during the period of SCHED_SRC_INTERVAL,
+ * is larger than the threshold opt_sched_rate_high, SRC is enabled to work by setting sd->s_src_control = 1
+ * 2)if SRC works, it returns previous vcpu directly if previous vcpu is still runnalbe and not the idle vcpu.
+ * This method can decrease the frequency of scheduling when the scheduling frequency is excessive.
+*/
+
+void src_controller(struct schedule_data *sd, struct vcpu *prev, s_time_t now);
+
 
 static inline void trace_continue_running(struct vcpu *v)
 {
@@ -1033,6 +1058,29 @@ static void vcpu_periodic_timer_work(str
     set_timer(&v->periodic_timer, periodic_next_event);
 }
 
+void src_controller(struct schedule_data *sd, struct vcpu *prev, s_time_t now)
+{
+    sd->s_csnum++;
+    if ((now - sd->s_src_loop_begin) >= MILLISECS(SCHED_SRC_INTERVAL))
+    {
+        if (sd->s_csnum >= opt_sched_rate_high)
+               sd->s_src_control = 1;
+        else
+               sd->s_src_control = 0;
+        sd->s_src_loop_begin = now;
+        sd->s_csnum = 0;
+    }
+    if (sd->s_src_control)
+    {
+       if (!is_idle_vcpu(prev) && vcpu_runnable(prev))
+       {
+           perfc_incr(sched_src);
+           return continue_running(prev);
+       }
+       perfc_incr(sched_nosrc);
+    }
+}
+
 /* 
  * The main function
  * - deschedule the current domain (scheduler independent).
@@ -1054,6 +1102,8 @@ static void schedule(void)
 
     sd = &this_cpu(schedule_data);
 
+    if (opt_sched_rate_control)
+        src_controller(sd,prev,now);
     /* Update tasklet scheduling status. */
     switch ( *tasklet_work )
     {
@@ -1197,6 +1247,9 @@ static int cpu_schedule_up(unsigned int 
     sd->curr = idle_vcpu[cpu];
     init_timer(&sd->s_timer, s_timer_fn, NULL, cpu);
     atomic_set(&sd->urgent_count, 0);
+    sd->s_csnum=0;
+    sd->s_src_loop_begin=NOW();
+    sd->s_src_control=0;
 
     /* Boot CPU is dealt with later in schedule_init(). */
     if ( cpu == 0 )
diff -ruNp xen.org/include/xen/perfc_defn.h xen/include/xen/perfc_defn.h
--- xen.org/include/xen/perfc_defn.h	2011-10-20 03:29:44.000000000 -0400
+++ xen/include/xen/perfc_defn.h	2011-10-23 21:08:28.000000000 -0400
@@ -15,6 +15,8 @@ PERFCOUNTER(ipis,                   "#IP
 PERFCOUNTER(sched_irq,              "sched: timer")
 PERFCOUNTER(sched_run,              "sched: runs through scheduler")
 PERFCOUNTER(sched_ctx,              "sched: context switches")
+PERFCOUNTER(sched_src,              "sched: src triggered")
+PERFCOUNTER(sched_nosrc,            "sched: src not triggered")
 
 PERFCOUNTER(vcpu_check,             "csched: vcpu_check")
 PERFCOUNTER(schedule,               "csched: schedule")
diff -ruNp xen.org/include/xen/sched-if.h xen/include/xen/sched-if.h
--- xen.org/include/xen/sched-if.h	2011-10-20 03:29:44.000000000 -0400
+++ xen/include/xen/sched-if.h	2011-10-23 21:20:57.000000000 -0400
@@ -15,6 +15,11 @@ extern struct cpupool *cpupool0;
 
 /* cpus currently in no cpupool */
 extern cpumask_t cpupool_free_cpus;
+/*SRC judge whether to trigger scheduling controller based on the comparison 
+ *between the scheduling frequency, counted during SCHED_SRC_INTERVAL, and the threshold opt_sched_rate_high
+ *Suggest to set SCHED_SRC_INTERVAL to 10 (millisecond)
+*/
+#define SCHED_SRC_INTERVAL      10
 
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
@@ -32,6 +37,9 @@ struct schedule_data {
     struct vcpu        *curr;           /* current task                    */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
+    int                 s_csnum;          	/* scheduling number based on last period */
+    s_time_t            s_src_loop_begin;       /* SRC conting start point */
+    bool_t              s_src_control;        	/*indicate whether src should be triggered */
     atomic_t            urgent_count;   /* how many urgent vcpus           */
 };

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-10-24  3:36 [PATCH] scheduler rate controller Lv, Hui
@ 2011-10-24 16:17 ` George Dunlap
  2011-10-24 16:57   ` Keir Fraser
                     ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: George Dunlap @ 2011-10-24 16:17 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Duan, Jiangang, Tian, Kevin, xen-devel@lists.xensource.com,
	keir@xen.org, Dong, Eddie

On Mon, Oct 24, 2011 at 4:36 AM, Lv, Hui <hui.lv@intel.com> wrote:
>
> As one of the topics presented in Xen summit2011 in SC, we proposed one method scheduler rate controller (SRC) to control high frequency of scheduling under some conditions. You can find the slides at
> http://www.slideshare.net/xen_com_mgr/9-hui-lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystems
>
> In the followings, we have tested it with 2-socket multi-core system with many rounds and got the positive results and improve the performance greatly either with the consolidation workload SPECvirt_2010 or some small workloads such as sysbench and SPECjbb. So I posted it here for review.
>
> >From Xen scheduling mechanism, hypervisor kicks related VCPUs by raising schedule softirq during processing external interrupts. Therefore, if the number of IRQ is very large, the scheduling happens more frequent. Frequent scheduling will
> 1) bring more overhead for hypervisor and
> 2) increase cache miss rate.
>
> In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI solution are adopted to bypass software emulation but bring heavy network traffic. Correspondingly, 15k scheduling happened per second on each physical core, which means the average running time is  very short, only 60us. We proposed SRC in XEN to mitigate this problem.
> The performance benefits brought by this patch is very huge at peak throughput with no influence when system loads are low.
>
> SRC improved SPECvirt performance by 14%.
> 1)It reduced CPU utilization, which allows more load to be added.
> 2)Response time (QoS)  became better at the same CPU %.
> 3)The better response time allowed us to push the CPU % at peak performance to an even higher level (CPU was not saturated in SPECvirt).
> SRC reduced context switch rate significantly, resulted in
> 2)Smaller Path Length
> 3)Less cache misses thus lower CPI
> 4)Better performance for both Guest and Hypervisor sides.
>
> With this patch, from our SPECvirt_sc2010 results, the performance of xen catches up the other open sourced hypervisor.

Hui,

Thanks for the patch, and the work you've done testing it.  There are
a couple of things to discuss.

* I'm not sure I like the idea of doing this at the generic level than
at the specific scheduler level -- e.g., inside of credit1.  For
better or for worse, all aspects of scheduling work together, and even
small changes tend to have a significant effect on the emergent
behavior.  I understand why you'd want this in the generic scheduling
code; but it seems like it would be better for each scheduler to
implement a rate control independently.

* The actual algorithm you use here isn't described.  It seems to be
as follows (please correct me if I've made a mistake
reverse-engineering the algorithm):

Every 10ms, check to see if there have been more than 50 schedules.
If so, disable pre-emption entirely for 10ms, allowing processes to
run without being interrupted (unless they yield).

It seems like we should be able to do better.  For one, it means in
the general case you will flip back and forth between really frequent
schedules and less frequent schedules.  For two, turning off
preemption entirely will mean that whatever vcpu happens to be running
could, if it wished, run for the full 10ms; and which one got elected
to do that would be really random.  This may work well for SPECvirt,
but it's the kind of algorithm that is likely to have some workloads
on which it works very poorly.  Finally, there's the chance that this
algorithm could be "gamed" -- i.e., if a rogue VM knew that most other
VMs yielded frequently, it might be able to arrange that there would
always be more than 50 context switches a second, while it runs
without preemption and takes up more than its fair share.

Have you tried just making it give each vcpu a minimum amount of
scheduling time, say, 500us or 1ms?

Now a couple of stylistic comments:
* src tends to make me think of "source".  I think sched_rate[_*]
would fit the existing naming convention better.
* src_controller() shouldn't call continue_running() directly.
Instead, scheduler() should call src_controller(); and only call
sched->do_schedule() if src_controller() returns false (or something
like that).
* Whatever the algorithm is should have comments describing what it
does and how it's supposed to work.
* Your patch is malformed; you need to have it apply at the top level,
not from within the xen/ subdirectory.  The easiest way to get a patch
is to use either mercurial queues, or "hg diff".  There are some good
suggestions for making and posting patches here:
http://wiki.xensource.com/xenwiki/SubmittingXenPatches

Thanks again for all your work on this -- we definitely want Xen to
beat the other open-source hypervisor. :-)

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-10-24 16:17 ` George Dunlap
@ 2011-10-24 16:57   ` Keir Fraser
  2011-10-24 21:20   ` Dario Faggioli
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Keir Fraser @ 2011-10-24 16:57 UTC (permalink / raw)
  To: George Dunlap, Lv, Hui
  Cc: Duan, Jiangang, Tian, Kevin, xen-devel@lists.xensource.com,
	Dong, Eddie

On 24/10/2011 17:17, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> * I'm not sure I like the idea of doing this at the generic level than
> at the specific scheduler level -- e.g., inside of credit1.  For
> better or for worse, all aspects of scheduling work together, and even
> small changes tend to have a significant effect on the emergent
> behavior.  I understand why you'd want this in the generic scheduling
> code; but it seems like it would be better for each scheduler to
> implement a rate control independently.

Yes, this doesn't belong in schedule.c.

 -- Keir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-10-24 16:17 ` George Dunlap
  2011-10-24 16:57   ` Keir Fraser
@ 2011-10-24 21:20   ` Dario Faggioli
  2011-10-25  7:57   ` Dario Faggioli
  2011-10-28  2:07   ` Lv, Hui
  3 siblings, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2011-10-24 21:20 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2186 bytes --]

Hello everyone,

On Mon, 2011-10-24 at 17:17 +0100, George Dunlap wrote:
> * The actual algorithm you use here isn't described.  It seems to be
> as follows (please correct me if I've made a mistake
> reverse-engineering the algorithm):
> 
> Every 10ms, check to see if there have been more than 50 schedules.
> If so, disable pre-emption entirely for 10ms, allowing processes to
> run without being interrupted (unless they yield).
> 
> It seems like we should be able to do better.  For one, it means in
> the general case you will flip back and forth between really frequent
> schedules and less frequent schedules.  For two, turning off
> preemption entirely will mean that whatever vcpu happens to be running
> could, if it wished, run for the full 10ms; and which one got elected
> to do that would be really random.  
>
To me, this is  key point... Maybe we can save at least the calls coming
from the timer tick from being skipped, or something like that?

More generally speaking, I think I can see how this feature can be
useful, and I also think it can live in the generic schedule.c code, but
the algorithm with which rate-limiting is happening needs to be well
known, documented and exposed to the user (more than by means of a
couple of perf-counters).

For example this might completely destroy the time guarantees a
scheduler  like sEDF would give, and in such case it must be easy enough
to figure out what's going on and why the scheduler is  not behaving as
one would have expected!

For that reaason, although, again, a mechanism like this could
(according to my opinion) be general enough to be sensible and
meaningful for all the various schedulers, it might be worthwhile to
have it inside credit1 for now, where we know it will probably yield the
most of its benefits.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
PhD Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-10-24 16:17 ` George Dunlap
  2011-10-24 16:57   ` Keir Fraser
  2011-10-24 21:20   ` Dario Faggioli
@ 2011-10-25  7:57   ` Dario Faggioli
  2011-10-28  2:07   ` Lv, Hui
  3 siblings, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2011-10-25  7:57 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, keir@xen.org,
	Dong, Eddie, Lv, Hui, Duan, Jiangang


[-- Attachment #1.1: Type: text/plain, Size: 2523 bytes --]

Hello everyone,

On Mon, 2011-10-24 at 17:17 +0100, George Dunlap wrote:
> * The actual algorithm you use here isn't described.  It seems to be
> as follows (please correct me if I've made a mistake
> reverse-engineering the algorithm):
> 
> Every 10ms, check to see if there have been more than 50 schedules.
> If so, disable pre-emption entirely for 10ms, allowing processes to
> run without being interrupted (unless they yield).
> 
> It seems like we should be able to do better.  For one, it means in
> the general case you will flip back and forth between really frequent
> schedules and less frequent schedules.  For two, turning off
> preemption entirely will mean that whatever vcpu happens to be running
> could, if it wished, run for the full 10ms; and which one got elected
> to do that would be really random.  
>
To me, this is  key point... Maybe we can save at least the calls coming
from the timer tick from being skipped, or something like that?

More generally speaking, I think I can see how this feature can be
useful, and I also think it can live in the generic schedule.c code, but
the algorithm with which rate-limiting is happening needs to be well
known, documented and exposed to the user (more than by means of a
couple of perf-counters).

For example this might completely destroy the time guarantees a
scheduler  like sEDF would give, and in such case it must be easy enough
to figure out what's going on and why the scheduler is  not behaving as
one would have expected!

For that reaason, although, again, a mechanism like this could
(according to my opinion) be general enough to be sensible and
meaningful for all the various schedulers, it might be worthwhile to
have it inside credit1 for now, where we know it will probably yield the
most of its benefits.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
PhD Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
PhD Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-24 16:17 ` George Dunlap
                     ` (2 preceding siblings ...)
  2011-10-25  7:57   ` Dario Faggioli
@ 2011-10-28  2:07   ` Lv, Hui
  2011-10-28  8:10     ` Dario Faggioli
  3 siblings, 1 reply; 22+ messages in thread
From: Lv, Hui @ 2011-10-28  2:07 UTC (permalink / raw)
  To: George Dunlap
  Cc: Duan, Jiangang, Tian, Kevin, xen-devel@lists.xensource.com,
	keir@xen.org, Dong, Eddie



> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Tuesday, October 25, 2011 12:17 AM
> To: Lv, Hui
> Cc: xen-devel@lists.xensource.com; Duan, Jiangang; Tian, Kevin;
> keir@xen.org; Dong, Eddie
> Subject: Re: [Xen-devel] [PATCH] scheduler rate controller
> 
> On Mon, Oct 24, 2011 at 4:36 AM, Lv, Hui <hui.lv@intel.com> wrote:
> >
> > As one of the topics presented in Xen summit2011 in SC, we proposed
> one method scheduler rate controller (SRC) to control high frequency of
> scheduling under some conditions. You can find the slides at
> > http://www.slideshare.net/xen_com_mgr/9-hui-
> lvtacklingthemanagementchallengesofserverconsolidationonmulticoresystem
> s
> >
> > In the followings, we have tested it with 2-socket multi-core system
> with many rounds and got the positive results and improve the
> performance greatly either with the consolidation workload
> SPECvirt_2010 or some small workloads such as sysbench and SPECjbb. So
> I posted it here for review.
> >
> > >From Xen scheduling mechanism, hypervisor kicks related VCPUs by
> raising schedule softirq during processing external interrupts.
> Therefore, if the number of IRQ is very large, the scheduling happens
> more frequent. Frequent scheduling will
> > 1) bring more overhead for hypervisor and
> > 2) increase cache miss rate.
> >
> > In our consolidation workloads, SPECvirt_sc2010, SR-IOV & iSCSI
> solution are adopted to bypass software emulation but bring heavy
> network traffic. Correspondingly, 15k scheduling happened per second on
> each physical core, which means the average running time is  very short,
> only 60us. We proposed SRC in XEN to mitigate this problem.
> > The performance benefits brought by this patch is very huge at peak
> throughput with no influence when system loads are low.
> >
> > SRC improved SPECvirt performance by 14%.
> > 1)It reduced CPU utilization, which allows more load to be added.
> > 2)Response time (QoS)  became better at the same CPU %.
> > 3)The better response time allowed us to push the CPU % at peak
> performance to an even higher level (CPU was not saturated in SPECvirt).
> > SRC reduced context switch rate significantly, resulted in
> > 2)Smaller Path Length
> > 3)Less cache misses thus lower CPI
> > 4)Better performance for both Guest and Hypervisor sides.
> >
> > With this patch, from our SPECvirt_sc2010 results, the performance of
> xen catches up the other open sourced hypervisor.
> 
> Hui,
> 
> Thanks for the patch, and the work you've done testing it.  There are
> a couple of things to discuss.
> 
> * I'm not sure I like the idea of doing this at the generic level than
> at the specific scheduler level -- e.g., inside of credit1.  For
> better or for worse, all aspects of scheduling work together, and even
> small changes tend to have a significant effect on the emergent
> behavior.  I understand why you'd want this in the generic scheduling
> code; but it seems like it would be better for each scheduler to
> implement a rate control independently.
> 
> * The actual algorithm you use here isn't described.  It seems to be
> as follows (please correct me if I've made a mistake
> reverse-engineering the algorithm):
> 
> Every 10ms, check to see if there have been more than 50 schedules.
> If so, disable pre-emption entirely for 10ms, allowing processes to
> run without being interrupted (unless they yield).
> 

Sorry for the lack of description. You are right for the control process.

> It seems like we should be able to do better.  For one, it means in
> the general case you will flip back and forth between really frequent
> schedules and less frequent schedules.  For two, turning off
> preemption entirely will mean that whatever vcpu happens to be running
> could, if it wished, run for the full 10ms; and which one got elected
> to do that would be really random.  This may work well for SPECvirt,
> but it's the kind of algorithm that is likely to have some workloads
> on which it works very poorly.  Finally, there's the chance that this
> algorithm could be "gamed" -- i.e., if a rogue VM knew that most other
> VMs yielded frequently, it might be able to arrange that there would
> always be more than 50 context switches a second, while it runs
> without preemption and takes up more than its fair share.
> 

Yes, agree that, there are more things to do to make a more delicate solution for this in the next step. For example, we can consider per VM status to decide whether to turn on/off the control to make it more fair, such as your point three.

However, as the first step, the current solution is straightforward and effective. 
1) Most importantly, it happens when the scheduling frequency is excessive. User can decide which degree is excessive by setting parameter "opt_sched_rate_high"(default 50). If the system is crucial for latency sensitive tasks, you can choose a higher value that this patch will have little impact on it. User can decide which value is good for their environment. However, from our experience, if the scheduling frequency is too excessive, it also impairs the Qos of latency sensitive tasks due to frequent interrupts by other VMs.
2) Considering the excessive scheduling condition, the preemption is turning off entirely. If current running vcpu, yielded frequently, it cannot run for the full 10ms. If current running vcpu, not yielded frequently, it can possible run as long as up to 10ms. That means, this algorithm roughly protect the un-yielded vcpu to run a long time slice without preemption. This is something similar to your point 3 but in a roughly way. :)
3) Finally, this patch aimed to solve the issue when scheduling frequency is excessive but not influence the normal case (less frequency). We should treat these two cases separately. Since excessive scheduling case cannot guarantee neither performance or Qos.


 
> Have you tried just making it give each vcpu a minimum amount of
> scheduling time, say, 500us or 1ms?
> 
> Now a couple of stylistic comments:
> * src tends to make me think of "source".  I think sched_rate[_*]
> would fit the existing naming convention better.
> * src_controller() shouldn't call continue_running() directly.
> Instead, scheduler() should call src_controller(); and only call
> sched->do_schedule() if src_controller() returns false (or something
> like that).
> * Whatever the algorithm is should have comments describing what it
> does and how it's supposed to work.
> * Your patch is malformed; you need to have it apply at the top level,
> not from within the xen/ subdirectory.  The easiest way to get a patch
> is to use either mercurial queues, or "hg diff".  There are some good
> suggestions for making and posting patches here:
> http://wiki.xensource.com/xenwiki/SubmittingXenPatches
> 

Thanks for the kind information. I think the next version will be better :)

> Thanks again for all your work on this -- we definitely want Xen to
> beat the other open-source hypervisor. :-)
> 
>  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-28  2:07   ` Lv, Hui
@ 2011-10-28  8:10     ` Dario Faggioli
  2011-10-28  8:52       ` Lv, Hui
  0 siblings, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2011-10-28  8:10 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, keir@xen.org,
	George Dunlap, Dong, Eddie, Duan, Jiangang


[-- Attachment #1.1: Type: text/plain, Size: 3409 bytes --]

On Fri, 2011-10-28 at 10:07 +0800, Lv, Hui wrote:
> Yes, agree that, there are more things to do to make a more delicate solution for this in the next step. 
>
Hi Hui,

Firt of ll, I took a look at the slides and it really seems there's a
huge amount of work there about benchmarking, analyzing the results and
chasing the culprits of the various sources of overhead... Great job,
indeed! :-)

> For example, we can consider per VM status to decide whether to turn on/off the control to make it more fair, such as your point three.
> 
> However, as the first step, the current solution is straightforward and effective. 
> 1) Most importantly, it happens when the scheduling frequency is excessive. User can decide which degree is excessive by setting parameter "opt_sched_rate_high"(default 50). If the system is crucial for latency sensitive tasks, you can choose a higher value that this patch will have little impact on it. User can decide which value is good for their environment. However, from our experience, if the scheduling frequency is too excessive, it also impairs the Qos of latency sensitive tasks due to frequent interrupts by other VMs.
> 2) Considering the excessive scheduling condition, the preemption is turning off entirely. If current running vcpu, yielded frequently, it cannot run for the full 10ms. If current running vcpu, not yielded frequently, it can possible run as long as up to 10ms. That means, this algorithm roughly protect the un-yielded vcpu to run a long time slice without preemption. This is something similar to your point 3 but in a roughly way. :)
> 3) Finally, this patch aimed to solve the issue when scheduling frequency is excessive but not influence the normal case (less frequency). We should treat these two cases separately. Since excessive scheduling case cannot guarantee neither performance or Qos.
> 
Just something crossed my mind reading the patch and the comments, would
it make sense to rate-limit the calls coming from (non-timer) interrupt
exit paths while still letting the tick able to trigger a scheduling
decision? This just to be sure that at least the time slice enforcing
(if any) happens how expected... Could it make sense?

More generally speaking, I see how this feature can be useful, and I
also think it could live in the generic schedule.c code, but (as George
was saying) the algorithm by which rate-limiting is happening needs to
be well known, documented and exposed to the user (more than by means of
a couple of perf-counters).

For example this might completely destroy the time guarantees a
scheduler  like sEDF would give, and in such case it must be easy enough
to figure out what's going on and why the scheduler is not behaving as
expected!

For that reason, although again, this could be made general enough to be
sensible and meaningful for all the various schedulers, it might be
worthwhile to have it inside credit1 for now, where we know it will
probably yield the most of its benefits.

Just my 2 cents. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
PhD Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-28  8:10     ` Dario Faggioli
@ 2011-10-28  8:52       ` Lv, Hui
  2011-10-28 10:09         ` Dario Faggioli
  0 siblings, 1 reply; 22+ messages in thread
From: Lv, Hui @ 2011-10-28  8:52 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, keir@xen.org,
	George Dunlap, Dong, Eddie, Duan, Jiangang

[-- Attachment #1: Type: text/plain, Size: 2101 bytes --]

Thanks Dario for your helpful comments.

> Just something crossed my mind reading the patch and the comments,
> would it make sense to rate-limit the calls coming from (non-timer)
> interrupt exit paths while still letting the tick able to trigger a
> scheduling decision? This just to be sure that at least the time slice
> enforcing (if any) happens how expected... Could it make sense?
> 

Yes, it makes sense. But currently, we lacks the scheduler knowledge such as what caused the scheduler, timer or interrupt? Can we?

> More generally speaking, I see how this feature can be useful, and I
> also think it could live in the generic schedule.c code, but (as George
> was saying) the algorithm by which rate-limiting is happening needs to
> be well known, documented and exposed to the user (more than by means
> of a couple of perf-counters).
> 

One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users.

> For example this might completely destroy the time guarantees a
> scheduler  like sEDF would give, and in such case it must be easy
> enough to figure out what's going on and why the scheduler is not
> behaving as expected!
> 
> For that reason, although again, this could be made general enough to
> be sensible and meaningful for all the various schedulers, it might be
> worthwhile to have it inside credit1 for now, where we know it will
> probably yield the most of its benefits.
> 

I think I got your point. More considerations should be taken to avoid the disasters to any of the existing schedulers.
I'm fine to move it to the credit in the current stage. :)

> Just my 2 cents. :-)
> 
> Thanks and Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> ----------------------------------------------------------------------
> Dario Faggioli, http://retis.sssup.it/people/faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) PhD
> Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-28  8:52       ` Lv, Hui
@ 2011-10-28 10:09         ` Dario Faggioli
  2011-10-28 16:18           ` George Dunlap
  0 siblings, 1 reply; 22+ messages in thread
From: Dario Faggioli @ 2011-10-28 10:09 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, keir@xen.org,
	George Dunlap, Dong, Eddie, Duan, Jiangang


[-- Attachment #1.1: Type: text/plain, Size: 2686 bytes --]

On Fri, 2011-10-28 at 16:52 +0800, Lv, Hui wrote:
> > Just something crossed my mind reading the patch and the comments,
> > would it make sense to rate-limit the calls coming from (non-timer)
> > interrupt exit paths while still letting the tick able to trigger a
> > scheduling decision? This just to be sure that at least the time slice
> > enforcing (if any) happens how expected... Could it make sense?
> > 
> 
> Yes, it makes sense. But currently, we lacks the scheduler knowledge such as what caused the scheduler, timer or interrupt? Can we?
> 
Not sure yet, I can imagine it's tricky and I need to dig a bit more in
the code, but I'll let know if I found a way of doing that...

> > More generally speaking, I see how this feature can be useful, and I
> > also think it could live in the generic schedule.c code, but (as George
> > was saying) the algorithm by which rate-limiting is happening needs to
> > be well known, documented and exposed to the user (more than by means
> > of a couple of perf-counters).
> > 
> 
> One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users.
> 
Well, don't know, maybe a WARN (a WARN_ONCE alike thing would probably
be better), or in general something that leave a footstep in the logs,
so that one can find out by means of `xl dmesg' or related. Obviously,
I'm not suggesting of printk-ing each suppressed schedule invocation, or
the overhead would get even worse... :-P

I'm thinking of something that happens the very first time the limiting
fires, or maybe oncee some period/number of suppressions, just to remind
the user that he's getting weird behaviour because _he_enabled_
rate-limiting. Hopefully, that might also be useful for the user itself
to fine tune the limiting parameters, although I think the perf-counters
are already quite well suited for this.

> I think I got your point. More considerations should be taken to avoid the disasters to any of the existing schedulers.
> I'm fine to move it to the credit in the current stage. :)
> 
Yeah, and sorry (to everyone) for having pointed that out in so many
messages very similar to each other, but I was having MUA issues and was
thinking my mails weren't making to the list... Sorry again,

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
PhD Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-28 10:09         ` Dario Faggioli
@ 2011-10-28 16:18           ` George Dunlap
  2011-10-29  2:05             ` Lv, Hui
  2011-10-31  9:59             ` Dario Faggioli
  0 siblings, 2 replies; 22+ messages in thread
From: George Dunlap @ 2011-10-28 16:18 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	George Dunlap, Dong, Eddie, Lv, Hui, Duan, Jiangang

On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote:
> Not sure yet, I can imagine it's tricky and I need to dig a bit more in
> the code, but I'll let know if I found a way of doing that...

There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised.  But I
think we want to focus on the scheduler itself raising it as a result of
the .wake() callback.  Whether the .wake() happens as a result of a HW
interrupt or something else, I don't think really matters.

Dario and Hui,  neither of you have commented on my idea, which is
simply don't preempt a VM if it has run for less than some amount of
time (say, 500us or 1ms).  If a higher-priority VM is woken up, see how
long the current VM has run.  If it's less than 1ms, set a 1ms timer and
call schedule() then.

> > > More generally speaking, I see how this feature can be useful, and I
> > > also think it could live in the generic schedule.c code, but (as George
> > > was saying) the algorithm by which rate-limiting is happening needs to
> > > be well known, documented and exposed to the user (more than by means
> > > of a couple of perf-counters).
> > > 
> > 
> > One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users.
> > 
> Well, don't know, maybe a WARN (a WARN_ONCE alike thing would probably
> be better), or in general something that leave a footstep in the logs,
> so that one can find out by means of `xl dmesg' or related. Obviously,
> I'm not suggesting of printk-ing each suppressed schedule invocation, or
> the overhead would get even worse... :-P
> 
> I'm thinking of something that happens the very first time the limiting
> fires, or maybe oncee some period/number of suppressions, just to remind
> the user that he's getting weird behaviour because _he_enabled_
> rate-limiting. Hopefully, that might also be useful for the user itself
> to fine tune the limiting parameters, although I think the perf-counters
> are already quite well suited for this.

As much as possible, we want the system to Just Work.  Under normal
circumstances it wouldn't be too unusual for a VM to have a several-ms
delay between receiving a physical interrupt and being scheduled; I
think that if the 1ms delay works, having it on all the time would
probably be the best solution.  That's another reason I'm in favor of
trying it -- it's simple and easy to understand, and doesn't require
detecting when to "turn it on".

 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-28 16:18           ` George Dunlap
@ 2011-10-29  2:05             ` Lv, Hui
  2011-10-31 10:16               ` Dario Faggioli
  2011-11-03  4:28               ` George Dunlap
  2011-10-31  9:59             ` Dario Faggioli
  1 sibling, 2 replies; 22+ messages in thread
From: Lv, Hui @ 2011-10-29  2:05 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	George Dunlap, Dong, Eddie, Duan, Jiangang

[-- Attachment #1: Type: text/plain, Size: 4197 bytes --]

I have tried one way very similar as your idea.
1) to check whether current running vcpu runs less than 1ms, if yes, we will return current vcpu directly without preemption.
It try to guarantee vcpu to run as long as 1ms, if it wants.
It can reduce the scheduling frequency to some degree, but not very significant. Because 1ms is too light/weak with comparison to 10ms delay (SRC patch used).

As you said, if applying the seveal_ms_delay, it will happen whenever system is normal or not (excessive frequency). It may possible have the consequence that 1)under normal condition, it will produce worse Qos than that without applying such delay, 2) under excessive frequency condition, the mitigation effect of 1ms-delay may be too weak. In addition, your idea is to delay scheduling instead of reducing, which means the total number of scheduling would probably not change.
I think one possible solution, is to make the value of 1ms-delay adaptive according to the system status (low load or high load). If so, SRC patch just covered the excessive condition currently :). That's why I mentioned to treat normal and excessive conditions separately and don't influence the normal case as much as possible. Because we never know the consequence without amount of testing work. :)

Some of my stupid thinking :)

Best regards,

Lv, Hui

-----Original Message-----
From: George Dunlap [mailto:george.dunlap@citrix.com] 
Sent: Saturday, October 29, 2011 12:19 AM
To: Dario Faggioli
Cc: Lv, Hui; George Dunlap; Duan, Jiangang; Tian, Kevin; xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie
Subject: RE: [Xen-devel] [PATCH] scheduler rate controller

On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote:
> Not sure yet, I can imagine it's tricky and I need to dig a bit more 
> in the code, but I'll let know if I found a way of doing that...

There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised.  But I think we want to focus on the scheduler itself raising it as a result of the .wake() callback.  Whether the .wake() happens as a result of a HW interrupt or something else, I don't think really matters.

Dario and Hui,  neither of you have commented on my idea, which is simply don't preempt a VM if it has run for less than some amount of time (say, 500us or 1ms).  If a higher-priority VM is woken up, see how long the current VM has run.  If it's less than 1ms, set a 1ms timer and call schedule() then.

> > > More generally speaking, I see how this feature can be useful, and 
> > > I also think it could live in the generic schedule.c code, but (as 
> > > George was saying) the algorithm by which rate-limiting is 
> > > happening needs to be well known, documented and exposed to the 
> > > user (more than by means of a couple of perf-counters).
> > > 
> > 
> > One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users.
> > 
> Well, don't know, maybe a WARN (a WARN_ONCE alike thing would probably 
> be better), or in general something that leave a footstep in the logs, 
> so that one can find out by means of `xl dmesg' or related. Obviously, 
> I'm not suggesting of printk-ing each suppressed schedule invocation, 
> or the overhead would get even worse... :-P
> 
> I'm thinking of something that happens the very first time the 
> limiting fires, or maybe oncee some period/number of suppressions, 
> just to remind the user that he's getting weird behaviour because 
> _he_enabled_ rate-limiting. Hopefully, that might also be useful for 
> the user itself to fine tune the limiting parameters, although I think 
> the perf-counters are already quite well suited for this.

As much as possible, we want the system to Just Work.  Under normal circumstances it wouldn't be too unusual for a VM to have a several-ms delay between receiving a physical interrupt and being scheduled; I think that if the 1ms delay works, having it on all the time would probably be the best solution.  That's another reason I'm in favor of trying it -- it's simple and easy to understand, and doesn't require detecting when to "turn it on".

 -George

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-28 16:18           ` George Dunlap
  2011-10-29  2:05             ` Lv, Hui
@ 2011-10-31  9:59             ` Dario Faggioli
  1 sibling, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2011-10-31  9:59 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	George Dunlap, Dong, Eddie, Lv, Hui, Duan, Jiangang


[-- Attachment #1.1: Type: text/plain, Size: 3147 bytes --]

On Fri, 2011-10-28 at 17:18 +0100, George Dunlap wrote:
> On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote:
> > Not sure yet, I can imagine it's tricky and I need to dig a bit more in
> > the code, but I'll let know if I found a way of doing that...
> 
> There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised.  But I
> think we want to focus on the scheduler itself raising it as a result of
> the .wake() callback.  Whether the .wake() happens as a result of a HW
> interrupt or something else, I don't think really matters.
> 
I fully agree, since these are the events that are likely to cause a
context switch, and it's the context switch that --when happening too
frequently-- trashes the performances, isn't it?

> Dario and Hui,  neither of you have commented on my idea, which is
> simply don't preempt a VM if it has run for less than some amount of
> time (say, 500us or 1ms).  If a higher-priority VM is woken up, see how
> long the current VM has run.  If it's less than 1ms, set a 1ms timer and
> call schedule() then.
> 
Right, I was about to, but the got busy with other stuff. To me, it
sounds a more than valid method for achieving similar results. Maybe I
like it more than pure rate-limiting because it makes it easier to
understand what's going on, especially from the final user point of
view, and thus also easier to configure and to be dealt with.

However, the  effect is still pretty much the same, and this is also
able to cause "scheduling artifact", and affect guarantees and/or
expected behaviour, e.g., in EDF, if the earliest deadline task wakes up
right after a context switch it would need to run, independently on for
how long current is running, and not doing so could jeopardize
schedulability. 

Again, I'm not saying something like that shouldn't be done at all, just
that --if it is something special kicking in from time to time-- it
should make itsself evident.

> As much as possible, we want the system to Just Work.  Under normal
> circumstances it wouldn't be too unusual for a VM to have a several-ms
> delay between receiving a physical interrupt and being scheduled; I
> think that if the 1ms delay works, having it on all the time would
> probably be the best solution.  That's another reason I'm in favor of
> trying it -- it's simple and easy to understand, and doesn't require
> detecting when to "turn it on".
> 
Ok, that's the other way around. Having it all the time is pretty much
self evident so, if we're fine with the latency this introduces, and it
improves fairness and performances , this could be the way. From my
experience in real-time scheduling, ms alike scheduling latency is
something very very huge... But this obviously depend on the workload.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
PhD Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-10-29  2:05             ` Lv, Hui
@ 2011-10-31 10:16               ` Dario Faggioli
  2011-11-03  4:28               ` George Dunlap
  1 sibling, 0 replies; 22+ messages in thread
From: Dario Faggioli @ 2011-10-31 10:16 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	George Dunlap, Dong, Eddie, George Dunlap, Duan, Jiangang


[-- Attachment #1.1: Type: text/plain, Size: 2193 bytes --]

On Sat, 2011-10-29 at 10:05 +0800, Lv, Hui wrote:
> As you said, if applying the seveal_ms_delay, it will happen whenever system is normal or not (excessive frequency). It may possible have the consequence that 1)under normal condition, it will produce worse Qos than that without applying such delay, 2) under excessive frequency condition, the mitigation effect of 1ms-delay may be too weak. In addition, your idea is to delay scheduling instead of reducing, which means the total number of scheduling would probably not change.
>
I think it would. I mean, all the interrupts that arrive (and that are
causing TASKLET_SCHEDULE->schedule() right now) and see current vcpu not
having run for 1ms would collapse in just one call to schedule() at the
end of the 1ms delay from the time instant when the very first one
(interrupt) happened... Isn't it?

If yes, that would definitely reduce the number of calls to schedule().

> I think one possible solution, is to make the value of 1ms-delay adaptive according to the system status (low load or high load). If so, SRC patch just covered the excessive condition currently :). That's why I mentioned to treat normal and excessive conditions separately and don't influence the normal case as much as possible. Because we never know the consequence without amount of testing work. :)
> 
Well, again, from my perspective, these numbers (1ms, 10ms) really looks
like ages, considering for example audio processing workloads with
software like JACK needs periodicity of 1.3 or 2.6 ms, and thus having
1ms or (for worse) 10ms eaten by the scheduler would definitely step on
their toes. I understand we're not talking very much about audio
processing VMs here but hey, if it has to be general purpose... :-P

> Some of my stupid thinking :)
> 
Same for mine. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
PhD Candidate, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-10-29  2:05             ` Lv, Hui
  2011-10-31 10:16               ` Dario Faggioli
@ 2011-11-03  4:28               ` George Dunlap
  2011-11-04 14:08                 ` Lv, Hui
  2011-11-28 17:31                 ` Lv, Hui
  1 sibling, 2 replies; 22+ messages in thread
From: George Dunlap @ 2011-11-03  4:28 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

[-- Attachment #1: Type: text/plain, Size: 5884 bytes --]

On Sat, Oct 29, 2011 at 11:05 AM, Lv, Hui <hui.lv@intel.com> wrote:
> I have tried one way very similar as your idea.
> 1) to check whether current running vcpu runs less than 1ms, if yes, we will return current vcpu directly without preemption.
> It try to guarantee vcpu to run as long as 1ms, if it wants.
> It can reduce the scheduling frequency to some degree, but not very significant. Because 1ms is too light/weak with comparison to 10ms delay (SRC patch used).

Hey Hui,  Sorry for the delay in response -- FYI I'm at the XenSummit
Korea now, and I'll be on holiday next week.

Do you have the patch that you wrote for the 1ms delay handy, and any
numbers that you ran?  I'm a bit surprised that a 1ms delay didn't
have much effect; but in any case, it seems dialing that up should
have a similar effect -- e.g., if we changed that to 10ms, then it
should have a similar effect to the patch that you sent before.

> As you said, if applying the seveal_ms_delay, it will happen whenever system is normal or not (excessive frequency). It may possible have the consequence that 1)under normal condition, it will produce worse Qos than that without applying such delay,

Perhaps; but the current credit scheduler may already allow a VM to
run exclusively for 30ms, so I don't think that overall it should have
a big influence.

> 2) under excessive frequency condition, the mitigation effect of 1ms-delay may be too weak. In addition, your idea is to delay scheduling instead of reducing, which means the total number of scheduling would probably not change.

Well it will prevent preemption; so as long as at least one VM does
not yield, it will reduce the number of schedule events to 1000 times
per second.  If all VMs yield, then you can't really reduce the number
of scheduling events anyway (even with your preemption-disable patch).

> I think one possible solution, is to make the value of 1ms-delay adaptive according to the system status (low load or high load). If so, SRC patch just covered the excessive condition currently :). That's why I mentioned to treat normal and excessive conditions separately and don't influence the normal case as much as possible. Because we never know the consequence without amount of testing work. :)

Yes, exactly. :-)

> Some of my stupid thinking :)

Well, you've obviously done a lot more looking recently than I have. :-)

I'm attaching a prototype minimum timeslice patch that I threw
together last week.  It currently hangs during boot, but it will give
you the idea of what I was thinking of.

Hui, can you let me know what you think of the idea, and if you find
it interesting, could you try to fix it up, and test it?  Testing it
with bigger values like 5ms would be really interesting.

 -George

>
> Best regards,
>
> Lv, Hui
>
>
> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Saturday, October 29, 2011 12:19 AM
> To: Dario Faggioli
> Cc: Lv, Hui; George Dunlap; Duan, Jiangang; Tian, Kevin; xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie
> Subject: RE: [Xen-devel] [PATCH] scheduler rate controller
>
> On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote:
>> Not sure yet, I can imagine it's tricky and I need to dig a bit more
>> in the code, but I'll let know if I found a way of doing that...
>
> There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised.  But I think we want to focus on the scheduler itself raising it as a result of the .wake() callback.  Whether the .wake() happens as a result of a HW interrupt or something else, I don't think really matters.
>
> Dario and Hui,  neither of you have commented on my idea, which is simply don't preempt a VM if it has run for less than some amount of time (say, 500us or 1ms).  If a higher-priority VM is woken up, see how long the current VM has run.  If it's less than 1ms, set a 1ms timer and call schedule() then.
>
>> > > More generally speaking, I see how this feature can be useful, and
>> > > I also think it could live in the generic schedule.c code, but (as
>> > > George was saying) the algorithm by which rate-limiting is
>> > > happening needs to be well known, documented and exposed to the
>> > > user (more than by means of a couple of perf-counters).
>> > >
>> >
>> > One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users.
>> >
>> Well, don't know, maybe a WARN (a WARN_ONCE alike thing would probably
>> be better), or in general something that leave a footstep in the logs,
>> so that one can find out by means of `xl dmesg' or related. Obviously,
>> I'm not suggesting of printk-ing each suppressed schedule invocation,
>> or the overhead would get even worse... :-P
>>
>> I'm thinking of something that happens the very first time the
>> limiting fires, or maybe oncee some period/number of suppressions,
>> just to remind the user that he's getting weird behaviour because
>> _he_enabled_ rate-limiting. Hopefully, that might also be useful for
>> the user itself to fine tune the limiting parameters, although I think
>> the perf-counters are already quite well suited for this.
>
> As much as possible, we want the system to Just Work.  Under normal circumstances it wouldn't be too unusual for a VM to have a several-ms delay between receiving a physical interrupt and being scheduled; I think that if the 1ms delay works, having it on all the time would probably be the best solution.  That's another reason I'm in favor of trying it -- it's simple and easy to understand, and doesn't require detecting when to "turn it on".
>
>  -George
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

[-- Attachment #2: credit2-min-tslice.diff --]
[-- Type: text/x-diff, Size: 3544 bytes --]

diff -r 0526644ad2a6 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Oct 27 16:07:18 2011 +0100
+++ b/xen/common/sched_credit.c	Thu Nov 03 03:28:24 2011 +0000
@@ -109,6 +109,11 @@ boolean_param("sched_credit_default_yiel
 boolean_param("sched_credit_default_yield", sched_credit_default_yield);
 static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
 integer_param("sched_credit_tslice_ms", sched_credit_tslice_ms);
+
+/*
+ * Scheduler generic parameters
+ */
+extern int sched_ratelimit_us;
 
 /*
  * Physical CPU
@@ -1297,9 +1302,14 @@ csched_schedule(
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
     struct task_slice ret;
+    s_time_t runtime, tslice;
 
     CSCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
+
+    runtime = now - current->runstate.state_entry_time;
+    if ( runtime < 0 ) /* Does this ever happen? */
+        runtime = 0;
 
     if ( !is_idle_vcpu(scurr->vcpu) )
     {
@@ -1314,14 +1324,46 @@ csched_schedule(
     }
 
     /*
-     * Select next runnable local VCPU (ie top of local runq)
+     * Choices, choices:
+     * - If we have a tasklet, we need to run the idle vcpu no matter what.
+     * - If sched rate limiting is in effect, and the current vcpu has
+     *   run for less than that amount of time, continue the current one,
+     *   but with a shorter timeslice.
+     * - Otherwise, chose the one with the highest priority (which may
+     *   be the one currently running)
+     * - If the currently running one is TS_OVER, see if there
+     *   is a higher priority one waiting on the runqueue of another
+     *   cpu and steal it.
+     *
+     * Current invariant is that we always put the currently running vcpu
+     * on the runqueue, because we always take him off again below.
      */
+
     if ( vcpu_runnable(current) )
         __runq_insert(cpu, scurr);
     else
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
 
-    snext = __runq_elem(runq->next);
+    /* If we have schedule rate limiting enabled, check to see
+     * how long we've run for. */
+    if ( sched_ratelimit_us
+         && vcpu_runnable(current)
+         && !is_idle_vcpu(current)
+         && runtime < MICROSECS(sched_ratelimit_us) )
+    {
+        snext = scurr;
+        /* FIXME: Use prv->tslice_ms if we're also the head of hte queue */
+        tslice = MICROSECS(sched_ratelimit_us);
+    }
+    else
+    {
+        /*
+         * Select next runnable local VCPU (ie top of local runq)
+         */
+        snext = __runq_elem(runq->next);
+        tslice = MILLISECS(prv->tslice_ms));
+    }
+
     ret.migrated = 0;
 
     /* Tasklet work (which runs in idle VCPU context) overrides all else. */
@@ -1371,7 +1413,7 @@ csched_schedule(
      * Return task to run next...
      */
     ret.time = (is_idle_vcpu(snext->vcpu) ?
-                -1 : MILLISECS(prv->tslice_ms));
+                -1 : tslice;
     ret.task = snext->vcpu;
 
     CSCHED_VCPU_CHECK(ret.task);
diff -r 0526644ad2a6 xen/common/schedule.c
--- a/xen/common/schedule.c	Thu Oct 27 16:07:18 2011 +0100
+++ b/xen/common/schedule.c	Thu Nov 03 03:28:24 2011 +0000
@@ -46,6 +46,9 @@ string_param("sched", opt_sched);
  */
 bool_t sched_smt_power_savings = 0;
 boolean_param("sched_smt_power_savings", sched_smt_power_savings);
+/* Default scheduling rate limit: 1ms */
+int sched_ratelimit_us = 1000;
+integer_param("sched_ratelimit_us", sched_ratelimit_us);
 
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] scheduler rate controller
  2011-11-03  4:28               ` George Dunlap
@ 2011-11-04 14:08                 ` Lv, Hui
  2011-11-14 15:22                   ` George Dunlap
  2011-11-14 15:30                   ` George Dunlap
  2011-11-28 17:31                 ` Lv, Hui
  1 sibling, 2 replies; 22+ messages in thread
From: Lv, Hui @ 2011-11-04 14:08 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]

> 
> Hey Hui,  Sorry for the delay in response -- FYI I'm at the XenSummit Korea
> now, and I'll be on holiday next week.
> 

Have a good trip in Korea and the following holiday! And Say Hi to everyone there。 ：）

> I'm attaching a prototype minimum timeslice patch that I threw together last
> week.  It currently hangs during boot, but it will give you the idea of what I
> was thinking of.
> 
> Hui, can you let me know what you think of the idea, and if you find it
> interesting, could you try to fix it up, and test it?  Testing it with bigger values
> like 5ms would be really interesting.

I agree that this idea seems more natural and proper, if it can solve the two problems that I addressed above. We need data to prove/disprove it.
As you mentioned that, this method is supposed to have the similar result as the patch I sent, when setting 10ms as the delay value in the excessive condition.
So that, an idea came to me that may enforce your proposal,
1. we still count the scheduling number during each period (for example 10ms)
2. This scheduling number is used to adaptively decide the delay value.
For example, if scheduling number is very excessive, we can set longer delay time, such as 5ms or 10ms. Or if the scheduling number is small, we can set small delay time, such as 1ms, 500us or even zero. In this way, the delay value is decided adaptively.
I think It can solve the possible problems that I addressed above.
George, how do you think this?
I'd like to try this and see the result. May also to compare the results between different solutions. As you know, SPECvirt workloads is too complex that I need some time to produce this :).
Also we have a set of small workloads to make quick testing.

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-11-04 14:08                 ` Lv, Hui
@ 2011-11-14 15:22                   ` George Dunlap
  2011-11-14 15:30                   ` George Dunlap
  1 sibling, 0 replies; 22+ messages in thread
From: George Dunlap @ 2011-11-14 15:22 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

On Fri, Nov 4, 2011 at 2:08 PM, Lv, Hui <hui.lv@intel.com> wrote:
> So that, an idea came to me that may enforce your proposal,
> 1. we still count the scheduling number during each period (for example 10ms)
> 2. This scheduling number is used to adaptively decide the delay value.
> For example, if scheduling number is very excessive, we can set longer delay time, such as 5ms or 10ms. Or if the scheduling number is small, we can set small delay time, such as 1ms, 500us or even zero. In this way, the delay value is decided adaptively.

Setting the value adaptively is good, but only if it's adapting to the
right thing. :-)

For instance, adapting to number of cache misses, or to latency
requirements of guests, seems like a good idea.

But adapting to the number of scheduling events in the last period
doesn't seem very useful -- especially since our whole goal is to
change the number of scheduling events to be fewer. :-)

> I'd like to try this and see the result. May also to compare the results between different solutions. As you know, SPECvirt workloads is too complex that I need some time to produce this :).

I've heard that; thanks for doing the work.

> Also we have a set of small workloads to make quick testing.

What kinds of workloads are these?

Our performance team here is also trying to develop a lighter-weight
(i.e., easier to set up and run) benchmark for scalability /
consolidation.  Hopefully once they get that up and running I can test
the scheduling characteristics as well.

Peace,
 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-11-04 14:08                 ` Lv, Hui
  2011-11-14 15:22                   ` George Dunlap
@ 2011-11-14 15:30                   ` George Dunlap
  1 sibling, 0 replies; 22+ messages in thread
From: George Dunlap @ 2011-11-14 15:30 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Duan, Jiangang, Tian, Kevin, xen-devel@lists.xensource.com,
	Dario Faggioli, George Dunlap

On Fri, Nov 4, 2011 at 2:08 PM, Lv, Hui <hui.lv@intel.com> wrote:
> I'd like to try this and see the result. May also to compare the results between different solutions. As you know, SPECvirt workloads is too complex that I need some time to produce this :).
> Also we have a set of small workloads to make quick testing.

BTW, did you take any traces when running SPECVirt for your previous
tests?  If you could take some traces of SPECVirt whenever you get to
doing the new tests, that would be really helpful in understanding
both how the SPECVirt benchmark behaves, and how we can best tweak the
scheduler so that it runs better.

Thanks,
 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-11-03  4:28               ` George Dunlap
  2011-11-04 14:08                 ` Lv, Hui
@ 2011-11-28 17:31                 ` Lv, Hui
  2011-12-01 17:13                   ` George Dunlap
  1 sibling, 1 reply; 22+ messages in thread
From: Lv, Hui @ 2011-11-28 17:31 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

[-- Attachment #1: Type: text/plain, Size: 6900 bytes --]

Hi, George

Sorry for the late. I also met issues to boot up xen according to your patch, which is same as credit_3.patch that I attached.
So I modified it to the credit_1.patch and credit_2.patch, both of which work well.
1) credit_1 adopts "scheduling frequency counting" to decide the value of sched_ratelimit_us, which makes it adaptive.
2) credit_2 adopts the constant sched_ratelimit_us value 1000. 

Although the performance comparison data is still in process, I want to hear some feedbacks from you. 
I think I shall share the data very soon when the system becomes stable.


Best regards,

Lv, Hui


-----Original Message-----
From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George Dunlap
Sent: Thursday, November 03, 2011 12:29 PM
To: Lv, Hui
Cc: George Dunlap; Dario Faggioli; Tian, Kevin; xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie; Duan, Jiangang
Subject: Re: [Xen-devel] [PATCH] scheduler rate controller

On Sat, Oct 29, 2011 at 11:05 AM, Lv, Hui <hui.lv@intel.com> wrote:
> I have tried one way very similar as your idea.
> 1) to check whether current running vcpu runs less than 1ms, if yes, we will return current vcpu directly without preemption.
> It try to guarantee vcpu to run as long as 1ms, if it wants.
> It can reduce the scheduling frequency to some degree, but not very significant. Because 1ms is too light/weak with comparison to 10ms delay (SRC patch used).

Hey Hui,  Sorry for the delay in response -- FYI I'm at the XenSummit Korea now, and I'll be on holiday next week.

Do you have the patch that you wrote for the 1ms delay handy, and any numbers that you ran?  I'm a bit surprised that a 1ms delay didn't have much effect; but in any case, it seems dialing that up should have a similar effect -- e.g., if we changed that to 10ms, then it should have a similar effect to the patch that you sent before.

> As you said, if applying the seveal_ms_delay, it will happen whenever 
> system is normal or not (excessive frequency). It may possible have 
> the consequence that 1)under normal condition, it will produce worse 
> Qos than that without applying such delay,

Perhaps; but the current credit scheduler may already allow a VM to run exclusively for 30ms, so I don't think that overall it should have a big influence.

> 2) under excessive frequency condition, the mitigation effect of 1ms-delay may be too weak. In addition, your idea is to delay scheduling instead of reducing, which means the total number of scheduling would probably not change.

Well it will prevent preemption; so as long as at least one VM does not yield, it will reduce the number of schedule events to 1000 times per second.  If all VMs yield, then you can't really reduce the number of scheduling events anyway (even with your preemption-disable patch).

> I think one possible solution, is to make the value of 1ms-delay 
> adaptive according to the system status (low load or high load). If 
> so, SRC patch just covered the excessive condition currently :). 
> That's why I mentioned to treat normal and excessive conditions 
> separately and don't influence the normal case as much as possible. 
> Because we never know the consequence without amount of testing work. 
> :)

Yes, exactly. :-)

> Some of my stupid thinking :)

Well, you've obviously done a lot more looking recently than I have. :-)

I'm attaching a prototype minimum timeslice patch that I threw together last week.  It currently hangs during boot, but it will give you the idea of what I was thinking of.

Hui, can you let me know what you think of the idea, and if you find it interesting, could you try to fix it up, and test it?  Testing it with bigger values like 5ms would be really interesting.

 -George

>
> Best regards,
>
> Lv, Hui
>
>
> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Saturday, October 29, 2011 12:19 AM
> To: Dario Faggioli
> Cc: Lv, Hui; George Dunlap; Duan, Jiangang; Tian, Kevin; 
> xen-devel@lists.xensource.com; Keir (Xen.org); Dong, Eddie
> Subject: RE: [Xen-devel] [PATCH] scheduler rate controller
>
> On Fri, 2011-10-28 at 11:09 +0100, Dario Faggioli wrote:
>> Not sure yet, I can imagine it's tricky and I need to dig a bit more 
>> in the code, but I'll let know if I found a way of doing that...
>
> There are lots of reasons why the SCHEDULE_SOFTIRQ gets raised.  But I think we want to focus on the scheduler itself raising it as a result of the .wake() callback.  Whether the .wake() happens as a result of a HW interrupt or something else, I don't think really matters.
>
> Dario and Hui,  neither of you have commented on my idea, which is simply don't preempt a VM if it has run for less than some amount of time (say, 500us or 1ms).  If a higher-priority VM is woken up, see how long the current VM has run.  If it's less than 1ms, set a 1ms timer and call schedule() then.
>
>> > > More generally speaking, I see how this feature can be useful, 
>> > > and I also think it could live in the generic schedule.c code, 
>> > > but (as George was saying) the algorithm by which rate-limiting 
>> > > is happening needs to be well known, documented and exposed to 
>> > > the user (more than by means of a couple of perf-counters).
>> > >
>> >
>> > One question is that, what is the right palace to document such information? I'd like to make it as clear as possible to the users.
>> >
>> Well, don't know, maybe a WARN (a WARN_ONCE alike thing would 
>> probably be better), or in general something that leave a footstep in 
>> the logs, so that one can find out by means of `xl dmesg' or related. 
>> Obviously, I'm not suggesting of printk-ing each suppressed schedule 
>> invocation, or the overhead would get even worse... :-P
>>
>> I'm thinking of something that happens the very first time the 
>> limiting fires, or maybe oncee some period/number of suppressions, 
>> just to remind the user that he's getting weird behaviour because 
>> _he_enabled_ rate-limiting. Hopefully, that might also be useful for 
>> the user itself to fine tune the limiting parameters, although I 
>> think the perf-counters are already quite well suited for this.
>
> As much as possible, we want the system to Just Work.  Under normal circumstances it wouldn't be too unusual for a VM to have a several-ms delay between receiving a physical interrupt and being scheduled; I think that if the 1ms delay works, having it on all the time would probably be the best solution.  That's another reason I'm in favor of trying it -- it's simple and easy to understand, and doesn't require detecting when to "turn it on".
>
>  -George
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

[-- Attachment #2: credit_1.patch --]
[-- Type: application/octet-stream, Size: 5123 bytes --]

diff -ruN xen.ori/common/sched_credit.c xen/common/sched_credit.c
--- xen.ori/common/sched_credit.c	2011-11-28 08:50:58.000000000 -0500
+++ xen/common/sched_credit.c	2011-11-28 01:41:13.000000000 -0500
@@ -115,6 +115,10 @@
 boolean_param("sched_credit_default_yield", sched_credit_default_yield);
 
 /*
+ * Scheduler generic parameters
+ */
+extern int sched_ratelimit_us;
+/*
  * Physical CPU
  */
 struct csched_pcpu {
@@ -1297,10 +1301,27 @@
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
     struct task_slice ret;
+    s_time_t runtime, tslice;
+    struct schedule_data *sd;
 
     CSCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
 
+    sd = &this_cpu(schedule_data);
+    sd->s_csnum++;
+    if ((now - sd->s_src_b) >= MILLISECS(SCHED_SRC_INTERVAL))
+    {
+        if (sd->s_csnum >= SCHED_SRC_THRESHOLD)
+            sched_ratelimit_us = 1000;
+        else
+            sched_ratelimit_us = 0;
+	sd->s_src_b = now;
+	sd->s_csnum = 0;
+    }
+        
+    runtime = now - current->runstate.state_entry_time;
+    if ( runtime < 0 ) /* Does this ever happen? */
+        runtime = 0;
     if ( !is_idle_vcpu(scurr->vcpu) )
     {
         /* Update credits of a non-idle VCPU. */
@@ -1312,7 +1333,33 @@
         /* Re-instate a boosted idle VCPU as normal-idle. */
         scurr->pri = CSCHED_PRI_IDLE;
     }
-
+    /* If we have schedule rate limiting enabled, check to see
+     * how long we've run for. */
+  
+    if ( sched_ratelimit_us
+         && vcpu_runnable(current)
+         && !is_idle_vcpu(current)
+         && runtime < MICROSECS(sched_ratelimit_us) )
+    {
+        snext = scurr;
+        perfc_incr(delay_ms);
+        /* FIXME: Use prv->tslice_ms if we're also the head of hte queue */
+        //tslice = MICROSECS(sched_ratelimit_us);
+        tslice = MILLISECS(sched_ratelimit_us);
+	ret.time = tslice;
+	ret.task = snext->vcpu;
+	ret.migrated = 0;
+	CSCHED_VCPU_CHECK(ret.task);
+    	return ret;
+    }
+    else
+    {
+        /*
+         * Select next runnable local VCPU (ie top of local runq)
+         */
+        tslice = MILLISECS(CSCHED_MSECS_PER_TSLICE);
+    }
+    
     /*
      * Select next runnable local VCPU (ie top of local runq)
      */
@@ -1370,8 +1417,8 @@
     /*
      * Return task to run next...
      */
-    ret.time = (is_idle_vcpu(snext->vcpu) ?
-                -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE));
+    ret.time = (is_idle_vcpu(snext->vcpu)) ?
+                -1 : tslice;
     ret.task = snext->vcpu;
 
     CSCHED_VCPU_CHECK(ret.task);
diff -ruN xen.ori/common/schedule.c xen/common/schedule.c
--- xen.ori/common/schedule.c	2011-11-28 08:50:58.000000000 -0500
+++ xen/common/schedule.c	2011-11-28 01:17:30.000000000 -0500
@@ -47,6 +47,10 @@
 bool_t sched_smt_power_savings = 0;
 boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 
+/* Default scheduling rate limit: 1ms */
+int sched_ratelimit_us = 1000;
+integer_param("sched_ratelimit_us", sched_ratelimit_us);
+
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);
 static void vcpu_periodic_timer_fn(void *data);
@@ -1197,6 +1201,8 @@
     sd->curr = idle_vcpu[cpu];
     init_timer(&sd->s_timer, s_timer_fn, NULL, cpu);
     atomic_set(&sd->urgent_count, 0);
+    sd->s_csnum=0;
+    sd->s_src_b=0;
 
     /* Boot CPU is dealt with later in schedule_init(). */
     if ( cpu == 0 )
diff -ruN xen.ori/include/xen/perfc_defn.h xen/include/xen/perfc_defn.h
--- xen.ori/include/xen/perfc_defn.h	2011-11-28 08:50:58.000000000 -0500
+++ xen/include/xen/perfc_defn.h	2011-11-28 01:17:30.000000000 -0500
@@ -20,6 +20,7 @@
 PERFCOUNTER(schedule,               "csched: schedule")
 PERFCOUNTER(acct_run,               "csched: acct_run")
 PERFCOUNTER(acct_no_work,           "csched: acct_no_work")
+PERFCOUNTER(delay_ms,		    "csched: 1ms delay")
 PERFCOUNTER(acct_balance,           "csched: acct_balance")
 PERFCOUNTER(acct_reorder,           "csched: acct_reorder")
 PERFCOUNTER(acct_min_credit,        "csched: acct_min_credit")
diff -ruN xen.ori/include/xen/sched-if.h xen/include/xen/sched-if.h
--- xen.ori/include/xen/sched-if.h	2011-11-28 08:50:58.000000000 -0500
+++ xen/include/xen/sched-if.h	2011-11-28 01:17:30.000000000 -0500
@@ -15,6 +15,10 @@
 
 /* cpus currently in no cpupool */
 extern cpumask_t cpupool_free_cpus;
+/* The period that SRC updated scheduler frequency */
+#define SCHED_SRC_INTERVAL       10
+/* Threshold to trigger SRC,  */
+#define SCHED_SRC_THRESHOLD     50
 
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
@@ -32,6 +36,9 @@
     struct vcpu        *curr;           /* current task                    */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
+    int                 s_csnum;        /* scheduling number based on last period */
+    s_time_t            s_src_b;        /* SRC conting start point */
+    bool_t              s_src_c;        /*indicate whether src should be triggered */
     atomic_t            urgent_count;   /* how many urgent vcpus           */
 };
 

[-- Attachment #3: credit_2.patch --]
[-- Type: application/octet-stream, Size: 3454 bytes --]

diff -ruN xen.ori/common/sched_credit.c xen/common/sched_credit.c
--- xen.ori/common/sched_credit.c	2011-11-28 08:50:58.000000000 -0500
+++ xen/common/sched_credit.c	2011-11-28 10:05:05.000000000 -0500
@@ -115,6 +115,10 @@
 boolean_param("sched_credit_default_yield", sched_credit_default_yield);
 
 /*
+ * Scheduler generic parameters
+ */
+extern int sched_ratelimit_us;
+/*
  * Physical CPU
  */
 struct csched_pcpu {
@@ -1297,10 +1301,15 @@
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
     struct task_slice ret;
+    s_time_t runtime, tslice;
+    struct schedule_data *sd;
 
     CSCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
 
+    runtime = now - current->runstate.state_entry_time;
+    if ( runtime < 0 ) /* Does this ever happen? */
+        runtime = 0;
     if ( !is_idle_vcpu(scurr->vcpu) )
     {
         /* Update credits of a non-idle VCPU. */
@@ -1312,7 +1321,33 @@
         /* Re-instate a boosted idle VCPU as normal-idle. */
         scurr->pri = CSCHED_PRI_IDLE;
     }
-
+    /* If we have schedule rate limiting enabled, check to see
+     * how long we've run for. */
+  
+    if ( sched_ratelimit_us
+         && vcpu_runnable(current)
+         && !is_idle_vcpu(current)
+         && runtime < MICROSECS(sched_ratelimit_us) )
+    {
+        snext = scurr;
+        perfc_incr(delay_ms);
+        /* FIXME: Use prv->tslice_ms if we're also the head of hte queue */
+        //tslice = MICROSECS(sched_ratelimit_us);
+        tslice = MILLISECS(sched_ratelimit_us);
+	ret.time = tslice;
+	ret.task = snext->vcpu;
+	ret.migrated = 0;
+	CSCHED_VCPU_CHECK(ret.task);
+    	return ret;
+    }
+    else
+    {
+        /*
+         * Select next runnable local VCPU (ie top of local runq)
+         */
+        tslice = MILLISECS(CSCHED_MSECS_PER_TSLICE);
+    }
+    
     /*
      * Select next runnable local VCPU (ie top of local runq)
      */
@@ -1370,8 +1405,8 @@
     /*
      * Return task to run next...
      */
-    ret.time = (is_idle_vcpu(snext->vcpu) ?
-                -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE));
+    ret.time = (is_idle_vcpu(snext->vcpu)) ?
+                -1 : tslice;
     ret.task = snext->vcpu;
 
     CSCHED_VCPU_CHECK(ret.task);
diff -ruN xen.ori/common/schedule.c xen/common/schedule.c
--- xen.ori/common/schedule.c	2011-11-28 08:50:58.000000000 -0500
+++ xen/common/schedule.c	2011-11-28 10:13:06.000000000 -0500
@@ -47,6 +47,10 @@
 bool_t sched_smt_power_savings = 0;
 boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 
+/* Default scheduling rate limit: 1ms */
+int sched_ratelimit_us = 1000;
+integer_param("sched_ratelimit_us", sched_ratelimit_us);
+
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);
 static void vcpu_periodic_timer_fn(void *data);
diff -ruN xen.ori/include/xen/perfc_defn.h xen/include/xen/perfc_defn.h
--- xen.ori/include/xen/perfc_defn.h	2011-11-28 08:50:58.000000000 -0500
+++ xen/include/xen/perfc_defn.h	2011-11-28 01:17:30.000000000 -0500
@@ -20,6 +20,7 @@
 PERFCOUNTER(schedule,               "csched: schedule")
 PERFCOUNTER(acct_run,               "csched: acct_run")
 PERFCOUNTER(acct_no_work,           "csched: acct_no_work")
+PERFCOUNTER(delay_ms,		    "csched: 1ms delay")
 PERFCOUNTER(acct_balance,           "csched: acct_balance")
 PERFCOUNTER(acct_reorder,           "csched: acct_reorder")
 PERFCOUNTER(acct_min_credit,        "csched: acct_min_credit")

[-- Attachment #4: credit_3.patch --]
[-- Type: application/octet-stream, Size: 3390 bytes --]

diff -ruN xen.ori/common/sched_credit.c xen.dup/common/sched_credit.c
--- xen.ori/common/sched_credit.c	2011-11-28 08:50:58.000000000 -0500
+++ xen.dup/common/sched_credit.c	2011-11-28 11:45:07.000000000 -0500
@@ -115,6 +115,10 @@
 boolean_param("sched_credit_default_yield", sched_credit_default_yield);
 
 /*
+ * Scheduler generic parameters
+ */
+extern int sched_ratelimit_us;
+/*
  * Physical CPU
  */
 struct csched_pcpu {
@@ -1297,10 +1301,14 @@
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
     struct task_slice ret;
+    s_time_t runtime, tslice;
 
     CSCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
 
+    runtime = now - current->runstate.state_entry_time;
+    if ( runtime < 0 ) /* Does this ever happen? */
+        runtime = 0;
     if ( !is_idle_vcpu(scurr->vcpu) )
     {
         /* Update credits of a non-idle VCPU. */
@@ -1321,7 +1329,28 @@
     else
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
 
-    snext = __runq_elem(runq->next);
+    /* If we have schedule rate limiting enabled, check to see
+     * how long we've run for. */
+    if ( sched_ratelimit_us
+         && vcpu_runnable(current)
+         && !is_idle_vcpu(current)
+         && runtime < MICROSECS(sched_ratelimit_us) )
+    {
+        snext = scurr;
+        perfc_incr(delay_ms);
+        /* FIXME: Use prv->tslice_ms if we're also the head of hte queue */
+        //tslice = MICROSECS(sched_ratelimit_us);
+        tslice = MILLISECS(sched_ratelimit_us);
+    }
+    else
+    {
+        /*
+         * Select next runnable local VCPU (ie top of local runq)
+         */
+        snext = __runq_elem(runq->next);
+        tslice = MILLISECS(CSCHED_MSECS_PER_TSLICE);
+    }
+
     ret.migrated = 0;
 
     /* Tasklet work (which runs in idle VCPU context) overrides all else. */
@@ -1370,8 +1399,8 @@
     /*
      * Return task to run next...
      */
-    ret.time = (is_idle_vcpu(snext->vcpu) ?
-                -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE));
+    ret.time = (is_idle_vcpu(snext->vcpu)) ?
+                -1 : tslice;
     ret.task = snext->vcpu;
 
     CSCHED_VCPU_CHECK(ret.task);
diff -ruN xen.ori/common/schedule.c xen.dup/common/schedule.c
--- xen.ori/common/schedule.c	2011-11-28 08:50:58.000000000 -0500
+++ xen.dup/common/schedule.c	2011-11-28 11:45:23.000000000 -0500
@@ -47,6 +47,10 @@
 bool_t sched_smt_power_savings = 0;
 boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 
+/* Default scheduling rate limit: 1ms */
+int sched_ratelimit_us = 1000;
+integer_param("sched_ratelimit_us", sched_ratelimit_us);
+
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);
 static void vcpu_periodic_timer_fn(void *data);
diff -ruN xen.ori/include/xen/perfc_defn.h xen.dup/include/xen/perfc_defn.h
--- xen.ori/include/xen/perfc_defn.h	2011-11-28 08:50:58.000000000 -0500
+++ xen.dup/include/xen/perfc_defn.h	2011-11-28 01:17:31.000000000 -0500
@@ -20,6 +20,7 @@
 PERFCOUNTER(schedule,               "csched: schedule")
 PERFCOUNTER(acct_run,               "csched: acct_run")
 PERFCOUNTER(acct_no_work,           "csched: acct_no_work")
+PERFCOUNTER(delay_ms,		    "csched: 1ms delay")
 PERFCOUNTER(acct_balance,           "csched: acct_balance")
 PERFCOUNTER(acct_reorder,           "csched: acct_reorder")
 PERFCOUNTER(acct_min_credit,        "csched: acct_min_credit")

[-- Attachment #5: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-11-28 17:31                 ` Lv, Hui
@ 2011-12-01 17:13                   ` George Dunlap
  2011-12-11 15:27                     ` Lv, Hui
  0 siblings, 1 reply; 22+ messages in thread
From: George Dunlap @ 2011-12-01 17:13 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

[-- Attachment #1: Type: text/plain, Size: 3696 bytes --]

On Mon, Nov 28, 2011 at 5:31 PM, Lv, Hui <hui.lv@intel.com> wrote:
> Sorry for the late. I also met issues to boot up xen according to your patch, which is same as credit_3.patch that I attached.

Thanks Hui; debugging someone else's buggy code is going beyond
expectations. :-)

> So I modified it to the credit_1.patch and credit_2.patch, both of which work well.

Standard patches for OSS development need to be -p1, not -p0; they
need to work if, in the toplevel of the directory (foo.hg) you type
"patch -p1 < bar.patch". The easiest way to make one of these patches
is to use "hg diff"; the best way (if you're using Mercurial) is to
use Mercurual queues.

> 2) credit_2 adopts the constant sched_ratelimit_us value 1000.

Looks fine overall.  One issue with the patch is that it will not only
fail to preempt for a higher-priority vcpu, it will also fail to
preempt for tasklets.  Tasklet work must be done immediately.  Perhaps
we can add "!tasklet_work_scheduled" to the list of conditions for

Why did you change "MICROSECS" to "MILLISECS" when calculating
timeslice?  In this case, it will set the timeslice to a full second!
Not what we want...

>From a software maintenance perspective, I'm not a fan of early
returns from functions.  I think it's too easy not to notice that
there's a different return path.  In this case, I think I'd prefer
adding a label, and using "goto out;" instead.

> 1) credit_1 adopts "scheduling frequency counting" to decide the value of sched_ratelimit_us, which makes it adaptive.

If you were using mercurial queues, you could put this after the last
one, and it would be easier to see the proposed "adaptive" part of the
code. :-)

Hypervisors are very complicated; it's best to keep things as
absolutely simple as possible.  This kind of mechanism is exactly the
sort of thing that makes it very hard to predict what will happen.  I
think unless you can show that it adds a significant benefit, it's
better just to use the min timeslice.

Regarding this particular code, a couple of items, just for feedback:
* All of the ratelimiting code and data structures should be in the
pluggable scheduler, not in common code.
* This code hard-codes in '1000' as the value it sets the global
variable to, overriding whatever the user may have entered on the
command-line
* Furthermore, global variable is shared by all of the cpus, however;
meaning, you may have one cpu enabling it one moment based on its own
conditions, and have nother cpu disabling it almost immediatly
afterwards, based on conditions on that cpu.  If you're testing with
this at the moment, you might as well stop -- you're going to get a
pretty random result.  If you really wanted this to be an adaptive
solution, you'd need to make a per-cpu variable with the per-cpu rate
scheduling; and then set it to the global variable (which is the user
configuration).
* Finally, this patch doesn't distinguish between schedules that
happen due to a yield, and schedules that happen due to preemption.
The only schedules we have any control over are schedules that happen
due to preemption.  If adaptivity has any value, it should pay
attention to what it can control.

I've taken your two patches, given them the proper formating, and made
them into a patch series (the first adding the ratelimiting, the
second adding the adaptive bit); they are attached.  You should be
able to easily pull them into a mercurial patchqueue using "hg qimport
/path/to/patches/*.diff".  In the future, I will not look at any
patches which do not apply using either "patch -p1" or "hg qimport."

Thanks again for all your work on this -- I hope we can get a simple,
effective solution in place soon.

 -George

[-- Attachment #2: 01-credit-ratelimit.diff --]
[-- Type: text/plain, Size: 3707 bytes --]

# HG changeset patch
# User George Dunlap <george.dunlap@eu.citrix.com>
# Date 1322758205 0
# Node ID e122485df37b57f58dbf13ec81a18169c0a7e90f
# Parent  96bbdc894224821fbc14a33e93b55688920c7fd2
imported patch credit_2.patch

diff -r c613d436ca09 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Tue Nov 22 19:03:04 2011 +0000
+++ b/xen/common/sched_credit.c	Thu Dec 01 17:15:12 2011 +0000
@@ -115,6 +115,10 @@ static bool_t __read_mostly sched_credit
 boolean_param("sched_credit_default_yield", sched_credit_default_yield);
 
 /*
+ * Scheduler generic parameters
+ */
+extern int sched_ratelimit_us;
+/*
  * Physical CPU
  */
 struct csched_pcpu {
@@ -1299,10 +1303,15 @@ csched_schedule(
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
     struct task_slice ret;
+    s_time_t runtime, tslice;
+    struct schedule_data *sd;
 
     CSCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
 
+    runtime = now - current->runstate.state_entry_time;
+    if ( runtime < 0 ) /* Does this ever happen? */
+        runtime = 0;
     if ( !is_idle_vcpu(scurr->vcpu) )
     {
         /* Update credits of a non-idle VCPU. */
@@ -1314,7 +1323,33 @@ csched_schedule(
         /* Re-instate a boosted idle VCPU as normal-idle. */
         scurr->pri = CSCHED_PRI_IDLE;
     }
-
+    /* If we have schedule rate limiting enabled, check to see
+     * how long we've run for. */
+  
+    if ( sched_ratelimit_us
+         && vcpu_runnable(current)
+         && !is_idle_vcpu(current)
+         && runtime < MICROSECS(sched_ratelimit_us) )
+    {
+        snext = scurr;
+        perfc_incr(delay_ms);
+        /* FIXME: Use prv->tslice_ms if we're also the head of hte queue */
+        //tslice = MICROSECS(sched_ratelimit_us);
+        tslice = MILLISECS(sched_ratelimit_us);
+	ret.time = tslice;
+	ret.task = snext->vcpu;
+	ret.migrated = 0;
+	CSCHED_VCPU_CHECK(ret.task);
+    	return ret;
+    }
+    else
+    {
+        /*
+         * Select next runnable local VCPU (ie top of local runq)
+         */
+        tslice = MILLISECS(CSCHED_MSECS_PER_TSLICE);
+    }
+    
     /*
      * Select next runnable local VCPU (ie top of local runq)
      */
@@ -1373,7 +1408,7 @@ csched_schedule(
      * Return task to run next...
      */
     ret.time = (is_idle_vcpu(snext->vcpu) ?
-                -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE));
+                -1 : tslice);
     ret.task = snext->vcpu;
 
     CSCHED_VCPU_CHECK(ret.task);
diff -r c613d436ca09 xen/common/schedule.c
--- a/xen/common/schedule.c	Tue Nov 22 19:03:04 2011 +0000
+++ b/xen/common/schedule.c	Thu Dec 01 17:15:12 2011 +0000
@@ -47,6 +47,10 @@ string_param("sched", opt_sched);
 bool_t sched_smt_power_savings = 0;
 boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 
+/* Default scheduling rate limit: 1ms */
+int sched_ratelimit_us = 1000;
+integer_param("sched_ratelimit_us", sched_ratelimit_us);
+
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);
 static void vcpu_periodic_timer_fn(void *data);
diff -r c613d436ca09 xen/include/xen/perfc_defn.h
--- a/xen/include/xen/perfc_defn.h	Tue Nov 22 19:03:04 2011 +0000
+++ b/xen/include/xen/perfc_defn.h	Thu Dec 01 17:15:12 2011 +0000
@@ -20,6 +20,7 @@ PERFCOUNTER(vcpu_check,             "csc
 PERFCOUNTER(schedule,               "csched: schedule")
 PERFCOUNTER(acct_run,               "csched: acct_run")
 PERFCOUNTER(acct_no_work,           "csched: acct_no_work")
+PERFCOUNTER(delay_ms,		    "csched: 1ms delay")
 PERFCOUNTER(acct_balance,           "csched: acct_balance")
 PERFCOUNTER(acct_reorder,           "csched: acct_reorder")
 PERFCOUNTER(acct_min_credit,        "csched: acct_min_credit")

[-- Attachment #3: 02-credit-adaptive.diff --]
[-- Type: text/plain, Size: 2311 bytes --]

diff -r e122485df37b xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Dec 01 16:50:05 2011 +0000
+++ b/xen/common/sched_credit.c	Thu Dec 01 16:55:27 2011 +0000
@@ -1307,6 +1307,18 @@ csched_schedule(
     CSCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
 
+    sd = &this_cpu(schedule_data);
+    sd->s_csnum++;
+    if ((now - sd->s_src_b) >= MILLISECS(SCHED_SRC_INTERVAL))
+    {
+        if (sd->s_csnum >= SCHED_SRC_THRESHOLD)
+            sched_ratelimit_us = 1000;
+        else
+            sched_ratelimit_us = 0;
+	sd->s_src_b = now;
+	sd->s_csnum = 0;
+    }
+        
     runtime = now - current->runstate.state_entry_time;
     if ( runtime < 0 ) /* Does this ever happen? */
         runtime = 0;
diff -r e122485df37b xen/common/schedule.c
--- a/xen/common/schedule.c	Thu Dec 01 16:50:05 2011 +0000
+++ b/xen/common/schedule.c	Thu Dec 01 16:55:27 2011 +0000
@@ -1261,6 +1261,8 @@ static int cpu_schedule_up(unsigned int 
     sd->curr = idle_vcpu[cpu];
     init_timer(&sd->s_timer, s_timer_fn, NULL, cpu);
     atomic_set(&sd->urgent_count, 0);
+    sd->s_csnum=0;
+    sd->s_src_b=0;
 
     /* Boot CPU is dealt with later in schedule_init(). */
     if ( cpu == 0 )
diff -r e122485df37b xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Thu Dec 01 16:50:05 2011 +0000
+++ b/xen/include/xen/sched-if.h	Thu Dec 01 16:55:27 2011 +0000
@@ -15,6 +15,10 @@ extern struct cpupool *cpupool0;
 
 /* cpus currently in no cpupool */
 extern cpumask_t cpupool_free_cpus;
+/* The period that SRC updated scheduler frequency */
+#define SCHED_SRC_INTERVAL       10
+/* Threshold to trigger SRC,  */
+#define SCHED_SRC_THRESHOLD     50
 
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
@@ -32,6 +36,9 @@ struct schedule_data {
     struct vcpu        *curr;           /* current task                    */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
+    int                 s_csnum;        /* scheduling number based on last period */
+    s_time_t            s_src_b;        /* SRC conting start point */
+    bool_t              s_src_c;        /*indicate whether src should be triggered */
     atomic_t            urgent_count;   /* how many urgent vcpus           */
 };
 

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-12-01 17:13                   ` George Dunlap
@ 2011-12-11 15:27                     ` Lv, Hui
  2011-12-12 11:43                       ` George Dunlap
  0 siblings, 1 reply; 22+ messages in thread
From: Lv, Hui @ 2011-12-11 15:27 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

[-- Attachment #1: Type: text/plain, Size: 3507 bytes --]

Hi George,

Thank you very much for your feedbacks.
I have finished the measurement work based on the delay method. From the comparable results, 1ms-delay can do as well as SRC patch to gain significant performance boost without obvious drawbacks.

1. Basically, the "delay method" can achieve nearly the same benefits as my previous SRC patch, 11% overall performance boost for SPECvirt than original credit scheduler.
2. I have tried 1ms delay and 10ms delay, there is no big difference between these two configurations. (1ms is enough to achieve a good performance)
3. I have compared different load level response time/latency (low, high, peak), "delay method" didn't bring very much Qos increase.
4. 1ms delay can reduce 30% context switch at peak performance, where produces the benefits.

You can find the raw data from the attached excel file.
The attached credit_1.diff patch works stable at my side.


> Looks fine overall.  One issue with the patch is that it will not only fail to
> preempt for a higher-priority vcpu, it will also fail to preempt for tasklets.
> Tasklet work must be done immediately.  Perhaps we can add
> "!tasklet_work_scheduled" to the list of conditions for

Yes, added "!tasklet_work_scheduled". I have done the experiments to compare with/without this, there is not big difference. 
> 
> Why did you change "MICROSECS" to "MILLISECS" when calculating timeslice?
> In this case, it will set the timeslice to a full second!
> Not what we want...
Sorry , it's my typo, I have changed

> From a software maintenance perspective, I'm not a fan of early returns from
> functions.  I think it's too easy not to notice that there's a different return
> path.  In this case, I think I'd prefer adding a label, and using "goto out;"
> instead.

Followed this code style.

> If you were using mercurial queues, you could put this after the last one, and it
> would be easier to see the proposed "adaptive" part of the code. :-)
> 
> Hypervisors are very complicated; it's best to keep things as absolutely simple
> as possible.  This kind of mechanism is exactly the sort of thing that makes it
> very hard to predict what will happen.  I think unless you can show that it adds
> a significant benefit, it's better just to use the min timeslice.

In fact, I have tested the results with 1ms delay and 10ms delay, there is no significant performance improvement. It means, even though we select adaptive method, there is also no significant performance boost with comparison to 1ms.
1ms is good enough insofar. 

> Regarding this particular code, a couple of items, just for feedback:
> * All of the ratelimiting code and data structures should be in the pluggable
> scheduler, not in common code.
Agreed.


> an adaptive solution, you'd need to make a per-cpu variable with the per-cpu
> rate scheduling; and then set it to the global variable (which is the user
> configuration).

It seems no need to make it adaptive now :)

> I've taken your two patches, given them the proper formating, and made them
> into a patch series (the first adding the ratelimiting, the second adding the
> adaptive bit); they are attached.  You should be able to easily pull them into a
> mercurial patchqueue using "hg qimport /path/to/patches/*.diff".  In the
> future, I will not look at any patches which do not apply using either "patch -p1"
> or "hg qimport."

Thanks for the coach to submit patch in a right way.
> 
>  -George

[-- Attachment #2: credit_1.diff --]
[-- Type: application/octet-stream, Size: 4021 bytes --]

diff -r 1c58bb664d8d xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Dec 08 17:15:16 2011 +0000
+++ b/xen/common/sched_credit.c	Sun Dec 11 02:39:07 2011 -0500
@@ -109,7 +109,9 @@ static bool_t __read_mostly sched_credit
 boolean_param("sched_credit_default_yield", sched_credit_default_yield);
 static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
 integer_param("sched_credit_tslice_ms", sched_credit_tslice_ms);
-
+/* Scheduler generic parameters
+*/
+extern int sched_ratelimit_us;
 /*
  * Physical CPU
  */
@@ -1297,10 +1299,15 @@ csched_schedule(
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
     struct task_slice ret;
+    s_time_t runtime, tslice;
 
     CSCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
 
+    runtime = now - current->runstate.state_entry_time;
+    if ( runtime < 0 ) /* Does this ever happen? */
+        runtime = 0;
+
     if ( !is_idle_vcpu(scurr->vcpu) )
     {
         /* Update credits of a non-idle VCPU. */
@@ -1313,6 +1320,41 @@ csched_schedule(
         scurr->pri = CSCHED_PRI_IDLE;
     }
 
+    /* Choices, choices:
+     * - If we have a tasklet, we need to run the idle vcpu no matter what.
+     * - If sched rate limiting is in effect, and the current vcpu has
+     *   run for less than that amount of time, continue the current one,
+     *   but with a shorter timeslice and return it immediately
+     * - Otherwise, chose the one with the highest priority (which may
+     *   be the one currently running)
+     * - If the currently running one is TS_OVER, see if there
+     *   is a higher priority one waiting on the runqueue of another
+     *   cpu and steal it.
+     */
+
+    /* If we have schedule rate limiting enabled, check to see
+     * how long we've run for. */  
+    if ( sched_ratelimit_us
+	 && !tasklet_work_scheduled
+         && vcpu_runnable(current)
+         && !is_idle_vcpu(current)
+         && runtime < MICROSECS(sched_ratelimit_us) )
+    {
+        snext = scurr;
+	snext->start_time += now;
+        perfc_incr(delay_ms);
+        tslice = MICROSECS(sched_ratelimit_us);
+	ret.migrated = 0;
+	goto out;
+    }
+    else
+    {
+        /*
+         * Select next runnable local VCPU (ie top of local runq)
+         */
+        tslice = MILLISECS(prv->tslice_ms);
+    }
+    
     /*
      * Select next runnable local VCPU (ie top of local runq)
      */
@@ -1367,11 +1409,12 @@ csched_schedule(
     if ( !is_idle_vcpu(snext->vcpu) )
         snext->start_time += now;
 
+out:
     /*
      * Return task to run next...
      */
     ret.time = (is_idle_vcpu(snext->vcpu) ?
-                -1 : MILLISECS(prv->tslice_ms));
+                -1 : tslice);
     ret.task = snext->vcpu;
 
     CSCHED_VCPU_CHECK(ret.task);
diff -r 1c58bb664d8d xen/common/schedule.c
--- a/xen/common/schedule.c	Thu Dec 08 17:15:16 2011 +0000
+++ b/xen/common/schedule.c	Sun Dec 11 02:39:07 2011 -0500
@@ -47,6 +47,10 @@ string_param("sched", opt_sched);
 bool_t sched_smt_power_savings = 0;
 boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 
+/* Default scheduling rate limit: 1ms */
+int sched_ratelimit_us = 1000;
+integer_param("sched_ratelimit_us", sched_ratelimit_us);
+
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);
 static void vcpu_periodic_timer_fn(void *data);
diff -r 1c58bb664d8d xen/include/xen/perfc_defn.h
--- a/xen/include/xen/perfc_defn.h	Thu Dec 08 17:15:16 2011 +0000
+++ b/xen/include/xen/perfc_defn.h	Sun Dec 11 02:39:07 2011 -0500
@@ -15,7 +15,7 @@ PERFCOUNTER(ipis,                   "#IP
 PERFCOUNTER(sched_irq,              "sched: timer")
 PERFCOUNTER(sched_run,              "sched: runs through scheduler")
 PERFCOUNTER(sched_ctx,              "sched: context switches")
-
+PERFCOUNTER(delay_ms,		    "csched: delay")
 PERFCOUNTER(vcpu_check,             "csched: vcpu_check")
 PERFCOUNTER(schedule,               "csched: schedule")
 PERFCOUNTER(acct_run,               "csched: acct_run")

[-- Attachment #3: scheduler comparison.xlsx --]
[-- Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, Size: 12105 bytes --]

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-12-11 15:27                     ` Lv, Hui
@ 2011-12-12 11:43                       ` George Dunlap
  2011-12-13  2:24                         ` Lv, Hui
  0 siblings, 1 reply; 22+ messages in thread
From: George Dunlap @ 2011-12-12 11:43 UTC (permalink / raw)
  To: Lv, Hui
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

On Sun, Dec 11, 2011 at 3:27 PM, Lv, Hui <hui.lv@intel.com> wrote:
> Hi George,
>
> Thank you very much for your feedbacks.
> I have finished the measurement work based on the delay method. From the comparable results, 1ms-delay can do as well as SRC patch to gain significant performance boost without obvious drawbacks.
> 1. Basically, the "delay method" can achieve nearly the same benefits as my previous SRC patch, 11% overall performance boost for SPECvirt than original credit scheduler.
> 2. I have tried 1ms delay and 10ms delay, there is no big difference between these two configurations. (1ms is enough to achieve a good performance)
> 3. I have compared different load level response time/latency (low, high, peak), "delay method" didn't bring very much Qos increase.

Thanks Hui, those are good results.  Just one question: What's QoS
supposed to measure?  Is this a metric that SPECVirt reports?  Is
higher or lower better?

The patch looks good, but there's one last nitpick: Several of your
lines have hard tab characters in them; tabs are officially verboten
in the Xen code.  Please replace them with spaces.  After that, I
think we're ready to check it in.

One more small request: would it be possible to get some short xen
traces of SPECVirt running, at least with the 1ms-delay patch, and if
possible without it?  I'd like to have a better understanding of what
kind of scheduling workload SPECVirt creates, and how the 1ms delay
affects it.  If you have other priorities, don't worry, I'll wait
until our performance team here gets it set up.  If you do have time,
the command to use is as follows:

xentrace -D -e 0x21000 -T 30 /path/to/file.trace

This will take *just* scheduling traces for 30 seconds.  If you could
run it when the benchmark is going full throttle, that should help me
get an idea what the scheduling looks like.

Thanks,
 -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] scheduler rate controller
  2011-12-12 11:43                       ` George Dunlap
@ 2011-12-13  2:24                         ` Lv, Hui
  0 siblings, 0 replies; 22+ messages in thread
From: Lv, Hui @ 2011-12-13  2:24 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir (Xen.org),
	Dong, Eddie, George Dunlap, Duan, Jiangang, Dario Faggioli

> Thanks Hui, those are good results.  Just one question: What's QoS
> supposed to measure?  Is this a metric that SPECVirt reports?  Is
> higher or lower better?
> 
Yes, it was reported by SPECvirt. The lower, the better.
 
> The patch looks good, but there's one last nitpick: Several of your
> lines have hard tab characters in them; tabs are officially verboten
> in the Xen code.  Please replace them with spaces.  After that, I
> think we're ready to check it in.
>
Great!!
 
> the command to use is as follows:
> 
> xentrace -D -e 0x21000 -T 30 /path/to/file.trace
> 
> This will take *just* scheduling traces for 30 seconds.  If you could
> run it when the benchmark is going full throttle, that should help me
> get an idea what the scheduling looks like.
> 
Yes, I'd like to do this. 
Hope the data size is not very large. Since, for full throttle condition, Xen hypervisor occupied a large CPU time (around 20%). 
Correspondingly, the trace data is very big.
I'll let you know when data is ready.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-12-13  2:24 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-24  3:36 [PATCH] scheduler rate controller Lv, Hui
2011-10-24 16:17 ` George Dunlap
2011-10-24 16:57   ` Keir Fraser
2011-10-24 21:20   ` Dario Faggioli
2011-10-25  7:57   ` Dario Faggioli
2011-10-28  2:07   ` Lv, Hui
2011-10-28  8:10     ` Dario Faggioli
2011-10-28  8:52       ` Lv, Hui
2011-10-28 10:09         ` Dario Faggioli
2011-10-28 16:18           ` George Dunlap
2011-10-29  2:05             ` Lv, Hui
2011-10-31 10:16               ` Dario Faggioli
2011-11-03  4:28               ` George Dunlap
2011-11-04 14:08                 ` Lv, Hui
2011-11-14 15:22                   ` George Dunlap
2011-11-14 15:30                   ` George Dunlap
2011-11-28 17:31                 ` Lv, Hui
2011-12-01 17:13                   ` George Dunlap
2011-12-11 15:27                     ` Lv, Hui
2011-12-12 11:43                       ` George Dunlap
2011-12-13  2:24                         ` Lv, Hui
2011-10-31  9:59             ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.