Re: [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality
       [not found]   ` <1629755.KbDSmDPDTX@aspire.rjw.lan>
@ 2017-10-16  2:46     ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  2:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:26, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:27 AM CEST Aubrey Li wrote:
>> There are several factors in the menu governor to predict the next
>> idle interval:
>> - the next timer
>> - the recent idle interval history
>> - the corrected idle interval pattern
>> These factors are common enough to be extracted to be one function.
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> This patch alone would break things AFAICS, because it removes code from
> menu_select() without a replacement (and menu_predict() is never called
> just yet).
> 
> Please always do your best to ensure that things will work after *every*
> patch in a series.

okay, I'll correct this in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
       [not found]   ` <2672521.fEEa1b19Vu@aspire.rjw.lan>
@ 2017-10-16  3:11     ` Li, Aubrey
  2017-10-17  0:05       ` Rafael J. Wysocki
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  3:11 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:35, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
>> Record the overhead of idle entry in micro-second
>>
> 
> What is this needed for?

We need to figure out how long of a idle is a short idle and recording
the overhead is for this purpose. The short idle threshold is based
on this overhead.

> 
>> +void cpuidle_entry_end(void)
>> +{
>> +	struct cpuidle_device *dev = cpuidle_get_device();
>> +	u64 overhead;
>> +	s64 diff;
>> +
>> +	if (dev) {
>> +		dev->idle_stat.entry_end = local_clock();
>> +		overhead = div_u64(dev->idle_stat.entry_end -
>> +				dev->idle_stat.entry_start, NSEC_PER_USEC);
> 
> Is the conversion really necessary?
> 
> If so, then why?

We can choose nano-second and micro-second. Given that workload results
in the short idle pattern, I think micro-second is good enough for the
real workload.

Another reason is that prediction from idle governor is micro-second, so
I convert it for comparing purpose.
> 
> And if there is a good reason, what about using right shift to do
> an approximate conversion to avoid the extra division here?

Sure >> 10 works for me as I don't think here precision is a big deal.

> 
>> +		diff = overhead - dev->idle_stat.overhead;
>> +		dev->idle_stat.overhead += diff >> 3;
> 
> Can you please explain what happens in the two lines above?

Online average computing algorithm, stolen from update_avg() @ kernel/sched/core.c.
> 
>> +		/*
>> +		 * limit overhead to 1us
>> +		 */
>> +		if (dev->idle_stat.overhead == 0)
>> +			dev->idle_stat.overhead = 1;
>> +	}
>> +}
>> +
>>  /**
>>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>>   */
>> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
>> index fc1e5d7..cad9b71 100644
>> --- a/include/linux/cpuidle.h
>> +++ b/include/linux/cpuidle.h
>> @@ -72,6 +72,15 @@ struct cpuidle_device_kobj;
>>  struct cpuidle_state_kobj;
>>  struct cpuidle_driver_kobj;
>>  
>> +struct cpuidle_stat {
>> +	u64			entry_start;	/* nanosecond */
>> +	u64			entry_end;	/* nanosecond */
>> +	u64			overhead;	/* nanosecond */
>> +	unsigned int		predicted_us;	/* microsecond */
>> +	bool			predicted;	/* ever predicted? */
>> +	bool			fast_idle;	/* fast idle? */
> 
> Some of the fields here are never used in the code after this patch.
> 
> Also it would be good to add a comment describing the meaning of the
> fields.
> 
okay, will add in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
       [not found]   ` <4523111.uMcC96MW3N@aspire.rjw.lan>
@ 2017-10-16  3:26     ` Li, Aubrey
  2017-10-16  4:45       ` Mike Galbraith
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  3:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:51, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:30 AM CEST Aubrey Li wrote:
>> If the next idle is expected to be a fast idle, we should keep tick
>> on before going into idle
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> This also can be merged with the previous patch (and the [2/8]) IMO.
> 

okay, will merge in the next version.

>> ---
>>  drivers/cpuidle/cpuidle.c | 14 ++++++++++++++
>>  include/linux/cpuidle.h   |  2 ++
>>  kernel/time/tick-sched.c  |  4 ++++
>>  3 files changed, 20 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index ef6f7dd..6cb7e17 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -370,6 +370,20 @@ void cpuidle_predict(void)
>>  }
>>  
>>  /**
>> + * cpuidle_fast_idle - predict whether or not the coming idle is a fast idle
>> + * This function can be called in irq exit path, make it as soon as possible
>> + */
>> +bool cpuidle_fast_idle(void)
>> +{
>> +	struct cpuidle_device *dev = cpuidle_get_device();
>> +
>> +	if (!dev)
>> +		return false;
>> +
>> +	return dev->idle_stat.fast_idle;
> 
> return dev && dev->idle_stat.fast_idle;

Thanks!

>>  
>> @@ -916,6 +917,9 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
>>  			return false;
>>  	}
>>  
>> +	if (cpuidle_fast_idle())
>> +		return false;
>> +
>>  	return true;
> 
> return !cpuidle_fast_idle();

And thanks!
> 
>>  }
>>  
>>
> 
> And IMO there is quite a bit too much marketing in the "fast_idle" name,
> as it seems all about avoiding to stop the tick if the predicted idle
> duration is short enough.
> 

okay, and what's your suggestion? :)
I'll try to move quiet_vmstat() into the normal idle branch if this patch series
are reasonable. Is fast_idle a good indication for it?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  3:26     ` [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle Li, Aubrey
@ 2017-10-16  4:45       ` Mike Galbraith
  2017-10-16  5:34         ` Li, Aubrey
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2017-10-16  4:45 UTC (permalink / raw)
  To: Li, Aubrey, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
> 
> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
> are reasonable. Is fast_idle a good indication for it?

see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code

	-Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  4:45       ` Mike Galbraith
@ 2017-10-16  5:34         ` Li, Aubrey
  2017-10-16  6:25           ` Mike Galbraith
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  5:34 UTC (permalink / raw)
  To: Mike Galbraith, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/16 12:45, Mike Galbraith wrote:
> On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
>>
>> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
>> are reasonable. Is fast_idle a good indication for it?
> 
> see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code

It looks like this commit makes tick stop critical as it can be invoked in interrupt
exit path?

->smp_apic_timer_interrupt
-->irq_exit
--->tick_nohz_irq_exit
---->__tick_nohz_idle_enter
----->tick_nohz_stop_sched_tick
------>quiet_vmstat

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
       [not found]   ` <2242303.t20yq9Lc6j@aspire.rjw.lan>
@ 2017-10-16  6:00     ` Li, Aubrey
  2017-10-17  0:01       ` Rafael J. Wysocki
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  6:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:59, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
>> Add a knob to make fast idle threshold tunable
>>
>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> 
> I first of all am not sure about the need to add a tunable for this at all
> in the first place.

Actually I think a fixed value(10) might be good enough but not quite sure
if there is a requirement to tune it for different scenario, for example even
if the predicted idle interval is 100x overhead, I still want a fast path for
a better benchmark score?

>> @@ -1229,6 +1230,17 @@ static struct ctl_table kern_table[] = {
>>  		.extra2		= &one,
>>  	},
>>  #endif
>> +#ifdef CONFIG_CPU_IDLE
>> +	{
>> +		.procname       = "fast_idle_ratio",
>> +		.data           = &sysctl_fast_idle_ratio,
>> +		.maxlen         = sizeof(int),
>> +		.mode           = 0644,
>> +		.proc_handler   = proc_dointvec_minmax,
>> +		.extra1         = &one,
>> +		.extra2         = &one_hundred,
>> +	},
>> +#endif
>>  	{ }
>>  };
>>  
> 
> And if there is a good enough reason to add it, shouldn't the tunable be
> there in the cpuidle framework?
> 
sure, if it makes sense, I'll move it into cpuidle/sysfs.c

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  5:34         ` Li, Aubrey
@ 2017-10-16  6:25           ` Mike Galbraith
  2017-10-16  6:31             ` Li, Aubrey
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2017-10-16  6:25 UTC (permalink / raw)
  To: Li, Aubrey, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On Mon, 2017-10-16 at 13:34 +0800, Li, Aubrey wrote:
> On 2017/10/16 12:45, Mike Galbraith wrote:
> > On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
> >>
> >> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
> >> are reasonable. Is fast_idle a good indication for it?
> > 
> > see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code
> 
> It looks like this commit makes tick stop critical as it can be invoked in interrupt
> exit path?

do_idle() ain't critical?  It is in my book.  Hopefully, you're about
to make that idle_stat.fast_idle thingy liberal enough that we cease
mucking about with the tick on every microscopic idle (and I can then
trash my years old local patch to avoid needlessly eating that cost).

	-Mike

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle
  2017-10-16  6:25           ` Mike Galbraith
@ 2017-10-16  6:31             ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  6:31 UTC (permalink / raw)
  To: Mike Galbraith, Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/16 14:25, Mike Galbraith wrote:
> On Mon, 2017-10-16 at 13:34 +0800, Li, Aubrey wrote:
>> On 2017/10/16 12:45, Mike Galbraith wrote:
>>> On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
>>>>
>>>> I'll try to move quiet_vmstat() into the normal idle branch if this patch series
>>>> are reasonable. Is fast_idle a good indication for it?
>>>
>>> see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code
>>
>> It looks like this commit makes tick stop critical as it can be invoked in interrupt
>> exit path?
> 
> do_idle() ain't critical?  It is in my book.  Hopefully, you're about
> to make that idle_stat.fast_idle thingy liberal enough that we cease
> mucking about with the tick on every microscopic idle (and I can then
> trash my years old local patch to avoid needlessly eating that cost).
> 

Welcome any feedback of other patch set, Mike, :)

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
       [not found]   ` <1554921.dz8jk4n8cL@aspire.rjw.lan>
@ 2017-10-16  6:46     ` Li, Aubrey
  2017-10-16 23:58       ` Rafael J. Wysocki
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  6:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:56, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
>> sleep length indicates how long we'll be idle. Currently, it's updated
>> only when tick nohz enters. These patch series make a new requirement
>> with tick, so we should keep sleep length updated as needed
> 
> So what exactly would be the problem with leaving things as they are?

Previously ts->sleep_length is only updated when tick is stopped.

As follows, in

__tick_nohz_idle_enter()
{
	if (can_stop_idle_tick() /* return true */) {
		tick_nohz_stop_sched_tick()
			|
			|-----> update sleep_length
	}
}

Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
to update sleep_length every time we read it

If we leave it unchanged, the prediction could read a sleep_length long time
ago if the system keep ticking.
> 
>> ---
>>  kernel/time/tick-sched.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index d663fab..94fb9b8 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
>>   */
>>  ktime_t tick_nohz_get_sleep_length(void)
>>  {
>> +	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
>>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>>  
>> +	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
>> +
>>  	return ts->sleep_length;
>>  }
>>  
>>
> 
> I probably wouldn't do it this way ...
> 
> 

May I know the detailed thoughts?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
       [not found] ` <3026355.QRuoy6eIZM@aspire.rjw.lan>
@ 2017-10-16  7:44   ` Li, Aubrey
  2017-10-17  0:07     ` Rafael J. Wysocki
  0 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  7:44 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 9:14, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
>> We found under some latency intensive workloads, short idle periods occurs
>> very common, then idle entry and exit path starts to dominate, so it's
>> important to optimize them. To determine the short idle pattern, we need
>> to figure out how long of the coming idle and the threshold of the short
>> idle interval.
>>
>> A cpu idle prediction functionality is introduced in this proposal to catch
>> the short idle pattern.
>>
>> Firstly, we check the IRQ timings subsystem, if there is an event
>> coming soon.
>> -- https://lwn.net/Articles/691297/
>>
>> Secondly, we check the idle statistics of scheduler, if it's likely we'll
>> go into a short idle.
>> -- https://patchwork.kernel.org/patch/2839221/
>>
>> Thirdly, we predict the next idle interval by using the prediction
>> fucntionality in the idle governor if it has.
>>
>> For the threshold of the short idle interval, we record the timestamps of
>> the idle entry, and multiply by a tunable parameter at here:
>> -- /proc/sys/kernel/fast_idle_ratio
>>
>> We use the output of the idle prediction to skip turning tick off if a
>> short idle is determined in this proposal. Reprogramming hardware timer
>> twice(off and on) is expensive for a very short idle. There are some
>> potential optimizations can be done according to the same indicator.
>>
>> I observed when system is idle, the idle predictor reports 20/s long idle
>> and ZERO fast idle on one CPU. And when the workload is running, the idle
>> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
>>
>> Aubrey Li (8):
>>   cpuidle: menu: extract prediction functionality
>>   cpuidle: record the overhead of idle entry
>>   cpuidle: add a new predict interface
>>   tick/nohz: keep tick on for a fast idle
>>   timers: keep sleep length updated as needed
>>   cpuidle: make fast idle threshold tunable
>>   cpuidle: introduce irq timing to make idle prediction
>>   cpuidle: introduce run queue average idle to make idle prediction
>>
>>  drivers/cpuidle/Kconfig          |   1 +
>>  drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
>>  drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
>>  include/linux/cpuidle.h          |  21 ++++++++
>>  kernel/sched/idle.c              |  14 ++++-
>>  kernel/sysctl.c                  |  12 +++++
>>  kernel/time/tick-sched.c         |   7 +++
>>  7 files changed, 209 insertions(+), 24 deletions(-)
>>
> 
> Overall, it looks like you could avoid stopping the tick every time the
> predicted idle duration is not longer than the tick interval in the first
> place.
> > Why don't you do that?

I didn't catch this.

Are you suggesting?

if(!cpu_stat.fast_idle)
	tick_nohz_idle_enter()

Or you concern why the threshold can't simply be tick interval?

For the first, can_stop_idle_tick() is a better place to skip tick-off IMHO.
For the latter, if the threshold is close/equal to the tick, it's quite possible
the next event is the tick and no other else event.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface
       [not found]   ` <2353480.vFnqZDvmsB@aspire.rjw.lan>
@ 2017-10-16  8:04     ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  8:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 8:45, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:29 AM CEST Aubrey Li wrote:
>> For the governor has predict functionality, add a new predict
>> interface in cpuidle framework to call and use it.
> 
> Care to describe how it is intended to work?
> 
> Also this patch uses data structures introduced in the previous one
> (and the previous one didn't use them), so maybe merge the two?

okay, will refine in the next version.

> 
>> ---
>>  drivers/cpuidle/cpuidle.c        | 34 ++++++++++++++++++++++++++++++++++
>>  drivers/cpuidle/governors/menu.c |  7 +++++++
>>  include/linux/cpuidle.h          |  3 +++
>>  kernel/sched/idle.c              |  1 +
>>  4 files changed, 45 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index 4066308..ef6f7dd 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -336,6 +336,40 @@ void cpuidle_entry_end(void)
>>  }
>>  
>>  /**
>> + * cpuidle_predict - predict whether the coming idle is a fast idle or not
>> + */
>> +void cpuidle_predict(void)
>> +{
>> +	struct cpuidle_device *dev = cpuidle_get_device();
>> +	unsigned int overhead_threshold;
>> +
>> +	if (!dev)
>> +		return;
>> +
>> +	overhead_threshold = dev->idle_stat.overhead;
>> +
>> +	if (cpuidle_curr_governor->predict) {
>> +		dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
>> +		/*
>> +		 * notify idle governor to avoid reduplicative
>> +		 * prediction computation
>> +		 */
> 
> This comment is hard to understand.
> 
> What does it mean, really?
> 

predict() does a lot of computation. If it's called in cpuidle_predict(),
we don't want it to be called in menu governor again. So here I use a flag to
tell menu governor if predict computation is already done for the coming idle
this time.

>> +		dev->idle_stat.predicted = true;
>> +		if (dev->idle_stat.predicted_us < overhead_threshold) {
>> +			/*
>> +			 * notify tick subsystem to keep ticking
>> +			 * for the coming idle
>> +			 */
>> +			dev->idle_stat.fast_idle = true;
>> +		} else
>> +			dev->idle_stat.fast_idle = false;
> 
> What about
> 
> 		dev->idle_stat.fast_idle = dev->idle_stat.predicted_us < overhead_threshold;

Ack

> 
> Also why don't you use dev->idle_stat.overhead directly here?

Yeah, I should merge a few patches, because dev->idle_stat.overhead is referenced many times,
for irq timings, for scheduler idle statistics and for idle governor.

> 
> And what is the basic idea here?  Why do we compare the predicted
> idle duration with the entry/exit overhead from the previous cycle
> (if I understood this correctly, that is)?

entry/exit overhead is an average here, and the variance should not be big.

> 
>> +	} else {
>> +		dev->idle_stat.predicted = false;
>> +		dev->idle_stat.fast_idle = false;
>> +	}
>> +}
>> +
>> +/**
>>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>>   */
>>  void cpuidle_install_idle_handler(void)
>> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
>> index 6bed197..90b2a10 100644
>> --- a/drivers/cpuidle/governors/menu.c
>> +++ b/drivers/cpuidle/governors/menu.c
>> @@ -344,6 +344,12 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>>  	if (unlikely(latency_req == 0))
>>  		return 0;
>>  
>> +	/*don't predict again if idle framework already did it */
>> +	if (!dev->idle_stat.predicted)
>> +		menu_predict();
>> +	else
>> +		dev->idle_stat.predicted = false;
> 
> We provide the callback which is going to be used by the core if present,
> so why would the core not use it after all?

There is a case that in the loop

while (!need_resched()) {
	cpuidle_idle_call()
}

The CPU receives a timer interrupt and wake up and find nothing to be scheduled
and back to call menu then mwait to sleep again.

Under this case, cpuidle_predict() is not reached but the idle data for the 
prediction needs to be updated.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface
       [not found]   ` <3044561.Ej2KzLJlAU@aspire.rjw.lan>
@ 2017-10-16  9:52     ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-16  9:52 UTC (permalink / raw)
  To: Rafael J. Wysocki, Aubrey Li
  Cc: tglx, peterz, len.brown, ak, tim.c.chen, linux-pm, linux-kernel

On 2017/10/14 9:27, Rafael J. Wysocki wrote:
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 0951dac..8704f3c 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -225,6 +225,7 @@ static void do_idle(void)
>>  	 */
>>  	__current_set_polling();
>>  	quiet_vmstat();
>> +	cpuidle_predict();
> 
> One more question here.
> 
> This changes the cpuidle code ordering such that if the ->predict callback
> is present, the governor prediction will run before tick_nohz_idle_enter(),
> whereas without that callback it runs in cpuidle_idle_call().
> 
> Is that actually going to work correctly for the menu governor, in particular?
>

This depends on how prediction works, patch 1/8 has the details.

The first factor we get is the next clock event from tick device, this is not
related to cpuidle call.

The other two factors are derived from idle interval history, the data is already
maintained in menu_device data structure.

I'm not sure if this can address your concern, or anything else I can provide to
help this.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
  2017-10-16  6:46     ` [RFC PATCH v2 5/8] timers: keep sleep length updated as needed Li, Aubrey
@ 2017-10-16 23:58       ` Rafael J. Wysocki
  2017-10-17  6:10         ` Li, Aubrey
  0 siblings, 1 reply; 22+ messages in thread
From: Rafael J. Wysocki @ 2017-10-16 23:58 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 8:46:41 AM CEST Li, Aubrey wrote:
> On 2017/10/14 8:56, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
> >> sleep length indicates how long we'll be idle. Currently, it's updated
> >> only when tick nohz enters. These patch series make a new requirement
> >> with tick, so we should keep sleep length updated as needed
> > 
> > So what exactly would be the problem with leaving things as they are?
> 
> Previously ts->sleep_length is only updated when tick is stopped.
> 
> As follows, in
> 
> __tick_nohz_idle_enter()
> {
> 	if (can_stop_idle_tick() /* return true */) {
> 		tick_nohz_stop_sched_tick()
> 			|
> 			|-----> update sleep_length
> 	}
> }

Which is logical, because the tick will get in the way if we don't stop it,
won't it?

> 
> Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
> to update sleep_length every time we read it
> 
> If we leave it unchanged, the prediction could read a sleep_length long time
> ago if the system keep ticking.

Well, but does it make sense to estimate the sleep length without stopping
the tick?

> >> ---
> >>  kernel/time/tick-sched.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> >> index d663fab..94fb9b8 100644
> >> --- a/kernel/time/tick-sched.c
> >> +++ b/kernel/time/tick-sched.c
> >> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
> >>   */
> >>  ktime_t tick_nohz_get_sleep_length(void)
> >>  {
> >> +	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
> >>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> >>  
> >> +	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
> >> +
> >>  	return ts->sleep_length;
> >>  }
> >>  
> >>
> > 
> > I probably wouldn't do it this way ...
> > 
> > 
> 
> May I know the detailed thoughts?

That depends on the answer above. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
  2017-10-16  6:00     ` [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable Li, Aubrey
@ 2017-10-17  0:01       ` Rafael J. Wysocki
  2017-10-17  6:12         ` Li, Aubrey
  0 siblings, 1 reply; 22+ messages in thread
From: Rafael J. Wysocki @ 2017-10-17  0:01 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 8:00:45 AM CEST Li, Aubrey wrote:
> On 2017/10/14 8:59, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
> >> Add a knob to make fast idle threshold tunable
> >>
> >> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> > 
> > I first of all am not sure about the need to add a tunable for this at all
> > in the first place.
> 
> Actually I think a fixed value(10) might be good enough but not quite sure
> if there is a requirement to tune it for different scenario, for example even
> if the predicted idle interval is 100x overhead, I still want a fast path for
> a better benchmark score?

Any new tunables make the test matrix expand considerably, so it generally is
better to err on the conservative side with adding them.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
  2017-10-16  3:11     ` [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry Li, Aubrey
@ 2017-10-17  0:05       ` Rafael J. Wysocki
  2017-10-17  7:04         ` Li, Aubrey
  0 siblings, 1 reply; 22+ messages in thread
From: Rafael J. Wysocki @ 2017-10-17  0:05 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 5:11:57 AM CEST Li, Aubrey wrote:
> On 2017/10/14 8:35, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
> >> Record the overhead of idle entry in micro-second
> >>
> > 
> > What is this needed for?
> 
> We need to figure out how long of a idle is a short idle and recording
> the overhead is for this purpose. The short idle threshold is based
> on this overhead.

I don't really understand this statement.

Pretent I'm not familiar with this stuff and try to explain it to me. :-)

> > 
> >> +void cpuidle_entry_end(void)
> >> +{
> >> +	struct cpuidle_device *dev = cpuidle_get_device();
> >> +	u64 overhead;
> >> +	s64 diff;
> >> +
> >> +	if (dev) {
> >> +		dev->idle_stat.entry_end = local_clock();
> >> +		overhead = div_u64(dev->idle_stat.entry_end -
> >> +				dev->idle_stat.entry_start, NSEC_PER_USEC);
> > 
> > Is the conversion really necessary?
> > 
> > If so, then why?
> 
> We can choose nano-second and micro-second. Given that workload results
> in the short idle pattern, I think micro-second is good enough for the
> real workload.
> 
> Another reason is that prediction from idle governor is micro-second, so
> I convert it for comparing purpose.
> > 
> > And if there is a good reason, what about using right shift to do
> > an approximate conversion to avoid the extra division here?
> 
> Sure >> 10 works for me as I don't think here precision is a big deal.
> 
> > 
> >> +		diff = overhead - dev->idle_stat.overhead;
> >> +		dev->idle_stat.overhead += diff >> 3;
> > 
> > Can you please explain what happens in the two lines above?
> 
> Online average computing algorithm, stolen from update_avg() @ kernel/sched/core.c.

OK

Maybe care to add a comment to that effect?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-10-16  7:44   ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Li, Aubrey
@ 2017-10-17  0:07     ` Rafael J. Wysocki
  2017-10-17  7:32       ` Li, Aubrey
  0 siblings, 1 reply; 22+ messages in thread
From: Rafael J. Wysocki @ 2017-10-17  0:07 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On Monday, October 16, 2017 9:44:41 AM CEST Li, Aubrey wrote:
> On 2017/10/14 9:14, Rafael J. Wysocki wrote:
> > On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
> >> We found under some latency intensive workloads, short idle periods occurs
> >> very common, then idle entry and exit path starts to dominate, so it's
> >> important to optimize them. To determine the short idle pattern, we need
> >> to figure out how long of the coming idle and the threshold of the short
> >> idle interval.
> >>
> >> A cpu idle prediction functionality is introduced in this proposal to catch
> >> the short idle pattern.
> >>
> >> Firstly, we check the IRQ timings subsystem, if there is an event
> >> coming soon.
> >> -- https://lwn.net/Articles/691297/
> >>
> >> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> >> go into a short idle.
> >> -- https://patchwork.kernel.org/patch/2839221/
> >>
> >> Thirdly, we predict the next idle interval by using the prediction
> >> fucntionality in the idle governor if it has.
> >>
> >> For the threshold of the short idle interval, we record the timestamps of
> >> the idle entry, and multiply by a tunable parameter at here:
> >> -- /proc/sys/kernel/fast_idle_ratio
> >>
> >> We use the output of the idle prediction to skip turning tick off if a
> >> short idle is determined in this proposal. Reprogramming hardware timer
> >> twice(off and on) is expensive for a very short idle. There are some
> >> potential optimizations can be done according to the same indicator.
> >>
> >> I observed when system is idle, the idle predictor reports 20/s long idle
> >> and ZERO fast idle on one CPU. And when the workload is running, the idle
> >> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
> >>
> >> Aubrey Li (8):
> >>   cpuidle: menu: extract prediction functionality
> >>   cpuidle: record the overhead of idle entry
> >>   cpuidle: add a new predict interface
> >>   tick/nohz: keep tick on for a fast idle
> >>   timers: keep sleep length updated as needed
> >>   cpuidle: make fast idle threshold tunable
> >>   cpuidle: introduce irq timing to make idle prediction
> >>   cpuidle: introduce run queue average idle to make idle prediction
> >>
> >>  drivers/cpuidle/Kconfig          |   1 +
> >>  drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
> >>  drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
> >>  include/linux/cpuidle.h          |  21 ++++++++
> >>  kernel/sched/idle.c              |  14 ++++-
> >>  kernel/sysctl.c                  |  12 +++++
> >>  kernel/time/tick-sched.c         |   7 +++
> >>  7 files changed, 209 insertions(+), 24 deletions(-)
> >>
> > 
> > Overall, it looks like you could avoid stopping the tick every time the
> > predicted idle duration is not longer than the tick interval in the first
> > place.
> > > Why don't you do that?
> 
> I didn't catch this.
> 
> Are you suggesting?
> 
> if(!cpu_stat.fast_idle)
> 	tick_nohz_idle_enter()
> 
> Or you concern why the threshold can't simply be tick interval?

That I guess.

> For the first, can_stop_idle_tick() is a better place to skip tick-off IMHO.
> For the latter, if the threshold is close/equal to the tick, it's quite possible
> the next event is the tick and no other else event.

Well, I don't quite get that.

What's the reasoning here?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed
  2017-10-16 23:58       ` Rafael J. Wysocki
@ 2017-10-17  6:10         ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-17  6:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 7:58, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 8:46:41 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:56, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
>>>> sleep length indicates how long we'll be idle. Currently, it's updated
>>>> only when tick nohz enters. These patch series make a new requirement
>>>> with tick, so we should keep sleep length updated as needed
>>>
>>> So what exactly would be the problem with leaving things as they are?
>>
>> Previously ts->sleep_length is only updated when tick is stopped.
>>
>> As follows, in
>>
>> __tick_nohz_idle_enter()
>> {
>> 	if (can_stop_idle_tick() /* return true */) {
>> 		tick_nohz_stop_sched_tick()
>> 			|
>> 			|-----> update sleep_length
>> 	}
>> }
> 
> Which is logical, because the tick will get in the way if we don't stop it,
> won't it?
> 
The scenario here is, tick(4ms) is too long for a short idle(e.g. 4us).
And there could be hundreds of short idle within one tick interval.

>>
>> Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
>> to update sleep_length every time we read it
>>
>> If we leave it unchanged, the prediction could read a sleep_length long time
>> ago if the system keep ticking.
> 
> Well, but does it make sense to estimate the sleep length without stopping
> the tick?

For example, for the first short idle, we just turned tick off last time, so
we may get the sleep length = 3900us. Then we keep tick on, and the 100th short
idle comes, then the original sleep length is still 3900us(not updated), but
actually it should be e.g. 100us.

> 
>>>> ---
>>>>  kernel/time/tick-sched.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>>>> index d663fab..94fb9b8 100644
>>>> --- a/kernel/time/tick-sched.c
>>>> +++ b/kernel/time/tick-sched.c
>>>> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
>>>>   */
>>>>  ktime_t tick_nohz_get_sleep_length(void)
>>>>  {
>>>> +	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
>>>>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>>>>  
>>>> +	ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
>>>> +
>>>>  	return ts->sleep_length;
>>>>  }
>>>>  
>>>>
>>>
>>> I probably wouldn't do it this way ...
>>>
>>>
>>
>> May I know the detailed thoughts?
> 
> That depends on the answer above. :-)

Does the above explanation address the concern?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable
  2017-10-17  0:01       ` Rafael J. Wysocki
@ 2017-10-17  6:12         ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-17  6:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 8:01, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 8:00:45 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:59, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
>>>> Add a knob to make fast idle threshold tunable
>>>>
>>>> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
>>>
>>> I first of all am not sure about the need to add a tunable for this at all
>>> in the first place.
>>
>> Actually I think a fixed value(10) might be good enough but not quite sure
>> if there is a requirement to tune it for different scenario, for example even
>> if the predicted idle interval is 100x overhead, I still want a fast path for
>> a better benchmark score?
> 
> Any new tunables make the test matrix expand considerably, so it generally is
> better to err on the conservative side with adding them.
> 

Okay, it's fine for me without it. I'll remove in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry
  2017-10-17  0:05       ` Rafael J. Wysocki
@ 2017-10-17  7:04         ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-17  7:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Mike Galbraith
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 8:05, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 5:11:57 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:35, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
>>>> Record the overhead of idle entry in micro-second
>>>>
>>>
>>> What is this needed for?
>>
>> We need to figure out how long of a idle is a short idle and recording
>> the overhead is for this purpose. The short idle threshold is based
>> on this overhead.
> 
> I don't really understand this statement.
> 
> Pretent I'm not familiar with this stuff and try to explain it to me. :-)
> 

Okay, let me try, :-)

Today what we did in idle loop as follows:

do_idle {
	idle_entry {
	- deferrable stuff like quiet_vmstat
	- turn off tick(without looking at historical/predicted idle interval)
	- rcu idle enter, c-state selection, etc
	}

	idle_call {
	- poll or halt or mwait
	}

	idle_exit {
	- rcu idle exit
	- restore the tick if tick is stopped before enter idle
	}
}

And we already measured idle_entry and idle_exit costs several micro-seconds,
say 10us.

Now if idle_call is 1000us, much larger than idle_entry and idle_exit, we can
ignore the time cost in idle_entry and idle_exit.

But for some workloads with short idle pattern, like netperf, the idle_call
is 2us, then idle_entry and idle_exit start to dominate. If we can reduce the
time in idle_entry and idle_exit, we then get better workload performance
significantly.

Modem high-speed network and low-latency I/O like Nvme disk has this requirement.
Mike's patch was made several years ago though I don't know the details. Here is
an article related to this.
https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext

>>>
>>>> +void cpuidle_entry_end(void)
>>>> +{
>>>> +	struct cpuidle_device *dev = cpuidle_get_device();
>>>> +	u64 overhead;
>>>> +	s64 diff;
>>>> +
>>>> +	if (dev) {
>>>> +		dev->idle_stat.entry_end = local_clock();
>>>> +		overhead = div_u64(dev->idle_stat.entry_end -
>>>> +				dev->idle_stat.entry_start, NSEC_PER_USEC);
>>>
>>> Is the conversion really necessary?
>>>
>>> If so, then why?
>>
>> We can choose nano-second and micro-second. Given that workload results
>> in the short idle pattern, I think micro-second is good enough for the
>> real workload.
>>
>> Another reason is that prediction from idle governor is micro-second, so
>> I convert it for comparing purpose.
>>>
>>> And if there is a good reason, what about using right shift to do
>>> an approximate conversion to avoid the extra division here?
>>
>> Sure >> 10 works for me as I don't think here precision is a big deal.
>>
>>>
>>>> +		diff = overhead - dev->idle_stat.overhead;
>>>> +		dev->idle_stat.overhead += diff >> 3;
>>>
>>> Can you please explain what happens in the two lines above?
>>
>> Online average computing algorithm, stolen from update_avg() @ kernel/sched/core.c.
> 
> OK
> 
> Maybe care to add a comment to that effect?

Sure, I'll add in the next version.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-10-17  0:07     ` Rafael J. Wysocki
@ 2017-10-17  7:32       ` Li, Aubrey
  0 siblings, 0 replies; 22+ messages in thread
From: Li, Aubrey @ 2017-10-17  7:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Aubrey Li, tglx, peterz, len.brown, ak, tim.c.chen, linux-pm,
	linux-kernel

On 2017/10/17 8:07, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 9:44:41 AM CEST Li, Aubrey wrote:
>>
>> Or you concern why the threshold can't simply be tick interval?
> 
> That I guess.
> 
>> For the latter, if the threshold is close/equal to the tick, it's quite possible
>> the next event is the tick and no other else event.
> 
> Well, I don't quite get that.
> 
> What's the reasoning here?

if we set the threshold to the tick interval, when the system keeps the tick on once,
it will have no chance to turn if off forever.

Because we turn on the tick, and count in the time of idle entry and idle exit,
the idle interval always < tick interval, that means idle is always a short idle.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
       [not found] <1506756034-6340-1-git-send-email-aubrey.li@intel.com>
                   ` (6 preceding siblings ...)
       [not found] ` <1506756034-6340-4-git-send-email-aubrey.li@intel.com>
@ 2017-11-30  1:00 ` Li, Aubrey
  2017-11-30  1:37   ` Rafael J. Wysocki
  7 siblings, 1 reply; 22+ messages in thread
From: Li, Aubrey @ 2017-11-30  1:00 UTC (permalink / raw)
  To: Aubrey Li, tglx, peterz, rjw, len.brown, ak, tim.c.chen
  Cc: linux-pm, linux-kernel

Hi,

Thanks Rafael's comments against V2. I'd like to ping here to see which
direction this proposal should go and what fundamental change this proposal
should make.

I'm also open to any suggestions if my proposal is not on the right way.

Thanks,
-Aubrey

On 2017/9/30 15:20, Aubrey Li wrote:
> We found under some latency intensive workloads, short idle periods occurs
> very common, then idle entry and exit path starts to dominate, so it's
> important to optimize them. To determine the short idle pattern, we need
> to figure out how long of the coming idle and the threshold of the short
> idle interval.
> 
> A cpu idle prediction functionality is introduced in this proposal to catch
> the short idle pattern.
> 
> Firstly, we check the IRQ timings subsystem, if there is an event
> coming soon.
> -- https://lwn.net/Articles/691297/
> 
> Secondly, we check the idle statistics of scheduler, if it's likely we'll
> go into a short idle.
> -- https://patchwork.kernel.org/patch/2839221/
> 
> Thirdly, we predict the next idle interval by using the prediction
> fucntionality in the idle governor if it has.
> 
> For the threshold of the short idle interval, we record the timestamps of
> the idle entry, and multiply by a tunable parameter at here:
> -- /proc/sys/kernel/fast_idle_ratio
> 
> We use the output of the idle prediction to skip turning tick off if a
> short idle is determined in this proposal. Reprogramming hardware timer
> twice(off and on) is expensive for a very short idle. There are some
> potential optimizations can be done according to the same indicator.
> 
> I observed when system is idle, the idle predictor reports 20/s long idle
> and ZERO fast idle on one CPU. And when the workload is running, the idle
> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
> 
> Aubrey Li (8):
>   cpuidle: menu: extract prediction functionality
>   cpuidle: record the overhead of idle entry
>   cpuidle: add a new predict interface
>   tick/nohz: keep tick on for a fast idle
>   timers: keep sleep length updated as needed
>   cpuidle: make fast idle threshold tunable
>   cpuidle: introduce irq timing to make idle prediction
>   cpuidle: introduce run queue average idle to make idle prediction
> 
>  drivers/cpuidle/Kconfig          |   1 +
>  drivers/cpuidle/cpuidle.c        | 109 +++++++++++++++++++++++++++++++++++++++
>  drivers/cpuidle/governors/menu.c |  69 ++++++++++++++++---------
>  include/linux/cpuidle.h          |  21 ++++++++
>  kernel/sched/idle.c              |  14 ++++-
>  kernel/sysctl.c                  |  12 +++++
>  kernel/time/tick-sched.c         |   7 +++
>  7 files changed, 209 insertions(+), 24 deletions(-)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality
  2017-11-30  1:00 ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Li, Aubrey
@ 2017-11-30  1:37   ` Rafael J. Wysocki
  0 siblings, 0 replies; 22+ messages in thread
From: Rafael J. Wysocki @ 2017-11-30  1:37 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, Thomas Gleixner, Peter Zijlstra, Rafael J. Wysocki,
	Len Brown, Andi Kleen, Tim Chen, Linux PM,
	Linux Kernel Mailing List

On Thu, Nov 30, 2017 at 2:00 AM, Li, Aubrey <aubrey.li@linux.intel.com> wrote:
> Hi,
>
> Thanks Rafael's comments against V2. I'd like to ping here to see which
> direction this proposal should go and what fundamental change this proposal
> should make.
>
> I'm also open to any suggestions if my proposal is not on the right way.

To me (and I discussed that with Peter briefly last month in Prague)
the way to go would be to avoid stopping the scheduler tick until we
are entirely sure that it should be stopped.  Which is when the
governor (any governor BTW, not just menu) chooses the target idle
state for the CPU and the target residency of that state is long
enough for the tick to be stopped ("long enough" here means that if
the tick is not stopped, the time spent by the CPU in the target idle
state will be shorter than the target residency).  Note that this is
totally independent on what governor is used and it doesn't involve
any "refinements" or "improvements" of the prediction whatever.

Of course, for the governor to make the state selection, it generally
will have to know when the next non-tick timer is going to expire and
getting that information to it will require some effort.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-11-30  1:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1506756034-6340-1-git-send-email-aubrey.li@intel.com>
     [not found] ` <1506756034-6340-2-git-send-email-aubrey.li@intel.com>
     [not found]   ` <1629755.KbDSmDPDTX@aspire.rjw.lan>
2017-10-16  2:46     ` [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality Li, Aubrey
     [not found] ` <1506756034-6340-3-git-send-email-aubrey.li@intel.com>
     [not found]   ` <2672521.fEEa1b19Vu@aspire.rjw.lan>
2017-10-16  3:11     ` [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry Li, Aubrey
2017-10-17  0:05       ` Rafael J. Wysocki
2017-10-17  7:04         ` Li, Aubrey
     [not found] ` <1506756034-6340-5-git-send-email-aubrey.li@intel.com>
     [not found]   ` <4523111.uMcC96MW3N@aspire.rjw.lan>
2017-10-16  3:26     ` [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle Li, Aubrey
2017-10-16  4:45       ` Mike Galbraith
2017-10-16  5:34         ` Li, Aubrey
2017-10-16  6:25           ` Mike Galbraith
2017-10-16  6:31             ` Li, Aubrey
     [not found] ` <1506756034-6340-7-git-send-email-aubrey.li@intel.com>
     [not found]   ` <2242303.t20yq9Lc6j@aspire.rjw.lan>
2017-10-16  6:00     ` [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable Li, Aubrey
2017-10-17  0:01       ` Rafael J. Wysocki
2017-10-17  6:12         ` Li, Aubrey
     [not found] ` <1506756034-6340-6-git-send-email-aubrey.li@intel.com>
     [not found]   ` <1554921.dz8jk4n8cL@aspire.rjw.lan>
2017-10-16  6:46     ` [RFC PATCH v2 5/8] timers: keep sleep length updated as needed Li, Aubrey
2017-10-16 23:58       ` Rafael J. Wysocki
2017-10-17  6:10         ` Li, Aubrey
     [not found] ` <3026355.QRuoy6eIZM@aspire.rjw.lan>
2017-10-16  7:44   ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Li, Aubrey
2017-10-17  0:07     ` Rafael J. Wysocki
2017-10-17  7:32       ` Li, Aubrey
     [not found] ` <1506756034-6340-4-git-send-email-aubrey.li@intel.com>
     [not found]   ` <2353480.vFnqZDvmsB@aspire.rjw.lan>
2017-10-16  8:04     ` [RFC PATCH v2 3/8] cpuidle: add a new predict interface Li, Aubrey
     [not found]   ` <3044561.Ej2KzLJlAU@aspire.rjw.lan>
2017-10-16  9:52     ` Li, Aubrey
2017-11-30  1:00 ` [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality Li, Aubrey
2017-11-30  1:37   ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).