Re: [PATCH v4] cpuidle: Fix last_residency division

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel Lezcano <daniel.lezcano@linaro.org>
To: "Shreyas B. Prabhu" <shreyas@linux.vnet.ibm.com>, rjw@rjwysocki.net
Cc: nicolas.pitre@linaro.org, arnd@arndb.de,
	linux-pm@vger.kernel.org, David.Laight@ACULAB.COM,
	anton@samba.org, linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v4] cpuidle: Fix last_residency division
Date: Thu, 30 Jun 2016 16:57:34 +0200	[thread overview]
Message-ID: <5775335E.2040003@linaro.org> (raw)
In-Reply-To: <1467297253-2171-1-git-send-email-shreyas@linux.vnet.ibm.com>

On 06/30/2016 04:34 PM, Shreyas B. Prabhu wrote:
> Snooze is a poll idle state in powernv and pseries platforms. Snooze
> has a timeout so that if a cpu stays in snooze for more than target
> residency of the next available idle state, then it would exit thereby
> giving chance to the cpuidle governor to re-evaluate and
> promote the cpu to a deeper idle state. Therefore whenever snooze exits
> due to this timeout, its last_residency will be target_residency of next
> deeper state.
>
> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> changed the math around last_residency calculation. Specifically, while
> converting last_residency value from nanoseconds to microseconds it does
> right shift by 10. Due to this, in snooze timeout exit scenarios
> last_residency calculated is roughly 2.3% less than target_residency of
> next available state. This pattern is picked up get_typical_interval()
> in the menu governor and therefore expected_interval in menu_select() is
> frequently less than the target_residency of any state but snooze.
>
> Due to this we are entering snooze at a higher rate, thereby affecting
> the single thread performance.
>
> Fix this by using a better approximation for division by 1000.
>
> Reported-by: Anton Blanchard <anton@samba.org>
> Bisected-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
> Suggested-by David Laight <david.laight@aculab.com>
> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
> ---
> Changes in v4
> =============
>   - Increasing the threshold upto which approximation can be used.
>   - Removed explicit cast. Instead added a comment saying why cast
>     is safe.
>
> Changes in v3
> =============
>   - Using approximation suggested by David
>
> Changes in v2
> =============
>   - Fixing it in the cpuidle core code instead of driver code.
>
>   drivers/cpuidle/cpuidle.c | 11 +++--------
>   drivers/cpuidle/cpuidle.h | 38 ++++++++++++++++++++++++++++++++++++++
>   2 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a4d0059..f55ad01 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -174,7 +174,6 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>   	struct cpuidle_state *target_state = &drv->states[index];
>   	bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
>   	u64 time_start, time_end;
> -	s64 diff;
>
>   	/*
>   	 * Tell the time framework to switch to a broadcast timer because our
> @@ -218,14 +217,10 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>   		local_irq_enable();
>
>   	/*
> -	 * local_clock() returns the time in nanosecond, let's shift
> -	 * by 10 (divide by 1024) to have microsecond based time.
> +	 * local_clock() returns the time in nanoseconds, convert it to
> +	 * microsecond based time.
>   	 */
> -	diff = (time_end - time_start) >> 10;
> -	if (diff > INT_MAX)
> -		diff = INT_MAX;
> -
> -	dev->last_residency = (int) diff;
> +	dev->last_residency = convert_nsec_to_usec(time_end - time_start);
>
>   	if (entered_state >= 0) {
>   		/* Update cpuidle counters */
> diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
> index f87f399..a027b35 100644
> --- a/drivers/cpuidle/cpuidle.h
> +++ b/drivers/cpuidle/cpuidle.h
> @@ -68,4 +68,42 @@ static inline void cpuidle_coupled_unregister_device(struct cpuidle_device *dev)
>   }
>   #endif
>
> +/*
> + * To ensure that there is no overflow while approximation
> + * for dividing val by 1000, we must respect -
> + * val + (val >> 5) <= 0xFFFFFFFF
> + * val + val/32 <= 0xFFFFFFFF
> + * val <= (0xFFFFFFFF * 32) / 33
> + * val <= 0xF83E0F82
> + * Hence the threshold for val below which we can use the
> + * approximation is 0xF83E0F82
> + */
> +#define DIV_APPROXIMATION_THRESHOLD 0xF83E0F82UL
> +
> +/*
> + * Used for calculating last_residency in usec. Optimized for case
> + * where last_residency in nsecs is < DIV_APPROXIMATION_THRESHOLD
> + * Approximated value has less than 1% error.
> + */
> +static inline int convert_nsec_to_usec(u64 nsec)
> +{
> +	if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
> +		u32 usec = nsec;
> +
> +		usec += usec >> 5;
> +		usec = usec >> 10;
> +
> +		/* Can safely cast to int since usec is < INT_MAX */
> +		return usec;
> +	} else {
> +		u64 usec = div_u64(nsec, 1000);
> +
> +		if (usec > INT_MAX)
> +			usec = INT_MAX;
> +
> +		/* Can safely cast to int since usec is < INT_MAX */
> +		return usec;
> +	}
> +}


What bothers me with this division is the benefit of adding an extra 
ultra optimized division by 1000 in cpuidle.h while we have already 
ktime_divns which is optimized in ktime.h.

Why not:

ts = ns_to_ktime(local_clock());

...

te = ns_to_ktime(local_clock());


diff = ktime_us_delta(te, ts);





-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

WARNING: multiple messages have this Message-ID (diff)

From: Daniel Lezcano <daniel.lezcano@linaro.org>
To: "Shreyas B. Prabhu" <shreyas@linux.vnet.ibm.com>, rjw@rjwysocki.net
Cc: linux-pm@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	anton@samba.org, mpe@ellerman.id.au, bsingharora@gmail.com,
	David.Laight@ACULAB.COM, arnd@arndb.de, nicolas.pitre@linaro.org
Subject: Re: [PATCH v4] cpuidle: Fix last_residency division
Date: Thu, 30 Jun 2016 16:57:34 +0200	[thread overview]
Message-ID: <5775335E.2040003@linaro.org> (raw)
In-Reply-To: <1467297253-2171-1-git-send-email-shreyas@linux.vnet.ibm.com>

On 06/30/2016 04:34 PM, Shreyas B. Prabhu wrote:
> Snooze is a poll idle state in powernv and pseries platforms. Snooze
> has a timeout so that if a cpu stays in snooze for more than target
> residency of the next available idle state, then it would exit thereby
> giving chance to the cpuidle governor to re-evaluate and
> promote the cpu to a deeper idle state. Therefore whenever snooze exits
> due to this timeout, its last_residency will be target_residency of next
> deeper state.
>
> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> changed the math around last_residency calculation. Specifically, while
> converting last_residency value from nanoseconds to microseconds it does
> right shift by 10. Due to this, in snooze timeout exit scenarios
> last_residency calculated is roughly 2.3% less than target_residency of
> next available state. This pattern is picked up get_typical_interval()
> in the menu governor and therefore expected_interval in menu_select() is
> frequently less than the target_residency of any state but snooze.
>
> Due to this we are entering snooze at a higher rate, thereby affecting
> the single thread performance.
>
> Fix this by using a better approximation for division by 1000.
>
> Reported-by: Anton Blanchard <anton@samba.org>
> Bisected-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
> Suggested-by David Laight <david.laight@aculab.com>
> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
> ---
> Changes in v4
> =============
>   - Increasing the threshold upto which approximation can be used.
>   - Removed explicit cast. Instead added a comment saying why cast
>     is safe.
>
> Changes in v3
> =============
>   - Using approximation suggested by David
>
> Changes in v2
> =============
>   - Fixing it in the cpuidle core code instead of driver code.
>
>   drivers/cpuidle/cpuidle.c | 11 +++--------
>   drivers/cpuidle/cpuidle.h | 38 ++++++++++++++++++++++++++++++++++++++
>   2 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a4d0059..f55ad01 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -174,7 +174,6 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>   	struct cpuidle_state *target_state = &drv->states[index];
>   	bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
>   	u64 time_start, time_end;
> -	s64 diff;
>
>   	/*
>   	 * Tell the time framework to switch to a broadcast timer because our
> @@ -218,14 +217,10 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
>   		local_irq_enable();
>
>   	/*
> -	 * local_clock() returns the time in nanosecond, let's shift
> -	 * by 10 (divide by 1024) to have microsecond based time.
> +	 * local_clock() returns the time in nanoseconds, convert it to
> +	 * microsecond based time.
>   	 */
> -	diff = (time_end - time_start) >> 10;
> -	if (diff > INT_MAX)
> -		diff = INT_MAX;
> -
> -	dev->last_residency = (int) diff;
> +	dev->last_residency = convert_nsec_to_usec(time_end - time_start);
>
>   	if (entered_state >= 0) {
>   		/* Update cpuidle counters */
> diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
> index f87f399..a027b35 100644
> --- a/drivers/cpuidle/cpuidle.h
> +++ b/drivers/cpuidle/cpuidle.h
> @@ -68,4 +68,42 @@ static inline void cpuidle_coupled_unregister_device(struct cpuidle_device *dev)
>   }
>   #endif
>
> +/*
> + * To ensure that there is no overflow while approximation
> + * for dividing val by 1000, we must respect -
> + * val + (val >> 5) <= 0xFFFFFFFF
> + * val + val/32 <= 0xFFFFFFFF
> + * val <= (0xFFFFFFFF * 32) / 33
> + * val <= 0xF83E0F82
> + * Hence the threshold for val below which we can use the
> + * approximation is 0xF83E0F82
> + */
> +#define DIV_APPROXIMATION_THRESHOLD 0xF83E0F82UL
> +
> +/*
> + * Used for calculating last_residency in usec. Optimized for case
> + * where last_residency in nsecs is < DIV_APPROXIMATION_THRESHOLD
> + * Approximated value has less than 1% error.
> + */
> +static inline int convert_nsec_to_usec(u64 nsec)
> +{
> +	if (likely(nsec < DIV_APPROXIMATION_THRESHOLD)) {
> +		u32 usec = nsec;
> +
> +		usec += usec >> 5;
> +		usec = usec >> 10;
> +
> +		/* Can safely cast to int since usec is < INT_MAX */
> +		return usec;
> +	} else {
> +		u64 usec = div_u64(nsec, 1000);
> +
> +		if (usec > INT_MAX)
> +			usec = INT_MAX;
> +
> +		/* Can safely cast to int since usec is < INT_MAX */
> +		return usec;
> +	}
> +}


What bothers me with this division is the benefit of adding an extra 
ultra optimized division by 1000 in cpuidle.h while we have already 
ktime_divns which is optimized in ktime.h.

Why not:

ts = ns_to_ktime(local_clock());

...

te = ns_to_ktime(local_clock());


diff = ktime_us_delta(te, ts);





-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

next prev parent reply	other threads:[~2016-06-30 14:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-30 14:34 [PATCH v4] cpuidle: Fix last_residency division Shreyas B. Prabhu
2016-06-30 14:57 ` Daniel Lezcano [this message]
2016-06-30 14:57   ` Daniel Lezcano
2016-06-30 15:37   ` Nicolas Pitre
2016-06-30 15:37     ` Nicolas Pitre
2016-07-01  8:06     ` Daniel Lezcano
2016-07-01 12:41       ` Balbir Singh
2016-07-01 12:41         ` Balbir Singh
2016-07-01 13:02         ` Nicolas Pitre
2016-07-01 13:02           ` Nicolas Pitre
2016-07-01 13:00       ` Nicolas Pitre
2016-07-01 13:00         ` Nicolas Pitre
2016-07-01 14:16       ` Shreyas B Prabhu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5775335E.2040003@linaro.org \
    --to=daniel.lezcano@linaro.org \
    --cc=David.Laight@ACULAB.COM \
    --cc=anton@samba.org \
    --cc=arnd@arndb.de \
    --cc=linux-pm@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=nicolas.pitre@linaro.org \
    --cc=rjw@rjwysocki.net \
    --cc=shreyas@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.