[PATCH] cpufreq: governor: Fix handling of special cases in dbs_update()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Rafael J. Wysocki" <rjw@rjwysocki.net>
To: Linux PM list <linux-pm@vger.kernel.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Chen Yu <yu.c.chen@intel.com>,
	Timo Valtoaho <timo.valtoaho@gmail.com>
Subject: [PATCH] cpufreq: governor: Fix handling of special cases in dbs_update()
Date: Fri, 06 May 2016 01:30:37 +0200	[thread overview]
Message-ID: <4149261.sebfWj8seH@vostro.rjw.lan> (raw)

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

As reported in KBZ 69821:

"With CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequcency 800MHz
 even if usage goes to 100%, frequency does not scale up, the governor
 in use is ondemand. Neither works conservative. Performance and
 userspace governors work as expected.

 With CONFIG_NO_HZ_IDLE or CONFIG_NO_HZ_FULL cpu scales up with ondemand
 as expected."

Analysis carried out by Chen Yu leads to the conclusion that the
observed issue is due to idle_time in dbs_update() representing a
negative number in which case the function will return 0 as the load
(unless load is greater than 0 for another CPU sharing the policy),
although that need not be the right choice.

Indeed, idle_time representing a negative number means that during
the last sampling interval the CPU was almost 100% busy on the rough
average, so 100 should be returned as the load in that case.

Modify the code accordingly and rearrange it to clarify the handling
of all of the special cases in it.  While at it, also avoid returning
zero as the load if time_elapsed is 0 (it doesn't really make sense
to return 0 then).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Timo Valtoaho <timo.valtoaho@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq_governor.c |   87 +++++++++++++++++++++----------------
 1 file changed, 51 insertions(+), 36 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -156,47 +156,62 @@ unsigned int dbs_update(struct cpufreq_p
 			j_cdbs->prev_cpu_nice = cur_nice;
 		}
 
-		if (unlikely(!time_elapsed || time_elapsed < idle_time))
-			continue;
-
-		/*
-		 * If the CPU had gone completely idle, and a task just woke up
-		 * on this CPU now, it would be unfair to calculate 'load' the
-		 * usual way for this elapsed time-window, because it will show
-		 * near-zero load, irrespective of how CPU intensive that task
-		 * actually is. This is undesirable for latency-sensitive bursty
-		 * workloads.
-		 *
-		 * To avoid this, we reuse the 'load' from the previous
-		 * time-window and give this task a chance to start with a
-		 * reasonably high CPU frequency. (However, we shouldn't over-do
-		 * this copy, lest we get stuck at a high load (high frequency)
-		 * for too long, even when the current system load has actually
-		 * dropped down. So we perform the copy only once, upon the
-		 * first wake-up from idle.)
-		 *
-		 * Detecting this situation is easy: the governor's utilization
-		 * update handler would not have run during CPU-idle periods.
-		 * Hence, an unusually large 'time_elapsed' (as compared to the
-		 * sampling rate) indicates this scenario.
-		 *
-		 * prev_load can be zero in two cases and we must recalculate it
-		 * for both cases:
-		 * - during long idle intervals
-		 * - explicitly set to zero
-		 */
-		if (unlikely(time_elapsed > 2 * sampling_rate &&
-			     j_cdbs->prev_load)) {
+		if (unlikely(!time_elapsed)) {
+			/*
+			 * That can only happen when this function is called
+			 * twice in a row with a very short interval between the
+			 * calls, so the previous load value can be used then.
+			 */
 			load = j_cdbs->prev_load;
-
+		} else if (unlikely(time_elapsed > 2 * sampling_rate &&
+				    j_cdbs->prev_load)) {
 			/*
-			 * Perform a destructive copy, to ensure that we copy
-			 * the previous load only once, upon the first wake-up
-			 * from idle.
+			 * If the CPU had gone completely idle and a task has
+			 * just woken up on this CPU now, it would be unfair to
+			 * calculate 'load' the usual way for this elapsed
+			 * time-window, because it would show near-zero load,
+			 * irrespective of how CPU intensive that task actually
+			 * was. This is undesirable for latency-sensitive bursty
+			 * workloads.
+			 *
+			 * To avoid this, reuse the 'load' from the previous
+			 * time-window and give this task a chance to start with
+			 * a reasonably high CPU frequency. However, that
+			 * shouldn't be over-done, lest we get stuck at a high
+			 * load (high frequency) for too long, even when the
+			 * current system load has actually dropped down, so
+			 * clear prev_load to guarantee that the load will be
+			 * computed again next time.
+			 *
+			 * Detecting this situation is easy: the governor's
+			 * utilization update handler would not have run during
+			 * CPU-idle periods.  Hence, an unusually large
+			 * 'time_elapsed' (as compared to the sampling rate)
+			 * indicates this scenario.
 			 */
+			load = j_cdbs->prev_load;
 			j_cdbs->prev_load = 0;
 		} else {
-			load = 100 * (time_elapsed - idle_time) / time_elapsed;
+			if (time_elapsed >= idle_time) {
+				load = 100 * (time_elapsed - idle_time) / time_elapsed;
+			} else {
+				/*
+				 * That can happen if idle_time is returned by
+				 * get_cpu_idle_time_jiffy().  In that case
+				 * idle_time is roughly equal to the difference
+				 * between time_elapsed and "busy time" obtained
+				 * from CPU statistics.  Then, the "busy time"
+				 * can end up being greater than time_elapsed
+				 * (for example, if jiffies_64 and the CPU
+				 * statistics are updated by different CPUs),
+				 * so idle_time may in fact be negative.  That
+				 * means, though, that the CPU was busy all
+				 * the time (on the rough average) during the
+				 * last sampling interval and 100 can be
+				 * returned as the load.
+				 */
+				load = (int)idle_time < 0 ? 100 : 0;
+			}
 			j_cdbs->prev_load = load;
 		}

next             reply	other threads:[~2016-05-05 23:27 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-05 23:30 Rafael J. Wysocki [this message]
2016-05-06  6:30 ` [PATCH] cpufreq: governor: Fix handling of special cases in dbs_update() Viresh Kumar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4149261.sebfWj8seH@vostro.rjw.lan \
    --to=rjw@rjwysocki.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=timo.valtoaho@gmail.com \
    --cc=viresh.kumar@linaro.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox