From: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
To: rafael@kernel.org, viresh.kumar@linaro.org
Cc: venkatesh.pallipadi@intel.com, davej@redhat.com, trenn@suse.de,
linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
zhongqiu.han@oss.qualcomm.com
Subject: [PATCH v2 2/2] cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load
Date: Thu, 9 Apr 2026 19:14:07 +0800 [thread overview]
Message-ID: <20260409111407.9775-3-zhongqiu.han@oss.qualcomm.com> (raw)
In-Reply-To: <20260409111407.9775-1-zhongqiu.han@oss.qualcomm.com>
When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update()
may run concurrently and observe the new tunable value while
prev_cpu_nice still holds a stale baseline, producing a spurious
massive idle_time that results in an incorrect CPU load value.
The root cause is that prev_cpu_nice is only updated inside dbs_update()
when ignore_nice is true. While ignore_nice is false, prev_cpu_nice is
never advanced, so it accumulates an unbounded debt of nice CPU time.
The moment ignore_nice is flipped to 1, the very next dbs_update() call
computes:
idle_time += cur_nice - j_cdbs->prev_cpu_nice
where prev_cpu_nice is stale (possibly 0 if never updated since boot),
making idle_time artificially large.
The race can be illustrated with two concurrent paths:
Path A (sysfs write, holds attr_set->update_lock):
governor_store()
mutex_lock(&attr_set->update_lock)
ignore_nice_load_store()
dbs_data->ignore_nice_load = 1 /* (A1) */
gov_update_cpu_data(dbs_data)
mutex_lock(&policy_dbs->update_mutex) /* (A2) */
j_cdbs->prev_cpu_nice = kcpustat_field(...)
mutex_unlock(&policy_dbs->update_mutex)
mutex_unlock(&attr_set->update_lock)
Path B (work queue, wins the race between A1 and A2):
dbs_work_handler()
mutex_lock(&policy_dbs->update_mutex) /* acquired before A2 */
dbs_update()
ignore_nice = dbs_data->ignore_nice_load /* sees new value: 1 */
cur_nice = kcpustat_field(...)
idle_time += cur_nice - j_cdbs->prev_cpu_nice /* stale */
j_cdbs->prev_cpu_nice = cur_nice
mutex_unlock(&policy_dbs->update_mutex)
Note that even without the race, the anomaly occurs deterministically
on the very first dbs_update() call after ignore_nice_load is enabled,
because prev_cpu_nice has never been updated while ignore_nice was 0.
The race only widens the window in which the stale read can happen.
Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice. With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta
will always be the nice time accumulated in the last sampling interval,
not since boot.
As a consequence of always tracking prev_cpu_nice:
- gov_update_cpu_data() no longer needs to reset prev_cpu_nice when
ignore_nice_load changes; remove that conditional.
- cpufreq_dbs_governor_start() must unconditionally initialize
prev_cpu_nice so the very first dbs_update() has a valid baseline;
remove the ignore_nice guard and the now-unused ignore_nice variable.
- ignore_nice_load_store() no longer needs to call gov_update_cpu_data()
at all (prev_cpu_nice is always current); remove that call.
Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
---
drivers/cpufreq/cpufreq_conservative.c | 3 ---
drivers/cpufreq/cpufreq_governor.c | 28 +++++++++++++++++---------
drivers/cpufreq/cpufreq_ondemand.c | 3 ---
3 files changed, 18 insertions(+), 16 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index df01d33993d8..5c316d2d3ddd 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -213,9 +213,6 @@ static ssize_t ignore_nice_load_store(struct gov_attr_set *attr_set,
dbs_data->ignore_nice_load = input;
- /* we need to re-evaluate prev_cpu_idle */
- gov_update_cpu_data(dbs_data);
-
return count;
}
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index c0d419c95609..cfbfa5d8bb36 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -92,6 +92,12 @@ EXPORT_SYMBOL_GPL(sampling_rate_store);
*
* Call under the @dbs_data->attr_set.update_lock. The per-policy
* update_mutex is acquired and released internally for each policy.
+ *
+ * Note: prev_cpu_nice is intentionally not reset here. dbs_update() tracks
+ * prev_cpu_nice unconditionally on every sample, so it is always current.
+ * Resetting it here is therefore unnecessary and would only introduce a
+ * one-sample spike if a concurrent dbs_update() ran between the reset and
+ * the next sample.
*/
void gov_update_cpu_data(struct dbs_data *dbs_data)
{
@@ -106,8 +112,6 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
dbs_data->io_is_busy);
- if (dbs_data->ignore_nice_load)
- j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
}
mutex_unlock(&policy_dbs->update_mutex);
}
@@ -167,12 +171,18 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
j_cdbs->prev_cpu_idle = cur_idle_time;
- if (ignore_nice) {
- u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+ /*
+ * Always sample cur_nice and advance prev_cpu_nice, regardless
+ * of ignore_nice. This keeps prev_cpu_nice current so that
+ * enabling ignore_nice_load via sysfs never produces a
+ * stale-baseline spike (the delta will be at most one sampling
+ * interval of accumulated nice time, not since boot).
+ */
+ u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+ if (ignore_nice)
idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
- j_cdbs->prev_cpu_nice = cur_nice;
- }
+ j_cdbs->prev_cpu_nice = cur_nice;
if (unlikely(!time_elapsed)) {
/*
@@ -519,7 +529,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
struct dbs_governor *gov = dbs_governor_of(policy);
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *dbs_data = policy_dbs->dbs_data;
- unsigned int sampling_rate, ignore_nice, j;
+ unsigned int sampling_rate, j;
unsigned int io_busy;
if (!policy->cur)
@@ -529,7 +539,6 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
policy_dbs->rate_mult = 1;
sampling_rate = dbs_data->sampling_rate;
- ignore_nice = dbs_data->ignore_nice_load;
io_busy = dbs_data->io_is_busy;
mutex_lock(&policy_dbs->update_mutex);
@@ -542,8 +551,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
*/
j_cdbs->prev_load = 0;
- if (ignore_nice)
- j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+ j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
}
mutex_unlock(&policy_dbs->update_mutex);
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 9942dbb38dae..d8d843183c21 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -261,9 +261,6 @@ static ssize_t ignore_nice_load_store(struct gov_attr_set *attr_set,
}
dbs_data->ignore_nice_load = input;
- /* we need to re-evaluate prev_cpu_idle */
- gov_update_cpu_data(dbs_data);
-
return count;
}
--
2.43.0
prev parent reply other threads:[~2026-04-09 11:14 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-09 11:14 [PATCH v2 0/2] cpufreq: governor: Fix races and stale baseline on prev_cpu_nice Zhongqiu Han
2026-04-09 11:14 ` [PATCH v2 1/2] cpufreq: governor: Fix race between sysfs store and dbs work handler Zhongqiu Han
2026-04-09 11:14 ` Zhongqiu Han [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260409111407.9775-3-zhongqiu.han@oss.qualcomm.com \
--to=zhongqiu.han@oss.qualcomm.com \
--cc=davej@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=rafael@kernel.org \
--cc=trenn@suse.de \
--cc=venkatesh.pallipadi@intel.com \
--cc=viresh.kumar@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox