[PATCH v2 0/2] cpufreq: governor: Fix races and stale baseline on prev_cpu

public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] cpufreq: governor: Fix races and stale baseline on prev_cpu_nice
@ 2026-04-09 11:14 Zhongqiu Han
  2026-04-09 11:14 ` [PATCH v2 1/2] cpufreq: governor: Fix race between sysfs store and dbs work handler Zhongqiu Han
  2026-04-09 11:14 ` [PATCH v2 2/2] cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load Zhongqiu Han
  0 siblings, 2 replies; 3+ messages in thread
From: Zhongqiu Han @ 2026-04-09 11:14 UTC (permalink / raw)
  To: rafael, viresh.kumar
  Cc: venkatesh.pallipadi, davej, trenn, linux-pm, linux-kernel,
	zhongqiu.han

Patch 1 fixes a data race between sysfs store callbacks and the DBS
work handler.  gov_update_cpu_data() writes prev_cpu_idle and
prev_cpu_nice while holding only attr_set->update_lock, whereas
dbs_update() reads and writes the same fields while holding only
policy_dbs->update_mutex.  Because these are independent locks, the
two paths are not mutually exclusive.  The fix acquires
policy_dbs->update_mutex inside gov_update_cpu_data() for each
policy, and also holds it around the initialization loop in
cpufreq_dbs_governor_start() to close a similar window against
concurrent sysfs writes.

Patch 2 fixes a stale-baseline spike on prev_cpu_nice that occurs
when ignore_nice_load is enabled via sysfs.  Because prev_cpu_nice
is only advanced in dbs_update() when ignore_nice is true, it
accumulates an unbounded debt of nice CPU time while ignore_nice is
false.  The moment ignore_nice_load is flipped to 1, the next
dbs_update() computes a massive idle_time delta against the stale
baseline, producing an incorrect CPU load value.  The fix
unconditionally samples and advances prev_cpu_nice on every
dbs_update() call, regardless of ignore_nice, so the baseline is
always current.  As a consequence, the prev_cpu_nice reset in
gov_update_cpu_data() and the gov_update_cpu_data() call in
ignore_nice_load_store() are no longer needed and are removed.

Changelog:
- Update linux-next base
- Based on v1 review, patch 1 is updated to add the missing
  protection around cpufreq_dbs_governor_start(), and patch 2/2 is added.
- Link to v1: https://lore.kernel.org/all/20260406110113.3475920-1-zhongqiu.han@oss.qualcomm.com/

Zhongqiu Han (2):
  cpufreq: governor: Fix race between sysfs store and dbs work handler
  cpufreq: governor: Fix stale prev_cpu_nice spike when enabling
    ignore_nice_load

 drivers/cpufreq/cpufreq_conservative.c |  3 ---
 drivers/cpufreq/cpufreq_governor.c     | 35 ++++++++++++++++++--------
 drivers/cpufreq/cpufreq_ondemand.c     |  3 ---
 3 files changed, 24 insertions(+), 17 deletions(-)

base-commit: f3e6330d7fe42b204af05a2dbc68b379e0ad179e
-- 
2.43.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 1/2] cpufreq: governor: Fix race between sysfs store and dbs work handler
  2026-04-09 11:14 [PATCH v2 0/2] cpufreq: governor: Fix races and stale baseline on prev_cpu_nice Zhongqiu Han
@ 2026-04-09 11:14 ` Zhongqiu Han
  2026-04-09 11:14 ` [PATCH v2 2/2] cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load Zhongqiu Han
  1 sibling, 0 replies; 3+ messages in thread
From: Zhongqiu Han @ 2026-04-09 11:14 UTC (permalink / raw)
  To: rafael, viresh.kumar
  Cc: venkatesh.pallipadi, davej, trenn, linux-pm, linux-kernel,
	zhongqiu.han

gov_update_cpu_data() resets per-CPU prev_cpu_idle and prev_cpu_nice
for every CPU in the governed domain. It is called from sysfs store
callbacks (e.g. ignore_nice_load_store) which run under
attr_set->update_lock, held by the surrounding governor_store().

Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which
calls dbs_update()) under policy_dbs->update_mutex. dbs_update() both
reads and writes the same prev_cpu_idle / prev_cpu_nice fields. The
potential race path is:

Path A (sysfs write, holds attr_set->update_lock only):

  governor_store()
    mutex_lock(&attr_set->update_lock)
    ignore_nice_load_store()
      dbs_data->ignore_nice_load = input
      gov_update_cpu_data(dbs_data)
        list_for_each_entry(policy_dbs, ...)
          for_each_cpu(j, ...)
            j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
            j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
    mutex_unlock(&attr_set->update_lock)

Path B (work queue, holds policy_dbs->update_mutex only):

  dbs_work_handler()
    mutex_lock(&policy_dbs->update_mutex)
    gov->gov_dbs_update(policy)
      dbs_update()
        for_each_cpu(j, policy->cpus)
          idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
          j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
          idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
          j_cdbs->prev_cpu_nice = cur_nice                  /* write */
    mutex_unlock(&policy_dbs->update_mutex)

Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.

Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.

Additionally, cpufreq_dbs_governor_start() initializes prev_cpu_idle
and prev_cpu_nice without holding policy_dbs->update_mutex. After
cpufreq_dbs_governor_init() returns, the new policy is already visible
in attr_set->policy_list and sysfs attributes are accessible. A
concurrent sysfs write can therefore call gov_update_cpu_data() and
race with the initialization loop on the same u64 fields. Fix this by
holding policy_dbs->update_mutex around the initialization loop in
cpufreq_dbs_governor_start() as well.

The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
(path Y), and do_dbs_timer() (path Z) were serialised by the same
dbs_mutex, so no race existed. Those two commits switched do_dbs_timer()
from dbs_mutex to a per-policy/per-cpu timer_mutex to reduce lock
contention, but left store_ignore_nice_load() still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between store_ignore_nice_load() and dbs_check_cpu().

Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load() entirely, introducing an
additional potential race between store_ignore_nice_load() (path Y, now
lockless) and cpufreq_governor_dbs() (path X, still holding dbs_mutex),
while the race between path Y and path Z remained.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
---
 drivers/cpufreq/cpufreq_governor.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 86f35e451914..c0d419c95609 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -90,7 +90,8 @@ EXPORT_SYMBOL_GPL(sampling_rate_store);
  * (that may be a single policy or a bunch of them if governor tunables are
  * system-wide).
  *
- * Call under the @dbs_data mutex.
+ * Call under the @dbs_data->attr_set.update_lock. The per-policy
+ * update_mutex is acquired and released internally for each policy.
  */
 void gov_update_cpu_data(struct dbs_data *dbs_data)
 {
@@ -99,6 +100,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 	list_for_each_entry(policy_dbs, &dbs_data->attr_set.policy_list, list) {
 		unsigned int j;

+		mutex_lock(&policy_dbs->update_mutex);
 		for_each_cpu(j, policy_dbs->policy->cpus) {
 			struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);

@@ -107,6 +109,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 			if (dbs_data->ignore_nice_load)
 				j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 		}
+		mutex_unlock(&policy_dbs->update_mutex);
 	}
 }
 EXPORT_SYMBOL_GPL(gov_update_cpu_data);
@@ -529,6 +532,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 	ignore_nice = dbs_data->ignore_nice_load;
 	io_busy = dbs_data->io_is_busy;

+	mutex_lock(&policy_dbs->update_mutex);
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);

@@ -541,6 +545,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 	}
+	mutex_unlock(&policy_dbs->update_mutex);

 	gov->start(policy);

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v2 2/2] cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load
  2026-04-09 11:14 [PATCH v2 0/2] cpufreq: governor: Fix races and stale baseline on prev_cpu_nice Zhongqiu Han
  2026-04-09 11:14 ` [PATCH v2 1/2] cpufreq: governor: Fix race between sysfs store and dbs work handler Zhongqiu Han
@ 2026-04-09 11:14 ` Zhongqiu Han
  1 sibling, 0 replies; 3+ messages in thread
From: Zhongqiu Han @ 2026-04-09 11:14 UTC (permalink / raw)
  To: rafael, viresh.kumar
  Cc: venkatesh.pallipadi, davej, trenn, linux-pm, linux-kernel,
	zhongqiu.han

When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update()
may run concurrently and observe the new tunable value while
prev_cpu_nice still holds a stale baseline, producing a spurious
massive idle_time that results in an incorrect CPU load value.

The root cause is that prev_cpu_nice is only updated inside dbs_update()
when ignore_nice is true.  While ignore_nice is false, prev_cpu_nice is
never advanced, so it accumulates an unbounded debt of nice CPU time.
The moment ignore_nice is flipped to 1, the very next dbs_update() call
computes:

  idle_time += cur_nice - j_cdbs->prev_cpu_nice

where prev_cpu_nice is stale (possibly 0 if never updated since boot),
making idle_time artificially large.

The race can be illustrated with two concurrent paths:

Path A (sysfs write, holds attr_set->update_lock):

  governor_store()
    mutex_lock(&attr_set->update_lock)
    ignore_nice_load_store()
      dbs_data->ignore_nice_load = 1              /* (A1) */
      gov_update_cpu_data(dbs_data)
        mutex_lock(&policy_dbs->update_mutex)     /* (A2) */
          j_cdbs->prev_cpu_nice = kcpustat_field(...)
        mutex_unlock(&policy_dbs->update_mutex)
    mutex_unlock(&attr_set->update_lock)

Path B (work queue, wins the race between A1 and A2):

  dbs_work_handler()
    mutex_lock(&policy_dbs->update_mutex)         /* acquired before A2 */
    dbs_update()
      ignore_nice = dbs_data->ignore_nice_load    /* sees new value: 1 */
      cur_nice = kcpustat_field(...)
      idle_time += cur_nice - j_cdbs->prev_cpu_nice /* stale */
      j_cdbs->prev_cpu_nice = cur_nice
    mutex_unlock(&policy_dbs->update_mutex)

Note that even without the race, the anomaly occurs deterministically
on the very first dbs_update() call after ignore_nice_load is enabled,
because prev_cpu_nice has never been updated while ignore_nice was 0.
The race only widens the window in which the stale read can happen.

Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice.  With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta
will always be the nice time accumulated in the last sampling interval,
not since boot.

As a consequence of always tracking prev_cpu_nice:

  - gov_update_cpu_data() no longer needs to reset prev_cpu_nice when
    ignore_nice_load changes; remove that conditional.
  - cpufreq_dbs_governor_start() must unconditionally initialize
    prev_cpu_nice so the very first dbs_update() has a valid baseline;
    remove the ignore_nice guard and the now-unused ignore_nice variable.
  - ignore_nice_load_store() no longer needs to call gov_update_cpu_data()
    at all (prev_cpu_nice is always current); remove that call.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
---
 drivers/cpufreq/cpufreq_conservative.c |  3 ---
 drivers/cpufreq/cpufreq_governor.c     | 28 +++++++++++++++++---------
 drivers/cpufreq/cpufreq_ondemand.c     |  3 ---
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index df01d33993d8..5c316d2d3ddd 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -213,9 +213,6 @@ static ssize_t ignore_nice_load_store(struct gov_attr_set *attr_set,
 
 	dbs_data->ignore_nice_load = input;
 
-	/* we need to re-evaluate prev_cpu_idle */
-	gov_update_cpu_data(dbs_data);
-
 	return count;
 }
 
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index c0d419c95609..cfbfa5d8bb36 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -92,6 +92,12 @@ EXPORT_SYMBOL_GPL(sampling_rate_store);
  *
  * Call under the @dbs_data->attr_set.update_lock. The per-policy
  * update_mutex is acquired and released internally for each policy.
+ *
+ * Note: prev_cpu_nice is intentionally not reset here. dbs_update() tracks
+ * prev_cpu_nice unconditionally on every sample, so it is always current.
+ * Resetting it here is therefore unnecessary and would only introduce a
+ * one-sample spike if a concurrent dbs_update() ran between the reset and
+ * the next sample.
  */
 void gov_update_cpu_data(struct dbs_data *dbs_data)
 {
@@ -106,8 +112,6 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 
 			j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
 								  dbs_data->io_is_busy);
-			if (dbs_data->ignore_nice_load)
-				j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 		}
 		mutex_unlock(&policy_dbs->update_mutex);
 	}
@@ -167,12 +171,18 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
 
 		j_cdbs->prev_cpu_idle = cur_idle_time;
 
-		if (ignore_nice) {
-			u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+		/*
+		 * Always sample cur_nice and advance prev_cpu_nice, regardless
+		 * of ignore_nice.  This keeps prev_cpu_nice current so that
+		 * enabling ignore_nice_load via sysfs never produces a
+		 * stale-baseline spike (the delta will be at most one sampling
+		 * interval of accumulated nice time, not since boot).
+		 */
+		u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 
+		if (ignore_nice)
 			idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
-			j_cdbs->prev_cpu_nice = cur_nice;
-		}
+		j_cdbs->prev_cpu_nice = cur_nice;
 
 		if (unlikely(!time_elapsed)) {
 			/*
@@ -519,7 +529,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 	struct dbs_governor *gov = dbs_governor_of(policy);
 	struct policy_dbs_info *policy_dbs = policy->governor_data;
 	struct dbs_data *dbs_data = policy_dbs->dbs_data;
-	unsigned int sampling_rate, ignore_nice, j;
+	unsigned int sampling_rate, j;
 	unsigned int io_busy;
 
 	if (!policy->cur)
@@ -529,7 +539,6 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 	policy_dbs->rate_mult = 1;
 
 	sampling_rate = dbs_data->sampling_rate;
-	ignore_nice = dbs_data->ignore_nice_load;
 	io_busy = dbs_data->io_is_busy;
 
 	mutex_lock(&policy_dbs->update_mutex);
@@ -542,8 +551,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 		 */
 		j_cdbs->prev_load = 0;
 
-		if (ignore_nice)
-			j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+		j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
 	}
 	mutex_unlock(&policy_dbs->update_mutex);
 
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 9942dbb38dae..d8d843183c21 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -261,9 +261,6 @@ static ssize_t ignore_nice_load_store(struct gov_attr_set *attr_set,
 	}
 	dbs_data->ignore_nice_load = input;
 
-	/* we need to re-evaluate prev_cpu_idle */
-	gov_update_cpu_data(dbs_data);
-
 	return count;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-09 11:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 11:14 [PATCH v2 0/2] cpufreq: governor: Fix races and stale baseline on prev_cpu_nice Zhongqiu Han
2026-04-09 11:14 ` [PATCH v2 1/2] cpufreq: governor: Fix race between sysfs store and dbs work handler Zhongqiu Han
2026-04-09 11:14 ` [PATCH v2 2/2] cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load Zhongqiu Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox