linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation
@ 2025-12-10  8:31 Xin Zhao
  2025-12-10  8:31 ` [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
  2025-12-10  8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
  0 siblings, 2 replies; 4+ messages in thread
From: Xin Zhao @ 2025-12-10  8:31 UTC (permalink / raw)
  To: anna-maria, frederic, mingo, tglx, kuba
  Cc: jackzxcui1989, linux-kernel, linux-fsdevel

The idle and iowait statistics in /proc/stat are obtained through
get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
enabled, when CPU is online, the idle and iowait values use the
idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
is offline. Although /proc/stat do not print statistics of offline CPU,
it still print aggregated statistics of all possible CPUs.

ick_cpu_sched and kernel_cpustat are maintained by different logic,
leading to a significant gap. The first line of the data below shows the
/proc/stat output when only one CPU remains after CPU offline, the second
line shows the /proc/stat output after all CPUs are brought back online:

cpu  2408558 2 916619 4275883 5403 123758 64685 0 0 0
cpu  2408588 2 916693 4200737 4184 123762 64686 0 0 0

Obviously, other values do not experience significant fluctuations, while
idle/iowait statistics show a substantial decrease, which make system CPU
monitoring troublesome.

get_cpu_idle_time_us() calculates the latest cpu idle time based on
idle_entrytime and current time. When CPU is idle when offline, the value
return by get_cpu_idle_time_us() will continue to increase, which is
unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
When CPU is in the iowait state when offline, the value return by
get_cpu_iowait_time_us() will continue to increase.

Introduce get_cpu_idle_time_us_offline() as the _offline variants of
get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
same value of idle_sleeptime without any calculation. In this way,
/proc/stat logic can use it to get a correct CPU idle time, which remains
unchanged during CPU offline period. Also, the aggregated statistics of
all possible CPUs printed by /proc/stat will not experience significant
fluctuation when CPU hotplug.
So as the newly added get_cpu_iowait_time_us_offline().

In addition, revise a comment about broken iowait counter update race
related to the topic.

Xin Zhao (2):
  timers/nohz: Revise a comment about broken iowait counter update race
  timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug

 fs/proc/stat.c           |  4 +++
 include/linux/tick.h     |  4 +++
 kernel/time/tick-sched.c | 58 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 64 insertions(+), 2 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race
  2025-12-10  8:31 [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation Xin Zhao
@ 2025-12-10  8:31 ` Xin Zhao
  2025-12-10  8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
  1 sibling, 0 replies; 4+ messages in thread
From: Xin Zhao @ 2025-12-10  8:31 UTC (permalink / raw)
  To: anna-maria, frederic, mingo, tglx, kuba
  Cc: jackzxcui1989, linux-kernel, linux-fsdevel

Task migration during wakeup may cause /proc/stat iowait regression,
as mentioned in the commit message of Frederic Weisbecker's previous
patch. The nr_iowait statistic of rq can be decreased by remote CPU
during wakeup, leading to a sudden drop in /proc/stat iowait calculation,
while /proc/stat idle statistic experiences a sudden increase.
Excluding the hotplug scenario when /proc/stat idle statistic may
experiences a sudden decrease which is fixed in a subsequent patch,
/proc/stat idle statistic will never decrease suddenly, as there is no
logic in kernel that allows a remote CPU to increase rq nr_iowait.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 kernel/time/tick-sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8ddf74e70..4d089b290 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -812,8 +812,8 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
  * Return the cumulative idle time (since boot) for a given
  * CPU, in microseconds. Note that this is partially broken due to
  * the counter of iowait tasks that can be remotely updated without
- * any synchronization. Therefore it is possible to observe backward
- * values within two consecutive reads.
+ * any synchronization. Therefore it is possible to observe sudden
+ * increases within two consecutive reads.
  *
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
  2025-12-10  8:31 [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation Xin Zhao
  2025-12-10  8:31 ` [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
@ 2025-12-10  8:31 ` Xin Zhao
  2025-12-11 16:34   ` Frederic Weisbecker
  1 sibling, 1 reply; 4+ messages in thread
From: Xin Zhao @ 2025-12-10  8:31 UTC (permalink / raw)
  To: anna-maria, frederic, mingo, tglx, kuba
  Cc: jackzxcui1989, linux-kernel, linux-fsdevel

The idle and iowait statistics in /proc/stat are obtained through
get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
enabled, when CPU is online, the idle and iowait values use the
idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
is offline. Although /proc/stat do not print statistics of offline CPU,
it still print aggregated statistics of all possible CPUs.

tick_cpu_sched and kernel_cpustat are maintained by different logic,
leading to a significant gap. The first line of the data below shows the
/proc/stat output when only one CPU remains after CPU offline, the second
line shows the /proc/stat output after all CPUs are brought back online:

cpu  2408558 2 916619 4275883 5403 123758 64685 0 0 0
cpu  2408588 2 916693 4200737 4184 123762 64686 0 0 0

Obviously, other values do not experience significant fluctuations, while
idle/iowait statistics show a substantial decrease, which make system CPU
monitoring troublesome.

get_cpu_idle_time_us() calculates the latest cpu idle time based on
idle_entrytime and current time. When CPU is idle when offline, the value
return by get_cpu_idle_time_us() will continue to increase, which is
unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
When CPU is in the iowait state when offline, the value return by
get_cpu_iowait_time_us() will continue to increase.

Introduce get_cpu_idle_time_us_offline() as the _offline variants of
get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
same value of idle_sleeptime without any calculation. In this way,
/proc/stat logic can use it to get a correct CPU idle time, which remains
unchanged during CPU offline period. Also, the aggregated statistics of
all possible CPUs printed by /proc/stat will not experience significant
fluctuation when CPU hotplug.
So as the newly added get_cpu_iowait_time_us_offline().

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 fs/proc/stat.c           |  4 +++
 include/linux/tick.h     |  4 +++
 kernel/time/tick-sched.c | 54 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 8b444e862..9920e7bfc 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -28,6 +28,8 @@ u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
 
 	if (cpu_online(cpu))
 		idle_usecs = get_cpu_idle_time_us(cpu, NULL);
+	else
+		idle_usecs = get_cpu_idle_time_us_offline(cpu);
 
 	if (idle_usecs == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
@@ -44,6 +46,8 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
 
 	if (cpu_online(cpu))
 		iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
+	else
+		iowait_usecs = get_cpu_iowait_time_us_offline(cpu);
 
 	if (iowait_usecs == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ac76ae9fa..e5db27657 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -139,7 +139,9 @@ extern ktime_t tick_nohz_get_next_hrtimer(void);
 extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
 extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_us_offline(int cpu);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_iowait_time_us_offline(int cpu);
 #else /* !CONFIG_NO_HZ_COMMON */
 #define tick_nohz_enabled (0)
 static inline int tick_nohz_tick_stopped(void) { return 0; }
@@ -161,7 +163,9 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
 	return *delta_next;
 }
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_idle_time_us_offline(int cpu) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_iowait_time_us_offline(int cpu) { return -1; }
 #endif /* !CONFIG_NO_HZ_COMMON */
 
 /*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4d089b290..2cda08f94 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -829,6 +829,33 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
+/**
+ * get_cpu_idle_time_us_offline - get the total idle time of an offline CPU
+ * @cpu: CPU number to query
+ *
+ * Return the idle time (since boot) for a given offline CPU, in microseconds.
+ *
+ * This function does not calculate the latest idle time based on
+ * idle_entrytime and current time like get_cpu_idle_time_us() does.
+ * If get_cpu_idle_time_us() is used to obtain idle time of an offline CPU,
+ * its value will continue to increase, which is unexpected.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_idle_time_us_offline(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+
+	if (!tick_nohz_active)
+		return -1;
+
+	return ktime_to_us(ts->idle_sleeptime);
+}
+EXPORT_SYMBOL_GPL(get_cpu_idle_time_us_offline);
+
 /**
  * get_cpu_iowait_time_us - get the total iowait time of a CPU
  * @cpu: CPU number to query
@@ -855,6 +882,33 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+/**
+ * get_cpu_iowait_time_us_offline - get the total iowait time of an offline CPU
+ * @cpu: CPU number to query
+ *
+ * Return the iowait time (since boot) for a given CPU, in microseconds.
+ *
+ * This function does not calculate the latest iowait time based on
+ * idle_entrytime and current time like get_cpu_iowait_time_us() does.
+ * If get_cpu_iowait_time_us is used to obtain iowait time of an offline CPU,
+ * its value will continue to increase, which is unexpected.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_iowait_time_us_offline(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+
+	if (!tick_nohz_active)
+		return -1;
+
+	return ktime_to_us(ts->iowait_sleeptime);
+}
+EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us_offline);
+
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_cancel(&ts->sched_timer);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
  2025-12-10  8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
@ 2025-12-11 16:34   ` Frederic Weisbecker
  0 siblings, 0 replies; 4+ messages in thread
From: Frederic Weisbecker @ 2025-12-11 16:34 UTC (permalink / raw)
  To: Xin Zhao; +Cc: anna-maria, mingo, tglx, kuba, linux-kernel, linux-fsdevel

Hi Xin,

Le Wed, Dec 10, 2025 at 04:31:35PM +0800, Xin Zhao a écrit :
> The idle and iowait statistics in /proc/stat are obtained through
> get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
> enabled, when CPU is online, the idle and iowait values use the
> idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
> use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
> is offline. Although /proc/stat do not print statistics of offline CPU,
> it still print aggregated statistics of all possible CPUs.
> 
> tick_cpu_sched and kernel_cpustat are maintained by different logic,
> leading to a significant gap. The first line of the data below shows the
> /proc/stat output when only one CPU remains after CPU offline, the second
> line shows the /proc/stat output after all CPUs are brought back online:
> 
> cpu  2408558 2 916619 4275883 5403 123758 64685 0 0 0
> cpu  2408588 2 916693 4200737 4184 123762 64686 0 0 0
> 
> Obviously, other values do not experience significant fluctuations, while
> idle/iowait statistics show a substantial decrease, which make system CPU
> monitoring troublesome.
> 
> get_cpu_idle_time_us() calculates the latest cpu idle time based on
> idle_entrytime and current time. When CPU is idle when offline, the value
> return by get_cpu_idle_time_us() will continue to increase, which is
> unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
> When CPU is in the iowait state when offline, the value return by
> get_cpu_iowait_time_us() will continue to increase.
> 
> Introduce get_cpu_idle_time_us_offline() as the _offline variants of
> get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
> same value of idle_sleeptime without any calculation. In this way,
> /proc/stat logic can use it to get a correct CPU idle time, which remains
> unchanged during CPU offline period. Also, the aggregated statistics of
> all possible CPUs printed by /proc/stat will not experience significant
> fluctuation when CPU hotplug.
> So as the newly added get_cpu_iowait_time_us_offline().
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>

So the problem is a bit deeper (warning: lots of bullets)

First of all you shouldn't need get_cpu_iowait_time_us_offline().
get_cpu_idle_time_us() is supposed to stop advancing after
tick_sched_timer_dying().

But since this code:

	if (cpu_is_offline(cpu)) {
		cpuhp_report_idle_dead();
		arch_cpu_idle_dead();
	}

is placed after tick_nohz_idle_enter() in do_idle(), the idle time
moves forward as long as the CPU is offline.

In fact offline handling in idle should happen at the very beginning
of do_idle because:

* There is no tick handling to do
* There is no nohz idle balancing to do
* And polling to TIF_RESCHED is useless
* No need to check if need_resched() before offline handling since
  stop_machine is done and all per-cpu kthread should be done with
  their job.

So:

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd17..35d79af3286d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -260,6 +260,12 @@ static void do_idle(void)
 {
 	int cpu = smp_processor_id();
 
+	if (cpu_is_offline(cpu)) {
+		local_irq_disable();
+		cpuhp_report_idle_dead();
+		arch_cpu_idle_dead();
+	}
+
 	/*
 	 * Check if we need to update blocked load
 	 */
@@ -311,11 +317,6 @@ static void do_idle(void)
 		 */
 		local_irq_disable();
 
-		if (cpu_is_offline(cpu)) {
-			cpuhp_report_idle_dead();
-			arch_cpu_idle_dead();
-		}
-
 		arch_cpu_idle_enter();
 		rcu_nocb_flush_deferred_wakeup();
 

But tick_sched_timer_dying() temporarily clears the idle/iowait time.
It's fixable but that's not all. We still have two idle time accounting
with each having their shortcomings:

* The accounting for online CPUs which is based on delta between
  tick_nohz_start_idle() and tick_nohz_stop_idle().

  Pros:
       - Works when the tick is off

       - Has nsecs granularity
  Cons:
       - Ignore steal time and possibly spuriously substract it from later
         system accounting.

       - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but the
         idle irqtime is possibly spuriously substracted from later system
         accounting.

       - Is not accurate when CONFIG_IRQ_TIME_ACCOUNTING=n

       - The windows between 1) idle task scheduling and the first call to
         tick_nohz_start_idle() and 2) idle task between the last
         tick_nohz_stop_idle() and the rest of the idle time are blindspots
	 wrt. cputime accounting.

* The accounting for offline CPUs which is based on ticks and the
  jiffies delta during which the tick was stopped.

  Pros:
       - Handles steal time correctly

       - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and CONFIG_IRQ_TIME_ACCOUNTING=n
         correctly.

       - Handles the whole idle task

   Cons:
       - Doesn't elapse when the tick is off

       - Has TICK_NSEC granularity (jiffies)

       - Needs to track the idle ticks that were accounted and substract
         them from the total jiffies time spent while the tick was stopped.
	 This is ugly.
	 

So what I think we should do is having a single one solution always applying
to cpustat which does a hybrid approach with better support:

* Tick based accounting whenever the tick isn't stopped.

* When the tick is stopped, account like we do between tick_nohz_start_idle()
  and tick_nohz_stop_idle() (but only when the tick is actually stopped) and
  handle CONFIG_IRQ_TIME_ACCOUNTING correctly such that:

  - If y, stop on IRQ entry and restart on IRQ exit like we used to. It means
    we must flush cpu_irqtime::tick_delta upon tick restart so that it's not
    substracted later from system accounting.

  - If n, keep accounting on IRQs (remove calls to tick_nohz_start_idle()
    and tick_nohz_stop_idle() during IRQs)

  During that tick stopped time, tick based accounting must be ignored.

  Idle steal delta time must be recorded at tick stop/tick restart boundaries
  and substracted from later idle time accounting.

  This logic should be moved to vtime and we still need the offlining fix above.

Just give me a few days to work on that patchset.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-12-11 16:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-10  8:31 [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation Xin Zhao
2025-12-10  8:31 ` [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
2025-12-10  8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
2025-12-11 16:34   ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).