* [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation
@ 2025-12-10 8:31 Xin Zhao
2025-12-10 8:31 ` [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
2025-12-10 8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
0 siblings, 2 replies; 4+ messages in thread
From: Xin Zhao @ 2025-12-10 8:31 UTC (permalink / raw)
To: anna-maria, frederic, mingo, tglx, kuba
Cc: jackzxcui1989, linux-kernel, linux-fsdevel
The idle and iowait statistics in /proc/stat are obtained through
get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
enabled, when CPU is online, the idle and iowait values use the
idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
is offline. Although /proc/stat do not print statistics of offline CPU,
it still print aggregated statistics of all possible CPUs.
ick_cpu_sched and kernel_cpustat are maintained by different logic,
leading to a significant gap. The first line of the data below shows the
/proc/stat output when only one CPU remains after CPU offline, the second
line shows the /proc/stat output after all CPUs are brought back online:
cpu 2408558 2 916619 4275883 5403 123758 64685 0 0 0
cpu 2408588 2 916693 4200737 4184 123762 64686 0 0 0
Obviously, other values do not experience significant fluctuations, while
idle/iowait statistics show a substantial decrease, which make system CPU
monitoring troublesome.
get_cpu_idle_time_us() calculates the latest cpu idle time based on
idle_entrytime and current time. When CPU is idle when offline, the value
return by get_cpu_idle_time_us() will continue to increase, which is
unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
When CPU is in the iowait state when offline, the value return by
get_cpu_iowait_time_us() will continue to increase.
Introduce get_cpu_idle_time_us_offline() as the _offline variants of
get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
same value of idle_sleeptime without any calculation. In this way,
/proc/stat logic can use it to get a correct CPU idle time, which remains
unchanged during CPU offline period. Also, the aggregated statistics of
all possible CPUs printed by /proc/stat will not experience significant
fluctuation when CPU hotplug.
So as the newly added get_cpu_iowait_time_us_offline().
In addition, revise a comment about broken iowait counter update race
related to the topic.
Xin Zhao (2):
timers/nohz: Revise a comment about broken iowait counter update race
timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
fs/proc/stat.c | 4 +++
include/linux/tick.h | 4 +++
kernel/time/tick-sched.c | 58 ++++++++++++++++++++++++++++++++++++++--
3 files changed, 64 insertions(+), 2 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race
2025-12-10 8:31 [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation Xin Zhao
@ 2025-12-10 8:31 ` Xin Zhao
2025-12-10 8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
1 sibling, 0 replies; 4+ messages in thread
From: Xin Zhao @ 2025-12-10 8:31 UTC (permalink / raw)
To: anna-maria, frederic, mingo, tglx, kuba
Cc: jackzxcui1989, linux-kernel, linux-fsdevel
Task migration during wakeup may cause /proc/stat iowait regression,
as mentioned in the commit message of Frederic Weisbecker's previous
patch. The nr_iowait statistic of rq can be decreased by remote CPU
during wakeup, leading to a sudden drop in /proc/stat iowait calculation,
while /proc/stat idle statistic experiences a sudden increase.
Excluding the hotplug scenario when /proc/stat idle statistic may
experiences a sudden decrease which is fixed in a subsequent patch,
/proc/stat idle statistic will never decrease suddenly, as there is no
logic in kernel that allows a remote CPU to increase rq nr_iowait.
Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
kernel/time/tick-sched.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8ddf74e70..4d089b290 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -812,8 +812,8 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
* Return the cumulative idle time (since boot) for a given
* CPU, in microseconds. Note that this is partially broken due to
* the counter of iowait tasks that can be remotely updated without
- * any synchronization. Therefore it is possible to observe backward
- * values within two consecutive reads.
+ * any synchronization. Therefore it is possible to observe sudden
+ * increases within two consecutive reads.
*
* This time is measured via accounting rather than sampling,
* and is as accurate as ktime_get() is.
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
2025-12-10 8:31 [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation Xin Zhao
2025-12-10 8:31 ` [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
@ 2025-12-10 8:31 ` Xin Zhao
2025-12-11 16:34 ` Frederic Weisbecker
1 sibling, 1 reply; 4+ messages in thread
From: Xin Zhao @ 2025-12-10 8:31 UTC (permalink / raw)
To: anna-maria, frederic, mingo, tglx, kuba
Cc: jackzxcui1989, linux-kernel, linux-fsdevel
The idle and iowait statistics in /proc/stat are obtained through
get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
enabled, when CPU is online, the idle and iowait values use the
idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
is offline. Although /proc/stat do not print statistics of offline CPU,
it still print aggregated statistics of all possible CPUs.
tick_cpu_sched and kernel_cpustat are maintained by different logic,
leading to a significant gap. The first line of the data below shows the
/proc/stat output when only one CPU remains after CPU offline, the second
line shows the /proc/stat output after all CPUs are brought back online:
cpu 2408558 2 916619 4275883 5403 123758 64685 0 0 0
cpu 2408588 2 916693 4200737 4184 123762 64686 0 0 0
Obviously, other values do not experience significant fluctuations, while
idle/iowait statistics show a substantial decrease, which make system CPU
monitoring troublesome.
get_cpu_idle_time_us() calculates the latest cpu idle time based on
idle_entrytime and current time. When CPU is idle when offline, the value
return by get_cpu_idle_time_us() will continue to increase, which is
unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
When CPU is in the iowait state when offline, the value return by
get_cpu_iowait_time_us() will continue to increase.
Introduce get_cpu_idle_time_us_offline() as the _offline variants of
get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
same value of idle_sleeptime without any calculation. In this way,
/proc/stat logic can use it to get a correct CPU idle time, which remains
unchanged during CPU offline period. Also, the aggregated statistics of
all possible CPUs printed by /proc/stat will not experience significant
fluctuation when CPU hotplug.
So as the newly added get_cpu_iowait_time_us_offline().
Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
fs/proc/stat.c | 4 +++
include/linux/tick.h | 4 +++
kernel/time/tick-sched.c | 54 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 62 insertions(+)
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 8b444e862..9920e7bfc 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -28,6 +28,8 @@ u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
if (cpu_online(cpu))
idle_usecs = get_cpu_idle_time_us(cpu, NULL);
+ else
+ idle_usecs = get_cpu_idle_time_us_offline(cpu);
if (idle_usecs == -1ULL)
/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
@@ -44,6 +46,8 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
if (cpu_online(cpu))
iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
+ else
+ iowait_usecs = get_cpu_iowait_time_us_offline(cpu);
if (iowait_usecs == -1ULL)
/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ac76ae9fa..e5db27657 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -139,7 +139,9 @@ extern ktime_t tick_nohz_get_next_hrtimer(void);
extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_us_offline(int cpu);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_iowait_time_us_offline(int cpu);
#else /* !CONFIG_NO_HZ_COMMON */
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
@@ -161,7 +163,9 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
return *delta_next;
}
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_idle_time_us_offline(int cpu) { return -1; }
static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_iowait_time_us_offline(int cpu) { return -1; }
#endif /* !CONFIG_NO_HZ_COMMON */
/*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4d089b290..2cda08f94 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -829,6 +829,33 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
+/**
+ * get_cpu_idle_time_us_offline - get the total idle time of an offline CPU
+ * @cpu: CPU number to query
+ *
+ * Return the idle time (since boot) for a given offline CPU, in microseconds.
+ *
+ * This function does not calculate the latest idle time based on
+ * idle_entrytime and current time like get_cpu_idle_time_us() does.
+ * If get_cpu_idle_time_us() is used to obtain idle time of an offline CPU,
+ * its value will continue to increase, which is unexpected.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_idle_time_us_offline(int cpu)
+{
+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+
+ if (!tick_nohz_active)
+ return -1;
+
+ return ktime_to_us(ts->idle_sleeptime);
+}
+EXPORT_SYMBOL_GPL(get_cpu_idle_time_us_offline);
+
/**
* get_cpu_iowait_time_us - get the total iowait time of a CPU
* @cpu: CPU number to query
@@ -855,6 +882,33 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
+/**
+ * get_cpu_iowait_time_us_offline - get the total iowait time of an offline CPU
+ * @cpu: CPU number to query
+ *
+ * Return the iowait time (since boot) for a given CPU, in microseconds.
+ *
+ * This function does not calculate the latest iowait time based on
+ * idle_entrytime and current time like get_cpu_iowait_time_us() does.
+ * If get_cpu_iowait_time_us is used to obtain iowait time of an offline CPU,
+ * its value will continue to increase, which is unexpected.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_iowait_time_us_offline(int cpu)
+{
+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+
+ if (!tick_nohz_active)
+ return -1;
+
+ return ktime_to_us(ts->iowait_sleeptime);
+}
+EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us_offline);
+
static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
{
hrtimer_cancel(&ts->sched_timer);
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
2025-12-10 8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
@ 2025-12-11 16:34 ` Frederic Weisbecker
0 siblings, 0 replies; 4+ messages in thread
From: Frederic Weisbecker @ 2025-12-11 16:34 UTC (permalink / raw)
To: Xin Zhao; +Cc: anna-maria, mingo, tglx, kuba, linux-kernel, linux-fsdevel
Hi Xin,
Le Wed, Dec 10, 2025 at 04:31:35PM +0800, Xin Zhao a écrit :
> The idle and iowait statistics in /proc/stat are obtained through
> get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
> enabled, when CPU is online, the idle and iowait values use the
> idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
> use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
> is offline. Although /proc/stat do not print statistics of offline CPU,
> it still print aggregated statistics of all possible CPUs.
>
> tick_cpu_sched and kernel_cpustat are maintained by different logic,
> leading to a significant gap. The first line of the data below shows the
> /proc/stat output when only one CPU remains after CPU offline, the second
> line shows the /proc/stat output after all CPUs are brought back online:
>
> cpu 2408558 2 916619 4275883 5403 123758 64685 0 0 0
> cpu 2408588 2 916693 4200737 4184 123762 64686 0 0 0
>
> Obviously, other values do not experience significant fluctuations, while
> idle/iowait statistics show a substantial decrease, which make system CPU
> monitoring troublesome.
>
> get_cpu_idle_time_us() calculates the latest cpu idle time based on
> idle_entrytime and current time. When CPU is idle when offline, the value
> return by get_cpu_idle_time_us() will continue to increase, which is
> unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
> When CPU is in the iowait state when offline, the value return by
> get_cpu_iowait_time_us() will continue to increase.
>
> Introduce get_cpu_idle_time_us_offline() as the _offline variants of
> get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
> same value of idle_sleeptime without any calculation. In this way,
> /proc/stat logic can use it to get a correct CPU idle time, which remains
> unchanged during CPU offline period. Also, the aggregated statistics of
> all possible CPUs printed by /proc/stat will not experience significant
> fluctuation when CPU hotplug.
> So as the newly added get_cpu_iowait_time_us_offline().
>
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
So the problem is a bit deeper (warning: lots of bullets)
First of all you shouldn't need get_cpu_iowait_time_us_offline().
get_cpu_idle_time_us() is supposed to stop advancing after
tick_sched_timer_dying().
But since this code:
if (cpu_is_offline(cpu)) {
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}
is placed after tick_nohz_idle_enter() in do_idle(), the idle time
moves forward as long as the CPU is offline.
In fact offline handling in idle should happen at the very beginning
of do_idle because:
* There is no tick handling to do
* There is no nohz idle balancing to do
* And polling to TIF_RESCHED is useless
* No need to check if need_resched() before offline handling since
stop_machine is done and all per-cpu kthread should be done with
their job.
So:
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd17..35d79af3286d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -260,6 +260,12 @@ static void do_idle(void)
{
int cpu = smp_processor_id();
+ if (cpu_is_offline(cpu)) {
+ local_irq_disable();
+ cpuhp_report_idle_dead();
+ arch_cpu_idle_dead();
+ }
+
/*
* Check if we need to update blocked load
*/
@@ -311,11 +317,6 @@ static void do_idle(void)
*/
local_irq_disable();
- if (cpu_is_offline(cpu)) {
- cpuhp_report_idle_dead();
- arch_cpu_idle_dead();
- }
-
arch_cpu_idle_enter();
rcu_nocb_flush_deferred_wakeup();
But tick_sched_timer_dying() temporarily clears the idle/iowait time.
It's fixable but that's not all. We still have two idle time accounting
with each having their shortcomings:
* The accounting for online CPUs which is based on delta between
tick_nohz_start_idle() and tick_nohz_stop_idle().
Pros:
- Works when the tick is off
- Has nsecs granularity
Cons:
- Ignore steal time and possibly spuriously substract it from later
system accounting.
- Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but the
idle irqtime is possibly spuriously substracted from later system
accounting.
- Is not accurate when CONFIG_IRQ_TIME_ACCOUNTING=n
- The windows between 1) idle task scheduling and the first call to
tick_nohz_start_idle() and 2) idle task between the last
tick_nohz_stop_idle() and the rest of the idle time are blindspots
wrt. cputime accounting.
* The accounting for offline CPUs which is based on ticks and the
jiffies delta during which the tick was stopped.
Pros:
- Handles steal time correctly
- Handle CONFIG_IRQ_TIME_ACCOUNTING=y and CONFIG_IRQ_TIME_ACCOUNTING=n
correctly.
- Handles the whole idle task
Cons:
- Doesn't elapse when the tick is off
- Has TICK_NSEC granularity (jiffies)
- Needs to track the idle ticks that were accounted and substract
them from the total jiffies time spent while the tick was stopped.
This is ugly.
So what I think we should do is having a single one solution always applying
to cpustat which does a hybrid approach with better support:
* Tick based accounting whenever the tick isn't stopped.
* When the tick is stopped, account like we do between tick_nohz_start_idle()
and tick_nohz_stop_idle() (but only when the tick is actually stopped) and
handle CONFIG_IRQ_TIME_ACCOUNTING correctly such that:
- If y, stop on IRQ entry and restart on IRQ exit like we used to. It means
we must flush cpu_irqtime::tick_delta upon tick restart so that it's not
substracted later from system accounting.
- If n, keep accounting on IRQs (remove calls to tick_nohz_start_idle()
and tick_nohz_stop_idle() during IRQs)
During that tick stopped time, tick based accounting must be ignored.
Idle steal delta time must be recorded at tick stop/tick restart boundaries
and substracted from later idle time accounting.
This logic should be moved to vtime and we still need the offlining fix above.
Just give me a few days to work on that patchset.
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-11 16:34 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-10 8:31 [PATCH v2 0/2] timers/nohz: Optimize /proc/stat idle/iowait fluctuation Xin Zhao
2025-12-10 8:31 ` [PATCH v2 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
2025-12-10 8:31 ` [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
2025-12-11 16:34 ` Frederic Weisbecker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).