linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Optimize /proc/stat idle/iowait fluctuation
@ 2025-11-29 13:35 Xin Zhao
  2025-11-29 13:35 ` [PATCH 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
  2025-11-29 13:35 ` [PATCH 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
  0 siblings, 2 replies; 5+ messages in thread
From: Xin Zhao @ 2025-11-29 13:35 UTC (permalink / raw)
  To: anna-maria, frederic, mingo, tglx, kuba
  Cc: jackzxcui1989, linux-kernel, linux-fsdevel

The idle and iowait statistics in /proc/stat are obtained through
get_idle_time and get_iowait_time. Assuming CONFIG_NO_HZ_COMMON is
enabled, when CPU is online, the idle and iowait values use the
idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
is offline. It will cause idle/iowait fluctuation in the aggregated
statistics print by /proc/stat.
Introduce get_cpu_idle_time_us_raw and get_cpu_iowait_time_us_raw, so that
/proc/stat logic can use them to get the last raw value of idle_sleeptime
and iowait_sleeptime from tick_cpu_sched without any calculation when CPU
is offline. It avoids /proc/stat idle/iowait fluctuation when cpu hotplug.
In addition, revise a comment about broken iowait counter update race
related to the topic.

Xin Zhao (2):
  timers/nohz: Revise a comment about broken iowait counter update race
  timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug

 fs/proc/stat.c           |  4 ++++
 include/linux/tick.h     |  4 ++++
 kernel/time/tick-sched.c | 50 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 56 insertions(+), 2 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] timers/nohz: Revise a comment about broken iowait counter update race
  2025-11-29 13:35 [PATCH 0/2] Optimize /proc/stat idle/iowait fluctuation Xin Zhao
@ 2025-11-29 13:35 ` Xin Zhao
  2025-11-29 13:35 ` [PATCH 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
  1 sibling, 0 replies; 5+ messages in thread
From: Xin Zhao @ 2025-11-29 13:35 UTC (permalink / raw)
  To: anna-maria, frederic, mingo, tglx, kuba
  Cc: jackzxcui1989, linux-kernel, linux-fsdevel

Task migration during wakeup may cause /proc/stat iowait regression,
as mentioned in the commit message of Frederic Weisbecker's previous
patch. The nr_iowait statistic of rq can be decreased by remote CPU
during wakeup, leading to a sudden drop in /proc/stat iowait calculation,
while /proc/stat idle statistic experiences a sudden increase.
Excluding the hotplug scenario when /proc/stat idle statistic may
experiences a sudden decrease which is fixed in a subsequent patch,
/proc/stat idle statistic will never decrease suddenly, as there is no
logic in kernel that allows a remote CPU to increase rq nr_iowait.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 kernel/time/tick-sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8ddf74e70..4d089b290 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -812,8 +812,8 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
  * Return the cumulative idle time (since boot) for a given
  * CPU, in microseconds. Note that this is partially broken due to
  * the counter of iowait tasks that can be remotely updated without
- * any synchronization. Therefore it is possible to observe backward
- * values within two consecutive reads.
+ * any synchronization. Therefore it is possible to observe sudden
+ * increases within two consecutive reads.
  *
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
  2025-11-29 13:35 [PATCH 0/2] Optimize /proc/stat idle/iowait fluctuation Xin Zhao
  2025-11-29 13:35 ` [PATCH 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
@ 2025-11-29 13:35 ` Xin Zhao
  2025-12-01 21:17   ` Ingo Molnar
  1 sibling, 1 reply; 5+ messages in thread
From: Xin Zhao @ 2025-11-29 13:35 UTC (permalink / raw)
  To: anna-maria, frederic, mingo, tglx, kuba
  Cc: jackzxcui1989, linux-kernel, linux-fsdevel

The idle and iowait statistics in /proc/stat are obtained through
get_idle_time and get_iowait_time. Assuming CONFIG_NO_HZ_COMMON is
enabled, when CPU is online, the idle and iowait values use the
idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
is offline. Although /proc/stat do not print statistics of offline CPU,
it still print aggregated statistics for all possible CPUs.
tick_cpu_sched and kernel_cpustat are maintained by different logic,
leading to a significant gap. The first line of the data below shows the
/proc/stat output when only one CPU remains after CPU offline, the second
line shows the /proc/stat output after all CPUs are brought back online:

cpu  2408558 2 916619 4275883 5403 123758 64685 0 0 0
cpu  2408588 2 916693 4200737 4184 123762 64686 0 0 0

Obviously, other values do not experience significant fluctuations, while
idle/iowait statistics show a substantial decrease, which make system CPU
monitoring troublesome.
Introduce get_cpu_idle_time_us_raw and get_cpu_iowait_time_us_raw, so that
/proc/stat logic can use them to get the last raw value of idle_sleeptime
and iowait_sleeptime from tick_cpu_sched without any calculation when CPU
is offline. It avoids /proc/stat idle/iowait fluctuation when cpu hotplug.

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
---
 fs/proc/stat.c           |  4 ++++
 include/linux/tick.h     |  4 ++++
 kernel/time/tick-sched.c | 46 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 8b444e862..de13a2e1c 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -28,6 +28,8 @@ u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
 
 	if (cpu_online(cpu))
 		idle_usecs = get_cpu_idle_time_us(cpu, NULL);
+	else
+		idle_usecs = get_cpu_idle_time_us_raw(cpu);
 
 	if (idle_usecs == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
@@ -44,6 +46,8 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
 
 	if (cpu_online(cpu))
 		iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
+	else
+		iowait_usecs = get_cpu_iowait_time_us_raw(cpu);
 
 	if (iowait_usecs == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ac76ae9fa..bdd38d270 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -139,7 +139,9 @@ extern ktime_t tick_nohz_get_next_hrtimer(void);
 extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
 extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_us_raw(int cpu);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_iowait_time_us_raw(int cpu);
 #else /* !CONFIG_NO_HZ_COMMON */
 #define tick_nohz_enabled (0)
 static inline int tick_nohz_tick_stopped(void) { return 0; }
@@ -161,7 +163,9 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
 	return *delta_next;
 }
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_idle_time_us_raw(int cpu) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_iowait_time_us_raw(int cpu) { return -1; }
 #endif /* !CONFIG_NO_HZ_COMMON */
 
 /*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4d089b290..607fc22d4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -829,6 +829,29 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
+/**
+ * get_cpu_idle_time_us_raw - get the raw total idle time of a CPU
+ * @cpu: CPU number to query
+ *
+ * Return the raw idle time (since boot) for a given CPU, in
+ * microseconds.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_idle_time_us_raw(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+
+	if (!tick_nohz_active)
+		return -1;
+
+	return ktime_to_us(ts->idle_sleeptime);
+}
+EXPORT_SYMBOL_GPL(get_cpu_idle_time_us_raw);
+
 /**
  * get_cpu_iowait_time_us - get the total iowait time of a CPU
  * @cpu: CPU number to query
@@ -855,6 +878,29 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+/**
+ * get_cpu_iowait_time_us_raw - get the raw total iowait time of a CPU
+ * @cpu: CPU number to query
+ *
+ * Return the raw iowait time (since boot) for a given CPU, in
+ * microseconds.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_iowait_time_us_raw(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+
+	if (!tick_nohz_active)
+		return -1;
+
+	return ktime_to_us(ts->iowait_sleeptime);
+}
+EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us_raw);
+
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_cancel(&ts->sched_timer);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
  2025-11-29 13:35 ` [PATCH 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
@ 2025-12-01 21:17   ` Ingo Molnar
  2025-12-04 18:21     ` Xin Zhao
  0 siblings, 1 reply; 5+ messages in thread
From: Ingo Molnar @ 2025-12-01 21:17 UTC (permalink / raw)
  To: Xin Zhao; +Cc: anna-maria, frederic, tglx, kuba, linux-kernel, linux-fsdevel


* Xin Zhao <jackzxcui1989@163.com> wrote:

> The idle and iowait statistics in /proc/stat are obtained through
> get_idle_time and get_iowait_time. Assuming CONFIG_NO_HZ_COMMON is
> enabled, when CPU is online, the idle and iowait values use the
> idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
> use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
> is offline. Although /proc/stat do not print statistics of offline CPU,
> it still print aggregated statistics for all possible CPUs.
> tick_cpu_sched and kernel_cpustat are maintained by different logic,
> leading to a significant gap. The first line of the data below shows the
> /proc/stat output when only one CPU remains after CPU offline, the second
> line shows the /proc/stat output after all CPUs are brought back online:
> 
> cpu  2408558 2 916619 4275883 5403 123758 64685 0 0 0
> cpu  2408588 2 916693 4200737 4184 123762 64686 0 0 0

Yeah, that outlier indeed looks suboptimal, and there's 
very little user-space tooling can do to detect it. I 
think your suggestion, to use the 'frozen' values of an 
offline CPU, might as well be the right approach.

What value is printed if the CPU was never online, is 
it properly initialized to zero?


> Obviously, other values do not experience significant fluctuations, while
> idle/iowait statistics show a substantial decrease, which make system CPU
> monitoring troublesome.
> Introduce get_cpu_idle_time_us_raw and get_cpu_iowait_time_us_raw, so that
> /proc/stat logic can use them to get the last raw value of idle_sleeptime
> and iowait_sleeptime from tick_cpu_sched without any calculation when CPU
> is offline. It avoids /proc/stat idle/iowait fluctuation when cpu hotplug.
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
> ---
>  fs/proc/stat.c           |  4 ++++
>  include/linux/tick.h     |  4 ++++
>  kernel/time/tick-sched.c | 46 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 54 insertions(+)
> 
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 8b444e862..de13a2e1c 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -28,6 +28,8 @@ u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
>  
>  	if (cpu_online(cpu))
>  		idle_usecs = get_cpu_idle_time_us(cpu, NULL);
> +	else
> +		idle_usecs = get_cpu_idle_time_us_raw(cpu);
>  
>  	if (idle_usecs == -1ULL)
>  		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
> @@ -44,6 +46,8 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
>  
>  	if (cpu_online(cpu))
>  		iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
> +	else
> +		iowait_usecs = get_cpu_iowait_time_us_raw(cpu);

So why not just use the get_cpu_idle_time_us() and 
get_cpu_iowait_time_us() values unconditionally, for 
all possible_cpus?

The raw/non-raw distinction makes very little sense in 
this context, the read_seqlock_retry loop will always 
succeed after a single step (because there are no 
writers), so the behavior of the full get_cpu_idle/iowait_time_us()
functions should be close to the _raw() variants.

Patch would be much simpler that way.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug
  2025-12-01 21:17   ` Ingo Molnar
@ 2025-12-04 18:21     ` Xin Zhao
  0 siblings, 0 replies; 5+ messages in thread
From: Xin Zhao @ 2025-12-04 18:21 UTC (permalink / raw)
  To: mingo
  Cc: anna-maria, frederic, jackzxcui1989, kuba, linux-fsdevel,
	linux-kernel, tglx

Dear Ingo,
    Sorry for reply late, working on another urgent task recently.


On Mon, 1 Dec 2025 22:17:02 +0100 Ingo Molnar <mingo@kernel.org> wrote:

> > The idle and iowait statistics in /proc/stat are obtained through
> > get_idle_time and get_iowait_time. Assuming CONFIG_NO_HZ_COMMON is
> > enabled, when CPU is online, the idle and iowait values use the
> > idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
> > use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
> > is offline. Although /proc/stat do not print statistics of offline CPU,
> > it still print aggregated statistics for all possible CPUs.
> > tick_cpu_sched and kernel_cpustat are maintained by different logic,
> > leading to a significant gap. The first line of the data below shows the
> > /proc/stat output when only one CPU remains after CPU offline, the second
> > line shows the /proc/stat output after all CPUs are brought back online:
> > 
> > cpu  2408558 2 916619 4275883 5403 123758 64685 0 0 0
> > cpu  2408588 2 916693 4200737 4184 123762 64686 0 0 0
> 
> Yeah, that outlier indeed looks suboptimal, and there's 
> very little user-space tooling can do to detect it. I 
> think your suggestion, to use the 'frozen' values of an 
> offline CPU, might as well be the right approach.

Thanks.

> What value is printed if the CPU was never online, is 
> it properly initialized to zero?

I've done the experiment printing the values of tick_sched in secondary_start_kernel
just before executing set_cpu_online. It is confirmed that variables in tick_sched
structure, such as idle_active and idle_entrytime, are all 0, tick_sched variable is
a per-CPU variable defined at the beginning of tick-sched.c, named tick_cpu_sched.
Like other global variables, if per-CPU variables are not explicitly initialized to
specific value, they will be set to 0 by default.



> So why not just use the get_cpu_idle_time_us() and 
> get_cpu_iowait_time_us() values unconditionally, for 
> all possible_cpus?
> 
> The raw/non-raw distinction makes very little sense in 
> this context, the read_seqlock_retry loop will always 
> succeed after a single step (because there are no 
> writers), so the behavior of the full get_cpu_idle/iowait_time_us()
> functions should be close to the _raw() variants.
> 
> Patch would be much simpler that way.

Perhaps I didn't choose a good name; using the word "raw" indeed evokes thoughts of
preventing deadlocks in sequential locking. My original intention was to differentiate
from the logic of calculating:

delta = ktime_sub(now, ts->idle_entrytime);

in get_cpu_sleep_time_us, which adds this delta to the returned idle time.


If we use the original get_cpu_sleep_time_us interface directly, the calculation logic
still run even when the CPU is offline. The following four lines show the tick_sched
corresponding to CPU 3 during its offline period:

[  897.136535] zhaoxin:get_cpu_idle_time_us_raw(3)=823703591,get_cpu_idle_time_us(3)=842938736
[  897.136540] zhaoxin:cpu[3]tick_sched info:[idle_active:1][idle_entrytime:877900502450]
[  935.501687] zhaoxin:get_cpu_idle_time_us_raw(3)=823703591,get_cpu_idle_time_us(3)=881303888
[  935.501694] zhaoxin:cpu[3]tick_sched info:[idle_active:1][idle_entrytime:877900502450]

As above, values obtained from get_cpu_idle_time_us_raw remain constant, while values
from get_cpu_idle_time_us continue to increase.

Since the CPU is already offline, the statistics for any CPU might stop increasing as well?

The following /proc/stat statistics are 'just before the CPU went offline' and 'just after
bringing all CPUs back online', all CPUs are offline except CPU0:

'just before the CPU went offline':
cpu  444829 5 254942 6833361 782 49798 20514 0 0 0
cpu0 68788 1 54126 636019 202 10827 3059 0 0 0
cpu1 58191 0 30832 303281 24 6896 1677 0 0 0
cpu2 7770 0 8268 380039 31 1826 1022 0 0 0
cpu3 7954 0 7476 381024 30 1776 976 0 0 0
cpu4 8319 0 6557 382425 48 1713 805 0 0 0
cpu5 8526 0 5791 383103 37 1681 811 0 0 0
cpu6 22202 0 10746 361746 37 1872 729 0 0 0
cpu7 35019 0 23848 330534 32 3873 1757 0 0 0
cpu8 35065 1 23654 330980 16 3859 1690 0 0 0
cpu9 1065 0 4302 395611 48 1290 861 0 0 0
cpu10 38462 0 12500 362772 19 1879 801 0 0 0
cpu11 37863 0 11913 363654 26 1843 567 0 0 0
cpu12 37717 0 10885 364177 25 1792 845 0 0 0
cpu13 37338 0 10219 365103 32 1763 646 0 0 0
cpu14 14212 0 10272 370534 34 1818 1384 0 0 0
cpu15 1677 0 7933 391188 51 1501 1557 0 0 0
cpu16 9744 0 5015 375124 21 1687 708 0 0 0
cpu17 14908 0 10597 356039 61 1894 611 0 0 0

'just after bringing all CPUs back online':
cpu  447782 5 257163 6861054 797 50101 20624 0 0 0
cpu0 70102 1 55738 661461 215 11111 3163 0 0 0
cpu1 58283 0 30873 303386 26 6897 1677 0 0 0
cpu2 7869 0 8301 380147 31 1828 1022 0 0 0
cpu3 8063 0 7510 381130 30 1777 976 0 0 0
cpu4 8417 0 6594 382540 48 1714 805 0 0 0
cpu5 8605 0 5849 383217 37 1682 811 0 0 0
cpu6 22300 0 10779 361869 37 1874 730 0 0 0
cpu7 35105 0 23897 330659 32 3874 1757 0 0 0
cpu8 35168 1 23694 331102 16 3860 1690 0 0 0
cpu9 1154 0 4338 395750 48 1291 862 0 0 0
cpu10 38565 0 12527 362908 19 1880 801 0 0 0
cpu11 37956 0 11943 363799 26 1844 567 0 0 0
cpu12 37835 0 10901 364316 25 1793 845 0 0 0
cpu13 37410 0 10278 365250 32 1764 646 0 0 0
cpu14 14326 0 10291 370686 34 1819 1385 0 0 0
cpu15 1767 0 7964 391349 51 1502 1560 0 0 0
cpu16 9860 0 5036 375275 21 1688 708 0 0 0
cpu17 14989 0 10641 356203 61 1895 611 0 0 0

As we can see, current implementation of cpu statistics come to a halt when cpu offline.


So, can we keep the current patch as it is, or use a different name, such as get_cpu_idle_time_us_halt?


--
Xin Zhao


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-12-04 18:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-29 13:35 [PATCH 0/2] Optimize /proc/stat idle/iowait fluctuation Xin Zhao
2025-11-29 13:35 ` [PATCH 1/2] timers/nohz: Revise a comment about broken iowait counter update race Xin Zhao
2025-11-29 13:35 ` [PATCH 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug Xin Zhao
2025-12-01 21:17   ` Ingo Molnar
2025-12-04 18:21     ` Xin Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).