public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] tick/sched: Refactor idle cputime accounting
@ 2026-01-16 14:51 Frederic Weisbecker
  2026-01-16 14:51 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
                   ` (16 more replies)
  0 siblings, 17 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Rafael J . Wysocki, Boqun Feng,
	Thomas Gleixner, Steven Rostedt, Christophe Leroy (CS GROUP),
	Kieran Bingham, Ben Segall, Michael Ellerman, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Neeraj Upadhyay, Xin Zhao,
	Madhavan Srinivasan, Mel Gorman, Valentin Schneider,
	Christian Borntraeger, Jan Kiszka, linuxppc-dev,
	Paul E . McKenney, Viresh Kumar, Anna-Maria Behnsen,
	Uladzislau Rezki, Dietmar Eggemann, Heiko Carstens, linux-pm,
	Alexander Gordeev, Sven Schnelle, Vasily Gorbik, Joel Fernandes,
	Nicholas Piggin, linux-s390, Peter Zijlstra

Hi,

After the issue reported here:

	https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/

It occurs that the idle cputime accounting is a big mess that
accumulates within two concurrent statistics, each having their own
shortcomings:

* The accounting for online CPUs which is based on the delta between
  tick_nohz_start_idle() and tick_nohz_stop_idle().

  Pros:
       - Works when the tick is off

       - Has nsecs granularity

  Cons:
       - Account idle steal time but doesn't substract it from idle
         cputime.

       - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
         the IRQ time is simply ignored when
         CONFIG_IRQ_TIME_ACCOUNTING=n

       - The windows between 1) idle task scheduling and the first call
         to tick_nohz_start_idle() and 2) idle task between the last
         tick_nohz_stop_idle() and the rest of the idle time are
         blindspots wrt. cputime accounting (though mostly insignificant
         amount)

       - Relies on private fields outside of kernel stats, with specific
         accessors.

* The accounting for offline CPUs which is based on ticks and the
  jiffies delta during which the tick was stopped.

  Pros:
       - Handles steal time correctly

       - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
         CONFIG_IRQ_TIME_ACCOUNTING=n correctly.

       - Handles the whole idle task

       - Accounts directly to kernel stats, without midlayer accumulator.

   Cons:
       - Doesn't elapse when the tick is off, which doesn't make it
         suitable for online CPUs.

       - Has TICK_NSEC granularity (jiffies)

       - Needs to track the dyntick-idle ticks that were accounted and
         substract them from the total jiffies time spent while the tick
         was stopped. This is an ugly workaround.

Having two different accounting for a single context is not the only
problem: since those accountings are of different natures, it is
possible to observe the global idle time going backward after a CPU goes
offline, as reported by Xin Zhao.

Clean up the situation with introducing a hybrid approach that stays
coherent, fixes the backward jumps and works for both online and offline
CPUs:

* Tick based or native vtime accounting operate before the tick is
  stopped and resumes once the tick is restarted.

* When the idle loop starts, switch to dynticks-idle accounting as is
  done currently, except that the statistics accumulate directly to the
  relevant kernel stat fields.

* Private dyntick cputime accounting fields are removed.

* Works on both online and offline case.

* Move most of the relevant code to the common sched/cputime subsystem

* Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
  dynticks-idle accounting still elapses while on IRQs.

* Correctly substract idle steal cputime from idle time

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	timers/core

HEAD: 6a3d814ef2f6142714bef862be36def5ca4c9d96
Thanks,
	Frederic
---

Frederic Weisbecker (15):
      sched/idle: Handle offlining first in idle loop
      sched/cputime: Remove superfluous and error prone kcpustat_field() parameter
      sched/cputime: Correctly support generic vtime idle time
      powerpc/time: Prepare to stop elapsing in dynticks-idle
      s390/time: Prepare to stop elapsing in dynticks-idle
      tick/sched: Unify idle cputime accounting
      cpufreq: ondemand: Simplify idle cputime granularity test
      tick/sched: Remove nohz disabled special case in cputime fetch
      tick/sched: Move dyntick-idle cputime accounting to cputime code
      tick/sched: Remove unused fields
      tick/sched: Account tickless idle cputime only when tick is stopped
      tick/sched: Consolidate idle time fetching APIs
      sched/cputime: Consolidate get_cpu_[idle|iowait]_time_us()
      sched/cputime: Handle idle irqtime gracefully
      sched/cputime: Handle dyntick-idle steal time correctly

 arch/powerpc/kernel/time.c         |  41 +++++
 arch/s390/include/asm/idle.h       |  11 +-
 arch/s390/kernel/idle.c            |  13 +-
 arch/s390/kernel/vtime.c           |  57 ++++++-
 drivers/cpufreq/cpufreq.c          |  29 +---
 drivers/cpufreq/cpufreq_governor.c |   6 +-
 drivers/cpufreq/cpufreq_ondemand.c |   7 +-
 drivers/macintosh/rack-meter.c     |   2 +-
 fs/proc/stat.c                     |  40 +----
 fs/proc/uptime.c                   |   8 +-
 include/linux/kernel_stat.h        |  76 ++++++++--
 include/linux/tick.h               |   4 -
 include/linux/vtime.h              |  20 ++-
 kernel/rcu/tree.c                  |   9 +-
 kernel/rcu/tree_stall.h            |   7 +-
 kernel/sched/cputime.c             | 302 +++++++++++++++++++++++++++++++------
 kernel/sched/idle.c                |  11 +-
 kernel/sched/sched.h               |   1 +
 kernel/time/tick-sched.c           | 203 +++++--------------------
 kernel/time/tick-sched.h           |  12 --
 kernel/time/timer_list.c           |   6 +-
 scripts/gdb/linux/timerlist.py     |   4 -
 22 files changed, 505 insertions(+), 364 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/15] sched/idle: Handle offlining first in idle loop
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
@ 2026-01-16 14:51 ` Frederic Weisbecker
  2026-01-19 12:53   ` Peter Zijlstra
  2026-01-16 14:51 ` [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter Frederic Weisbecker
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Offline handling happens from within the inner idle loop,
after the beginning of dyntick cputime accounting, nohz idle
load balancing and TIF_NEED_RESCHED polling.

This is not necessary and even buggy because:

* There is no dyntick handling to do. And calling tick_nohz_idle_enter()
  messes up with the struct tick_sched reset that was performed on
  tick_sched_timer_dying().

* There is no nohz idle balancing to do.

* Polling on TIF_RESCHED is irrelevant at this stage, there are no more
  tasks allowed to run.

* No need to check if need_resched() before offline handling since
  stop_machine is done and all per-cpu kthread should be done with
  their job.

Therefore move the offline handling at the beginning of the idle loop.
This will also ease the idle cputime unification later by not elapsing
idle time while offline through the call to:

	tick_nohz_idle_enter() -> tick_nohz_start_idle()

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/idle.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd17..35d79af3286d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -260,6 +260,12 @@ static void do_idle(void)
 {
 	int cpu = smp_processor_id();
 
+	if (cpu_is_offline(cpu)) {
+		local_irq_disable();
+		cpuhp_report_idle_dead();
+		arch_cpu_idle_dead();
+	}
+
 	/*
 	 * Check if we need to update blocked load
 	 */
@@ -311,11 +317,6 @@ static void do_idle(void)
 		 */
 		local_irq_disable();
 
-		if (cpu_is_offline(cpu)) {
-			cpuhp_report_idle_dead();
-			arch_cpu_idle_dead();
-		}
-
 		arch_cpu_idle_enter();
 		rcu_nocb_flush_deferred_wakeup();
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
  2026-01-16 14:51 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
@ 2026-01-16 14:51 ` Frederic Weisbecker
  2026-01-16 14:51 ` [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

The first parameter to kcpustat_field() is a pointer to the cpu kcpustat
to be fetched from. This parameter is error prone because a copy to a
kcpustat could be passed by accident instead of the original one. Also
the kcpustat structure can already be retrieved with the help of the
mandatory CPU argument.

Remove the needless paramater.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 drivers/cpufreq/cpufreq_governor.c | 6 +++---
 drivers/macintosh/rack-meter.c     | 2 +-
 include/linux/kernel_stat.h        | 8 +++-----
 kernel/rcu/tree.c                  | 9 +++------
 kernel/rcu/tree_stall.h            | 7 +++----
 kernel/sched/cputime.c             | 5 ++---
 6 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 1a7fcaf39cc9..b6683628091d 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -105,7 +105,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
 			j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
 								  dbs_data->io_is_busy);
 			if (dbs_data->ignore_nice_load)
-				j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+				j_cdbs->prev_cpu_nice = kcpustat_field(CPUTIME_NICE, j);
 		}
 	}
 }
@@ -165,7 +165,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
 		j_cdbs->prev_cpu_idle = cur_idle_time;
 
 		if (ignore_nice) {
-			u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+			u64 cur_nice = kcpustat_field(CPUTIME_NICE, j);
 
 			idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
 			j_cdbs->prev_cpu_nice = cur_nice;
@@ -539,7 +539,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
 		j_cdbs->prev_load = 0;
 
 		if (ignore_nice)
-			j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+			j_cdbs->prev_cpu_nice = kcpustat_field(CPUTIME_NICE, j);
 	}
 
 	gov->start(policy);
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 896a43bd819f..20b2ecd32340 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -87,7 +87,7 @@ static inline u64 get_cpu_idle_time(unsigned int cpu)
 		 kcpustat->cpustat[CPUTIME_IOWAIT];
 
 	if (rackmeter_ignore_nice)
-		retval += kcpustat_field(kcpustat, CPUTIME_NICE, cpu);
+		retval += kcpustat_field(CPUTIME_NICE, cpu);
 
 	return retval;
 }
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index b97ce2df376f..dd020ecaf67b 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -100,14 +100,12 @@ static inline unsigned long kstat_cpu_irqs_sum(unsigned int cpu)
 }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern u64 kcpustat_field(struct kernel_cpustat *kcpustat,
-			  enum cpu_usage_stat usage, int cpu);
+extern u64 kcpustat_field(enum cpu_usage_stat usage, int cpu);
 extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
 #else
-static inline u64 kcpustat_field(struct kernel_cpustat *kcpustat,
-				 enum cpu_usage_stat usage, int cpu)
+static inline u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
 {
-	return kcpustat->cpustat[usage];
+	return kcpustat_cpu(cpu).cpustat[usage];
 }
 
 static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 293bbd9ac3f4..ceea4b2f755b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -968,14 +968,11 @@ static int rcu_watching_snap_recheck(struct rcu_data *rdp)
 		if (rcu_cpu_stall_cputime && rdp->snap_record.gp_seq != rdp->gp_seq) {
 			int cpu = rdp->cpu;
 			struct rcu_snap_record *rsrp;
-			struct kernel_cpustat *kcsp;
-
-			kcsp = &kcpustat_cpu(cpu);
 
 			rsrp = &rdp->snap_record;
-			rsrp->cputime_irq     = kcpustat_field(kcsp, CPUTIME_IRQ, cpu);
-			rsrp->cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu);
-			rsrp->cputime_system  = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu);
+			rsrp->cputime_irq     = kcpustat_field(CPUTIME_IRQ, cpu);
+			rsrp->cputime_softirq = kcpustat_field(CPUTIME_SOFTIRQ, cpu);
+			rsrp->cputime_system  = kcpustat_field(CPUTIME_SYSTEM, cpu);
 			rsrp->nr_hardirqs = kstat_cpu_irqs_sum(cpu) + arch_irq_stat_cpu(cpu);
 			rsrp->nr_softirqs = kstat_cpu_softirqs_sum(cpu);
 			rsrp->nr_csw = nr_context_switches_cpu(cpu);
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b67532cb8770..cf7ae51cba40 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -479,7 +479,6 @@ static void print_cpu_stat_info(int cpu)
 {
 	struct rcu_snap_record rsr, *rsrp;
 	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
-	struct kernel_cpustat *kcsp = &kcpustat_cpu(cpu);
 
 	if (!rcu_cpu_stall_cputime)
 		return;
@@ -488,9 +487,9 @@ static void print_cpu_stat_info(int cpu)
 	if (rsrp->gp_seq != rdp->gp_seq)
 		return;
 
-	rsr.cputime_irq     = kcpustat_field(kcsp, CPUTIME_IRQ, cpu);
-	rsr.cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu);
-	rsr.cputime_system  = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu);
+	rsr.cputime_irq     = kcpustat_field(CPUTIME_IRQ, cpu);
+	rsr.cputime_softirq = kcpustat_field(CPUTIME_SOFTIRQ, cpu);
+	rsr.cputime_system  = kcpustat_field(CPUTIME_SYSTEM, cpu);
 
 	pr_err("\t         hardirqs   softirqs   csw/system\n");
 	pr_err("\t number: %8lld %10d %12lld\n",
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 4f97896887ec..5dcb0f2e01bc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -961,10 +961,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
 	return 0;
 }
 
-u64 kcpustat_field(struct kernel_cpustat *kcpustat,
-		   enum cpu_usage_stat usage, int cpu)
+u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
 {
-	u64 *cpustat = kcpustat->cpustat;
+	u64 *cpustat = kcpustat_cpu(cpu).cpustat;
 	u64 val = cpustat[usage];
 	struct rq *rq;
 	int err;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
  2026-01-16 14:51 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
  2026-01-16 14:51 ` [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter Frederic Weisbecker
@ 2026-01-16 14:51 ` Frederic Weisbecker
  2026-01-19 13:02   ` Peter Zijlstra
  2026-01-16 14:51 ` [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Currently whether generic vtime is running or not, the idle cputime is
fetched from the nohz accounting.

However generic vtime already does its own idle cputime accounting. Only
the kernel stat accessors are not plugged to support it.

Read the idle generic vtime cputime when it's running, this will allow
to later more clearly split nohz and vtime cputime accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 fs/proc/stat.c           |  8 ++++----
 include/linux/vtime.h    |  7 ++++++-
 kernel/sched/cputime.c   | 38 +++++++++++++++++++++++++++++++-------
 kernel/time/tick-sched.c |  2 +-
 4 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 8b444e862319..6ac2a13b8be5 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -30,8 +30,8 @@ u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
 		idle_usecs = get_cpu_idle_time_us(cpu, NULL);
 
 	if (idle_usecs == -1ULL)
-		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
-		idle = kcs->cpustat[CPUTIME_IDLE];
+		/* !NO_HZ or cpu offline or vtime so we can rely on cpustat.idle */
+		idle = kcpustat_field(CPUTIME_IDLE, cpu);
 	else
 		idle = idle_usecs * NSEC_PER_USEC;
 
@@ -46,8 +46,8 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
 		iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
 
 	if (iowait_usecs == -1ULL)
-		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
-		iowait = kcs->cpustat[CPUTIME_IOWAIT];
+		/* !NO_HZ or cpu offline or vtime so we can rely on cpustat.iowait */
+		iowait = kcpustat_field(CPUTIME_IOWAIT, cpu);
 	else
 		iowait = iowait_usecs * NSEC_PER_USEC;
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 29dd5b91dd7d..737930f66c3e 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -27,6 +27,11 @@ static inline void vtime_guest_exit(struct task_struct *tsk) { }
 static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
+static inline bool vtime_generic_enabled_cpu(int cpu)
+{
+	return context_tracking_enabled_cpu(cpu);
+}
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
 extern void vtime_account_softirq(struct task_struct *tsk);
@@ -74,7 +79,7 @@ static inline bool vtime_accounting_enabled(void)
 
 static inline bool vtime_accounting_enabled_cpu(int cpu)
 {
-	return context_tracking_enabled_cpu(cpu);
+	return vtime_generic_enabled_cpu(cpu);
 }
 
 static inline bool vtime_accounting_enabled_this_cpu(void)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5dcb0f2e01bc..f32c169da11a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -761,7 +761,11 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	account_idle_time(get_vtime_delta(&tsk->vtime));
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	account_idle_time(get_vtime_delta(vtime));
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_task_switch_generic(struct task_struct *prev)
@@ -912,6 +916,7 @@ static int kcpustat_field_vtime(u64 *cpustat,
 				int cpu, u64 *val)
 {
 	struct vtime *vtime = &tsk->vtime;
+	struct rq *rq = cpu_rq(cpu);
 	unsigned int seq;
 
 	do {
@@ -953,6 +958,14 @@ static int kcpustat_field_vtime(u64 *cpustat,
 			if (state == VTIME_GUEST && task_nice(tsk) > 0)
 				*val += vtime->gtime + vtime_delta(vtime);
 			break;
+		case CPUTIME_IDLE:
+			if (state == VTIME_IDLE && !atomic_read(&rq->nr_iowait))
+				*val += vtime_delta(vtime);
+			break;
+		case CPUTIME_IOWAIT:
+			if (state == VTIME_IDLE && atomic_read(&rq->nr_iowait) > 0)
+				*val += vtime_delta(vtime);
+			break;
 		default:
 			break;
 		}
@@ -1015,8 +1028,8 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
 		*dst = *src;
 		cpustat = dst->cpustat;
 
-		/* Task is sleeping, dead or idle, nothing to add */
-		if (state < VTIME_SYS)
+		/* Task is sleeping or dead, nothing to add */
+		if (state < VTIME_IDLE)
 			continue;
 
 		delta = vtime_delta(vtime);
@@ -1025,15 +1038,17 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
 		 * Task runs either in user (including guest) or kernel space,
 		 * add pending nohz time to the right place.
 		 */
-		if (state == VTIME_SYS) {
+		switch (vtime->state) {
+		case VTIME_SYS:
 			cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
-		} else if (state == VTIME_USER) {
+			break;
+		case VTIME_USER:
 			if (task_nice(tsk) > 0)
 				cpustat[CPUTIME_NICE] += vtime->utime + delta;
 			else
 				cpustat[CPUTIME_USER] += vtime->utime + delta;
-		} else {
-			WARN_ON_ONCE(state != VTIME_GUEST);
+			break;
+		case VTIME_GUEST:
 			if (task_nice(tsk) > 0) {
 				cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
 				cpustat[CPUTIME_NICE] += vtime->gtime + delta;
@@ -1041,6 +1056,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
 				cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
 				cpustat[CPUTIME_USER] += vtime->gtime + delta;
 			}
+			break;
+		case VTIME_IDLE:
+			if (atomic_read(&cpu_rq(cpu)->nr_iowait) > 0)
+				cpustat[CPUTIME_IOWAIT] += delta;
+			else
+				cpustat[CPUTIME_IDLE] += delta;
+			break;
+		default:
+			WARN_ON_ONCE(1);
 		}
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8ddf74e705d3..f1d07a0276a5 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -780,7 +780,7 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
 	ktime_t now, idle;
 	unsigned int seq;
 
-	if (!tick_nohz_active)
+	if (!tick_nohz_active || vtime_generic_enabled_cpu(cpu))
 		return -1;
 
 	now = ktime_get();
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2026-01-16 14:51 ` [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker
@ 2026-01-16 14:51 ` Frederic Weisbecker
  2026-02-25 17:53   ` Christophe Leroy (CS GROUP)
  2026-01-16 14:51 ` [PATCH 05/15] s390/time: " Frederic Weisbecker
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Currently the tick subsystem stores the idle cputime accounting in
private fields, allowing cohabitation with architecture idle vtime
accounting. The former is fetched on online CPUs, the latter on offline
CPUs.

For consolidation purpose, architecture vtime accounting will continue
to account the cputime but will make a break when the idle tick is
stopped. The dyntick cputime accounting will then be relayed by the tick
subsystem so that the idle cputime is still seen advancing coherently
even when the tick isn't there to flush the idle vtime.

Prepare for that and introduce three new APIs which will be used in
subsequent patches:

_ vtime_dynticks_start() is deemed to be called when idle enters in
  dyntick mode. The idle cputime that elapsed so far is accumulated.

- vtime_dynticks_stop() is deemed to be called when idle exits from
  dyntick mode. The vtime entry clocks are fast-forward to current time
  so that idle accounting restarts elapsing from now.

- vtime_reset() is deemed to be called from dynticks idle IRQ entry to
  fast-forward the clock to current time so that the IRQ time is still
  accounted by vtime while nohz cputime is paused.

Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
accounting twice the idle cputime, along with nohz accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++++++
 include/linux/vtime.h      |  6 ++++++
 2 files changed, 47 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 4bbeb8644d3d..9b3167274653 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
 		acct->starttime = acct0->starttime;
 	}
 }
+
+#ifdef CONFIG_NO_HZ_COMMON
+/**
+ * vtime_reset - Fast forward vtime entry clocks
+ *
+ * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
+ * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
+ */
+void vtime_reset(void)
+{
+	struct cpu_accounting_data *acct = get_accounting(current);
+
+	acct->starttime = mftb();
+#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
+	acct->startspurr = read_spurr(now);
+#endif
+}
+
+/**
+ * vtime_dyntick_start - Inform vtime about entry to idle-dynticks
+ *
+ * Called when idle enters in dyntick mode. The idle cputime that elapsed so far
+ * is accumulated and the tick subsystem takes over the idle cputime accounting.
+ */
+void vtime_dyntick_start(void)
+{
+	vtime_account_idle(current);
+}
+
+/**
+ * vtime_dyntick_stop - Inform vtime about exit from idle-dynticks
+ *
+ * Called when idle exits from dyntick mode. The vtime entry clocks are
+ * fast-forward to current time so that idle accounting restarts elapsing from
+ * now.
+ */
+void vtime_dyntick_stop(void)
+{
+	vtime_reset();
+}
+#endif /* CONFIG_NO_HZ_COMMON */
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 void __no_kcsan __delay(unsigned long loops)
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 737930f66c3e..10cdb08f960b 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -37,11 +37,17 @@ extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
 extern void vtime_account_softirq(struct task_struct *tsk);
 extern void vtime_account_hardirq(struct task_struct *tsk);
 extern void vtime_flush(struct task_struct *tsk);
+extern void vtime_reset(void);
+extern void vtime_dyntick_start(void);
+extern void vtime_dyntick_stop(void);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 static inline void vtime_account_irq(struct task_struct *tsk, unsigned int offset) { }
 static inline void vtime_account_softirq(struct task_struct *tsk) { }
 static inline void vtime_account_hardirq(struct task_struct *tsk) { }
 static inline void vtime_flush(struct task_struct *tsk) { }
+static inline void vtime_reset(void) { }
+static inline void vtime_dyntick_start(void) { }
+extern inline void vtime_dyntick_stop(void) { }
 #endif
 
 /*
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/15] s390/time: Prepare to stop elapsing in dynticks-idle
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2026-01-16 14:51 ` [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
@ 2026-01-16 14:51 ` Frederic Weisbecker
  2026-01-21 12:17   ` Heiko Carstens
  2026-01-16 14:51 ` [PATCH 06/15] tick/sched: Unify idle cputime accounting Frederic Weisbecker
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Currently the tick subsystem stores the idle cputime accounting in
private fields, allowing cohabitation with architecture idle vtime
accounting. The former is fetched on online CPUs, the latter on offline
CPUs.

For consolidation purpose, architecture vtime accounting will continue
to account the cputime but will make a break when the idle tick is
stopped. The dyntick cputime accounting will then be relayed by the tick
subsystem so that the idle cputime is still seen advancing coherently
even when the tick isn't there to flush the idle vtime.

Prepare for that and introduce three new APIs which will be used in
subsequent patches:

_ vtime_dynticks_start() is deemed to be called when idle enters in
  dyntick mode. The idle cputime that elapsed so far is accumulated
  and accounted. Also idle time accounting is ignored.

- vtime_dynticks_stop() is deemed to be called when idle exits from
  dyntick mode. The vtime entry clocks are fast-forward to current time
  so that idle accounting restarts elapsing from now. Also idle time
  accounting is resumed.

- vtime_reset() is deemed to be called from dynticks idle IRQ entry to
  fast-forward the clock to current time so that the IRQ time is still
  accounted by vtime while nohz cputime is paused.

Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
accounting twice the idle cputime, along with nohz accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 arch/s390/include/asm/idle.h | 11 +++---
 arch/s390/kernel/idle.c      | 13 ++++++--
 arch/s390/kernel/vtime.c     | 65 ++++++++++++++++++++++++++++++------
 3 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/arch/s390/include/asm/idle.h b/arch/s390/include/asm/idle.h
index 09f763b9eb40..2770c4f761e1 100644
--- a/arch/s390/include/asm/idle.h
+++ b/arch/s390/include/asm/idle.h
@@ -12,11 +12,12 @@
 #include <linux/device.h>
 
 struct s390_idle_data {
-	unsigned long idle_count;
-	unsigned long idle_time;
-	unsigned long clock_idle_enter;
-	unsigned long timer_idle_enter;
-	unsigned long mt_cycles_enter[8];
+	bool		idle_dyntick;
+	unsigned long	idle_count;
+	unsigned long	idle_time;
+	unsigned long	clock_idle_enter;
+	unsigned long	timer_idle_enter;
+	unsigned long	mt_cycles_enter[8];
 };
 
 extern struct device_attribute dev_attr_idle_count;
diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
index 39cb8d0ae348..54bb932184dd 100644
--- a/arch/s390/kernel/idle.c
+++ b/arch/s390/kernel/idle.c
@@ -35,6 +35,12 @@ void account_idle_time_irq(void)
 			this_cpu_add(mt_cycles[i], cycles_new[i] - idle->mt_cycles_enter[i]);
 	}
 
+	WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
+
+	/* Dyntick idle time accounted by nohz/scheduler */
+	if (idle->idle_dyntick)
+		return;
+
 	idle_time = lc->int_clock - idle->clock_idle_enter;
 
 	lc->steal_timer += idle->clock_idle_enter - lc->last_update_clock;
@@ -45,7 +51,6 @@ void account_idle_time_irq(void)
 
 	/* Account time spent with enabled wait psw loaded as idle time. */
 	WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
-	WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
 	account_idle_time(cputime_to_nsecs(idle_time));
 }
 
@@ -61,8 +66,10 @@ void noinstr arch_cpu_idle(void)
 	set_cpu_flag(CIF_ENABLED_WAIT);
 	if (smp_cpu_mtid)
 		stcctm(MT_DIAG, smp_cpu_mtid, (u64 *)&idle->mt_cycles_enter);
-	idle->clock_idle_enter = get_tod_clock_fast();
-	idle->timer_idle_enter = get_cpu_timer();
+	if (!idle->idle_dyntick) {
+		idle->clock_idle_enter = get_tod_clock_fast();
+		idle->timer_idle_enter = get_cpu_timer();
+	}
 	bpon();
 	__load_psw_mask(psw_mask);
 }
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 234a0ba30510..c19528eb4ee3 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -17,6 +17,7 @@
 #include <asm/vtimer.h>
 #include <asm/vtime.h>
 #include <asm/cpu_mf.h>
+#include <asm/idle.h>
 #include <asm/smp.h>
 
 #include "entry.h"
@@ -111,23 +112,30 @@ static void account_system_index_scaled(struct task_struct *p, u64 cputime,
 	account_system_index_time(p, cputime_to_nsecs(cputime), index);
 }
 
-/*
- * Update process times based on virtual cpu times stored by entry.S
- * to the lowcore fields user_timer, system_timer & steal_clock.
- */
-static int do_account_vtime(struct task_struct *tsk)
+static inline void vtime_reset_last_update(struct lowcore *lc)
 {
-	u64 timer, clock, user, guest, system, hardirq, softirq;
-	struct lowcore *lc = get_lowcore();
-
-	timer = lc->last_update_timer;
-	clock = lc->last_update_clock;
 	asm volatile(
 		"	stpt	%0\n"	/* Store current cpu timer value */
 		"	stckf	%1"	/* Store current tod clock value */
 		: "=Q" (lc->last_update_timer),
 		  "=Q" (lc->last_update_clock)
 		: : "cc");
+}
+
+/*
+ * Update process times based on virtual cpu times stored by entry.S
+ * to the lowcore fields user_timer, system_timer & steal_clock.
+ */
+static int do_account_vtime(struct task_struct *tsk)
+{
+	u64 timer, clock, user, guest, system, hardirq, softirq;
+	struct lowcore *lc = get_lowcore();
+
+	timer = lc->last_update_timer;
+	clock = lc->last_update_clock;
+
+	vtime_reset_last_update(lc);
+
 	clock = lc->last_update_clock - clock;
 	timer -= lc->last_update_timer;
 
@@ -261,6 +269,43 @@ void vtime_account_hardirq(struct task_struct *tsk)
 	virt_timer_forward(delta);
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+/**
+ * vtime_reset - Fast forward vtime entry clocks
+ *
+ * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
+ * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
+ */
+void vtime_reset(void)
+{
+	vtime_reset_last_update(get_lowcore());
+}
+
+/**
+ * vtime_dyntick_start - Inform vtime about entry to idle-dynticks
+ *
+ * Called when idle enters in dyntick mode. The idle cputime that elapsed so far
+ * is flushed and the tick subsystem takes over the idle cputime accounting.
+ */
+void vtime_dyntick_start(void)
+{
+	__this_cpu_write(s390_idle.idle_dyntick, true);
+	vtime_flush(current);
+}
+
+/**
+ * vtime_dyntick_stop - Inform vtime about exit from idle-dynticks
+ *
+ * Called when idle exits from dyntick mode. The vtime entry clocks are
+ * fast-forward to current time and idle accounting resumes.
+ */
+void vtime_dyntick_stop(void)
+{
+	vtime_reset_last_update(get_lowcore());
+	__this_cpu_write(s390_idle.idle_dyntick, false);
+}
+#endif /* CONFIG_NO_HZ_COMMON */
+
 /*
  * Sorted add to a list. List is linear searched until first bigger
  * element is found.
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/15] tick/sched: Unify idle cputime accounting
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2026-01-16 14:51 ` [PATCH 05/15] s390/time: " Frederic Weisbecker
@ 2026-01-16 14:51 ` Frederic Weisbecker
  2026-01-19 14:26   ` Peter Zijlstra
  2026-01-16 14:52 ` [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test Frederic Weisbecker
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:51 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

The non-vtime dynticks-idle cputime accounting is a big mess that
accumulates within two concurrent statistics, each having their own
shortcomings:

* The accounting for online CPUs which is based on the delta between
  tick_nohz_start_idle() and tick_nohz_stop_idle().

  Pros:
       - Works when the tick is off

       - Has nsecs granularity

  Cons:
       - Account idle steal time but doesn't substract it from idle
         cputime.

       - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
         the IRQ time is simply ignored when
         CONFIG_IRQ_TIME_ACCOUNTING=n

       - The windows between 1) idle task scheduling and the first call
         to tick_nohz_start_idle() and 2) idle task between the last
         tick_nohz_stop_idle() and the rest of the idle time are
         blindspots wrt. cputime accounting (though mostly insignificant
         amount)

       - Relies on private fields outside of kernel stats, with specific
         accessors.

* The accounting for offline CPUs which is based on ticks and the
  jiffies delta during which the tick was stopped.

  Pros:
       - Handles steal time correctly

       - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
         CONFIG_IRQ_TIME_ACCOUNTING=n correctly.

       - Handles the whole idle task

       - Accounts directly to kernel stats, without midlayer accumulator.

   Cons:
       - Doesn't elapse when the tick is off, which doesn't make it
         suitable for online CPUs.

       - Has TICK_NSEC granularity (jiffies)

       - Needs to track the dyntick-idle ticks that were accounted and
         substract them from the total jiffies time spent while the tick
         was stopped. This is an ugly workaround.

Having two different accounting for a single context is not the only
problem: since those accountings are of different natures, it is
possible to observe the global idle time going backward after a CPU goes
offline.

Clean up the situation with introducing a hybrid approach that stays
coherent and works for both online and offline CPUs:

* Tick based or native vtime accounting operate before the idle loop
  is entered and resume once the idle loop prepares to exit.

* When the idle loop starts, switch to dynticks-idle accounting as is
  done currently, except that the statistics accumulate directly to the
  relevant kernel stat fields.

* Private dyntick cputime accounting fields are removed.

* Works on both online and offline case.

Further improvement will include:

* Only switch to dynticks-idle cputime accounting when the tick actually
  goes in dynticks mode.

* Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
  dynticks-idle accounting still elapses while on IRQs.

* Correctly substract idle steal cputime from idle time

Reported-by: Xin Zhao <jackzxcui1989@163.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/kernel_stat.h | 24 ++++++++++---
 include/linux/vtime.h       |  7 +++-
 kernel/sched/cputime.c      | 62 ++++++++++++++++----------------
 kernel/time/tick-sched.c    | 72 +++++++++++--------------------------
 4 files changed, 77 insertions(+), 88 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index dd020ecaf67b..ba65aad308a1 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -34,6 +34,9 @@ enum cpu_usage_stat {
 };
 
 struct kernel_cpustat {
+#ifdef CONFIG_NO_HZ_COMMON
+	int idle_dyntick;
+#endif
 	u64 cpustat[NR_STATS];
 };
 
@@ -99,6 +102,20 @@ static inline unsigned long kstat_cpu_irqs_sum(unsigned int cpu)
 	return kstat_cpu(cpu).irqs_sum;
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+extern void kcpustat_dyntick_start(void);
+extern void kcpustat_dyntick_stop(void);
+static inline bool kcpustat_idle_dyntick(void)
+{
+	return __this_cpu_read(kernel_cpustat.idle_dyntick);
+}
+#else
+static inline bool kcpustat_idle_dyntick(void)
+{
+	return false;
+}
+#endif /* CONFIG_NO_HZ_COMMON */
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern u64 kcpustat_field(enum cpu_usage_stat usage, int cpu);
 extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
@@ -113,7 +130,7 @@ static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
 	*dst = kcpustat_cpu(cpu);
 }
 
-#endif
+#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 
 extern void account_user_time(struct task_struct *, u64);
 extern void account_guest_time(struct task_struct *, u64);
@@ -127,14 +144,13 @@ extern u64 get_idle_time(struct kernel_cpustat *kcs, int cpu);
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline void account_process_tick(struct task_struct *tsk, int user)
 {
-	vtime_flush(tsk);
+	if (!kcpustat_idle_dyntick())
+		vtime_flush(tsk);
 }
 #else
 extern void account_process_tick(struct task_struct *, int user);
 #endif
 
-extern void account_idle_ticks(unsigned long ticks);
-
 #ifdef CONFIG_SCHED_CORE
 extern void __account_forceidle_time(struct task_struct *tsk, u64 delta);
 #endif
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 10cdb08f960b..43934ff20c4a 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -32,6 +32,11 @@ static inline bool vtime_generic_enabled_cpu(int cpu)
 	return context_tracking_enabled_cpu(cpu);
 }
 
+static inline bool vtime_generic_enabled_this_cpu(void)
+{
+	return context_tracking_enabled_this_cpu();
+}
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
 extern void vtime_account_softirq(struct task_struct *tsk);
@@ -90,7 +95,7 @@ static inline bool vtime_accounting_enabled_cpu(int cpu)
 
 static inline bool vtime_accounting_enabled_this_cpu(void)
 {
-	return context_tracking_enabled_this_cpu();
+	return vtime_generic_enabled_this_cpu();
 }
 
 extern void vtime_task_switch_generic(struct task_struct *prev);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f32c169da11a..c10fcc3d65b3 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -400,16 +400,30 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	}
 }
 
-static void irqtime_account_idle_ticks(int ticks)
-{
-	irqtime_account_process_tick(current, 0, ticks);
-}
 #else /* !CONFIG_IRQ_TIME_ACCOUNTING: */
-static inline void irqtime_account_idle_ticks(int ticks) { }
 static inline void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 						int nr_ticks) { }
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
+#ifdef CONFIG_NO_HZ_COMMON
+void kcpustat_dyntick_start(void)
+{
+	if (!vtime_generic_enabled_this_cpu()) {
+		vtime_dyntick_start();
+		__this_cpu_write(kernel_cpustat.idle_dyntick, 1);
+	}
+}
+
+void kcpustat_dyntick_stop(void)
+{
+	if (!vtime_generic_enabled_this_cpu()) {
+		__this_cpu_write(kernel_cpustat.idle_dyntick, 0);
+		vtime_dyntick_stop();
+		steal_account_process_time(ULONG_MAX);
+	}
+}
+#endif /* CONFIG_NO_HZ_COMMON */
+
 /*
  * Use precise platform statistics if available:
  */
@@ -423,11 +437,15 @@ void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
 		vtime_account_hardirq(tsk);
 	} else if (pc & SOFTIRQ_OFFSET) {
 		vtime_account_softirq(tsk);
-	} else if (!IS_ENABLED(CONFIG_HAVE_VIRT_CPU_ACCOUNTING_IDLE) &&
-		   is_idle_task(tsk)) {
-		vtime_account_idle(tsk);
+	} else if (!kcpustat_idle_dyntick()) {
+		if (!IS_ENABLED(CONFIG_HAVE_VIRT_CPU_ACCOUNTING_IDLE) &&
+		    is_idle_task(tsk)) {
+			vtime_account_idle(tsk);
+		} else {
+			vtime_account_kernel(tsk);
+		}
 	} else {
-		vtime_account_kernel(tsk);
+		vtime_reset();
 	}
 }
 
@@ -469,6 +487,9 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	if (vtime_accounting_enabled_this_cpu())
 		return;
 
+	if (kcpustat_idle_dyntick())
+		return;
+
 	if (irqtime_enabled()) {
 		irqtime_account_process_tick(p, user_tick, 1);
 		return;
@@ -490,29 +511,6 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		account_idle_time(cputime);
 }
 
-/*
- * Account multiple ticks of idle time.
- * @ticks: number of stolen ticks
- */
-void account_idle_ticks(unsigned long ticks)
-{
-	u64 cputime, steal;
-
-	if (irqtime_enabled()) {
-		irqtime_account_idle_ticks(ticks);
-		return;
-	}
-
-	cputime = ticks * TICK_NSEC;
-	steal = steal_account_process_time(ULONG_MAX);
-
-	if (steal >= cputime)
-		return;
-
-	cputime -= steal;
-	account_idle_time(cputime);
-}
-
 /*
  * Adjust tick based cputime random precision against scheduler runtime
  * accounting.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f1d07a0276a5..74c97ad75856 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -285,8 +285,6 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
 	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) &&
 	    tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
 		touch_softlockup_watchdog_sched();
-		if (is_idle_task(current))
-			ts->idle_jiffies++;
 		/*
 		 * In case the current tick fired too early past its expected
 		 * expiration, make sure we don't bypass the next clock reprogramming
@@ -744,8 +742,12 @@ static void tick_nohz_update_jiffies(ktime_t now)
 
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	ktime_t delta;
 
+	if (vtime_generic_enabled_this_cpu())
+		return;
+
 	if (WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)))
 		return;
 
@@ -753,9 +755,9 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	if (nr_iowait_cpu(smp_processor_id()) > 0)
-		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+		cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
 	else
-		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+		cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
 
 	ts->idle_entrytime = now;
 	tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
@@ -766,17 +768,21 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 
 static void tick_nohz_start_idle(struct tick_sched *ts)
 {
+	if (vtime_generic_enabled_this_cpu())
+		return;
+
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = ktime_get();
 	tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE);
 	write_seqcount_end(&ts->idle_sleeptime_seq);
-
 	sched_clock_idle_sleep_event();
 }
 
-static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
 				 bool compute_delta, u64 *last_update_time)
 {
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	u64 *cpustat = kcpustat_cpu(cpu).cpustat;
 	ktime_t now, idle;
 	unsigned int seq;
 
@@ -793,9 +799,9 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
 		if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE) && compute_delta) {
 			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
 
-			idle = ktime_add(*sleeptime, delta);
+			idle = ktime_add(cpustat[idx], delta);
 		} else {
-			idle = *sleeptime;
+			idle = cpustat[idx];
 		}
 	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
@@ -822,9 +828,7 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
  */
 u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-
-	return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
+	return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE,
 				     !nr_iowait_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
@@ -848,9 +852,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-
-	return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
+	return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT,
 				     nr_iowait_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
@@ -1250,10 +1252,8 @@ void tick_nohz_idle_stop_tick(void)
 		ts->idle_sleeps++;
 		ts->idle_expires = expires;
 
-		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
-			ts->idle_jiffies = ts->last_jiffies;
+		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED))
 			nohz_balance_enter_idle(cpu);
-		}
 	} else {
 		tick_nohz_retain_tick(ts);
 	}
@@ -1282,6 +1282,7 @@ void tick_nohz_idle_enter(void)
 	WARN_ON_ONCE(ts->timer_expires_base);
 
 	tick_sched_flag_set(ts, TS_FLAG_INIDLE);
+	kcpustat_dyntick_start();
 	tick_nohz_start_idle(ts);
 
 	local_irq_enable();
@@ -1407,37 +1408,12 @@ unsigned long tick_nohz_get_idle_calls_cpu(int cpu)
 	return ts->idle_calls;
 }
 
-static void tick_nohz_account_idle_time(struct tick_sched *ts,
-					ktime_t now)
-{
-	unsigned long ticks;
-
-	ts->idle_exittime = now;
-
-	if (vtime_accounting_enabled_this_cpu())
-		return;
-	/*
-	 * We stopped the tick in idle. update_process_times() would miss the
-	 * time we slept, as it does only a 1 tick accounting.
-	 * Enforce that this is accounted to idle !
-	 */
-	ticks = jiffies - ts->idle_jiffies;
-	/*
-	 * We might be one off. Do not randomly account a huge number of ticks!
-	 */
-	if (ticks && ticks < LONG_MAX)
-		account_idle_ticks(ticks);
-}
-
 void tick_nohz_idle_restart_tick(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
-	if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
-		ktime_t now = ktime_get();
-		tick_nohz_restart_sched_tick(ts, now);
-		tick_nohz_account_idle_time(ts, now);
-	}
+	if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
+		tick_nohz_restart_sched_tick(ts, ktime_get());
 }
 
 static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
@@ -1446,8 +1422,6 @@ static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
 		__tick_nohz_full_update_tick(ts, now);
 	else
 		tick_nohz_restart_sched_tick(ts, now);
-
-	tick_nohz_account_idle_time(ts, now);
 }
 
 /**
@@ -1489,6 +1463,7 @@ void tick_nohz_idle_exit(void)
 
 	if (tick_stopped)
 		tick_nohz_idle_update_tick(ts, now);
+	kcpustat_dyntick_stop();
 
 	local_irq_enable();
 }
@@ -1625,20 +1600,15 @@ void tick_setup_sched_timer(bool hrtimer)
 void tick_sched_timer_dying(int cpu)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	ktime_t idle_sleeptime, iowait_sleeptime;
 	unsigned long idle_calls, idle_sleeps;
 
 	/* This must happen before hrtimers are migrated! */
 	if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES))
 		hrtimer_cancel(&ts->sched_timer);
 
-	idle_sleeptime = ts->idle_sleeptime;
-	iowait_sleeptime = ts->iowait_sleeptime;
 	idle_calls = ts->idle_calls;
 	idle_sleeps = ts->idle_sleeps;
 	memset(ts, 0, sizeof(*ts));
-	ts->idle_sleeptime = idle_sleeptime;
-	ts->iowait_sleeptime = iowait_sleeptime;
 	ts->idle_calls = idle_calls;
 	ts->idle_sleeps = idle_sleeps;
 }
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2026-01-16 14:51 ` [PATCH 06/15] tick/sched: Unify idle cputime accounting Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-19  5:37   ` Viresh Kumar
  2026-01-19 12:30   ` Rafael J. Wysocki
  2026-01-16 14:52 ` [PATCH 08/15] tick/sched: Remove nohz disabled special case in cputime fetch Frederic Weisbecker
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
accounting has a nanoseconds granularity.

Use the appropriate indicator instead to make that deduction.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 drivers/cpufreq/cpufreq_ondemand.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index a6ecc203f7b7..2d52ee035702 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -334,17 +334,12 @@ static void od_free(struct policy_dbs_info *policy_dbs)
 static int od_init(struct dbs_data *dbs_data)
 {
 	struct od_dbs_tuners *tuners;
-	u64 idle_time;
-	int cpu;
 
 	tuners = kzalloc(sizeof(*tuners), GFP_KERNEL);
 	if (!tuners)
 		return -ENOMEM;
 
-	cpu = get_cpu();
-	idle_time = get_cpu_idle_time_us(cpu, NULL);
-	put_cpu();
-	if (idle_time != -1ULL) {
+	if (tick_nohz_enabled) {
 		/* Idle micro accounting is supported. Use finer thresholds */
 		dbs_data->up_threshold = MICRO_FREQUENCY_UP_THRESHOLD;
 	} else {
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/15] tick/sched: Remove nohz disabled special case in cputime fetch
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-16 14:52 ` [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code Frederic Weisbecker
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Even when nohz is not runtime enabled, the dynticks idle cputime
accounting can run and the common idle cputime accessors are still
relevant.

Remove the nohz disabled special case accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/tick-sched.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 74c97ad75856..f0b79e876997 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -786,7 +786,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
 	ktime_t now, idle;
 	unsigned int seq;
 
-	if (!tick_nohz_active || vtime_generic_enabled_cpu(cpu))
+	if (vtime_generic_enabled_cpu(cpu))
 		return -1;
 
 	now = ktime_get();
@@ -824,7 +824,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
- * Return: -1 if NOHZ is not enabled, else total idle time of the @cpu
+ * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
  */
 u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
@@ -848,7 +848,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
- * Return: -1 if NOHZ is not enabled, else total iowait time of @cpu
+ * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 08/15] tick/sched: Remove nohz disabled special case in cputime fetch Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-19 14:35   ` Peter Zijlstra
  2026-01-16 14:52 ` [PATCH 10/15] tick/sched: Remove unused fields Frederic Weisbecker
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Although the dynticks-idle cputime accounting is necessarily tied to
the tick subsystem, the actual related accounting code has no business
residing there and should be part of the scheduler cputime code.

Move away the relevant pieces and state machine to where they belong.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/kernel_stat.h |  14 +++-
 kernel/sched/cputime.c      | 145 ++++++++++++++++++++++++++++++--
 kernel/time/tick-sched.c    | 161 +++++++-----------------------------
 3 files changed, 180 insertions(+), 140 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index ba65aad308a1..a906492eb680 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -35,9 +35,12 @@ enum cpu_usage_stat {
 
 struct kernel_cpustat {
 #ifdef CONFIG_NO_HZ_COMMON
-	int idle_dyntick;
+	bool		idle_dyntick;
+	bool		idle_elapse;
+	seqcount_t	idle_sleeptime_seq;
+	ktime_t		idle_entrytime;
 #endif
-	u64 cpustat[NR_STATS];
+	u64		cpustat[NR_STATS];
 };
 
 struct kernel_stat {
@@ -103,8 +106,11 @@ static inline unsigned long kstat_cpu_irqs_sum(unsigned int cpu)
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
-extern void kcpustat_dyntick_start(void);
-extern void kcpustat_dyntick_stop(void);
+extern void kcpustat_dyntick_start(ktime_t now);
+extern void kcpustat_dyntick_stop(ktime_t now);
+extern void kcpustat_irq_enter(ktime_t now);
+extern void kcpustat_irq_exit(ktime_t now);
+
 static inline bool kcpustat_idle_dyntick(void)
 {
 	return __this_cpu_read(kernel_cpustat.idle_dyntick);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index c10fcc3d65b3..16d6730efe6d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -406,22 +406,153 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 #ifdef CONFIG_NO_HZ_COMMON
-void kcpustat_dyntick_start(void)
+static void kcpustat_idle_stop(struct kernel_cpustat *kc, ktime_t now)
 {
-	if (!vtime_generic_enabled_this_cpu()) {
-		vtime_dyntick_start();
-		__this_cpu_write(kernel_cpustat.idle_dyntick, 1);
-	}
+	u64 *cpustat = kc->cpustat;
+	ktime_t delta;
+
+	if (!kc->idle_elapse)
+		return;
+
+	delta = ktime_sub(now, kc->idle_entrytime);
+
+	write_seqcount_begin(&kc->idle_sleeptime_seq);
+	if (nr_iowait_cpu(smp_processor_id()) > 0)
+		cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
+	else
+		cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
+
+	kc->idle_entrytime = now;
+	kc->idle_elapse = false;
+	write_seqcount_end(&kc->idle_sleeptime_seq);
 }
 
-void kcpustat_dyntick_stop(void)
+static void kcpustat_idle_start(struct kernel_cpustat *kc, ktime_t now)
 {
+	write_seqcount_begin(&kc->idle_sleeptime_seq);
+	kc->idle_entrytime = now;
+	kc->idle_elapse = true;
+	write_seqcount_end(&kc->idle_sleeptime_seq);
+}
+
+void kcpustat_dyntick_stop(ktime_t now)
+{
+	struct kernel_cpustat *kc = kcpustat_this_cpu;
+
 	if (!vtime_generic_enabled_this_cpu()) {
-		__this_cpu_write(kernel_cpustat.idle_dyntick, 0);
+		WARN_ON_ONCE(!kc->idle_dyntick);
+		kcpustat_idle_stop(kc, now);
+		kc->idle_dyntick = false;
 		vtime_dyntick_stop();
 		steal_account_process_time(ULONG_MAX);
 	}
 }
+
+void kcpustat_dyntick_start(ktime_t now)
+{
+	struct kernel_cpustat *kc = kcpustat_this_cpu;
+
+	if (!vtime_generic_enabled_this_cpu()) {
+		vtime_dyntick_start();
+		kc->idle_dyntick = true;
+		kcpustat_idle_start(kc, now);
+	}
+}
+
+void kcpustat_irq_enter(ktime_t now)
+{
+	struct kernel_cpustat *kc = kcpustat_this_cpu;
+
+	if (!vtime_generic_enabled_this_cpu())
+		kcpustat_idle_stop(kc, now);
+}
+
+void kcpustat_irq_exit(ktime_t now)
+{
+	struct kernel_cpustat *kc = kcpustat_this_cpu;
+
+	if (!vtime_generic_enabled_this_cpu())
+		kcpustat_idle_start(kc, now);
+}
+
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
+				 bool compute_delta, u64 *last_update_time)
+{
+	struct kernel_cpustat *kc = &kcpustat_cpu(cpu);
+	u64 *cpustat = kc->cpustat;
+	ktime_t now, idle;
+	unsigned int seq;
+
+	if (vtime_generic_enabled_cpu(cpu))
+		return -1;
+
+	now = ktime_get();
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
+
+	do {
+		seq = read_seqcount_begin(&kc->idle_sleeptime_seq);
+
+		if (kc->idle_elapse && compute_delta) {
+			ktime_t delta = ktime_sub(now, kc->idle_entrytime);
+
+			idle = ktime_add(cpustat[idx], delta);
+		} else {
+			idle = cpustat[idx];
+		}
+	} while (read_seqcount_retry(&kc->idle_sleeptime_seq, seq));
+
+	return ktime_to_us(idle);
+}
+
+/**
+ * get_cpu_idle_time_us - get the total idle time of a CPU
+ * @cpu: CPU number to query
+ * @last_update_time: variable to store update time in. Do not update
+ * counters if NULL.
+ *
+ * Return the cumulative idle time (since boot) for a given
+ * CPU, in microseconds. Note that this is partially broken due to
+ * the counter of iowait tasks that can be remotely updated without
+ * any synchronization. Therefore it is possible to observe backward
+ * values within two consecutive reads.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
+ */
+u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+{
+	return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE,
+				     !nr_iowait_cpu(cpu), last_update_time);
+}
+EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
+
+/**
+ * get_cpu_iowait_time_us - get the total iowait time of a CPU
+ * @cpu: CPU number to query
+ * @last_update_time: variable to store update time in. Do not update
+ * counters if NULL.
+ *
+ * Return the cumulative iowait time (since boot) for a given
+ * CPU, in microseconds. Note this is partially broken due to
+ * the counter of iowait tasks that can be remotely updated without
+ * any synchronization. Therefore it is possible to observe backward
+ * values within two consecutive reads.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
+ */
+u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
+{
+	return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT,
+				     nr_iowait_cpu(cpu), last_update_time);
+}
+EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
+
 #endif /* CONFIG_NO_HZ_COMMON */
 
 /*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f0b79e876997..cbd645fb8df6 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -740,123 +740,6 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog_sched();
 }
 
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
-	ktime_t delta;
-
-	if (vtime_generic_enabled_this_cpu())
-		return;
-
-	if (WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)))
-		return;
-
-	delta = ktime_sub(now, ts->idle_entrytime);
-
-	write_seqcount_begin(&ts->idle_sleeptime_seq);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
-		cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
-	else
-		cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
-
-	ts->idle_entrytime = now;
-	tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
-	write_seqcount_end(&ts->idle_sleeptime_seq);
-
-	sched_clock_idle_wakeup_event();
-}
-
-static void tick_nohz_start_idle(struct tick_sched *ts)
-{
-	if (vtime_generic_enabled_this_cpu())
-		return;
-
-	write_seqcount_begin(&ts->idle_sleeptime_seq);
-	ts->idle_entrytime = ktime_get();
-	tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE);
-	write_seqcount_end(&ts->idle_sleeptime_seq);
-	sched_clock_idle_sleep_event();
-}
-
-static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
-				 bool compute_delta, u64 *last_update_time)
-{
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	u64 *cpustat = kcpustat_cpu(cpu).cpustat;
-	ktime_t now, idle;
-	unsigned int seq;
-
-	if (vtime_generic_enabled_cpu(cpu))
-		return -1;
-
-	now = ktime_get();
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
-
-	do {
-		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
-
-		if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE) && compute_delta) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-
-			idle = ktime_add(cpustat[idx], delta);
-		} else {
-			idle = cpustat[idx];
-		}
-	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
-
-	return ktime_to_us(idle);
-
-}
-
-/**
- * get_cpu_idle_time_us - get the total idle time of a CPU
- * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
- *
- * Return the cumulative idle time (since boot) for a given
- * CPU, in microseconds. Note that this is partially broken due to
- * the counter of iowait tasks that can be remotely updated without
- * any synchronization. Therefore it is possible to observe backward
- * values within two consecutive reads.
- *
- * This time is measured via accounting rather than sampling,
- * and is as accurate as ktime_get() is.
- *
- * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
- */
-u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
-{
-	return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE,
-				     !nr_iowait_cpu(cpu), last_update_time);
-}
-EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
-
-/**
- * get_cpu_iowait_time_us - get the total iowait time of a CPU
- * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
- *
- * Return the cumulative iowait time (since boot) for a given
- * CPU, in microseconds. Note this is partially broken due to
- * the counter of iowait tasks that can be remotely updated without
- * any synchronization. Therefore it is possible to observe backward
- * values within two consecutive reads.
- *
- * This time is measured via accounting rather than sampling,
- * and is as accurate as ktime_get() is.
- *
- * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
- */
-u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
-{
-	return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT,
-				     nr_iowait_cpu(cpu), last_update_time);
-}
-EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-
 static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 {
 	hrtimer_cancel(&ts->sched_timer);
@@ -1264,6 +1147,20 @@ void tick_nohz_idle_retain_tick(void)
 	tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
 }
 
+static void tick_nohz_clock_sleep(struct tick_sched *ts)
+{
+	tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE);
+	sched_clock_idle_sleep_event();
+}
+
+static void tick_nohz_clock_wakeup(struct tick_sched *ts)
+{
+	if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)) {
+		tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
+		sched_clock_idle_wakeup_event();
+	}
+}
+
 /**
  * tick_nohz_idle_enter - prepare for entering idle on the current CPU
  *
@@ -1278,12 +1175,10 @@ void tick_nohz_idle_enter(void)
 	local_irq_disable();
 
 	ts = this_cpu_ptr(&tick_cpu_sched);
-
 	WARN_ON_ONCE(ts->timer_expires_base);
-
-	tick_sched_flag_set(ts, TS_FLAG_INIDLE);
-	kcpustat_dyntick_start();
-	tick_nohz_start_idle(ts);
+	ts->idle_entrytime = ktime_get();
+	kcpustat_dyntick_start(ts->idle_entrytime);
+	tick_nohz_clock_sleep(ts);
 
 	local_irq_enable();
 }
@@ -1311,10 +1206,13 @@ void tick_nohz_irq_exit(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
-	if (tick_sched_flag_test(ts, TS_FLAG_INIDLE))
-		tick_nohz_start_idle(ts);
-	else
+	if (tick_sched_flag_test(ts, TS_FLAG_INIDLE)) {
+		ts->idle_entrytime = ktime_get();
+		kcpustat_irq_exit(ts->idle_entrytime);
+		tick_nohz_clock_sleep(ts);
+	} else {
 		tick_nohz_full_update_tick(ts);
+	}
 }
 
 /**
@@ -1459,11 +1357,11 @@ void tick_nohz_idle_exit(void)
 		now = ktime_get();
 
 	if (idle_active)
-		tick_nohz_stop_idle(ts, now);
+		tick_nohz_clock_wakeup(ts);
 
 	if (tick_stopped)
 		tick_nohz_idle_update_tick(ts, now);
-	kcpustat_dyntick_stop();
+	kcpustat_dyntick_stop(now);
 
 	local_irq_enable();
 }
@@ -1519,9 +1417,14 @@ static inline void tick_nohz_irq_enter(void)
 
 	if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED | TS_FLAG_IDLE_ACTIVE))
 		return;
+
 	now = ktime_get();
-	if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE))
-		tick_nohz_stop_idle(ts, now);
+
+	if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)) {
+		tick_nohz_clock_wakeup(ts);
+		kcpustat_irq_enter(now);
+	}
+
 	/*
 	 * If all CPUs are idle we may need to update a stale jiffies value.
 	 * Note nohz_full is a special case: a timekeeper is guaranteed to stay
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/15] tick/sched: Remove unused fields
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-16 14:52 ` [PATCH 11/15] tick/sched: Account tickless idle cputime only when tick is stopped Frederic Weisbecker
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Remove fields after the dyntick-idle cputime migration to scheduler
code.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/tick-sched.h       | 12 ------------
 kernel/time/timer_list.c       |  6 +-----
 scripts/gdb/linux/timerlist.py |  4 ----
 3 files changed, 1 insertion(+), 21 deletions(-)

diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index b4a7822f495d..79b9252047b1 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -44,9 +44,7 @@ struct tick_device {
  *			to resume the tick timer operation in the timeline
  *			when the CPU returns from nohz sleep.
  * @next_tick:		Next tick to be fired when in dynticks mode.
- * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
  * @idle_waketime:	Time when the idle was interrupted
- * @idle_sleeptime_seq:	sequence counter for data consistency
  * @idle_entrytime:	Time when the idle call was entered
  * @last_jiffies:	Base jiffies snapshot when next event was last computed
  * @timer_expires_base:	Base time clock monotonic for @timer_expires
@@ -55,9 +53,6 @@ struct tick_device {
  * @idle_expires:	Next tick in idle, for debugging purpose only
  * @idle_calls:		Total number of idle calls
  * @idle_sleeps:	Number of idle calls, where the sched tick was stopped
- * @idle_exittime:	Time when the idle state was left
- * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
- * @iowait_sleeptime:	Sum of the time slept in idle with sched tick stopped, with IO outstanding
  * @tick_dep_mask:	Tick dependency mask - is set, if someone needs the tick
  * @check_clocks:	Notification mechanism about clocksource changes
  */
@@ -73,12 +68,10 @@ struct tick_sched {
 	struct hrtimer			sched_timer;
 	ktime_t				last_tick;
 	ktime_t				next_tick;
-	unsigned long			idle_jiffies;
 	ktime_t				idle_waketime;
 	unsigned int			got_idle_tick;
 
 	/* Idle entry */
-	seqcount_t			idle_sleeptime_seq;
 	ktime_t				idle_entrytime;
 
 	/* Tick stop */
@@ -90,11 +83,6 @@ struct tick_sched {
 	unsigned long			idle_calls;
 	unsigned long			idle_sleeps;
 
-	/* Idle exit */
-	ktime_t				idle_exittime;
-	ktime_t				idle_sleeptime;
-	ktime_t				iowait_sleeptime;
-
 	/* Full dynticks handling */
 	atomic_t			tick_dep_mask;
 
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 488e47e96e93..e77b512e8597 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -154,14 +154,10 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 		P_flag(highres, TS_FLAG_HIGHRES);
 		P_ns(last_tick);
 		P_flag(tick_stopped, TS_FLAG_STOPPED);
-		P(idle_jiffies);
 		P(idle_calls);
 		P(idle_sleeps);
 		P_ns(idle_entrytime);
 		P_ns(idle_waketime);
-		P_ns(idle_exittime);
-		P_ns(idle_sleeptime);
-		P_ns(iowait_sleeptime);
 		P(last_jiffies);
 		P(next_timer);
 		P_ns(idle_expires);
@@ -258,7 +254,7 @@ static void timer_list_show_tickdevices_header(struct seq_file *m)
 
 static inline void timer_list_header(struct seq_file *m, u64 now)
 {
-	SEQ_printf(m, "Timer List Version: v0.10\n");
+	SEQ_printf(m, "Timer List Version: v0.11\n");
 	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
 	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
 	SEQ_printf(m, "\n");
diff --git a/scripts/gdb/linux/timerlist.py b/scripts/gdb/linux/timerlist.py
index ccc24d30de80..c14ce55674c9 100644
--- a/scripts/gdb/linux/timerlist.py
+++ b/scripts/gdb/linux/timerlist.py
@@ -90,14 +90,10 @@ def print_cpu(hrtimer_bases, cpu, max_clock_bases):
             text += f"  .{'nohz':15s}: {int(bool(ts['flags'] & TS_FLAG_NOHZ))}\n"
             text += f"  .{'last_tick':15s}: {ts['last_tick']}\n"
             text += f"  .{'tick_stopped':15s}: {int(bool(ts['flags'] & TS_FLAG_STOPPED))}\n"
-            text += f"  .{'idle_jiffies':15s}: {ts['idle_jiffies']}\n"
             text += f"  .{'idle_calls':15s}: {ts['idle_calls']}\n"
             text += f"  .{'idle_sleeps':15s}: {ts['idle_sleeps']}\n"
             text += f"  .{'idle_entrytime':15s}: {ts['idle_entrytime']} nsecs\n"
             text += f"  .{'idle_waketime':15s}: {ts['idle_waketime']} nsecs\n"
-            text += f"  .{'idle_exittime':15s}: {ts['idle_exittime']} nsecs\n"
-            text += f"  .{'idle_sleeptime':15s}: {ts['idle_sleeptime']} nsecs\n"
-            text += f"  .{'iowait_sleeptime':15s}: {ts['iowait_sleeptime']} nsecs\n"
             text += f"  .{'last_jiffies':15s}: {ts['last_jiffies']}\n"
             text += f"  .{'next_timer':15s}: {ts['next_timer']}\n"
             text += f"  .{'idle_expires':15s}: {ts['idle_expires']} nsecs\n"
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/15] tick/sched: Account tickless idle cputime only when tick is stopped
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (9 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 10/15] tick/sched: Remove unused fields Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-16 14:52 ` [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs Frederic Weisbecker
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

There is no real point in switching to dyntick-idle cputime accounting
mode if the tick is not actually stopped. This just adds overhead,
notably fetching the GTOD, on each idle exit and each idle IRQ entry for
no reason during short idle trips.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/tick-sched.c | 44 ++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index cbd645fb8df6..05da130d257a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1135,8 +1135,10 @@ void tick_nohz_idle_stop_tick(void)
 		ts->idle_sleeps++;
 		ts->idle_expires = expires;
 
-		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED))
+		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+			kcpustat_dyntick_start(ts->idle_entrytime);
 			nohz_balance_enter_idle(cpu);
+		}
 	} else {
 		tick_nohz_retain_tick(ts);
 	}
@@ -1177,7 +1179,6 @@ void tick_nohz_idle_enter(void)
 	ts = this_cpu_ptr(&tick_cpu_sched);
 	WARN_ON_ONCE(ts->timer_expires_base);
 	ts->idle_entrytime = ktime_get();
-	kcpustat_dyntick_start(ts->idle_entrytime);
 	tick_nohz_clock_sleep(ts);
 
 	local_irq_enable();
@@ -1207,9 +1208,10 @@ void tick_nohz_irq_exit(void)
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
 	if (tick_sched_flag_test(ts, TS_FLAG_INIDLE)) {
-		ts->idle_entrytime = ktime_get();
-		kcpustat_irq_exit(ts->idle_entrytime);
 		tick_nohz_clock_sleep(ts);
+		ts->idle_entrytime = ktime_get();
+		if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
+			kcpustat_irq_exit(ts->idle_entrytime);
 	} else {
 		tick_nohz_full_update_tick(ts);
 	}
@@ -1310,8 +1312,11 @@ void tick_nohz_idle_restart_tick(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
-	if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
-		tick_nohz_restart_sched_tick(ts, ktime_get());
+	if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+		ktime_t now = ktime_get();
+		kcpustat_dyntick_stop(now);
+		tick_nohz_restart_sched_tick(ts, now);
+	}
 }
 
 static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
@@ -1341,7 +1346,6 @@ static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
 void tick_nohz_idle_exit(void)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
-	bool idle_active, tick_stopped;
 	ktime_t now;
 
 	local_irq_disable();
@@ -1350,18 +1354,13 @@ void tick_nohz_idle_exit(void)
 	WARN_ON_ONCE(ts->timer_expires_base);
 
 	tick_sched_flag_clear(ts, TS_FLAG_INIDLE);
-	idle_active = tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE);
-	tick_stopped = tick_sched_flag_test(ts, TS_FLAG_STOPPED);
+	tick_nohz_clock_wakeup(ts);
 
-	if (idle_active || tick_stopped)
+	if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
 		now = ktime_get();
-
-	if (idle_active)
-		tick_nohz_clock_wakeup(ts);
-
-	if (tick_stopped)
+		kcpustat_dyntick_stop(now);
 		tick_nohz_idle_update_tick(ts, now);
-	kcpustat_dyntick_stop(now);
+	}
 
 	local_irq_enable();
 }
@@ -1415,15 +1414,13 @@ static inline void tick_nohz_irq_enter(void)
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 	ktime_t now;
 
-	if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED | TS_FLAG_IDLE_ACTIVE))
+	tick_nohz_clock_wakeup(ts);
+
+	if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED))
 		return;
 
 	now = ktime_get();
-
-	if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)) {
-		tick_nohz_clock_wakeup(ts);
-		kcpustat_irq_enter(now);
-	}
+	kcpustat_irq_enter(now);
 
 	/*
 	 * If all CPUs are idle we may need to update a stale jiffies value.
@@ -1432,8 +1429,7 @@ static inline void tick_nohz_irq_enter(void)
 	 * rare case (typically stop machine). So we must make sure we have a
 	 * last resort.
 	 */
-	if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
-		tick_nohz_update_jiffies(now);
+	tick_nohz_update_jiffies(now);
 }
 
 #else
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (10 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 11/15] tick/sched: Account tickless idle cputime only when tick is stopped Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-16 14:52 ` [PATCH 13/15] sched/cputime: Consolidate get_cpu_[idle|iowait]_time_us() Frederic Weisbecker
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Fetching the idle cputime is available through a variety of accessors
all over the place depending on the different accounting flavours and
needs:

- idle vtime generic accounting can be accessed by kcpustat_field(),
  kcpustat_cpu_fetch(), or get_idle/iowait_time() but not by
  get_cpu_idle/iowait_time_us()

- dynticks-idle accounting can only be accessed by get_idle/iowait_time()
  or get_cpu_idle/iowait_time_us()

- CONFIG_NO_HZ_COMMON=n idle accounting can be accessed by kcpustat_field()
  kcpustat_cpu_fetch(), or get_idle/iowait_time() but not by
  get_cpu_idle/iowait_time_us()

Moreover get_idle/iowait_time() relies on get_cpu_idle/iowait_time_us()
with a non-sensical conversion to microseconds and back to nanoseconds
on the way.

Start consolidating the APIs with removing get_idle/iowait_time() and
make kcpustat_field() and kcpustat_cpu_fetch() work for all cases.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 fs/proc/stat.c              | 40 +++-----------------------
 fs/proc/uptime.c            |  8 ++----
 include/linux/kernel_stat.h | 34 +++++++++++++++++++---
 kernel/sched/cputime.c      | 57 ++++++++++++++++++++++++++-----------
 4 files changed, 76 insertions(+), 63 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6ac2a13b8be5..c00468a83f64 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -22,38 +22,6 @@
 #define arch_irq_stat() 0
 #endif
 
-u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
-{
-	u64 idle, idle_usecs = -1ULL;
-
-	if (cpu_online(cpu))
-		idle_usecs = get_cpu_idle_time_us(cpu, NULL);
-
-	if (idle_usecs == -1ULL)
-		/* !NO_HZ or cpu offline or vtime so we can rely on cpustat.idle */
-		idle = kcpustat_field(CPUTIME_IDLE, cpu);
-	else
-		idle = idle_usecs * NSEC_PER_USEC;
-
-	return idle;
-}
-
-static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
-{
-	u64 iowait, iowait_usecs = -1ULL;
-
-	if (cpu_online(cpu))
-		iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
-
-	if (iowait_usecs == -1ULL)
-		/* !NO_HZ or cpu offline or vtime so we can rely on cpustat.iowait */
-		iowait = kcpustat_field(CPUTIME_IOWAIT, cpu);
-	else
-		iowait = iowait_usecs * NSEC_PER_USEC;
-
-	return iowait;
-}
-
 static void show_irq_gap(struct seq_file *p, unsigned int gap)
 {
 	static const char zeros[] = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0";
@@ -105,8 +73,8 @@ static int show_stat(struct seq_file *p, void *v)
 		user		+= cpustat[CPUTIME_USER];
 		nice		+= cpustat[CPUTIME_NICE];
 		system		+= cpustat[CPUTIME_SYSTEM];
-		idle		+= get_idle_time(&kcpustat, i);
-		iowait		+= get_iowait_time(&kcpustat, i);
+		idle		+= cpustat[CPUTIME_IDLE];
+		iowait		+= cpustat[CPUTIME_IOWAIT];
 		irq		+= cpustat[CPUTIME_IRQ];
 		softirq		+= cpustat[CPUTIME_SOFTIRQ];
 		steal		+= cpustat[CPUTIME_STEAL];
@@ -146,8 +114,8 @@ static int show_stat(struct seq_file *p, void *v)
 		user		= cpustat[CPUTIME_USER];
 		nice		= cpustat[CPUTIME_NICE];
 		system		= cpustat[CPUTIME_SYSTEM];
-		idle		= get_idle_time(&kcpustat, i);
-		iowait		= get_iowait_time(&kcpustat, i);
+		idle		= cpustat[CPUTIME_IDLE];
+		iowait		= cpustat[CPUTIME_IOWAIT];
 		irq		= cpustat[CPUTIME_IRQ];
 		softirq		= cpustat[CPUTIME_SOFTIRQ];
 		steal		= cpustat[CPUTIME_STEAL];
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index b5343d209381..433aa947cd57 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -18,12 +18,8 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	int i;
 
 	idle_nsec = 0;
-	for_each_possible_cpu(i) {
-		struct kernel_cpustat kcs;
-
-		kcpustat_cpu_fetch(&kcs, i);
-		idle_nsec += get_idle_time(&kcs, i);
-	}
+	for_each_possible_cpu(i)
+		idle_nsec += kcpustat_field(CPUTIME_IDLE, i);
 
 	ktime_get_boottime_ts64(&uptime);
 	timens_add_boottime(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index a906492eb680..e1efd26e56f0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -110,32 +110,59 @@ extern void kcpustat_dyntick_start(ktime_t now);
 extern void kcpustat_dyntick_stop(ktime_t now);
 extern void kcpustat_irq_enter(ktime_t now);
 extern void kcpustat_irq_exit(ktime_t now);
+extern u64 kcpustat_field_idle(int cpu);
+extern u64 kcpustat_field_iowait(int cpu);
 
 static inline bool kcpustat_idle_dyntick(void)
 {
 	return __this_cpu_read(kernel_cpustat.idle_dyntick);
 }
 #else
+static inline u64 kcpustat_field_idle(int cpu)
+{
+	return kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
+}
+static inline u64 kcpustat_field_iowait(int cpu)
+{
+	return kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+}
+
 static inline bool kcpustat_idle_dyntick(void)
 {
 	return false;
 }
 #endif /* CONFIG_NO_HZ_COMMON */
 
+/* Fetch cputime values when vtime is disabled on a CPU */
+static inline u64 kcpustat_field_default(enum cpu_usage_stat usage, int cpu)
+{
+	if (usage == CPUTIME_IDLE)
+		return kcpustat_field_idle(cpu);
+	if (usage == CPUTIME_IOWAIT)
+		return kcpustat_field_iowait(cpu);
+	return kcpustat_cpu(cpu).cpustat[usage];
+}
+
+static inline void kcpustat_cpu_fetch_default(struct kernel_cpustat *dst, int cpu)
+{
+	*dst = kcpustat_cpu(cpu);
+	dst->cpustat[CPUTIME_IDLE] = kcpustat_field_idle(cpu);
+	dst->cpustat[CPUTIME_IOWAIT] = kcpustat_field_iowait(cpu);
+}
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern u64 kcpustat_field(enum cpu_usage_stat usage, int cpu);
 extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
 #else
 static inline u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
 {
-	return kcpustat_cpu(cpu).cpustat[usage];
+	return kcpustat_field_default(usage, cpu);
 }
 
 static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
 {
-	*dst = kcpustat_cpu(cpu);
+	kcpustat_cpu_fetch_default(dst, cpu);
 }
-
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 
 extern void account_user_time(struct task_struct *, u64);
@@ -145,7 +172,6 @@ extern void account_system_index_time(struct task_struct *, u64,
 				      enum cpu_usage_stat);
 extern void account_steal_time(u64);
 extern void account_idle_time(u64);
-extern u64 get_idle_time(struct kernel_cpustat *kcs, int cpu);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline void account_process_tick(struct task_struct *tsk, int user)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 16d6730efe6d..9906abe5d7bc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -475,21 +475,14 @@ void kcpustat_irq_exit(ktime_t now)
 		kcpustat_idle_start(kc, now);
 }
 
-static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
-				 bool compute_delta, u64 *last_update_time)
+static u64 kcpustat_field_dyntick(int cpu, enum cpu_usage_stat idx,
+				  bool compute_delta, ktime_t now)
 {
 	struct kernel_cpustat *kc = &kcpustat_cpu(cpu);
 	u64 *cpustat = kc->cpustat;
-	ktime_t now, idle;
+	ktime_t idle;
 	unsigned int seq;
 
-	if (vtime_generic_enabled_cpu(cpu))
-		return -1;
-
-	now = ktime_get();
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
-
 	do {
 		seq = read_seqcount_begin(&kc->idle_sleeptime_seq);
 
@@ -502,7 +495,38 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
 		}
 	} while (read_seqcount_retry(&kc->idle_sleeptime_seq, seq));
 
-	return ktime_to_us(idle);
+	return idle;
+}
+
+u64 kcpustat_field_idle(int cpu)
+{
+	return kcpustat_field_dyntick(cpu, CPUTIME_IDLE,
+				      !nr_iowait_cpu(cpu), ktime_get());
+}
+EXPORT_SYMBOL_GPL(kcpustat_field_idle);
+
+u64 kcpustat_field_iowait(int cpu)
+{
+	return kcpustat_field_dyntick(cpu, CPUTIME_IOWAIT,
+				      nr_iowait_cpu(cpu), ktime_get());
+}
+EXPORT_SYMBOL_GPL(kcpustat_field_iowait);
+
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
+				 bool compute_delta, u64 *last_update_time)
+{
+	ktime_t now = ktime_get();
+	u64 res;
+
+	if (vtime_generic_enabled_cpu(cpu))
+		return -1;
+	else
+		res = kcpustat_field_dyntick(cpu, idx, compute_delta, now);
+
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
+
+	return ktime_to_us(res);
 }
 
 /**
@@ -552,7 +576,6 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 				     nr_iowait_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-
 #endif /* CONFIG_NO_HZ_COMMON */
 
 /*
@@ -1110,8 +1133,8 @@ u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
 	struct rq *rq;
 	int err;
 
-	if (!vtime_accounting_enabled_cpu(cpu))
-		return val;
+	if (!vtime_generic_enabled_cpu(cpu))
+		return kcpustat_field_default(usage, cpu);
 
 	rq = cpu_rq(cpu);
 
@@ -1206,8 +1229,8 @@ void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
 	struct rq *rq;
 	int err;
 
-	if (!vtime_accounting_enabled_cpu(cpu)) {
-		*dst = *src;
+	if (!vtime_generic_enabled_cpu(cpu)) {
+		kcpustat_cpu_fetch_default(dst, cpu);
 		return;
 	}
 
@@ -1220,7 +1243,7 @@ void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
 		curr = rcu_dereference(rq->curr);
 		if (WARN_ON_ONCE(!curr)) {
 			rcu_read_unlock();
-			*dst = *src;
+			kcpustat_cpu_fetch_default(dst, cpu);
 			return;
 		}
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/15] sched/cputime: Consolidate get_cpu_[idle|iowait]_time_us()
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (11 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-16 14:52 ` [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully Frederic Weisbecker
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

get_cpu_idle/iowait_time_us() may ultimately fail if generic vtime
accounting is enabled.

The ad-hoc replacement solution by cpufreq is to compute jiffies minus
the whole busy cputime. Although the intention should provide a coherent
low resolution estimation of the idle and iowait time, the
implementation is buggy because jiffies don't start at 0.

Enhance instead get_cpu_[idle|iowait]_time_us() to provide support for
vtime generic accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 drivers/cpufreq/cpufreq.c   | 29 +----------------------------
 include/linux/kernel_stat.h |  3 +++
 include/linux/tick.h        |  4 ----
 kernel/sched/cputime.c      | 14 ++++++++++----
 4 files changed, 14 insertions(+), 36 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 4472bb1ec83c..ecb9634cd06b 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -130,38 +130,11 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy)
 }
 EXPORT_SYMBOL_GPL(get_governor_parent_kobj);
 
-static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
-{
-	struct kernel_cpustat kcpustat;
-	u64 cur_wall_time;
-	u64 idle_time;
-	u64 busy_time;
-
-	cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());
-
-	kcpustat_cpu_fetch(&kcpustat, cpu);
-
-	busy_time = kcpustat.cpustat[CPUTIME_USER];
-	busy_time += kcpustat.cpustat[CPUTIME_SYSTEM];
-	busy_time += kcpustat.cpustat[CPUTIME_IRQ];
-	busy_time += kcpustat.cpustat[CPUTIME_SOFTIRQ];
-	busy_time += kcpustat.cpustat[CPUTIME_STEAL];
-	busy_time += kcpustat.cpustat[CPUTIME_NICE];
-
-	idle_time = cur_wall_time - busy_time;
-	if (wall)
-		*wall = div_u64(cur_wall_time, NSEC_PER_USEC);
-
-	return div_u64(idle_time, NSEC_PER_USEC);
-}
-
 u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy)
 {
 	u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);
 
-	if (idle_time == -1ULL)
-		return get_cpu_idle_time_jiffy(cpu, wall);
-	else if (!io_busy)
+	if (!io_busy)
 		idle_time += get_cpu_iowait_time_us(cpu, wall);
 
 	return idle_time;
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e1efd26e56f0..e59916477075 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -133,6 +133,9 @@ static inline bool kcpustat_idle_dyntick(void)
 }
 #endif /* CONFIG_NO_HZ_COMMON */
 
+extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+
 /* Fetch cputime values when vtime is disabled on a CPU */
 static inline u64 kcpustat_field_default(enum cpu_usage_stat usage, int cpu)
 {
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ac76ae9fa36d..1296cba67bee 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -138,8 +138,6 @@ extern bool tick_nohz_idle_got_tick(void);
 extern ktime_t tick_nohz_get_next_hrtimer(void);
 extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
 extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
-extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
-extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 #else /* !CONFIG_NO_HZ_COMMON */
 #define tick_nohz_enabled (0)
 static inline int tick_nohz_tick_stopped(void) { return 0; }
@@ -160,8 +158,6 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
 	*delta_next = TICK_NSEC;
 	return *delta_next;
 }
-static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
-static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
 #endif /* !CONFIG_NO_HZ_COMMON */
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9906abe5d7bc..f0620b429698 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -511,6 +511,13 @@ u64 kcpustat_field_iowait(int cpu)
 				      nr_iowait_cpu(cpu), ktime_get());
 }
 EXPORT_SYMBOL_GPL(kcpustat_field_iowait);
+#else
+static u64 kcpustat_field_dyntick(int cpu, enum cpu_usage_stat idx,
+				  bool compute_delta, ktime_t now)
+{
+	return kcpustat_cpu(cpu).cpustat[idx];
+}
+#endif /* CONFIG_NO_HZ_COMMON */
 
 static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
 				 bool compute_delta, u64 *last_update_time)
@@ -519,7 +526,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
 	u64 res;
 
 	if (vtime_generic_enabled_cpu(cpu))
-		return -1;
+		res = kcpustat_field(idx, cpu);
 	else
 		res = kcpustat_field_dyntick(cpu, idx, compute_delta, now);
 
@@ -544,7 +551,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
- * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
+ * Return: total idle time of the @cpu
  */
 u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
@@ -568,7 +575,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
- * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
+ * Return: total iowait time of @cpu
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
@@ -576,7 +583,6 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 				     nr_iowait_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-#endif /* CONFIG_NO_HZ_COMMON */
 
 /*
  * Use precise platform statistics if available:
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (12 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 13/15] sched/cputime: Consolidate get_cpu_[idle|iowait]_time_us() Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-16 14:52 ` [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly Frederic Weisbecker
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

The dyntick-idle cputime accounting always assumes that IRQ time
accounting is enabled and consequently stops elapsing the idle time
during dyntick-idle IRQs.

This doesn't mix up well with disabled IRQ time accounting because then
idle IRQs become a cputime blind-spot. Also this feature is disabled
on most configurations and the overhead of pausing dyntick-idle
accounting while in idle IRQs could then be avoided.

Fix the situation with conditionally pausing dyntick-idle accounting
during idle IRQs only if neither native vtime (which does IRQ time
accounting) nor generic IRQ time accounting are enabled.

Also make sure that the accumulated IRQ time is not accidentally
substracted from later accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/cputime.c | 24 +++++++++++++++++++++---
 kernel/sched/sched.h   |  1 +
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f0620b429698..3dadfaa92b27 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -45,7 +45,8 @@ static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
 	u64_stats_update_begin(&irqtime->sync);
 	cpustat[idx] += delta;
 	irqtime->total += delta;
-	irqtime->tick_delta += delta;
+	if (!irqtime->idle_dyntick)
+		irqtime->tick_delta += delta;
 	u64_stats_update_end(&irqtime->sync);
 }
 
@@ -80,6 +81,16 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
 		irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
 }
 
+static inline void irqtime_dyntick_start(void)
+{
+	__this_cpu_write(cpu_irqtime.idle_dyntick, true);
+}
+
+static inline void irqtime_dyntick_stop(void)
+{
+	__this_cpu_write(cpu_irqtime.idle_dyntick, false);
+}
+
 static u64 irqtime_tick_accounted(u64 maxtime)
 {
 	struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime);
@@ -93,6 +104,9 @@ static u64 irqtime_tick_accounted(u64 maxtime)
 
 #else /* !CONFIG_IRQ_TIME_ACCOUNTING: */
 
+static inline void irqtime_dyntick_start(void) { }
+static inline void irqtime_dyntick_stop(void) { }
+
 static u64 irqtime_tick_accounted(u64 dummy)
 {
 	return 0;
@@ -443,6 +457,7 @@ void kcpustat_dyntick_stop(ktime_t now)
 		WARN_ON_ONCE(!kc->idle_dyntick);
 		kcpustat_idle_stop(kc, now);
 		kc->idle_dyntick = false;
+		irqtime_dyntick_stop();
 		vtime_dyntick_stop();
 		steal_account_process_time(ULONG_MAX);
 	}
@@ -454,6 +469,7 @@ void kcpustat_dyntick_start(ktime_t now)
 
 	if (!vtime_generic_enabled_this_cpu()) {
 		vtime_dyntick_start();
+		irqtime_dyntick_start();
 		kc->idle_dyntick = true;
 		kcpustat_idle_start(kc, now);
 	}
@@ -463,7 +479,8 @@ void kcpustat_irq_enter(ktime_t now)
 {
 	struct kernel_cpustat *kc = kcpustat_this_cpu;
 
-	if (!vtime_generic_enabled_this_cpu())
+	if (!vtime_generic_enabled_this_cpu() &&
+	    (irqtime_enabled() || vtime_accounting_enabled_this_cpu()))
 		kcpustat_idle_stop(kc, now);
 }
 
@@ -471,7 +488,8 @@ void kcpustat_irq_exit(ktime_t now)
 {
 	struct kernel_cpustat *kc = kcpustat_this_cpu;
 
-	if (!vtime_generic_enabled_this_cpu())
+	if (!vtime_generic_enabled_this_cpu() &&
+	    (irqtime_enabled() || vtime_accounting_enabled_this_cpu()))
 		kcpustat_idle_start(kc, now);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..cf677ff12b10 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3307,6 +3307,7 @@ static inline void sched_core_tick(struct rq *rq) { }
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 
 struct irqtime {
+	bool			idle_dyntick;
 	u64			total;
 	u64			tick_delta;
 	u64			irq_start_time;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (13 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully Frederic Weisbecker
@ 2026-01-16 14:52 ` Frederic Weisbecker
  2026-01-16 14:57 ` [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
  2026-01-19 14:53 ` Peter Zijlstra
  16 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:52 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

The dyntick-idle steal time is currently accounted when the tick
restarts but the stolen idle time is not substracted from the idle time
that was already accounted. This is to avoid observing the idle time
going backward as the dyntick-idle cputime accessors can't reliably know
in advance the stolen idle time.

In order to maintain a forward progressing idle cputime while
substracting idle steal time from it, keep track of the previously
accounted idle stolen time and substract it from _later_ idle cputime
accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/kernel_stat.h | 1 +
 kernel/sched/cputime.c      | 9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e59916477075..a5b5a25c3cc1 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -39,6 +39,7 @@ struct kernel_cpustat {
 	bool		idle_elapse;
 	seqcount_t	idle_sleeptime_seq;
 	ktime_t		idle_entrytime;
+	u64		idle_steal;
 #endif
 	u64		cpustat[NR_STATS];
 };
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3dadfaa92b27..749a6ed4d2fa 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -424,18 +424,25 @@ static void kcpustat_idle_stop(struct kernel_cpustat *kc, ktime_t now)
 {
 	u64 *cpustat = kc->cpustat;
 	ktime_t delta;
+	u64 steal, steal_delta;
 
 	if (!kc->idle_elapse)
 		return;
 
 	delta = ktime_sub(now, kc->idle_entrytime);
+	steal = steal_account_process_time(delta);
 
 	write_seqcount_begin(&kc->idle_sleeptime_seq);
+	steal_delta = min_t(u64, kc->idle_steal, delta);
+	delta -= steal_delta;
+	kc->idle_steal -= steal_delta;
+
 	if (nr_iowait_cpu(smp_processor_id()) > 0)
 		cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
 	else
 		cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
 
+	kc->idle_steal += steal;
 	kc->idle_entrytime = now;
 	kc->idle_elapse = false;
 	write_seqcount_end(&kc->idle_sleeptime_seq);
@@ -459,7 +466,6 @@ void kcpustat_dyntick_stop(ktime_t now)
 		kc->idle_dyntick = false;
 		irqtime_dyntick_stop();
 		vtime_dyntick_stop();
-		steal_account_process_time(ULONG_MAX);
 	}
 }
 
@@ -507,6 +513,7 @@ static u64 kcpustat_field_dyntick(int cpu, enum cpu_usage_stat idx,
 		if (kc->idle_elapse && compute_delta) {
 			ktime_t delta = ktime_sub(now, kc->idle_entrytime);
 
+			delta -= min_t(u64, kc->idle_steal, (u64)delta);
 			idle = ktime_add(cpustat[idx], delta);
 		} else {
 			idle = cpustat[idx];
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (14 preceding siblings ...)
  2026-01-16 14:52 ` [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly Frederic Weisbecker
@ 2026-01-16 14:57 ` Frederic Weisbecker
  2026-01-20 12:42   ` Shrikanth Hegde
  2026-01-19 14:53 ` Peter Zijlstra
  16 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-16 14:57 UTC (permalink / raw)
  To: LKML
  Cc: Rafael J . Wysocki, Boqun Feng, Thomas Gleixner, Steven Rostedt,
	Christophe Leroy (CS GROUP), Kieran Bingham, Ben Segall,
	Michael Ellerman, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Neeraj Upadhyay, Xin Zhao, Madhavan Srinivasan, Mel Gorman,
	Valentin Schneider, Christian Borntraeger, Jan Kiszka,
	linuxppc-dev, Paul E . McKenney, Viresh Kumar, Anna-Maria Behnsen,
	Uladzislau Rezki, Dietmar Eggemann, Heiko Carstens, linux-pm,
	Alexander Gordeev, Sven Schnelle, Vasily Gorbik, Joel Fernandes,
	Nicholas Piggin, linux-s390, Peter Zijlstra

I forgot to mention I haven't yet tested CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
(s390 and powerpc).

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test
  2026-01-16 14:52 ` [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test Frederic Weisbecker
@ 2026-01-19  5:37   ` Viresh Kumar
  2026-01-19 12:30   ` Rafael J. Wysocki
  1 sibling, 0 replies; 42+ messages in thread
From: Viresh Kumar @ 2026-01-19  5:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
	Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
	Vincent Guittot, Xin Zhao, linux-pm, linux-s390, linuxppc-dev

On 16-01-26, 15:52, Frederic Weisbecker wrote:
> cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
> accounting has a nanoseconds granularity.
> 
> Use the appropriate indicator instead to make that deduction.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  drivers/cpufreq/cpufreq_ondemand.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
> index a6ecc203f7b7..2d52ee035702 100644
> --- a/drivers/cpufreq/cpufreq_ondemand.c
> +++ b/drivers/cpufreq/cpufreq_ondemand.c
> @@ -334,17 +334,12 @@ static void od_free(struct policy_dbs_info *policy_dbs)
>  static int od_init(struct dbs_data *dbs_data)
>  {
>  	struct od_dbs_tuners *tuners;
> -	u64 idle_time;
> -	int cpu;
>  
>  	tuners = kzalloc(sizeof(*tuners), GFP_KERNEL);
>  	if (!tuners)
>  		return -ENOMEM;
>  
> -	cpu = get_cpu();
> -	idle_time = get_cpu_idle_time_us(cpu, NULL);
> -	put_cpu();
> -	if (idle_time != -1ULL) {
> +	if (tick_nohz_enabled) {
>  		/* Idle micro accounting is supported. Use finer thresholds */
>  		dbs_data->up_threshold = MICRO_FREQUENCY_UP_THRESHOLD;
>  	} else {

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test
  2026-01-16 14:52 ` [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test Frederic Weisbecker
  2026-01-19  5:37   ` Viresh Kumar
@ 2026-01-19 12:30   ` Rafael J. Wysocki
  2026-01-19 22:06     ` Frederic Weisbecker
  1 sibling, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2026-01-19 12:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
	Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
	Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
	linuxppc-dev

On Fri, Jan 16, 2026 at 3:53 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
> accounting has a nanoseconds granularity.
>
> Use the appropriate indicator instead to make that deduction.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>

or please let me know if you want me to take this patch.

> ---
>  drivers/cpufreq/cpufreq_ondemand.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
> index a6ecc203f7b7..2d52ee035702 100644
> --- a/drivers/cpufreq/cpufreq_ondemand.c
> +++ b/drivers/cpufreq/cpufreq_ondemand.c
> @@ -334,17 +334,12 @@ static void od_free(struct policy_dbs_info *policy_dbs)
>  static int od_init(struct dbs_data *dbs_data)
>  {
>         struct od_dbs_tuners *tuners;
> -       u64 idle_time;
> -       int cpu;
>
>         tuners = kzalloc(sizeof(*tuners), GFP_KERNEL);
>         if (!tuners)
>                 return -ENOMEM;
>
> -       cpu = get_cpu();
> -       idle_time = get_cpu_idle_time_us(cpu, NULL);
> -       put_cpu();
> -       if (idle_time != -1ULL) {
> +       if (tick_nohz_enabled) {
>                 /* Idle micro accounting is supported. Use finer thresholds */
>                 dbs_data->up_threshold = MICRO_FREQUENCY_UP_THRESHOLD;
>         } else {
> --
> 2.51.1
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/15] sched/idle: Handle offlining first in idle loop
  2026-01-16 14:51 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
@ 2026-01-19 12:53   ` Peter Zijlstra
  2026-01-19 21:04     ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2026-01-19 12:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

On Fri, Jan 16, 2026 at 03:51:54PM +0100, Frederic Weisbecker wrote:

>  kernel/sched/idle.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index c174afe1dd17..35d79af3286d 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -260,6 +260,12 @@ static void do_idle(void)
>  {
>  	int cpu = smp_processor_id();
>  
> +	if (cpu_is_offline(cpu)) {

Does it make sense to make that: if (unlikely(cpu_is_offline(cpu))) ?

> +		local_irq_disable();

Also, do we want something like:

		WARN_ON_ONCE(need_resched());

?

> +		cpuhp_report_idle_dead();
> +		arch_cpu_idle_dead();
> +	}
> +
>  	/*
>  	 * Check if we need to update blocked load
>  	 */
> @@ -311,11 +317,6 @@ static void do_idle(void)
>  		 */
>  		local_irq_disable();
>  
> -		if (cpu_is_offline(cpu)) {
> -			cpuhp_report_idle_dead();
> -			arch_cpu_idle_dead();
> -		}
> -
>  		arch_cpu_idle_enter();
>  		rcu_nocb_flush_deferred_wakeup();
>  
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time
  2026-01-16 14:51 ` [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker
@ 2026-01-19 13:02   ` Peter Zijlstra
  2026-01-19 21:35     ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2026-01-19 13:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

On Fri, Jan 16, 2026 at 03:51:56PM +0100, Frederic Weisbecker wrote:

> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 8ddf74e705d3..f1d07a0276a5 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -780,7 +780,7 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
>  	ktime_t now, idle;
>  	unsigned int seq;
>  
> -	if (!tick_nohz_active)
> +	if (!tick_nohz_active || vtime_generic_enabled_cpu(cpu))
>  		return -1;
>  
>  	now = ktime_get();

Is this not broken? IIUC this means that you can no longer use
get_cpu_{idle,iowait}_time_us() the moment you have context tracking
enabled.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/15] tick/sched: Unify idle cputime accounting
  2026-01-16 14:51 ` [PATCH 06/15] tick/sched: Unify idle cputime accounting Frederic Weisbecker
@ 2026-01-19 14:26   ` Peter Zijlstra
  2026-01-19 22:00     ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2026-01-19 14:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

On Fri, Jan 16, 2026 at 03:51:59PM +0100, Frederic Weisbecker wrote:

> +#ifdef CONFIG_NO_HZ_COMMON
> +void kcpustat_dyntick_start(void)
> +{
> +	if (!vtime_generic_enabled_this_cpu()) {
> +		vtime_dyntick_start();
> +		__this_cpu_write(kernel_cpustat.idle_dyntick, 1);
> +	}
> +}

Why don't we need to make sure steal time is up-to-date at this point?

> +void kcpustat_dyntick_stop(void)
> +{
> +	if (!vtime_generic_enabled_this_cpu()) {
> +		__this_cpu_write(kernel_cpustat.idle_dyntick, 0);
> +		vtime_dyntick_stop();
> +		steal_account_process_time(ULONG_MAX);
> +	}
> +}
> +#endif /* CONFIG_NO_HZ_COMMON */

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code
  2026-01-16 14:52 ` [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code Frederic Weisbecker
@ 2026-01-19 14:35   ` Peter Zijlstra
  2026-01-19 22:08     ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2026-01-19 14:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

On Fri, Jan 16, 2026 at 03:52:02PM +0100, Frederic Weisbecker wrote:

> +static void kcpustat_idle_stop(struct kernel_cpustat *kc, ktime_t now)
>  {
> +	u64 *cpustat = kc->cpustat;
> +	ktime_t delta;
> +
> +	if (!kc->idle_elapse)
> +		return;
> +
> +	delta = ktime_sub(now, kc->idle_entrytime);
> +
> +	write_seqcount_begin(&kc->idle_sleeptime_seq);
> +	if (nr_iowait_cpu(smp_processor_id()) > 0)
> +		cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
> +	else
> +		cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
> +
> +	kc->idle_entrytime = now;
> +	kc->idle_elapse = false;
> +	write_seqcount_end(&kc->idle_sleeptime_seq);
>  }

I realize this is mostly code movement; but do we really want to
preserve ktime_{sub,add}() and all that?

I mean, we killed that 32bit ktime nonsense ages ago.

> -static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
> -{
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> -	ktime_t delta;
> -
> -	if (vtime_generic_enabled_this_cpu())
> -		return;
> -
> -	if (WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)))
> -		return;
> -
> -	delta = ktime_sub(now, ts->idle_entrytime);
> -
> -	write_seqcount_begin(&ts->idle_sleeptime_seq);
> -	if (nr_iowait_cpu(smp_processor_id()) > 0)
> -		cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
> -	else
> -		cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
> -
> -	ts->idle_entrytime = now;
> -	tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
> -	write_seqcount_end(&ts->idle_sleeptime_seq);
> -
> -	sched_clock_idle_wakeup_event();
> -}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
  2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
                   ` (15 preceding siblings ...)
  2026-01-16 14:57 ` [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
@ 2026-01-19 14:53 ` Peter Zijlstra
  2026-01-19 22:12   ` Frederic Weisbecker
  16 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2026-01-19 14:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Rafael J . Wysocki, Boqun Feng, Thomas Gleixner,
	Steven Rostedt, Christophe Leroy (CS GROUP), Kieran Bingham,
	Ben Segall, Michael Ellerman, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Neeraj Upadhyay, Xin Zhao, Madhavan Srinivasan,
	Mel Gorman, Valentin Schneider, Christian Borntraeger, Jan Kiszka,
	linuxppc-dev, Paul E . McKenney, Viresh Kumar, Anna-Maria Behnsen,
	Uladzislau Rezki, Dietmar Eggemann, Heiko Carstens, linux-pm,
	Alexander Gordeev, Sven Schnelle, Vasily Gorbik, Joel Fernandes,
	Nicholas Piggin, linux-s390

On Fri, Jan 16, 2026 at 03:51:53PM +0100, Frederic Weisbecker wrote:
>  kernel/sched/cputime.c             | 302 +++++++++++++++++++++++++++++++------

My editor feels strongly about the below; with that it still has one
complaint about paravirt_steal_clock() which does not have a proper
declaration.


diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7ff8dbec7ee3..248232fa6e27 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -2,6 +2,7 @@
 /*
  * Simple CPU accounting cgroup controller
  */
+#include <linux/sched/clock.h>
 #include <linux/sched/cputime.h>
 #include <linux/tsacct_kern.h>
 #include "sched.h"

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/15] sched/idle: Handle offlining first in idle loop
  2026-01-19 12:53   ` Peter Zijlstra
@ 2026-01-19 21:04     ` Frederic Weisbecker
  2026-01-20  4:26       ` K Prateek Nayak
  0 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-19 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

Le Mon, Jan 19, 2026 at 01:53:47PM +0100, Peter Zijlstra a écrit :
> On Fri, Jan 16, 2026 at 03:51:54PM +0100, Frederic Weisbecker wrote:
> 
> >  kernel/sched/idle.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index c174afe1dd17..35d79af3286d 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -260,6 +260,12 @@ static void do_idle(void)
> >  {
> >  	int cpu = smp_processor_id();
> >  
> > +	if (cpu_is_offline(cpu)) {
> 
> Does it make sense to make that: if (unlikely(cpu_is_offline(cpu))) ?

Yes indeed!

> 
> > +		local_irq_disable();
> 
> Also, do we want something like:
> 
> 		WARN_ON_ONCE(need_resched());
> 
> ?

Definetly.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time
  2026-01-19 13:02   ` Peter Zijlstra
@ 2026-01-19 21:35     ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-19 21:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

Le Mon, Jan 19, 2026 at 02:02:22PM +0100, Peter Zijlstra a écrit :
> On Fri, Jan 16, 2026 at 03:51:56PM +0100, Frederic Weisbecker wrote:
> 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 8ddf74e705d3..f1d07a0276a5 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -780,7 +780,7 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
> >  	ktime_t now, idle;
> >  	unsigned int seq;
> >  
> > -	if (!tick_nohz_active)
> > +	if (!tick_nohz_active || vtime_generic_enabled_cpu(cpu))
> >  		return -1;
> >  
> >  	now = ktime_get();
> 
> Is this not broken? IIUC this means that you can no longer use
> get_cpu_{idle,iowait}_time_us() the moment you have context tracking
> enabled.

It is supported again in patch 13/15. And it's not exactly breaking
bisection in the meantime because the sole user is cpufreq and cpufreq
shouldn't be relevant with nohz_full.

Ok a few subsystem rely on the resulting cpufreq API get_cpu_idle_time():

- the legacy drivers/macintosh/rack-meter.c
- drivers/scsi/lpfc/lpfc_init.c

But cpufreq provides a low-resolution version in the worst case for nohz_full
(again until 13/15).

Hmm, but you're right this is confusing. I think I should be able to fix that
in this patch.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/15] tick/sched: Unify idle cputime accounting
  2026-01-19 14:26   ` Peter Zijlstra
@ 2026-01-19 22:00     ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-19 22:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

Le Mon, Jan 19, 2026 at 03:26:07PM +0100, Peter Zijlstra a écrit :
> On Fri, Jan 16, 2026 at 03:51:59PM +0100, Frederic Weisbecker wrote:
> 
> > +#ifdef CONFIG_NO_HZ_COMMON
> > +void kcpustat_dyntick_start(void)
> > +{
> > +	if (!vtime_generic_enabled_this_cpu()) {
> > +		vtime_dyntick_start();
> > +		__this_cpu_write(kernel_cpustat.idle_dyntick, 1);
> > +	}
> > +}
> 
> Why don't we need to make sure steal time is up-to-date at this point?

Yes, there could be steal time since the last tick. It will be included
and accounted in kcpustat_dyntick_stop() and not substracted from system
or idle cputime (but it should!). This wrong behaviour is the same as the
current upstream behaviour. So no known regression.

But check the last patch of the series that tries to fix that:

    sched/cputime: Handle dyntick-idle steal time correctly

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test
  2026-01-19 12:30   ` Rafael J. Wysocki
@ 2026-01-19 22:06     ` Frederic Weisbecker
  2026-01-20 12:32       ` Rafael J. Wysocki
  0 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-19 22:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Anna-Maria Behnsen, Ben Segall, Boqun Feng, Christian Borntraeger,
	Dietmar Eggemann, Heiko Carstens, Ingo Molnar, Jan Kiszka,
	Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
	Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Le Mon, Jan 19, 2026 at 01:30:07PM +0100, Rafael J. Wysocki a écrit :
> On Fri, Jan 16, 2026 at 3:53 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
> > accounting has a nanoseconds granularity.
> >
> > Use the appropriate indicator instead to make that deduction.
> >
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
> 
> or please let me know if you want me to take this patch.

The patch is standalone but the rest of the patchset depends on it.
Now I don't target this patchset for v6.20-rc1.

So if you manage to sneak this patch in for v6.20-rc1, it works because
I'll rebase on -rc1. Otherwise I'll need to keep it to avoid breaking
some code assumptions.

What do you think?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code
  2026-01-19 14:35   ` Peter Zijlstra
@ 2026-01-19 22:08     ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-19 22:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

Le Mon, Jan 19, 2026 at 03:35:52PM +0100, Peter Zijlstra a écrit :
> On Fri, Jan 16, 2026 at 03:52:02PM +0100, Frederic Weisbecker wrote:
> 
> > +static void kcpustat_idle_stop(struct kernel_cpustat *kc, ktime_t now)
> >  {
> > +	u64 *cpustat = kc->cpustat;
> > +	ktime_t delta;
> > +
> > +	if (!kc->idle_elapse)
> > +		return;
> > +
> > +	delta = ktime_sub(now, kc->idle_entrytime);
> > +
> > +	write_seqcount_begin(&kc->idle_sleeptime_seq);
> > +	if (nr_iowait_cpu(smp_processor_id()) > 0)
> > +		cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
> > +	else
> > +		cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
> > +
> > +	kc->idle_entrytime = now;
> > +	kc->idle_elapse = false;
> > +	write_seqcount_end(&kc->idle_sleeptime_seq);
> >  }
> 
> I realize this is mostly code movement; but do we really want to
> preserve ktime_{sub,add}() and all that?
> 
> I mean, we killed that 32bit ktime nonsense ages ago.

Good point, this should just be u64.

Thanks!

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
  2026-01-19 14:53 ` Peter Zijlstra
@ 2026-01-19 22:12   ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-19 22:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Rafael J . Wysocki, Boqun Feng, Thomas Gleixner,
	Steven Rostedt, Christophe Leroy (CS GROUP), Kieran Bingham,
	Ben Segall, Michael Ellerman, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Neeraj Upadhyay, Xin Zhao, Madhavan Srinivasan,
	Mel Gorman, Valentin Schneider, Christian Borntraeger, Jan Kiszka,
	linuxppc-dev, Paul E . McKenney, Viresh Kumar, Anna-Maria Behnsen,
	Uladzislau Rezki, Dietmar Eggemann, Heiko Carstens, linux-pm,
	Alexander Gordeev, Sven Schnelle, Vasily Gorbik, Joel Fernandes,
	Nicholas Piggin, linux-s390

Le Mon, Jan 19, 2026 at 03:53:30PM +0100, Peter Zijlstra a écrit :
> On Fri, Jan 16, 2026 at 03:51:53PM +0100, Frederic Weisbecker wrote:
> >  kernel/sched/cputime.c             | 302 +++++++++++++++++++++++++++++++------
> 
> My editor feels strongly about the below; with that it still has one
> complaint about paravirt_steal_clock() which does not have a proper
> declaration.

I guess it happens to be somehow included in the <linux/sched*.h> wave

> 
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 7ff8dbec7ee3..248232fa6e27 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -2,6 +2,7 @@
>  /*
>   * Simple CPU accounting cgroup controller
>   */
> +#include <linux/sched/clock.h>
>  #include <linux/sched/cputime.h>
>  #include <linux/tsacct_kern.h>
>  #include "sched.h"

Ok I'll include that.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/15] sched/idle: Handle offlining first in idle loop
  2026-01-19 21:04     ` Frederic Weisbecker
@ 2026-01-20  4:26       ` K Prateek Nayak
  2026-01-20 14:52         ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: K Prateek Nayak @ 2026-01-20  4:26 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

Hello Frederic, Peter,

On 1/20/2026 2:34 AM, Frederic Weisbecker wrote:
> Le Mon, Jan 19, 2026 at 01:53:47PM +0100, Peter Zijlstra a écrit :
>> On Fri, Jan 16, 2026 at 03:51:54PM +0100, Frederic Weisbecker wrote:
>>
>>>  kernel/sched/idle.c | 11 ++++++-----
>>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>>> index c174afe1dd17..35d79af3286d 100644
>>> --- a/kernel/sched/idle.c
>>> +++ b/kernel/sched/idle.c
>>> @@ -260,6 +260,12 @@ static void do_idle(void)
>>>  {
>>>  	int cpu = smp_processor_id();
>>>  
>>> +	if (cpu_is_offline(cpu)) {
>>
>> Does it make sense to make that: if (unlikely(cpu_is_offline(cpu))) ?
> 
> Yes indeed!

nit. but don't we inherit it from:

#define cpu_is_offline(cpu)     unlikely(!cpu_online(cpu))

so it will end up being annotated with unlikely() no?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test
  2026-01-19 22:06     ` Frederic Weisbecker
@ 2026-01-20 12:32       ` Rafael J. Wysocki
  2026-01-20 14:28         ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Rafael J. Wysocki @ 2026-01-20 12:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rafael J. Wysocki, LKML, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
	Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
	Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
	Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
	Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
	Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
	Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
	linuxppc-dev

On Mon, Jan 19, 2026 at 11:07 PM Frederic Weisbecker
<frederic@kernel.org> wrote:
>
> Le Mon, Jan 19, 2026 at 01:30:07PM +0100, Rafael J. Wysocki a écrit :
> > On Fri, Jan 16, 2026 at 3:53 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > >
> > > cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
> > > accounting has a nanoseconds granularity.
> > >
> > > Use the appropriate indicator instead to make that deduction.
> > >
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> >
> > Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
> >
> > or please let me know if you want me to take this patch.
>
> The patch is standalone but the rest of the patchset depends on it.
> Now I don't target this patchset for v6.20-rc1.
>
> So if you manage to sneak this patch in for v6.20-rc1, it works because
> I'll rebase on -rc1. Otherwise I'll need to keep it to avoid breaking
> some code assumptions.
>
> What do you think?

It can go into -rc1.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
  2026-01-16 14:57 ` [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
@ 2026-01-20 12:42   ` Shrikanth Hegde
  2026-01-21 16:55     ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Shrikanth Hegde @ 2026-01-20 12:42 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Rafael J . Wysocki, Boqun Feng, Thomas Gleixner, Steven Rostedt,
	Christophe Leroy (CS GROUP), Kieran Bingham, Ben Segall,
	Michael Ellerman, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Neeraj Upadhyay, Xin Zhao, Madhavan Srinivasan, Mel Gorman,
	Valentin Schneider, Christian Borntraeger, Jan Kiszka,
	linuxppc-dev, Paul E . McKenney, Viresh Kumar, Anna-Maria Behnsen,
	Uladzislau Rezki, Dietmar Eggemann, Heiko Carstens, linux-pm,
	Alexander Gordeev, Sven Schnelle, Vasily Gorbik, Joel Fernandes,
	Nicholas Piggin, linux-s390, Peter Zijlstra


Hi Frederic.

On 1/16/26 8:27 PM, Frederic Weisbecker wrote:
> I forgot to mention I haven't yet tested CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> (s390 and powerpc).
> 
> Thanks.


tl;dr

I ran this on powerNV(Non virtualized) with 144 CPUs with below config. (default ones)
Patch *breaks* the cpu idle stats most of the time. idle values are wrong.


Detailed info:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In config i have this:
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_BSD_PROCESS_ACCT is not set

+++++++++

When system is fully idle, i see this.

06:44:26 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:44:27 AM  all    0.01    0.00    0.01    0.00   57.20    0.00    0.00    0.00    0.00   42.79
06:44:28 AM  all    0.02    0.00    0.03    0.00   55.73    0.00    0.00    0.00    0.00   44.22
06:44:29 AM  all    0.01    0.00    0.00    0.00   56.23    0.00    0.00    0.00    0.00   43.77

- Seeing 50%+ in irq time, which is clearly wrong.

+++++++++
When running stress-ng --cpu=72 (expectation is 50% idle time)
06:48:12 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:48:13 AM  all   49.98    0.00    0.01    0.00   15.81    0.00    0.00    0.00    0.00   34.20
06:48:14 AM  all   49.93    0.00    0.00    0.00   15.15    0.00    0.00    0.00    0.00   34.91
06:48:15 AM  all   49.99    0.00    0.01    0.00   15.29    0.00    0.00    0.00    0.00   34.72

- Wrong values again. 50% is expected idle time.

+++++++++
system is idle again.
06:48:46 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:48:47 AM  all    0.00    0.00    0.00    0.00   63.93    0.00    0.00    0.00    0.00   36.07
06:48:48 AM  all    0.02    0.00    0.00    0.00   63.78    0.01    0.00    0.00    0.00   36.18
06:48:49 AM  all    0.00    0.00    0.00    0.00   63.77    0.00    0.00    0.00    0.00   36.23

- Wrong values again. irq increased further.

+++++++++

I have seen the below warnings too.
WARNING: kernel/time/tick-sched.c:1353 at tick_nohz_idle_exit
[    T0] WARNING: kernel/time/tick-sched.c:1353 at tick_nohz_idle_exit+0x148/0x150, CPU#4: swapper/4/0
[    T0] Modules linked in: vmx_crypto gf128mul
[    T0] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Tainted: G        W           6.19.0-rc5-00683-gbe7e8f3d5116 #61 PREEMPT(full)
[    T0] Tainted: [W]=WARN
[    T0] Hardware name: 0000000000000000 POWER9 0x4e1202 opal:v7.1 PowerNV
[    T0] NIP [c0000000002c8210] tick_nohz_idle_exit+0x148/0x150
[    T0] LR [c00000000022f10c] do_idle+0x1dc/0x328


WARNING: kernel/time/tick-sched.c:1274 at tick_nohz_get_sleep_length
     T0] NIP [c0000000002c7fc0] tick_nohz_get_sleep_length+0x108/0x110
[    T0] LR [c000000000ca1548] menu_select+0x3c0/0x7b4
[    T0] Call Trace:
[    T0] [c000000003197e10] [c000000003197e50] 0xc000000003197e50 (unreliable)
[    T0] [c000000003197e50] [c000000000ca1548] menu_select+0x3c0/0x7b4
[    T0] [c000000003197ed0] [c000000000c9f120] cpuidle_select+0x34/0x48
[    T0] [c000000003197ef0] [c00000000022f184] do_idle+0x254/0x328


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I went back to baseline to confirm the original behaviour.
(d613f96096e4) Merge timers/vdso into tip/master

07:02:17 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
07:02:18 AM  all    0.01    0.00    0.01    0.01    1.19    0.00    0.00    0.00    0.00   98.77
07:02:19 AM  all    0.01    0.00    0.01    0.00    0.84    0.00    0.00    0.00    0.00   99.14
07:02:20 AM  all    0.00    0.00    0.01    0.00    0.99    0.00    0.00    0.00    0.00   99.00
07:02:21 AM  all    0.01    0.00    0.00    0.00    0.83    0.00    0.00    0.00    0.00   99.16

Which is the working as expected.



PS: Initial data. I haven't gone through the series yet.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test
  2026-01-20 12:32       ` Rafael J. Wysocki
@ 2026-01-20 14:28         ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-20 14:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Anna-Maria Behnsen, Ben Segall, Boqun Feng, Christian Borntraeger,
	Dietmar Eggemann, Heiko Carstens, Ingo Molnar, Jan Kiszka,
	Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
	Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Le Tue, Jan 20, 2026 at 01:32:50PM +0100, Rafael J. Wysocki a écrit :
> On Mon, Jan 19, 2026 at 11:07 PM Frederic Weisbecker
> <frederic@kernel.org> wrote:
> >
> > Le Mon, Jan 19, 2026 at 01:30:07PM +0100, Rafael J. Wysocki a écrit :
> > > On Fri, Jan 16, 2026 at 3:53 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > >
> > > > cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
> > > > accounting has a nanoseconds granularity.
> > > >
> > > > Use the appropriate indicator instead to make that deduction.
> > > >
> > > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > >
> > > Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
> > >
> > > or please let me know if you want me to take this patch.
> >
> > The patch is standalone but the rest of the patchset depends on it.
> > Now I don't target this patchset for v6.20-rc1.
> >
> > So if you manage to sneak this patch in for v6.20-rc1, it works because
> > I'll rebase on -rc1. Otherwise I'll need to keep it to avoid breaking
> > some code assumptions.
> >
> > What do you think?
> 
> It can go into -rc1.

Very nice, thanks for taking it!


-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/15] sched/idle: Handle offlining first in idle loop
  2026-01-20  4:26       ` K Prateek Nayak
@ 2026-01-20 14:52         ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-20 14:52 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, LKML, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
	Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
	Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
	linuxppc-dev

Le Tue, Jan 20, 2026 at 09:56:12AM +0530, K Prateek Nayak a écrit :
> Hello Frederic, Peter,
> 
> On 1/20/2026 2:34 AM, Frederic Weisbecker wrote:
> > Le Mon, Jan 19, 2026 at 01:53:47PM +0100, Peter Zijlstra a écrit :
> >> On Fri, Jan 16, 2026 at 03:51:54PM +0100, Frederic Weisbecker wrote:
> >>
> >>>  kernel/sched/idle.c | 11 ++++++-----
> >>>  1 file changed, 6 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >>> index c174afe1dd17..35d79af3286d 100644
> >>> --- a/kernel/sched/idle.c
> >>> +++ b/kernel/sched/idle.c
> >>> @@ -260,6 +260,12 @@ static void do_idle(void)
> >>>  {
> >>>  	int cpu = smp_processor_id();
> >>>  
> >>> +	if (cpu_is_offline(cpu)) {
> >>
> >> Does it make sense to make that: if (unlikely(cpu_is_offline(cpu))) ?
> > 
> > Yes indeed!
> 
> nit. but don't we inherit it from:
> 
> #define cpu_is_offline(cpu)     unlikely(!cpu_online(cpu))
> 
> so it will end up being annotated with unlikely() no?

Ah right!

> 
> -- 
> Thanks and Regards,
> Prateek
> 

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/15] s390/time: Prepare to stop elapsing in dynticks-idle
  2026-01-16 14:51 ` [PATCH 05/15] s390/time: " Frederic Weisbecker
@ 2026-01-21 12:17   ` Heiko Carstens
  2026-01-21 18:04     ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Heiko Carstens @ 2026-01-21 12:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Ingo Molnar, Jan Kiszka,
	Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
	Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

On Fri, Jan 16, 2026 at 03:51:58PM +0100, Frederic Weisbecker wrote:
> diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
> index 39cb8d0ae348..54bb932184dd 100644
> --- a/arch/s390/kernel/idle.c
> +++ b/arch/s390/kernel/idle.c
> @@ -35,6 +35,12 @@ void account_idle_time_irq(void)
>  			this_cpu_add(mt_cycles[i], cycles_new[i] - idle->mt_cycles_enter[i]);
>  	}
>  
> +	WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
> +
> +	/* Dyntick idle time accounted by nohz/scheduler */
> +	if (idle->idle_dyntick)
> +		return;
> +
>  	idle_time = lc->int_clock - idle->clock_idle_enter;
>  
>  	lc->steal_timer += idle->clock_idle_enter - lc->last_update_clock;
> @@ -45,7 +51,6 @@ void account_idle_time_irq(void)
>  
>  	/* Account time spent with enabled wait psw loaded as idle time. */
>  	WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
> -	WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
>  	account_idle_time(cputime_to_nsecs(idle_time));
>  }

This breaks idle time reporting (aka enabled wait psw time) via the per-cpu
sysfs files (see show_idle_time()). That is: the second WRITE_ONCE() should
also go above the early return statement; but of course this leads to other
dependencies...

Not sure what to do with this. I thought about removing those sysfs files
already in the past, since they are of very limited use; and most likely
nothing in user space would miss them.

Anyway, you need to integrate the trivial patch below, so everything compiles
for s390. It also _seems_ to work.

Guess I need to spend some more time on accounting and see what it would take
to convert to VIRT_CPU_ACCOUNTING_GEN, while keeping the current precision and
functionality.

diff --git a/arch/s390/include/asm/idle.h b/arch/s390/include/asm/idle.h
index 2770c4f761e1..285b3da318d6 100644
--- a/arch/s390/include/asm/idle.h
+++ b/arch/s390/include/asm/idle.h
@@ -8,6 +8,7 @@
 #ifndef _S390_IDLE_H
 #define _S390_IDLE_H
 
+#include <linux/percpu-defs.h>
 #include <linux/types.h>
 #include <linux/device.h>
 
@@ -20,6 +21,8 @@ struct s390_idle_data {
 	unsigned long	mt_cycles_enter[8];
 };
 
+DECLARE_PER_CPU(struct s390_idle_data, s390_idle);
+
 extern struct device_attribute dev_attr_idle_count;
 extern struct device_attribute dev_attr_idle_time_us;
 
diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
index 54bb932184dd..e3fe64e7adbe 100644
--- a/arch/s390/kernel/idle.c
+++ b/arch/s390/kernel/idle.c
@@ -19,7 +19,7 @@
 #include <asm/smp.h>
 #include "entry.h"
 
-static DEFINE_PER_CPU(struct s390_idle_data, s390_idle);
+DEFINE_PER_CPU(struct s390_idle_data, s390_idle);
 
 void account_idle_time_irq(void)
 {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
  2026-01-20 12:42   ` Shrikanth Hegde
@ 2026-01-21 16:55     ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-21 16:55 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: LKML, Rafael J . Wysocki, Boqun Feng, Thomas Gleixner,
	Steven Rostedt, Christophe Leroy (CS GROUP), Kieran Bingham,
	Ben Segall, Michael Ellerman, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Neeraj Upadhyay, Xin Zhao, Madhavan Srinivasan,
	Mel Gorman, Valentin Schneider, Christian Borntraeger, Jan Kiszka,
	linuxppc-dev, Paul E . McKenney, Viresh Kumar, Anna-Maria Behnsen,
	Uladzislau Rezki, Dietmar Eggemann, Heiko Carstens, linux-pm,
	Alexander Gordeev, Sven Schnelle, Vasily Gorbik, Joel Fernandes,
	Nicholas Piggin, linux-s390, Peter Zijlstra

Le Tue, Jan 20, 2026 at 06:12:08PM +0530, Shrikanth Hegde a écrit :
> 
> Hi Frederic.
> 
> On 1/16/26 8:27 PM, Frederic Weisbecker wrote:
> > I forgot to mention I haven't yet tested CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> > (s390 and powerpc).
> > 
> > Thanks.
> 
> 
> tl;dr
> 
> I ran this on powerNV(Non virtualized) with 144 CPUs with below config. (default ones)
> Patch *breaks* the cpu idle stats most of the time. idle values are wrong.

Right I somehow lost the TS_FLAG_INIDLE setting in tick_nohz_idle_enter(),
which ruins the whole thing.

You probably think I should have detected that with light testing and you're
right. Not checking dmesg was a bit sloppy from my end...

I'm fixing that and will send a v2 soonish.

Thanks a lot for testing!

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/15] s390/time: Prepare to stop elapsing in dynticks-idle
  2026-01-21 12:17   ` Heiko Carstens
@ 2026-01-21 18:04     ` Frederic Weisbecker
  2026-01-22 14:40       ` Heiko Carstens
  0 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-21 18:04 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Ingo Molnar, Jan Kiszka,
	Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
	Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Le Wed, Jan 21, 2026 at 01:17:48PM +0100, Heiko Carstens a écrit :
> On Fri, Jan 16, 2026 at 03:51:58PM +0100, Frederic Weisbecker wrote:
> > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
> > index 39cb8d0ae348..54bb932184dd 100644
> > --- a/arch/s390/kernel/idle.c
> > +++ b/arch/s390/kernel/idle.c
> > @@ -35,6 +35,12 @@ void account_idle_time_irq(void)
> >  			this_cpu_add(mt_cycles[i], cycles_new[i] - idle->mt_cycles_enter[i]);
> >  	}
> >  
> > +	WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
> > +
> > +	/* Dyntick idle time accounted by nohz/scheduler */
> > +	if (idle->idle_dyntick)
> > +		return;
> > +
> >  	idle_time = lc->int_clock - idle->clock_idle_enter;
> >  
> >  	lc->steal_timer += idle->clock_idle_enter - lc->last_update_clock;
> > @@ -45,7 +51,6 @@ void account_idle_time_irq(void)
> >  
> >  	/* Account time spent with enabled wait psw loaded as idle time. */
> >  	WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
> > -	WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
> >  	account_idle_time(cputime_to_nsecs(idle_time));
> >  }
> 
> This breaks idle time reporting (aka enabled wait psw time) via the per-cpu
> sysfs files (see show_idle_time()). That is: the second WRITE_ONCE() should
> also go above the early return statement; but of course this leads to other
> dependencies...

Oh right! Will fix that.

BTW here is a question for you, does the timer (as in get_cpu_timer()) still
decrements while in idle? I would assume not, given how lc->system_timer
is updated in account_idle_time_irq().

And another question in this same function is this :

    lc->steal_timer += idle->clock_idle_enter - lc->last_update_clock;

clock_idle_enter is updated right before halting the CPU. But when was
last_update_clock updated last? Could be either task switch to idle, or
a previous idle tick interrupt or a previous idle IRQ entry. In any case
I'm not sure the difference is meaningful as steal time.

I must be missing something.

> Not sure what to do with this. I thought about removing those sysfs files
> already in the past, since they are of very limited use; and most likely
> nothing in user space would miss them.

Perhaps but this file is a good comparison point against /proc/stat because
s390 vtime is much closer to measuring the actual CPU halted time than what
the generic nohz accounting does (which includes more idle code execution).

> 
> Anyway, you need to integrate the trivial patch below, so everything compiles
> for s390. It also _seems_ to work.

Thanks, I'll include that.

> 
> Guess I need to spend some more time on accounting and see what it would take
> to convert to VIRT_CPU_ACCOUNTING_GEN, while keeping the current precision and
> functionality.

I would expect more overhead with VIRT_CPU_ACCOUNTING_GEN, though that has yet
to be measured. In any case you'll lose some idle cputime precision (but
you need to read that through s390 sysfs files) if what we want to measure
here is the actual halted time.

Perhaps we could enhance VIRT_CPU_ACCOUNTING_GEN and nohz idle cputime
accounting to match s390 precision. Though I expect some cost
accessing the clock inevitably more often on some machines.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/15] s390/time: Prepare to stop elapsing in dynticks-idle
  2026-01-21 18:04     ` Frederic Weisbecker
@ 2026-01-22 14:40       ` Heiko Carstens
  2026-01-27 14:45         ` Frederic Weisbecker
  0 siblings, 1 reply; 42+ messages in thread
From: Heiko Carstens @ 2026-01-22 14:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Ingo Molnar, Jan Kiszka,
	Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
	Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

On Wed, Jan 21, 2026 at 07:04:35PM +0100, Frederic Weisbecker wrote:
> BTW here is a question for you, does the timer (as in get_cpu_timer()) still
> decrements while in idle? I would assume not, given how lc->system_timer
> is updated in account_idle_time_irq().

It is not decremented while in idle (or when the hypervisor schedules
the virtual cpu away). We use the fact that the cpu timer is not
decremented when the virtual cpu is not running vs the real
time-of-day clock to calculate steal time.

> And another question in this same function is this :
> 
>     lc->steal_timer += idle->clock_idle_enter - lc->last_update_clock;
> 
> clock_idle_enter is updated right before halting the CPU. But when was
> last_update_clock updated last? Could be either task switch to idle, or
> a previous idle tick interrupt or a previous idle IRQ entry. In any case
> I'm not sure the difference is meaningful as steal time.
> 
> I must be missing something.

"It has been like that forever" :) However I do agree that this doesn't seem
to make any sense. At least with the current implementation I cannot see how
that makes sense, since the difference of two time stamps, which do not
include any steal time are added.

Maybe it broke by some of all the changes over the years, or it was always
wrong, or I am missing something too.

Will investigate and address it if required. Thank you for bringing this up!

> > Not sure what to do with this. I thought about removing those sysfs files
> > already in the past, since they are of very limited use; and most likely
> > nothing in user space would miss them.
> 
> Perhaps but this file is a good comparison point against /proc/stat because
> s390 vtime is much closer to measuring the actual CPU halted time than what
> the generic nohz accounting does (which includes more idle code execution).

Yes, while comparing those files I also see an unexpected difference of
several seconds after two days of uptime; that is before your changes.

In theory the sum of idle and iowait in /proc/stat should be the same like the
per-cpu idle_time_us sysfs file. But there is a difference, which shouldn't be
there as far as I can tell. Yet another thing to look into.

> > Guess I need to spend some more time on accounting and see what it would take
> > to convert to VIRT_CPU_ACCOUNTING_GEN, while keeping the current precision and
> > functionality.
> 
> I would expect more overhead with VIRT_CPU_ACCOUNTING_GEN, though that has yet
> to be measured. In any case you'll lose some idle cputime precision (but
> you need to read that through s390 sysfs files) if what we want to measure
> here is the actual halted time.
> 
> Perhaps we could enhance VIRT_CPU_ACCOUNTING_GEN and nohz idle cputime
> accounting to match s390 precision. Though I expect some cost
> accessing the clock inevitably more often on some machines.

Let me experiment with that, but first I want to understand the oddities
pointed out above.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/15] s390/time: Prepare to stop elapsing in dynticks-idle
  2026-01-22 14:40       ` Heiko Carstens
@ 2026-01-27 14:45         ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-01-27 14:45 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
	Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
	Christian Borntraeger, Dietmar Eggemann, Ingo Molnar, Jan Kiszka,
	Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
	Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev

Le Thu, Jan 22, 2026 at 03:40:45PM +0100, Heiko Carstens a écrit :
> On Wed, Jan 21, 2026 at 07:04:35PM +0100, Frederic Weisbecker wrote:
> > BTW here is a question for you, does the timer (as in get_cpu_timer()) still
> > decrements while in idle? I would assume not, given how lc->system_timer
> > is updated in account_idle_time_irq().
> 
> It is not decremented while in idle (or when the hypervisor schedules
> the virtual cpu away). We use the fact that the cpu timer is not
> decremented when the virtual cpu is not running vs the real
> time-of-day clock to calculate steal time.

Ok, good then!

> 
> > And another question in this same function is this :
> > 
> >     lc->steal_timer += idle->clock_idle_enter - lc->last_update_clock;
> > 
> > clock_idle_enter is updated right before halting the CPU. But when was
> > last_update_clock updated last? Could be either task switch to idle, or
> > a previous idle tick interrupt or a previous idle IRQ entry. In any case
> > I'm not sure the difference is meaningful as steal time.
> > 
> > I must be missing something.
> 
> "It has been like that forever" :) However I do agree that this doesn't seem
> to make any sense. At least with the current implementation I cannot see how
> that makes sense, since the difference of two time stamps, which do not
> include any steal time are added.
> 
> Maybe it broke by some of all the changes over the years, or it was always
> wrong, or I am missing something too.
> 
> Will investigate and address it if required. Thank you for bringing this up!

Ok, I take some relief from the fact it's not only unclear to me :-)

> 
> > > Not sure what to do with this. I thought about removing those sysfs files
> > > already in the past, since they are of very limited use; and most likely
> > > nothing in user space would miss them.
> > 
> > Perhaps but this file is a good comparison point against /proc/stat because
> > s390 vtime is much closer to measuring the actual CPU halted time than what
> > the generic nohz accounting does (which includes more idle code execution).
> 
> Yes, while comparing those files I also see an unexpected difference of
> several seconds after two days of uptime; that is before your changes.
> 
> In theory the sum of idle and iowait in /proc/stat should be the same like the
> per-cpu idle_time_us sysfs file. But there is a difference, which shouldn't be
> there as far as I can tell. Yet another thing to look into.

Yes and that's expected both before and after my changes.

* /proc/stat is the time spent between tick_nohz_idle_enter() and
  tick_nohz_idle_exit() (to simplify, because there are some pause during
  idle IRQs).

* The s390 idle sysfs file depicts more closely the time spent while the
  CPU is really idle (and not executing idle code).

Different semantics and this is why you observe different results. I guess
/proc/stat has higher values (with idle + iowait) and that is expected.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time
  2026-02-06 14:22 [PATCH 00/15 v2] " Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev, Shrikanth Hegde

Currently whether generic vtime is running or not, the idle cputime is
fetched from the nohz accounting.

However generic vtime already does its own idle cputime accounting. Only
the kernel stat accessors are not plugged to support it.

Read the idle generic vtime cputime when it's running, this will allow
to later more clearly split nohz and vtime cputime accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/vtime.h    |  9 +++++++--
 kernel/sched/cputime.c   | 38 +++++++++++++++++++++++++++++---------
 kernel/time/tick-sched.c | 12 +++++++++---
 3 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 29dd5b91dd7d..336875bea767 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -10,7 +10,6 @@
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
 extern void vtime_account_kernel(struct task_struct *tsk);
-extern void vtime_account_idle(struct task_struct *tsk);
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -27,7 +26,13 @@ static inline void vtime_guest_exit(struct task_struct *tsk) { }
 static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
+static inline bool vtime_generic_enabled_cpu(int cpu)
+{
+	return context_tracking_enabled_cpu(cpu);
+}
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
 extern void vtime_account_softirq(struct task_struct *tsk);
 extern void vtime_account_hardirq(struct task_struct *tsk);
@@ -74,7 +79,7 @@ static inline bool vtime_accounting_enabled(void)
 
 static inline bool vtime_accounting_enabled_cpu(int cpu)
 {
-	return context_tracking_enabled_cpu(cpu);
+	return vtime_generic_enabled_cpu(cpu);
 }
 
 static inline bool vtime_accounting_enabled_this_cpu(void)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5dcb0f2e01bc..5613838d0307 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -759,9 +759,9 @@ void vtime_guest_exit(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
-void vtime_account_idle(struct task_struct *tsk)
+static void __vtime_account_idle(struct vtime *vtime)
 {
-	account_idle_time(get_vtime_delta(&tsk->vtime));
+	account_idle_time(get_vtime_delta(vtime));
 }
 
 void vtime_task_switch_generic(struct task_struct *prev)
@@ -770,7 +770,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
 
 	write_seqcount_begin(&vtime->seqcount);
 	if (vtime->state == VTIME_IDLE)
-		vtime_account_idle(prev);
+		__vtime_account_idle(vtime);
 	else
 		__vtime_account_kernel(prev, vtime);
 	vtime->state = VTIME_INACTIVE;
@@ -912,6 +912,7 @@ static int kcpustat_field_vtime(u64 *cpustat,
 				int cpu, u64 *val)
 {
 	struct vtime *vtime = &tsk->vtime;
+	struct rq *rq = cpu_rq(cpu);
 	unsigned int seq;
 
 	do {
@@ -953,6 +954,14 @@ static int kcpustat_field_vtime(u64 *cpustat,
 			if (state == VTIME_GUEST && task_nice(tsk) > 0)
 				*val += vtime->gtime + vtime_delta(vtime);
 			break;
+		case CPUTIME_IDLE:
+			if (state == VTIME_IDLE && !atomic_read(&rq->nr_iowait))
+				*val += vtime_delta(vtime);
+			break;
+		case CPUTIME_IOWAIT:
+			if (state == VTIME_IDLE && atomic_read(&rq->nr_iowait) > 0)
+				*val += vtime_delta(vtime);
+			break;
 		default:
 			break;
 		}
@@ -1015,8 +1024,8 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
 		*dst = *src;
 		cpustat = dst->cpustat;
 
-		/* Task is sleeping, dead or idle, nothing to add */
-		if (state < VTIME_SYS)
+		/* Task is sleeping or dead, nothing to add */
+		if (state < VTIME_IDLE)
 			continue;
 
 		delta = vtime_delta(vtime);
@@ -1025,15 +1034,17 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
 		 * Task runs either in user (including guest) or kernel space,
 		 * add pending nohz time to the right place.
 		 */
-		if (state == VTIME_SYS) {
+		switch (vtime->state) {
+		case VTIME_SYS:
 			cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
-		} else if (state == VTIME_USER) {
+			break;
+		case VTIME_USER:
 			if (task_nice(tsk) > 0)
 				cpustat[CPUTIME_NICE] += vtime->utime + delta;
 			else
 				cpustat[CPUTIME_USER] += vtime->utime + delta;
-		} else {
-			WARN_ON_ONCE(state != VTIME_GUEST);
+			break;
+		case VTIME_GUEST:
 			if (task_nice(tsk) > 0) {
 				cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
 				cpustat[CPUTIME_NICE] += vtime->gtime + delta;
@@ -1041,6 +1052,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
 				cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
 				cpustat[CPUTIME_USER] += vtime->gtime + delta;
 			}
+			break;
+		case VTIME_IDLE:
+			if (atomic_read(&cpu_rq(cpu)->nr_iowait) > 0)
+				cpustat[CPUTIME_IOWAIT] += delta;
+			else
+				cpustat[CPUTIME_IDLE] += delta;
+			break;
+		default:
+			WARN_ON_ONCE(1);
 		}
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8ddf74e705d3..9632066aea4d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -774,9 +774,10 @@ static void tick_nohz_start_idle(struct tick_sched *ts)
 	sched_clock_idle_sleep_event();
 }
 
-static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx, ktime_t *sleeptime,
 				 bool compute_delta, u64 *last_update_time)
 {
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
 	unsigned int seq;
 
@@ -787,6 +788,11 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
+	if (vtime_generic_enabled_cpu(cpu)) {
+		idle = kcpustat_field(idx, cpu);
+		return ktime_to_us(idle);
+	}
+
 	do {
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 
@@ -824,7 +830,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
-	return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
+	return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE, &ts->idle_sleeptime,
 				     !nr_iowait_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
@@ -850,7 +856,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
-	return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
+	return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT, &ts->iowait_sleeptime,
 				     nr_iowait_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
  2026-01-16 14:51 ` [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
@ 2026-02-25 17:53   ` Christophe Leroy (CS GROUP)
  0 siblings, 0 replies; 42+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-02-25 17:53 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
	Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
	Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
	linux-s390, linuxppc-dev



Le 16/01/2026 à 15:51, Frederic Weisbecker a écrit :
> Currently the tick subsystem stores the idle cputime accounting in
> private fields, allowing cohabitation with architecture idle vtime
> accounting. The former is fetched on online CPUs, the latter on offline
> CPUs.
> 
> For consolidation purpose, architecture vtime accounting will continue
> to account the cputime but will make a break when the idle tick is
> stopped. The dyntick cputime accounting will then be relayed by the tick
> subsystem so that the idle cputime is still seen advancing coherently
> even when the tick isn't there to flush the idle vtime.
> 
> Prepare for that and introduce three new APIs which will be used in
> subsequent patches:
> 
> _ vtime_dynticks_start() is deemed to be called when idle enters in
>    dyntick mode. The idle cputime that elapsed so far is accumulated.
> 
> - vtime_dynticks_stop() is deemed to be called when idle exits from
>    dyntick mode. The vtime entry clocks are fast-forward to current time
>    so that idle accounting restarts elapsing from now.
> 
> - vtime_reset() is deemed to be called from dynticks idle IRQ entry to
>    fast-forward the clock to current time so that the IRQ time is still
>    accounted by vtime while nohz cputime is paused.
> 
> Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
> accounting twice the idle cputime, along with nohz accounting.
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>   arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++++++
>   include/linux/vtime.h      |  6 ++++++
>   2 files changed, 47 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 4bbeb8644d3d..9b3167274653 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
>   		acct->starttime = acct0->starttime;
>   	}
>   }
> +
> +#ifdef CONFIG_NO_HZ_COMMON
> +/**
> + * vtime_reset - Fast forward vtime entry clocks
> + *
> + * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
> + * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
> + */
> +void vtime_reset(void)
> +{
> +	struct cpu_accounting_data *acct = get_accounting(current);
> +
> +	acct->starttime = mftb();
> +#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
> +	acct->startspurr = read_spurr(now);

'now' doesn't exist.

> +#endif
> +}
> +
> +/**
> + * vtime_dyntick_start - Inform vtime about entry to idle-dynticks
> + *
> + * Called when idle enters in dyntick mode. The idle cputime that elapsed so far
> + * is accumulated and the tick subsystem takes over the idle cputime accounting.
> + */
> +void vtime_dyntick_start(void)
> +{
> +	vtime_account_idle(current);
> +}
> +
> +/**
> + * vtime_dyntick_stop - Inform vtime about exit from idle-dynticks
> + *
> + * Called when idle exits from dyntick mode. The vtime entry clocks are
> + * fast-forward to current time so that idle accounting restarts elapsing from
> + * now.
> + */
> +void vtime_dyntick_stop(void)
> +{
> +	vtime_reset();
> +}
> +#endif /* CONFIG_NO_HZ_COMMON */
>   #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>   
>   void __no_kcsan __delay(unsigned long loops)
> diff --git a/include/linux/vtime.h b/include/linux/vtime.h
> index 737930f66c3e..10cdb08f960b 100644
> --- a/include/linux/vtime.h
> +++ b/include/linux/vtime.h
> @@ -37,11 +37,17 @@ extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
>   extern void vtime_account_softirq(struct task_struct *tsk);
>   extern void vtime_account_hardirq(struct task_struct *tsk);
>   extern void vtime_flush(struct task_struct *tsk);
> +extern void vtime_reset(void);
> +extern void vtime_dyntick_start(void);
> +extern void vtime_dyntick_stop(void);

extern keyword is pointless for function prototypes, we should refrain 
to add new ones.

>   #else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>   static inline void vtime_account_irq(struct task_struct *tsk, unsigned int offset) { }
>   static inline void vtime_account_softirq(struct task_struct *tsk) { }
>   static inline void vtime_account_hardirq(struct task_struct *tsk) { }
>   static inline void vtime_flush(struct task_struct *tsk) { }
> +static inline void vtime_reset(void) { }
> +static inline void vtime_dyntick_start(void) { }
> +extern inline void vtime_dyntick_stop(void) { }

Why extern for that one ?

>   #endif
>   
>   /*


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2026-02-25 17:54 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-16 14:51 [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
2026-01-16 14:51 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
2026-01-19 12:53   ` Peter Zijlstra
2026-01-19 21:04     ` Frederic Weisbecker
2026-01-20  4:26       ` K Prateek Nayak
2026-01-20 14:52         ` Frederic Weisbecker
2026-01-16 14:51 ` [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter Frederic Weisbecker
2026-01-16 14:51 ` [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker
2026-01-19 13:02   ` Peter Zijlstra
2026-01-19 21:35     ` Frederic Weisbecker
2026-01-16 14:51 ` [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
2026-02-25 17:53   ` Christophe Leroy (CS GROUP)
2026-01-16 14:51 ` [PATCH 05/15] s390/time: " Frederic Weisbecker
2026-01-21 12:17   ` Heiko Carstens
2026-01-21 18:04     ` Frederic Weisbecker
2026-01-22 14:40       ` Heiko Carstens
2026-01-27 14:45         ` Frederic Weisbecker
2026-01-16 14:51 ` [PATCH 06/15] tick/sched: Unify idle cputime accounting Frederic Weisbecker
2026-01-19 14:26   ` Peter Zijlstra
2026-01-19 22:00     ` Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test Frederic Weisbecker
2026-01-19  5:37   ` Viresh Kumar
2026-01-19 12:30   ` Rafael J. Wysocki
2026-01-19 22:06     ` Frederic Weisbecker
2026-01-20 12:32       ` Rafael J. Wysocki
2026-01-20 14:28         ` Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 08/15] tick/sched: Remove nohz disabled special case in cputime fetch Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code Frederic Weisbecker
2026-01-19 14:35   ` Peter Zijlstra
2026-01-19 22:08     ` Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 10/15] tick/sched: Remove unused fields Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 11/15] tick/sched: Account tickless idle cputime only when tick is stopped Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 13/15] sched/cputime: Consolidate get_cpu_[idle|iowait]_time_us() Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully Frederic Weisbecker
2026-01-16 14:52 ` [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly Frederic Weisbecker
2026-01-16 14:57 ` [PATCH 00/15] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
2026-01-20 12:42   ` Shrikanth Hegde
2026-01-21 16:55     ` Frederic Weisbecker
2026-01-19 14:53 ` Peter Zijlstra
2026-01-19 22:12   ` Frederic Weisbecker
  -- strict thread matches above, loose matches on Subject: below --
2026-02-06 14:22 [PATCH 00/15 v2] " Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox