* [PATCH 01/15] sched/idle: Handle offlining first in idle loop
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-18 18:22 ` Shrikanth Hegde
2026-02-06 14:22 ` [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter Frederic Weisbecker
` (14 subsequent siblings)
15 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Offline handling happens from within the inner idle loop,
after the beginning of dyntick cputime accounting, nohz idle
load balancing and TIF_NEED_RESCHED polling.
This is not necessary and even buggy because:
* There is no dyntick handling to do. And calling tick_nohz_idle_enter()
messes up with the struct tick_sched reset that was performed on
tick_sched_timer_dying().
* There is no nohz idle balancing to do.
* Polling on TIF_RESCHED is irrelevant at this stage, there are no more
tasks allowed to run.
* No need to check if need_resched() before offline handling since
stop_machine is done and all per-cpu kthread should be done with
their job.
Therefore move the offline handling at the beginning of the idle loop.
This will also ease the idle cputime unification later by not elapsing
idle time while offline through the call to:
tick_nohz_idle_enter() -> tick_nohz_start_idle()
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/sched/idle.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd17..51764cbec6f3 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -260,6 +260,14 @@ static void do_idle(void)
{
int cpu = smp_processor_id();
+ if (cpu_is_offline(cpu)) {
+ local_irq_disable();
+ /* All per-CPU kernel threads should be done by now. */
+ WARN_ON_ONCE(need_resched());
+ cpuhp_report_idle_dead();
+ arch_cpu_idle_dead();
+ }
+
/*
* Check if we need to update blocked load
*/
@@ -311,11 +319,6 @@ static void do_idle(void)
*/
local_irq_disable();
- if (cpu_is_offline(cpu)) {
- cpuhp_report_idle_dead();
- arch_cpu_idle_dead();
- }
-
arch_cpu_idle_enter();
rcu_nocb_flush_deferred_wakeup();
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 01/15] sched/idle: Handle offlining first in idle loop
2026-02-06 14:22 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
@ 2026-02-18 18:22 ` Shrikanth Hegde
0 siblings, 0 replies; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-18 18:22 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> Offline handling happens from within the inner idle loop,
> after the beginning of dyntick cputime accounting, nohz idle
> load balancing and TIF_NEED_RESCHED polling.
>
> This is not necessary and even buggy because:
>
> * There is no dyntick handling to do. And calling tick_nohz_idle_enter()
> messes up with the struct tick_sched reset that was performed on
> tick_sched_timer_dying().
>
> * There is no nohz idle balancing to do.
>
> * Polling on TIF_RESCHED is irrelevant at this stage, there are no more
> tasks allowed to run.
>
> * No need to check if need_resched() before offline handling since
> stop_machine is done and all per-cpu kthread should be done with
> their job.
>
> Therefore move the offline handling at the beginning of the idle loop.
> This will also ease the idle cputime unification later by not elapsing
> idle time while offline through the call to:
>
> tick_nohz_idle_enter() -> tick_nohz_start_idle()
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Makes sense to call it outside the loop.
Once you report idle is dead, there is nothing to do that CPU.
Reviewed-by: Shrikanth Hegde<sshegde@linux.ibm.com>
> ---
> kernel/sched/idle.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index c174afe1dd17..51764cbec6f3 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -260,6 +260,14 @@ static void do_idle(void)
> {
> int cpu = smp_processor_id();
>
> + if (cpu_is_offline(cpu)) {
> + local_irq_disable();
> + /* All per-CPU kernel threads should be done by now. */
> + WARN_ON_ONCE(need_resched());
> + cpuhp_report_idle_dead();
> + arch_cpu_idle_dead();
> + }
> +
> /*
> * Check if we need to update blocked load
> */
> @@ -311,11 +319,6 @@ static void do_idle(void)
> */
> local_irq_disable();
>
> - if (cpu_is_offline(cpu)) {
> - cpuhp_report_idle_dead();
> - arch_cpu_idle_dead();
> - }
> -
> arch_cpu_idle_enter();
> rcu_nocb_flush_deferred_wakeup();
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-18 18:25 ` Shrikanth Hegde
2026-02-06 14:22 ` [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker
` (13 subsequent siblings)
15 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
The first parameter to kcpustat_field() is a pointer to the cpu kcpustat
to be fetched from. This parameter is error prone because a copy to a
kcpustat could be passed by accident instead of the original one. Also
the kcpustat structure can already be retrieved with the help of the
mandatory CPU argument.
Remove the needless paramater.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
drivers/cpufreq/cpufreq_governor.c | 6 +++---
drivers/macintosh/rack-meter.c | 2 +-
include/linux/kernel_stat.h | 8 +++-----
kernel/rcu/tree.c | 9 +++------
kernel/rcu/tree_stall.h | 7 +++----
kernel/sched/cputime.c | 5 ++---
6 files changed, 15 insertions(+), 22 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 1a7fcaf39cc9..b6683628091d 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -105,7 +105,7 @@ void gov_update_cpu_data(struct dbs_data *dbs_data)
j_cdbs->prev_cpu_idle = get_cpu_idle_time(j, &j_cdbs->prev_update_time,
dbs_data->io_is_busy);
if (dbs_data->ignore_nice_load)
- j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+ j_cdbs->prev_cpu_nice = kcpustat_field(CPUTIME_NICE, j);
}
}
}
@@ -165,7 +165,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
j_cdbs->prev_cpu_idle = cur_idle_time;
if (ignore_nice) {
- u64 cur_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+ u64 cur_nice = kcpustat_field(CPUTIME_NICE, j);
idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC);
j_cdbs->prev_cpu_nice = cur_nice;
@@ -539,7 +539,7 @@ int cpufreq_dbs_governor_start(struct cpufreq_policy *policy)
j_cdbs->prev_load = 0;
if (ignore_nice)
- j_cdbs->prev_cpu_nice = kcpustat_field(&kcpustat_cpu(j), CPUTIME_NICE, j);
+ j_cdbs->prev_cpu_nice = kcpustat_field(CPUTIME_NICE, j);
}
gov->start(policy);
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 896a43bd819f..20b2ecd32340 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -87,7 +87,7 @@ static inline u64 get_cpu_idle_time(unsigned int cpu)
kcpustat->cpustat[CPUTIME_IOWAIT];
if (rackmeter_ignore_nice)
- retval += kcpustat_field(kcpustat, CPUTIME_NICE, cpu);
+ retval += kcpustat_field(CPUTIME_NICE, cpu);
return retval;
}
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index b97ce2df376f..dd020ecaf67b 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -100,14 +100,12 @@ static inline unsigned long kstat_cpu_irqs_sum(unsigned int cpu)
}
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern u64 kcpustat_field(struct kernel_cpustat *kcpustat,
- enum cpu_usage_stat usage, int cpu);
+extern u64 kcpustat_field(enum cpu_usage_stat usage, int cpu);
extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
#else
-static inline u64 kcpustat_field(struct kernel_cpustat *kcpustat,
- enum cpu_usage_stat usage, int cpu)
+static inline u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
{
- return kcpustat->cpustat[usage];
+ return kcpustat_cpu(cpu).cpustat[usage];
}
static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 293bbd9ac3f4..ceea4b2f755b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -968,14 +968,11 @@ static int rcu_watching_snap_recheck(struct rcu_data *rdp)
if (rcu_cpu_stall_cputime && rdp->snap_record.gp_seq != rdp->gp_seq) {
int cpu = rdp->cpu;
struct rcu_snap_record *rsrp;
- struct kernel_cpustat *kcsp;
-
- kcsp = &kcpustat_cpu(cpu);
rsrp = &rdp->snap_record;
- rsrp->cputime_irq = kcpustat_field(kcsp, CPUTIME_IRQ, cpu);
- rsrp->cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu);
- rsrp->cputime_system = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu);
+ rsrp->cputime_irq = kcpustat_field(CPUTIME_IRQ, cpu);
+ rsrp->cputime_softirq = kcpustat_field(CPUTIME_SOFTIRQ, cpu);
+ rsrp->cputime_system = kcpustat_field(CPUTIME_SYSTEM, cpu);
rsrp->nr_hardirqs = kstat_cpu_irqs_sum(cpu) + arch_irq_stat_cpu(cpu);
rsrp->nr_softirqs = kstat_cpu_softirqs_sum(cpu);
rsrp->nr_csw = nr_context_switches_cpu(cpu);
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b67532cb8770..cf7ae51cba40 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -479,7 +479,6 @@ static void print_cpu_stat_info(int cpu)
{
struct rcu_snap_record rsr, *rsrp;
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
- struct kernel_cpustat *kcsp = &kcpustat_cpu(cpu);
if (!rcu_cpu_stall_cputime)
return;
@@ -488,9 +487,9 @@ static void print_cpu_stat_info(int cpu)
if (rsrp->gp_seq != rdp->gp_seq)
return;
- rsr.cputime_irq = kcpustat_field(kcsp, CPUTIME_IRQ, cpu);
- rsr.cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu);
- rsr.cputime_system = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu);
+ rsr.cputime_irq = kcpustat_field(CPUTIME_IRQ, cpu);
+ rsr.cputime_softirq = kcpustat_field(CPUTIME_SOFTIRQ, cpu);
+ rsr.cputime_system = kcpustat_field(CPUTIME_SYSTEM, cpu);
pr_err("\t hardirqs softirqs csw/system\n");
pr_err("\t number: %8lld %10d %12lld\n",
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 4f97896887ec..5dcb0f2e01bc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -961,10 +961,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
return 0;
}
-u64 kcpustat_field(struct kernel_cpustat *kcpustat,
- enum cpu_usage_stat usage, int cpu)
+u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
{
- u64 *cpustat = kcpustat->cpustat;
+ u64 *cpustat = kcpustat_cpu(cpu).cpustat;
u64 val = cpustat[usage];
struct rq *rq;
int err;
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter
2026-02-06 14:22 ` [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter Frederic Weisbecker
@ 2026-02-18 18:25 ` Shrikanth Hegde
0 siblings, 0 replies; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-18 18:25 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Christophe Leroy (CS GROUP), Rafael J. Wysocki, Alexander Gordeev,
Anna-Maria Behnsen, Ben Segall, Boqun Feng, Christian Borntraeger,
Dietmar Eggemann, Heiko Carstens, Ingo Molnar, Jan Kiszka,
Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev
On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> The first parameter to kcpustat_field() is a pointer to the cpu kcpustat
> to be fetched from. This parameter is error prone because a copy to a
> kcpustat could be passed by accident instead of the original one. Also
> the kcpustat structure can already be retrieved with the help of the
> mandatory CPU argument.
>
> Remove the needless paramater.
nit: s/paramater/parameter
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 01/15] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 02/15] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
` (12 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Currently whether generic vtime is running or not, the idle cputime is
fetched from the nohz accounting.
However generic vtime already does its own idle cputime accounting. Only
the kernel stat accessors are not plugged to support it.
Read the idle generic vtime cputime when it's running, this will allow
to later more clearly split nohz and vtime cputime accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/vtime.h | 9 +++++++--
kernel/sched/cputime.c | 38 +++++++++++++++++++++++++++++---------
kernel/time/tick-sched.c | 12 +++++++++---
3 files changed, 45 insertions(+), 14 deletions(-)
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 29dd5b91dd7d..336875bea767 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -10,7 +10,6 @@
*/
#ifdef CONFIG_VIRT_CPU_ACCOUNTING
extern void vtime_account_kernel(struct task_struct *tsk);
-extern void vtime_account_idle(struct task_struct *tsk);
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -27,7 +26,13 @@ static inline void vtime_guest_exit(struct task_struct *tsk) { }
static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
#endif
+static inline bool vtime_generic_enabled_cpu(int cpu)
+{
+ return context_tracking_enabled_cpu(cpu);
+}
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+extern void vtime_account_idle(struct task_struct *tsk);
extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
extern void vtime_account_softirq(struct task_struct *tsk);
extern void vtime_account_hardirq(struct task_struct *tsk);
@@ -74,7 +79,7 @@ static inline bool vtime_accounting_enabled(void)
static inline bool vtime_accounting_enabled_cpu(int cpu)
{
- return context_tracking_enabled_cpu(cpu);
+ return vtime_generic_enabled_cpu(cpu);
}
static inline bool vtime_accounting_enabled_this_cpu(void)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5dcb0f2e01bc..5613838d0307 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -759,9 +759,9 @@ void vtime_guest_exit(struct task_struct *tsk)
}
EXPORT_SYMBOL_GPL(vtime_guest_exit);
-void vtime_account_idle(struct task_struct *tsk)
+static void __vtime_account_idle(struct vtime *vtime)
{
- account_idle_time(get_vtime_delta(&tsk->vtime));
+ account_idle_time(get_vtime_delta(vtime));
}
void vtime_task_switch_generic(struct task_struct *prev)
@@ -770,7 +770,7 @@ void vtime_task_switch_generic(struct task_struct *prev)
write_seqcount_begin(&vtime->seqcount);
if (vtime->state == VTIME_IDLE)
- vtime_account_idle(prev);
+ __vtime_account_idle(vtime);
else
__vtime_account_kernel(prev, vtime);
vtime->state = VTIME_INACTIVE;
@@ -912,6 +912,7 @@ static int kcpustat_field_vtime(u64 *cpustat,
int cpu, u64 *val)
{
struct vtime *vtime = &tsk->vtime;
+ struct rq *rq = cpu_rq(cpu);
unsigned int seq;
do {
@@ -953,6 +954,14 @@ static int kcpustat_field_vtime(u64 *cpustat,
if (state == VTIME_GUEST && task_nice(tsk) > 0)
*val += vtime->gtime + vtime_delta(vtime);
break;
+ case CPUTIME_IDLE:
+ if (state == VTIME_IDLE && !atomic_read(&rq->nr_iowait))
+ *val += vtime_delta(vtime);
+ break;
+ case CPUTIME_IOWAIT:
+ if (state == VTIME_IDLE && atomic_read(&rq->nr_iowait) > 0)
+ *val += vtime_delta(vtime);
+ break;
default:
break;
}
@@ -1015,8 +1024,8 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
*dst = *src;
cpustat = dst->cpustat;
- /* Task is sleeping, dead or idle, nothing to add */
- if (state < VTIME_SYS)
+ /* Task is sleeping or dead, nothing to add */
+ if (state < VTIME_IDLE)
continue;
delta = vtime_delta(vtime);
@@ -1025,15 +1034,17 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
* Task runs either in user (including guest) or kernel space,
* add pending nohz time to the right place.
*/
- if (state == VTIME_SYS) {
+ switch (vtime->state) {
+ case VTIME_SYS:
cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
- } else if (state == VTIME_USER) {
+ break;
+ case VTIME_USER:
if (task_nice(tsk) > 0)
cpustat[CPUTIME_NICE] += vtime->utime + delta;
else
cpustat[CPUTIME_USER] += vtime->utime + delta;
- } else {
- WARN_ON_ONCE(state != VTIME_GUEST);
+ break;
+ case VTIME_GUEST:
if (task_nice(tsk) > 0) {
cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
cpustat[CPUTIME_NICE] += vtime->gtime + delta;
@@ -1041,6 +1052,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
cpustat[CPUTIME_USER] += vtime->gtime + delta;
}
+ break;
+ case VTIME_IDLE:
+ if (atomic_read(&cpu_rq(cpu)->nr_iowait) > 0)
+ cpustat[CPUTIME_IOWAIT] += delta;
+ else
+ cpustat[CPUTIME_IDLE] += delta;
+ break;
+ default:
+ WARN_ON_ONCE(1);
}
} while (read_seqcount_retry(&vtime->seqcount, seq));
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8ddf74e705d3..9632066aea4d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -774,9 +774,10 @@ static void tick_nohz_start_idle(struct tick_sched *ts)
sched_clock_idle_sleep_event();
}
-static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx, ktime_t *sleeptime,
bool compute_delta, u64 *last_update_time)
{
+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, idle;
unsigned int seq;
@@ -787,6 +788,11 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
if (last_update_time)
*last_update_time = ktime_to_us(now);
+ if (vtime_generic_enabled_cpu(cpu)) {
+ idle = kcpustat_field(idx, cpu);
+ return ktime_to_us(idle);
+ }
+
do {
seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
@@ -824,7 +830,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
+ return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE, &ts->idle_sleeptime,
!nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
@@ -850,7 +856,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
+ return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT, &ts->iowait_sleeptime,
nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (2 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 03/15] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-19 18:30 ` Shrikanth Hegde
2026-02-06 14:22 ` [PATCH 05/15] s390/time: " Frederic Weisbecker
` (11 subsequent siblings)
15 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Currently the tick subsystem stores the idle cputime accounting in
private fields, allowing cohabitation with architecture idle vtime
accounting. The former is fetched on online CPUs, the latter on offline
CPUs.
For consolidation purpose, architecture vtime accounting will continue
to account the cputime but will make a break when the idle tick is
stopped. The dyntick cputime accounting will then be relayed by the tick
subsystem so that the idle cputime is still seen advancing coherently
even when the tick isn't there to flush the idle vtime.
Prepare for that and introduce three new APIs which will be used in
subsequent patches:
_ vtime_dynticks_start() is deemed to be called when idle enters in
dyntick mode. The idle cputime that elapsed so far is accumulated.
- vtime_dynticks_stop() is deemed to be called when idle exits from
dyntick mode. The vtime entry clocks are fast-forward to current time
so that idle accounting restarts elapsing from now.
- vtime_reset() is deemed to be called from dynticks idle IRQ entry to
fast-forward the clock to current time so that the IRQ time is still
accounted by vtime while nohz cputime is paused.
Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
accounting twice the idle cputime, along with nohz accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++++++
include/linux/vtime.h | 6 ++++++
2 files changed, 47 insertions(+)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 4bbeb8644d3d..18506740f4a4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
acct->starttime = acct0->starttime;
}
}
+
+#ifdef CONFIG_NO_HZ_COMMON
+/**
+ * vtime_reset - Fast forward vtime entry clocks
+ *
+ * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
+ * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
+ */
+void vtime_reset(void)
+{
+ struct cpu_accounting_data *acct = get_accounting(current);
+
+ acct->starttime = mftb();
+#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
+ acct->startspurr = read_spurr(acct->starttime);
+#endif
+}
+
+/**
+ * vtime_dyntick_start - Inform vtime about entry to idle-dynticks
+ *
+ * Called when idle enters in dyntick mode. The idle cputime that elapsed so far
+ * is accumulated and the tick subsystem takes over the idle cputime accounting.
+ */
+void vtime_dyntick_start(void)
+{
+ vtime_account_idle(current);
+}
+
+/**
+ * vtime_dyntick_stop - Inform vtime about exit from idle-dynticks
+ *
+ * Called when idle exits from dyntick mode. The vtime entry clocks are
+ * fast-forward to current time so that idle accounting restarts elapsing from
+ * now.
+ */
+void vtime_dyntick_stop(void)
+{
+ vtime_reset();
+}
+#endif /* CONFIG_NO_HZ_COMMON */
#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
void __no_kcsan __delay(unsigned long loops)
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 336875bea767..61b94c12d7dd 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -37,11 +37,17 @@ extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
extern void vtime_account_softirq(struct task_struct *tsk);
extern void vtime_account_hardirq(struct task_struct *tsk);
extern void vtime_flush(struct task_struct *tsk);
+extern void vtime_reset(void);
+extern void vtime_dyntick_start(void);
+extern void vtime_dyntick_stop(void);
#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
static inline void vtime_account_irq(struct task_struct *tsk, unsigned int offset) { }
static inline void vtime_account_softirq(struct task_struct *tsk) { }
static inline void vtime_account_hardirq(struct task_struct *tsk) { }
static inline void vtime_flush(struct task_struct *tsk) { }
+static inline void vtime_reset(void) { }
+static inline void vtime_dyntick_start(void) { }
+extern inline void vtime_dyntick_stop(void) { }
#endif
/*
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-06 14:22 ` [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
@ 2026-02-19 18:30 ` Shrikanth Hegde
2026-02-24 15:41 ` Christophe Leroy (CS GROUP)
0 siblings, 1 reply; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-19 18:30 UTC (permalink / raw)
To: Frederic Weisbecker, LKML, Madhavan Srinivasan
Cc: Christophe Leroy (CS GROUP), Rafael J. Wysocki, Alexander Gordeev,
Anna-Maria Behnsen, Ben Segall, Boqun Feng, Christian Borntraeger,
Dietmar Eggemann, Heiko Carstens, Ingo Molnar, Jan Kiszka,
Joel Fernandes, Juri Lelli, Kieran Bingham, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev
On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> Currently the tick subsystem stores the idle cputime accounting in
> private fields, allowing cohabitation with architecture idle vtime
> accounting. The former is fetched on online CPUs, the latter on offline
> CPUs.
>
> For consolidation purpose, architecture vtime accounting will continue
> to account the cputime but will make a break when the idle tick is
> stopped. The dyntick cputime accounting will then be relayed by the tick
> subsystem so that the idle cputime is still seen advancing coherently
> even when the tick isn't there to flush the idle vtime.
>
> Prepare for that and introduce three new APIs which will be used in
> subsequent patches:
>
> _ vtime_dynticks_start() is deemed to be called when idle enters in
> dyntick mode. The idle cputime that elapsed so far is accumulated.
>
> - vtime_dynticks_stop() is deemed to be called when idle exits from
> dyntick mode. The vtime entry clocks are fast-forward to current time
> so that idle accounting restarts elapsing from now.
>
> - vtime_reset() is deemed to be called from dynticks idle IRQ entry to
> fast-forward the clock to current time so that the IRQ time is still
> accounted by vtime while nohz cputime is paused.
>
> Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
> accounting twice the idle cputime, along with nohz accounting.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++++++
> include/linux/vtime.h | 6 ++++++
> 2 files changed, 47 insertions(+)
>
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 4bbeb8644d3d..18506740f4a4 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
> acct->starttime = acct0->starttime;
> }
> }
> +
> +#ifdef CONFIG_NO_HZ_COMMON
> +/**
> + * vtime_reset - Fast forward vtime entry clocks
> + *
> + * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
> + * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
> + */
> +void vtime_reset(void)
> +{
> + struct cpu_accounting_data *acct = get_accounting(current);
> +
> + acct->starttime = mftb();
I figured out why those huge values happen.
This happens because mftb is from when the system is booted.
I was doing kexec to start the new kernel and mftb wasn't getting
reset.
I thought about this. This is concern for pseries too, where LPAR's
restart but system won't restart and mftb will continue to run instead of
reset.
I think we should be using sched_clock instead of mftb here.
Though we need it a few more places and some cosmetic changes around it.
Note: Some values being huge exists without series for few CPUs, with series it
shows up in most of the CPUs.
So I am planning send out fix below fix separately keeping your
series as dependency.
---
arch/powerpc/include/asm/accounting.h | 4 ++--
arch/powerpc/include/asm/cputime.h | 14 +++++++-------
arch/powerpc/kernel/time.c | 22 +++++++++++-----------
3 files changed, 20 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/include/asm/accounting.h
index 6d79c31700e2..50f120646e6d 100644
--- a/arch/powerpc/include/asm/accounting.h
+++ b/arch/powerpc/include/asm/accounting.h
@@ -21,8 +21,8 @@ struct cpu_accounting_data {
unsigned long steal_time;
unsigned long idle_time;
/* Internal counters */
- unsigned long starttime; /* TB value snapshot */
- unsigned long starttime_user; /* TB value on exit to usermode */
+ unsigned long starttime; /* Time value snapshot */
+ unsigned long starttime_user; /* Time value on exit to usermode */
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
unsigned long startspurr; /* SPURR value snapshot */
unsigned long utime_sspurr; /* ->user_time when ->startspurr set */
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index aff858ca99c0..eb6b629b113f 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -20,9 +20,9 @@
#include <asm/time.h>
#include <asm/param.h>
#include <asm/firmware.h>
+#include <linux/sched/clock.h>
#ifdef __KERNEL__
-#define cputime_to_nsecs(cputime) tb_to_ns(cputime)
/*
* PPC64 uses PACA which is task independent for storing accounting data while
@@ -44,20 +44,20 @@
*/
static notrace inline void account_cpu_user_entry(void)
{
- unsigned long tb = mftb();
+ unsigned long now = sched_clock();
struct cpu_accounting_data *acct = raw_get_accounting(current);
- acct->utime += (tb - acct->starttime_user);
- acct->starttime = tb;
+ acct->utime += (now - acct->starttime_user);
+ acct->starttime = now;
}
static notrace inline void account_cpu_user_exit(void)
{
- unsigned long tb = mftb();
+ unsigned long now = sched_clock();
struct cpu_accounting_data *acct = raw_get_accounting(current);
- acct->stime += (tb - acct->starttime);
- acct->starttime_user = tb;
+ acct->stime += (now - acct->starttime);
+ acct->starttime_user = now;
}
static notrace inline void account_stolen_time(void)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 18506740f4a4..fb67cdae3bcb 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -215,7 +215,7 @@ static unsigned long vtime_delta(struct cpu_accounting_data *acct,
WARN_ON_ONCE(!irqs_disabled());
- now = mftb();
+ now = sched_clock();
stime = now - acct->starttime;
acct->starttime = now;
@@ -299,9 +299,9 @@ static void vtime_flush_scaled(struct task_struct *tsk,
{
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
if (acct->utime_scaled)
- tsk->utimescaled += cputime_to_nsecs(acct->utime_scaled);
+ tsk->utimescaled += acct->utime_scaled;
if (acct->stime_scaled)
- tsk->stimescaled += cputime_to_nsecs(acct->stime_scaled);
+ tsk->stimescaled += acct->stime_scaled;
acct->utime_scaled = 0;
acct->utime_sspurr = 0;
@@ -321,28 +321,28 @@ void vtime_flush(struct task_struct *tsk)
struct cpu_accounting_data *acct = get_accounting(tsk);
if (acct->utime)
- account_user_time(tsk, cputime_to_nsecs(acct->utime));
+ account_user_time(tsk, acct->utime);
if (acct->gtime)
- account_guest_time(tsk, cputime_to_nsecs(acct->gtime));
+ account_guest_time(tsk, acct->gtime);
if (IS_ENABLED(CONFIG_PPC_SPLPAR) && acct->steal_time) {
- account_steal_time(cputime_to_nsecs(acct->steal_time));
+ account_steal_time(acct->steal_time);
acct->steal_time = 0;
}
if (acct->idle_time)
- account_idle_time(cputime_to_nsecs(acct->idle_time));
+ account_idle_time(acct->idle_time);
if (acct->stime)
- account_system_index_time(tsk, cputime_to_nsecs(acct->stime),
+ account_system_index_time(tsk, acct->stime,
CPUTIME_SYSTEM);
if (acct->hardirq_time)
- account_system_index_time(tsk, cputime_to_nsecs(acct->hardirq_time),
+ account_system_index_time(tsk, acct->hardirq_time,
CPUTIME_IRQ);
if (acct->softirq_time)
- account_system_index_time(tsk, cputime_to_nsecs(acct->softirq_time),
+ account_system_index_time(tsk, acct->softirq_time,
CPUTIME_SOFTIRQ);
vtime_flush_scaled(tsk, acct);
@@ -388,7 +388,7 @@ void vtime_reset(void)
{
struct cpu_accounting_data *acct = get_accounting(current);
- acct->starttime = mftb();
+ acct->starttime = sched_clock();
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
acct->startspurr = read_spurr(acct->starttime);
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-19 18:30 ` Shrikanth Hegde
@ 2026-02-24 15:41 ` Christophe Leroy (CS GROUP)
2026-02-25 7:46 ` Shrikanth Hegde
0 siblings, 1 reply; 40+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-02-24 15:41 UTC (permalink / raw)
To: Shrikanth Hegde, Frederic Weisbecker, LKML, Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Hegde,
Le 19/02/2026 à 19:30, Shrikanth Hegde a écrit :
>
>
> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
>> Currently the tick subsystem stores the idle cputime accounting in
>> private fields, allowing cohabitation with architecture idle vtime
>> accounting. The former is fetched on online CPUs, the latter on offline
>> CPUs.
>>
>> For consolidation purpose, architecture vtime accounting will continue
>> to account the cputime but will make a break when the idle tick is
>> stopped. The dyntick cputime accounting will then be relayed by the tick
>> subsystem so that the idle cputime is still seen advancing coherently
>> even when the tick isn't there to flush the idle vtime.
>>
>> Prepare for that and introduce three new APIs which will be used in
>> subsequent patches:
>>
>> _ vtime_dynticks_start() is deemed to be called when idle enters in
>> dyntick mode. The idle cputime that elapsed so far is accumulated.
>>
>> - vtime_dynticks_stop() is deemed to be called when idle exits from
>> dyntick mode. The vtime entry clocks are fast-forward to current time
>> so that idle accounting restarts elapsing from now.
>>
>> - vtime_reset() is deemed to be called from dynticks idle IRQ entry to
>> fast-forward the clock to current time so that the IRQ time is still
>> accounted by vtime while nohz cputime is paused.
>>
>> Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
>> accounting twice the idle cputime, along with nohz accounting.
>>
>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>
>> ---
>> arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++++++
>> include/linux/vtime.h | 6 ++++++
>> 2 files changed, 47 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 4bbeb8644d3d..18506740f4a4 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
>> acct->starttime = acct0->starttime;
>> }
>> }
>> +
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +/**
>> + * vtime_reset - Fast forward vtime entry clocks
>> + *
>> + * Called from dynticks idle IRQ entry to fast-forward the clocks to
>> current time
>> + * so that the IRQ time is still accounted by vtime while nohz
>> cputime is paused.
>> + */
>> +void vtime_reset(void)
>> +{
>> + struct cpu_accounting_data *acct = get_accounting(current);
>> +
>> + acct->starttime = mftb();
>
> I figured out why those huge values happen.
>
> This happens because mftb is from when the system is booted.
> I was doing kexec to start the new kernel and mftb wasn't getting
> reset.
>
> I thought about this. This is concern for pseries too, where LPAR's
> restart but system won't restart and mftb will continue to run instead of
> reset.
>
> I think we should be using sched_clock instead of mftb here.
> Though we need it a few more places and some cosmetic changes around it.
>
> Note: Some values being huge exists without series for few CPUs, with
> series it
> shows up in most of the CPUs.
>
> So I am planning send out fix below fix separately keeping your
> series as dependency.
>
> ---
> arch/powerpc/include/asm/accounting.h | 4 ++--
> arch/powerpc/include/asm/cputime.h | 14 +++++++-------
> arch/powerpc/kernel/time.c | 22 +++++++++++-----------
> 3 files changed, 20 insertions(+), 20 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/
> include/asm/accounting.h
> index 6d79c31700e2..50f120646e6d 100644
> --- a/arch/powerpc/include/asm/accounting.h
> +++ b/arch/powerpc/include/asm/accounting.h
> @@ -21,8 +21,8 @@ struct cpu_accounting_data {
> unsigned long steal_time;
> unsigned long idle_time;
> /* Internal counters */
> - unsigned long starttime; /* TB value snapshot */
> - unsigned long starttime_user; /* TB value on exit to usermode */
> + unsigned long starttime; /* Time value snapshot */
> + unsigned long starttime_user; /* Time value on exit to usermode */
> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
> unsigned long startspurr; /* SPURR value snapshot */
> unsigned long utime_sspurr; /* ->user_time when ->startspurr
> set */
> diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/
> asm/cputime.h
> index aff858ca99c0..eb6b629b113f 100644
> --- a/arch/powerpc/include/asm/cputime.h
> +++ b/arch/powerpc/include/asm/cputime.h
> @@ -20,9 +20,9 @@
> #include <asm/time.h>
> #include <asm/param.h>
> #include <asm/firmware.h>
> +#include <linux/sched/clock.h>
>
> #ifdef __KERNEL__
> -#define cputime_to_nsecs(cputime) tb_to_ns(cputime)
>
> /*
> * PPC64 uses PACA which is task independent for storing accounting
> data while
> @@ -44,20 +44,20 @@
> */
> static notrace inline void account_cpu_user_entry(void)
> {
> - unsigned long tb = mftb();
> + unsigned long now = sched_clock();
Now way !
By doing that you'll kill performance for no reason. All we need when
accounting time spent in kernel or in user is the difference between
time at entry and time at exit, no mater what the time was at boot time.
Also sched_clock() returns nanoseconds which implies calculation from
timebase. This is pointless CPU consumption. The current implementation
calculates nanoseconds at task switch when calling vtime_flush().Your
change will now do it at every kernel entry and kernel exit by calling
sched_clock().
Another point is that sched_clock() returns a long long not a long.
And also sched_clock() uses get_tb() which does mftb and mftbu. Which is
pointless for calculating time deltas unless your application spends
hours without being re-scheduled.
> struct cpu_accounting_data *acct = raw_get_accounting(current);
>
> - acct->utime += (tb - acct->starttime_user);
> - acct->starttime = tb;
> + acct->utime += (now - acct->starttime_user);
> + acct->starttime = now;
> }
>
> static notrace inline void account_cpu_user_exit(void)
> {
> - unsigned long tb = mftb();
> + unsigned long now = sched_clock();
> struct cpu_accounting_data *acct = raw_get_accounting(current);
>
> - acct->stime += (tb - acct->starttime);
> - acct->starttime_user = tb;
> + acct->stime += (now - acct->starttime);
> + acct->starttime_user = now;
> }
>
> static notrace inline void account_stolen_time(void)
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 18506740f4a4..fb67cdae3bcb 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -215,7 +215,7 @@ static unsigned long vtime_delta(struct
> cpu_accounting_data *acct,
>
> WARN_ON_ONCE(!irqs_disabled());
>
> - now = mftb();
> + now = sched_clock();
> stime = now - acct->starttime;
> acct->starttime = now;
>
> @@ -299,9 +299,9 @@ static void vtime_flush_scaled(struct task_struct *tsk,
> {
> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
> if (acct->utime_scaled)
> - tsk->utimescaled += cputime_to_nsecs(acct->utime_scaled);
> + tsk->utimescaled += acct->utime_scaled;
> if (acct->stime_scaled)
> - tsk->stimescaled += cputime_to_nsecs(acct->stime_scaled);
> + tsk->stimescaled += acct->stime_scaled;
>
> acct->utime_scaled = 0;
> acct->utime_sspurr = 0;
> @@ -321,28 +321,28 @@ void vtime_flush(struct task_struct *tsk)
> struct cpu_accounting_data *acct = get_accounting(tsk);
>
> if (acct->utime)
> - account_user_time(tsk, cputime_to_nsecs(acct->utime));
> + account_user_time(tsk, acct->utime);
>
> if (acct->gtime)
> - account_guest_time(tsk, cputime_to_nsecs(acct->gtime));
> + account_guest_time(tsk, acct->gtime);
>
> if (IS_ENABLED(CONFIG_PPC_SPLPAR) && acct->steal_time) {
> - account_steal_time(cputime_to_nsecs(acct->steal_time));
> + account_steal_time(acct->steal_time);
> acct->steal_time = 0;
> }
>
> if (acct->idle_time)
> - account_idle_time(cputime_to_nsecs(acct->idle_time));
> + account_idle_time(acct->idle_time);
>
> if (acct->stime)
> - account_system_index_time(tsk, cputime_to_nsecs(acct->stime),
> + account_system_index_time(tsk, acct->stime,
> CPUTIME_SYSTEM);
>
> if (acct->hardirq_time)
> - account_system_index_time(tsk, cputime_to_nsecs(acct-
> >hardirq_time),
> + account_system_index_time(tsk, acct->hardirq_time,
> CPUTIME_IRQ);
> if (acct->softirq_time)
> - account_system_index_time(tsk, cputime_to_nsecs(acct-
> >softirq_time),
> + account_system_index_time(tsk, acct->softirq_time,
> CPUTIME_SOFTIRQ);
>
> vtime_flush_scaled(tsk, acct);
> @@ -388,7 +388,7 @@ void vtime_reset(void)
> {
> struct cpu_accounting_data *acct = get_accounting(current);
>
> - acct->starttime = mftb();
> + acct->starttime = sched_clock();
> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
> acct->startspurr = read_spurr(acct->starttime);
> #endif
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-24 15:41 ` Christophe Leroy (CS GROUP)
@ 2026-02-25 7:46 ` Shrikanth Hegde
2026-02-25 9:45 ` Christophe Leroy (CS GROUP)
2026-02-26 7:32 ` Christophe Leroy (CS GROUP)
0 siblings, 2 replies; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-25 7:46 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP), Frederic Weisbecker, LKML,
Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Christophe,
On 2/24/26 9:11 PM, Christophe Leroy (CS GROUP) wrote:
> Hi Hegde,
>
> Le 19/02/2026 à 19:30, Shrikanth Hegde a écrit :
>>
>>
>> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
>>> Currently the tick subsystem stores the idle cputime accounting in
>>> private fields, allowing cohabitation with architecture idle vtime
>>> accounting. The former is fetched on online CPUs, the latter on offline
>>> CPUs.
>>>
>>> For consolidation purpose, architecture vtime accounting will continue
>>> to account the cputime but will make a break when the idle tick is
>>> stopped. The dyntick cputime accounting will then be relayed by the tick
>>> subsystem so that the idle cputime is still seen advancing coherently
>>> even when the tick isn't there to flush the idle vtime.
>>>
>>> Prepare for that and introduce three new APIs which will be used in
>>> subsequent patches:
>>>
>>> _ vtime_dynticks_start() is deemed to be called when idle enters in
>>> dyntick mode. The idle cputime that elapsed so far is accumulated.
>>>
>>> - vtime_dynticks_stop() is deemed to be called when idle exits from
>>> dyntick mode. The vtime entry clocks are fast-forward to current time
>>> so that idle accounting restarts elapsing from now.
>>>
>>> - vtime_reset() is deemed to be called from dynticks idle IRQ entry to
>>> fast-forward the clock to current time so that the IRQ time is still
>>> accounted by vtime while nohz cputime is paused.
>>>
>>> Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
>>> accounting twice the idle cputime, along with nohz accounting.
>>>
>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>>
>> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>
>>> ---
>>> arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++++++
>>> include/linux/vtime.h | 6 ++++++
>>> 2 files changed, 47 insertions(+)
>>>
>>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>>> index 4bbeb8644d3d..18506740f4a4 100644
>>> --- a/arch/powerpc/kernel/time.c
>>> +++ b/arch/powerpc/kernel/time.c
>>> @@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
>>> acct->starttime = acct0->starttime;
>>> }
>>> }
>>> +
>>> +#ifdef CONFIG_NO_HZ_COMMON
>>> +/**
>>> + * vtime_reset - Fast forward vtime entry clocks
>>> + *
>>> + * Called from dynticks idle IRQ entry to fast-forward the clocks to
>>> current time
>>> + * so that the IRQ time is still accounted by vtime while nohz
>>> cputime is paused.
>>> + */
>>> +void vtime_reset(void)
>>> +{
>>> + struct cpu_accounting_data *acct = get_accounting(current);
>>> +
>>> + acct->starttime = mftb();
>>
>> I figured out why those huge values happen.
>>
>> This happens because mftb is from when the system is booted.
>> I was doing kexec to start the new kernel and mftb wasn't getting
>> reset.
>>
>> I thought about this. This is concern for pseries too, where LPAR's
>> restart but system won't restart and mftb will continue to run instead of
>> reset.
>>
>> I think we should be using sched_clock instead of mftb here.
>> Though we need it a few more places and some cosmetic changes around it.
>>
>> Note: Some values being huge exists without series for few CPUs, with
>> series it
>> shows up in most of the CPUs.
>>
>> So I am planning send out fix below fix separately keeping your
>> series as dependency.
>>
>> ---
>> arch/powerpc/include/asm/accounting.h | 4 ++--
>> arch/powerpc/include/asm/cputime.h | 14 +++++++-------
>> arch/powerpc/kernel/time.c | 22 +++++++++++-----------
>> 3 files changed, 20 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/
>> include/asm/accounting.h
>> index 6d79c31700e2..50f120646e6d 100644
>> --- a/arch/powerpc/include/asm/accounting.h
>> +++ b/arch/powerpc/include/asm/accounting.h
>> @@ -21,8 +21,8 @@ struct cpu_accounting_data {
>> unsigned long steal_time;
>> unsigned long idle_time;
>> /* Internal counters */
>> - unsigned long starttime; /* TB value snapshot */
>> - unsigned long starttime_user; /* TB value on exit to usermode */
>> + unsigned long starttime; /* Time value snapshot */
>> + unsigned long starttime_user; /* Time value on exit to
>> usermode */
>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>> unsigned long startspurr; /* SPURR value snapshot */
>> unsigned long utime_sspurr; /* ->user_time when ->startspurr
>> set */
>> diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/
>> include/ asm/cputime.h
>> index aff858ca99c0..eb6b629b113f 100644
>> --- a/arch/powerpc/include/asm/cputime.h
>> +++ b/arch/powerpc/include/asm/cputime.h
>> @@ -20,9 +20,9 @@
>> #include <asm/time.h>
>> #include <asm/param.h>
>> #include <asm/firmware.h>
>> +#include <linux/sched/clock.h>
>>
>> #ifdef __KERNEL__
>> -#define cputime_to_nsecs(cputime) tb_to_ns(cputime)
>>
>> /*
>> * PPC64 uses PACA which is task independent for storing accounting
>> data while
>> @@ -44,20 +44,20 @@
>> */
>> static notrace inline void account_cpu_user_entry(void)
>> {
>> - unsigned long tb = mftb();
>> + unsigned long now = sched_clock();
>
> Now way !
>
> By doing that you'll kill performance for no reason. All we need when
> accounting time spent in kernel or in user is the difference between
> time at entry and time at exit, no mater what the time was at boot time.
>
No. With this patch there will not be any performance difference.
All it does is, instead of using mftb uses sched_clock at those places.
In arch/powerpc/kernel/time.c we have sched_clock().
notrace unsigned long long sched_clock(void)
{
return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
}
It does the same mftb call, and accounts only the time after boot, which is
what /proc/stat should do as well.
"
the amount of time, measured in units of USER_HZ
(1/100ths of a second on most architectures
user (1) Time spent in user mode.
idle (4) Time spent in the idle task. This value
should be USER_HZ times the second entry in
the /proc/uptime pseudo-file.
"
/proc/uptime is based on sched_clock, so i infer /proc/stat also should show
values w.r.t to boot of the OS.
> Also sched_clock() returns nanoseconds which implies calculation from
> timebase. This is pointless CPU consumption. The current implementation
> calculates nanoseconds at task switch when calling vtime_flush().Your
> change will now do it at every kernel entry and kernel exit by calling
> sched_clock().
This change doesn't add any additional paths. Even without patches, mftb would have
been called in every kernel entry/exit. See mftb usage account_cpu_user_exit/enter
Now instead of mftb sched_clock is used, that's all. No additional entry/exit points.
And previously when accounting we would have done cputime_to_nsecs, now that conversion
is done automatically in sched_clock. So overall computation-wise it should be same.
What i am missing to see it here?
>
> Another point is that sched_clock() returns a long long not a long.
Thanks for pointing that out.
Ok. Let me change some of those variables into unsigned long long.
Compiler didn't warn me, so i didn't see it.
>
> And also sched_clock() uses get_tb() which does mftb and mftbu. Which is
> pointless for calculating time deltas unless your application spends
> hours without being re-scheduled.
>
I didn't get this. At current also, we use mftb, that functionality should be the same.
Could you please explain how?
>
>> struct cpu_accounting_data *acct = raw_get_accounting(current);
>>
>> - acct->utime += (tb - acct->starttime_user);
>> - acct->starttime = tb;
>> + acct->utime += (now - acct->starttime_user);
>> + acct->starttime = now;
>> }
>>
>> static notrace inline void account_cpu_user_exit(void)
>> {
>> - unsigned long tb = mftb();
>> + unsigned long now = sched_clock();
>> struct cpu_accounting_data *acct = raw_get_accounting(current);
>>
>> - acct->stime += (tb - acct->starttime);
>> - acct->starttime_user = tb;
>> + acct->stime += (now - acct->starttime);
>> + acct->starttime_user = now;
>> }
>>
>> static notrace inline void account_stolen_time(void)
>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 18506740f4a4..fb67cdae3bcb 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -215,7 +215,7 @@ static unsigned long vtime_delta(struct
>> cpu_accounting_data *acct,
>>
>> WARN_ON_ONCE(!irqs_disabled());
>>
>> - now = mftb();
>> + now = sched_clock();
>> stime = now - acct->starttime;
>> acct->starttime = now;
>>
>> @@ -299,9 +299,9 @@ static void vtime_flush_scaled(struct task_struct
>> *tsk,
>> {
>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>> if (acct->utime_scaled)
>> - tsk->utimescaled += cputime_to_nsecs(acct->utime_scaled);
>> + tsk->utimescaled += acct->utime_scaled;
>> if (acct->stime_scaled)
>> - tsk->stimescaled += cputime_to_nsecs(acct->stime_scaled);
>> + tsk->stimescaled += acct->stime_scaled;
>>
>> acct->utime_scaled = 0;
>> acct->utime_sspurr = 0;
>> @@ -321,28 +321,28 @@ void vtime_flush(struct task_struct *tsk)
>> struct cpu_accounting_data *acct = get_accounting(tsk);
>>
>> if (acct->utime)
>> - account_user_time(tsk, cputime_to_nsecs(acct->utime));
>> + account_user_time(tsk, acct->utime);
>>
>> if (acct->gtime)
>> - account_guest_time(tsk, cputime_to_nsecs(acct->gtime));
>> + account_guest_time(tsk, acct->gtime);
>>
>> if (IS_ENABLED(CONFIG_PPC_SPLPAR) && acct->steal_time) {
>> - account_steal_time(cputime_to_nsecs(acct->steal_time));
>> + account_steal_time(acct->steal_time);
>> acct->steal_time = 0;
>> }
>>
>> if (acct->idle_time)
>> - account_idle_time(cputime_to_nsecs(acct->idle_time));
>> + account_idle_time(acct->idle_time);
>>
>> if (acct->stime)
>> - account_system_index_time(tsk, cputime_to_nsecs(acct->stime),
>> + account_system_index_time(tsk, acct->stime,
>> CPUTIME_SYSTEM);
>>
>> if (acct->hardirq_time)
>> - account_system_index_time(tsk, cputime_to_nsecs(acct-
>> >hardirq_time),
>> + account_system_index_time(tsk, acct->hardirq_time,
>> CPUTIME_IRQ);
>> if (acct->softirq_time)
>> - account_system_index_time(tsk, cputime_to_nsecs(acct-
>> >softirq_time),
>> + account_system_index_time(tsk, acct->softirq_time,
>> CPUTIME_SOFTIRQ);
>>
>> vtime_flush_scaled(tsk, acct);
>> @@ -388,7 +388,7 @@ void vtime_reset(void)
>> {
>> struct cpu_accounting_data *acct = get_accounting(current);
>>
>> - acct->starttime = mftb();
>> + acct->starttime = sched_clock();
>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>> acct->startspurr = read_spurr(acct->starttime);
>> #endif
>
PS: I measured the performance with hackbench. I don't see any degradation.
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 7:46 ` Shrikanth Hegde
@ 2026-02-25 9:45 ` Christophe Leroy (CS GROUP)
2026-02-25 10:34 ` Shrikanth Hegde
2026-02-26 7:32 ` Christophe Leroy (CS GROUP)
1 sibling, 1 reply; 40+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-02-25 9:45 UTC (permalink / raw)
To: Shrikanth Hegde, Frederic Weisbecker, LKML, Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Hegde,
Le 25/02/2026 à 08:46, Shrikanth Hegde a écrit :
> Hi Christophe,
>
> On 2/24/26 9:11 PM, Christophe Leroy (CS GROUP) wrote:
>> Hi Hegde,
>>
>> Le 19/02/2026 à 19:30, Shrikanth Hegde a écrit :
>>>
>>>
>>> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
>>>> Currently the tick subsystem stores the idle cputime accounting in
>>>> private fields, allowing cohabitation with architecture idle vtime
>>>> accounting. The former is fetched on online CPUs, the latter on offline
>>>> CPUs.
>>>>
>>>> For consolidation purpose, architecture vtime accounting will continue
>>>> to account the cputime but will make a break when the idle tick is
>>>> stopped. The dyntick cputime accounting will then be relayed by the
>>>> tick
>>>> subsystem so that the idle cputime is still seen advancing coherently
>>>> even when the tick isn't there to flush the idle vtime.
>>>>
>>>> Prepare for that and introduce three new APIs which will be used in
>>>> subsequent patches:
>>>>
>>>> _ vtime_dynticks_start() is deemed to be called when idle enters in
>>>> dyntick mode. The idle cputime that elapsed so far is accumulated.
>>>>
>>>> - vtime_dynticks_stop() is deemed to be called when idle exits from
>>>> dyntick mode. The vtime entry clocks are fast-forward to current
>>>> time
>>>> so that idle accounting restarts elapsing from now.
>>>>
>>>> - vtime_reset() is deemed to be called from dynticks idle IRQ entry to
>>>> fast-forward the clock to current time so that the IRQ time is still
>>>> accounted by vtime while nohz cputime is paused.
>>>>
>>>> Also accumulated vtime won't be flushed from dyntick-idle ticks to
>>>> avoid
>>>> accounting twice the idle cputime, along with nohz accounting.
>>>>
>>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>>>
>>> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>>
>>>> ---
>>>> arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++
>>>> ++++
>>>> include/linux/vtime.h | 6 ++++++
>>>> 2 files changed, 47 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>>>> index 4bbeb8644d3d..18506740f4a4 100644
>>>> --- a/arch/powerpc/kernel/time.c
>>>> +++ b/arch/powerpc/kernel/time.c
>>>> @@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
>>>> acct->starttime = acct0->starttime;
>>>> }
>>>> }
>>>> +
>>>> +#ifdef CONFIG_NO_HZ_COMMON
>>>> +/**
>>>> + * vtime_reset - Fast forward vtime entry clocks
>>>> + *
>>>> + * Called from dynticks idle IRQ entry to fast-forward the clocks
>>>> to current time
>>>> + * so that the IRQ time is still accounted by vtime while nohz
>>>> cputime is paused.
>>>> + */
>>>> +void vtime_reset(void)
>>>> +{
>>>> + struct cpu_accounting_data *acct = get_accounting(current);
>>>> +
>>>> + acct->starttime = mftb();
>>>
>>> I figured out why those huge values happen.
>>>
>>> This happens because mftb is from when the system is booted.
>>> I was doing kexec to start the new kernel and mftb wasn't getting
>>> reset.
>>>
>>> I thought about this. This is concern for pseries too, where LPAR's
>>> restart but system won't restart and mftb will continue to run
>>> instead of
>>> reset.
>>>
>>> I think we should be using sched_clock instead of mftb here.
>>> Though we need it a few more places and some cosmetic changes around it.
>>>
>>> Note: Some values being huge exists without series for few CPUs, with
>>> series it
>>> shows up in most of the CPUs.
>>>
>>> So I am planning send out fix below fix separately keeping your
>>> series as dependency.
>>>
>>> ---
>>> arch/powerpc/include/asm/accounting.h | 4 ++--
>>> arch/powerpc/include/asm/cputime.h | 14 +++++++-------
>>> arch/powerpc/kernel/time.c | 22 +++++++++++-----------
>>> 3 files changed, 20 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/
>>> include/asm/accounting.h
>>> index 6d79c31700e2..50f120646e6d 100644
>>> --- a/arch/powerpc/include/asm/accounting.h
>>> +++ b/arch/powerpc/include/asm/accounting.h
>>> @@ -21,8 +21,8 @@ struct cpu_accounting_data {
>>> unsigned long steal_time;
>>> unsigned long idle_time;
>>> /* Internal counters */
>>> - unsigned long starttime; /* TB value snapshot */
>>> - unsigned long starttime_user; /* TB value on exit to usermode */
>>> + unsigned long starttime; /* Time value snapshot */
>>> + unsigned long starttime_user; /* Time value on exit to
>>> usermode */
>>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>>> unsigned long startspurr; /* SPURR value snapshot */
>>> unsigned long utime_sspurr; /* ->user_time when ->startspurr
>>> set */
>>> diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/
>>> include/ asm/cputime.h
>>> index aff858ca99c0..eb6b629b113f 100644
>>> --- a/arch/powerpc/include/asm/cputime.h
>>> +++ b/arch/powerpc/include/asm/cputime.h
>>> @@ -20,9 +20,9 @@
>>> #include <asm/time.h>
>>> #include <asm/param.h>
>>> #include <asm/firmware.h>
>>> +#include <linux/sched/clock.h>
>>>
>>> #ifdef __KERNEL__
>>> -#define cputime_to_nsecs(cputime) tb_to_ns(cputime)
>>>
>>> /*
>>> * PPC64 uses PACA which is task independent for storing accounting
>>> data while
>>> @@ -44,20 +44,20 @@
>>> */
>>> static notrace inline void account_cpu_user_entry(void)
>>> {
>>> - unsigned long tb = mftb();
>>> + unsigned long now = sched_clock();
>>
>> Now way !
>>
>> By doing that you'll kill performance for no reason. All we need when
>> accounting time spent in kernel or in user is the difference between
>> time at entry and time at exit, no mater what the time was at boot time.
>>
>
> No. With this patch there will not be any performance difference.
> All it does is, instead of using mftb uses sched_clock at those places.
>
>
> In arch/powerpc/kernel/time.c we have sched_clock().
> notrace unsigned long long sched_clock(void)
> {
> return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) <<
> tb_to_ns_shift;
> }
>
> It does the same mftb call, and accounts only the time after boot, which is
> what /proc/stat should do as well.
>
> "
> the amount of time, measured in units of USER_HZ
> (1/100ths of a second on most architectures
>
> user (1) Time spent in user mode.
>
> idle (4) Time spent in the idle task. This value
> should be USER_HZ times the second entry in
> the /proc/uptime pseudo-file.
> "
> /proc/uptime is based on sched_clock, so i infer /proc/stat also should
> show
> values w.r.t to boot of the OS.
>
>
>> Also sched_clock() returns nanoseconds which implies calculation from
>> timebase. This is pointless CPU consumption. The current
>> implementation calculates nanoseconds at task switch when calling
>> vtime_flush().Your change will now do it at every kernel entry and
>> kernel exit by calling sched_clock().
>
> This change doesn't add any additional paths. Even without patches, mftb
> would have
> been called in every kernel entry/exit. See mftb usage
> account_cpu_user_exit/enter
>
> Now instead of mftb sched_clock is used, that's all. No additional
> entry/exit points.
> And previously when accounting we would have done cputime_to_nsecs, now
> that conversion
> is done automatically in sched_clock. So overall computation-wise it
> should be same.
>
> What i am missing to see it here?
Ok, lets try to explain in more details:
While a process is running, it will enter and leave the kernel multiple
times, without task switch. For instance for system calls or for interrupts.
At every kernel entry and exit, account_cpu_user_entry() and
account_cpu_user_exit() are called. That's a very hot path.
I have added the following functions to see what the code looks like:
+
+void my_account_cpu_user_entry(void);
+void my_account_cpu_user_entry(void)
+{
+ account_cpu_user_entry();
+}
+
+void my_account_cpu_user_exit(void);
+void my_account_cpu_user_exit(void)
+{
+ account_cpu_user_exit();
+}
What we have today is very optimised:
00000148 <my_account_cpu_user_entry>:
148: 7d 0c 42 e6 mftb r8
14c: 80 e2 00 08 lwz r7,8(r2)
150: 81 22 00 28 lwz r9,40(r2)
154: 91 02 00 24 stw r8,36(r2)
158: 7d 29 38 50 subf r9,r9,r7
15c: 7d 29 42 14 add r9,r9,r8
160: 91 22 00 08 stw r9,8(r2)
164: 4e 80 00 20 blr
00000168 <my_account_cpu_user_exit>:
168: 7d 0c 42 e6 mftb r8
16c: 80 e2 00 0c lwz r7,12(r2)
170: 81 22 00 24 lwz r9,36(r2)
174: 91 02 00 28 stw r8,40(r2)
178: 7d 29 38 50 subf r9,r9,r7
17c: 7d 29 42 14 add r9,r9,r8
180: 91 22 00 0c stw r9,12(r2)
184: 4e 80 00 20 blr
With your change we now get a call to sched_clock() instead of a simple
mftb,
00000154 <my_account_cpu_user_entry>:
154: 94 21 ff f0 stwu r1,-16(r1)
158: 7c 08 02 a6 mflr r0
15c: 90 01 00 14 stw r0,20(r1)
160: 48 00 00 01 bl 160 <my_account_cpu_user_entry+0xc>
160: R_PPC_REL24 sched_clock
164: 81 02 00 08 lwz r8,8(r2)
168: 81 22 00 28 lwz r9,40(r2)
16c: 90 82 00 24 stw r4,36(r2)
170: 7d 29 40 50 subf r9,r9,r8
174: 7d 29 22 14 add r9,r9,r4
178: 91 22 00 08 stw r9,8(r2)
17c: 80 01 00 14 lwz r0,20(r1)
180: 38 21 00 10 addi r1,r1,16
184: 7c 08 03 a6 mtlr r0
188: 4e 80 00 20 blr
0000018c <my_account_cpu_user_exit>:
18c: 94 21 ff f0 stwu r1,-16(r1)
190: 7c 08 02 a6 mflr r0
194: 90 01 00 14 stw r0,20(r1)
198: 48 00 00 01 bl 198 <my_account_cpu_user_exit+0xc>
198: R_PPC_REL24 sched_clock
19c: 81 02 00 0c lwz r8,12(r2)
1a0: 81 22 00 24 lwz r9,36(r2)
1a4: 90 82 00 28 stw r4,40(r2)
1a8: 7d 29 40 50 subf r9,r9,r8
1ac: 7d 29 22 14 add r9,r9,r4
1b0: 91 22 00 0c stw r9,12(r2)
1b4: 80 01 00 14 lwz r0,20(r1)
1b8: 38 21 00 10 addi r1,r1,16
1bc: 7c 08 03 a6 mtlr r0
1c0: 4e 80 00 20 blr
And sched_clock() is heavy, first it has the sequence mftbu/mftb/mftbu,
and then it does awful lot of calculations including many multiply:
000004d8 <sched_clock>:
4d8: 7d 2d 42 e6 mftbu r9
4dc: 7d 0c 42 e6 mftb r8
4e0: 7d 4d 42 e6 mftbu r10
4e4: 7c 09 50 40 cmplw r9,r10
4e8: 40 82 ff f0 bne 4d8 <sched_clock>
4ec: 3d 40 00 00 lis r10,0
4ee: R_PPC_ADDR16_HA .data..ro_after_init
4f0: 38 ca 00 00 addi r6,r10,0
4f2: R_PPC_ADDR16_LO .data..ro_after_init
4f4: 3c e0 00 00 lis r7,0
4f6: R_PPC_ADDR16_HA .data..read_mostly
4f8: 38 87 00 00 addi r4,r7,0
4fa: R_PPC_ADDR16_LO .data..read_mostly
4fc: 80 66 00 04 lwz r3,4(r6)
500: 80 e7 00 00 lwz r7,0(r7)
502: R_PPC_ADDR16_LO .data..read_mostly
504: 80 c4 00 04 lwz r6,4(r4)
508: 81 4a 00 00 lwz r10,0(r10)
50a: R_PPC_ADDR16_LO .data..ro_after_init
50c: 7c 63 40 10 subfc r3,r3,r8
510: 7d 0a 49 10 subfe r8,r10,r9
514: 7d 27 19 d6 mullw r9,r7,r3
518: 7d 43 30 16 mulhwu r10,r3,r6
51c: 7c 08 31 d6 mullw r0,r8,r6
520: 7d 4a 48 14 addc r10,r10,r9
524: 7c 67 18 16 mulhwu r3,r7,r3
528: 39 20 00 00 li r9,0
52c: 7c c8 30 16 mulhwu r6,r8,r6
530: 7c a9 49 14 adde r5,r9,r9
534: 7d 67 41 d6 mullw r11,r7,r8
538: 7d 4a 00 14 addc r10,r10,r0
53c: 7c a5 01 94 addze r5,r5
540: 7c 63 30 14 addc r3,r3,r6
544: 7d 29 49 14 adde r9,r9,r9
548: 80 84 00 08 lwz r4,8(r4)
54c: 7c 63 58 14 addc r3,r3,r11
550: 7c e7 40 16 mulhwu r7,r7,r8
554: 7d 29 01 94 addze r9,r9
558: 7c 63 28 14 addc r3,r3,r5
55c: 7d 29 39 14 adde r9,r9,r7
560: 35 44 ff e0 addic. r10,r4,-32
564: 41 80 00 10 blt 574 <sched_clock+0x9c>
568: 7c 63 50 30 slw r3,r3,r10
56c: 38 80 00 00 li r4,0
570: 4e 80 00 20 blr
574: 21 04 00 1f subfic r8,r4,31
578: 54 6a f8 7e srwi r10,r3,1
57c: 7d 29 20 30 slw r9,r9,r4
580: 7d 4a 44 30 srw r10,r10,r8
584: 7c 64 20 30 slw r4,r3,r4
588: 7d 43 4b 78 or r3,r10,r9
58c: 4e 80 00 20 blr
I think the difference is obvious, no need of benchmarking. We shall
refrain from calling sched_clock() at every kernel entry/exit.
Converting from timebase to nanoseconds only need to be done in
vtime_flush() called by vtime_task_switch() during task switch.
Hope it is more explicit now.
Christophe
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 9:45 ` Christophe Leroy (CS GROUP)
@ 2026-02-25 10:34 ` Shrikanth Hegde
2026-02-25 11:14 ` Christophe Leroy (CS GROUP)
0 siblings, 1 reply; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-25 10:34 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP), Frederic Weisbecker, LKML,
Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Christophe.
On 2/25/26 3:15 PM, Christophe Leroy (CS GROUP) wrote:
> Hi Hegde,
>
> Le 25/02/2026 à 08:46, Shrikanth Hegde a écrit :
>> Hi Christophe,
>>
>> On 2/24/26 9:11 PM, Christophe Leroy (CS GROUP) wrote:
>>> Hi Hegde,
>>>
>>> Le 19/02/2026 à 19:30, Shrikanth Hegde a écrit :
>>>>
>>>>
>>>> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
>>>>> Currently the tick subsystem stores the idle cputime accounting in
>>>>> private fields, allowing cohabitation with architecture idle vtime
>>>>> accounting. The former is fetched on online CPUs, the latter on
>>>>> offline
>>>>> CPUs.
>>>>>
>>>>> For consolidation purpose, architecture vtime accounting will continue
>>>>> to account the cputime but will make a break when the idle tick is
>>>>> stopped. The dyntick cputime accounting will then be relayed by the
>>>>> tick
>>>>> subsystem so that the idle cputime is still seen advancing coherently
>>>>> even when the tick isn't there to flush the idle vtime.
>>>>>
>>>>> Prepare for that and introduce three new APIs which will be used in
>>>>> subsequent patches:
>>>>>
>>>>> _ vtime_dynticks_start() is deemed to be called when idle enters in
>>>>> dyntick mode. The idle cputime that elapsed so far is accumulated.
>>>>>
>>>>> - vtime_dynticks_stop() is deemed to be called when idle exits from
>>>>> dyntick mode. The vtime entry clocks are fast-forward to current
>>>>> time
>>>>> so that idle accounting restarts elapsing from now.
>>>>>
>>>>> - vtime_reset() is deemed to be called from dynticks idle IRQ entry to
>>>>> fast-forward the clock to current time so that the IRQ time is
>>>>> still
>>>>> accounted by vtime while nohz cputime is paused.
>>>>>
>>>>> Also accumulated vtime won't be flushed from dyntick-idle ticks to
>>>>> avoid
>>>>> accounting twice the idle cputime, along with nohz accounting.
>>>>>
>>>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>>>>
>>>> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>>>
>>>>> ---
>>>>> arch/powerpc/kernel/time.c | 41 +++++++++++++++++++++++++++++++++
>>>>> + ++++
>>>>> include/linux/vtime.h | 6 ++++++
>>>>> 2 files changed, 47 insertions(+)
>>>>>
>>>>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>>>>> index 4bbeb8644d3d..18506740f4a4 100644
>>>>> --- a/arch/powerpc/kernel/time.c
>>>>> +++ b/arch/powerpc/kernel/time.c
>>>>> @@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
>>>>> acct->starttime = acct0->starttime;
>>>>> }
>>>>> }
>>>>> +
>>>>> +#ifdef CONFIG_NO_HZ_COMMON
>>>>> +/**
>>>>> + * vtime_reset - Fast forward vtime entry clocks
>>>>> + *
>>>>> + * Called from dynticks idle IRQ entry to fast-forward the clocks
>>>>> to current time
>>>>> + * so that the IRQ time is still accounted by vtime while nohz
>>>>> cputime is paused.
>>>>> + */
>>>>> +void vtime_reset(void)
>>>>> +{
>>>>> + struct cpu_accounting_data *acct = get_accounting(current);
>>>>> +
>>>>> + acct->starttime = mftb();
>>>>
>>>> I figured out why those huge values happen.
>>>>
>>>> This happens because mftb is from when the system is booted.
>>>> I was doing kexec to start the new kernel and mftb wasn't getting
>>>> reset.
>>>>
>>>> I thought about this. This is concern for pseries too, where LPAR's
>>>> restart but system won't restart and mftb will continue to run
>>>> instead of
>>>> reset.
>>>>
>>>> I think we should be using sched_clock instead of mftb here.
>>>> Though we need it a few more places and some cosmetic changes around
>>>> it.
>>>>
>>>> Note: Some values being huge exists without series for few CPUs,
>>>> with series it
>>>> shows up in most of the CPUs.
>>>>
>>>> So I am planning send out fix below fix separately keeping your
>>>> series as dependency.
>>>>
>>>> ---
>>>> arch/powerpc/include/asm/accounting.h | 4 ++--
>>>> arch/powerpc/include/asm/cputime.h | 14 +++++++-------
>>>> arch/powerpc/kernel/time.c | 22 +++++++++++-----------
>>>> 3 files changed, 20 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/
>>>> include/asm/accounting.h
>>>> index 6d79c31700e2..50f120646e6d 100644
>>>> --- a/arch/powerpc/include/asm/accounting.h
>>>> +++ b/arch/powerpc/include/asm/accounting.h
>>>> @@ -21,8 +21,8 @@ struct cpu_accounting_data {
>>>> unsigned long steal_time;
>>>> unsigned long idle_time;
>>>> /* Internal counters */
>>>> - unsigned long starttime; /* TB value snapshot */
>>>> - unsigned long starttime_user; /* TB value on exit to
>>>> usermode */
>>>> + unsigned long starttime; /* Time value snapshot */
>>>> + unsigned long starttime_user; /* Time value on exit to
>>>> usermode */
>>>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>>>> unsigned long startspurr; /* SPURR value snapshot */
>>>> unsigned long utime_sspurr; /* ->user_time when -
>>>> >startspurr set */
>>>> diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/
>>>> include/ asm/cputime.h
>>>> index aff858ca99c0..eb6b629b113f 100644
>>>> --- a/arch/powerpc/include/asm/cputime.h
>>>> +++ b/arch/powerpc/include/asm/cputime.h
>>>> @@ -20,9 +20,9 @@
>>>> #include <asm/time.h>
>>>> #include <asm/param.h>
>>>> #include <asm/firmware.h>
>>>> +#include <linux/sched/clock.h>
>>>>
>>>> #ifdef __KERNEL__
>>>> -#define cputime_to_nsecs(cputime) tb_to_ns(cputime)
>>>>
>>>> /*
>>>> * PPC64 uses PACA which is task independent for storing
>>>> accounting data while
>>>> @@ -44,20 +44,20 @@
>>>> */
>>>> static notrace inline void account_cpu_user_entry(void)
>>>> {
>>>> - unsigned long tb = mftb();
>>>> + unsigned long now = sched_clock();
>>>
>>> Now way !
>>>
>>> By doing that you'll kill performance for no reason. All we need when
>>> accounting time spent in kernel or in user is the difference between
>>> time at entry and time at exit, no mater what the time was at boot time.
>>>
>>
>> No. With this patch there will not be any performance difference.
>> All it does is, instead of using mftb uses sched_clock at those places.
>>
>>
>> In arch/powerpc/kernel/time.c we have sched_clock().
>> notrace unsigned long long sched_clock(void)
>> {
>> return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) <<
>> tb_to_ns_shift;
>> }
>>
>> It does the same mftb call, and accounts only the time after boot,
>> which is
>> what /proc/stat should do as well.
>>
>> "
>> the amount of time, measured in units of USER_HZ
>> (1/100ths of a second on most architectures
>>
>> user (1) Time spent in user mode.
>>
>> idle (4) Time spent in the idle task. This value
>> should be USER_HZ times the second entry in
>> the /proc/uptime pseudo-file.
>> "
>> /proc/uptime is based on sched_clock, so i infer /proc/stat also
>> should show
>> values w.r.t to boot of the OS.
>>
>>
>>> Also sched_clock() returns nanoseconds which implies calculation from
>>> timebase. This is pointless CPU consumption. The current
>>> implementation calculates nanoseconds at task switch when calling
>>> vtime_flush().Your change will now do it at every kernel entry and
>>> kernel exit by calling sched_clock().
>>
>> This change doesn't add any additional paths. Even without patches,
>> mftb would have
>> been called in every kernel entry/exit. See mftb usage
>> account_cpu_user_exit/enter
>>
>> Now instead of mftb sched_clock is used, that's all. No additional
>> entry/exit points.
>> And previously when accounting we would have done cputime_to_nsecs,
>> now that conversion
>> is done automatically in sched_clock. So overall computation-wise it
>> should be same.
>>
>> What i am missing to see it here?
>
> Ok, lets try to explain in more details:
>
> While a process is running, it will enter and leave the kernel multiple
> times, without task switch. For instance for system calls or for
> interrupts.
>
> At every kernel entry and exit, account_cpu_user_entry() and
> account_cpu_user_exit() are called. That's a very hot path.
>
> I have added the following functions to see what the code looks like:
>
> +
> +void my_account_cpu_user_entry(void);
> +void my_account_cpu_user_entry(void)
> +{
> + account_cpu_user_entry();
> +}
> +
> +void my_account_cpu_user_exit(void);
> +void my_account_cpu_user_exit(void)
> +{
> + account_cpu_user_exit();
> +}
>
> What we have today is very optimised:
>
> 00000148 <my_account_cpu_user_entry>:
> 148: 7d 0c 42 e6 mftb r8
> 14c: 80 e2 00 08 lwz r7,8(r2)
> 150: 81 22 00 28 lwz r9,40(r2)
> 154: 91 02 00 24 stw r8,36(r2)
> 158: 7d 29 38 50 subf r9,r9,r7
> 15c: 7d 29 42 14 add r9,r9,r8
> 160: 91 22 00 08 stw r9,8(r2)
> 164: 4e 80 00 20 blr
>
> 00000168 <my_account_cpu_user_exit>:
> 168: 7d 0c 42 e6 mftb r8
> 16c: 80 e2 00 0c lwz r7,12(r2)
> 170: 81 22 00 24 lwz r9,36(r2)
> 174: 91 02 00 28 stw r8,40(r2)
> 178: 7d 29 38 50 subf r9,r9,r7
> 17c: 7d 29 42 14 add r9,r9,r8
> 180: 91 22 00 0c stw r9,12(r2)
> 184: 4e 80 00 20 blr
>
>
>
> With your change we now get a call to sched_clock() instead of a simple
> mftb,
>
> 00000154 <my_account_cpu_user_entry>:
> 154: 94 21 ff f0 stwu r1,-16(r1)
> 158: 7c 08 02 a6 mflr r0
> 15c: 90 01 00 14 stw r0,20(r1)
> 160: 48 00 00 01 bl 160 <my_account_cpu_user_entry+0xc>
> 160: R_PPC_REL24 sched_clock
> 164: 81 02 00 08 lwz r8,8(r2)
> 168: 81 22 00 28 lwz r9,40(r2)
> 16c: 90 82 00 24 stw r4,36(r2)
> 170: 7d 29 40 50 subf r9,r9,r8
> 174: 7d 29 22 14 add r9,r9,r4
> 178: 91 22 00 08 stw r9,8(r2)
> 17c: 80 01 00 14 lwz r0,20(r1)
> 180: 38 21 00 10 addi r1,r1,16
> 184: 7c 08 03 a6 mtlr r0
> 188: 4e 80 00 20 blr
>
> 0000018c <my_account_cpu_user_exit>:
> 18c: 94 21 ff f0 stwu r1,-16(r1)
> 190: 7c 08 02 a6 mflr r0
> 194: 90 01 00 14 stw r0,20(r1)
> 198: 48 00 00 01 bl 198 <my_account_cpu_user_exit+0xc>
> 198: R_PPC_REL24 sched_clock
> 19c: 81 02 00 0c lwz r8,12(r2)
> 1a0: 81 22 00 24 lwz r9,36(r2)
> 1a4: 90 82 00 28 stw r4,40(r2)
> 1a8: 7d 29 40 50 subf r9,r9,r8
> 1ac: 7d 29 22 14 add r9,r9,r4
> 1b0: 91 22 00 0c stw r9,12(r2)
> 1b4: 80 01 00 14 lwz r0,20(r1)
> 1b8: 38 21 00 10 addi r1,r1,16
> 1bc: 7c 08 03 a6 mtlr r0
> 1c0: 4e 80 00 20 blr
>
> And sched_clock() is heavy, first it has the sequence mftbu/mftb/mftbu,
> and then it does awful lot of calculations including many multiply:
>
> 000004d8 <sched_clock>:
> 4d8: 7d 2d 42 e6 mftbu r9
> 4dc: 7d 0c 42 e6 mftb r8
> 4e0: 7d 4d 42 e6 mftbu r10
> 4e4: 7c 09 50 40 cmplw r9,r10
> 4e8: 40 82 ff f0 bne 4d8 <sched_clock>
> 4ec: 3d 40 00 00 lis r10,0
> 4ee: R_PPC_ADDR16_HA .data..ro_after_init
> 4f0: 38 ca 00 00 addi r6,r10,0
> 4f2: R_PPC_ADDR16_LO .data..ro_after_init
> 4f4: 3c e0 00 00 lis r7,0
> 4f6: R_PPC_ADDR16_HA .data..read_mostly
> 4f8: 38 87 00 00 addi r4,r7,0
> 4fa: R_PPC_ADDR16_LO .data..read_mostly
> 4fc: 80 66 00 04 lwz r3,4(r6)
> 500: 80 e7 00 00 lwz r7,0(r7)
> 502: R_PPC_ADDR16_LO .data..read_mostly
> 504: 80 c4 00 04 lwz r6,4(r4)
> 508: 81 4a 00 00 lwz r10,0(r10)
> 50a: R_PPC_ADDR16_LO .data..ro_after_init
> 50c: 7c 63 40 10 subfc r3,r3,r8
> 510: 7d 0a 49 10 subfe r8,r10,r9
> 514: 7d 27 19 d6 mullw r9,r7,r3
> 518: 7d 43 30 16 mulhwu r10,r3,r6
> 51c: 7c 08 31 d6 mullw r0,r8,r6
> 520: 7d 4a 48 14 addc r10,r10,r9
> 524: 7c 67 18 16 mulhwu r3,r7,r3
> 528: 39 20 00 00 li r9,0
> 52c: 7c c8 30 16 mulhwu r6,r8,r6
> 530: 7c a9 49 14 adde r5,r9,r9
> 534: 7d 67 41 d6 mullw r11,r7,r8
> 538: 7d 4a 00 14 addc r10,r10,r0
> 53c: 7c a5 01 94 addze r5,r5
> 540: 7c 63 30 14 addc r3,r3,r6
> 544: 7d 29 49 14 adde r9,r9,r9
> 548: 80 84 00 08 lwz r4,8(r4)
> 54c: 7c 63 58 14 addc r3,r3,r11
> 550: 7c e7 40 16 mulhwu r7,r7,r8
> 554: 7d 29 01 94 addze r9,r9
> 558: 7c 63 28 14 addc r3,r3,r5
> 55c: 7d 29 39 14 adde r9,r9,r7
> 560: 35 44 ff e0 addic. r10,r4,-32
> 564: 41 80 00 10 blt 574 <sched_clock+0x9c>
> 568: 7c 63 50 30 slw r3,r3,r10
> 56c: 38 80 00 00 li r4,0
> 570: 4e 80 00 20 blr
> 574: 21 04 00 1f subfic r8,r4,31
> 578: 54 6a f8 7e srwi r10,r3,1
> 57c: 7d 29 20 30 slw r9,r9,r4
> 580: 7d 4a 44 30 srw r10,r10,r8
> 584: 7c 64 20 30 slw r4,r3,r4
> 588: 7d 43 4b 78 or r3,r10,r9
> 58c: 4e 80 00 20 blr
>
> I think the difference is obvious, no need of benchmarking. We shall
> refrain from calling sched_clock() at every kernel entry/exit.
> Converting from timebase to nanoseconds only need to be done in
> vtime_flush() called by vtime_task_switch() during task switch.
>
> Hope it is more explicit now.
>
Got it. The main concern was around with additional computation that sched_clock,
not any additional paths per se.
yes, that would be possible,
How about we do below? This adds only one subtraction.
This achieves the same outcome.
---
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index aff858ca99c0..7afba0202568 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -44,7 +44,7 @@
*/
static notrace inline void account_cpu_user_entry(void)
{
- unsigned long tb = mftb();
+ unsigned long tb = mftb() - get_boot_tb();
struct cpu_accounting_data *acct = raw_get_accounting(current);
acct->utime += (tb - acct->starttime_user);
@@ -53,7 +53,7 @@ static notrace inline void account_cpu_user_entry(void)
static notrace inline void account_cpu_user_exit(void)
{
- unsigned long tb = mftb();
+ unsigned long tb = mftb() - get_boot_tb();
struct cpu_accounting_data *acct = raw_get_accounting(current);
acct->stime += (tb - acct->starttime);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 18506740f4a4..ff5524e6cdc7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -215,7 +215,7 @@ static unsigned long vtime_delta(struct cpu_accounting_data *acct,
WARN_ON_ONCE(!irqs_disabled());
- now = mftb();
+ now = mftb() - get_boot_tb();
stime = now - acct->starttime;
acct->starttime = now;
@@ -388,7 +388,7 @@ void vtime_reset(void)
{
struct cpu_accounting_data *acct = get_accounting(current);
- acct->starttime = mftb();
+ acct->starttime = mftb() - get_boot_tb();
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
acct->startspurr = read_spurr(acct->starttime);
#endif
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 10:34 ` Shrikanth Hegde
@ 2026-02-25 11:14 ` Christophe Leroy (CS GROUP)
2026-02-25 13:33 ` Shrikanth Hegde
0 siblings, 1 reply; 40+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-02-25 11:14 UTC (permalink / raw)
To: Shrikanth Hegde, Frederic Weisbecker, LKML, Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Hegde,
Le 25/02/2026 à 11:34, Shrikanth Hegde a écrit :
> Hi Christophe.
>
> On 2/25/26 3:15 PM, Christophe Leroy (CS GROUP) wrote:
>>
>> Hope it is more explicit now.
>>
>
> Got it. The main concern was around with additional computation that
> sched_clock,
> not any additional paths per se.
>
> yes, that would be possible,
>
>
> How about we do below? This adds only one subtraction.
> This achieves the same outcome.
It adds a bit more than just a substration. It adds a call to an extern
fonction.
00000164 <my_account_cpu_user_entry>:
164: 94 21 ff f0 stwu r1,-16(r1)
168: 7c 08 02 a6 mflr r0
16c: 90 01 00 14 stw r0,20(r1)
170: 93 e1 00 0c stw r31,12(r1)
174: 7f ec 42 e6 mftb r31
178: 48 00 00 01 bl 178 <my_account_cpu_user_entry+0x14>
178: R_PPC_REL24 get_boot_tb
17c: 81 02 00 08 lwz r8,8(r2)
180: 81 22 00 28 lwz r9,40(r2)
184: 7c 84 f8 50 subf r4,r4,r31
188: 7d 29 40 50 subf r9,r9,r8
18c: 7d 29 22 14 add r9,r9,r4
190: 90 82 00 24 stw r4,36(r2)
194: 91 22 00 08 stw r9,8(r2)
198: 80 01 00 14 lwz r0,20(r1)
19c: 83 e1 00 0c lwz r31,12(r1)
1a0: 7c 08 03 a6 mtlr r0
1a4: 38 21 00 10 addi r1,r1,16
1a8: 4e 80 00 20 blr
000001ac <my_account_cpu_user_exit>:
1ac: 94 21 ff f0 stwu r1,-16(r1)
1b0: 7c 08 02 a6 mflr r0
1b4: 90 01 00 14 stw r0,20(r1)
1b8: 93 e1 00 0c stw r31,12(r1)
1bc: 7f ec 42 e6 mftb r31
1c0: 48 00 00 01 bl 1c0 <my_account_cpu_user_exit+0x14>
1c0: R_PPC_REL24 get_boot_tb
1c4: 81 02 00 0c lwz r8,12(r2)
1c8: 81 22 00 24 lwz r9,36(r2)
1cc: 7c 84 f8 50 subf r4,r4,r31
1d0: 7d 29 40 50 subf r9,r9,r8
1d4: 7d 29 22 14 add r9,r9,r4
1d8: 90 82 00 28 stw r4,40(r2)
1dc: 91 22 00 0c stw r9,12(r2)
1e0: 80 01 00 14 lwz r0,20(r1)
1e4: 83 e1 00 0c lwz r31,12(r1)
1e8: 7c 08 03 a6 mtlr r0
1ec: 38 21 00 10 addi r1,r1,16
1f0: 4e 80 00 20 blr
I really still can't see the point of this substraction.
At one place we do
tb1 = mftb1;
acct->utime += (tb1 - acct->starttime_user);
acct->starttime = tb1;
At the other place we do
tb2 = mftb2;
acct->stime += (tb2 - acct->starttime);
acct->starttime_user = tb2;
So at the end we have
acct->utime += mftb1 - mftb2;
acct->stime += mftb2 - mftb1;
You want to change to
tb1 = mftb1 - boot_tb;
tb2 = mftb2 - boot_tb;
At the end we would get
acct->utime += mftb1 - boot_tb - mftb2 + boot_tb = mftb1 - mftb2;
acct->stime += mftb2 - boot_tb - mftb1 + boot_tb = mftb2 - mftb1;
So what's the point in doing such a useless substract that disappears at
the end ? What am I missing ?
Christophe
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 11:14 ` Christophe Leroy (CS GROUP)
@ 2026-02-25 13:33 ` Shrikanth Hegde
2026-02-25 13:54 ` Christophe Leroy (CS GROUP)
0 siblings, 1 reply; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-25 13:33 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP), Frederic Weisbecker, LKML,
Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
On 2/25/26 4:44 PM, Christophe Leroy (CS GROUP) wrote:
> Hi Hegde,
>
> Le 25/02/2026 à 11:34, Shrikanth Hegde a écrit :
>> Hi Christophe.
>>
>> On 2/25/26 3:15 PM, Christophe Leroy (CS GROUP) wrote:
>>>
>>> Hope it is more explicit now.
>>>
>>
>> Got it. The main concern was around with additional computation that
>> sched_clock,
>> not any additional paths per se.
>>
>> yes, that would be possible,
>>
>>
>> How about we do below? This adds only one subtraction.
>> This achieves the same outcome.
>
> It adds a bit more than just a substration. It adds a call to an extern
> fonction.
I think we should make it always inline and move it to time.h
>
> 00000164 <my_account_cpu_user_entry>:
> 164: 94 21 ff f0 stwu r1,-16(r1)
> 168: 7c 08 02 a6 mflr r0
> 16c: 90 01 00 14 stw r0,20(r1)
> 170: 93 e1 00 0c stw r31,12(r1)
> 174: 7f ec 42 e6 mftb r31
> 178: 48 00 00 01 bl 178 <my_account_cpu_user_entry+0x14>
> 178: R_PPC_REL24 get_boot_tb
> 17c: 81 02 00 08 lwz r8,8(r2)
> 180: 81 22 00 28 lwz r9,40(r2)
> 184: 7c 84 f8 50 subf r4,r4,r31
> 188: 7d 29 40 50 subf r9,r9,r8
> 18c: 7d 29 22 14 add r9,r9,r4
> 190: 90 82 00 24 stw r4,36(r2)
> 194: 91 22 00 08 stw r9,8(r2)
> 198: 80 01 00 14 lwz r0,20(r1)
> 19c: 83 e1 00 0c lwz r31,12(r1)
> 1a0: 7c 08 03 a6 mtlr r0
> 1a4: 38 21 00 10 addi r1,r1,16
> 1a8: 4e 80 00 20 blr
>
> 000001ac <my_account_cpu_user_exit>:
> 1ac: 94 21 ff f0 stwu r1,-16(r1)
> 1b0: 7c 08 02 a6 mflr r0
> 1b4: 90 01 00 14 stw r0,20(r1)
> 1b8: 93 e1 00 0c stw r31,12(r1)
> 1bc: 7f ec 42 e6 mftb r31
> 1c0: 48 00 00 01 bl 1c0 <my_account_cpu_user_exit+0x14>
> 1c0: R_PPC_REL24 get_boot_tb
> 1c4: 81 02 00 0c lwz r8,12(r2)
> 1c8: 81 22 00 24 lwz r9,36(r2)
> 1cc: 7c 84 f8 50 subf r4,r4,r31
> 1d0: 7d 29 40 50 subf r9,r9,r8
> 1d4: 7d 29 22 14 add r9,r9,r4
> 1d8: 90 82 00 28 stw r4,40(r2)
> 1dc: 91 22 00 0c stw r9,12(r2)
> 1e0: 80 01 00 14 lwz r0,20(r1)
> 1e4: 83 e1 00 0c lwz r31,12(r1)
> 1e8: 7c 08 03 a6 mtlr r0
> 1ec: 38 21 00 10 addi r1,r1,16
> 1f0: 4e 80 00 20 blr
>
>
> I really still can't see the point of this substraction.
>
> At one place we do
>
> tb1 = mftb1;
>
> acct->utime += (tb1 - acct->starttime_user);
> acct->starttime = tb1;
>
> At the other place we do
>
> tb2 = mftb2;
>
> acct->stime += (tb2 - acct->starttime);
> acct->starttime_user = tb2;
>
> So at the end we have
>
> acct->utime += mftb1 - mftb2;
> acct->stime += mftb2 - mftb1;
>
> You want to change to
> tb1 = mftb1 - boot_tb;
> tb2 = mftb2 - boot_tb;
>
> At the end we would get
>
> acct->utime += mftb1 - boot_tb - mftb2 + boot_tb = mftb1 - mftb2;
> acct->stime += mftb2 - boot_tb - mftb1 + boot_tb = mftb2 - mftb1;
>
> So what's the point in doing such a useless substract that disappears at
> the end ? What am I missing ?
>
I had similar thought, but I saw this data below when i do exec on the system.
This was the stats seen on PowerNV system with 144 CPUs.
Nothing is running on the system after boot. So it is mostly idle.
======== With the series applied ===
cat /proc/stat | head
cpu 1494 0 135607576 9628633227 16876 142 63 0 0 0
cpu0 0 0 8 67807311 0 2 40 0 0 0
cpu1 0 0 6 67807349 0 0 0 0 0 0
cat /proc/uptime
48.32 96286332.82 << Note this value is too huge. Also system value is also huge.
========= without the series(tip/master) ===============
cat /proc/stat | head
cpu 2003 0 67866261 859414 15923 249 66 0 0 0
cpu0 5 0 23 5595 461 2 38 0 0 0
cpu1 0 0 9 6092 21 0 3 0 0 0
cat /proc/uptime
61.29 8594.82 << This is right. 144*61 = 8784.
But note, the system time reported. i.e 67866261. It is too huge again. And very close to actual mftb value
rather than the diff. i.e we have paths were tb1 is not done. tb2 is effectively mftb - 0
========= with proposed fix of mftb - boot_tb ===============
cat /proc/stat | head
cpu 5187 0 10996 2025690 16566 765 184 0 0 0
cpu0 9 0 28 14096 65 6 108 0 0 0
cpu1 4 0 15 14277 0 0 2 0 0 0
cat /proc/uptime
142.97 20257.42 << Looks correct, since 142*144 is close to 20448
=============================================================
Now lets go to CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
cat /proc/stat | head
cpu 1804 0 3003 791760 15695 0 0 0 0 0
cpu0 22 0 46 5535 0 0 0 0 0 0
cpu1 0 0 7 5637 0 0 0 0 0 0
cat /proc/uptime
56.49 7918.05 << Looks correct. close 56*144
================================================
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 13:33 ` Shrikanth Hegde
@ 2026-02-25 13:54 ` Christophe Leroy (CS GROUP)
2026-02-25 17:47 ` Shrikanth Hegde
0 siblings, 1 reply; 40+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-02-25 13:54 UTC (permalink / raw)
To: Shrikanth Hegde, Frederic Weisbecker, LKML, Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Hegde,
Le 25/02/2026 à 14:33, Shrikanth Hegde a écrit :
>
>
> On 2/25/26 4:44 PM, Christophe Leroy (CS GROUP) wrote:
>> Hi Hegde,
>>
>> Le 25/02/2026 à 11:34, Shrikanth Hegde a écrit :
>>> Hi Christophe.
>>>
>>> On 2/25/26 3:15 PM, Christophe Leroy (CS GROUP) wrote:
>>>>
>>>> Hope it is more explicit now.
>>>>
>>>
>>> Got it. The main concern was around with additional computation that
>>> sched_clock,
>>> not any additional paths per se.
>>>
>>> yes, that would be possible,
>>>
>>>
>>> How about we do below? This adds only one subtraction.
>>> This achieves the same outcome.
>>
>> It adds a bit more than just a substration. It adds a call to an
>> extern fonction.
>
> I think we should make it always inline and move it to time.h
>
>>
>> 00000164 <my_account_cpu_user_entry>:
>> 164: 94 21 ff f0 stwu r1,-16(r1)
>> 168: 7c 08 02 a6 mflr r0
>> 16c: 90 01 00 14 stw r0,20(r1)
>> 170: 93 e1 00 0c stw r31,12(r1)
>> 174: 7f ec 42 e6 mftb r31
>> 178: 48 00 00 01 bl 178 <my_account_cpu_user_entry+0x14>
>> 178: R_PPC_REL24 get_boot_tb
>> 17c: 81 02 00 08 lwz r8,8(r2)
>> 180: 81 22 00 28 lwz r9,40(r2)
>> 184: 7c 84 f8 50 subf r4,r4,r31
>> 188: 7d 29 40 50 subf r9,r9,r8
>> 18c: 7d 29 22 14 add r9,r9,r4
>> 190: 90 82 00 24 stw r4,36(r2)
>> 194: 91 22 00 08 stw r9,8(r2)
>> 198: 80 01 00 14 lwz r0,20(r1)
>> 19c: 83 e1 00 0c lwz r31,12(r1)
>> 1a0: 7c 08 03 a6 mtlr r0
>> 1a4: 38 21 00 10 addi r1,r1,16
>> 1a8: 4e 80 00 20 blr
>>
>> 000001ac <my_account_cpu_user_exit>:
>> 1ac: 94 21 ff f0 stwu r1,-16(r1)
>> 1b0: 7c 08 02 a6 mflr r0
>> 1b4: 90 01 00 14 stw r0,20(r1)
>> 1b8: 93 e1 00 0c stw r31,12(r1)
>> 1bc: 7f ec 42 e6 mftb r31
>> 1c0: 48 00 00 01 bl 1c0 <my_account_cpu_user_exit+0x14>
>> 1c0: R_PPC_REL24 get_boot_tb
>> 1c4: 81 02 00 0c lwz r8,12(r2)
>> 1c8: 81 22 00 24 lwz r9,36(r2)
>> 1cc: 7c 84 f8 50 subf r4,r4,r31
>> 1d0: 7d 29 40 50 subf r9,r9,r8
>> 1d4: 7d 29 22 14 add r9,r9,r4
>> 1d8: 90 82 00 28 stw r4,40(r2)
>> 1dc: 91 22 00 0c stw r9,12(r2)
>> 1e0: 80 01 00 14 lwz r0,20(r1)
>> 1e4: 83 e1 00 0c lwz r31,12(r1)
>> 1e8: 7c 08 03 a6 mtlr r0
>> 1ec: 38 21 00 10 addi r1,r1,16
>> 1f0: 4e 80 00 20 blr
>>
>>
>> I really still can't see the point of this substraction.
>>
>> At one place we do
>>
>> tb1 = mftb1;
>>
>> acct->utime += (tb1 - acct->starttime_user);
>> acct->starttime = tb1;
>>
>> At the other place we do
>>
>> tb2 = mftb2;
>>
>> acct->stime += (tb2 - acct->starttime);
>> acct->starttime_user = tb2;
>>
>> So at the end we have
>>
>> acct->utime += mftb1 - mftb2;
>> acct->stime += mftb2 - mftb1;
>>
>> You want to change to
>> tb1 = mftb1 - boot_tb;
>> tb2 = mftb2 - boot_tb;
>>
>> At the end we would get
>>
>> acct->utime += mftb1 - boot_tb - mftb2 + boot_tb = mftb1 - mftb2;
>> acct->stime += mftb2 - boot_tb - mftb1 + boot_tb = mftb2 - mftb1;
>>
>> So what's the point in doing such a useless substract that disappears
>> at the end ? What am I missing ?
>>
>
> I had similar thought, but I saw this data below when i do exec on the
> system.
>
> This was the stats seen on PowerNV system with 144 CPUs.
> Nothing is running on the system after boot. So it is mostly idle.
>
>
> ======== With the series applied ===
>
> cat /proc/stat | head
> cpu 1494 0 135607576 9628633227 16876 142 63 0 0 0
> cpu0 0 0 8 67807311 0 2 40 0 0 0
> cpu1 0 0 6 67807349 0 0 0 0 0 0
>
> cat /proc/uptime
> 48.32 96286332.82 << Note this value is too huge. Also system value is
> also huge.
>
> ========= without the series(tip/master) ===============
> cat /proc/stat | head
> cpu 2003 0 67866261 859414 15923 249 66 0 0 0
> cpu0 5 0 23 5595 461 2 38 0 0 0
> cpu1 0 0 9 6092 21 0 3 0 0 0
>
> cat /proc/uptime
> 61.29 8594.82 << This is right. 144*61 = 8784.
>
> But note, the system time reported. i.e 67866261. It is too huge again.
> And very close to actual mftb value
> rather than the diff. i.e we have paths were tb1 is not done. tb2 is
> effectively mftb - 0
>
>
> ========= with proposed fix of mftb - boot_tb ===============
> cat /proc/stat | head
> cpu 5187 0 10996 2025690 16566 765 184 0 0 0
> cpu0 9 0 28 14096 65 6 108 0 0 0
> cpu1 4 0 15 14277 0 0 2 0 0 0
>
> cat /proc/uptime
> 142.97 20257.42 << Looks correct, since 142*144 is close to 20448
>
> =============================================================
>
> Now lets go to CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
>
> cat /proc/stat | head
> cpu 1804 0 3003 791760 15695 0 0 0 0 0
> cpu0 22 0 46 5535 0 0 0 0 0 0
> cpu1 0 0 7 5637 0 0 0 0 0 0
>
> cat /proc/uptime
> 56.49 7918.05 << Looks correct. close 56*144
>
>
> ================================================
>
I think I'm starting to understand now.
I think the problem is that acct->starttime has an invalid value the
very first time it is used.
We are probably lacking an initial value in paca->accounting.starttime.
This should likely be initialised from mftb in head_64.S in
start_here_common for main CPU and __secondary_start for other CPUs or
maybe at higher level in C in setup_arch() and start_secondary()
Christophe
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 13:54 ` Christophe Leroy (CS GROUP)
@ 2026-02-25 17:47 ` Shrikanth Hegde
2026-02-25 17:59 ` Christophe Leroy (CS GROUP)
0 siblings, 1 reply; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-25 17:47 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP), Frederic Weisbecker, LKML,
Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Christophe.
> I think I'm starting to understand now.
>
> I think the problem is that acct->starttime has an invalid value the
> very first time it is used.
>
> We are probably lacking an initial value in paca->accounting.starttime.
> This should likely be initialised from mftb in head_64.S in
> start_here_common for main CPU and __secondary_start for other CPUs or
> maybe at higher level in C in setup_arch() and start_secondary()
>
> Christophe
How about below? this works too.
---
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 18506740f4a4..af129645b7f7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -928,9 +928,24 @@ static void __init set_decrementer_max(void)
bits, decrementer_max);
}
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+/*
+ * This is done to initialize the starttime correctly. with this
+ * /proc/stat show correct values similar to CONFIG_VIRT_CPU_ACCOUNTING_GEN
+ */
+static void init_cpu_accounting_startime(void)
+{
+ struct cpu_accounting_data *acct = get_accounting(current);
+ acct->starttime = mftb();
+}
+#else
+static void init_cpu_accounting_startime(void) { };
+#endif
+
static void __init init_decrementer_clockevent(void)
{
register_decrementer_clockevent(smp_processor_id());
+ init_cpu_accounting_startime();
}
void secondary_cpu_time_init(void)
@@ -946,6 +961,8 @@ void secondary_cpu_time_init(void)
/* FIME: Should make unrelated change to move snapshot_timebase
* call here ! */
register_decrementer_clockevent(smp_processor_id());
+
+ init_cpu_accounting_startime();
}
/*
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 17:47 ` Shrikanth Hegde
@ 2026-02-25 17:59 ` Christophe Leroy (CS GROUP)
2026-02-26 4:06 ` Shrikanth Hegde
0 siblings, 1 reply; 40+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-02-25 17:59 UTC (permalink / raw)
To: Shrikanth Hegde, Frederic Weisbecker, LKML, Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Le 25/02/2026 à 18:47, Shrikanth Hegde a écrit :
> Hi Christophe.
>
>> I think I'm starting to understand now.
>>
>> I think the problem is that acct->starttime has an invalid value the
>> very first time it is used.
>>
>> We are probably lacking an initial value in paca->accounting.starttime.
>> This should likely be initialised from mftb in head_64.S in
>> start_here_common for main CPU and __secondary_start for other CPUs or
>> maybe at higher level in C in setup_arch() and start_secondary()
>>
>> Christophe
>
> How about below? this works too.
Fine it is works, it means we found the real problem.
What about using the newly added vtime_reset() ? See below (untested)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 9b3167274653..f4aef85106ac 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -377,7 +377,6 @@ void vtime_task_switch(struct task_struct *prev)
}
}
-#ifdef CONFIG_NO_HZ_COMMON
/**
* vtime_reset - Fast forward vtime entry clocks
*
@@ -394,6 +393,7 @@ void vtime_reset(void)
#endif
}
+#ifdef CONFIG_NO_HZ_COMMON
/**
* vtime_dyntick_start - Inform vtime about entry to idle-dynticks
*
@@ -931,6 +931,7 @@ static void __init set_decrementer_max(void)
static void __init init_decrementer_clockevent(void)
{
register_decrementer_clockevent(smp_processor_id());
+ vtime_reset();
}
void secondary_cpu_time_init(void)
@@ -946,6 +947,7 @@ void secondary_cpu_time_init(void)
/* FIME: Should make unrelated change to move snapshot_timebase
* call here ! */
register_decrementer_clockevent(smp_processor_id());
+ vtime_reset();
}
/*
>
> ---
>
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 18506740f4a4..af129645b7f7 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -928,9 +928,24 @@ static void __init set_decrementer_max(void)
> bits, decrementer_max);
> }
>
> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> +/*
> + * This is done to initialize the starttime correctly. with this
> + * /proc/stat show correct values similar to
> CONFIG_VIRT_CPU_ACCOUNTING_GEN
> + */
> +static void init_cpu_accounting_startime(void)
> +{
> + struct cpu_accounting_data *acct = get_accounting(current);
> + acct->starttime = mftb();
> +}
> +#else
> +static void init_cpu_accounting_startime(void) { };
> +#endif
> +
> static void __init init_decrementer_clockevent(void)
> {
> register_decrementer_clockevent(smp_processor_id());
> + init_cpu_accounting_startime();
> }
>
> void secondary_cpu_time_init(void)
> @@ -946,6 +961,8 @@ void secondary_cpu_time_init(void)
> /* FIME: Should make unrelated change to move snapshot_timebase
> * call here ! */
> register_decrementer_clockevent(smp_processor_id());
> +
> + init_cpu_accounting_startime();
> }
>
> /*
>
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 17:59 ` Christophe Leroy (CS GROUP)
@ 2026-02-26 4:06 ` Shrikanth Hegde
0 siblings, 0 replies; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-26 4:06 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP), Frederic Weisbecker, LKML,
Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
On 2/25/26 11:29 PM, Christophe Leroy (CS GROUP) wrote:
>
>
> Le 25/02/2026 à 18:47, Shrikanth Hegde a écrit :
>> Hi Christophe.
>>
>>> I think I'm starting to understand now.
>>>
>>> I think the problem is that acct->starttime has an invalid value the
>>> very first time it is used.
>>>
>>> We are probably lacking an initial value in paca->accounting.starttime.
>>> This should likely be initialised from mftb in head_64.S in
>>> start_here_common for main CPU and __secondary_start for other CPUs
>>> or maybe at higher level in C in setup_arch() and start_secondary()
>>>
>>> Christophe
>>
>> How about below? this works too.
>
> Fine it is works, it means we found the real problem.
>
> What about using the newly added vtime_reset() ? See below (untested)
>
Thanks Christophe for helping.
This works too. vtime_reset does the exact same thing.
Let me write a changelog and a comment on vtime_reset and send it.
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 9b3167274653..f4aef85106ac 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -377,7 +377,6 @@ void vtime_task_switch(struct task_struct *prev)
> }
> }
>
> -#ifdef CONFIG_NO_HZ_COMMON
> /**
> * vtime_reset - Fast forward vtime entry clocks
> *
> @@ -394,6 +393,7 @@ void vtime_reset(void)
> #endif
> }
>
> +#ifdef CONFIG_NO_HZ_COMMON
> /**
> * vtime_dyntick_start - Inform vtime about entry to idle-dynticks
> *
> @@ -931,6 +931,7 @@ static void __init set_decrementer_max(void)
> static void __init init_decrementer_clockevent(void)
> {
> register_decrementer_clockevent(smp_processor_id());
> + vtime_reset();
> }
>
> void secondary_cpu_time_init(void)
> @@ -946,6 +947,7 @@ void secondary_cpu_time_init(void)
> /* FIME: Should make unrelated change to move snapshot_timebase
> * call here ! */
> register_decrementer_clockevent(smp_processor_id());
> + vtime_reset();
> }
>
> /*
>
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-25 7:46 ` Shrikanth Hegde
2026-02-25 9:45 ` Christophe Leroy (CS GROUP)
@ 2026-02-26 7:32 ` Christophe Leroy (CS GROUP)
2026-02-26 12:57 ` Shrikanth Hegde
1 sibling, 1 reply; 40+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-02-26 7:32 UTC (permalink / raw)
To: Shrikanth Hegde, Frederic Weisbecker, LKML, Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Hegde,
Le 25/02/2026 à 08:46, Shrikanth Hegde a écrit :
> Hi Christophe,
>
> On 2/24/26 9:11 PM, Christophe Leroy (CS GROUP) wrote:
>> Hi Hegde,
>>
>> Le 19/02/2026 à 19:30, Shrikanth Hegde a écrit :
>>>
>>>
>>> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
>>>> Currently the tick subsystem stores the idle cputime accounting in
>>>> private fields, allowing cohabitation with architecture idle vtime
>>>> accounting. The former is fetched on online CPUs, the latter on offline
>>>> CPUs.
>>>>
>>>> For consolidation purpose, architecture vtime accounting will continue
>>>> to account the cputime but will make a break when the idle tick is
>>>> stopped. The dyntick cputime accounting will then be relayed by the
>>>> tick
>>>> subsystem so that the idle cputime is still seen advancing coherently
>>>> even when the tick isn't there to flush the idle vtime.
>>>>
>>>> Prepare for that and introduce three new APIs which will be used in
>>>> subsequent patches:
>>>>
>>>> _ vtime_dynticks_start() is deemed to be called when idle enters in
>>>> dyntick mode. The idle cputime that elapsed so far is accumulated.
>>>>
>>>> - vtime_dynticks_stop() is deemed to be called when idle exits from
>>>> dyntick mode. The vtime entry clocks are fast-forward to current
>>>> time
>>>> so that idle accounting restarts elapsing from now.
>>>>
>>>> - vtime_reset() is deemed to be called from dynticks idle IRQ entry to
>>>> fast-forward the clock to current time so that the IRQ time is still
>>>> accounted by vtime while nohz cputime is paused.
>>>>
>>>> Also accumulated vtime won't be flushed from dyntick-idle ticks to
>>>> avoid
>>>> accounting twice the idle cputime, along with nohz accounting.
>>>>
>>>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>>>
>>> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>>
>>>> ---
>>>> arch/powerpc/kernel/time.c | 41 ++++++++++++++++++++++++++++++++++
>>>> ++++
>>>> include/linux/vtime.h | 6 ++++++
>>>> 2 files changed, 47 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>>>> index 4bbeb8644d3d..18506740f4a4 100644
>>>> --- a/arch/powerpc/kernel/time.c
>>>> +++ b/arch/powerpc/kernel/time.c
>>>> @@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
>>>> acct->starttime = acct0->starttime;
>>>> }
>>>> }
>>>> +
>>>> +#ifdef CONFIG_NO_HZ_COMMON
>>>> +/**
>>>> + * vtime_reset - Fast forward vtime entry clocks
>>>> + *
>>>> + * Called from dynticks idle IRQ entry to fast-forward the clocks
>>>> to current time
>>>> + * so that the IRQ time is still accounted by vtime while nohz
>>>> cputime is paused.
>>>> + */
>>>> +void vtime_reset(void)
>>>> +{
>>>> + struct cpu_accounting_data *acct = get_accounting(current);
>>>> +
>>>> + acct->starttime = mftb();
>>>
>>> I figured out why those huge values happen.
>>>
>>> This happens because mftb is from when the system is booted.
>>> I was doing kexec to start the new kernel and mftb wasn't getting
>>> reset.
>>>
>>> I thought about this. This is concern for pseries too, where LPAR's
>>> restart but system won't restart and mftb will continue to run
>>> instead of
>>> reset.
>>>
>>> I think we should be using sched_clock instead of mftb here.
>>> Though we need it a few more places and some cosmetic changes around it.
>>>
>>> Note: Some values being huge exists without series for few CPUs, with
>>> series it
>>> shows up in most of the CPUs.
>>>
>>> So I am planning send out fix below fix separately keeping your
>>> series as dependency.
>>>
>>> ---
>>> arch/powerpc/include/asm/accounting.h | 4 ++--
>>> arch/powerpc/include/asm/cputime.h | 14 +++++++-------
>>> arch/powerpc/kernel/time.c | 22 +++++++++++-----------
>>> 3 files changed, 20 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/
>>> include/asm/accounting.h
>>> index 6d79c31700e2..50f120646e6d 100644
>>> --- a/arch/powerpc/include/asm/accounting.h
>>> +++ b/arch/powerpc/include/asm/accounting.h
>>> @@ -21,8 +21,8 @@ struct cpu_accounting_data {
>>> unsigned long steal_time;
>>> unsigned long idle_time;
>>> /* Internal counters */
>>> - unsigned long starttime; /* TB value snapshot */
>>> - unsigned long starttime_user; /* TB value on exit to usermode */
>>> + unsigned long starttime; /* Time value snapshot */
>>> + unsigned long starttime_user; /* Time value on exit to
>>> usermode */
>>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>>> unsigned long startspurr; /* SPURR value snapshot */
>>> unsigned long utime_sspurr; /* ->user_time when ->startspurr
>>> set */
>>> diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/
>>> include/ asm/cputime.h
>>> index aff858ca99c0..eb6b629b113f 100644
>>> --- a/arch/powerpc/include/asm/cputime.h
>>> +++ b/arch/powerpc/include/asm/cputime.h
>>> @@ -20,9 +20,9 @@
>>> #include <asm/time.h>
>>> #include <asm/param.h>
>>> #include <asm/firmware.h>
>>> +#include <linux/sched/clock.h>
>>>
>>> #ifdef __KERNEL__
>>> -#define cputime_to_nsecs(cputime) tb_to_ns(cputime)
>>>
>>> /*
>>> * PPC64 uses PACA which is task independent for storing accounting
>>> data while
>>> @@ -44,20 +44,20 @@
>>> */
>>> static notrace inline void account_cpu_user_entry(void)
>>> {
>>> - unsigned long tb = mftb();
>>> + unsigned long now = sched_clock();
>>
>> Now way !
>>
>> By doing that you'll kill performance for no reason. All we need when
>> accounting time spent in kernel or in user is the difference between
>> time at entry and time at exit, no mater what the time was at boot time.
>>
>
> No. With this patch there will not be any performance difference.
> All it does is, instead of using mftb uses sched_clock at those places.
>
For the record, I did some benchmark test with
tools/testing/selftests/powerpc/benchmarks/null_syscall on powerpc 885
microcontroller:
Without your proposed patch:
root@vgoip:~# ./null_syscall
2729.98 ns 360.36 cycles
With your proposed patch below:
root@vgoip:~# ./null_syscall
3370.80 ns 444.95 cycles
So as expected it is a huge regression, almost 25% more time to run the
syscall.
Christophe
>
> In arch/powerpc/kernel/time.c we have sched_clock().
> notrace unsigned long long sched_clock(void)
> {
> return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) <<
> tb_to_ns_shift;
> }
>
> It does the same mftb call, and accounts only the time after boot, which is
> what /proc/stat should do as well.
>
> "
> the amount of time, measured in units of USER_HZ
> (1/100ths of a second on most architectures
>
> user (1) Time spent in user mode.
>
> idle (4) Time spent in the idle task. This value
> should be USER_HZ times the second entry in
> the /proc/uptime pseudo-file.
> "
> /proc/uptime is based on sched_clock, so i infer /proc/stat also should
> show
> values w.r.t to boot of the OS.
>
>
>> Also sched_clock() returns nanoseconds which implies calculation from
>> timebase. This is pointless CPU consumption. The current
>> implementation calculates nanoseconds at task switch when calling
>> vtime_flush().Your change will now do it at every kernel entry and
>> kernel exit by calling sched_clock().
>
> This change doesn't add any additional paths. Even without patches, mftb
> would have
> been called in every kernel entry/exit. See mftb usage
> account_cpu_user_exit/enter
>
> Now instead of mftb sched_clock is used, that's all. No additional
> entry/exit points.
> And previously when accounting we would have done cputime_to_nsecs, now
> that conversion
> is done automatically in sched_clock. So overall computation-wise it
> should be same.
>
> What i am missing to see it here?
>
>>
>> Another point is that sched_clock() returns a long long not a long.
>
> Thanks for pointing that out.
>
> Ok. Let me change some of those variables into unsigned long long.
> Compiler didn't warn me, so i didn't see it.
>
>>
>> And also sched_clock() uses get_tb() which does mftb and mftbu. Which
>> is pointless for calculating time deltas unless your application
>> spends hours without being re-scheduled.
>>
>
> I didn't get this. At current also, we use mftb, that functionality
> should be the same.
> Could you please explain how?
>
>>
>>> struct cpu_accounting_data *acct = raw_get_accounting(current);
>>>
>>> - acct->utime += (tb - acct->starttime_user);
>>> - acct->starttime = tb;
>>> + acct->utime += (now - acct->starttime_user);
>>> + acct->starttime = now;
>>> }
>>>
>>> static notrace inline void account_cpu_user_exit(void)
>>> {
>>> - unsigned long tb = mftb();
>>> + unsigned long now = sched_clock();
>>> struct cpu_accounting_data *acct = raw_get_accounting(current);
>>>
>>> - acct->stime += (tb - acct->starttime);
>>> - acct->starttime_user = tb;
>>> + acct->stime += (now - acct->starttime);
>>> + acct->starttime_user = now;
>>> }
>>>
>>> static notrace inline void account_stolen_time(void)
>>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>>> index 18506740f4a4..fb67cdae3bcb 100644
>>> --- a/arch/powerpc/kernel/time.c
>>> +++ b/arch/powerpc/kernel/time.c
>>> @@ -215,7 +215,7 @@ static unsigned long vtime_delta(struct
>>> cpu_accounting_data *acct,
>>>
>>> WARN_ON_ONCE(!irqs_disabled());
>>>
>>> - now = mftb();
>>> + now = sched_clock();
>>> stime = now - acct->starttime;
>>> acct->starttime = now;
>>>
>>> @@ -299,9 +299,9 @@ static void vtime_flush_scaled(struct task_struct
>>> *tsk,
>>> {
>>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>>> if (acct->utime_scaled)
>>> - tsk->utimescaled += cputime_to_nsecs(acct->utime_scaled);
>>> + tsk->utimescaled += acct->utime_scaled;
>>> if (acct->stime_scaled)
>>> - tsk->stimescaled += cputime_to_nsecs(acct->stime_scaled);
>>> + tsk->stimescaled += acct->stime_scaled;
>>>
>>> acct->utime_scaled = 0;
>>> acct->utime_sspurr = 0;
>>> @@ -321,28 +321,28 @@ void vtime_flush(struct task_struct *tsk)
>>> struct cpu_accounting_data *acct = get_accounting(tsk);
>>>
>>> if (acct->utime)
>>> - account_user_time(tsk, cputime_to_nsecs(acct->utime));
>>> + account_user_time(tsk, acct->utime);
>>>
>>> if (acct->gtime)
>>> - account_guest_time(tsk, cputime_to_nsecs(acct->gtime));
>>> + account_guest_time(tsk, acct->gtime);
>>>
>>> if (IS_ENABLED(CONFIG_PPC_SPLPAR) && acct->steal_time) {
>>> - account_steal_time(cputime_to_nsecs(acct->steal_time));
>>> + account_steal_time(acct->steal_time);
>>> acct->steal_time = 0;
>>> }
>>>
>>> if (acct->idle_time)
>>> - account_idle_time(cputime_to_nsecs(acct->idle_time));
>>> + account_idle_time(acct->idle_time);
>>>
>>> if (acct->stime)
>>> - account_system_index_time(tsk, cputime_to_nsecs(acct->stime),
>>> + account_system_index_time(tsk, acct->stime,
>>> CPUTIME_SYSTEM);
>>>
>>> if (acct->hardirq_time)
>>> - account_system_index_time(tsk, cputime_to_nsecs(acct-
>>> >hardirq_time),
>>> + account_system_index_time(tsk, acct->hardirq_time,
>>> CPUTIME_IRQ);
>>> if (acct->softirq_time)
>>> - account_system_index_time(tsk, cputime_to_nsecs(acct-
>>> >softirq_time),
>>> + account_system_index_time(tsk, acct->softirq_time,
>>> CPUTIME_SOFTIRQ);
>>>
>>> vtime_flush_scaled(tsk, acct);
>>> @@ -388,7 +388,7 @@ void vtime_reset(void)
>>> {
>>> struct cpu_accounting_data *acct = get_accounting(current);
>>>
>>> - acct->starttime = mftb();
>>> + acct->starttime = sched_clock();
>>> #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
>>> acct->startspurr = read_spurr(acct->starttime);
>>> #endif
>>
>
> PS: I measured the performance with hackbench. I don't see any degradation.
>
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
2026-02-26 7:32 ` Christophe Leroy (CS GROUP)
@ 2026-02-26 12:57 ` Shrikanth Hegde
0 siblings, 0 replies; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-26 12:57 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP), Frederic Weisbecker, LKML,
Madhavan Srinivasan
Cc: Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Hi Christophe.
>> No. With this patch there will not be any performance difference.
>> All it does is, instead of using mftb uses sched_clock at those places.
>>
>
> For the record, I did some benchmark test with tools/testing/selftests/
> powerpc/benchmarks/null_syscall on powerpc 885 microcontroller:
>
> Without your proposed patch:
>
> root@vgoip:~# ./null_syscall
> 2729.98 ns 360.36 cycles
>
> With your proposed patch below:
>
> root@vgoip:~# ./null_syscall
> 3370.80 ns 444.95 cycles
>
> So as expected it is a huge regression, almost 25% more time to run the
> syscall.
>
> Christophe
>
>
>>
Got it. My bad in assuming it may not happen multiple before vtime_flush.
Btw, can you try latest vtime_reset patch? It shouldn't make any difference.
I tried perf bench syscall basic, and ./null_syscall.
I don't any difference for vtime_reset patch.
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 05/15] s390/time: Prepare to stop elapsing in dynticks-idle
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (3 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 06/15] tick/sched: Unify idle cputime accounting Frederic Weisbecker
` (10 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Currently the tick subsystem stores the idle cputime accounting in
private fields, allowing cohabitation with architecture idle vtime
accounting. The former is fetched on online CPUs, the latter on offline
CPUs.
For consolidation purpose, architecture vtime accounting will continue
to account the cputime but will make a break when the idle tick is
stopped. The dyntick cputime accounting will then be relayed by the tick
subsystem so that the idle cputime is still seen advancing coherently
even when the tick isn't there to flush the idle vtime.
Prepare for that and introduce three new APIs which will be used in
subsequent patches:
_ vtime_dynticks_start() is deemed to be called when idle enters in
dyntick mode. The idle cputime that elapsed so far is accumulated
and accounted. Also idle time accounting is ignored.
- vtime_dynticks_stop() is deemed to be called when idle exits from
dyntick mode. The vtime entry clocks are fast-forward to current time
so that idle accounting restarts elapsing from now. Also idle time
accounting is resumed.
- vtime_reset() is deemed to be called from dynticks idle IRQ entry to
fast-forward the clock to current time so that the IRQ time is still
accounted by vtime while nohz cputime is paused.
Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
accounting twice the idle cputime, along with nohz accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
arch/s390/include/asm/idle.h | 14 +++++---
arch/s390/kernel/idle.c | 19 +++++++----
arch/s390/kernel/vtime.c | 65 ++++++++++++++++++++++++++++++------
3 files changed, 77 insertions(+), 21 deletions(-)
diff --git a/arch/s390/include/asm/idle.h b/arch/s390/include/asm/idle.h
index 09f763b9eb40..285b3da318d6 100644
--- a/arch/s390/include/asm/idle.h
+++ b/arch/s390/include/asm/idle.h
@@ -8,17 +8,21 @@
#ifndef _S390_IDLE_H
#define _S390_IDLE_H
+#include <linux/percpu-defs.h>
#include <linux/types.h>
#include <linux/device.h>
struct s390_idle_data {
- unsigned long idle_count;
- unsigned long idle_time;
- unsigned long clock_idle_enter;
- unsigned long timer_idle_enter;
- unsigned long mt_cycles_enter[8];
+ bool idle_dyntick;
+ unsigned long idle_count;
+ unsigned long idle_time;
+ unsigned long clock_idle_enter;
+ unsigned long timer_idle_enter;
+ unsigned long mt_cycles_enter[8];
};
+DECLARE_PER_CPU(struct s390_idle_data, s390_idle);
+
extern struct device_attribute dev_attr_idle_count;
extern struct device_attribute dev_attr_idle_time_us;
diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
index 39cb8d0ae348..614db5ea6ea3 100644
--- a/arch/s390/kernel/idle.c
+++ b/arch/s390/kernel/idle.c
@@ -19,7 +19,7 @@
#include <asm/smp.h>
#include "entry.h"
-static DEFINE_PER_CPU(struct s390_idle_data, s390_idle);
+DEFINE_PER_CPU(struct s390_idle_data, s390_idle);
void account_idle_time_irq(void)
{
@@ -35,7 +35,15 @@ void account_idle_time_irq(void)
this_cpu_add(mt_cycles[i], cycles_new[i] - idle->mt_cycles_enter[i]);
}
+ WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
+
+ /* Account time spent with enabled wait psw loaded as idle time. */
idle_time = lc->int_clock - idle->clock_idle_enter;
+ WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
+
+ /* Dyntick idle time accounted by nohz/scheduler */
+ if (idle->idle_dyntick)
+ return;
lc->steal_timer += idle->clock_idle_enter - lc->last_update_clock;
lc->last_update_clock = lc->int_clock;
@@ -43,9 +51,6 @@ void account_idle_time_irq(void)
lc->system_timer += lc->last_update_timer - idle->timer_idle_enter;
lc->last_update_timer = lc->sys_enter_timer;
- /* Account time spent with enabled wait psw loaded as idle time. */
- WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
- WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
account_idle_time(cputime_to_nsecs(idle_time));
}
@@ -61,8 +66,10 @@ void noinstr arch_cpu_idle(void)
set_cpu_flag(CIF_ENABLED_WAIT);
if (smp_cpu_mtid)
stcctm(MT_DIAG, smp_cpu_mtid, (u64 *)&idle->mt_cycles_enter);
- idle->clock_idle_enter = get_tod_clock_fast();
- idle->timer_idle_enter = get_cpu_timer();
+ if (!idle->idle_dyntick) {
+ idle->clock_idle_enter = get_tod_clock_fast();
+ idle->timer_idle_enter = get_cpu_timer();
+ }
bpon();
__load_psw_mask(psw_mask);
}
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 234a0ba30510..c19528eb4ee3 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -17,6 +17,7 @@
#include <asm/vtimer.h>
#include <asm/vtime.h>
#include <asm/cpu_mf.h>
+#include <asm/idle.h>
#include <asm/smp.h>
#include "entry.h"
@@ -111,23 +112,30 @@ static void account_system_index_scaled(struct task_struct *p, u64 cputime,
account_system_index_time(p, cputime_to_nsecs(cputime), index);
}
-/*
- * Update process times based on virtual cpu times stored by entry.S
- * to the lowcore fields user_timer, system_timer & steal_clock.
- */
-static int do_account_vtime(struct task_struct *tsk)
+static inline void vtime_reset_last_update(struct lowcore *lc)
{
- u64 timer, clock, user, guest, system, hardirq, softirq;
- struct lowcore *lc = get_lowcore();
-
- timer = lc->last_update_timer;
- clock = lc->last_update_clock;
asm volatile(
" stpt %0\n" /* Store current cpu timer value */
" stckf %1" /* Store current tod clock value */
: "=Q" (lc->last_update_timer),
"=Q" (lc->last_update_clock)
: : "cc");
+}
+
+/*
+ * Update process times based on virtual cpu times stored by entry.S
+ * to the lowcore fields user_timer, system_timer & steal_clock.
+ */
+static int do_account_vtime(struct task_struct *tsk)
+{
+ u64 timer, clock, user, guest, system, hardirq, softirq;
+ struct lowcore *lc = get_lowcore();
+
+ timer = lc->last_update_timer;
+ clock = lc->last_update_clock;
+
+ vtime_reset_last_update(lc);
+
clock = lc->last_update_clock - clock;
timer -= lc->last_update_timer;
@@ -261,6 +269,43 @@ void vtime_account_hardirq(struct task_struct *tsk)
virt_timer_forward(delta);
}
+#ifdef CONFIG_NO_HZ_COMMON
+/**
+ * vtime_reset - Fast forward vtime entry clocks
+ *
+ * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
+ * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
+ */
+void vtime_reset(void)
+{
+ vtime_reset_last_update(get_lowcore());
+}
+
+/**
+ * vtime_dyntick_start - Inform vtime about entry to idle-dynticks
+ *
+ * Called when idle enters in dyntick mode. The idle cputime that elapsed so far
+ * is flushed and the tick subsystem takes over the idle cputime accounting.
+ */
+void vtime_dyntick_start(void)
+{
+ __this_cpu_write(s390_idle.idle_dyntick, true);
+ vtime_flush(current);
+}
+
+/**
+ * vtime_dyntick_stop - Inform vtime about exit from idle-dynticks
+ *
+ * Called when idle exits from dyntick mode. The vtime entry clocks are
+ * fast-forward to current time and idle accounting resumes.
+ */
+void vtime_dyntick_stop(void)
+{
+ vtime_reset_last_update(get_lowcore());
+ __this_cpu_write(s390_idle.idle_dyntick, false);
+}
+#endif /* CONFIG_NO_HZ_COMMON */
+
/*
* Sorted add to a list. List is linear searched until first bigger
* element is found.
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 06/15] tick/sched: Unify idle cputime accounting
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (4 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 05/15] s390/time: " Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test Frederic Weisbecker
` (9 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
The non-vtime dynticks-idle cputime accounting is a big mess that
accumulates within two concurrent statistics, each having their own
shortcomings:
* The accounting for online CPUs which is based on the delta between
tick_nohz_start_idle() and tick_nohz_stop_idle().
Pros:
- Works when the tick is off
- Has nsecs granularity
Cons:
- Account idle steal time but doesn't substract it from idle
cputime.
- Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
the IRQ time is simply ignored when
CONFIG_IRQ_TIME_ACCOUNTING=n
- The windows between 1) idle task scheduling and the first call
to tick_nohz_start_idle() and 2) idle task between the last
tick_nohz_stop_idle() and the rest of the idle time are
blindspots wrt. cputime accounting (though mostly insignificant
amount)
- Relies on private fields outside of kernel stats, with specific
accessors.
* The accounting for offline CPUs which is based on ticks and the
jiffies delta during which the tick was stopped.
Pros:
- Handles steal time correctly
- Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
CONFIG_IRQ_TIME_ACCOUNTING=n correctly.
- Handles the whole idle task
- Accounts directly to kernel stats, without midlayer accumulator.
Cons:
- Doesn't elapse when the tick is off, which doesn't make it
suitable for online CPUs.
- Has TICK_NSEC granularity (jiffies)
- Needs to track the dyntick-idle ticks that were accounted and
substract them from the total jiffies time spent while the tick
was stopped. This is an ugly workaround.
Having two different accounting for a single context is not the only
problem: since those accountings are of different natures, it is
possible to observe the global idle time going backward after a CPU goes
offline.
Clean up the situation with introducing a hybrid approach that stays
coherent and works for both online and offline CPUs:
* Tick based or native vtime accounting operate before the idle loop
is entered and resume once the idle loop prepares to exit.
* When the idle loop starts, switch to dynticks-idle accounting as is
done currently, except that the statistics accumulate directly to the
relevant kernel stat fields.
* Private dyntick cputime accounting fields are removed.
* Works on both online and offline case.
Further improvement will include:
* Only switch to dynticks-idle cputime accounting when the tick actually
goes in dynticks mode.
* Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
dynticks-idle accounting still elapses while on IRQs.
* Correctly substract idle steal cputime from idle time
Reported-by: Xin Zhao <jackzxcui1989@163.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/kernel_stat.h | 24 ++++++++++---
include/linux/vtime.h | 7 +++-
kernel/sched/cputime.c | 62 ++++++++++++++++----------------
kernel/time/tick-sched.c | 71 +++++++++++--------------------------
4 files changed, 76 insertions(+), 88 deletions(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index dd020ecaf67b..ba65aad308a1 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -34,6 +34,9 @@ enum cpu_usage_stat {
};
struct kernel_cpustat {
+#ifdef CONFIG_NO_HZ_COMMON
+ int idle_dyntick;
+#endif
u64 cpustat[NR_STATS];
};
@@ -99,6 +102,20 @@ static inline unsigned long kstat_cpu_irqs_sum(unsigned int cpu)
return kstat_cpu(cpu).irqs_sum;
}
+#ifdef CONFIG_NO_HZ_COMMON
+extern void kcpustat_dyntick_start(void);
+extern void kcpustat_dyntick_stop(void);
+static inline bool kcpustat_idle_dyntick(void)
+{
+ return __this_cpu_read(kernel_cpustat.idle_dyntick);
+}
+#else
+static inline bool kcpustat_idle_dyntick(void)
+{
+ return false;
+}
+#endif /* CONFIG_NO_HZ_COMMON */
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
extern u64 kcpustat_field(enum cpu_usage_stat usage, int cpu);
extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
@@ -113,7 +130,7 @@ static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
*dst = kcpustat_cpu(cpu);
}
-#endif
+#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN */
extern void account_user_time(struct task_struct *, u64);
extern void account_guest_time(struct task_struct *, u64);
@@ -127,14 +144,13 @@ extern u64 get_idle_time(struct kernel_cpustat *kcs, int cpu);
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
static inline void account_process_tick(struct task_struct *tsk, int user)
{
- vtime_flush(tsk);
+ if (!kcpustat_idle_dyntick())
+ vtime_flush(tsk);
}
#else
extern void account_process_tick(struct task_struct *, int user);
#endif
-extern void account_idle_ticks(unsigned long ticks);
-
#ifdef CONFIG_SCHED_CORE
extern void __account_forceidle_time(struct task_struct *tsk, u64 delta);
#endif
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 61b94c12d7dd..a4506336002d 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -31,6 +31,11 @@ static inline bool vtime_generic_enabled_cpu(int cpu)
return context_tracking_enabled_cpu(cpu);
}
+static inline bool vtime_generic_enabled_this_cpu(void)
+{
+ return context_tracking_enabled_this_cpu();
+}
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
extern void vtime_account_idle(struct task_struct *tsk);
extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
@@ -90,7 +95,7 @@ static inline bool vtime_accounting_enabled_cpu(int cpu)
static inline bool vtime_accounting_enabled_this_cpu(void)
{
- return context_tracking_enabled_this_cpu();
+ return vtime_generic_enabled_this_cpu();
}
extern void vtime_task_switch_generic(struct task_struct *prev);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5613838d0307..d67f93e845a7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -400,16 +400,30 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
}
}
-static void irqtime_account_idle_ticks(int ticks)
-{
- irqtime_account_process_tick(current, 0, ticks);
-}
#else /* !CONFIG_IRQ_TIME_ACCOUNTING: */
-static inline void irqtime_account_idle_ticks(int ticks) { }
static inline void irqtime_account_process_tick(struct task_struct *p, int user_tick,
int nr_ticks) { }
#endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
+#ifdef CONFIG_NO_HZ_COMMON
+void kcpustat_dyntick_start(void)
+{
+ if (!vtime_generic_enabled_this_cpu()) {
+ vtime_dyntick_start();
+ __this_cpu_write(kernel_cpustat.idle_dyntick, 1);
+ }
+}
+
+void kcpustat_dyntick_stop(void)
+{
+ if (!vtime_generic_enabled_this_cpu()) {
+ __this_cpu_write(kernel_cpustat.idle_dyntick, 0);
+ vtime_dyntick_stop();
+ steal_account_process_time(ULONG_MAX);
+ }
+}
+#endif /* CONFIG_NO_HZ_COMMON */
+
/*
* Use precise platform statistics if available:
*/
@@ -423,11 +437,15 @@ void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
vtime_account_hardirq(tsk);
} else if (pc & SOFTIRQ_OFFSET) {
vtime_account_softirq(tsk);
- } else if (!IS_ENABLED(CONFIG_HAVE_VIRT_CPU_ACCOUNTING_IDLE) &&
- is_idle_task(tsk)) {
- vtime_account_idle(tsk);
+ } else if (!kcpustat_idle_dyntick()) {
+ if (!IS_ENABLED(CONFIG_HAVE_VIRT_CPU_ACCOUNTING_IDLE) &&
+ is_idle_task(tsk)) {
+ vtime_account_idle(tsk);
+ } else {
+ vtime_account_kernel(tsk);
+ }
} else {
- vtime_account_kernel(tsk);
+ vtime_reset();
}
}
@@ -469,6 +487,9 @@ void account_process_tick(struct task_struct *p, int user_tick)
if (vtime_accounting_enabled_this_cpu())
return;
+ if (kcpustat_idle_dyntick())
+ return;
+
if (irqtime_enabled()) {
irqtime_account_process_tick(p, user_tick, 1);
return;
@@ -490,29 +511,6 @@ void account_process_tick(struct task_struct *p, int user_tick)
account_idle_time(cputime);
}
-/*
- * Account multiple ticks of idle time.
- * @ticks: number of stolen ticks
- */
-void account_idle_ticks(unsigned long ticks)
-{
- u64 cputime, steal;
-
- if (irqtime_enabled()) {
- irqtime_account_idle_ticks(ticks);
- return;
- }
-
- cputime = ticks * TICK_NSEC;
- steal = steal_account_process_time(ULONG_MAX);
-
- if (steal >= cputime)
- return;
-
- cputime -= steal;
- account_idle_time(cputime);
-}
-
/*
* Adjust tick based cputime random precision against scheduler runtime
* accounting.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9632066aea4d..21ac561a8545 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -285,8 +285,6 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
if (IS_ENABLED(CONFIG_NO_HZ_COMMON) &&
tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
touch_softlockup_watchdog_sched();
- if (is_idle_task(current))
- ts->idle_jiffies++;
/*
* In case the current tick fired too early past its expected
* expiration, make sure we don't bypass the next clock reprogramming
@@ -744,8 +742,12 @@ static void tick_nohz_update_jiffies(ktime_t now)
static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
+ u64 *cpustat = kcpustat_this_cpu->cpustat;
ktime_t delta;
+ if (vtime_generic_enabled_this_cpu())
+ return;
+
if (WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)))
return;
@@ -753,9 +755,9 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
write_seqcount_begin(&ts->idle_sleeptime_seq);
if (nr_iowait_cpu(smp_processor_id()) > 0)
- ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+ cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+ cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
ts->idle_entrytime = now;
tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
@@ -766,18 +768,21 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
static void tick_nohz_start_idle(struct tick_sched *ts)
{
+ if (vtime_generic_enabled_this_cpu())
+ return;
+
write_seqcount_begin(&ts->idle_sleeptime_seq);
ts->idle_entrytime = ktime_get();
tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE);
write_seqcount_end(&ts->idle_sleeptime_seq);
-
sched_clock_idle_sleep_event();
}
-static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx, ktime_t *sleeptime,
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
bool compute_delta, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ u64 *cpustat = kcpustat_cpu(cpu).cpustat;
ktime_t now, idle;
unsigned int seq;
@@ -799,9 +804,9 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx, ktime_t *slee
if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE) && compute_delta) {
ktime_t delta = ktime_sub(now, ts->idle_entrytime);
- idle = ktime_add(*sleeptime, delta);
+ idle = ktime_add(cpustat[idx], delta);
} else {
- idle = *sleeptime;
+ idle = cpustat[idx];
}
} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
@@ -828,9 +833,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx, ktime_t *slee
*/
u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-
- return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE, &ts->idle_sleeptime,
+ return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE,
!nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
@@ -854,9 +857,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
*/
u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-
- return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT, &ts->iowait_sleeptime,
+ return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT,
nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
@@ -1256,10 +1257,8 @@ void tick_nohz_idle_stop_tick(void)
ts->idle_sleeps++;
ts->idle_expires = expires;
- if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
- ts->idle_jiffies = ts->last_jiffies;
+ if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED))
nohz_balance_enter_idle(cpu);
- }
} else {
tick_nohz_retain_tick(ts);
}
@@ -1288,6 +1287,7 @@ void tick_nohz_idle_enter(void)
WARN_ON_ONCE(ts->timer_expires_base);
tick_sched_flag_set(ts, TS_FLAG_INIDLE);
+ kcpustat_dyntick_start();
tick_nohz_start_idle(ts);
local_irq_enable();
@@ -1413,37 +1413,12 @@ unsigned long tick_nohz_get_idle_calls_cpu(int cpu)
return ts->idle_calls;
}
-static void tick_nohz_account_idle_time(struct tick_sched *ts,
- ktime_t now)
-{
- unsigned long ticks;
-
- ts->idle_exittime = now;
-
- if (vtime_accounting_enabled_this_cpu())
- return;
- /*
- * We stopped the tick in idle. update_process_times() would miss the
- * time we slept, as it does only a 1 tick accounting.
- * Enforce that this is accounted to idle !
- */
- ticks = jiffies - ts->idle_jiffies;
- /*
- * We might be one off. Do not randomly account a huge number of ticks!
- */
- if (ticks && ticks < LONG_MAX)
- account_idle_ticks(ticks);
-}
-
void tick_nohz_idle_restart_tick(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
- if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
- ktime_t now = ktime_get();
- tick_nohz_restart_sched_tick(ts, now);
- tick_nohz_account_idle_time(ts, now);
- }
+ if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
+ tick_nohz_restart_sched_tick(ts, ktime_get());
}
static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
@@ -1452,8 +1427,6 @@ static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
__tick_nohz_full_update_tick(ts, now);
else
tick_nohz_restart_sched_tick(ts, now);
-
- tick_nohz_account_idle_time(ts, now);
}
/**
@@ -1495,6 +1468,7 @@ void tick_nohz_idle_exit(void)
if (tick_stopped)
tick_nohz_idle_update_tick(ts, now);
+ kcpustat_dyntick_stop();
local_irq_enable();
}
@@ -1631,20 +1605,15 @@ void tick_setup_sched_timer(bool hrtimer)
void tick_sched_timer_dying(int cpu)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t idle_sleeptime, iowait_sleeptime;
unsigned long idle_calls, idle_sleeps;
/* This must happen before hrtimers are migrated! */
if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES))
hrtimer_cancel(&ts->sched_timer);
- idle_sleeptime = ts->idle_sleeptime;
- iowait_sleeptime = ts->iowait_sleeptime;
idle_calls = ts->idle_calls;
idle_sleeps = ts->idle_sleeps;
memset(ts, 0, sizeof(*ts));
- ts->idle_sleeptime = idle_sleeptime;
- ts->iowait_sleeptime = iowait_sleeptime;
ts->idle_calls = idle_calls;
ts->idle_sleeps = idle_sleeps;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (5 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 06/15] tick/sched: Unify idle cputime accounting Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 08/15] tick/sched: Remove nohz disabled special case in cputime fetch Frederic Weisbecker
` (8 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
accounting has a nanoseconds granularity.
Use the appropriate indicator instead to make that deduction.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
drivers/cpufreq/cpufreq_ondemand.c | 7 +------
include/linux/tick.h | 2 ++
kernel/time/hrtimer.c | 2 +-
kernel/time/tick-internal.h | 2 --
kernel/time/tick-sched.c | 8 +++++++-
kernel/time/timer.c | 2 +-
6 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index a6ecc203f7b7..bb7db82930e4 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -334,17 +334,12 @@ static void od_free(struct policy_dbs_info *policy_dbs)
static int od_init(struct dbs_data *dbs_data)
{
struct od_dbs_tuners *tuners;
- u64 idle_time;
- int cpu;
tuners = kzalloc(sizeof(*tuners), GFP_KERNEL);
if (!tuners)
return -ENOMEM;
- cpu = get_cpu();
- idle_time = get_cpu_idle_time_us(cpu, NULL);
- put_cpu();
- if (idle_time != -1ULL) {
+ if (tick_nohz_is_active()) {
/* Idle micro accounting is supported. Use finer thresholds */
dbs_data->up_threshold = MICRO_FREQUENCY_UP_THRESHOLD;
} else {
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ac76ae9fa36d..738007d6f577 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -126,6 +126,7 @@ enum tick_dep_bits {
#ifdef CONFIG_NO_HZ_COMMON
extern bool tick_nohz_enabled;
+extern bool tick_nohz_is_active(void);
extern bool tick_nohz_tick_stopped(void);
extern bool tick_nohz_tick_stopped_cpu(int cpu);
extern void tick_nohz_idle_stop_tick(void);
@@ -142,6 +143,7 @@ extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#else /* !CONFIG_NO_HZ_COMMON */
#define tick_nohz_enabled (0)
+static inline bool tick_nohz_is_active(void) { return false; }
static inline int tick_nohz_tick_stopped(void) { return 0; }
static inline int tick_nohz_tick_stopped_cpu(int cpu) { return 0; }
static inline void tick_nohz_idle_stop_tick(void) { }
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index f8ea8c8fc895..e1bbf883dfa8 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -943,7 +943,7 @@ void clock_was_set(unsigned int bases)
cpumask_var_t mask;
int cpu;
- if (!hrtimer_hres_active(cpu_base) && !tick_nohz_active)
+ if (!hrtimer_hres_active(cpu_base) && !tick_nohz_is_active())
goto out_timerfd;
if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 4e4f7bbe2a64..597d816d22e8 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -156,7 +156,6 @@ static inline void tick_nohz_init(void) { }
#endif
#ifdef CONFIG_NO_HZ_COMMON
-extern unsigned long tick_nohz_active;
extern void timers_update_nohz(void);
extern u64 get_jiffies_update(unsigned long *basej);
# ifdef CONFIG_SMP
@@ -171,7 +170,6 @@ extern void timer_expire_remote(unsigned int cpu);
# endif
#else /* CONFIG_NO_HZ_COMMON */
static inline void timers_update_nohz(void) { }
-#define tick_nohz_active (0)
#endif
DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 21ac561a8545..81c619bf662c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -691,7 +691,7 @@ void __init tick_nohz_init(void)
* NO HZ enabled ?
*/
bool tick_nohz_enabled __read_mostly = true;
-unsigned long tick_nohz_active __read_mostly;
+static unsigned long tick_nohz_active __read_mostly;
/*
* Enable / Disable tickless mode
*/
@@ -702,6 +702,12 @@ static int __init setup_tick_nohz(char *str)
__setup("nohz=", setup_tick_nohz);
+bool tick_nohz_is_active(void)
+{
+ return tick_nohz_active;
+}
+EXPORT_SYMBOL_GPL(tick_nohz_is_active);
+
bool tick_nohz_tick_stopped(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 1f2364126894..7e1e3bde6b8b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -281,7 +281,7 @@ DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);
static void timers_update_migration(void)
{
- if (sysctl_timer_migration && tick_nohz_active)
+ if (sysctl_timer_migration && tick_nohz_is_active())
static_branch_enable(&timers_migration_enabled);
else
static_branch_disable(&timers_migration_enabled);
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 08/15] tick/sched: Remove nohz disabled special case in cputime fetch
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (6 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 07/15] cpufreq: ondemand: Simplify idle cputime granularity test Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code Frederic Weisbecker
` (7 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Even when nohz is not runtime enabled, the dynticks idle cputime
accounting can run and the common idle cputime accessors are still
relevant.
Remove the nohz disabled special case accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/time/tick-sched.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 81c619bf662c..2b58cdc326c3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -792,9 +792,6 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
ktime_t now, idle;
unsigned int seq;
- if (!tick_nohz_active)
- return -1;
-
now = ktime_get();
if (last_update_time)
*last_update_time = ktime_to_us(now);
@@ -835,7 +832,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
* This time is measured via accounting rather than sampling,
* and is as accurate as ktime_get() is.
*
- * Return: -1 if NOHZ is not enabled, else total idle time of the @cpu
+ * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
*/
u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
@@ -859,7 +856,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
* This time is measured via accounting rather than sampling,
* and is as accurate as ktime_get() is.
*
- * Return: -1 if NOHZ is not enabled, else total iowait time of @cpu
+ * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
*/
u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (7 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 08/15] tick/sched: Remove nohz disabled special case in cputime fetch Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 10/15] tick/sched: Remove unused fields Frederic Weisbecker
` (6 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Although the dynticks-idle cputime accounting is necessarily tied to
the tick subsystem, the actual related accounting code has no business
residing there and should be part of the scheduler cputime code.
Move away the relevant pieces and state machine to where they belong.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/kernel_stat.h | 14 +++-
kernel/sched/cputime.c | 149 +++++++++++++++++++++++++++++++--
kernel/time/tick-sched.c | 162 +++++++-----------------------------
3 files changed, 184 insertions(+), 141 deletions(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index ba65aad308a1..9343353ac7a3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -35,9 +35,12 @@ enum cpu_usage_stat {
struct kernel_cpustat {
#ifdef CONFIG_NO_HZ_COMMON
- int idle_dyntick;
+ bool idle_dyntick;
+ bool idle_elapse;
+ seqcount_t idle_sleeptime_seq;
+ u64 idle_entrytime;
#endif
- u64 cpustat[NR_STATS];
+ u64 cpustat[NR_STATS];
};
struct kernel_stat {
@@ -103,8 +106,11 @@ static inline unsigned long kstat_cpu_irqs_sum(unsigned int cpu)
}
#ifdef CONFIG_NO_HZ_COMMON
-extern void kcpustat_dyntick_start(void);
-extern void kcpustat_dyntick_stop(void);
+extern void kcpustat_dyntick_start(u64 now);
+extern void kcpustat_dyntick_stop(u64 now);
+extern void kcpustat_irq_enter(u64 now);
+extern void kcpustat_irq_exit(u64 now);
+
static inline bool kcpustat_idle_dyntick(void)
{
return __this_cpu_read(kernel_cpustat.idle_dyntick);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index d67f93e845a7..d2cad4d8dc10 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -2,6 +2,7 @@
/*
* Simple CPU accounting cgroup controller
*/
+#include <linux/sched/clock.h>
#include <linux/sched/cputime.h>
#include <linux/tsacct_kern.h>
#include "sched.h"
@@ -406,22 +407,156 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_
#endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
#ifdef CONFIG_NO_HZ_COMMON
-void kcpustat_dyntick_start(void)
+static void kcpustat_idle_stop(struct kernel_cpustat *kc, u64 now)
{
- if (!vtime_generic_enabled_this_cpu()) {
- vtime_dyntick_start();
- __this_cpu_write(kernel_cpustat.idle_dyntick, 1);
- }
+ u64 *cpustat = kc->cpustat;
+ u64 delta;
+
+ if (!kc->idle_elapse)
+ return;
+
+ delta = now - kc->idle_entrytime;
+
+ write_seqcount_begin(&kc->idle_sleeptime_seq);
+ if (nr_iowait_cpu(smp_processor_id()) > 0)
+ cpustat[CPUTIME_IOWAIT] += delta;
+ else
+ cpustat[CPUTIME_IDLE] += delta;
+
+ kc->idle_entrytime = now;
+ kc->idle_elapse = false;
+ write_seqcount_end(&kc->idle_sleeptime_seq);
}
-void kcpustat_dyntick_stop(void)
+static void kcpustat_idle_start(struct kernel_cpustat *kc, u64 now)
{
+ write_seqcount_begin(&kc->idle_sleeptime_seq);
+ kc->idle_entrytime = now;
+ kc->idle_elapse = true;
+ write_seqcount_end(&kc->idle_sleeptime_seq);
+}
+
+void kcpustat_dyntick_stop(u64 now)
+{
+ struct kernel_cpustat *kc = kcpustat_this_cpu;
+
if (!vtime_generic_enabled_this_cpu()) {
- __this_cpu_write(kernel_cpustat.idle_dyntick, 0);
+ WARN_ON_ONCE(!kc->idle_dyntick);
+ kcpustat_idle_stop(kc, now);
+ kc->idle_dyntick = false;
vtime_dyntick_stop();
steal_account_process_time(ULONG_MAX);
}
}
+
+void kcpustat_dyntick_start(u64 now)
+{
+ struct kernel_cpustat *kc = kcpustat_this_cpu;
+
+ if (!vtime_generic_enabled_this_cpu()) {
+ vtime_dyntick_start();
+ kc->idle_dyntick = true;
+ kcpustat_idle_start(kc, now);
+ }
+}
+
+void kcpustat_irq_enter(u64 now)
+{
+ struct kernel_cpustat *kc = kcpustat_this_cpu;
+
+ if (!vtime_generic_enabled_this_cpu())
+ kcpustat_idle_stop(kc, now);
+}
+
+void kcpustat_irq_exit(u64 now)
+{
+ struct kernel_cpustat *kc = kcpustat_this_cpu;
+
+ if (!vtime_generic_enabled_this_cpu())
+ kcpustat_idle_start(kc, now);
+}
+
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
+ bool compute_delta, u64 *last_update_time)
+{
+ struct kernel_cpustat *kc = &kcpustat_cpu(cpu);
+ u64 *cpustat = kc->cpustat;
+ unsigned int seq;
+ ktime_t now;
+ u64 idle;
+
+ now = ktime_get();
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);
+
+ if (vtime_generic_enabled_cpu(cpu)) {
+ idle = kcpustat_field(idx, cpu);
+ goto to_us;
+ }
+
+ do {
+ seq = read_seqcount_begin(&kc->idle_sleeptime_seq);
+
+ if (kc->idle_elapse && compute_delta)
+ idle = cpustat[idx] + (now - kc->idle_entrytime);
+ else
+ idle = cpustat[idx];
+ } while (read_seqcount_retry(&kc->idle_sleeptime_seq, seq));
+
+to_us:
+ do_div(idle, NSEC_PER_USEC);
+
+ return idle;
+}
+
+/**
+ * get_cpu_idle_time_us - get the total idle time of a CPU
+ * @cpu: CPU number to query
+ * @last_update_time: variable to store update time in. Do not update
+ * counters if NULL.
+ *
+ * Return the cumulative idle time (since boot) for a given
+ * CPU, in microseconds. Note that this is partially broken due to
+ * the counter of iowait tasks that can be remotely updated without
+ * any synchronization. Therefore it is possible to observe backward
+ * values within two consecutive reads.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
+ */
+u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+{
+ return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE,
+ !nr_iowait_cpu(cpu), last_update_time);
+}
+EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
+
+/**
+ * get_cpu_iowait_time_us - get the total iowait time of a CPU
+ * @cpu: CPU number to query
+ * @last_update_time: variable to store update time in. Do not update
+ * counters if NULL.
+ *
+ * Return the cumulative iowait time (since boot) for a given
+ * CPU, in microseconds. Note this is partially broken due to
+ * the counter of iowait tasks that can be remotely updated without
+ * any synchronization. Therefore it is possible to observe backward
+ * values within two consecutive reads.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
+ */
+u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
+{
+ return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT,
+ nr_iowait_cpu(cpu), last_update_time);
+}
+EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
+
#endif /* CONFIG_NO_HZ_COMMON */
/*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2b58cdc326c3..aa36c8d218e2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -746,125 +746,6 @@ static void tick_nohz_update_jiffies(ktime_t now)
touch_softlockup_watchdog_sched();
}
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
- u64 *cpustat = kcpustat_this_cpu->cpustat;
- ktime_t delta;
-
- if (vtime_generic_enabled_this_cpu())
- return;
-
- if (WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)))
- return;
-
- delta = ktime_sub(now, ts->idle_entrytime);
-
- write_seqcount_begin(&ts->idle_sleeptime_seq);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
- cpustat[CPUTIME_IOWAIT] = ktime_add(cpustat[CPUTIME_IOWAIT], delta);
- else
- cpustat[CPUTIME_IDLE] = ktime_add(cpustat[CPUTIME_IDLE], delta);
-
- ts->idle_entrytime = now;
- tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
- write_seqcount_end(&ts->idle_sleeptime_seq);
-
- sched_clock_idle_wakeup_event();
-}
-
-static void tick_nohz_start_idle(struct tick_sched *ts)
-{
- if (vtime_generic_enabled_this_cpu())
- return;
-
- write_seqcount_begin(&ts->idle_sleeptime_seq);
- ts->idle_entrytime = ktime_get();
- tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE);
- write_seqcount_end(&ts->idle_sleeptime_seq);
- sched_clock_idle_sleep_event();
-}
-
-static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
- bool compute_delta, u64 *last_update_time)
-{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- u64 *cpustat = kcpustat_cpu(cpu).cpustat;
- ktime_t now, idle;
- unsigned int seq;
-
- now = ktime_get();
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
- if (vtime_generic_enabled_cpu(cpu)) {
- idle = kcpustat_field(idx, cpu);
- return ktime_to_us(idle);
- }
-
- do {
- seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
-
- if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE) && compute_delta) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-
- idle = ktime_add(cpustat[idx], delta);
- } else {
- idle = cpustat[idx];
- }
- } while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
-
- return ktime_to_us(idle);
-
-}
-
-/**
- * get_cpu_idle_time_us - get the total idle time of a CPU
- * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
- *
- * Return the cumulative idle time (since boot) for a given
- * CPU, in microseconds. Note that this is partially broken due to
- * the counter of iowait tasks that can be remotely updated without
- * any synchronization. Therefore it is possible to observe backward
- * values within two consecutive reads.
- *
- * This time is measured via accounting rather than sampling,
- * and is as accurate as ktime_get() is.
- *
- * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
- */
-u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
-{
- return get_cpu_sleep_time_us(cpu, CPUTIME_IDLE,
- !nr_iowait_cpu(cpu), last_update_time);
-}
-EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
-
-/**
- * get_cpu_iowait_time_us - get the total iowait time of a CPU
- * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
- *
- * Return the cumulative iowait time (since boot) for a given
- * CPU, in microseconds. Note this is partially broken due to
- * the counter of iowait tasks that can be remotely updated without
- * any synchronization. Therefore it is possible to observe backward
- * values within two consecutive reads.
- *
- * This time is measured via accounting rather than sampling,
- * and is as accurate as ktime_get() is.
- *
- * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
- */
-u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
-{
- return get_cpu_sleep_time_us(cpu, CPUTIME_IOWAIT,
- nr_iowait_cpu(cpu), last_update_time);
-}
-EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-
static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
{
hrtimer_cancel(&ts->sched_timer);
@@ -1272,6 +1153,20 @@ void tick_nohz_idle_retain_tick(void)
tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
}
+static void tick_nohz_clock_sleep(struct tick_sched *ts)
+{
+ tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE);
+ sched_clock_idle_sleep_event();
+}
+
+static void tick_nohz_clock_wakeup(struct tick_sched *ts)
+{
+ if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)) {
+ tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
+ sched_clock_idle_wakeup_event();
+ }
+}
+
/**
* tick_nohz_idle_enter - prepare for entering idle on the current CPU
*
@@ -1286,12 +1181,11 @@ void tick_nohz_idle_enter(void)
local_irq_disable();
ts = this_cpu_ptr(&tick_cpu_sched);
-
WARN_ON_ONCE(ts->timer_expires_base);
-
tick_sched_flag_set(ts, TS_FLAG_INIDLE);
- kcpustat_dyntick_start();
- tick_nohz_start_idle(ts);
+ ts->idle_entrytime = ktime_get();
+ kcpustat_dyntick_start(ts->idle_entrytime);
+ tick_nohz_clock_sleep(ts);
local_irq_enable();
}
@@ -1319,10 +1213,13 @@ void tick_nohz_irq_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
- if (tick_sched_flag_test(ts, TS_FLAG_INIDLE))
- tick_nohz_start_idle(ts);
- else
+ if (tick_sched_flag_test(ts, TS_FLAG_INIDLE)) {
+ ts->idle_entrytime = ktime_get();
+ kcpustat_irq_exit(ts->idle_entrytime);
+ tick_nohz_clock_sleep(ts);
+ } else {
tick_nohz_full_update_tick(ts);
+ }
}
/**
@@ -1467,11 +1364,11 @@ void tick_nohz_idle_exit(void)
now = ktime_get();
if (idle_active)
- tick_nohz_stop_idle(ts, now);
+ tick_nohz_clock_wakeup(ts);
if (tick_stopped)
tick_nohz_idle_update_tick(ts, now);
- kcpustat_dyntick_stop();
+ kcpustat_dyntick_stop(now);
local_irq_enable();
}
@@ -1527,9 +1424,14 @@ static inline void tick_nohz_irq_enter(void)
if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED | TS_FLAG_IDLE_ACTIVE))
return;
+
now = ktime_get();
- if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE))
- tick_nohz_stop_idle(ts, now);
+
+ if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)) {
+ tick_nohz_clock_wakeup(ts);
+ kcpustat_irq_enter(now);
+ }
+
/*
* If all CPUs are idle we may need to update a stale jiffies value.
* Note nohz_full is a special case: a timekeeper is guaranteed to stay
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 10/15] tick/sched: Remove unused fields
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (8 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 09/15] tick/sched: Move dyntick-idle cputime accounting to cputime code Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 11/15] tick/sched: Account tickless idle cputime only when tick is stopped Frederic Weisbecker
` (5 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Remove fields after the dyntick-idle cputime migration to scheduler
code.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/time/tick-sched.h | 12 ------------
kernel/time/timer_list.c | 6 +-----
scripts/gdb/linux/timerlist.py | 4 ----
3 files changed, 1 insertion(+), 21 deletions(-)
diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index b4a7822f495d..79b9252047b1 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -44,9 +44,7 @@ struct tick_device {
* to resume the tick timer operation in the timeline
* when the CPU returns from nohz sleep.
* @next_tick: Next tick to be fired when in dynticks mode.
- * @idle_jiffies: jiffies at the entry to idle for idle time accounting
* @idle_waketime: Time when the idle was interrupted
- * @idle_sleeptime_seq: sequence counter for data consistency
* @idle_entrytime: Time when the idle call was entered
* @last_jiffies: Base jiffies snapshot when next event was last computed
* @timer_expires_base: Base time clock monotonic for @timer_expires
@@ -55,9 +53,6 @@ struct tick_device {
* @idle_expires: Next tick in idle, for debugging purpose only
* @idle_calls: Total number of idle calls
* @idle_sleeps: Number of idle calls, where the sched tick was stopped
- * @idle_exittime: Time when the idle state was left
- * @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
- * @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
* @tick_dep_mask: Tick dependency mask - is set, if someone needs the tick
* @check_clocks: Notification mechanism about clocksource changes
*/
@@ -73,12 +68,10 @@ struct tick_sched {
struct hrtimer sched_timer;
ktime_t last_tick;
ktime_t next_tick;
- unsigned long idle_jiffies;
ktime_t idle_waketime;
unsigned int got_idle_tick;
/* Idle entry */
- seqcount_t idle_sleeptime_seq;
ktime_t idle_entrytime;
/* Tick stop */
@@ -90,11 +83,6 @@ struct tick_sched {
unsigned long idle_calls;
unsigned long idle_sleeps;
- /* Idle exit */
- ktime_t idle_exittime;
- ktime_t idle_sleeptime;
- ktime_t iowait_sleeptime;
-
/* Full dynticks handling */
atomic_t tick_dep_mask;
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 488e47e96e93..e77b512e8597 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -154,14 +154,10 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
P_flag(highres, TS_FLAG_HIGHRES);
P_ns(last_tick);
P_flag(tick_stopped, TS_FLAG_STOPPED);
- P(idle_jiffies);
P(idle_calls);
P(idle_sleeps);
P_ns(idle_entrytime);
P_ns(idle_waketime);
- P_ns(idle_exittime);
- P_ns(idle_sleeptime);
- P_ns(iowait_sleeptime);
P(last_jiffies);
P(next_timer);
P_ns(idle_expires);
@@ -258,7 +254,7 @@ static void timer_list_show_tickdevices_header(struct seq_file *m)
static inline void timer_list_header(struct seq_file *m, u64 now)
{
- SEQ_printf(m, "Timer List Version: v0.10\n");
+ SEQ_printf(m, "Timer List Version: v0.11\n");
SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
SEQ_printf(m, "\n");
diff --git a/scripts/gdb/linux/timerlist.py b/scripts/gdb/linux/timerlist.py
index ccc24d30de80..c14ce55674c9 100644
--- a/scripts/gdb/linux/timerlist.py
+++ b/scripts/gdb/linux/timerlist.py
@@ -90,14 +90,10 @@ def print_cpu(hrtimer_bases, cpu, max_clock_bases):
text += f" .{'nohz':15s}: {int(bool(ts['flags'] & TS_FLAG_NOHZ))}\n"
text += f" .{'last_tick':15s}: {ts['last_tick']}\n"
text += f" .{'tick_stopped':15s}: {int(bool(ts['flags'] & TS_FLAG_STOPPED))}\n"
- text += f" .{'idle_jiffies':15s}: {ts['idle_jiffies']}\n"
text += f" .{'idle_calls':15s}: {ts['idle_calls']}\n"
text += f" .{'idle_sleeps':15s}: {ts['idle_sleeps']}\n"
text += f" .{'idle_entrytime':15s}: {ts['idle_entrytime']} nsecs\n"
text += f" .{'idle_waketime':15s}: {ts['idle_waketime']} nsecs\n"
- text += f" .{'idle_exittime':15s}: {ts['idle_exittime']} nsecs\n"
- text += f" .{'idle_sleeptime':15s}: {ts['idle_sleeptime']} nsecs\n"
- text += f" .{'iowait_sleeptime':15s}: {ts['iowait_sleeptime']} nsecs\n"
text += f" .{'last_jiffies':15s}: {ts['last_jiffies']}\n"
text += f" .{'next_timer':15s}: {ts['next_timer']}\n"
text += f" .{'idle_expires':15s}: {ts['idle_expires']} nsecs\n"
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 11/15] tick/sched: Account tickless idle cputime only when tick is stopped
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (9 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 10/15] tick/sched: Remove unused fields Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs Frederic Weisbecker
` (4 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
There is no real point in switching to dyntick-idle cputime accounting
mode if the tick is not actually stopped. This just adds overhead,
notably fetching the GTOD, on each idle exit and each idle IRQ entry for
no reason during short idle trips.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/time/tick-sched.c | 44 ++++++++++++++++++----------------------
1 file changed, 20 insertions(+), 24 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index aa36c8d218e2..bceed0c4dd2c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1141,8 +1141,10 @@ void tick_nohz_idle_stop_tick(void)
ts->idle_sleeps++;
ts->idle_expires = expires;
- if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED))
+ if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+ kcpustat_dyntick_start(ts->idle_entrytime);
nohz_balance_enter_idle(cpu);
+ }
} else {
tick_nohz_retain_tick(ts);
}
@@ -1184,7 +1186,6 @@ void tick_nohz_idle_enter(void)
WARN_ON_ONCE(ts->timer_expires_base);
tick_sched_flag_set(ts, TS_FLAG_INIDLE);
ts->idle_entrytime = ktime_get();
- kcpustat_dyntick_start(ts->idle_entrytime);
tick_nohz_clock_sleep(ts);
local_irq_enable();
@@ -1214,9 +1215,10 @@ void tick_nohz_irq_exit(void)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
if (tick_sched_flag_test(ts, TS_FLAG_INIDLE)) {
- ts->idle_entrytime = ktime_get();
- kcpustat_irq_exit(ts->idle_entrytime);
tick_nohz_clock_sleep(ts);
+ ts->idle_entrytime = ktime_get();
+ if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
+ kcpustat_irq_exit(ts->idle_entrytime);
} else {
tick_nohz_full_update_tick(ts);
}
@@ -1317,8 +1319,11 @@ void tick_nohz_idle_restart_tick(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
- if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
- tick_nohz_restart_sched_tick(ts, ktime_get());
+ if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+ ktime_t now = ktime_get();
+ kcpustat_dyntick_stop(now);
+ tick_nohz_restart_sched_tick(ts, now);
+ }
}
static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
@@ -1348,7 +1353,6 @@ static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
void tick_nohz_idle_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
- bool idle_active, tick_stopped;
ktime_t now;
local_irq_disable();
@@ -1357,18 +1361,13 @@ void tick_nohz_idle_exit(void)
WARN_ON_ONCE(ts->timer_expires_base);
tick_sched_flag_clear(ts, TS_FLAG_INIDLE);
- idle_active = tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE);
- tick_stopped = tick_sched_flag_test(ts, TS_FLAG_STOPPED);
+ tick_nohz_clock_wakeup(ts);
- if (idle_active || tick_stopped)
+ if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
now = ktime_get();
-
- if (idle_active)
- tick_nohz_clock_wakeup(ts);
-
- if (tick_stopped)
+ kcpustat_dyntick_stop(now);
tick_nohz_idle_update_tick(ts, now);
- kcpustat_dyntick_stop(now);
+ }
local_irq_enable();
}
@@ -1422,15 +1421,13 @@ static inline void tick_nohz_irq_enter(void)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
ktime_t now;
- if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED | TS_FLAG_IDLE_ACTIVE))
+ tick_nohz_clock_wakeup(ts);
+
+ if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED))
return;
now = ktime_get();
-
- if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)) {
- tick_nohz_clock_wakeup(ts);
- kcpustat_irq_enter(now);
- }
+ kcpustat_irq_enter(now);
/*
* If all CPUs are idle we may need to update a stale jiffies value.
@@ -1439,8 +1436,7 @@ static inline void tick_nohz_irq_enter(void)
* rare case (typically stop machine). So we must make sure we have a
* last resort.
*/
- if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
- tick_nohz_update_jiffies(now);
+ tick_nohz_update_jiffies(now);
}
#else
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (10 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 11/15] tick/sched: Account tickless idle cputime only when tick is stopped Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 22:35 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 13/15] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case Frederic Weisbecker
` (3 subsequent siblings)
15 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Fetching the idle cputime is available through a variety of accessors
all over the place depending on the different accounting flavours and
needs:
- idle vtime generic accounting can be accessed by kcpustat_field(),
kcpustat_cpu_fetch(), get_idle/iowait_time() and
get_cpu_idle/iowait_time_us()
- dynticks-idle accounting can only be accessed by get_idle/iowait_time()
or get_cpu_idle/iowait_time_us()
- CONFIG_NO_HZ_COMMON=n idle accounting can be accessed by kcpustat_field()
kcpustat_cpu_fetch(), or get_idle/iowait_time() but not by
get_cpu_idle/iowait_time_us()
Moreover get_idle/iowait_time() relies on get_cpu_idle/iowait_time_us()
with a non-sensical conversion to microseconds and back to nanoseconds
on the way.
Start consolidating the APIs with removing get_idle/iowait_time() and
make kcpustat_field() and kcpustat_cpu_fetch() work for all cases.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
fs/proc/stat.c | 40 +++---------------------
fs/proc/uptime.c | 8 ++---
include/linux/kernel_stat.h | 34 ++++++++++++++++++---
kernel/sched/cputime.c | 61 ++++++++++++++++++++++++-------------
4 files changed, 76 insertions(+), 67 deletions(-)
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 8b444e862319..c00468a83f64 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -22,38 +22,6 @@
#define arch_irq_stat() 0
#endif
-u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
-{
- u64 idle, idle_usecs = -1ULL;
-
- if (cpu_online(cpu))
- idle_usecs = get_cpu_idle_time_us(cpu, NULL);
-
- if (idle_usecs == -1ULL)
- /* !NO_HZ or cpu offline so we can rely on cpustat.idle */
- idle = kcs->cpustat[CPUTIME_IDLE];
- else
- idle = idle_usecs * NSEC_PER_USEC;
-
- return idle;
-}
-
-static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
-{
- u64 iowait, iowait_usecs = -1ULL;
-
- if (cpu_online(cpu))
- iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
-
- if (iowait_usecs == -1ULL)
- /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
- iowait = kcs->cpustat[CPUTIME_IOWAIT];
- else
- iowait = iowait_usecs * NSEC_PER_USEC;
-
- return iowait;
-}
-
static void show_irq_gap(struct seq_file *p, unsigned int gap)
{
static const char zeros[] = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0";
@@ -105,8 +73,8 @@ static int show_stat(struct seq_file *p, void *v)
user += cpustat[CPUTIME_USER];
nice += cpustat[CPUTIME_NICE];
system += cpustat[CPUTIME_SYSTEM];
- idle += get_idle_time(&kcpustat, i);
- iowait += get_iowait_time(&kcpustat, i);
+ idle += cpustat[CPUTIME_IDLE];
+ iowait += cpustat[CPUTIME_IOWAIT];
irq += cpustat[CPUTIME_IRQ];
softirq += cpustat[CPUTIME_SOFTIRQ];
steal += cpustat[CPUTIME_STEAL];
@@ -146,8 +114,8 @@ static int show_stat(struct seq_file *p, void *v)
user = cpustat[CPUTIME_USER];
nice = cpustat[CPUTIME_NICE];
system = cpustat[CPUTIME_SYSTEM];
- idle = get_idle_time(&kcpustat, i);
- iowait = get_iowait_time(&kcpustat, i);
+ idle = cpustat[CPUTIME_IDLE];
+ iowait = cpustat[CPUTIME_IOWAIT];
irq = cpustat[CPUTIME_IRQ];
softirq = cpustat[CPUTIME_SOFTIRQ];
steal = cpustat[CPUTIME_STEAL];
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index b5343d209381..433aa947cd57 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -18,12 +18,8 @@ static int uptime_proc_show(struct seq_file *m, void *v)
int i;
idle_nsec = 0;
- for_each_possible_cpu(i) {
- struct kernel_cpustat kcs;
-
- kcpustat_cpu_fetch(&kcs, i);
- idle_nsec += get_idle_time(&kcs, i);
- }
+ for_each_possible_cpu(i)
+ idle_nsec += kcpustat_field(CPUTIME_IDLE, i);
ktime_get_boottime_ts64(&uptime);
timens_add_boottime(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 9343353ac7a3..3680519d7b2c 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -110,32 +110,59 @@ extern void kcpustat_dyntick_start(u64 now);
extern void kcpustat_dyntick_stop(u64 now);
extern void kcpustat_irq_enter(u64 now);
extern void kcpustat_irq_exit(u64 now);
+extern u64 kcpustat_field_idle(int cpu);
+extern u64 kcpustat_field_iowait(int cpu);
static inline bool kcpustat_idle_dyntick(void)
{
return __this_cpu_read(kernel_cpustat.idle_dyntick);
}
#else
+static inline u64 kcpustat_field_idle(int cpu)
+{
+ return kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
+}
+static inline u64 kcpustat_field_iowait(int cpu)
+{
+ return kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
+}
+
static inline bool kcpustat_idle_dyntick(void)
{
return false;
}
#endif /* CONFIG_NO_HZ_COMMON */
+/* Fetch cputime values when vtime is disabled on a CPU */
+static inline u64 kcpustat_field_default(enum cpu_usage_stat usage, int cpu)
+{
+ if (usage == CPUTIME_IDLE)
+ return kcpustat_field_idle(cpu);
+ if (usage == CPUTIME_IOWAIT)
+ return kcpustat_field_iowait(cpu);
+ return kcpustat_cpu(cpu).cpustat[usage];
+}
+
+static inline void kcpustat_cpu_fetch_default(struct kernel_cpustat *dst, int cpu)
+{
+ *dst = kcpustat_cpu(cpu);
+ dst->cpustat[CPUTIME_IDLE] = kcpustat_field_idle(cpu);
+ dst->cpustat[CPUTIME_IOWAIT] = kcpustat_field_iowait(cpu);
+}
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
extern u64 kcpustat_field(enum cpu_usage_stat usage, int cpu);
extern void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu);
#else
static inline u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
{
- return kcpustat_cpu(cpu).cpustat[usage];
+ return kcpustat_field_default(usage, cpu);
}
static inline void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
{
- *dst = kcpustat_cpu(cpu);
+ kcpustat_cpu_fetch_default(dst, cpu);
}
-
#endif /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN */
extern void account_user_time(struct task_struct *, u64);
@@ -145,7 +172,6 @@ extern void account_system_index_time(struct task_struct *, u64,
enum cpu_usage_stat);
extern void account_steal_time(u64);
extern void account_idle_time(u64);
-extern u64 get_idle_time(struct kernel_cpustat *kcs, int cpu);
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
static inline void account_process_tick(struct task_struct *tsk, int user)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index d2cad4d8dc10..057fdc00dbc6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -476,24 +476,14 @@ void kcpustat_irq_exit(u64 now)
kcpustat_idle_start(kc, now);
}
-static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
- bool compute_delta, u64 *last_update_time)
+static u64 kcpustat_field_dyntick(int cpu, enum cpu_usage_stat idx,
+ bool compute_delta, u64 now)
{
struct kernel_cpustat *kc = &kcpustat_cpu(cpu);
u64 *cpustat = kc->cpustat;
unsigned int seq;
- ktime_t now;
u64 idle;
- now = ktime_get();
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
- if (vtime_generic_enabled_cpu(cpu)) {
- idle = kcpustat_field(idx, cpu);
- goto to_us;
- }
-
do {
seq = read_seqcount_begin(&kc->idle_sleeptime_seq);
@@ -503,12 +493,42 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
idle = cpustat[idx];
} while (read_seqcount_retry(&kc->idle_sleeptime_seq, seq));
-to_us:
- do_div(idle, NSEC_PER_USEC);
-
return idle;
}
+u64 kcpustat_field_idle(int cpu)
+{
+ return kcpustat_field_dyntick(cpu, CPUTIME_IDLE,
+ !nr_iowait_cpu(cpu), ktime_get());
+}
+EXPORT_SYMBOL_GPL(kcpustat_field_idle);
+
+u64 kcpustat_field_iowait(int cpu)
+{
+ return kcpustat_field_dyntick(cpu, CPUTIME_IOWAIT,
+ nr_iowait_cpu(cpu), ktime_get());
+}
+EXPORT_SYMBOL_GPL(kcpustat_field_iowait);
+
+static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
+ bool compute_delta, u64 *last_update_time)
+{
+ ktime_t now = ktime_get();
+ u64 res;
+
+ if (vtime_generic_enabled_cpu(cpu))
+ res = kcpustat_field(idx, cpu);
+ else
+ res = kcpustat_field_dyntick(cpu, idx, compute_delta, now);
+
+ do_div(res, NSEC_PER_USEC);
+
+ if (last_update_time)
+ *last_update_time = res;
+
+ return res;
+}
+
/**
* get_cpu_idle_time_us - get the total idle time of a CPU
* @cpu: CPU number to query
@@ -556,7 +576,6 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-
#endif /* CONFIG_NO_HZ_COMMON */
/*
@@ -1110,8 +1129,8 @@ u64 kcpustat_field(enum cpu_usage_stat usage, int cpu)
struct rq *rq;
int err;
- if (!vtime_accounting_enabled_cpu(cpu))
- return val;
+ if (!vtime_generic_enabled_cpu(cpu))
+ return kcpustat_field_default(usage, cpu);
rq = cpu_rq(cpu);
@@ -1206,8 +1225,8 @@ void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
struct rq *rq;
int err;
- if (!vtime_accounting_enabled_cpu(cpu)) {
- *dst = *src;
+ if (!vtime_generic_enabled_cpu(cpu)) {
+ kcpustat_cpu_fetch_default(dst, cpu);
return;
}
@@ -1220,7 +1239,7 @@ void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu)
curr = rcu_dereference(rq->curr);
if (WARN_ON_ONCE(!curr)) {
rcu_read_unlock();
- *dst = *src;
+ kcpustat_cpu_fetch_default(dst, cpu);
return;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs
2026-02-06 14:22 ` [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs Frederic Weisbecker
@ 2026-02-06 22:35 ` Frederic Weisbecker
0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 22:35 UTC (permalink / raw)
To: LKML
Cc: Christophe Leroy (CS GROUP), Rafael J. Wysocki, Alexander Gordeev,
Anna-Maria Behnsen, Ben Segall, Boqun Feng, Christian Borntraeger,
Dietmar Eggemann, Heiko Carstens, Ingo Molnar, Jan Kiszka,
Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
Le Fri, Feb 06, 2026 at 03:22:42PM +0100, Frederic Weisbecker a écrit :
> +static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
> + bool compute_delta, u64 *last_update_time)
> +{
> + ktime_t now = ktime_get();
> + u64 res;
> +
> + if (vtime_generic_enabled_cpu(cpu))
> + res = kcpustat_field(idx, cpu);
> + else
> + res = kcpustat_field_dyntick(cpu, idx, compute_delta, now);
> +
> + do_div(res, NSEC_PER_USEC);
> +
> + if (last_update_time)
> + *last_update_time = res;
Urgh, this should be *last_update_time = ktime_to_us(now)
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 13/15] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (11 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 12/15] tick/sched: Consolidate idle time fetching APIs Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-02-06 14:22 ` [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully Frederic Weisbecker
` (2 subsequent siblings)
15 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
The last reason why get_cpu_idle/iowait_time_us() may return -1 now is
if the config doesn't support nohz.
The ad-hoc replacement solution by cpufreq is to compute jiffies minus
the whole busy cputime. Although the intention should provide a coherent
low resolution estimation of the idle and iowait time, the
implementation is buggy because jiffies don't start at 0.
Just provide instead a real get_cpu_[idle|iowait]_time_us() offcase.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
drivers/cpufreq/cpufreq.c | 29 +----------------------------
include/linux/kernel_stat.h | 3 +++
include/linux/tick.h | 4 ----
kernel/sched/cputime.c | 12 +++++++++---
4 files changed, 13 insertions(+), 35 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 4472bb1ec83c..ecb9634cd06b 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -130,38 +130,11 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy)
}
EXPORT_SYMBOL_GPL(get_governor_parent_kobj);
-static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
-{
- struct kernel_cpustat kcpustat;
- u64 cur_wall_time;
- u64 idle_time;
- u64 busy_time;
-
- cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());
-
- kcpustat_cpu_fetch(&kcpustat, cpu);
-
- busy_time = kcpustat.cpustat[CPUTIME_USER];
- busy_time += kcpustat.cpustat[CPUTIME_SYSTEM];
- busy_time += kcpustat.cpustat[CPUTIME_IRQ];
- busy_time += kcpustat.cpustat[CPUTIME_SOFTIRQ];
- busy_time += kcpustat.cpustat[CPUTIME_STEAL];
- busy_time += kcpustat.cpustat[CPUTIME_NICE];
-
- idle_time = cur_wall_time - busy_time;
- if (wall)
- *wall = div_u64(cur_wall_time, NSEC_PER_USEC);
-
- return div_u64(idle_time, NSEC_PER_USEC);
-}
-
u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy)
{
u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);
- if (idle_time == -1ULL)
- return get_cpu_idle_time_jiffy(cpu, wall);
- else if (!io_busy)
+ if (!io_busy)
idle_time += get_cpu_iowait_time_us(cpu, wall);
return idle_time;
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 3680519d7b2c..512104b0ff49 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -133,6 +133,9 @@ static inline bool kcpustat_idle_dyntick(void)
}
#endif /* CONFIG_NO_HZ_COMMON */
+extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+
/* Fetch cputime values when vtime is disabled on a CPU */
static inline u64 kcpustat_field_default(enum cpu_usage_stat usage, int cpu)
{
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 738007d6f577..1cf4651f09ad 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -139,8 +139,6 @@ extern bool tick_nohz_idle_got_tick(void);
extern ktime_t tick_nohz_get_next_hrtimer(void);
extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
-extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
-extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#else /* !CONFIG_NO_HZ_COMMON */
#define tick_nohz_enabled (0)
static inline bool tick_nohz_is_active(void) { return false; }
@@ -162,8 +160,6 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
*delta_next = TICK_NSEC;
return *delta_next;
}
-static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
-static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
#endif /* !CONFIG_NO_HZ_COMMON */
/*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 057fdc00dbc6..d588a4a50e57 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -509,6 +509,13 @@ u64 kcpustat_field_iowait(int cpu)
nr_iowait_cpu(cpu), ktime_get());
}
EXPORT_SYMBOL_GPL(kcpustat_field_iowait);
+#else
+static u64 kcpustat_field_dyntick(int cpu, enum cpu_usage_stat idx,
+ bool compute_delta, ktime_t now)
+{
+ return kcpustat_cpu(cpu).cpustat[idx];
+}
+#endif /* CONFIG_NO_HZ_COMMON */
static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
bool compute_delta, u64 *last_update_time)
@@ -544,7 +551,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
* This time is measured via accounting rather than sampling,
* and is as accurate as ktime_get() is.
*
- * Return: -1 if generic vtime is enabled, else total idle time of the @cpu
+ * Return: total idle time of the @cpu
*/
u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
@@ -568,7 +575,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
* This time is measured via accounting rather than sampling,
* and is as accurate as ktime_get() is.
*
- * Return: -1 if generic vtime is enabled, else total iowait time of @cpu
+ * Return: total iowait time of @cpu
*/
u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
@@ -576,7 +583,6 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-#endif /* CONFIG_NO_HZ_COMMON */
/*
* Use precise platform statistics if available:
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (12 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 13/15] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-03-03 11:11 ` Shrikanth Hegde
2026-02-06 14:22 ` [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly Frederic Weisbecker
2026-02-11 13:43 ` [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Shrikanth Hegde
15 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
The dyntick-idle cputime accounting always assumes that IRQ time
accounting is enabled and consequently stops elapsing the idle time
during dyntick-idle IRQs.
This doesn't mix up well with disabled IRQ time accounting because then
idle IRQs become a cputime blind-spot. Also this feature is disabled
on most configurations and the overhead of pausing dyntick-idle
accounting while in idle IRQs could then be avoided.
Fix the situation with conditionally pausing dyntick-idle accounting
during idle IRQs only if neither native vtime (which does IRQ time
accounting) nor generic IRQ time accounting are enabled.
Also make sure that the accumulated IRQ time is not accidentally
substracted from later accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/sched/cputime.c | 24 +++++++++++++++++++++---
kernel/sched/sched.h | 1 +
2 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index d588a4a50e57..92fa2f037b6e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -46,7 +46,8 @@ static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
u64_stats_update_begin(&irqtime->sync);
cpustat[idx] += delta;
irqtime->total += delta;
- irqtime->tick_delta += delta;
+ if (!irqtime->idle_dyntick)
+ irqtime->tick_delta += delta;
u64_stats_update_end(&irqtime->sync);
}
@@ -81,6 +82,16 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
+static inline void irqtime_dyntick_start(void)
+{
+ __this_cpu_write(cpu_irqtime.idle_dyntick, true);
+}
+
+static inline void irqtime_dyntick_stop(void)
+{
+ __this_cpu_write(cpu_irqtime.idle_dyntick, false);
+}
+
static u64 irqtime_tick_accounted(u64 maxtime)
{
struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime);
@@ -94,6 +105,9 @@ static u64 irqtime_tick_accounted(u64 maxtime)
#else /* !CONFIG_IRQ_TIME_ACCOUNTING: */
+static inline void irqtime_dyntick_start(void) { }
+static inline void irqtime_dyntick_stop(void) { }
+
static u64 irqtime_tick_accounted(u64 dummy)
{
return 0;
@@ -444,6 +458,7 @@ void kcpustat_dyntick_stop(u64 now)
WARN_ON_ONCE(!kc->idle_dyntick);
kcpustat_idle_stop(kc, now);
kc->idle_dyntick = false;
+ irqtime_dyntick_stop();
vtime_dyntick_stop();
steal_account_process_time(ULONG_MAX);
}
@@ -455,6 +470,7 @@ void kcpustat_dyntick_start(u64 now)
if (!vtime_generic_enabled_this_cpu()) {
vtime_dyntick_start();
+ irqtime_dyntick_start();
kc->idle_dyntick = true;
kcpustat_idle_start(kc, now);
}
@@ -464,7 +480,8 @@ void kcpustat_irq_enter(u64 now)
{
struct kernel_cpustat *kc = kcpustat_this_cpu;
- if (!vtime_generic_enabled_this_cpu())
+ if (!vtime_generic_enabled_this_cpu() &&
+ (irqtime_enabled() || vtime_accounting_enabled_this_cpu()))
kcpustat_idle_stop(kc, now);
}
@@ -472,7 +489,8 @@ void kcpustat_irq_exit(u64 now)
{
struct kernel_cpustat *kc = kcpustat_this_cpu;
- if (!vtime_generic_enabled_this_cpu())
+ if (!vtime_generic_enabled_this_cpu() &&
+ (irqtime_enabled() || vtime_accounting_enabled_this_cpu()))
kcpustat_idle_start(kc, now);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..cf677ff12b10 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3307,6 +3307,7 @@ static inline void sched_core_tick(struct rq *rq) { }
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
struct irqtime {
+ bool idle_dyntick;
u64 total;
u64 tick_delta;
u64 irq_start_time;
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully
2026-02-06 14:22 ` [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully Frederic Weisbecker
@ 2026-03-03 11:11 ` Shrikanth Hegde
2026-03-20 14:32 ` Frederic Weisbecker
0 siblings, 1 reply; 40+ messages in thread
From: Shrikanth Hegde @ 2026-03-03 11:11 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Christophe Leroy (CS GROUP), Rafael J. Wysocki, Alexander Gordeev,
Anna-Maria Behnsen, Ben Segall, Boqun Feng, Christian Borntraeger,
Dietmar Eggemann, Heiko Carstens, Ingo Molnar, Jan Kiszka,
Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev
Hi Frederic,
On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> The dyntick-idle cputime accounting always assumes that IRQ time
> accounting is enabled and consequently stops elapsing the idle time
> during dyntick-idle IRQs.
>
> This doesn't mix up well with disabled IRQ time accounting because then
> idle IRQs become a cputime blind-spot. Also this feature is disabled
> on most configurations and the overhead of pausing dyntick-idle
> accounting while in idle IRQs could then be avoided.
>
> Fix the situation with conditionally pausing dyntick-idle accounting
> during idle IRQs only if neither native vtime (which does IRQ time
> accounting) nor generic IRQ time accounting are enabled.
>
> Also make sure that the accumulated IRQ time is not accidentally
> substracted from later accounting.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/sched/cputime.c | 24 +++++++++++++++++++++---
> kernel/sched/sched.h | 1 +
> 2 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index d588a4a50e57..92fa2f037b6e 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -46,7 +46,8 @@ static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
> u64_stats_update_begin(&irqtime->sync);
> cpustat[idx] += delta;
> irqtime->total += delta;
> - irqtime->tick_delta += delta;
> + if (!irqtime->idle_dyntick)
> + irqtime->tick_delta += delta;
Wouldn't kcpustat_idle_dyntick achieve the same thing?
> u64_stats_update_end(&irqtime->sync);
> }
>
> @@ -81,6 +82,16 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
> irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
> }
>
> +static inline void irqtime_dyntick_start(void)
> +{
> + __this_cpu_write(cpu_irqtime.idle_dyntick, true);
> +}
> +
> +static inline void irqtime_dyntick_stop(void)
> +{
> + __this_cpu_write(cpu_irqtime.idle_dyntick, false);
> +}
> +
> static u64 irqtime_tick_accounted(u64 maxtime)
> {
> struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime);
> @@ -94,6 +105,9 @@ static u64 irqtime_tick_accounted(u64 maxtime)
>
> #else /* !CONFIG_IRQ_TIME_ACCOUNTING: */
>
> +static inline void irqtime_dyntick_start(void) { }
> +static inline void irqtime_dyntick_stop(void) { }
> +
> static u64 irqtime_tick_accounted(u64 dummy)
> {
> return 0;
> @@ -444,6 +458,7 @@ void kcpustat_dyntick_stop(u64 now)
> WARN_ON_ONCE(!kc->idle_dyntick);
> kcpustat_idle_stop(kc, now);
> kc->idle_dyntick = false;
> + irqtime_dyntick_stop();
> vtime_dyntick_stop();
> steal_account_process_time(ULONG_MAX);
> }
> @@ -455,6 +470,7 @@ void kcpustat_dyntick_start(u64 now)
>
> if (!vtime_generic_enabled_this_cpu()) {
> vtime_dyntick_start();
> + irqtime_dyntick_start();
> kc->idle_dyntick = true;
> kcpustat_idle_start(kc, now);
> }
> @@ -464,7 +480,8 @@ void kcpustat_irq_enter(u64 now)
> {
> struct kernel_cpustat *kc = kcpustat_this_cpu;
>
> - if (!vtime_generic_enabled_this_cpu())
> + if (!vtime_generic_enabled_this_cpu() &&
> + (irqtime_enabled() || vtime_accounting_enabled_this_cpu()))
> kcpustat_idle_stop(kc, now);
> }
Scenario: context_tracking is not enabled(since nohz_full or isolcpus not specified) and
irqtime/native is not enabled. ( config is CONFIG_VIRT_CPU_ACCOUNTING_GEN + IRQ_TIME=n)
cpu goes into tickless mode. Gets irqs, but kcpustat_irq_enter/exit is nop.
Then the time it spent in irq is still accounted for idle time, during kcpustat_dyntick_stop?
Who is going to account the irq time in this case?
>
> @@ -472,7 +489,8 @@ void kcpustat_irq_exit(u64 now)
> {
> struct kernel_cpustat *kc = kcpustat_this_cpu;
>
> - if (!vtime_generic_enabled_this_cpu())
> + if (!vtime_generic_enabled_this_cpu() &&
> + (irqtime_enabled() || vtime_accounting_enabled_this_cpu()))
> kcpustat_idle_start(kc, now);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d30cca6870f5..cf677ff12b10 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3307,6 +3307,7 @@ static inline void sched_core_tick(struct rq *rq) { }
> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>
> struct irqtime {
> + bool idle_dyntick;
> u64 total;
> u64 tick_delta;
> u64 irq_start_time;
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully
2026-03-03 11:11 ` Shrikanth Hegde
@ 2026-03-20 14:32 ` Frederic Weisbecker
0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-03-20 14:32 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Le Tue, Mar 03, 2026 at 04:41:18PM +0530, Shrikanth Hegde a écrit :
> Hi Frederic,
>
> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> > The dyntick-idle cputime accounting always assumes that IRQ time
> > accounting is enabled and consequently stops elapsing the idle time
> > during dyntick-idle IRQs.
> >
> > This doesn't mix up well with disabled IRQ time accounting because then
> > idle IRQs become a cputime blind-spot. Also this feature is disabled
> > on most configurations and the overhead of pausing dyntick-idle
> > accounting while in idle IRQs could then be avoided.
> >
> > Fix the situation with conditionally pausing dyntick-idle accounting
> > during idle IRQs only if neither native vtime (which does IRQ time
> > accounting) nor generic IRQ time accounting are enabled.
> >
> > Also make sure that the accumulated IRQ time is not accidentally
> > substracted from later accounting.
> >
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> > kernel/sched/cputime.c | 24 +++++++++++++++++++++---
> > kernel/sched/sched.h | 1 +
> > 2 files changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index d588a4a50e57..92fa2f037b6e 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -46,7 +46,8 @@ static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
> > u64_stats_update_begin(&irqtime->sync);
> > cpustat[idx] += delta;
> > irqtime->total += delta;
> > - irqtime->tick_delta += delta;
> > + if (!irqtime->idle_dyntick)
> > + irqtime->tick_delta += delta;
>
> Wouldn't kcpustat_idle_dyntick achieve the same thing?
Yes indeed.
>
> > u64_stats_update_end(&irqtime->sync);
> > }
> > @@ -81,6 +82,16 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
> > irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
> > }
> > +static inline void irqtime_dyntick_start(void)
> > +{
> > + __this_cpu_write(cpu_irqtime.idle_dyntick, true);
> > +}
> > +
> > +static inline void irqtime_dyntick_stop(void)
> > +{
> > + __this_cpu_write(cpu_irqtime.idle_dyntick, false);
> > +}
> > +
> > static u64 irqtime_tick_accounted(u64 maxtime)
> > {
> > struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime);
> > @@ -94,6 +105,9 @@ static u64 irqtime_tick_accounted(u64 maxtime)
> > #else /* !CONFIG_IRQ_TIME_ACCOUNTING: */
> > +static inline void irqtime_dyntick_start(void) { }
> > +static inline void irqtime_dyntick_stop(void) { }
> > +
> > static u64 irqtime_tick_accounted(u64 dummy)
> > {
> > return 0;
> > @@ -444,6 +458,7 @@ void kcpustat_dyntick_stop(u64 now)
> > WARN_ON_ONCE(!kc->idle_dyntick);
> > kcpustat_idle_stop(kc, now);
> > kc->idle_dyntick = false;
> > + irqtime_dyntick_stop();
> > vtime_dyntick_stop();
> > steal_account_process_time(ULONG_MAX);
> > }
> > @@ -455,6 +470,7 @@ void kcpustat_dyntick_start(u64 now)
> > if (!vtime_generic_enabled_this_cpu()) {
> > vtime_dyntick_start();
> > + irqtime_dyntick_start();
> > kc->idle_dyntick = true;
> > kcpustat_idle_start(kc, now);
> > }
> > @@ -464,7 +480,8 @@ void kcpustat_irq_enter(u64 now)
> > {
> > struct kernel_cpustat *kc = kcpustat_this_cpu;
> > - if (!vtime_generic_enabled_this_cpu())
> > + if (!vtime_generic_enabled_this_cpu() &&
> > + (irqtime_enabled() || vtime_accounting_enabled_this_cpu()))
> > kcpustat_idle_stop(kc, now);
> > }
> Scenario: context_tracking is not enabled(since nohz_full or isolcpus not specified) and
> irqtime/native is not enabled. ( config is CONFIG_VIRT_CPU_ACCOUNTING_GEN + IRQ_TIME=n)
>
>
> cpu goes into tickless mode. Gets irqs, but kcpustat_irq_enter/exit is nop.
> Then the time it spent in irq is still accounted for idle time, during
> kcpustat_dyntick_stop?
Right! As is the case for IRQs firing in system and user time. Basically this
just consolidate the IRQ time accounting behaviour in CONFIG_VIRT_CPU_ACCOUNTING_GEN=n
> Who is going to account the irq time in this case?
Nothing, it's part of idle time.
We could also decide to account the idle IRQ time as system time. I guess it's a
matter of which semantic we want to give. Though that would be more overhead.
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (13 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 14/15] sched/cputime: Handle idle irqtime gracefully Frederic Weisbecker
@ 2026-02-06 14:22 ` Frederic Weisbecker
2026-03-03 11:17 ` Shrikanth Hegde
2026-02-11 13:43 ` [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Shrikanth Hegde
15 siblings, 1 reply; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-06 14:22 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev, Shrikanth Hegde
The dyntick-idle steal time is currently accounted when the tick
restarts but the stolen idle time is not substracted from the idle time
that was already accounted. This is to avoid observing the idle time
going backward as the dyntick-idle cputime accessors can't reliably know
in advance the stolen idle time.
In order to maintain a forward progressing idle cputime while
substracting idle steal time from it, keep track of the previously
accounted idle stolen time and substract it from _later_ idle cputime
accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/kernel_stat.h | 1 +
kernel/sched/cputime.c | 21 +++++++++++++++------
2 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 512104b0ff49..24a54a6151ba 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -39,6 +39,7 @@ struct kernel_cpustat {
bool idle_elapse;
seqcount_t idle_sleeptime_seq;
u64 idle_entrytime;
+ u64 idle_stealtime;
#endif
u64 cpustat[NR_STATS];
};
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 92fa2f037b6e..7e79288eb327 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -424,19 +424,25 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_
static void kcpustat_idle_stop(struct kernel_cpustat *kc, u64 now)
{
u64 *cpustat = kc->cpustat;
- u64 delta;
+ u64 delta, steal, steal_delta;
if (!kc->idle_elapse)
return;
delta = now - kc->idle_entrytime;
+ steal = steal_account_process_time(delta);
write_seqcount_begin(&kc->idle_sleeptime_seq);
+ steal_delta = min_t(u64, kc->idle_stealtime, delta);
+ delta -= steal_delta;
+ kc->idle_stealtime -= steal_delta;
+
if (nr_iowait_cpu(smp_processor_id()) > 0)
cpustat[CPUTIME_IOWAIT] += delta;
else
cpustat[CPUTIME_IDLE] += delta;
+ kc->idle_stealtime += steal;
kc->idle_entrytime = now;
kc->idle_elapse = false;
write_seqcount_end(&kc->idle_sleeptime_seq);
@@ -460,7 +466,6 @@ void kcpustat_dyntick_stop(u64 now)
kc->idle_dyntick = false;
irqtime_dyntick_stop();
vtime_dyntick_stop();
- steal_account_process_time(ULONG_MAX);
}
}
@@ -505,10 +510,14 @@ static u64 kcpustat_field_dyntick(int cpu, enum cpu_usage_stat idx,
do {
seq = read_seqcount_begin(&kc->idle_sleeptime_seq);
- if (kc->idle_elapse && compute_delta)
- idle = cpustat[idx] + (now - kc->idle_entrytime);
- else
- idle = cpustat[idx];
+ idle = cpustat[idx];
+
+ if (kc->idle_elapse && compute_delta) {
+ u64 delta = now - kc->idle_entrytime;
+
+ delta -= min_t(u64, kc->idle_stealtime, delta);
+ idle += delta;
+ }
} while (read_seqcount_retry(&kc->idle_sleeptime_seq, seq));
return idle;
--
2.51.1
^ permalink raw reply related [flat|nested] 40+ messages in thread* Re: [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly
2026-02-06 14:22 ` [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly Frederic Weisbecker
@ 2026-03-03 11:17 ` Shrikanth Hegde
2026-03-24 14:53 ` Frederic Weisbecker
0 siblings, 1 reply; 40+ messages in thread
From: Shrikanth Hegde @ 2026-03-03 11:17 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Christophe Leroy (CS GROUP), Rafael J. Wysocki, Alexander Gordeev,
Anna-Maria Behnsen, Ben Segall, Boqun Feng, Christian Borntraeger,
Dietmar Eggemann, Heiko Carstens, Ingo Molnar, Jan Kiszka,
Joel Fernandes, Juri Lelli, Kieran Bingham, Madhavan Srinivasan,
Mel Gorman, Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
Paul E . McKenney, Peter Zijlstra, Steven Rostedt, Sven Schnelle,
Thomas Gleixner, Uladzislau Rezki, Valentin Schneider,
Vasily Gorbik, Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm,
linux-s390, linuxppc-dev
On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> The dyntick-idle steal time is currently accounted when the tick
> restarts but the stolen idle time is not substracted from the idle time
> that was already accounted. This is to avoid observing the idle time
> going backward as the dyntick-idle cputime accessors can't reliably know
> in advance the stolen idle time.
>
> In order to maintain a forward progressing idle cputime while
> substracting idle steal time from it, keep track of the previously
> accounted idle stolen time and substract it from _later_ idle cputime
> accounting.
>
s/substract/subtract ?
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> include/linux/kernel_stat.h | 1 +
> kernel/sched/cputime.c | 21 +++++++++++++++------
> 2 files changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index 512104b0ff49..24a54a6151ba 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -39,6 +39,7 @@ struct kernel_cpustat {
> bool idle_elapse;
> seqcount_t idle_sleeptime_seq;
> u64 idle_entrytime;
> + u64 idle_stealtime;
> #endif
> u64 cpustat[NR_STATS];
> };
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 92fa2f037b6e..7e79288eb327 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -424,19 +424,25 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_
> static void kcpustat_idle_stop(struct kernel_cpustat *kc, u64 now)
> {
> u64 *cpustat = kc->cpustat;
> - u64 delta;
> + u64 delta, steal, steal_delta;
>
> if (!kc->idle_elapse)
> return;
>
> delta = now - kc->idle_entrytime;
> + steal = steal_account_process_time(delta);
>
> write_seqcount_begin(&kc->idle_sleeptime_seq);
> + steal_delta = min_t(u64, kc->idle_stealtime, delta);
> + delta -= steal_delta;
I didn;t get this logic. Why do we need idle_stealtime?
Lets say 10ms was steal time and 50ms was delta. but idle_stealtime is
sum of past accumulated steal time. we only need to subtract steal time there no?
Shouldn't this be delta -= steal ?
> + kc->idle_stealtime -= steal_delta;
> +
> if (nr_iowait_cpu(smp_processor_id()) > 0)
> cpustat[CPUTIME_IOWAIT] += delta;
> else
> cpustat[CPUTIME_IDLE] += delta;
>
> + kc->idle_stealtime += steal;
> kc->idle_entrytime = now;
> kc->idle_elapse = false;
> write_seqcount_end(&kc->idle_sleeptime_seq);
> @@ -460,7 +466,6 @@ void kcpustat_dyntick_stop(u64 now)
> kc->idle_dyntick = false;
> irqtime_dyntick_stop();
> vtime_dyntick_stop();
> - steal_account_process_time(ULONG_MAX);
> }
> }
>
> @@ -505,10 +510,14 @@ static u64 kcpustat_field_dyntick(int cpu, enum cpu_usage_stat idx,
> do {
> seq = read_seqcount_begin(&kc->idle_sleeptime_seq);
>
> - if (kc->idle_elapse && compute_delta)
> - idle = cpustat[idx] + (now - kc->idle_entrytime);
> - else
> - idle = cpustat[idx];
> + idle = cpustat[idx];
> +
> + if (kc->idle_elapse && compute_delta) {
> + u64 delta = now - kc->idle_entrytime;
> +
> + delta -= min_t(u64, kc->idle_stealtime, delta);
> + idle += delta;
> + }
> } while (read_seqcount_retry(&kc->idle_sleeptime_seq, seq));
>
> return idle;
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly
2026-03-03 11:17 ` Shrikanth Hegde
@ 2026-03-24 14:53 ` Frederic Weisbecker
0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-03-24 14:53 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: LKML, Christophe Leroy (CS GROUP), Rafael J. Wysocki,
Alexander Gordeev, Anna-Maria Behnsen, Ben Segall, Boqun Feng,
Christian Borntraeger, Dietmar Eggemann, Heiko Carstens,
Ingo Molnar, Jan Kiszka, Joel Fernandes, Juri Lelli,
Kieran Bingham, Madhavan Srinivasan, Mel Gorman, Michael Ellerman,
Neeraj Upadhyay, Nicholas Piggin, Paul E . McKenney,
Peter Zijlstra, Steven Rostedt, Sven Schnelle, Thomas Gleixner,
Uladzislau Rezki, Valentin Schneider, Vasily Gorbik,
Vincent Guittot, Viresh Kumar, Xin Zhao, linux-pm, linux-s390,
linuxppc-dev
Le Tue, Mar 03, 2026 at 04:47:45PM +0530, Shrikanth Hegde a écrit :
>
>
> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> > The dyntick-idle steal time is currently accounted when the tick
> > restarts but the stolen idle time is not substracted from the idle time
> > that was already accounted. This is to avoid observing the idle time
> > going backward as the dyntick-idle cputime accessors can't reliably know
> > in advance the stolen idle time.
> >
> > In order to maintain a forward progressing idle cputime while
> > substracting idle steal time from it, keep track of the previously
> > accounted idle stolen time and substract it from _later_ idle cputime
> > accounting.
> >
>
> s/substract/subtract ?
Right.
>
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > ---
> > include/linux/kernel_stat.h | 1 +
> > kernel/sched/cputime.c | 21 +++++++++++++++------
> > 2 files changed, 16 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> > index 512104b0ff49..24a54a6151ba 100644
> > --- a/include/linux/kernel_stat.h
> > +++ b/include/linux/kernel_stat.h
> > @@ -39,6 +39,7 @@ struct kernel_cpustat {
> > bool idle_elapse;
> > seqcount_t idle_sleeptime_seq;
> > u64 idle_entrytime;
> > + u64 idle_stealtime;
> > #endif
> > u64 cpustat[NR_STATS];
> > };
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 92fa2f037b6e..7e79288eb327 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -424,19 +424,25 @@ static inline void irqtime_account_process_tick(struct task_struct *p, int user_
> > static void kcpustat_idle_stop(struct kernel_cpustat *kc, u64 now)
> > {
> > u64 *cpustat = kc->cpustat;
> > - u64 delta;
> > + u64 delta, steal, steal_delta;
> > if (!kc->idle_elapse)
> > return;
> > delta = now - kc->idle_entrytime;
> > + steal = steal_account_process_time(delta);
> > write_seqcount_begin(&kc->idle_sleeptime_seq);
> > + steal_delta = min_t(u64, kc->idle_stealtime, delta);
> > + delta -= steal_delta;
>
> I didn;t get this logic. Why do we need idle_stealtime?
>
> Lets say 10ms was steal time and 50ms was delta. but idle_stealtime is
> sum of past accumulated steal time. we only need to subtract steal time there no?
>
> Shouldn't this be delta -= steal ?
That would be a risk to observe backward idle accounting:
Time CPU 0 CPU 1
---- ----- -----
0 sec kcpustat_idle_start()
<#VMEXIT>
...
1 sec </#VMEXIT>
arch_cpu_idle() // returns 2
2 sec kcpustat_idle_stop() kcpustat_field(CPUTIME_IDLE, 0)
cpustat[CPUTIME_IDLE] = 2 - 1
// returns 1
kcpustat_field(CPUTIME_IDLE, 0)
We could instead read remotely the paravirt clock, but then
steal_account_process_time() would need to always hold the ->idle_sleeptime_seq,
though it should happen to work without given the ordering.
Anyway to avoid any surprise I accumulate the steal time of an idle cycle to be
substracted on the next idle cycle.
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting
2026-02-06 14:22 [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
` (14 preceding siblings ...)
2026-02-06 14:22 ` [PATCH 15/15] sched/cputime: Handle dyntick-idle steal time correctly Frederic Weisbecker
@ 2026-02-11 13:43 ` Shrikanth Hegde
2026-02-11 17:06 ` Frederic Weisbecker
15 siblings, 1 reply; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-11 13:43 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Vasily Gorbik, Vincent Guittot, Kieran Bingham, Ingo Molnar,
Xin Zhao, Joel Fernandes, Neeraj Upadhyay, Sven Schnelle,
Boqun Feng, Mel Gorman, Dietmar Eggemann, Ben Segall,
Michael Ellerman, Rafael J. Wysocki, Paul E . McKenney,
Anna-Maria Behnsen, Alexander Gordeev, Madhavan Srinivasan,
linux-s390, Jan Kiszka, Juri Lelli, Christophe Leroy (CS GROUP),
linux-pm, Uladzislau Rezki, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Nicholas Piggin, Heiko Carstens, linuxppc-dev,
Christian Borntraeger, Valentin Schneider, Viresh Kumar
Hi Frederic,
Gave this series a spin on the same system as v1.
On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> Hi,
>
> After the issue reported here:
>
> https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/
>
> It occurs that the idle cputime accounting is a big mess that
> accumulates within two concurrent statistics, each having their own
> shortcomings:
>
> * The accounting for online CPUs which is based on the delta between
> tick_nohz_start_idle() and tick_nohz_stop_idle().
>
> Pros:
> - Works when the tick is off
>
> - Has nsecs granularity
>
> Cons:
> - Account idle steal time but doesn't substract it from idle
> cputime.
>
> - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
> the IRQ time is simply ignored when
> CONFIG_IRQ_TIME_ACCOUNTING=n
>
> - The windows between 1) idle task scheduling and the first call
> to tick_nohz_start_idle() and 2) idle task between the last
> tick_nohz_stop_idle() and the rest of the idle time are
> blindspots wrt. cputime accounting (though mostly insignificant
> amount)
>
> - Relies on private fields outside of kernel stats, with specific
> accessors.
>
> * The accounting for offline CPUs which is based on ticks and the
> jiffies delta during which the tick was stopped.
>
> Pros:
> - Handles steal time correctly
>
> - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
> CONFIG_IRQ_TIME_ACCOUNTING=n correctly.
>
> - Handles the whole idle task
>
> - Accounts directly to kernel stats, without midlayer accumulator.
>
> Cons:
> - Doesn't elapse when the tick is off, which doesn't make it
> suitable for online CPUs.
>
> - Has TICK_NSEC granularity (jiffies)
>
> - Needs to track the dyntick-idle ticks that were accounted and
> substract them from the total jiffies time spent while the tick
> was stopped. This is an ugly workaround.
>
> Having two different accounting for a single context is not the only
> problem: since those accountings are of different natures, it is
> possible to observe the global idle time going backward after a CPU goes
> offline, as reported by Xin Zhao.
>
> Clean up the situation with introducing a hybrid approach that stays
> coherent, fixes the backward jumps and works for both online and offline
> CPUs:
>
> * Tick based or native vtime accounting operate before the tick is
> stopped and resumes once the tick is restarted.
>
> * When the idle loop starts, switch to dynticks-idle accounting as is
> done currently, except that the statistics accumulate directly to the
> relevant kernel stat fields.
>
> * Private dyntick cputime accounting fields are removed.
>
> * Works on both online and offline case.
>
> * Move most of the relevant code to the common sched/cputime subsystem
>
> * Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
> dynticks-idle accounting still elapses while on IRQs.
>
> * Correctly substract idle steal cputime from idle time
>
> Changes since v1:
>
> - Fix deadlock involving double seq count lock on idle
>
> - Fix build breakage on powerpc
>
> - Fix build breakage on s390 (Heiko)
>
> - Fix broken sysfs s390 idle time file (Heiko)
>
> - Convert most ktime usage here into u64 (Peterz)
>
> - Add missing (or too implicit) <linux/sched/clock.h> (Peterz)
>
> - Fix whole idle time acccounting breakage due to missing TS_FLAG_ set
> on idle entry (Shrikanth Hegde)
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> timers/core-v2
>
> HEAD: 21458b98c80a0567d48131240317b7b73ba34c3c
> Thanks,
> Frederic
idle and runtime utilization with mpstat while running stress-ng looks
correct now.
However, when running hackbench I am noticing the below data. hackbench shows
severe regressions.
base: tip/master at 9c61ebbdb587a3950072700ab74a9310afe3ad73.
(nit: patch 7 is already part of tip. so skipped applying it)
+-----------------------------------------------+-------+---------+-----------+
| Test | base | +series | % Diff |
+-----------------------------------------------+-------+---------+-----------+
| HackBench Process 10 groups | 2.23 | 3.05 | -36.77% |
| HackBench Process 20 groups | 4.17 | 5.82 | -39.57% |
| HackBench Process 30 groups | 6.04 | 8.49 | -40.56% |
| HackBench Process 40 groups | 7.90 | 11.10 | -40.51% |
| HackBench thread 10 | 2.44 | 3.36 | -37.70% |
| HackBench thread 20 | 4.57 | 6.35 | -38.95% |
| HackBench Process(Pipe) 10 | 1.76 | 2.29 | -30.11% |
| HackBench Process(Pipe) 20 | 3.49 | 4.76 | -36.39% |
| HackBench Process(Pipe) 30 | 5.21 | 7.13 | -36.85% |
| HackBench Process(Pipe) 40 | 6.89 | 9.31 | -35.12% |
| HackBench thread(Pipe) 10 | 1.91 | 2.50 | -30.89% |
| HackBench thread(Pipe) 20 | 3.74 | 5.16 | -37.97% |
+-----------------------------------------------+-------+---------+-----------+
I have these in .config and I don't have nohz_full or isolated cpus.
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_SCHED_AVG_IRQ=y
I did a git bisect and below is what it says.
git bisect start
# status: waiting for both good and bad commits
# bad: [6821315886a3b5267ea31d29dba26fd34647fbbc] sched/cputime: Handle dyntick-idle steal time correctly
git bisect bad 6821315886a3b5267ea31d29dba26fd34647fbbc
# status: waiting for good commit(s), bad commit known
# good: [9c61ebbdb587a3950072700ab74a9310afe3ad73] Merge branch into tip/master: 'x86/sev'
git bisect good 9c61ebbdb587a3950072700ab74a9310afe3ad73
# good: [dc8bb3c84d162f7d9aa6becf9f8392474f92655a] tick/sched: Remove nohz disabled special case in cputime fetch
git bisect good dc8bb3c84d162f7d9aa6becf9f8392474f92655a
# good: [5070a778a581cd668f5d717f85fb22b078d8c20c] tick/sched: Account tickless idle cputime only when tick is stopped
git bisect good 5070a778a581cd668f5d717f85fb22b078d8c20c
# bad: [1e0ccc25a9a74b188b239c4de716fde279adbf8e] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case
git bisect bad 1e0ccc25a9a74b188b239c4de716fde279adbf8e
# bad: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched: Consolidate idle time fetching APIs
git bisect bad ee7c735b76071000d401869fc2883c451ee3fa61
# first bad commit: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched: Consolidate idle time fetching APIs
I did a perf diff between the two (collected perf record -a for hackbench 60 process 10000 loops)
perf diff base series:
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ........................... ................................................
#
+5.43% [kernel.kallsyms] [k] __update_freelist_slow
0.00% +4.55% [kernel.kallsyms] [k] _raw_spin_lock
+3.35% [kernel.kallsyms] [k] __memcg_slab_free_hook
0.55% +2.58% [kernel.kallsyms] [k] sock_wfree
+2.51% [kernel.kallsyms] [k] __account_obj_stock
2.29% -2.29% [kernel.kallsyms] [k] _raw_write_lock_irq
+2.25% [kernel.kallsyms] [k] _copy_from_iter
+1.96% [kernel.kallsyms] [k] fdget_pos
+1.87% [kernel.kallsyms] [k] _copy_to_iter
+1.69% [kernel.kallsyms] [k] sock_def_readable
2.32% -1.68% [kernel.kallsyms] [k] mod_memcg_lruvec_state
0.82% +1.67% [kernel.kallsyms] [k] skb_set_owner_w
0.08% +1.65% [kernel.kallsyms] [k] vfs_read
0.42% +1.57% [kernel.kallsyms] [k] kmem_cache_alloc_node_noprof
1.53% -1.53% [kernel.kallsyms] [k] kmem_cache_alloc_lru_noprof
1.56% -1.41% [kernel.kallsyms] [k] simple_copy_to_iter
0.27% +1.32% [kernel.kallsyms] [k] kfree
0.01% +1.25% [kernel.kallsyms] [k] __slab_free
0.19% +1.24% [kernel.kallsyms] [k] kmem_cache_free
1.23% -1.23% [kernel.kallsyms] [k] __pcs_replace_full_main
0.35% +1.21% [kernel.kallsyms] [k] __skb_datagram_iter
0.21% +1.13% [kernel.kallsyms] [k] sock_alloc_send_pskb
+1.09% [kernel.kallsyms] [k] mutex_lock
+0.98% [kernel.kallsyms].head.text [k] 0x0000000000013004
I haven't gone through the series yet. trying to go through meanwhile.
maybe different allocation scheme or more allocation/free everytime instead of
pre-allocated percpu variables?
First thought of reporting it. Let me know if you need any additional data.
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting
2026-02-11 13:43 ` [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting Shrikanth Hegde
@ 2026-02-11 17:06 ` Frederic Weisbecker
2026-02-12 7:02 ` Shrikanth Hegde
2026-02-18 18:11 ` Shrikanth Hegde
0 siblings, 2 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2026-02-11 17:06 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: LKML, Vasily Gorbik, Vincent Guittot, Kieran Bingham, Ingo Molnar,
Xin Zhao, Joel Fernandes, Neeraj Upadhyay, Sven Schnelle,
Boqun Feng, Mel Gorman, Dietmar Eggemann, Ben Segall,
Michael Ellerman, Rafael J. Wysocki, Paul E . McKenney,
Anna-Maria Behnsen, Alexander Gordeev, Madhavan Srinivasan,
linux-s390, Jan Kiszka, Juri Lelli, Christophe Leroy (CS GROUP),
linux-pm, Uladzislau Rezki, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Nicholas Piggin, Heiko Carstens, linuxppc-dev,
Christian Borntraeger, Valentin Schneider, Viresh Kumar
Le Wed, Feb 11, 2026 at 07:13:45PM +0530, Shrikanth Hegde a écrit :
> Hi Frederic,
> Gave this series a spin on the same system as v1.
>
> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
> > Hi,
> >
> > After the issue reported here:
> >
> > https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/
> >
> > It occurs that the idle cputime accounting is a big mess that
> > accumulates within two concurrent statistics, each having their own
> > shortcomings:
> >
> > * The accounting for online CPUs which is based on the delta between
> > tick_nohz_start_idle() and tick_nohz_stop_idle().
> >
> > Pros:
> > - Works when the tick is off
> >
> > - Has nsecs granularity
> >
> > Cons:
> > - Account idle steal time but doesn't substract it from idle
> > cputime.
> >
> > - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
> > the IRQ time is simply ignored when
> > CONFIG_IRQ_TIME_ACCOUNTING=n
> >
> > - The windows between 1) idle task scheduling and the first call
> > to tick_nohz_start_idle() and 2) idle task between the last
> > tick_nohz_stop_idle() and the rest of the idle time are
> > blindspots wrt. cputime accounting (though mostly insignificant
> > amount)
> >
> > - Relies on private fields outside of kernel stats, with specific
> > accessors.
> >
> > * The accounting for offline CPUs which is based on ticks and the
> > jiffies delta during which the tick was stopped.
> >
> > Pros:
> > - Handles steal time correctly
> >
> > - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
> > CONFIG_IRQ_TIME_ACCOUNTING=n correctly.
> >
> > - Handles the whole idle task
> >
> > - Accounts directly to kernel stats, without midlayer accumulator.
> >
> > Cons:
> > - Doesn't elapse when the tick is off, which doesn't make it
> > suitable for online CPUs.
> >
> > - Has TICK_NSEC granularity (jiffies)
> >
> > - Needs to track the dyntick-idle ticks that were accounted and
> > substract them from the total jiffies time spent while the tick
> > was stopped. This is an ugly workaround.
> >
> > Having two different accounting for a single context is not the only
> > problem: since those accountings are of different natures, it is
> > possible to observe the global idle time going backward after a CPU goes
> > offline, as reported by Xin Zhao.
> >
> > Clean up the situation with introducing a hybrid approach that stays
> > coherent, fixes the backward jumps and works for both online and offline
> > CPUs:
> >
> > * Tick based or native vtime accounting operate before the tick is
> > stopped and resumes once the tick is restarted.
> >
> > * When the idle loop starts, switch to dynticks-idle accounting as is
> > done currently, except that the statistics accumulate directly to the
> > relevant kernel stat fields.
> >
> > * Private dyntick cputime accounting fields are removed.
> >
> > * Works on both online and offline case.
> >
> > * Move most of the relevant code to the common sched/cputime subsystem
> >
> > * Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
> > dynticks-idle accounting still elapses while on IRQs.
> >
> > * Correctly substract idle steal cputime from idle time
> >
> > Changes since v1:
> >
> > - Fix deadlock involving double seq count lock on idle
> >
> > - Fix build breakage on powerpc
> >
> > - Fix build breakage on s390 (Heiko)
> >
> > - Fix broken sysfs s390 idle time file (Heiko)
> >
> > - Convert most ktime usage here into u64 (Peterz)
> >
> > - Add missing (or too implicit) <linux/sched/clock.h> (Peterz)
> >
> > - Fix whole idle time acccounting breakage due to missing TS_FLAG_ set
> > on idle entry (Shrikanth Hegde)
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > timers/core-v2
> >
> > HEAD: 21458b98c80a0567d48131240317b7b73ba34c3c
> > Thanks,
> > Frederic
>
> idle and runtime utilization with mpstat while running stress-ng looks
> correct now.
>
> However, when running hackbench I am noticing the below data. hackbench shows
> severe regressions.
>
> base: tip/master at 9c61ebbdb587a3950072700ab74a9310afe3ad73.
> (nit: patch 7 is already part of tip. so skipped applying it)
> +-----------------------------------------------+-------+---------+-----------+
> | Test | base | +series | % Diff |
> +-----------------------------------------------+-------+---------+-----------+
> | HackBench Process 10 groups | 2.23 | 3.05 | -36.77% |
> | HackBench Process 20 groups | 4.17 | 5.82 | -39.57% |
> | HackBench Process 30 groups | 6.04 | 8.49 | -40.56% |
> | HackBench Process 40 groups | 7.90 | 11.10 | -40.51% |
> | HackBench thread 10 | 2.44 | 3.36 | -37.70% |
> | HackBench thread 20 | 4.57 | 6.35 | -38.95% |
> | HackBench Process(Pipe) 10 | 1.76 | 2.29 | -30.11% |
> | HackBench Process(Pipe) 20 | 3.49 | 4.76 | -36.39% |
> | HackBench Process(Pipe) 30 | 5.21 | 7.13 | -36.85% |
> | HackBench Process(Pipe) 40 | 6.89 | 9.31 | -35.12% |
> | HackBench thread(Pipe) 10 | 1.91 | 2.50 | -30.89% |
> | HackBench thread(Pipe) 20 | 3.74 | 5.16 | -37.97% |
> +-----------------------------------------------+-------+---------+-----------+
>
> I have these in .config and I don't have nohz_full or isolated cpus.
>
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
>
> # CPU/Task time and stats accounting
> #
> CONFIG_VIRT_CPU_ACCOUNTING=y
> CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
> CONFIG_IRQ_TIME_ACCOUNTING=y
> CONFIG_HAVE_SCHED_AVG_IRQ=y
>
> I did a git bisect and below is what it says.
>
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [6821315886a3b5267ea31d29dba26fd34647fbbc] sched/cputime: Handle dyntick-idle steal time correctly
> git bisect bad 6821315886a3b5267ea31d29dba26fd34647fbbc
> # status: waiting for good commit(s), bad commit known
> # good: [9c61ebbdb587a3950072700ab74a9310afe3ad73] Merge branch into tip/master: 'x86/sev'
> git bisect good 9c61ebbdb587a3950072700ab74a9310afe3ad73
> # good: [dc8bb3c84d162f7d9aa6becf9f8392474f92655a] tick/sched: Remove nohz disabled special case in cputime fetch
> git bisect good dc8bb3c84d162f7d9aa6becf9f8392474f92655a
> # good: [5070a778a581cd668f5d717f85fb22b078d8c20c] tick/sched: Account tickless idle cputime only when tick is stopped
> git bisect good 5070a778a581cd668f5d717f85fb22b078d8c20c
> # bad: [1e0ccc25a9a74b188b239c4de716fde279adbf8e] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case
> git bisect bad 1e0ccc25a9a74b188b239c4de716fde279adbf8e
> # bad: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched: Consolidate idle time fetching APIs
> git bisect bad ee7c735b76071000d401869fc2883c451ee3fa61
> # first bad commit: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched:
> Consolidate idle time fetching APIs
I see. Can you try this? (or fetch timers/core-v3 from my tree)
Perhaps that mistake had some impact on cpufreq.
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 057fdc00dbc6..08550a6d9469 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -524,7 +524,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
do_div(res, NSEC_PER_USEC);
if (last_update_time)
- *last_update_time = res;
+ *last_update_time = ktime_to_us(now);
return res;
}
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting
2026-02-11 17:06 ` Frederic Weisbecker
@ 2026-02-12 7:02 ` Shrikanth Hegde
2026-02-18 18:11 ` Shrikanth Hegde
1 sibling, 0 replies; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-12 7:02 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Vasily Gorbik, Vincent Guittot, Kieran Bingham, Ingo Molnar,
Xin Zhao, Joel Fernandes, Neeraj Upadhyay, Sven Schnelle,
Boqun Feng, Mel Gorman, Dietmar Eggemann, Ben Segall,
Michael Ellerman, Rafael J. Wysocki, Paul E . McKenney,
Anna-Maria Behnsen, Alexander Gordeev, Madhavan Srinivasan,
linux-s390, Jan Kiszka, Juri Lelli, Christophe Leroy (CS GROUP),
linux-pm, Uladzislau Rezki, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Nicholas Piggin, Heiko Carstens, linuxppc-dev,
Christian Borntraeger, Valentin Schneider, Viresh Kumar
On 2/11/26 10:36 PM, Frederic Weisbecker wrote:
> Le Wed, Feb 11, 2026 at 07:13:45PM +0530, Shrikanth Hegde a écrit :
>> Hi Frederic,
>> Gave this series a spin on the same system as v1.
>>
>> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
>>> Hi,
>>>
>>> After the issue reported here:
>>>
>>> https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/
>>>
>>> It occurs that the idle cputime accounting is a big mess that
>>> accumulates within two concurrent statistics, each having their own
>>> shortcomings:
>>>
>>> * The accounting for online CPUs which is based on the delta between
>>> tick_nohz_start_idle() and tick_nohz_stop_idle().
>>>
>>> Pros:
>>> - Works when the tick is off
>>>
>>> - Has nsecs granularity
>>>
>>> Cons:
>>> - Account idle steal time but doesn't substract it from idle
>>> cputime.
>>>
>>> - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
>>> the IRQ time is simply ignored when
>>> CONFIG_IRQ_TIME_ACCOUNTING=n
>>>
>>> - The windows between 1) idle task scheduling and the first call
>>> to tick_nohz_start_idle() and 2) idle task between the last
>>> tick_nohz_stop_idle() and the rest of the idle time are
>>> blindspots wrt. cputime accounting (though mostly insignificant
>>> amount)
>>>
>>> - Relies on private fields outside of kernel stats, with specific
>>> accessors.
>>>
>>> * The accounting for offline CPUs which is based on ticks and the
>>> jiffies delta during which the tick was stopped.
>>>
>>> Pros:
>>> - Handles steal time correctly
>>>
>>> - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
>>> CONFIG_IRQ_TIME_ACCOUNTING=n correctly.
>>>
>>> - Handles the whole idle task
>>>
>>> - Accounts directly to kernel stats, without midlayer accumulator.
>>>
>>> Cons:
>>> - Doesn't elapse when the tick is off, which doesn't make it
>>> suitable for online CPUs.
>>>
>>> - Has TICK_NSEC granularity (jiffies)
>>>
>>> - Needs to track the dyntick-idle ticks that were accounted and
>>> substract them from the total jiffies time spent while the tick
>>> was stopped. This is an ugly workaround.
>>>
>>> Having two different accounting for a single context is not the only
>>> problem: since those accountings are of different natures, it is
>>> possible to observe the global idle time going backward after a CPU goes
>>> offline, as reported by Xin Zhao.
>>>
>>> Clean up the situation with introducing a hybrid approach that stays
>>> coherent, fixes the backward jumps and works for both online and offline
>>> CPUs:
>>>
>>> * Tick based or native vtime accounting operate before the tick is
>>> stopped and resumes once the tick is restarted.
>>>
>>> * When the idle loop starts, switch to dynticks-idle accounting as is
>>> done currently, except that the statistics accumulate directly to the
>>> relevant kernel stat fields.
>>>
>>> * Private dyntick cputime accounting fields are removed.
>>>
>>> * Works on both online and offline case.
>>>
>>> * Move most of the relevant code to the common sched/cputime subsystem
>>>
>>> * Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
>>> dynticks-idle accounting still elapses while on IRQs.
>>>
>>> * Correctly substract idle steal cputime from idle time
>>>
>>> Changes since v1:
>>>
>>> - Fix deadlock involving double seq count lock on idle
>>>
>>> - Fix build breakage on powerpc
>>>
>>> - Fix build breakage on s390 (Heiko)
>>>
>>> - Fix broken sysfs s390 idle time file (Heiko)
>>>
>>> - Convert most ktime usage here into u64 (Peterz)
>>>
>>> - Add missing (or too implicit) <linux/sched/clock.h> (Peterz)
>>>
>>> - Fix whole idle time acccounting breakage due to missing TS_FLAG_ set
>>> on idle entry (Shrikanth Hegde)
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>>> timers/core-v2
>>>
>>> HEAD: 21458b98c80a0567d48131240317b7b73ba34c3c
>>> Thanks,
>>> Frederic
>>
>> idle and runtime utilization with mpstat while running stress-ng looks
>> correct now.
>>
>> However, when running hackbench I am noticing the below data. hackbench shows
>> severe regressions.
>>
>> base: tip/master at 9c61ebbdb587a3950072700ab74a9310afe3ad73.
>> (nit: patch 7 is already part of tip. so skipped applying it)
>> +-----------------------------------------------+-------+---------+-----------+
>> | Test | base | +series | % Diff |
>> +-----------------------------------------------+-------+---------+-----------+
>> | HackBench Process 10 groups | 2.23 | 3.05 | -36.77% |
>> | HackBench Process 20 groups | 4.17 | 5.82 | -39.57% |
>> | HackBench Process 30 groups | 6.04 | 8.49 | -40.56% |
>> | HackBench Process 40 groups | 7.90 | 11.10 | -40.51% |
>> | HackBench thread 10 | 2.44 | 3.36 | -37.70% |
>> | HackBench thread 20 | 4.57 | 6.35 | -38.95% |
>> | HackBench Process(Pipe) 10 | 1.76 | 2.29 | -30.11% |
>> | HackBench Process(Pipe) 20 | 3.49 | 4.76 | -36.39% |
>> | HackBench Process(Pipe) 30 | 5.21 | 7.13 | -36.85% |
>> | HackBench Process(Pipe) 40 | 6.89 | 9.31 | -35.12% |
>> | HackBench thread(Pipe) 10 | 1.91 | 2.50 | -30.89% |
>> | HackBench thread(Pipe) 20 | 3.74 | 5.16 | -37.97% |
>> +-----------------------------------------------+-------+---------+-----------+
>>
>> I have these in .config and I don't have nohz_full or isolated cpus.
>>
>> CONFIG_TICK_ONESHOT=y
>> CONFIG_NO_HZ_COMMON=y
>> # CONFIG_HZ_PERIODIC is not set
>> # CONFIG_NO_HZ_IDLE is not set
>> CONFIG_NO_HZ_FULL=y
>>
>> # CPU/Task time and stats accounting
>> #
>> CONFIG_VIRT_CPU_ACCOUNTING=y
>> CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
>> CONFIG_IRQ_TIME_ACCOUNTING=y
>> CONFIG_HAVE_SCHED_AVG_IRQ=y
>>
>> I did a git bisect and below is what it says.
>>
>> git bisect start
>> # status: waiting for both good and bad commits
>> # bad: [6821315886a3b5267ea31d29dba26fd34647fbbc] sched/cputime: Handle dyntick-idle steal time correctly
>> git bisect bad 6821315886a3b5267ea31d29dba26fd34647fbbc
>> # status: waiting for good commit(s), bad commit known
>> # good: [9c61ebbdb587a3950072700ab74a9310afe3ad73] Merge branch into tip/master: 'x86/sev'
>> git bisect good 9c61ebbdb587a3950072700ab74a9310afe3ad73
>> # good: [dc8bb3c84d162f7d9aa6becf9f8392474f92655a] tick/sched: Remove nohz disabled special case in cputime fetch
>> git bisect good dc8bb3c84d162f7d9aa6becf9f8392474f92655a
>> # good: [5070a778a581cd668f5d717f85fb22b078d8c20c] tick/sched: Account tickless idle cputime only when tick is stopped
>> git bisect good 5070a778a581cd668f5d717f85fb22b078d8c20c
>> # bad: [1e0ccc25a9a74b188b239c4de716fde279adbf8e] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case
>> git bisect bad 1e0ccc25a9a74b188b239c4de716fde279adbf8e
>> # bad: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched: Consolidate idle time fetching APIs
>> git bisect bad ee7c735b76071000d401869fc2883c451ee3fa61
>> # first bad commit: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched:
>> Consolidate idle time fetching APIs
>
> I see. Can you try this? (or fetch timers/core-v3 from my tree)
> Perhaps that mistake had some impact on cpufreq.
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 057fdc00dbc6..08550a6d9469 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -524,7 +524,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
> do_div(res, NSEC_PER_USEC);
>
> if (last_update_time)
> - *last_update_time = res;
> + *last_update_time = ktime_to_us(now);
>
> return res;
> }
>
Yes. This diff helps. Now the data is almost same.
+-----------------------------------------------+-------+-------+-----------+
| Test | base | series+ | % Diff |
| | | +above diff |
+-----------------------------------------------+-------+-------------+-----------+
| HackBench Process 10 groups | 2.23 | 2.25 | -0.90% |
| HackBench Process 20 groups | 4.17 | 4.21 | -0.96% |
| HackBench Process 30 groups | 6.04 | 6.15 | -1.82% |
| HackBench Process 40 groups | 7.90 | 8.06 | -2.03% |
| HackBench thread 10 | 2.44 | 2.46 | -0.82% |
| HackBench thread 20 | 4.57 | 4.61 | -0.88% |
| HackBench Process(Pipe) 10 | 1.76 | 1.73 | 1.70% |
| HackBench Process(Pipe) 20 | 3.49 | 3.50 | -0.29% |
| HackBench Process(Pipe) 30 | 5.21 | 5.22 | -0.19% |
| HackBench Process(Pipe) 40 | 6.89 | 6.96 | -1.02% |
| HackBench thread(Pipe) 10 | 1.91 | 1.88 | 1.57% |
| HackBench thread(Pipe) 20 | 3.74 | 3.81 | -1.87% |
+-----------------------------------------------+-------+-------------+-----------+
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 00/15 v2] tick/sched: Refactor idle cputime accounting
2026-02-11 17:06 ` Frederic Weisbecker
2026-02-12 7:02 ` Shrikanth Hegde
@ 2026-02-18 18:11 ` Shrikanth Hegde
1 sibling, 0 replies; 40+ messages in thread
From: Shrikanth Hegde @ 2026-02-18 18:11 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Vasily Gorbik, Vincent Guittot, Kieran Bingham, Ingo Molnar,
Xin Zhao, Joel Fernandes, Neeraj Upadhyay, Sven Schnelle,
Boqun Feng, Mel Gorman, Dietmar Eggemann, Ben Segall,
Michael Ellerman, Rafael J. Wysocki, Paul E . McKenney,
Anna-Maria Behnsen, Alexander Gordeev, Madhavan Srinivasan,
linux-s390, Jan Kiszka, Juri Lelli, Christophe Leroy (CS GROUP),
linux-pm, Uladzislau Rezki, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Nicholas Piggin, Heiko Carstens, linuxppc-dev,
Christian Borntraeger, Valentin Schneider, Viresh Kumar
Hi Frederic.
On 2/11/26 10:36 PM, Frederic Weisbecker wrote:
> Le Wed, Feb 11, 2026 at 07:13:45PM +0530, Shrikanth Hegde a écrit :
>> Hi Frederic,
>> Gave this series a spin on the same system as v1.
>>
>> On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
>>> Hi,
>>>
>>> After the issue reported here:
>>>
>>> https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/
>>>
>>> It occurs that the idle cputime accounting is a big mess that
>>> accumulates within two concurrent statistics, each having their own
>>> shortcomings:
>>>
>>> * The accounting for online CPUs which is based on the delta between
>>> tick_nohz_start_idle() and tick_nohz_stop_idle().
>>>
>>> Pros:
>>> - Works when the tick is off
>>>
>>> - Has nsecs granularity
>>>
>>> Cons:
>>> - Account idle steal time but doesn't substract it from idle
>>> cputime.
>>>
>>> - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
>>> the IRQ time is simply ignored when
>>> CONFIG_IRQ_TIME_ACCOUNTING=n
>>>
>>> - The windows between 1) idle task scheduling and the first call
>>> to tick_nohz_start_idle() and 2) idle task between the last
>>> tick_nohz_stop_idle() and the rest of the idle time are
>>> blindspots wrt. cputime accounting (though mostly insignificant
>>> amount)
>>>
>>> - Relies on private fields outside of kernel stats, with specific
>>> accessors.
>>>
>>> * The accounting for offline CPUs which is based on ticks and the
>>> jiffies delta during which the tick was stopped.
>>>
>>> Pros:
>>> - Handles steal time correctly
>>>
>>> - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
>>> CONFIG_IRQ_TIME_ACCOUNTING=n correctly.
>>>
>>> - Handles the whole idle task
>>>
>>> - Accounts directly to kernel stats, without midlayer accumulator.
>>>
>>> Cons:
>>> - Doesn't elapse when the tick is off, which doesn't make it
>>> suitable for online CPUs.
>>>
>>> - Has TICK_NSEC granularity (jiffies)
>>>
>>> - Needs to track the dyntick-idle ticks that were accounted and
>>> substract them from the total jiffies time spent while the tick
>>> was stopped. This is an ugly workaround.
>>>
>>> Having two different accounting for a single context is not the only
>>> problem: since those accountings are of different natures, it is
>>> possible to observe the global idle time going backward after a CPU goes
>>> offline, as reported by Xin Zhao.
>>>
>>> Clean up the situation with introducing a hybrid approach that stays
>>> coherent, fixes the backward jumps and works for both online and offline
>>> CPUs:
>>>
>>> * Tick based or native vtime accounting operate before the tick is
>>> stopped and resumes once the tick is restarted.
>>>
>>> * When the idle loop starts, switch to dynticks-idle accounting as is
>>> done currently, except that the statistics accumulate directly to the
>>> relevant kernel stat fields.
>>>
>>> * Private dyntick cputime accounting fields are removed.
>>>
>>> * Works on both online and offline case.
>>>
>>> * Move most of the relevant code to the common sched/cputime subsystem
>>>
>>> * Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
>>> dynticks-idle accounting still elapses while on IRQs.
>>>
>>> * Correctly substract idle steal cputime from idle time
>>>
>>> Changes since v1:
>>>
>>> - Fix deadlock involving double seq count lock on idle
>>>
>>> - Fix build breakage on powerpc
>>>
>>> - Fix build breakage on s390 (Heiko)
>>>
>>> - Fix broken sysfs s390 idle time file (Heiko)
>>>
>>> - Convert most ktime usage here into u64 (Peterz)
>>>
>>> - Add missing (or too implicit) <linux/sched/clock.h> (Peterz)
>>>
>>> - Fix whole idle time acccounting breakage due to missing TS_FLAG_ set
>>> on idle entry (Shrikanth Hegde)
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>>> timers/core-v2
>>>
>>> HEAD: 21458b98c80a0567d48131240317b7b73ba34c3c
>>> Thanks,
>>> Frederic
>>
>> idle and runtime utilization with mpstat while running stress-ng looks
>> correct now.
>>
>> However, when running hackbench I am noticing the below data. hackbench shows
>> severe regressions.
>>
>> base: tip/master at 9c61ebbdb587a3950072700ab74a9310afe3ad73.
>> (nit: patch 7 is already part of tip. so skipped applying it)
>> +-----------------------------------------------+-------+---------+-----------+
>> | Test | base | +series | % Diff |
>> +-----------------------------------------------+-------+---------+-----------+
>> | HackBench Process 10 groups | 2.23 | 3.05 | -36.77% |
>> | HackBench Process 20 groups | 4.17 | 5.82 | -39.57% |
>> | HackBench Process 30 groups | 6.04 | 8.49 | -40.56% |
>> | HackBench Process 40 groups | 7.90 | 11.10 | -40.51% |
>> | HackBench thread 10 | 2.44 | 3.36 | -37.70% |
>> | HackBench thread 20 | 4.57 | 6.35 | -38.95% |
>> | HackBench Process(Pipe) 10 | 1.76 | 2.29 | -30.11% |
>> | HackBench Process(Pipe) 20 | 3.49 | 4.76 | -36.39% |
>> | HackBench Process(Pipe) 30 | 5.21 | 7.13 | -36.85% |
>> | HackBench Process(Pipe) 40 | 6.89 | 9.31 | -35.12% |
>> | HackBench thread(Pipe) 10 | 1.91 | 2.50 | -30.89% |
>> | HackBench thread(Pipe) 20 | 3.74 | 5.16 | -37.97% |
>> +-----------------------------------------------+-------+---------+-----------+
>>
>> I have these in .config and I don't have nohz_full or isolated cpus.
>>
>> CONFIG_TICK_ONESHOT=y
>> CONFIG_NO_HZ_COMMON=y
>> # CONFIG_HZ_PERIODIC is not set
>> # CONFIG_NO_HZ_IDLE is not set
>> CONFIG_NO_HZ_FULL=y
>>
>> # CPU/Task time and stats accounting
>> #
>> CONFIG_VIRT_CPU_ACCOUNTING=y
>> CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
>> CONFIG_IRQ_TIME_ACCOUNTING=y
>> CONFIG_HAVE_SCHED_AVG_IRQ=y
>>
>> I did a git bisect and below is what it says.
>>
>> git bisect start
>> # status: waiting for both good and bad commits
>> # bad: [6821315886a3b5267ea31d29dba26fd34647fbbc] sched/cputime: Handle dyntick-idle steal time correctly
>> git bisect bad 6821315886a3b5267ea31d29dba26fd34647fbbc
>> # status: waiting for good commit(s), bad commit known
>> # good: [9c61ebbdb587a3950072700ab74a9310afe3ad73] Merge branch into tip/master: 'x86/sev'
>> git bisect good 9c61ebbdb587a3950072700ab74a9310afe3ad73
>> # good: [dc8bb3c84d162f7d9aa6becf9f8392474f92655a] tick/sched: Remove nohz disabled special case in cputime fetch
>> git bisect good dc8bb3c84d162f7d9aa6becf9f8392474f92655a
>> # good: [5070a778a581cd668f5d717f85fb22b078d8c20c] tick/sched: Account tickless idle cputime only when tick is stopped
>> git bisect good 5070a778a581cd668f5d717f85fb22b078d8c20c
>> # bad: [1e0ccc25a9a74b188b239c4de716fde279adbf8e] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case
>> git bisect bad 1e0ccc25a9a74b188b239c4de716fde279adbf8e
>> # bad: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched: Consolidate idle time fetching APIs
>> git bisect bad ee7c735b76071000d401869fc2883c451ee3fa61
>> # first bad commit: [ee7c735b76071000d401869fc2883c451ee3fa61] tick/sched:
>> Consolidate idle time fetching APIs
>
> I see. Can you try this? (or fetch timers/core-v3 from my tree)
> Perhaps that mistake had some impact on cpufreq.
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 057fdc00dbc6..08550a6d9469 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -524,7 +524,7 @@ static u64 get_cpu_sleep_time_us(int cpu, enum cpu_usage_stat idx,
> do_div(res, NSEC_PER_USEC);
>
> if (last_update_time)
> - *last_update_time = res;
> + *last_update_time = ktime_to_us(now);
>
> return res;
> }
>
>
>
I have done testing in below cases on PowerNV(power9) box.
1. CONFIG_VIRT_CPU_ACCOUNTING_GEN + CONFIG_IRQ_TIME_ACCOUNTING=y.
This is common case of having VTIME_GEN + IRQ_TIME enabled.
2. CONFIG_VIRT_CPU_ACCOUNTING_GEN only.
IRQ_TIME is not selected
3. CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y (for this i had to disable CONFIG_NO_HZ_FULL)
CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ_FULL=n and VTIME_GEN=n
4. CONFIG_TICK_CPU_ACCOUNTING=y
(CONFIG_NO_HZ_FULL=n and CONFIG_NO_HZ_IDLE=y)
In all cases the idle time and iowait time doesn't go backwards.
So that's a clear win.
Without the patches iowait did go backwards.
So, with that for the series.
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
However, with the series, with NATIVE=y i am seeing one peculiar info.
without series: cpu0 0 0 9 60800 4 2 90 0 0 0 << 608 seconds after boot. That's ok.
with series: cpu0 1 0 17 9122062 0 3 140 0 0 0 << 91220 seconds?? Strange.
However, i see the time passage looks normal.
If i do like, cat /proc/stat; sleep 5; cat /proc/stat;
then i see same time difference with/without series.
So timekeeping works as expected.
Almost all CPUs have similar stat. I am wondering if there is bug or some kind
of wrapping in mftb which raises an irq and during that particular period the
values go very large. Even without series, I see one or two CPUs have same huge system
time. Maybe since the series handles the irq case now, it might be showing up in all CPUs.
This is a slightly older system. I will give this a try on power10 when I get the
systems in few weeks time.
^ permalink raw reply [flat|nested] 40+ messages in thread