public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14 v3] tick/sched: Refactor idle cputime accounting
@ 2026-03-31 13:16 Frederic Weisbecker
  2026-03-31 13:16 ` [PATCH 01/14] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
                   ` (13 more replies)
  0 siblings, 14 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2026-03-31 13:16 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Christophe Leroy (CS GROUP),
	Rafael J. Wysocki, Alexander Gordeev, Anna-Maria Behnsen,
	Ben Segall, Boqun Feng, Christian Borntraeger, Dietmar Eggemann,
	Heiko Carstens, Ingo Molnar, Jan Kiszka, Joel Fernandes,
	Juri Lelli, Kieran Bingham, Madhavan Srinivasan, Mel Gorman,
	Michael Ellerman, Neeraj Upadhyay, Nicholas Piggin,
	Paul E . McKenney, Peter Zijlstra, Shrikanth Hegde,
	Steven Rostedt, Sven Schnelle, Thomas Gleixner, Uladzislau Rezki,
	Valentin Schneider, Vasily Gorbik, Vincent Guittot, Viresh Kumar,
	Xin Zhao, linux-pm, linux-s390, linuxppc-dev

Hi,

After the issue reported here:

        https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/

It occurs that the idle cputime accounting is a big mess that
accumulates within two concurrent statistics, each having their own
shortcomings:

* The accounting for online CPUs which is based on the delta between
  tick_nohz_start_idle() and tick_nohz_stop_idle().

  Pros:
       - Works when the tick is off

       - Has nsecs granularity

  Cons:
       - Account idle steal time but doesn't substract it from idle
         cputime.

       - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
         the IRQ time is simply ignored when
         CONFIG_IRQ_TIME_ACCOUNTING=n

       - The windows between 1) idle task scheduling and the first call
         to tick_nohz_start_idle() and 2) idle task between the last
         tick_nohz_stop_idle() and the rest of the idle time are
         blindspots wrt. cputime accounting (though mostly insignificant
         amount)

       - Relies on private fields outside of kernel stats, with specific
         accessors.

* The accounting for offline CPUs which is based on ticks and the
  jiffies delta during which the tick was stopped.

  Pros:
       - Handles steal time correctly

       - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
         CONFIG_IRQ_TIME_ACCOUNTING=n correctly.

       - Handles the whole idle task

       - Accounts directly to kernel stats, without midlayer accumulator.

   Cons:
       - Doesn't elapse when the tick is off, which doesn't make it
         suitable for online CPUs.

       - Has TICK_NSEC granularity (jiffies)

       - Needs to track the dyntick-idle ticks that were accounted and
         substract them from the total jiffies time spent while the tick
         was stopped. This is an ugly workaround.

Having two different accounting for a single context is not the only
problem: since those accountings are of different natures, it is
possible to observe the global idle time going backward after a CPU goes
offline, as reported by Xin Zhao.

Clean up the situation with introducing a hybrid approach that stays
coherent, fixes the backward jumps and works for both online and offline
CPUs:

* Tick based or native vtime accounting operate before the tick is
  stopped and resumes once the tick is restarted.

* When the idle loop starts, switch to dynticks-idle accounting as is
  done currently, except that the statistics accumulate directly to the
  relevant kernel stat fields.

* Private dyntick cputime accounting fields are removed.

* Works on both online and offline case.

* Move most of the relevant code to the common sched/cputime subsystem

* Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
  dynticks-idle accounting still elapses while on IRQs.

* Correctly substract idle steal cputime from idle time

Changes since v2:

- Add tags

- Fix frenglish

- Add fixup from Heiko to s390 patch

- Drop "cpufreq: ondemand: Simplify idle cputime granularity test" as it's upstream

- Fix cpufreq regression reported by Shrikanth

- Simplfy irqtime handling with relying on kcpustat_idle_dyntick()

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
        timers/core-v3

HEAD: e37a063888aac70d4c598ce2ed367f8ce3552a69

Thanks!

Frederic Weisbecker (14):
  sched/idle: Handle offlining first in idle loop
  sched/cputime: Remove superfluous and error prone kcpustat_field()
    parameter
  sched/cputime: Correctly support generic vtime idle time
  powerpc/time: Prepare to stop elapsing in dynticks-idle
  s390/time: Prepare to stop elapsing in dynticks-idle
  tick/sched: Unify idle cputime accounting
  tick/sched: Remove nohz disabled special case in cputime fetch
  tick/sched: Move dyntick-idle cputime accounting to cputime code
  tick/sched: Remove unused fields
  tick/sched: Account tickless idle cputime only when tick is stopped
  tick/sched: Consolidate idle time fetching APIs
  sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case
  sched/cputime: Handle idle irqtime gracefully
  sched/cputime: Handle dyntick-idle steal time correctly

 arch/powerpc/kernel/time.c         |  41 ++++
 arch/s390/include/asm/idle.h       |   2 +
 arch/s390/kernel/idle.c            |   5 +-
 arch/s390/kernel/vtime.c           |  57 +++++-
 drivers/cpufreq/cpufreq.c          |  29 +--
 drivers/cpufreq/cpufreq_governor.c |   6 +-
 drivers/macintosh/rack-meter.c     |   2 +-
 fs/proc/stat.c                     |  40 +---
 fs/proc/uptime.c                   |   8 +-
 include/linux/kernel_stat.h        |  76 ++++++--
 include/linux/tick.h               |   4 -
 include/linux/vtime.h              |  22 ++-
 kernel/rcu/tree.c                  |   9 +-
 kernel/rcu/tree_stall.h            |   7 +-
 kernel/sched/cputime.c             | 289 ++++++++++++++++++++++++-----
 kernel/sched/idle.c                |  13 +-
 kernel/time/tick-sched.c           | 202 ++++----------------
 kernel/time/tick-sched.h           |  12 --
 kernel/time/timer_list.c           |   6 +-
 scripts/gdb/linux/timerlist.py     |   4 -
 20 files changed, 481 insertions(+), 353 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-31 14:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 13:16 [PATCH 00/14 v3] tick/sched: Refactor idle cputime accounting Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 01/14] sched/idle: Handle offlining first in idle loop Frederic Weisbecker
2026-03-31 13:59   ` Rafael J. Wysocki
2026-03-31 13:16 ` [PATCH 02/14] sched/cputime: Remove superfluous and error prone kcpustat_field() parameter Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 03/14] sched/cputime: Correctly support generic vtime idle time Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 04/14] powerpc/time: Prepare to stop elapsing in dynticks-idle Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 05/14] s390/time: " Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 06/14] tick/sched: Unify idle cputime accounting Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 07/14] tick/sched: Remove nohz disabled special case in cputime fetch Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 08/14] tick/sched: Move dyntick-idle cputime accounting to cputime code Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 09/14] tick/sched: Remove unused fields Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 10/14] tick/sched: Account tickless idle cputime only when tick is stopped Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 11/14] tick/sched: Consolidate idle time fetching APIs Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 12/14] sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 13/14] sched/cputime: Handle idle irqtime gracefully Frederic Weisbecker
2026-03-31 13:16 ` [PATCH 14/14] sched/cputime: Handle dyntick-idle steal time correctly Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox