[GIT PULL] Scheduler updates for v6.17

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GIT PULL] Scheduler updates for v6.17
@ 2025-07-28  6:48 Ingo Molnar
  2025-07-30  3:39 ` pr-tracker-bot
  2025-07-31  3:31 ` Linus Torvalds
  0 siblings, 2 replies; 11+ messages in thread
From: Ingo Molnar @ 2025-07-28  6:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

Linus,

Please pull the latest sched/core Git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2025-07-28

   # HEAD: 1b5f1454091e9e9fb5c944b3161acf4ec0894d0d sched/idle: Remove play_idle()

Scheduler updates for v6.17:

Core scheduler changes:

 - Better tracking of maximum lag of tasks in presence of different
   slices duration, for better handling of lag in the fair
   scheduler. (Vincent Guittot)

 - Clean up and standardize #if/#else/#endif markers throughout
   the entire scheduler code base (Ingo Molnar)

 - Make SMP unconditional: build the SMP scheduler's
   data structures and logic on UP kernel too, even though
   they are not used, to simplify the scheduler and remove
   around 200 #ifdef/[#else]/#endif blocks from the
   scheduler. (Ingo Molnar)

 - Reorganize cgroup bandwidth control interface handling
   for better interfacing with sched_ext (Tejun Heo)

Balancing:

 - Bump sd->max_newidle_lb_cost when newidle balance fails (Chris Mason)
 - Remove sched_domain_topology_level::flags to simplify the code (Prateek Nayak)
 - Simplify and clean up build_sched_topology() (Li Chen)
 - Optimize build_sched_topology() on large machines (Li Chen)

Real-time scheduling:

 - Add initial version of proxy execution: a mechanism for mutex-owning
   tasks to inherit the scheduling context of higher priority waiters.
   Currently limited to a single runqueue and conditional on CONFIG_EXPERT,
   and other limitations. (John Stultz, Peter Zijlstra, Valentin Schneider)

 - Deadline scheduler (Juri Lelli):

   - Fix dl_servers initialization order (Juri Lelli)
   - Fix DL scheduler's root domain reinitialization logic (Juri Lelli)
   - Fix accounting bugs after global limits change (Juri Lelli)
   - Fix scalability regression by implementing less agressive dl_server handling
     (Peter Zijlstra)

PSI:

 - Improve scalability by optimizing psi_group_change() cpu_clock() usage
   (Peter Zijlstra)

Rust changes:

 - Make Task, CondVar and PollCondVar methods inline to avoid unnecessary
   function calls (Kunwu Chan, Panagiotis Foliadis)

 - Add might_sleep() support for Rust code: Rust's "#[track_caller]"
   mechanism is used so that Rust's might_sleep() doesn't need to be
   defined as a macro (Fujita Tomonori)

 - Introduce file_from_location() (Boqun Feng)

Debugging & instrumentation:

 - Make clangd usable with scheduler source code files again (Peter Zijlstra)

 - tools: Add root_domains_dump.py which dumps root domains info (Juri Lelli)

 - tools: Add dl_bw_dump.py for printing bandwidth accounting info (Juri Lelli)

Misc cleanups & fixes:

 - Remove play_idle() (Feng Lee)

 - Fix check_preemption_disabled() (Sebastian Andrzej Siewior)

 - Do not call __put_task_struct() on RT if pi_blocked_on is set
   (Luis Claudio R. Goncalves)

 - Correct the comment in place_entity() (wang wei)

 Thanks,

	Ingo

------------------>
Boqun Feng (1):
      rust: Introduce file_from_location()

Chris Mason (1):
      sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails

FUJITA Tomonori (1):
      rust: task: Add Rust version of might_sleep()

Feng Lee (1):
      sched/idle: Remove play_idle()

Ingo Molnar (43):
      sched: Clean up and standardize #if/#else/#endif markers in sched/autogroup.[ch]
      sched: Clean up and standardize #if/#else/#endif markers in sched/clock.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/core.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/cpufreq_schedutil.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/cpupri.h
      sched: Clean up and standardize #if/#else/#endif markers in sched/cputime.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/deadline.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/debug.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/fair.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/idle.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/loadavg.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/pelt.[ch]
      sched: Clean up and standardize #if/#else/#endif markers in sched/psi.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/rt.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/sched.h
      sched: Clean up and standardize #if/#else/#endif markers in sched/stats.[ch]
      sched: Clean up and standardize #if/#else/#endif markers in sched/syscalls.c
      sched: Clean up and standardize #if/#else/#endif markers in sched/topology.c
      sched/smp: Always define sched_domains_mutex_lock()/unlock(), def_root_domain and sched_domains_mutex
      sched/smp: Make SMP unconditional
      sched/smp: Always define is_percpu_thread() and scheduler_ipi()
      sched/smp: Always define rq->hrtick_csd
      sched/smp: Use the SMP version of try_to_wake_up()
      sched/smp: Use the SMP version of __task_needs_rq_lock()
      sched/smp: Use the SMP version of wake_up_new_task()
      sched/smp: Use the SMP version of sched_exec()
      sched/smp: Use the SMP version of idle_thread_set_boot_cpu()
      sched/smp: Use the SMP version of the RT scheduling class
      sched/smp: Use the SMP version of the deadline scheduling class
      sched/smp: Use the SMP version of scheduler debugging data
      sched/smp: Use the SMP version of schedstats
      sched/smp: Use the SMP version of the scheduler syscalls
      sched/smp: Use the SMP version of sched_update_asym_prefer_cpu()
      sched/smp: Use the SMP version of the idle scheduling class
      sched/smp: Use the SMP version of the stop-CPU scheduling class
      sched/smp: Use the SMP version of cpu_of()
      sched/smp: Use the SMP version of is_migration_disabled()
      sched/smp: Use the SMP version of rq_pin_lock()
      sched/smp: Use the SMP version of task_on_cpu()
      sched/smp: Use the SMP version of WF_ and SD_ flag sanity checks
      sched/smp: Use the SMP version of ENQUEUE_MIGRATED
      sched/smp: Use the SMP version of add_nr_running()
      sched/smp: Use the SMP version of double_rq_clock_clear_update()

John Stultz (4):
      sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
      sched: Move update_curr_task logic into update_curr_se
      sched: Fix runtime accounting w/ split exec & sched contexts
      sched: Add an initial sketch of the find_proxy_task() function

Juri Lelli (5):
      sched/deadline: Initialize dl_servers after SMP
      sched/deadline: Reset extra_bw to max_bw when clearing root domains
      sched/deadline: Fix accounting after global limits change
      tools/sched: Add root_domains_dump.py which dumps root domains info
      tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info

K Prateek Nayak (1):
      sched/topology: Remove sched_domain_topology_level::flags

Kunwu Chan (2):
      rust: sync: Mark CondVar::notify_*() inline
      rust: sync: Mark PollCondVar::drop() inline

Li Chen (4):
      smpboot: introduce SDTL_INIT() helper to tidy sched topology setup
      x86/smpboot: remove redundant CONFIG_SCHED_SMT
      x86/smpboot: moves x86_topology to static initialize and truncate
      x86/smpboot: avoid SMT domain attach/destroy if SMT is not enabled

Luis Claudio R. Goncalves (1):
      sched: Do not call __put_task_struct() on rt if pi_blocked_on is set

Panagiotis Foliadis (1):
      rust: task: Mark Task methods inline

Peter Zijlstra (5):
      sched: Make clangd usable
      sched/psi: Optimize psi_group_change() cpu_clock() usage
      sched/deadline: Less agressive dl_server handling
      locking/mutex: Rework task_struct::blocked_on
      sched: Start blocked_on chain processing in find_proxy_task()

Sebastian Andrzej Siewior (1):
      lib/smp_processor_id: Make migration check unconditional of SMP

Tejun Heo (4):
      sched/fair: Move max_cfs_quota_period decl and default_cfs_period() def from fair.c to sched.h
      sched/core: Relocate tg_get_cfs_*() and cpu_cfs_*_read_*()
      sched/core: Reorganize cgroup bandwidth control interface file reads
      sched/core: Reorganize cgroup bandwidth control interface file writes

Valentin Schneider (2):
      locking/mutex: Add p->blocked_on wrappers for correctness checks
      sched: Fix proxy/current (push,pull)ability

Vincent Guittot (6):
      sched/fair: Use protect_slice() instead of direct comparison
      sched/fair: Fix NO_RUN_TO_PARITY case
      sched/fair: Remove spurious shorter slice preemption
      sched/fair: Limit run to parity to the min slice of enqueued entities
      sched/fair: Fix entity's lag with run to parity
      sched/fair: Always trigger resched at the end of a protected period

wang wei (1):
      sched/eevdf: Correct the comment in place_entity


 Documentation/admin-guide/kernel-parameters.txt |   5 +
 MAINTAINERS                                     |   1 +
 arch/powerpc/kernel/smp.c                       |  25 +-
 arch/s390/kernel/topology.c                     |  10 +-
 arch/x86/kernel/smpboot.c                       |  51 +-
 include/linux/cpu.h                             |   5 -
 include/linux/preempt.h                         |   9 -
 include/linux/psi_types.h                       |   6 +-
 include/linux/sched.h                           | 148 ++--
 include/linux/sched/deadline.h                  |   4 -
 include/linux/sched/idle.h                      |   4 -
 include/linux/sched/nohz.h                      |   4 +-
 include/linux/sched/sd_flags.h                  |   8 -
 include/linux/sched/task.h                      |  31 +-
 include/linux/sched/topology.h                  |  39 +-
 init/Kconfig                                    |  15 +
 kernel/fork.c                                   |   3 +-
 kernel/locking/mutex-debug.c                    |   9 +-
 kernel/locking/mutex.c                          |  18 +
 kernel/locking/mutex.h                          |   3 +-
 kernel/locking/ww_mutex.h                       |  16 +-
 kernel/sched/autogroup.c                        |   9 +-
 kernel/sched/autogroup.h                        |   6 +-
 kernel/sched/build_policy.c                     |   6 +-
 kernel/sched/build_utility.c                    |   9 +-
 kernel/sched/clock.c                            |   7 +-
 kernel/sched/completion.c                       |   5 +
 kernel/sched/core.c                             | 869 ++++++++++++++----------
 kernel/sched/core_sched.c                       |   2 +
 kernel/sched/cpuacct.c                          |   2 +
 kernel/sched/cpudeadline.c                      |   1 +
 kernel/sched/cpudeadline.h                      |   4 +-
 kernel/sched/cpufreq.c                          |   1 +
 kernel/sched/cpufreq_schedutil.c                |   6 +-
 kernel/sched/cpupri.c                           |   1 +
 kernel/sched/cpupri.h                           |   5 +-
 kernel/sched/cputime.c                          |  17 +-
 kernel/sched/deadline.c                         | 208 +++---
 kernel/sched/debug.c                            |  47 +-
 kernel/sched/fair.c                             | 408 ++++-------
 kernel/sched/idle.c                             |  15 +-
 kernel/sched/isolation.c                        |   2 +
 kernel/sched/loadavg.c                          |   6 +-
 kernel/sched/membarrier.c                       |   2 +
 kernel/sched/pelt.c                             |   5 +-
 kernel/sched/pelt.h                             |  67 +-
 kernel/sched/psi.c                              | 129 ++--
 kernel/sched/rt.c                               | 112 +--
 kernel/sched/sched-pelt.h                       |   1 +
 kernel/sched/sched.h                            | 243 ++-----
 kernel/sched/smp.h                              |   7 +
 kernel/sched/stats.c                            |   5 +-
 kernel/sched/stats.h                            |  10 +-
 kernel/sched/stop_task.c                        |   5 +-
 kernel/sched/swait.c                            |   1 +
 kernel/sched/syscalls.c                         |  15 +-
 kernel/sched/topology.c                         |  57 +-
 kernel/sched/wait.c                             |   1 +
 kernel/sched/wait_bit.c                         |   3 +
 kernel/smpboot.c                                |   4 -
 lib/smp_processor_id.c                          |   2 -
 rust/helpers/task.c                             |   6 +
 rust/kernel/lib.rs                              |  48 ++
 rust/kernel/sync/condvar.rs                     |   3 +
 rust/kernel/sync/poll.rs                        |   1 +
 rust/kernel/task.rs                             |  33 +
 tools/sched/dl_bw_dump.py                       |  57 ++
 tools/sched/root_domains_dump.py                |  68 ++
 68 files changed, 1472 insertions(+), 1463 deletions(-)
 create mode 100644 tools/sched/dl_bw_dump.py
 create mode 100644 tools/sched/root_domains_dump.py

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-07-28  6:48 [GIT PULL] Scheduler updates for v6.17 Ingo Molnar
@ 2025-07-30  3:39 ` pr-tracker-bot
  2025-07-31  3:31 ` Linus Torvalds
  1 sibling, 0 replies; 11+ messages in thread
From: pr-tracker-bot @ 2025-07-30  3:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Tejun Heo, Valentin Schneider, Shrikanth Hegde

The pull request you sent on Mon, 28 Jul 2025 08:48:44 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2025-07-28

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/bf76f23aa1c178e9115eba17f699fa726aed669b

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-07-28  6:48 [GIT PULL] Scheduler updates for v6.17 Ingo Molnar
  2025-07-30  3:39 ` pr-tracker-bot
@ 2025-07-31  3:31 ` Linus Torvalds
  2025-08-02 18:43   ` Linus Torvalds
  2025-08-04 16:50   ` Steven Rostedt
  1 sibling, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2025-07-31  3:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

On Sun, 27 Jul 2025 at 23:48, Ingo Molnar <mingo@kernel.org> wrote:
>
> PSI:
>
>  - Improve scalability by optimizing psi_group_change() cpu_clock() usage
>    (Peter Zijlstra)

I suspect this is buggy.

Maybe this is coincidence, but that sounds very unlikely:

  watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:7996]
  CPU#0 Utilization every 4s during lockup:
          #1: 100% system,          0% softirq,          0% hardirq,
       0% idle
          #2: 100% system,          1% softirq,          1% hardirq,
       0% idle
          #3: 100% system,          0% softirq,          0% hardirq,
       0% idle
          #4: 101% system,          0% softirq,          0% hardirq,
       0% idle
          #5: 100% system,          0% softirq,          0% hardirq,
       0% idle
  Modules linked in: uinput rfcomm nf_nat_tftp nf_conntrack_tftp
bridge stp llc ccm nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet [...]
  CPU: 0 UID: 0 PID: 7996 Comm: kworker/0:3 Not tainted
6.16.0-06574-gd9104cec3e8f #164 VOLUNTARY
  Hardware name: Dell Inc. XPS 13 9380/0KTW76, BIOS 1.26.0 09/11/2023
  Workqueue: events psi_avgs_work
  RIP: 0010:collect_percpu_times+0x2f6/0x320
  Code: c0 0f b6 c0 c1 e0 09 41 09 c5 e9 14 ff ff ff 49 8b 0f 48 89 4c
24 48 49 8b 4f 08 48 89 4c 24 50 e9 6e fe ff ff 4c 89 c0 f3 90 <4a> 8b
14 ed c0 3c 20 93
  RSP: 0018:ffffd4d3cc113d60 EFLAGS: 00000202
  RAX: ffffffff93b26880 RBX: fffff4d3bfba0ed4 RCX: 000000000000622d
  RDX: ffff8ced1e597880 RSI: fffffffc6684cefc RDI: 0000000000000000
  RBP: ffffd4d3cc113db8 R08: ffffffff93b26880 R09: 0000000000000000
  R10: 00001386e5a9adc7 R11: 000000000000eda9 R12: ffffd4d3cc113dd8
  R13: 0000000000000006 R14: 0000000000000006 R15: fffff4d3bfba0ec0
  FS:  0000000000000000(0000) GS:ffff8ced8a8f1000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000027f400c50010 CR3: 00000001b641e005 CR4: 00000000003726f0
  Call Trace:
   <TASK>
   psi_avgs_work+0x31/0xa0
   process_one_work+0x135/0x220
   worker_thread+0x2e7/0x420
   kthread+0xbd/0x1a0
   ret_from_fork+0x133/0x160
   ret_from_fork_asm+0x11/0x20
   </TASK>

and yeah, the laptop was dead at that point. Thankfully it had been
alive enough that the watchdog messages made it into the logs.

There were more than one of those reports (34 of them to be exact) but
they all look pretty much the same. RIP is always the same at that
collect_percpu_times+0x2f6/0x320, but that's just the instruction
after the 'pause' instruction that is from

   psi_read_begin ->
       return read_seqcount_begin(per_cpu_ptr(&psi_seq, cpu));

which is from that __read_seqcount_begin() code that waits for the
writer to go away:

        while (unlikely((__seq = seqprop_sequence(s)) & 1))             \
                cpu_relax();                                            \

and clearly it never does.

Why? I have no idea. But hopefully this makes somebody go "D'oh!" and
send me a trivial fix.

Please?

           Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-07-31  3:31 ` Linus Torvalds
@ 2025-08-02 18:43   ` Linus Torvalds
  2025-08-02 19:46     ` Steven Rostedt
  2025-08-03 17:50     ` Jeff Johnson
  2025-08-04 16:50   ` Steven Rostedt
  1 sibling, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2025-08-02 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

On Wed, 30 Jul 2025 at 20:31, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 27 Jul 2025 at 23:48, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > PSI:
> >
> >  - Improve scalability by optimizing psi_group_change() cpu_clock() usage
> >    (Peter Zijlstra)
>
> I suspect this is buggy.
>
> Maybe this is coincidence, but that sounds very unlikely:
>
>   watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:7996]
>   CPU#0 Utilization every 4s during lockup:

Happened again this morning, and as far as I can tell the machine was
just sitting there idle at the desktop.

I've only seen this on my laptop, so maybe it's some hw dependency,
but it *really* smells like commit 570c8efd5eb7 ("sched/psi: Optimize
psi_group_change() cpu_clock() usage") from the symptoms. It's
literally hanging on that psi_read_begin(), which is that
read_seqcount_begin() on that new per-cpu psi_seq counter.

Now, I'm not seeing how it could possibly trigger - I looked through
all the psi_write_begin() users, and they all *seem* to be (a) under
rq_lock_irq and (b) paired with a psi_write_end() with the same cpu.

But the symptoms have been very consistent both times it happened: the
RIP always a watchdog in collect_percpu_times(), always at that
'pause' in the "wait for seqcount to be even".

It's typically been in that psi_avgs_work kworker, but once it was
systemd-oomd that apparently had done a "read()" on it, so it went
through "psi_show()" instead.

Now, the *writers* all take the proper locks, but the readers don't.
And my laptop has CONFIG_PREMPT_VOLUNTARY in its config (random old
setting).

I'm not seeing why that would matter, since the seq count should
become even at some point, but it does mean that the seqcount read
loop looks like it's an endless kernel loop when it triggers. I don't
see how that would make a difference, since the seqcount should become
even on the writer side and the writers shouldn't be preempted and get
some kind of priority inversion with a reader that doesn't go away,
but *if* there is some bug in this area, maybe that config is why I'm
seeing it and others aren't?

Any ideas, people?

              Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-08-02 18:43   ` Linus Torvalds
@ 2025-08-02 19:46     ` Steven Rostedt
  2025-08-03 19:10       ` Linus Torvalds
  2025-08-03 17:50     ` Jeff Johnson
  1 sibling, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2025-08-02 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

On Sat, 2 Aug 2025 11:43:40 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I'm not seeing why that would matter, since the seq count should
> become even at some point, but it does mean that the seqcount read
> loop looks like it's an endless kernel loop when it triggers. I don't
> see how that would make a difference, since the seqcount should become
> even on the writer side and the writers shouldn't be preempted and get
> some kind of priority inversion with a reader that doesn't go away,
> but *if* there is some bug in this area, maybe that config is why I'm
> seeing it and others aren't?
> 
> Any ideas, people?

You could try to enable function tracer and stop the trace with the patch
below and see where it happened.

 # echo function > /sys/kernel/tracing/current_tracer
 # echo 1 > /sys/kernel/tracing/tracing_on

After it happens you can take a look at:

  # cat /sys/kernel/tracing/trace

where it would have stopped at the soft lock up. Now the function tracer
will fill up the buffer quickly and it may only have a fraction of a second
worth of data, thus it will not have the locked up task, but it may give
you an idea of what is keeping it from getting out of the read_seq loop.

Note that the function tracer will have a noticeable impact on performance.
But it may open up the race window even wider.

-- Steve

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 80b56c002c7f..7ac934efd8af 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -795,6 +795,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 #ifdef CONFIG_SYSFS
 		++softlockup_count;
 #endif
+		trace_printk("SOFT LOCK UP DETECTED\n");
+		tracing_off();
 
 		/*
 		 * Prevent multiple soft-lockup reports if one cpu is already

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-08-02 18:43   ` Linus Torvalds
  2025-08-02 19:46     ` Steven Rostedt
@ 2025-08-03 17:50     ` Jeff Johnson
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff Johnson @ 2025-08-03 17:50 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

On 8/2/25 11:43, Linus Torvalds wrote:
> On Wed, 30 Jul 2025 at 20:31, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> On Sun, 27 Jul 2025 at 23:48, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> PSI:
>>>
>>>  - Improve scalability by optimizing psi_group_change() cpu_clock() usage
>>>    (Peter Zijlstra)
>>
>> I suspect this is buggy.
>>
>> Maybe this is coincidence, but that sounds very unlikely:
>>
>>   watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:7996]
>>   CPU#0 Utilization every 4s during lockup:
> 
> Happened again this morning, and as far as I can tell the machine was
> just sitting there idle at the desktop.
> 
> I've only seen this on my laptop, so maybe it's some hw dependency,
> but it *really* smells like commit 570c8efd5eb7 ("sched/psi: Optimize
> psi_group_change() cpu_clock() usage") from the symptoms. It's
> literally hanging on that psi_read_begin(), which is that
> read_seqcount_begin() on that new per-cpu psi_seq counter.
> 
> Now, I'm not seeing how it could possibly trigger - I looked through
> all the psi_write_begin() users, and they all *seem* to be (a) under
> rq_lock_irq and (b) paired with a psi_write_end() with the same cpu.
> 
> But the symptoms have been very consistent both times it happened: the
> RIP always a watchdog in collect_percpu_times(), always at that
> 'pause' in the "wait for seqcount to be even".
> 
> It's typically been in that psi_avgs_work kworker, but once it was
> systemd-oomd that apparently had done a "read()" on it, so it went
> through "psi_show()" instead.
> 
> Now, the *writers* all take the proper locks, but the readers don't.
> And my laptop has CONFIG_PREMPT_VOLUNTARY in its config (random old
> setting).
> 
> I'm not seeing why that would matter, since the seq count should
> become even at some point, but it does mean that the seqcount read
> loop looks like it's an endless kernel loop when it triggers. I don't
> see how that would make a difference, since the seqcount should become
> even on the writer side and the writers shouldn't be preempted and get
> some kind of priority inversion with a reader that doesn't go away,
> but *if* there is some bug in this area, maybe that config is why I'm
> seeing it and others aren't?
> 
> Any ideas, people?

FWIW I'm seeing the same thing.

Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 21s! [kworker/3:0:3977]
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: Modules linked in: snd_seq_dummy snd_hrtimer ccm michael_mic bnep amdgpu snd_hda_codec_hdmi amdxcp gpu_sched drm_panel_backlight_quirks rmi_smbus rmi_core qrtr_mhi snd_hda_codec_generic at24 intel_rapl_msr binfmt_misc snd_hda_intel snd_hda_codec intel_rapl_common mei_hdcp snd_hda_core x86_pkg_temp_thermal qrtr snd_intel_dspcfg snd_intel_sdw_acpi intel_powerclamp snd_hwdep snd_pcm uvcvideo ath12k coretemp videobuf2_vmalloc qmi_helpers ghash_clmulni_intel nls_iso8859_1 aesni_intel uvc rapl snd_seq_midi videobuf2_memops mac80211 wmi_bmof snd_seq_midi_event libarc4 intel_cstate i2c_i801 videobuf2_v4l2 i915 i2c_mux radeon snd_rawmidi videobuf2_common drm_ttm_helper drm_buddy cfg80211 drm_exec i2c_smbus videodev ttm btusb drm_suballoc_helper snd_seq mc drm_client_lib btrtl btintel drm_display_helper btbcm mhi snd_seq_device btmtk cec snd_timer rc_core drm_kms_helper bluetooth mei_me snd lpc_ich mei i2c_algo_bit soundcore wireless_hotkey tpm_infineon input_leds joydev mac_hid serio_raw msr parport_pc ppdev lp
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  parport efi_pstore drm nfnetlink dmi_sysfs autofs4 rtsx_pci_sdmmc video cdc_ether usbnet mii psmouse ahci rtsx_pci libahci e1000e wmi
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: irq event stamp: 198926
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: hardirqs last  enabled at (198925): [<ffffffffa240150a>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: hardirqs last disabled at (198926): [<ffffffffa5714d90>] sysvec_apic_timer_interrupt+0x10/0xb0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: softirqs last  enabled at (198904): [<ffffffffa29a4ff3>] __irq_exit_rcu+0xb3/0xe0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: softirqs last disabled at (198899): [<ffffffffa29a4ff3>] __irq_exit_rcu+0xb3/0xe0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: CPU: 3 UID: 0 PID: 3977 Comm: kworker/3:0 Not tainted 6.16.0+ #146 PREEMPT(voluntary) 
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: Hardware name: Hewlett-Packard HP ZBook 14 G2/2216, BIOS M71 Ver. 01.31 02/24/2020
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: Workqueue: events psi_avgs_work
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: RIP: 0010:collect_percpu_times+0x77a/0xe80
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: Code: 41 5d 41 5e 41 5f c3 cc cc cc cc 48 8b 54 24 68 49 c7 c1 00 b0 51 a8 49 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 4c 01 c2 f3 90 <49> 81 ff 00 20 00 00 0f 83 93 04 00 00 80 3a 00 0f 85 38 06 00 00
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: RSP: 0018:ffff888132d3f9f0 EFLAGS: 00000202
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: RAX: 0000000000000003 RBX: ffffe8ffffdf65c0 RCX: 0000000000000000
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: RDX: fffffbfff4c7018b RSI: 0000000000000000 RDI: ffffffffa2b19025
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: RBP: fffffbfff4c7018b R08: dffffc0000000000 R09: ffffffffa851b000
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffa851b000
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: R13: 000000000000085b R14: dffffc0000000000 R15: 0000000000000003
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: FS:  0000000000000000(0000) GS:ffff888467291000(0000) knlGS:0000000000000000
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: CR2: 00007f3350001158 CR3: 00000001040bc001 CR4: 00000000003706f0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel: Call Trace:
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  <TASK>
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? do_raw_spin_lock+0x12d/0x270
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx_collect_percpu_times+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx___mutex_lock+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? _raw_spin_unlock_irqrestore+0x27/0x60
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  psi_avgs_work+0x96/0x200
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? lock_acquire+0x154/0x2d0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx_psi_avgs_work+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? lock_release+0xc6/0x2a0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  process_one_work+0x86e/0x14b0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx_process_one_work+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? assign_work+0x16c/0x240
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  worker_thread+0x5d0/0xfc0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx_worker_thread+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  kthread+0x375/0x750
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx_kthread+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? ret_from_fork+0x1f/0x2f0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? lock_release+0xc6/0x2a0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx_kthread+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ret_from_fork+0x215/0x2f0
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ? __pfx_kthread+0x10/0x10
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  ret_from_fork_asm+0x1a/0x30
Aug 03 10:17:26 qca-HP-ZBook-14-G2 kernel:  </TASK>

just a bit before, if it matters (this sequence occurred 3 times)...
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: The canary thread is apparently starving. Taking action.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Demoting known real-time threads.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1861 of process 1789.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1556 of process 1506.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1567 of process 1505.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1505 of process 1505.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1568 of process 1510.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1510 of process 1510.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1559 of process 1509.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Successfully demoted thread 1509 of process 1509.
Aug 03 10:14:02 qca-HP-ZBook-14-G2 rtkit-daemon[1557]: Demoted 8 threads.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-08-02 19:46     ` Steven Rostedt
@ 2025-08-03 19:10       ` Linus Torvalds
  2025-08-03 19:24         ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2025-08-03 19:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

On Sat, 2 Aug 2025 at 12:46, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> You could try to enable function tracer and stop the trace with the patch
> below and see where it happened.
>
>  # echo function > /sys/kernel/tracing/current_tracer
>  # echo 1 > /sys/kernel/tracing/tracing_on
>
> After it happens you can take a look at:
>
>   # cat /sys/kernel/tracing/trace

Note that when this happens, the machine is dead.

It seems to be alive enough to get this logged, but it's dead from a
functional standpoint. There's no "when it happens, do this".

                Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-08-03 19:10       ` Linus Torvalds
@ 2025-08-03 19:24         ` Steven Rostedt
  2025-08-03 19:36           ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2025-08-03 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

On Sun, 3 Aug 2025 12:10:56 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 2 Aug 2025 at 12:46, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > You could try to enable function tracer and stop the trace with the patch
> > below and see where it happened.
> >
> >  # echo function > /sys/kernel/tracing/current_tracer
> >  # echo 1 > /sys/kernel/tracing/tracing_on
> >
> > After it happens you can take a look at:
> >
> >   # cat /sys/kernel/tracing/trace  
> 
> Note that when this happens, the machine is dead.
> 
> It seems to be alive enough to get this logged, but it's dead from a
> functional standpoint. There's no "when it happens, do this".

Can you trigger a forced soft reboot? 

 echo 1 > /proc/sys/kernel/panic_on_warn

With the kernel command line "panic=1".

If your machine doesn't clear the memory on reboot, you could use the
persistent ring buffer too:

Adding to the kernel command line:

  reserve_mem=20M:12M:trace trace_instance=boot_mapped@trace

Then use:

  # echo function > /sys/kernel/tracing/instances/boot_mapped/current_tracer
  # echo 1 > /sys/kernel/tracing/instances/boot_mapped/tracing_on

After a crash, if the memory is persistent it should have everything up to
the crash:

   # cat /sys/kernel/tracing/instances/boot_mapped/trace  


This is the exact scenario that this was created for.

-- Steve


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-08-03 19:24         ` Steven Rostedt
@ 2025-08-03 19:36           ` Steven Rostedt
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2025-08-03 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde

On Sun, 3 Aug 2025 15:24:11 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> If your machine doesn't clear the memory on reboot, you could use the
> persistent ring buffer too:
> 
> Adding to the kernel command line:
> 
>   reserve_mem=20M:12M:trace trace_instance=boot_mapped@trace

You can know if this works by adding the above to the kernel command line
then boot once. Do a reboot, and if the second boot dmesg has something
like:

  Ring buffer meta [0] is from previous boot!

You're good to go. But if it has:

  Ring buffer boot meta[0] mismatch of magic or struct size

Then it likely will not work.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-07-31  3:31 ` Linus Torvalds
  2025-08-02 18:43   ` Linus Torvalds
@ 2025-08-04 16:50   ` Steven Rostedt
  2025-08-04 17:52     ` Linus Torvalds
  1 sibling, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2025-08-04 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde, Johannes Weiner,
	Chris Bainbridge

On Wed, 30 Jul 2025 20:31:44 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 27 Jul 2025 at 23:48, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > PSI:
> >
> >  - Improve scalability by optimizing psi_group_change() cpu_clock() usage
> >    (Peter Zijlstra)  
> 
> I suspect this is buggy.
> 

I forgot about this change, thinking it was added, until I saw this email:

  https://lore.kernel.org/all/20250804133240.GA1303466@cmpxchg.org/

It appears that Peter never sent in the change of:

  https://lore.kernel.org/lkml/20250716104050.GR1613200@noisy.programming.kicks-ass.net/

Looks like this could be your issue.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [GIT PULL] Scheduler updates for v6.17
  2025-08-04 16:50   ` Steven Rostedt
@ 2025-08-04 17:52     ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2025-08-04 17:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Mel Gorman,
	Tejun Heo, Valentin Schneider, Shrikanth Hegde, Johannes Weiner,
	Chris Bainbridge

On Mon, 4 Aug 2025 at 09:50, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> It appears that Peter never sent in the change of:
>
>   https://lore.kernel.org/lkml/20250716104050.GR1613200@noisy.programming.kicks-ass.net/
>
> Looks like this could be your issue.

Ack. I assume Peter is on vacation, so I just applied that one
directly, because yes, that looks like a likely culprit.

            Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-04 17:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-28  6:48 [GIT PULL] Scheduler updates for v6.17 Ingo Molnar
2025-07-30  3:39 ` pr-tracker-bot
2025-07-31  3:31 ` Linus Torvalds
2025-08-02 18:43   ` Linus Torvalds
2025-08-02 19:46     ` Steven Rostedt
2025-08-03 19:10       ` Linus Torvalds
2025-08-03 19:24         ` Steven Rostedt
2025-08-03 19:36           ` Steven Rostedt
2025-08-03 17:50     ` Jeff Johnson
2025-08-04 16:50   ` Steven Rostedt
2025-08-04 17:52     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).