[PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
@ 2024-05-28  0:34 Ankur Arora
  2024-05-28  0:34 ` [PATCH v2 01/35] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Ankur Arora
                   ` (36 more replies)
  0 siblings, 37 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Hi,

This series adds a new scheduling model PREEMPT_AUTO, which like
PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
on explicit preemption points for the voluntary models.

The series is based on Thomas' original proposal which he outlined
in [1], [2] and in his PoC [3].

v2 mostly reworks v1, with one of the main changes having less
noisy need-resched-lazy related interfaces.
More details in the changelog below.

The v1 of the series is at [4] and the RFC at [5].

Design
==

PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
PREEMPT_COUNT). This means that the scheduler can always safely
preempt. (This is identical to CONFIG_PREEMPT.)

Having that, the next step is to make the rescheduling policy dependent
on the chosen scheduling model. Currently, the scheduler uses a single
need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
reschedule is needed.
PREEMPT_AUTO extends this by adding an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (TIF_NEED_RESCHED), or express a need for
rescheduling while allowing the task on the runqueue to run to
timeslice completion (TIF_NEED_RESCHED_LAZY).

The scheduler decides which need-resched bits are chosen based on
the preemption model in use:

	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY

none		never   		always [*]
voluntary       higher sched class	other tasks [*]
full 		always                  never

[*] some details elided.

The last part of the puzzle is, when does preemption happen, or
alternately stated, when are the need-resched bits checked:

                 exit-to-user    ret-to-kernel    preempt_count()

NEED_RESCHED_LAZY     Y               N                N
NEED_RESCHED          Y               Y                Y

Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
none/voluntary preemption policies are in effect. And eager semantics
under full preemption.

In addition, since this is driven purely by the scheduler (not
depending on cond_resched() placement and the like), there is enough
flexibility in the scheduler to cope with edge cases -- ex. a kernel
task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
simply upgrading to a full NEED_RESCHED which can use more coercive
instruments like resched IPI to induce a context-switch.

Performance
==
The performance in the basic tests (perf bench sched messaging, kernbench,
cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
(See patches 
  "sched: support preempt=none under PREEMPT_AUTO"
  "sched: support preempt=full under PREEMPT_AUTO"
  "sched: handle preempt=voluntary under PREEMPT_AUTO")

For a macro test, a colleague in Oracle's Exadata team tried two
OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
backported.)

In both tests the data was cached on remote nodes (cells), and the
database nodes (compute) served client queries, with clients being
local in the first test and remote in the second.

Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs

				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
				                                        (preempt=voluntary)          
                              ==============================      =============================
                      clients  throughput    cpu-usage            throughput     cpu-usage         Gain
                               (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
		      -------  ----------  -----------------      ----------  -----------------   -------

  OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
  benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
 (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%

  OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
  benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
 (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
  90/10 RW ratio)

(Both sets of tests have a fair amount of NW traffic since the query
tables etc are cached on the cells. Additionally, the first set,
given the local clients, stress the scheduler a bit more than the
second.)

The comparative performance for both the tests is fairly close,
more or less within a margin of error.

Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):

"
 a) Base kernel (6.7),
 b) v1, PREEMPT_AUTO, preempt=voluntary
 c) v1, PREEMPT_DYNAMIC, preempt=voluntary
 d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y

 Workloads I tested and their %gain,
                    case b           case c       case d
 NAS                +2.7%              +1.9%         +2.1%
 Hashjoin,          +0.0%              +0.0%         +0.0%
 Graph500,          -6.0%              +0.0%         +0.0%
 XSBench            +1.7%              +0.0%         +1.2%

 (Note about the Graph500 numbers at [8].)

 Did kernbench etc test from Mel's mmtests suite also. Did not notice
 much difference.
"

One case where there is a significant performance drop is on powerpc,
seen running hackbench on a 320 core system (a test on a smaller system is
fine.) In theory there's no reason for this to only happen on powerpc
since most of the code is common, but I haven't been able to reproduce
it on x86 so far.

All in all, I think the tests above show that this scheduling model has legs.
However, the none/voluntary models under PREEMPT_AUTO are conceptually
different enough from the current none/voluntary models that there
likely are workloads where performance would be subpar. That needs more
extensive testing to figure out the weak points.

Series layout
==

Patches 1,2 
 "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
 "sched/core: Drop spinlocks on contention iff kernel is preemptible"
condition spin_needbreak() on the dynamic preempt_model_*().
Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.

Patch 3
  "sched: make test_*_tsk_thread_flag() return bool"
is a minor cleanup.

Patch 4,
  "preempt: introduce CONFIG_PREEMPT_AUTO"
introduces the new scheduling model.

Patch 5-7,
 "thread_info: selector for TIF_NEED_RESCHED[_LAZY]"
 "thread_info: define __tif_need_resched(resched_t)"
 "sched: define *_tsk_need_resched_lazy() helpers"

introduce new thread_info/task helper interfaces or make changes to
pre-existing ones that will be used in the rest of the series.

Patches 8-11,
  "entry: handle lazy rescheduling at user-exit"
  "entry/kvm: handle lazy rescheduling at guest-entry"
  "entry: irqentry_exit only preempts for TIF_NEED_RESCHED"
  "sched: __schedule_loop() doesn't need to check for need_resched_lazy()"

make changes/document the rescheduling points.

Patches 12-13,
  "sched: separate PREEMPT_DYNAMIC config logic"
  "sched: allow runtime config for PREEMPT_AUTO"

reuse the PREEMPT_DYNAMIC runtime configuration logic.

Patch 14-18,

  "rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO"
  "rcu: fix header guard for rcu_all_qs()"
  "preempt,rcu: warn on PREEMPT_RCU=n, preempt=full"
  "rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y"
  "rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y"

add changes needed for RCU.

Patch 19-20,
  "x86/thread_info: define TIF_NEED_RESCHED_LAZY"
  "powerpc: add support for PREEMPT_AUTO"

adds x86, powerpc support. 

Patches 21-24,
  "sched: prepare for lazy rescheduling in resched_curr()"
  "sched: default preemption policy for PREEMPT_AUTO"
  "sched: handle idle preemption for PREEMPT_AUTO"
  "sched: schedule eagerly in resched_cpu()"

are preparatory patches for adding PREEMPT_AUTO. Among other things
they add the default need-resched policy for !PREEMPT_AUTO,
PREEMPT_AUTO, and the idle task.

Patches 25-26,
  "sched/fair: refactor update_curr(), entity_tick()",
  "sched/fair: handle tick expiry under lazy preemption"

handle the 'hog' problem, where a kernel task does not voluntarily
schedule out.

And, finally patches 27-29,
  "sched: support preempt=none under PREEMPT_AUTO"
  "sched: support preempt=full under PREEMPT_AUTO"
  "sched: handle preempt=voluntary under PREEMPT_AUTO"

add support for the three preemption models.

Patch 30-33,
  "sched: latency warn for TIF_NEED_RESCHED_LAZY",
  "tracing: support lazy resched",
  "Documentation: tracing: add TIF_NEED_RESCHED_LAZY",
  "osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y"

handles remaining bits and pieces to do with TIF_NEED_RESCHED_LAZY.

And, finally patches 34-35

  "kconfig: decompose ARCH_NO_PREEMPT"
  "arch: decompose ARCH_NO_PREEMPT"

decompose ARCH_NO_PREEMPT which might make it easier to support
CONFIG_PREEMPTION on some architectures.

Changelog
==
v2: rebased to v6.9, addreses review comments, folds some other patches.

 - the lazy interfaces are less noisy now: the current interfaces stay
   unchanged so non-scheduler code doesn't need to change.
   This also means that the lazy preemption becomes a scheduler detail
   which works well with the core idea of lazy scheduling.
   (Mark Rutland, Thomas Gleixner)

 - preempt=none model now respects the leftmost deadline (Juri Lelli)
 - Add need-resched flag combination state in tracing headers (Steven Rostedt)
 - Decompose ARCH_NO_PREEMPT
 - Changes for RCU (and TASKS_RCU) will go in separately [6]

 - spin_needbreak() should be conditioned on preempt_model_*() at
   runtime (patches from Sean Christopherson [7])
 - powerpc support from Shrikanth Hegde

RFC:
 - Addresses review comments and is generally a more focused
   version of the RFC.
 - Lots of code reorganization.
 - Bugfixes all over.
 - need_resched() now only checks for TIF_NEED_RESCHED instead
   of TIF_NEED_RESCHED|TIF_NEED_RESCHED_LAZY.
 - set_nr_if_polling() now does not check for TIF_NEED_RESCHED_LAZY.
 - Tighten idle related checks.
 - RCU changes to force context-switches when a quiescent state is
   urgently needed.
 - Does not break live-patching anymore

Also at: github.com/terminus/linux preempt-v2

Please review.

Thanks
Ankur

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>

[1] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[2] https://lore.kernel.org/lkml/87led2wdj0.ffs@tglx/
[3] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
[4] https://lore.kernel.org/lkml/20240213055554.1802415-1-ankur.a.arora@oracle.com/
[5] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/
[6] https://lore.kernel.org/lkml/20240507093530.3043-1-urezki@gmail.com/
[7] https://lore.kernel.org/lkml/20240312193911.1796717-1-seanjc@google.com/
[8] https://lore.kernel.org/lkml/af122806-8325-4302-991f-9c0dc1857bfe@amd.com/
[9] https://lore.kernel.org/lkml/17cc54c4-2e75-4964-9155-84db081ce209@linux.ibm.com/

Ankur Arora (32):
  sched: make test_*_tsk_thread_flag() return bool
  preempt: introduce CONFIG_PREEMPT_AUTO
  thread_info: selector for TIF_NEED_RESCHED[_LAZY]
  thread_info: define __tif_need_resched(resched_t)
  sched: define *_tsk_need_resched_lazy() helpers
  entry: handle lazy rescheduling at user-exit
  entry/kvm: handle lazy rescheduling at guest-entry
  entry: irqentry_exit only preempts for TIF_NEED_RESCHED
  sched: __schedule_loop() doesn't need to check for need_resched_lazy()
  sched: separate PREEMPT_DYNAMIC config logic
  sched: allow runtime config for PREEMPT_AUTO
  rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO
  rcu: fix header guard for rcu_all_qs()
  preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
  rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y
  x86/thread_info: define TIF_NEED_RESCHED_LAZY
  sched: prepare for lazy rescheduling in resched_curr()
  sched: default preemption policy for PREEMPT_AUTO
  sched: handle idle preemption for PREEMPT_AUTO
  sched: schedule eagerly in resched_cpu()
  sched/fair: refactor update_curr(), entity_tick()
  sched/fair: handle tick expiry under lazy preemption
  sched: support preempt=none under PREEMPT_AUTO
  sched: support preempt=full under PREEMPT_AUTO
  sched: handle preempt=voluntary under PREEMPT_AUTO
  sched: latency warn for TIF_NEED_RESCHED_LAZY
  tracing: support lazy resched
  Documentation: tracing: add TIF_NEED_RESCHED_LAZY
  osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y
  kconfig: decompose ARCH_NO_PREEMPT
  arch: decompose ARCH_NO_PREEMPT

Sean Christopherson (2):
  sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
  sched/core: Drop spinlocks on contention iff kernel is preemptible

Shrikanth Hegde (1):
  powerpc: add support for PREEMPT_AUTO

 .../admin-guide/kernel-parameters.txt         |   5 +-
 Documentation/trace/ftrace.rst                |   6 +-
 arch/Kconfig                                  |   7 +
 arch/alpha/Kconfig                            |   3 +-
 arch/hexagon/Kconfig                          |   3 +-
 arch/m68k/Kconfig                             |   3 +-
 arch/powerpc/Kconfig                          |   1 +
 arch/powerpc/include/asm/thread_info.h        |   5 +-
 arch/powerpc/kernel/interrupt.c               |   5 +-
 arch/um/Kconfig                               |   3 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/thread_info.h            |   6 +-
 include/linux/entry-common.h                  |   2 +-
 include/linux/entry-kvm.h                     |   2 +-
 include/linux/preempt.h                       |  43 ++-
 include/linux/rcutree.h                       |   2 +-
 include/linux/sched.h                         | 101 +++---
 include/linux/spinlock.h                      |  14 +-
 include/linux/thread_info.h                   |  71 +++-
 include/linux/trace_events.h                  |   6 +-
 init/Makefile                                 |   1 +
 kernel/Kconfig.preempt                        |  37 ++-
 kernel/entry/common.c                         |  16 +-
 kernel/entry/kvm.c                            |   4 +-
 kernel/rcu/Kconfig                            |   2 +-
 kernel/rcu/tree.c                             |  13 +-
 kernel/rcu/tree_plugin.h                      |  11 +-
 kernel/sched/core.c                           | 311 ++++++++++++------
 kernel/sched/deadline.c                       |   9 +-
 kernel/sched/debug.c                          |  13 +-
 kernel/sched/fair.c                           |  56 ++--
 kernel/sched/rt.c                             |   6 +-
 kernel/sched/sched.h                          |  27 +-
 kernel/trace/trace.c                          |  30 +-
 kernel/trace/trace_osnoise.c                  |  22 +-
 kernel/trace/trace_output.c                   |  16 +-
 36 files changed, 598 insertions(+), 265 deletions(-)

-- 
2.31.1

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 01/35] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-06-06 17:45   ` [tip: sched/core] " tip-bot2 for Sean Christopherson
  2024-05-28  0:34 ` [PATCH v2 02/35] sched/core: Drop spinlocks on contention iff kernel is preemptible Ankur Arora
                   ` (35 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Sean Christopherson, Ankur Arora

From: Sean Christopherson <seanjc@google.com>

Move the declarations and inlined implementations of the preempt_model_*()
helpers to preempt.h so that they can be referenced in spinlock.h without
creating a potential circular dependency between spinlock.h and sched.h.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/preempt.h | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h   | 41 -----------------------------------------
 2 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7233e9cf1bab..ce76f1a45722 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -481,4 +481,45 @@ DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
 DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(), preempt_enable_notrace())
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+extern bool preempt_model_none(void);
+extern bool preempt_model_voluntary(void);
+extern bool preempt_model_full(void);
+
+#else
+
+static inline bool preempt_model_none(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_NONE);
+}
+static inline bool preempt_model_voluntary(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
+}
+static inline bool preempt_model_full(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT);
+}
+
+#endif
+
+static inline bool preempt_model_rt(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_RT);
+}
+
+/*
+ * Does the preemption model allow non-cooperative preemption?
+ *
+ * For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
+ * CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
+ * kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
+ * PREEMPT_NONE model.
+ */
+static inline bool preempt_model_preemptible(void)
+{
+	return preempt_model_full() || preempt_model_rt();
+}
+
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b4..73a3402843c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2060,47 +2060,6 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-extern bool preempt_model_none(void);
-extern bool preempt_model_voluntary(void);
-extern bool preempt_model_full(void);
-
-#else
-
-static inline bool preempt_model_none(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_NONE);
-}
-static inline bool preempt_model_voluntary(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
-}
-static inline bool preempt_model_full(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT);
-}
-
-#endif
-
-static inline bool preempt_model_rt(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_RT);
-}
-
-/*
- * Does the preemption model allow non-cooperative preemption?
- *
- * For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
- * CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
- * kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
- * PREEMPT_NONE model.
- */
-static inline bool preempt_model_preemptible(void)
-{
-	return preempt_model_full() || preempt_model_rt();
-}
-
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [tip: sched/core] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
  2024-05-28  0:34 ` [PATCH v2 01/35] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Ankur Arora
@ 2024-06-06 17:45   ` tip-bot2 for Sean Christopherson
  0 siblings, 0 replies; 95+ messages in thread
From: tip-bot2 for Sean Christopherson @ 2024-06-06 17:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sean Christopherson, Peter Zijlstra (Intel), Ankur Arora, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f0dc887f21d18791037c0166f652c67da761f16f
Gitweb:        https://git.kernel.org/tip/f0dc887f21d18791037c0166f652c67da761f16f
Author:        Sean Christopherson <seanjc@google.com>
AuthorDate:    Mon, 27 May 2024 17:34:47 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 05 Jun 2024 16:52:36 +02:00

sched/core: Move preempt_model_*() helpers from sched.h to preempt.h

Move the declarations and inlined implementations of the preempt_model_*()
helpers to preempt.h so that they can be referenced in spinlock.h without
creating a potential circular dependency between spinlock.h and sched.h.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Link: https://lkml.kernel.org/r/20240528003521.979836-2-ankur.a.arora@oracle.com
---
 include/linux/preempt.h | 41 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/sched.h   | 41 +----------------------------------------
 2 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7233e9c..ce76f1a 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -481,4 +481,45 @@ DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
 DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(), preempt_enable_notrace())
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+extern bool preempt_model_none(void);
+extern bool preempt_model_voluntary(void);
+extern bool preempt_model_full(void);
+
+#else
+
+static inline bool preempt_model_none(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_NONE);
+}
+static inline bool preempt_model_voluntary(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
+}
+static inline bool preempt_model_full(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT);
+}
+
+#endif
+
+static inline bool preempt_model_rt(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_RT);
+}
+
+/*
+ * Does the preemption model allow non-cooperative preemption?
+ *
+ * For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
+ * CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
+ * kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
+ * PREEMPT_NONE model.
+ */
+static inline bool preempt_model_preemptible(void)
+{
+	return preempt_model_full() || preempt_model_rt();
+}
+
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac..90691d9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2064,47 +2064,6 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-extern bool preempt_model_none(void);
-extern bool preempt_model_voluntary(void);
-extern bool preempt_model_full(void);
-
-#else
-
-static inline bool preempt_model_none(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_NONE);
-}
-static inline bool preempt_model_voluntary(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
-}
-static inline bool preempt_model_full(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT);
-}
-
-#endif
-
-static inline bool preempt_model_rt(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_RT);
-}
-
-/*
- * Does the preemption model allow non-cooperative preemption?
- *
- * For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
- * CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
- * kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
- * PREEMPT_NONE model.
- */
-static inline bool preempt_model_preemptible(void)
-{
-	return preempt_model_full() || preempt_model_rt();
-}
-
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 02/35] sched/core: Drop spinlocks on contention iff kernel is preemptible
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
  2024-05-28  0:34 ` [PATCH v2 01/35] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28  0:34 ` [PATCH v2 03/35] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Sean Christopherson, Valentin Schneider, Marco Elver,
	Frederic Weisbecker, David Matlack, Friedrich Weber, Ankur Arora,
	Chen Yu

From: Sean Christopherson <seanjc@google.com>

Use preempt_model_preemptible() to detect a preemptible kernel when
deciding whether or not to reschedule in order to drop a contended
spinlock or rwlock.  Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
preemption model is "none" or "voluntary".  In short, make kernels with
dynamically selected models behave the same as kernels with statically
selected models.

Somewhat counter-intuitively, NOT yielding a lock can provide better
latency for the relevant tasks/processes.  E.g. KVM x86's mmu_lock, a
rwlock, is often contended between an invalidation event (takes mmu_lock
for write) and a vCPU servicing a guest page fault (takes mmu_lock for
read).  For _some_ setups, letting the invalidation task complete even
if there is mmu_lock contention provides lower latency for *all* tasks,
i.e. the invalidation completes sooner *and* the vCPU services the guest
page fault sooner.

But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
can vary depending on the host VMM, the guest workload, the number of
vCPUs, the number of pCPUs in the host, why there is lock contention, etc.

In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
opposite and removing contention yielding entirely) needs to come with a
big pile of data proving that changing the status quo is a net positive.

Opportunistically document this side effect of preempt=full, as yielding
contended spinlocks can have significant, user-visible impact.

Fixes: c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode")
Link: https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Marco Elver <elver@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Friedrich Weber <f.weber@proxmox.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  4 +++-
 include/linux/spinlock.h                        | 14 ++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 396137ee018d..2d693300ab57 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4722,7 +4722,9 @@
 			none - Limited to cond_resched() calls
 			voluntary - Limited to cond_resched() and might_sleep() calls
 			full - Any section that isn't explicitly preempt disabled
-			       can be preempted anytime.
+			       can be preempted anytime.  Tasks will also yield
+			       contended spinlocks (if the critical section isn't
+			       explicitly preempt disabled beyond the lock itself).

 	print-fatal-signals=
 			[KNL] debug: print fatal signals
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 3fcd20de6ca8..63dd8cf3c3c2 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -462,11 +462,10 @@ static __always_inline int spin_is_contended(spinlock_t *lock)
  */
 static inline int spin_needbreak(spinlock_t *lock)
 {
-#ifdef CONFIG_PREEMPTION
+	if (!preempt_model_preemptible())
+		return 0;
+
 	return spin_is_contended(lock);
-#else
-	return 0;
-#endif
 }

 /*
@@ -479,11 +478,10 @@ static inline int spin_needbreak(spinlock_t *lock)
  */
 static inline int rwlock_needbreak(rwlock_t *lock)
 {
-#ifdef CONFIG_PREEMPTION
+	if (!preempt_model_preemptible())
+		return 0;
+
 	return rwlock_is_contended(lock);
-#else
-	return 0;
-#endif
 }

 /*
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 03/35] sched: make test_*_tsk_thread_flag() return bool
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
  2024-05-28  0:34 ` [PATCH v2 01/35] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Ankur Arora
  2024-05-28  0:34 ` [PATCH v2 02/35] sched/core: Drop spinlocks on contention iff kernel is preemptible Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28  0:34 ` [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO Ankur Arora
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

All users of test_*_tsk_thread_flag() treat the result as boolean.
This is also true for the underlying test_and_*_bit() operations.

Change the return type to bool.

Cc: Peter Ziljstra <peterz@infradead.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
---
 include/linux/sched.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 73a3402843c6..4808e5dd4f69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1937,17 +1937,17 @@ static inline void update_tsk_thread_flag(struct task_struct *tsk, int flag,
 	update_ti_thread_flag(task_thread_info(tsk), flag, value);
 }
 
-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
 {
 	return test_ti_thread_flag(task_thread_info(tsk), flag);
 }
@@ -1962,7 +1962,7 @@ static inline void clear_tsk_need_resched(struct task_struct *tsk)
 	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
 }
 
-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
 {
 	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (2 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 03/35] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-06-03 15:04   ` Shrikanth Hegde
  2024-05-28  0:34 ` [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY] Ankur Arora
                   ` (32 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Jonathan Corbet

PREEMPT_AUTO adds a new scheduling model which, like PREEMPT_DYNAMIC,
allows dynamic switching between a none/voluntary/full preemption
model. However, unlike PREEMPT_DYNAMIC, it doesn't use explicit
preemption points for the voluntary models.

It works by depending on CONFIG_PREEMPTION (and thus PREEMPT_COUNT),
allowing the scheduler to always know when it is safe to preempt
for all three preemption models.

In addition, it uses an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (the usual TIF_NEED_RESCHED semantics), or
express a need for rescheduling while allowing the task on the
runqueue to run to timeslice completion (TIF_NEED_RESCHED_LAZY).

The scheduler chooses the specific need-resched flag based on
the preemption model:

		TIF_NEED_RESCHED 	TIF_NEED_RESCHED_LAZY

none		never   		always [*]
voluntary       higher sched class	other tasks [*]
full 		always                  never

[*] when preempting idle, or for kernel tasks that are 'urgent' in
some way (ex. resched_cpu() used as an RCU hammer), we use
TIF_NEED_RESCHED.

The other part is when preemption happens -- or, when are the
need-resched flags checked:

                 exit-to-user    ret-to-kernel    preempt_count()
NEED_RESCHED_LAZY     Y               N                N
NEED_RESCHED          Y               Y                Y

Exposed under CONFIG_EXPERT for now.

Cc: Peter Ziljstra <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |  1 +
 include/linux/thread_info.h                   | 12 ++++++
 init/Makefile                                 |  1 +
 kernel/Kconfig.preempt                        | 37 +++++++++++++++++--
 4 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2d693300ab57..16a91090d167 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4719,6 +4719,7 @@
 
 	preempt=	[KNL]
 			Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
+			or CONFIG_PREEMPT_AUTO.
 			none - Limited to cond_resched() calls
 			voluntary - Limited to cond_resched() and might_sleep() calls
 			full - Any section that isn't explicitly preempt disabled
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9ea0b28068f4..06e13e7acbe2 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,18 @@ enum syscall_work_bit {
 
 #include <asm/thread_info.h>
 
+/*
+ * Fall back to the default need-resched flag when an architecture does not
+ * define TIF_NEED_RESCHED_LAZY.
+ *
+ * Note: with !PREEMPT_AUTO, code should not be setting TIF_NEED_RESCHED_LAZY
+ * anywhere. Define this here because we will explicitly test for this bit.
+ */
+#ifndef TIF_NEED_RESCHED_LAZY
+#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+#endif
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data
diff --git a/init/Makefile b/init/Makefile
index cbac576c57d6..da1dba3116dc 100644
--- a/init/Makefile
+++ b/init/Makefile
@@ -27,6 +27,7 @@ smp-flag-$(CONFIG_SMP)			:= SMP
 preempt-flag-$(CONFIG_PREEMPT_BUILD)	:= PREEMPT
 preempt-flag-$(CONFIG_PREEMPT_DYNAMIC)	:= PREEMPT_DYNAMIC
 preempt-flag-$(CONFIG_PREEMPT_RT)	:= PREEMPT_RT
+preempt-flag-$(CONFIG_PREEMPT_AUTO)	:= PREEMPT_AUTO
 
 build-version = $(or $(KBUILD_BUILD_VERSION), $(build-version-auto))
 build-timestamp = $(or $(KBUILD_BUILD_TIMESTAMP), $(build-timestamp-auto))
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a821..fe83040ad755 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,13 +11,17 @@ config PREEMPT_BUILD
 	select PREEMPTION
 	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
 
+config HAVE_PREEMPT_AUTO
+	bool
+
 choice
 	prompt "Preemption Model"
 	default PREEMPT_NONE
 
 config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
-	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
+	select PREEMPT_NONE_BUILD if (!PREEMPT_DYNAMIC && !PREEMPT_AUTO)
+
 	help
 	  This is the traditional Linux preemption model, geared towards
 	  throughput. It will still provide good latencies most of the
@@ -32,7 +36,7 @@ config PREEMPT_NONE
 config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
 	depends on !ARCH_NO_PREEMPT
-	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
+	select PREEMPT_VOLUNTARY_BUILD if (!PREEMPT_DYNAMIC && !PREEMPT_AUTO)
 	help
 	  This option reduces the latency of the kernel by adding more
 	  "explicit preemption points" to the kernel code. These new
@@ -95,7 +99,7 @@ config PREEMPTION
 
 config PREEMPT_DYNAMIC
 	bool "Preemption behaviour defined on boot"
-	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+	depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
 	select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
 	select PREEMPT_BUILD
 	default y if HAVE_PREEMPT_DYNAMIC_CALL
@@ -115,6 +119,33 @@ config PREEMPT_DYNAMIC
 	  Interesting if you want the same pre-built kernel should be used for
 	  both Server and Desktop workloads.
 
+config PREEMPT_AUTO
+	bool "Scheduler controlled preemption model"
+	depends on EXPERT && HAVE_PREEMPT_AUTO && !ARCH_NO_PREEMPT
+	select PREEMPT_BUILD
+	help
+	  This option allows to define the preemption model on the kernel
+	  command line parameter and thus override the default preemption
+	  model selected during compile time.
+
+	  However, note that the compile time choice of preemption model
+	  might impact other kernel options like the specific RCU model.
+
+	  This feature makes the latency of the kernel configurable by
+	  allowing the scheduler to choose when to preempt based on
+	  the preemption policy in effect. It does this without needing
+	  voluntary preemption points.
+
+	  With PREEMPT_NONE: the scheduler allows a task (executing in
+	  user or kernel context) to run to completion, at least until
+	  its current tick expires.
+
+	  With PREEMPT_VOLUNTARY: similar to PREEMPT_NONE, but the scheduler
+	  will also preempt for higher priority class of processes but not
+	  lower.
+
+	  With PREEMPT: the scheduler preempts at the earliest opportunity.
+
 config SCHED_CORE
 	bool "Core Scheduling for SMT"
 	depends on SCHED_SMT
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO
  2024-05-28  0:34 ` [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO Ankur Arora
@ 2024-06-03 15:04   ` Shrikanth Hegde
  2024-06-04 17:52     ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-06-03 15:04 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	Jonathan Corbet, LKML



On 5/28/24 6:04 AM, Ankur Arora wrote:
> PREEMPT_AUTO adds a new scheduling model which, like PREEMPT_DYNAMIC,
> allows dynamic switching between a none/voluntary/full preemption
> model. However, unlike PREEMPT_DYNAMIC, it doesn't use explicit
> preemption points for the voluntary models.
> 
> It works by depending on CONFIG_PREEMPTION (and thus PREEMPT_COUNT),
> allowing the scheduler to always know when it is safe to preempt
> for all three preemption models.
> 
> In addition, it uses an additional need-resched bit
> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED allows the
> scheduler to express two kinds of rescheduling intent: schedule at
> the earliest opportunity (the usual TIF_NEED_RESCHED semantics), or
> express a need for rescheduling while allowing the task on the
> runqueue to run to timeslice completion (TIF_NEED_RESCHED_LAZY).
> 
> The scheduler chooses the specific need-resched flag based on
> the preemption model:
> 
> 		TIF_NEED_RESCHED 	TIF_NEED_RESCHED_LAZY
> 
> none		never   		always [*]
> voluntary       higher sched class	other tasks [*]
> full 		always                  never
> 
> [*] when preempting idle, or for kernel tasks that are 'urgent' in
> some way (ex. resched_cpu() used as an RCU hammer), we use
> TIF_NEED_RESCHED.
> 
> The other part is when preemption happens -- or, when are the
> need-resched flags checked:
> 
>                  exit-to-user    ret-to-kernel    preempt_count()
> NEED_RESCHED_LAZY     Y               N                N
> NEED_RESCHED          Y               Y                Y
> 
> Exposed under CONFIG_EXPERT for now.
> 
> Cc: Peter Ziljstra <peterz@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  1 +
>  include/linux/thread_info.h                   | 12 ++++++
>  init/Makefile                                 |  1 +
>  kernel/Kconfig.preempt                        | 37 +++++++++++++++++--
>  4 files changed, 48 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 2d693300ab57..16a91090d167 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4719,6 +4719,7 @@
>  
>  	preempt=	[KNL]
>  			Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
> +			or CONFIG_PREEMPT_AUTO.
>  			none - Limited to cond_resched() calls
>  			voluntary - Limited to cond_resched() and might_sleep() calls
>  			full - Any section that isn't explicitly preempt disabled
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 9ea0b28068f4..06e13e7acbe2 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -59,6 +59,18 @@ enum syscall_work_bit {
>  
>  #include <asm/thread_info.h>
>  
> +/*
> + * Fall back to the default need-resched flag when an architecture does not
> + * define TIF_NEED_RESCHED_LAZY.
> + *
> + * Note: with !PREEMPT_AUTO, code should not be setting TIF_NEED_RESCHED_LAZY
> + * anywhere. Define this here because we will explicitly test for this bit.
> + */


Is this comment still valid? 
I see that flag has been set without any checks in arch file. 


> +#ifndef TIF_NEED_RESCHED_LAZY
> +#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
> +#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
> +#endif
> +
>  #ifdef __KERNEL__
>  
>  #ifndef arch_set_restart_data

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO
  2024-06-03 15:04   ` Shrikanth Hegde
@ 2024-06-04 17:52     ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-06-04 17:52 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, Jonathan Corbet, LKML


Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 5/28/24 6:04 AM, Ankur Arora wrote:

[...]

>> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
>> index 9ea0b28068f4..06e13e7acbe2 100644
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -59,6 +59,18 @@ enum syscall_work_bit {
>>
>>  #include <asm/thread_info.h>
>>
>> +/*
>> + * Fall back to the default need-resched flag when an architecture does not
>> + * define TIF_NEED_RESCHED_LAZY.
>> + *
>> + * Note: with !PREEMPT_AUTO, code should not be setting TIF_NEED_RESCHED_LAZY
>> + * anywhere. Define this here because we will explicitly test for this bit.
>> + */
>
>
> Is this comment still valid?
> I see that flag has been set without any checks in arch file.

Thanks for pointing this out. There is a typo in this comment.
Should have said "with !HAVE_PREEMPT_AUTO" instead of "with
!PREEMPT_AUTO" above.

So, an architecture should define HAVE_PREMPT_AUTO only if it also
defines TIF_NEED_RESCHED_LAZY and whatever else necessary to support
PREEMPT_AUTO.


Ankur

>> +#ifndef TIF_NEED_RESCHED_LAZY
>> +#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
>> +#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
>> +#endif
>> +
>>  #ifdef __KERNEL__
>>
>>  #ifndef arch_set_restart_data


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY]
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (3 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 15:55   ` Peter Zijlstra
  2024-05-28  0:34 ` [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t) Ankur Arora
                   ` (31 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Define tif_resched() to serve as selector for the specific
need-resched flag: with tif_resched() mapping to TIF_NEED_RESCHED
or to TIF_NEED_RESCHED_LAZY.

For !CONFIG_PREEMPT_AUTO, tif_resched() always evaluates
to TIF_NEED_RESCHED, preserving existing scheduling behaviour.

Cc: Peter Ziljstra <peterz@infradead.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/thread_info.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 06e13e7acbe2..65e5beedc915 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -71,6 +71,31 @@ enum syscall_work_bit {
 #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
 #endif
 
+typedef enum {
+	RESCHED_NOW = 0,
+	RESCHED_LAZY = 1,
+} resched_t;
+
+/*
+ * tif_resched(r) maps to TIF_NEED_RESCHED[_LAZY] with CONFIG_PREEMPT_AUTO.
+ *
+ * For !CONFIG_PREEMPT_AUTO, both tif_resched(RESCHED_NOW) and
+ * tif_resched(RESCHED_LAZY) reduce to the same value (TIF_NEED_RESCHED)
+ * leaving any scheduling behaviour unchanged.
+ */
+static __always_inline int tif_resched(resched_t rs)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+		return (rs == RESCHED_NOW) ? TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY;
+	else
+		return TIF_NEED_RESCHED;
+}
+
+static __always_inline int _tif_resched(resched_t rs)
+{
+	return 1 << tif_resched(rs);
+}
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY]
  2024-05-28  0:34 ` [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY] Ankur Arora
@ 2024-05-28 15:55   ` Peter Zijlstra
  2024-05-30  9:07     ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 15:55 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk

On Mon, May 27, 2024 at 05:34:51PM -0700, Ankur Arora wrote:
> Define tif_resched() to serve as selector for the specific
> need-resched flag: with tif_resched() mapping to TIF_NEED_RESCHED
> or to TIF_NEED_RESCHED_LAZY.
> 
> For !CONFIG_PREEMPT_AUTO, tif_resched() always evaluates
> to TIF_NEED_RESCHED, preserving existing scheduling behaviour.
> 
> Cc: Peter Ziljstra <peterz@infradead.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/thread_info.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 06e13e7acbe2..65e5beedc915 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -71,6 +71,31 @@ enum syscall_work_bit {
>  #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
>  #endif
>  
> +typedef enum {
> +	RESCHED_NOW = 0,
> +	RESCHED_LAZY = 1,
> +} resched_t;
> +
> +/*
> + * tif_resched(r) maps to TIF_NEED_RESCHED[_LAZY] with CONFIG_PREEMPT_AUTO.
> + *
> + * For !CONFIG_PREEMPT_AUTO, both tif_resched(RESCHED_NOW) and
> + * tif_resched(RESCHED_LAZY) reduce to the same value (TIF_NEED_RESCHED)
> + * leaving any scheduling behaviour unchanged.
> + */
> +static __always_inline int tif_resched(resched_t rs)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> +		return (rs == RESCHED_NOW) ? TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY;
> +	else
> +		return TIF_NEED_RESCHED;
> +}

Perhaps:

	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
		return TIF_NEED_RESCHED_LAZY;

	return TIF_NEED_RESCHED;

hmm?

> +
> +static __always_inline int _tif_resched(resched_t rs)
> +{
> +	return 1 << tif_resched(rs);
> +}
> +
>  #ifdef __KERNEL__
>  
>  #ifndef arch_set_restart_data
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY]
  2024-05-28 15:55   ` Peter Zijlstra
@ 2024-05-30  9:07     ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-30  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:34:51PM -0700, Ankur Arora wrote:
>> Define tif_resched() to serve as selector for the specific
>> need-resched flag: with tif_resched() mapping to TIF_NEED_RESCHED
>> or to TIF_NEED_RESCHED_LAZY.
>>
>> For !CONFIG_PREEMPT_AUTO, tif_resched() always evaluates
>> to TIF_NEED_RESCHED, preserving existing scheduling behaviour.
>>
>> Cc: Peter Ziljstra <peterz@infradead.org>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  include/linux/thread_info.h | 25 +++++++++++++++++++++++++
>>  1 file changed, 25 insertions(+)
>>
>> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
>> index 06e13e7acbe2..65e5beedc915 100644
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -71,6 +71,31 @@ enum syscall_work_bit {
>>  #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
>>  #endif
>>
>> +typedef enum {
>> +	RESCHED_NOW = 0,
>> +	RESCHED_LAZY = 1,
>> +} resched_t;
>> +
>> +/*
>> + * tif_resched(r) maps to TIF_NEED_RESCHED[_LAZY] with CONFIG_PREEMPT_AUTO.
>> + *
>> + * For !CONFIG_PREEMPT_AUTO, both tif_resched(RESCHED_NOW) and
>> + * tif_resched(RESCHED_LAZY) reduce to the same value (TIF_NEED_RESCHED)
>> + * leaving any scheduling behaviour unchanged.
>> + */
>> +static __always_inline int tif_resched(resched_t rs)
>> +{
>> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
>> +		return (rs == RESCHED_NOW) ? TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY;
>> +	else
>> +		return TIF_NEED_RESCHED;
>> +}
>
> Perhaps:
>
> 	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
> 		return TIF_NEED_RESCHED_LAZY;
>
> 	return TIF_NEED_RESCHED;

This and similar other interface changes make it much clearer.
Thanks. Will fix.


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t)
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (4 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY] Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 16:03   ` Peter Zijlstra
  2024-05-28  0:34 ` [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers Ankur Arora
                   ` (30 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Arnd Bergmann, Ingo Molnar,
	Vincent Guittot

Define __tif_need_resched() which takes a resched_t parameter to
decide the immediacy of the need-resched.

Update need_resched() and should_resched() so they both check for
__tif_need_resched(RESCHED_NOW), which keeps the current semantics.

Non scheduling code -- which only cares about any immediately required
preemption -- can continue unchanged since the commonly used interfaces
(need_resched(), should_resched(), tif_need_resched()) stay the same.

This also allows lazy preemption to just be a scheduler detail.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/preempt.h     |  2 +-
 include/linux/sched.h       |  7 ++++++-
 include/linux/thread_info.h | 34 ++++++++++++++++++++++++++++------
 kernel/trace/trace.c        |  2 +-
 4 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index ce76f1a45722..d453f5e34390 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -312,7 +312,7 @@ do { \
 } while (0)
 #define preempt_fold_need_resched() \
 do { \
-	if (tif_need_resched()) \
+	if (__tif_need_resched(RESCHED_NOW)) \
 		set_preempt_need_resched(); \
 } while (0)
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4808e5dd4f69..37a51115b691 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2062,7 +2062,12 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 
 static __always_inline bool need_resched(void)
 {
-	return unlikely(tif_need_resched());
+	return unlikely(__tif_need_resched(RESCHED_NOW));
+}
+
+static __always_inline bool need_resched_lazy(void)
+{
+	return unlikely(__tif_need_resched(RESCHED_LAZY));
 }
 
 /*
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 65e5beedc915..e246b01553a5 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -216,22 +216,44 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
 
 #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
 
-static __always_inline bool tif_need_resched(void)
+static __always_inline bool __tif_need_resched_bitop(int nr_flag)
 {
-	return arch_test_bit(TIF_NEED_RESCHED,
-			     (unsigned long *)(&current_thread_info()->flags));
+	return arch_test_bit(nr_flag,
+		     (unsigned long *)(&current_thread_info()->flags));
 }
 
 #else
 
-static __always_inline bool tif_need_resched(void)
+static __always_inline bool __tif_need_resched_bitop(int nr_flag)
 {
-	return test_bit(TIF_NEED_RESCHED,
-			(unsigned long *)(&current_thread_info()->flags));
+	return test_bit(nr_flag,
+		(unsigned long *)(&current_thread_info()->flags));
 }
 
 #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
 
+static __always_inline bool __tif_need_resched(resched_t rs)
+{
+	/*
+	 * With !PREEMPT_AUTO, this check is only meaningful if we
+	 * are checking if tif_resched(RESCHED_NOW) is set.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
+		return __tif_need_resched_bitop(tif_resched(rs));
+	else
+		return false;
+}
+
+static __always_inline bool tif_need_resched(void)
+{
+	return __tif_need_resched(RESCHED_NOW);
+}
+
+static __always_inline bool tif_need_resched_lazy(void)
+{
+	return __tif_need_resched(RESCHED_LAZY);
+}
+
 #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
 static inline int arch_within_stack_frames(const void * const stack,
 					   const void * const stackend,
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 233d1af39fff..ed229527be05 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2511,7 +2511,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
 	if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
 		trace_flags |= TRACE_FLAG_BH_OFF;
 
-	if (tif_need_resched())
+	if (__tif_need_resched(RESCHED_NOW))
 		trace_flags |= TRACE_FLAG_NEED_RESCHED;
 	if (test_preempt_need_resched())
 		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t)
  2024-05-28  0:34 ` [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t) Ankur Arora
@ 2024-05-28 16:03   ` Peter Zijlstra
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 16:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Arnd Bergmann, Ingo Molnar, Vincent Guittot

On Mon, May 27, 2024 at 05:34:52PM -0700, Ankur Arora wrote:

> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 65e5beedc915..e246b01553a5 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -216,22 +216,44 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
>  
>  #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
>  
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool __tif_need_resched_bitop(int nr_flag)
>  {
> -	return arch_test_bit(TIF_NEED_RESCHED,
> -			     (unsigned long *)(&current_thread_info()->flags));
> +	return arch_test_bit(nr_flag,
> +		     (unsigned long *)(&current_thread_info()->flags));
>  }
>  
>  #else
>  
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool __tif_need_resched_bitop(int nr_flag)
>  {
> -	return test_bit(TIF_NEED_RESCHED,
> -			(unsigned long *)(&current_thread_info()->flags));
> +	return test_bit(nr_flag,
> +		(unsigned long *)(&current_thread_info()->flags));
>  }

:se cino=(0:0

That is, you're wrecking the indentation here.

>  
>  #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>  
> +static __always_inline bool __tif_need_resched(resched_t rs)
> +{
> +	/*
> +	 * With !PREEMPT_AUTO, this check is only meaningful if we
> +	 * are checking if tif_resched(RESCHED_NOW) is set.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
> +		return __tif_need_resched_bitop(tif_resched(rs));
> +	else
> +		return false;
> +}

	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
		return false;

	return __tif_need_resched_bitop(tif_resched(rs));


> +
> +static __always_inline bool tif_need_resched(void)
> +{
> +	return __tif_need_resched(RESCHED_NOW);
> +}
> +
> +static __always_inline bool tif_need_resched_lazy(void)
> +{
> +	return __tif_need_resched(RESCHED_LAZY);
> +}
> +
>  #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
>  static inline int arch_within_stack_frames(const void * const stack,
>  					   const void * const stackend,
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 233d1af39fff..ed229527be05 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2511,7 +2511,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
>  	if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
>  		trace_flags |= TRACE_FLAG_BH_OFF;
>  
> -	if (tif_need_resched())
> +	if (__tif_need_resched(RESCHED_NOW))
>  		trace_flags |= TRACE_FLAG_NEED_RESCHED;

Per the above this is a NO-OP.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (5 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t) Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 16:09   ` Peter Zijlstra
  2024-05-29  8:25   ` Peter Zijlstra
  2024-05-28  0:34 ` [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit Ankur Arora
                   ` (29 subsequent siblings)
  36 siblings, 2 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

Define __{set,test}_tsk_need_resched() to test for the immediacy of the
need-resched.

The current helpers, {set,test}_tsk_need_resched(...) stay the same.

In scheduler code, switch to the more explicit variants,
__set_tsk_need_resched(...), __test_tsk_need_resched(...).

Note that clear_tsk_need_resched() is only used from __schedule()
to clear the flags before switching context. Now it clears all the
need-resched flags.

Cc: Peter Ziljstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched.h   | 45 +++++++++++++++++++++++++++++++++++++----
 kernel/sched/core.c     |  9 +++++----
 kernel/sched/deadline.c |  4 ++--
 kernel/sched/fair.c     |  2 +-
 kernel/sched/rt.c       |  4 ++--
 5 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 37a51115b691..804a76e6f3c5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1952,19 +1952,56 @@ static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
 	return test_ti_thread_flag(task_thread_info(tsk), flag);
 }
 
-static inline void set_tsk_need_resched(struct task_struct *tsk)
+/*
+ * With !CONFIG_PREEMPT_AUTO, tif_resched(RESCHED_LAZY) reduces to
+ * tif_resched(RESCHED_NOW). Add a check in the helpers below to ensure
+ * we don't touch the tif_reshed(RESCHED_NOW) bit unnecessarily.
+ */
+static inline void __set_tsk_need_resched(struct task_struct *tsk, resched_t rs)
 {
-	set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
+		set_tsk_thread_flag(tsk, tif_resched(rs));
+	else
+		/*
+		 * RESCHED_LAZY is only touched under CONFIG_PREEMPT_AUTO.
+		 */
+		BUG();
 }
 
 static inline void clear_tsk_need_resched(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+	clear_tsk_thread_flag(tsk, tif_resched(RESCHED_NOW));
+
+	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+		clear_tsk_thread_flag(tsk, tif_resched(RESCHED_LAZY));
+}
+
+static inline bool __test_tsk_need_resched(struct task_struct *tsk, resched_t rs)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
+		return unlikely(test_tsk_thread_flag(tsk, tif_resched(rs)));
+	else
+		return false;
 }
 
 static inline bool test_tsk_need_resched(struct task_struct *tsk)
 {
-	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
+	return __test_tsk_need_resched(tsk, RESCHED_NOW);
+}
+
+static inline bool test_tsk_need_resched_lazy(struct task_struct *tsk)
+{
+	return __test_tsk_need_resched(tsk, RESCHED_LAZY);
+}
+
+static inline void set_tsk_need_resched(struct task_struct *tsk)
+{
+	return __set_tsk_need_resched(tsk, RESCHED_NOW);
+}
+
+static inline void set_tsk_need_resched_lazy(struct task_struct *tsk)
+{
+	return __set_tsk_need_resched(tsk, RESCHED_LAZY);
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7019a40457a6..d00d7b45303e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -933,7 +933,7 @@ static bool set_nr_if_polling(struct task_struct *p)
 #else
 static inline bool set_nr_and_not_polling(struct task_struct *p)
 {
-	set_tsk_need_resched(p);
+	__set_tsk_need_resched(p, RESCHED_NOW);
 	return true;
 }
 
@@ -1045,13 +1045,13 @@ void resched_curr(struct rq *rq)
 
 	lockdep_assert_rq_held(rq);
 
-	if (test_tsk_need_resched(curr))
+	if (__test_tsk_need_resched(curr, RESCHED_NOW))
 		return;
 
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		set_tsk_need_resched(curr);
+		__set_tsk_need_resched(curr, RESCHED_NOW);
 		set_preempt_need_resched();
 		return;
 	}
@@ -2245,7 +2245,8 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(rq->curr) &&
+	    __test_tsk_need_resched(rq->curr, RESCHED_NOW))
 		rq_clock_skip_update(rq);
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a04a436af8cc..d24d6bfee293 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2035,7 +2035,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 	 * let us try to decide what's the best thing to do...
 	 */
 	if ((p->dl.deadline == rq->curr->dl.deadline) &&
-	    !test_tsk_need_resched(rq->curr))
+	    !__test_tsk_need_resched(rq->curr, RESCHED_NOW))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
 }
@@ -2564,7 +2564,7 @@ static void pull_dl_task(struct rq *this_rq)
 static void task_woken_dl(struct rq *rq, struct task_struct *p)
 {
 	if (!task_on_cpu(rq, p) &&
-	    !test_tsk_need_resched(rq->curr) &&
+	    !__test_tsk_need_resched(rq->curr, RESCHED_NOW) &&
 	    p->nr_cpus_allowed > 1 &&
 	    dl_task(rq->curr) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c62805dbd608..c5171c247466 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8316,7 +8316,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (__test_tsk_need_resched(curr, RESCHED_NOW))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..f0a6c9bb890b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1680,7 +1680,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio == rq->curr->prio && !__test_tsk_need_resched(rq->curr, RESCHED_NOW))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -2415,7 +2415,7 @@ static void pull_rt_task(struct rq *this_rq)
 static void task_woken_rt(struct rq *rq, struct task_struct *p)
 {
 	bool need_to_push = !task_on_cpu(rq, p) &&
-			    !test_tsk_need_resched(rq->curr) &&
+			    !__test_tsk_need_resched(rq->curr, RESCHED_NOW) &&
 			    p->nr_cpus_allowed > 1 &&
 			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
 			    (rq->curr->nr_cpus_allowed < 2 ||
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers
  2024-05-28  0:34 ` [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers Ankur Arora
@ 2024-05-28 16:09   ` Peter Zijlstra
  2024-05-30  9:02     ` Ankur Arora
  2024-05-29  8:25   ` Peter Zijlstra
  1 sibling, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 16:09 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Mon, May 27, 2024 at 05:34:53PM -0700, Ankur Arora wrote:
> Define __{set,test}_tsk_need_resched() to test for the immediacy of the
> need-resched.
> 
> The current helpers, {set,test}_tsk_need_resched(...) stay the same.
> 
> In scheduler code, switch to the more explicit variants,
> __set_tsk_need_resched(...), __test_tsk_need_resched(...).
> 
> Note that clear_tsk_need_resched() is only used from __schedule()
> to clear the flags before switching context. Now it clears all the
> need-resched flags.
> 
> Cc: Peter Ziljstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/sched.h   | 45 +++++++++++++++++++++++++++++++++++++----
>  kernel/sched/core.c     |  9 +++++----
>  kernel/sched/deadline.c |  4 ++--
>  kernel/sched/fair.c     |  2 +-
>  kernel/sched/rt.c       |  4 ++--
>  5 files changed, 51 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 37a51115b691..804a76e6f3c5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1952,19 +1952,56 @@ static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
>  	return test_ti_thread_flag(task_thread_info(tsk), flag);
>  }
>  
> -static inline void set_tsk_need_resched(struct task_struct *tsk)
> +/*
> + * With !CONFIG_PREEMPT_AUTO, tif_resched(RESCHED_LAZY) reduces to
> + * tif_resched(RESCHED_NOW). Add a check in the helpers below to ensure
> + * we don't touch the tif_reshed(RESCHED_NOW) bit unnecessarily.
> + */
> +static inline void __set_tsk_need_resched(struct task_struct *tsk, resched_t rs)
>  {
> -	set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
> +		set_tsk_thread_flag(tsk, tif_resched(rs));
> +	else
> +		/*
> +		 * RESCHED_LAZY is only touched under CONFIG_PREEMPT_AUTO.
> +		 */
> +		BUG();
>  }

This straight up violates coding style and would require a dose of {}.

	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO && rs == RESCHED_LAZY)
		BUG();

	set_tsk_thread_flag(tsk, tif_resched(rs));

seems much saner to me.

>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>  {
> -	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> +	clear_tsk_thread_flag(tsk, tif_resched(RESCHED_NOW));
> +
> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> +		clear_tsk_thread_flag(tsk, tif_resched(RESCHED_LAZY));
> +}
> +
> +static inline bool __test_tsk_need_resched(struct task_struct *tsk, resched_t rs)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
> +		return unlikely(test_tsk_thread_flag(tsk, tif_resched(rs)));
> +	else
> +		return false;
>  }

	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
		return false;

	return unlikely(test_tsk_thread_flag(tsk, tif_resched(rs)));

>  
>  static inline bool test_tsk_need_resched(struct task_struct *tsk)
>  {
> -	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
> +	return __test_tsk_need_resched(tsk, RESCHED_NOW);
> +}
> +
> +static inline bool test_tsk_need_resched_lazy(struct task_struct *tsk)
> +{
> +	return __test_tsk_need_resched(tsk, RESCHED_LAZY);
> +}
> +
> +static inline void set_tsk_need_resched(struct task_struct *tsk)
> +{
> +	return __set_tsk_need_resched(tsk, RESCHED_NOW);
> +}
> +
> +static inline void set_tsk_need_resched_lazy(struct task_struct *tsk)
> +{
> +	return __set_tsk_need_resched(tsk, RESCHED_LAZY);
>  }
>  
>  /*

So far so good, however:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7019a40457a6..d00d7b45303e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -933,7 +933,7 @@ static bool set_nr_if_polling(struct task_struct *p)
>  #else
>  static inline bool set_nr_and_not_polling(struct task_struct *p)
>  {
> -	set_tsk_need_resched(p);
> +	__set_tsk_need_resched(p, RESCHED_NOW);
>  	return true;
>  }
>  
> @@ -1045,13 +1045,13 @@ void resched_curr(struct rq *rq)
>  
>  	lockdep_assert_rq_held(rq);
>  
> -	if (test_tsk_need_resched(curr))
> +	if (__test_tsk_need_resched(curr, RESCHED_NOW))
>  		return;
>  
>  	cpu = cpu_of(rq);
>  
>  	if (cpu == smp_processor_id()) {
> -		set_tsk_need_resched(curr);
> +		__set_tsk_need_resched(curr, RESCHED_NOW);
>  		set_preempt_need_resched();
>  		return;
>  	}
> @@ -2245,7 +2245,8 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
>  	 * A queue event has occurred, and we're going to schedule.  In
>  	 * this case, we can save a useless back to back clock update.
>  	 */
> -	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
> +	if (task_on_rq_queued(rq->curr) &&
> +	    __test_tsk_need_resched(rq->curr, RESCHED_NOW))
>  		rq_clock_skip_update(rq);
>  }
>  
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a04a436af8cc..d24d6bfee293 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2035,7 +2035,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
>  	 * let us try to decide what's the best thing to do...
>  	 */
>  	if ((p->dl.deadline == rq->curr->dl.deadline) &&
> -	    !test_tsk_need_resched(rq->curr))
> +	    !__test_tsk_need_resched(rq->curr, RESCHED_NOW))
>  		check_preempt_equal_dl(rq, p);
>  #endif /* CONFIG_SMP */
>  }
> @@ -2564,7 +2564,7 @@ static void pull_dl_task(struct rq *this_rq)
>  static void task_woken_dl(struct rq *rq, struct task_struct *p)
>  {
>  	if (!task_on_cpu(rq, p) &&
> -	    !test_tsk_need_resched(rq->curr) &&
> +	    !__test_tsk_need_resched(rq->curr, RESCHED_NOW) &&
>  	    p->nr_cpus_allowed > 1 &&
>  	    dl_task(rq->curr) &&
>  	    (rq->curr->nr_cpus_allowed < 2 ||
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c62805dbd608..c5171c247466 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8316,7 +8316,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * prevents us from potentially nominating it as a false LAST_BUDDY
>  	 * below.
>  	 */
> -	if (test_tsk_need_resched(curr))
> +	if (__test_tsk_need_resched(curr, RESCHED_NOW))
>  		return;
>  
>  	/* Idle tasks are by definition preempted by non-idle tasks. */
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 3261b067b67e..f0a6c9bb890b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1680,7 +1680,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>  	 * to move current somewhere else, making room for our non-migratable
>  	 * task.
>  	 */
> -	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
> +	if (p->prio == rq->curr->prio && !__test_tsk_need_resched(rq->curr, RESCHED_NOW))
>  		check_preempt_equal_prio(rq, p);
>  #endif
>  }
> @@ -2415,7 +2415,7 @@ static void pull_rt_task(struct rq *this_rq)
>  static void task_woken_rt(struct rq *rq, struct task_struct *p)
>  {
>  	bool need_to_push = !task_on_cpu(rq, p) &&
> -			    !test_tsk_need_resched(rq->curr) &&
> +			    !__test_tsk_need_resched(rq->curr, RESCHED_NOW) &&
>  			    p->nr_cpus_allowed > 1 &&
>  			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
>  			    (rq->curr->nr_cpus_allowed < 2 ||

These are all NO-OPs.... Changelog says:
 
> In scheduler code, switch to the more explicit variants,
> __set_tsk_need_resched(...), __test_tsk_need_resched(...).

But leaves me wondering *WHY* ?!?

I can't help but feel this patch attempts to do 2 things and fails to
justify at least one of them.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers
  2024-05-28 16:09   ` Peter Zijlstra
@ 2024-05-30  9:02     ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-30  9:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:34:53PM -0700, Ankur Arora wrote:
>> Define __{set,test}_tsk_need_resched() to test for the immediacy of the
>> need-resched.
>>
>> The current helpers, {set,test}_tsk_need_resched(...) stay the same.
>>
>> In scheduler code, switch to the more explicit variants,
>> __set_tsk_need_resched(...), __test_tsk_need_resched(...).
>>
>> Note that clear_tsk_need_resched() is only used from __schedule()
>> to clear the flags before switching context. Now it clears all the
>> need-resched flags.
>>
>> Cc: Peter Ziljstra <peterz@infradead.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  include/linux/sched.h   | 45 +++++++++++++++++++++++++++++++++++++----
>>  kernel/sched/core.c     |  9 +++++----
>>  kernel/sched/deadline.c |  4 ++--
>>  kernel/sched/fair.c     |  2 +-
>>  kernel/sched/rt.c       |  4 ++--
>>  5 files changed, 51 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 37a51115b691..804a76e6f3c5 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1952,19 +1952,56 @@ static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
>>  	return test_ti_thread_flag(task_thread_info(tsk), flag);
>>  }
>>
>> -static inline void set_tsk_need_resched(struct task_struct *tsk)
>> +/*
>> + * With !CONFIG_PREEMPT_AUTO, tif_resched(RESCHED_LAZY) reduces to
>> + * tif_resched(RESCHED_NOW). Add a check in the helpers below to ensure
>> + * we don't touch the tif_reshed(RESCHED_NOW) bit unnecessarily.
>> + */
>> +static inline void __set_tsk_need_resched(struct task_struct *tsk, resched_t rs)
>>  {
>> -	set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
>> +		set_tsk_thread_flag(tsk, tif_resched(rs));
>> +	else
>> +		/*
>> +		 * RESCHED_LAZY is only touched under CONFIG_PREEMPT_AUTO.
>> +		 */
>> +		BUG();
>>  }
>
> This straight up violates coding style and would require a dose of {}.
>
> 	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO && rs == RESCHED_LAZY)
> 		BUG();
>
> 	set_tsk_thread_flag(tsk, tif_resched(rs));
>
> seems much saner to me.
>
>>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>>  {
>> -	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>> +	clear_tsk_thread_flag(tsk, tif_resched(RESCHED_NOW));
>> +
>> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
>> +		clear_tsk_thread_flag(tsk, tif_resched(RESCHED_LAZY));
>> +}
>> +
>> +static inline bool __test_tsk_need_resched(struct task_struct *tsk, resched_t rs)
>> +{
>> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
>> +		return unlikely(test_tsk_thread_flag(tsk, tif_resched(rs)));
>> +	else
>> +		return false;
>>  }
>
> 	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
> 		return false;
>
> 	return unlikely(test_tsk_thread_flag(tsk, tif_resched(rs)));
>
>>
>>  static inline bool test_tsk_need_resched(struct task_struct *tsk)
>>  {
>> -	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
>> +	return __test_tsk_need_resched(tsk, RESCHED_NOW);
>> +}
>> +
>> +static inline bool test_tsk_need_resched_lazy(struct task_struct *tsk)
>> +{
>> +	return __test_tsk_need_resched(tsk, RESCHED_LAZY);
>> +}
>> +
>> +static inline void set_tsk_need_resched(struct task_struct *tsk)
>> +{
>> +	return __set_tsk_need_resched(tsk, RESCHED_NOW);
>> +}
>> +
>> +static inline void set_tsk_need_resched_lazy(struct task_struct *tsk)
>> +{
>> +	return __set_tsk_need_resched(tsk, RESCHED_LAZY);
>>  }
>>
>>  /*
>
> So far so good, however:
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 7019a40457a6..d00d7b45303e 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -933,7 +933,7 @@ static bool set_nr_if_polling(struct task_struct *p)
>>  #else
>>  static inline bool set_nr_and_not_polling(struct task_struct *p)
>>  {
>> -	set_tsk_need_resched(p);
>> +	__set_tsk_need_resched(p, RESCHED_NOW);
>>  	return true;
>>  }
>>
>> @@ -1045,13 +1045,13 @@ void resched_curr(struct rq *rq)
>>
>>  	lockdep_assert_rq_held(rq);
>>
>> -	if (test_tsk_need_resched(curr))
>> +	if (__test_tsk_need_resched(curr, RESCHED_NOW))
>>  		return;
>>
>>  	cpu = cpu_of(rq);
>>
>>  	if (cpu == smp_processor_id()) {
>> -		set_tsk_need_resched(curr);
>> +		__set_tsk_need_resched(curr, RESCHED_NOW);
>>  		set_preempt_need_resched();
>>  		return;
>>  	}
>> @@ -2245,7 +2245,8 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
>>  	 * A queue event has occurred, and we're going to schedule.  In
>>  	 * this case, we can save a useless back to back clock update.
>>  	 */
>> -	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
>> +	if (task_on_rq_queued(rq->curr) &&
>> +	    __test_tsk_need_resched(rq->curr, RESCHED_NOW))
>>  		rq_clock_skip_update(rq);
>>  }
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index a04a436af8cc..d24d6bfee293 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2035,7 +2035,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
>>  	 * let us try to decide what's the best thing to do...
>>  	 */
>>  	if ((p->dl.deadline == rq->curr->dl.deadline) &&
>> -	    !test_tsk_need_resched(rq->curr))
>> +	    !__test_tsk_need_resched(rq->curr, RESCHED_NOW))
>>  		check_preempt_equal_dl(rq, p);
>>  #endif /* CONFIG_SMP */
>>  }
>> @@ -2564,7 +2564,7 @@ static void pull_dl_task(struct rq *this_rq)
>>  static void task_woken_dl(struct rq *rq, struct task_struct *p)
>>  {
>>  	if (!task_on_cpu(rq, p) &&
>> -	    !test_tsk_need_resched(rq->curr) &&
>> +	    !__test_tsk_need_resched(rq->curr, RESCHED_NOW) &&
>>  	    p->nr_cpus_allowed > 1 &&
>>  	    dl_task(rq->curr) &&
>>  	    (rq->curr->nr_cpus_allowed < 2 ||
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c62805dbd608..c5171c247466 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8316,7 +8316,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>  	 * prevents us from potentially nominating it as a false LAST_BUDDY
>>  	 * below.
>>  	 */
>> -	if (test_tsk_need_resched(curr))
>> +	if (__test_tsk_need_resched(curr, RESCHED_NOW))
>>  		return;
>>
>>  	/* Idle tasks are by definition preempted by non-idle tasks. */
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 3261b067b67e..f0a6c9bb890b 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1680,7 +1680,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>>  	 * to move current somewhere else, making room for our non-migratable
>>  	 * task.
>>  	 */
>> -	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
>> +	if (p->prio == rq->curr->prio && !__test_tsk_need_resched(rq->curr, RESCHED_NOW))
>>  		check_preempt_equal_prio(rq, p);
>>  #endif
>>  }
>> @@ -2415,7 +2415,7 @@ static void pull_rt_task(struct rq *this_rq)
>>  static void task_woken_rt(struct rq *rq, struct task_struct *p)
>>  {
>>  	bool need_to_push = !task_on_cpu(rq, p) &&
>> -			    !test_tsk_need_resched(rq->curr) &&
>> +			    !__test_tsk_need_resched(rq->curr, RESCHED_NOW) &&
>>  			    p->nr_cpus_allowed > 1 &&
>>  			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
>>  			    (rq->curr->nr_cpus_allowed < 2 ||
>
> These are all NO-OPs.... Changelog says:
>
>> In scheduler code, switch to the more explicit variants,
>> __set_tsk_need_resched(...), __test_tsk_need_resched(...).
>
> But leaves me wondering *WHY* ?!?
>
> I can't help but feel this patch attempts to do 2 things and fails to
> justify at least one of them.

So, yes all of the scheduler changes are NOP. In later patches the
scheduler will care about specifying the specific type of resched_t
and so cannot use need_resched() or need_resched_lazy().

Changed that here to minimize the interface change noise later-on.

Does something like the following help justify why they should be here?

  In scheduler code, switch to the more explicit variants,
  __set_tsk_need_resched(...), __test_tsk_need_resched(...) as a
  preparatory step for PREEMPT_AUTO support.

Thanks for all the comments, btw.

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers
  2024-05-28  0:34 ` [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers Ankur Arora
  2024-05-28 16:09   ` Peter Zijlstra
@ 2024-05-29  8:25   ` Peter Zijlstra
  2024-05-30  9:08     ` Ankur Arora
  1 sibling, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-29  8:25 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Mon, May 27, 2024 at 05:34:53PM -0700, Ankur Arora wrote:

>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>  {
> -	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> +	clear_tsk_thread_flag(tsk, tif_resched(RESCHED_NOW));
> +
> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> +		clear_tsk_thread_flag(tsk, tif_resched(RESCHED_LAZY));
> +}

(using tif_resched() here is really uncalled for)

So this will generate rather sub-optimal code, namely 2 atomics that
really should be one.

Ideally we'd write this something like:

	unsigned long mask = _TIF_NEED_RESCHED;
	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
		mask |= _TIF_NEED_RESCHED_LAZY;

	atomic_long_andnot(mask, (atomic_long_t *)task_thread_info(tsk)->flags);

Which will clear both bits with a single atomic.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers
  2024-05-29  8:25   ` Peter Zijlstra
@ 2024-05-30  9:08     ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-30  9:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:34:53PM -0700, Ankur Arora wrote:
>
>>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>>  {
>> -	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
>> +	clear_tsk_thread_flag(tsk, tif_resched(RESCHED_NOW));
>> +
>> +	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
>> +		clear_tsk_thread_flag(tsk, tif_resched(RESCHED_LAZY));
>> +}
>
> (using tif_resched() here is really uncalled for)
>
> So this will generate rather sub-optimal code, namely 2 atomics that
> really should be one.
>
> Ideally we'd write this something like:
>
> 	unsigned long mask = _TIF_NEED_RESCHED;
> 	if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> 		mask |= _TIF_NEED_RESCHED_LAZY;
>
> 	atomic_long_andnot(mask, (atomic_long_t *)task_thread_info(tsk)->flags);
>
> Which will clear both bits with a single atomic.

Much better. Will fix.

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (6 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 16:12   ` Peter Zijlstra
  2024-05-28  0:34 ` [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry Ankur Arora
                   ` (28 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Andy Lutomirski

The scheduling policy for TIF_NEED_RESCHED_LAZY is to allow the
running task to voluntarily schedule out, running it to completion.

For archs with GENERIC_ENTRY, do this by adding a check in
exit_to_user_mode_loop().

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/entry-common.h | 2 +-
 kernel/entry/common.c        | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index b0fb775a600d..f5bb19369973 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -65,7 +65,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 90843cc38588..bcb23c866425 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -98,7 +98,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_UPROBE)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit
  2024-05-28  0:34 ` [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit Ankur Arora
@ 2024-05-28 16:12   ` Peter Zijlstra
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 16:12 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Andy Lutomirski

On Mon, May 27, 2024 at 05:34:54PM -0700, Ankur Arora wrote:
> The scheduling policy for TIF_NEED_RESCHED_LAZY is to allow the
> running task to voluntarily schedule out, running it to completion.
> 
> For archs with GENERIC_ENTRY, do this by adding a check in
> exit_to_user_mode_loop().
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/entry-common.h | 2 +-
>  kernel/entry/common.c        | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index b0fb775a600d..f5bb19369973 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -65,7 +65,7 @@
>  #define EXIT_TO_USER_MODE_WORK						\
>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>  	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 ARCH_EXIT_TO_USER_MODE_WORK)
> +	 _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)

Should we be wanting both TIF_NEED_RESCHED flags side-by-side?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (7 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 16:13   ` Peter Zijlstra
  2024-05-28  0:34 ` [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED Ankur Arora
                   ` (27 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Paolo Bonzini, Andy Lutomirski

Archs defining CONFIG_KVM_XFER_TO_GUEST_WORK call
xfer_to_guest_mode_handle_work() from various KVM vcpu-run
loops to check for any task work including rescheduling.

Handle TIF_NEED_RESCHED_LAZY alongside TIF_NEED_RESCHED.

Also, while at it, remove the explicit check for need_resched() in
the exit condition as that is already covered in the loop condition.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/entry-kvm.h | 2 +-
 kernel/entry/kvm.c        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 6813171afccb..674a622c91be 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@
 
 #define XFER_TO_GUEST_MODE_WORK						\
 	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
-	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd..8485f63863af 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 			return -EINTR;
 		}
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
@@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 			return ret;
 
 		ti_work = read_thread_flags();
-	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+	} while (ti_work & XFER_TO_GUEST_MODE_WORK);
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry
  2024-05-28  0:34 ` [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry Ankur Arora
@ 2024-05-28 16:13   ` Peter Zijlstra
  2024-05-30  9:04     ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 16:13 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Paolo Bonzini, Andy Lutomirski

On Mon, May 27, 2024 at 05:34:55PM -0700, Ankur Arora wrote:
> Archs defining CONFIG_KVM_XFER_TO_GUEST_WORK call
> xfer_to_guest_mode_handle_work() from various KVM vcpu-run
> loops to check for any task work including rescheduling.
> 
> Handle TIF_NEED_RESCHED_LAZY alongside TIF_NEED_RESCHED.
> 
> Also, while at it, remove the explicit check for need_resched() in
> the exit condition as that is already covered in the loop condition.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/linux/entry-kvm.h | 2 +-
>  kernel/entry/kvm.c        | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
> index 6813171afccb..674a622c91be 100644
> --- a/include/linux/entry-kvm.h
> +++ b/include/linux/entry-kvm.h
> @@ -18,7 +18,7 @@
>  
>  #define XFER_TO_GUEST_MODE_WORK						\
>  	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> +	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)

Same as last patch, it seems weird to have both RESCHED flags so far
apart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry
  2024-05-28 16:13   ` Peter Zijlstra
@ 2024-05-30  9:04     ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-30  9:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Paolo Bonzini, Andy Lutomirski


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:34:55PM -0700, Ankur Arora wrote:
>> Archs defining CONFIG_KVM_XFER_TO_GUEST_WORK call
>> xfer_to_guest_mode_handle_work() from various KVM vcpu-run
>> loops to check for any task work including rescheduling.
>>
>> Handle TIF_NEED_RESCHED_LAZY alongside TIF_NEED_RESCHED.
>>
>> Also, while at it, remove the explicit check for need_resched() in
>> the exit condition as that is already covered in the loop condition.
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  include/linux/entry-kvm.h | 2 +-
>>  kernel/entry/kvm.c        | 4 ++--
>>  2 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
>> index 6813171afccb..674a622c91be 100644
>> --- a/include/linux/entry-kvm.h
>> +++ b/include/linux/entry-kvm.h
>> @@ -18,7 +18,7 @@
>>
>>  #define XFER_TO_GUEST_MODE_WORK						\
>>  	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
>> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
>> +	 _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
>
> Same as last patch, it seems weird to have both RESCHED flags so far
> apart.

True. Will fix this and the other.

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (8 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 16:18   ` Peter Zijlstra
  2024-05-28  0:34 ` [PATCH v2 11/35] sched: __schedule_loop() doesn't need to check for need_resched_lazy() Ankur Arora
                   ` (26 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Andy Lutomirski

Use __tif_need_resched(RESCHED_NOW) instead of need_resched() to be
explicit that this path only reschedules if it is needed imminently.

Also, add a comment about why we need a need-resched check here at
all, given that the top level conditional has already checked the
preempt_count().

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/entry/common.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bcb23c866425..c684385921de 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -307,7 +307,16 @@ void raw_irqentry_exit_cond_resched(void)
 		rcu_irq_exit_check_preempt();
 		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
 			WARN_ON_ONCE(!on_thread_stack());
-		if (need_resched())
+
+		/*
+		 * Check if we need to preempt eagerly.
+		 *
+		 * Note: we need an explicit check here because some
+		 * architectures don't fold TIF_NEED_RESCHED in the
+		 * preempt_count. For archs that do, this is already covered
+		 * in the conditional above.
+		 */
+		if (__tif_need_resched(RESCHED_NOW))
 			preempt_schedule_irq();
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED
  2024-05-28  0:34 ` [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED Ankur Arora
@ 2024-05-28 16:18   ` Peter Zijlstra
  2024-05-30  9:03     ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 16:18 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Andy Lutomirski

On Mon, May 27, 2024 at 05:34:56PM -0700, Ankur Arora wrote:
> Use __tif_need_resched(RESCHED_NOW) instead of need_resched() to be
> explicit that this path only reschedules if it is needed imminently.
> 
> Also, add a comment about why we need a need-resched check here at
> all, given that the top level conditional has already checked the
> preempt_count().
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/entry/common.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bcb23c866425..c684385921de 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -307,7 +307,16 @@ void raw_irqentry_exit_cond_resched(void)
>  		rcu_irq_exit_check_preempt();
>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>  			WARN_ON_ONCE(!on_thread_stack());
> -		if (need_resched())
> +
> +		/*
> +		 * Check if we need to preempt eagerly.
> +		 *
> +		 * Note: we need an explicit check here because some
> +		 * architectures don't fold TIF_NEED_RESCHED in the
> +		 * preempt_count. For archs that do, this is already covered
> +		 * in the conditional above.
> +		 */
> +		if (__tif_need_resched(RESCHED_NOW))
>  			preempt_schedule_irq();

Seeing how you introduced need_resched_lazy() and kept need_resched() to
be the NOW thing, I really don't see the point of using the long form
here?


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED
  2024-05-28 16:18   ` Peter Zijlstra
@ 2024-05-30  9:03     ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-30  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Andy Lutomirski


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:34:56PM -0700, Ankur Arora wrote:
>> Use __tif_need_resched(RESCHED_NOW) instead of need_resched() to be
>> explicit that this path only reschedules if it is needed imminently.
>>
>> Also, add a comment about why we need a need-resched check here at
>> all, given that the top level conditional has already checked the
>> preempt_count().
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/entry/common.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index bcb23c866425..c684385921de 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -307,7 +307,16 @@ void raw_irqentry_exit_cond_resched(void)
>>  		rcu_irq_exit_check_preempt();
>>  		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>>  			WARN_ON_ONCE(!on_thread_stack());
>> -		if (need_resched())
>> +
>> +		/*
>> +		 * Check if we need to preempt eagerly.
>> +		 *
>> +		 * Note: we need an explicit check here because some
>> +		 * architectures don't fold TIF_NEED_RESCHED in the
>> +		 * preempt_count. For archs that do, this is already covered
>> +		 * in the conditional above.
>> +		 */
>> +		if (__tif_need_resched(RESCHED_NOW))
>>  			preempt_schedule_irq();
>
> Seeing how you introduced need_resched_lazy() and kept need_resched() to
> be the NOW thing, I really don't see the point of using the long form
> here?

So, the reason I used the lower level interface here (and the scheduler)
was to spell out exactly was happening here.

Basically keep need_resched()/need_resched_lazy() for the none-core code.

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 11/35] sched: __schedule_loop() doesn't need to check for need_resched_lazy()
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (9 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28  0:34 ` [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic Ankur Arora
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

Various scheduling loops recheck need_resched() to avoid a missed
scheduling opportunity.

Explicitly note that we don't need to check for need_resched_lazy()
since that only needs to be handled at exit-to-user.

Also update the comment above __schedule() to describe
TIF_NEED_RESCHED_LAZY semantics.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d00d7b45303e..0c26b60c1101 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6582,20 +6582,23 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  *
  *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
  *
- *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
- *      paths. For example, see arch/x86/entry_64.S.
+ *   2. TIF_NEED_RESCHED flag is checked on interrupt and TIF_NEED_RESCHED[_LAZY]
+ *      flags on userspace return paths. For example, see kernel/entry/common.c
  *
- *      To drive preemption between tasks, the scheduler sets the flag in timer
- *      interrupt handler scheduler_tick().
+ *      To drive preemption between tasks, the scheduler sets one of the need-
+ *      resched flags in the timer interrupt handler scheduler_tick():
+ *        - !CONFIG_PREEMPT_AUTO: TIF_NEED_RESCHED.
+ *        -  CONFIG_PREEMPT_AUTO: TIF_NEED_RESCHED or TIF_NEED_RESCHED_LAZY
+ *           depending on the preemption model.
  *
  *   3. Wakeups don't really cause entry into schedule(). They add a
  *      task to the run-queue and that's it.
  *
  *      Now, if the new task added to the run-queue preempts the current
- *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
- *      called on the nearest possible occasion:
+ *      task, then the wakeup sets TIF_NEED_RESCHED[_LAZY] and schedule()
+ *      gets called on the nearest possible occasion:
  *
- *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):
+ *       - If the kernel is running under preempt_model_preemptible():
  *
  *         - in syscall or exception context, at the next outmost
  *           preempt_enable(). (this might be as soon as the wake_up()'s
@@ -6604,8 +6607,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  *         - in IRQ context, return from interrupt-handler to
  *           preemptible context
  *
- *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
- *         then at the next:
+ *       - If the kernel is running under preempt_model_none(), or
+ *         preempt_model_voluntary(), then at the next:
  *
  *          - cond_resched() call
  *          - explicit schedule() call
@@ -6823,6 +6826,11 @@ static __always_inline void __schedule_loop(unsigned int sched_mode)
 		preempt_disable();
 		__schedule(sched_mode);
 		sched_preempt_enable_no_resched();
+
+		/*
+		 * We don't check for need_resched_lazy() here, since it is
+		 * always handled at exit-to-user.
+		 */
 	} while (need_resched());
 }
 
@@ -6928,7 +6936,7 @@ static void __sched notrace preempt_schedule_common(void)
 		preempt_enable_no_resched_notrace();
 
 		/*
-		 * Check again in case we missed a preemption opportunity
+		 * Check again in case we missed an eager preemption opportunity
 		 * between schedule and now.
 		 */
 	} while (need_resched());
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (10 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 11/35] sched: __schedule_loop() doesn't need to check for need_resched_lazy() Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 16:25   ` Peter Zijlstra
  2024-05-28  0:34 ` [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO Ankur Arora
                   ` (24 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

Pull out the PREEMPT_DYNAMIC setup logic to allow other preemption
models to dynamically configure preemption.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 165 +++++++++++++++++++++++---------------------
 1 file changed, 86 insertions(+), 79 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c26b60c1101..349f6257fdcd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8713,6 +8713,89 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_rwlock_write);
 
+#if defined(CONFIG_PREEMPT_DYNAMIC)
+
+#define PREEMPT_MODE "Dynamic Preempt"
+
+enum {
+	preempt_dynamic_undefined = -1,
+	preempt_dynamic_none,
+	preempt_dynamic_voluntary,
+	preempt_dynamic_full,
+};
+
+int preempt_dynamic_mode = preempt_dynamic_undefined;
+static DEFINE_MUTEX(sched_dynamic_mutex);
+
+int sched_dynamic_mode(const char *str)
+{
+	if (!strcmp(str, "none"))
+		return preempt_dynamic_none;
+
+	if (!strcmp(str, "voluntary"))
+		return preempt_dynamic_voluntary;
+
+	if (!strcmp(str, "full"))
+		return preempt_dynamic_full;
+
+	return -EINVAL;
+}
+
+static void __sched_dynamic_update(int mode);
+void sched_dynamic_update(int mode)
+{
+	mutex_lock(&sched_dynamic_mutex);
+	__sched_dynamic_update(mode);
+	mutex_unlock(&sched_dynamic_mutex);
+}
+
+static void __init preempt_dynamic_init(void)
+{
+	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
+		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
+			sched_dynamic_update(preempt_dynamic_none);
+		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
+			sched_dynamic_update(preempt_dynamic_voluntary);
+		} else {
+			/* Default static call setting, nothing to do */
+			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
+			preempt_dynamic_mode = preempt_dynamic_full;
+			pr_info("%s: full\n", PREEMPT_MODE);
+		}
+	}
+}
+
+static int __init setup_preempt_mode(char *str)
+{
+	int mode = sched_dynamic_mode(str);
+	if (mode < 0) {
+		pr_warn("%s: unsupported mode: %s\n", PREEMPT_MODE, str);
+		return 0;
+	}
+
+	sched_dynamic_update(mode);
+	return 1;
+}
+__setup("preempt=", setup_preempt_mode);
+
+#define PREEMPT_MODEL_ACCESSOR(mode) \
+	bool preempt_model_##mode(void)						 \
+	{									 \
+		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
+		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
+	}									 \
+	EXPORT_SYMBOL_GPL(preempt_model_##mode)
+
+PREEMPT_MODEL_ACCESSOR(none);
+PREEMPT_MODEL_ACCESSOR(voluntary);
+PREEMPT_MODEL_ACCESSOR(full);
+
+#else /* !CONFIG_PREEMPT_DYNAMIC */
+
+static inline void preempt_dynamic_init(void) { }
+
+#endif /* !CONFIG_PREEMPT_DYNAMIC */
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 
 #ifdef CONFIG_GENERIC_ENTRY
@@ -8749,29 +8832,6 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
  */
 
-enum {
-	preempt_dynamic_undefined = -1,
-	preempt_dynamic_none,
-	preempt_dynamic_voluntary,
-	preempt_dynamic_full,
-};
-
-int preempt_dynamic_mode = preempt_dynamic_undefined;
-
-int sched_dynamic_mode(const char *str)
-{
-	if (!strcmp(str, "none"))
-		return preempt_dynamic_none;
-
-	if (!strcmp(str, "voluntary"))
-		return preempt_dynamic_voluntary;
-
-	if (!strcmp(str, "full"))
-		return preempt_dynamic_full;
-
-	return -EINVAL;
-}
-
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)
 #define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)
@@ -8782,7 +8842,6 @@ int sched_dynamic_mode(const char *str)
 #error "Unsupported PREEMPT_DYNAMIC mechanism"
 #endif
 
-static DEFINE_MUTEX(sched_dynamic_mutex);
 static bool klp_override;
 
 static void __sched_dynamic_update(int mode)
@@ -8807,7 +8866,7 @@ static void __sched_dynamic_update(int mode)
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
 		if (mode != preempt_dynamic_mode)
-			pr_info("Dynamic Preempt: none\n");
+			pr_info("%s: none\n", PREEMPT_MODE);
 		break;
 
 	case preempt_dynamic_voluntary:
@@ -8818,7 +8877,7 @@ static void __sched_dynamic_update(int mode)
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
 		if (mode != preempt_dynamic_mode)
-			pr_info("Dynamic Preempt: voluntary\n");
+			pr_info("%s: voluntary\n", PREEMPT_MODE);
 		break;
 
 	case preempt_dynamic_full:
@@ -8829,20 +8888,13 @@ static void __sched_dynamic_update(int mode)
 		preempt_dynamic_enable(preempt_schedule_notrace);
 		preempt_dynamic_enable(irqentry_exit_cond_resched);
 		if (mode != preempt_dynamic_mode)
-			pr_info("Dynamic Preempt: full\n");
+			pr_info("%s: full\n", PREEMPT_MODE);
 		break;
 	}
 
 	preempt_dynamic_mode = mode;
 }
 
-void sched_dynamic_update(int mode)
-{
-	mutex_lock(&sched_dynamic_mutex);
-	__sched_dynamic_update(mode);
-	mutex_unlock(&sched_dynamic_mutex);
-}
-
 #ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
 
 static int klp_cond_resched(void)
@@ -8873,51 +8925,6 @@ void sched_dynamic_klp_disable(void)
 
 #endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
 
-static int __init setup_preempt_mode(char *str)
-{
-	int mode = sched_dynamic_mode(str);
-	if (mode < 0) {
-		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
-		return 0;
-	}
-
-	sched_dynamic_update(mode);
-	return 1;
-}
-__setup("preempt=", setup_preempt_mode);
-
-static void __init preempt_dynamic_init(void)
-{
-	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
-		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
-			sched_dynamic_update(preempt_dynamic_none);
-		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
-			sched_dynamic_update(preempt_dynamic_voluntary);
-		} else {
-			/* Default static call setting, nothing to do */
-			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
-			preempt_dynamic_mode = preempt_dynamic_full;
-			pr_info("Dynamic Preempt: full\n");
-		}
-	}
-}
-
-#define PREEMPT_MODEL_ACCESSOR(mode) \
-	bool preempt_model_##mode(void)						 \
-	{									 \
-		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
-		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
-	}									 \
-	EXPORT_SYMBOL_GPL(preempt_model_##mode)
-
-PREEMPT_MODEL_ACCESSOR(none);
-PREEMPT_MODEL_ACCESSOR(voluntary);
-PREEMPT_MODEL_ACCESSOR(full);
-
-#else /* !CONFIG_PREEMPT_DYNAMIC */
-
-static inline void preempt_dynamic_init(void) { }
-
 #endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
 
 /**
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic
  2024-05-28  0:34 ` [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic Ankur Arora
@ 2024-05-28 16:25   ` Peter Zijlstra
  2024-05-30  9:30     ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 16:25 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Mon, May 27, 2024 at 05:34:58PM -0700, Ankur Arora wrote:
> Pull out the PREEMPT_DYNAMIC setup logic to allow other preemption
> models to dynamically configure preemption.

Uh what ?!? What's the point of creating back-to-back #ifdef sections ?

> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/sched/core.c | 165 +++++++++++++++++++++++---------------------
>  1 file changed, 86 insertions(+), 79 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0c26b60c1101..349f6257fdcd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8713,6 +8713,89 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
>  }
>  EXPORT_SYMBOL(__cond_resched_rwlock_write);
>  
> +#if defined(CONFIG_PREEMPT_DYNAMIC)
> +
> +#define PREEMPT_MODE "Dynamic Preempt"
> +
> +enum {
> +	preempt_dynamic_undefined = -1,
> +	preempt_dynamic_none,
> +	preempt_dynamic_voluntary,
> +	preempt_dynamic_full,
> +};
> +
> +int preempt_dynamic_mode = preempt_dynamic_undefined;
> +static DEFINE_MUTEX(sched_dynamic_mutex);
> +
> +int sched_dynamic_mode(const char *str)
> +{
> +	if (!strcmp(str, "none"))
> +		return preempt_dynamic_none;
> +
> +	if (!strcmp(str, "voluntary"))
> +		return preempt_dynamic_voluntary;
> +
> +	if (!strcmp(str, "full"))
> +		return preempt_dynamic_full;
> +
> +	return -EINVAL;
> +}
> +
> +static void __sched_dynamic_update(int mode);
> +void sched_dynamic_update(int mode)
> +{
> +	mutex_lock(&sched_dynamic_mutex);
> +	__sched_dynamic_update(mode);
> +	mutex_unlock(&sched_dynamic_mutex);
> +}
> +
> +static void __init preempt_dynamic_init(void)
> +{
> +	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
> +		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
> +			sched_dynamic_update(preempt_dynamic_none);
> +		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
> +			sched_dynamic_update(preempt_dynamic_voluntary);
> +		} else {
> +			/* Default static call setting, nothing to do */
> +			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
> +			preempt_dynamic_mode = preempt_dynamic_full;
> +			pr_info("%s: full\n", PREEMPT_MODE);
> +		}
> +	}
> +}
> +
> +static int __init setup_preempt_mode(char *str)
> +{
> +	int mode = sched_dynamic_mode(str);
> +	if (mode < 0) {
> +		pr_warn("%s: unsupported mode: %s\n", PREEMPT_MODE, str);
> +		return 0;
> +	}
> +
> +	sched_dynamic_update(mode);
> +	return 1;
> +}
> +__setup("preempt=", setup_preempt_mode);
> +
> +#define PREEMPT_MODEL_ACCESSOR(mode) \
> +	bool preempt_model_##mode(void)						 \
> +	{									 \
> +		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
> +		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
> +	}									 \
> +	EXPORT_SYMBOL_GPL(preempt_model_##mode)
> +
> +PREEMPT_MODEL_ACCESSOR(none);
> +PREEMPT_MODEL_ACCESSOR(voluntary);
> +PREEMPT_MODEL_ACCESSOR(full);
> +
> +#else /* !CONFIG_PREEMPT_DYNAMIC */
> +
> +static inline void preempt_dynamic_init(void) { }
> +
> +#endif /* !CONFIG_PREEMPT_DYNAMIC */
> +
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>  
>  #ifdef CONFIG_GENERIC_ENTRY
> @@ -8749,29 +8832,6 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>   *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>   */
>  
> -enum {
> -	preempt_dynamic_undefined = -1,
> -	preempt_dynamic_none,
> -	preempt_dynamic_voluntary,
> -	preempt_dynamic_full,
> -};
> -
> -int preempt_dynamic_mode = preempt_dynamic_undefined;
> -
> -int sched_dynamic_mode(const char *str)
> -{
> -	if (!strcmp(str, "none"))
> -		return preempt_dynamic_none;
> -
> -	if (!strcmp(str, "voluntary"))
> -		return preempt_dynamic_voluntary;
> -
> -	if (!strcmp(str, "full"))
> -		return preempt_dynamic_full;
> -
> -	return -EINVAL;
> -}
> -
>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>  #define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)
>  #define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)
> @@ -8782,7 +8842,6 @@ int sched_dynamic_mode(const char *str)
>  #error "Unsupported PREEMPT_DYNAMIC mechanism"
>  #endif
>  
> -static DEFINE_MUTEX(sched_dynamic_mutex);
>  static bool klp_override;
>  
>  static void __sched_dynamic_update(int mode)
> @@ -8807,7 +8866,7 @@ static void __sched_dynamic_update(int mode)
>  		preempt_dynamic_disable(preempt_schedule_notrace);
>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
>  		if (mode != preempt_dynamic_mode)
> -			pr_info("Dynamic Preempt: none\n");
> +			pr_info("%s: none\n", PREEMPT_MODE);
>  		break;
>  
>  	case preempt_dynamic_voluntary:
> @@ -8818,7 +8877,7 @@ static void __sched_dynamic_update(int mode)
>  		preempt_dynamic_disable(preempt_schedule_notrace);
>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
>  		if (mode != preempt_dynamic_mode)
> -			pr_info("Dynamic Preempt: voluntary\n");
> +			pr_info("%s: voluntary\n", PREEMPT_MODE);
>  		break;
>  
>  	case preempt_dynamic_full:
> @@ -8829,20 +8888,13 @@ static void __sched_dynamic_update(int mode)
>  		preempt_dynamic_enable(preempt_schedule_notrace);
>  		preempt_dynamic_enable(irqentry_exit_cond_resched);
>  		if (mode != preempt_dynamic_mode)
> -			pr_info("Dynamic Preempt: full\n");
> +			pr_info("%s: full\n", PREEMPT_MODE);
>  		break;
>  	}
>  
>  	preempt_dynamic_mode = mode;
>  }
>  
> -void sched_dynamic_update(int mode)
> -{
> -	mutex_lock(&sched_dynamic_mutex);
> -	__sched_dynamic_update(mode);
> -	mutex_unlock(&sched_dynamic_mutex);
> -}
> -
>  #ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
>  
>  static int klp_cond_resched(void)
> @@ -8873,51 +8925,6 @@ void sched_dynamic_klp_disable(void)
>  
>  #endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>  
> -static int __init setup_preempt_mode(char *str)
> -{
> -	int mode = sched_dynamic_mode(str);
> -	if (mode < 0) {
> -		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
> -		return 0;
> -	}
> -
> -	sched_dynamic_update(mode);
> -	return 1;
> -}
> -__setup("preempt=", setup_preempt_mode);
> -
> -static void __init preempt_dynamic_init(void)
> -{
> -	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
> -		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
> -			sched_dynamic_update(preempt_dynamic_none);
> -		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
> -			sched_dynamic_update(preempt_dynamic_voluntary);
> -		} else {
> -			/* Default static call setting, nothing to do */
> -			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
> -			preempt_dynamic_mode = preempt_dynamic_full;
> -			pr_info("Dynamic Preempt: full\n");
> -		}
> -	}
> -}
> -
> -#define PREEMPT_MODEL_ACCESSOR(mode) \
> -	bool preempt_model_##mode(void)						 \
> -	{									 \
> -		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
> -		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
> -	}									 \
> -	EXPORT_SYMBOL_GPL(preempt_model_##mode)
> -
> -PREEMPT_MODEL_ACCESSOR(none);
> -PREEMPT_MODEL_ACCESSOR(voluntary);
> -PREEMPT_MODEL_ACCESSOR(full);
> -
> -#else /* !CONFIG_PREEMPT_DYNAMIC */
> -
> -static inline void preempt_dynamic_init(void) { }
> -
>  #endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
>  
>  /**
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic
  2024-05-28 16:25   ` Peter Zijlstra
@ 2024-05-30  9:30     ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-30  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:34:58PM -0700, Ankur Arora wrote:
>> Pull out the PREEMPT_DYNAMIC setup logic to allow other preemption
>> models to dynamically configure preemption.
>
> Uh what ?!? What's the point of creating back-to-back #ifdef sections ?

Now that you mention it, it does seem quite odd.

Assuming I keep the separation maybe it makes sense to make the runtime
configuration it's own configuration option, say CONFIG_PREEMPT_RUNTIME.

And, PREEMPT_AUTO and PREEMPT_DYNAMIC could select it?


>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/sched/core.c | 165 +++++++++++++++++++++++---------------------
>>  1 file changed, 86 insertions(+), 79 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0c26b60c1101..349f6257fdcd 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8713,6 +8713,89 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
>>  }
>>  EXPORT_SYMBOL(__cond_resched_rwlock_write);
>>
>> +#if defined(CONFIG_PREEMPT_DYNAMIC)
>> +
>> +#define PREEMPT_MODE "Dynamic Preempt"
>> +
>> +enum {
>> +	preempt_dynamic_undefined = -1,
>> +	preempt_dynamic_none,
>> +	preempt_dynamic_voluntary,
>> +	preempt_dynamic_full,
>> +};
>> +
>> +int preempt_dynamic_mode = preempt_dynamic_undefined;
>> +static DEFINE_MUTEX(sched_dynamic_mutex);
>> +
>> +int sched_dynamic_mode(const char *str)
>> +{
>> +	if (!strcmp(str, "none"))
>> +		return preempt_dynamic_none;
>> +
>> +	if (!strcmp(str, "voluntary"))
>> +		return preempt_dynamic_voluntary;
>> +
>> +	if (!strcmp(str, "full"))
>> +		return preempt_dynamic_full;
>> +
>> +	return -EINVAL;
>> +}
>> +
>> +static void __sched_dynamic_update(int mode);
>> +void sched_dynamic_update(int mode)
>> +{
>> +	mutex_lock(&sched_dynamic_mutex);
>> +	__sched_dynamic_update(mode);
>> +	mutex_unlock(&sched_dynamic_mutex);
>> +}
>> +
>> +static void __init preempt_dynamic_init(void)
>> +{
>> +	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
>> +		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
>> +			sched_dynamic_update(preempt_dynamic_none);
>> +		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
>> +			sched_dynamic_update(preempt_dynamic_voluntary);
>> +		} else {
>> +			/* Default static call setting, nothing to do */
>> +			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
>> +			preempt_dynamic_mode = preempt_dynamic_full;
>> +			pr_info("%s: full\n", PREEMPT_MODE);
>> +		}
>> +	}
>> +}
>> +
>> +static int __init setup_preempt_mode(char *str)
>> +{
>> +	int mode = sched_dynamic_mode(str);
>> +	if (mode < 0) {
>> +		pr_warn("%s: unsupported mode: %s\n", PREEMPT_MODE, str);
>> +		return 0;
>> +	}
>> +
>> +	sched_dynamic_update(mode);
>> +	return 1;
>> +}
>> +__setup("preempt=", setup_preempt_mode);
>> +
>> +#define PREEMPT_MODEL_ACCESSOR(mode) \
>> +	bool preempt_model_##mode(void)						 \
>> +	{									 \
>> +		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>> +		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
>> +	}									 \
>> +	EXPORT_SYMBOL_GPL(preempt_model_##mode)
>> +
>> +PREEMPT_MODEL_ACCESSOR(none);
>> +PREEMPT_MODEL_ACCESSOR(voluntary);
>> +PREEMPT_MODEL_ACCESSOR(full);
>> +
>> +#else /* !CONFIG_PREEMPT_DYNAMIC */
>> +
>> +static inline void preempt_dynamic_init(void) { }
>> +
>> +#endif /* !CONFIG_PREEMPT_DYNAMIC */
>> +
>>  #ifdef CONFIG_PREEMPT_DYNAMIC
>>
>>  #ifdef CONFIG_GENERIC_ENTRY
>> @@ -8749,29 +8832,6 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>>   *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>>   */
>>
>> -enum {
>> -	preempt_dynamic_undefined = -1,
>> -	preempt_dynamic_none,
>> -	preempt_dynamic_voluntary,
>> -	preempt_dynamic_full,
>> -};
>> -
>> -int preempt_dynamic_mode = preempt_dynamic_undefined;
>> -
>> -int sched_dynamic_mode(const char *str)
>> -{
>> -	if (!strcmp(str, "none"))
>> -		return preempt_dynamic_none;
>> -
>> -	if (!strcmp(str, "voluntary"))
>> -		return preempt_dynamic_voluntary;
>> -
>> -	if (!strcmp(str, "full"))
>> -		return preempt_dynamic_full;
>> -
>> -	return -EINVAL;
>> -}
>> -
>>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>>  #define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)
>>  #define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)
>> @@ -8782,7 +8842,6 @@ int sched_dynamic_mode(const char *str)
>>  #error "Unsupported PREEMPT_DYNAMIC mechanism"
>>  #endif
>>
>> -static DEFINE_MUTEX(sched_dynamic_mutex);
>>  static bool klp_override;
>>
>>  static void __sched_dynamic_update(int mode)
>> @@ -8807,7 +8866,7 @@ static void __sched_dynamic_update(int mode)
>>  		preempt_dynamic_disable(preempt_schedule_notrace);
>>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
>>  		if (mode != preempt_dynamic_mode)
>> -			pr_info("Dynamic Preempt: none\n");
>> +			pr_info("%s: none\n", PREEMPT_MODE);
>>  		break;
>>
>>  	case preempt_dynamic_voluntary:
>> @@ -8818,7 +8877,7 @@ static void __sched_dynamic_update(int mode)
>>  		preempt_dynamic_disable(preempt_schedule_notrace);
>>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
>>  		if (mode != preempt_dynamic_mode)
>> -			pr_info("Dynamic Preempt: voluntary\n");
>> +			pr_info("%s: voluntary\n", PREEMPT_MODE);
>>  		break;
>>
>>  	case preempt_dynamic_full:
>> @@ -8829,20 +8888,13 @@ static void __sched_dynamic_update(int mode)
>>  		preempt_dynamic_enable(preempt_schedule_notrace);
>>  		preempt_dynamic_enable(irqentry_exit_cond_resched);
>>  		if (mode != preempt_dynamic_mode)
>> -			pr_info("Dynamic Preempt: full\n");
>> +			pr_info("%s: full\n", PREEMPT_MODE);
>>  		break;
>>  	}
>>
>>  	preempt_dynamic_mode = mode;
>>  }
>>
>> -void sched_dynamic_update(int mode)
>> -{
>> -	mutex_lock(&sched_dynamic_mutex);
>> -	__sched_dynamic_update(mode);
>> -	mutex_unlock(&sched_dynamic_mutex);
>> -}
>> -
>>  #ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
>>
>>  static int klp_cond_resched(void)
>> @@ -8873,51 +8925,6 @@ void sched_dynamic_klp_disable(void)
>>
>>  #endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>>
>> -static int __init setup_preempt_mode(char *str)
>> -{
>> -	int mode = sched_dynamic_mode(str);
>> -	if (mode < 0) {
>> -		pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
>> -		return 0;
>> -	}
>> -
>> -	sched_dynamic_update(mode);
>> -	return 1;
>> -}
>> -__setup("preempt=", setup_preempt_mode);
>> -
>> -static void __init preempt_dynamic_init(void)
>> -{
>> -	if (preempt_dynamic_mode == preempt_dynamic_undefined) {
>> -		if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
>> -			sched_dynamic_update(preempt_dynamic_none);
>> -		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
>> -			sched_dynamic_update(preempt_dynamic_voluntary);
>> -		} else {
>> -			/* Default static call setting, nothing to do */
>> -			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
>> -			preempt_dynamic_mode = preempt_dynamic_full;
>> -			pr_info("Dynamic Preempt: full\n");
>> -		}
>> -	}
>> -}
>> -
>> -#define PREEMPT_MODEL_ACCESSOR(mode) \
>> -	bool preempt_model_##mode(void)						 \
>> -	{									 \
>> -		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>> -		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
>> -	}									 \
>> -	EXPORT_SYMBOL_GPL(preempt_model_##mode)
>> -
>> -PREEMPT_MODEL_ACCESSOR(none);
>> -PREEMPT_MODEL_ACCESSOR(voluntary);
>> -PREEMPT_MODEL_ACCESSOR(full);
>> -
>> -#else /* !CONFIG_PREEMPT_DYNAMIC */
>> -
>> -static inline void preempt_dynamic_init(void) { }
>> -
>>  #endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
>>
>>  /**
>> --
>> 2.31.1
>>


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (11 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic Ankur Arora
@ 2024-05-28  0:34 ` Ankur Arora
  2024-05-28 16:27   ` Peter Zijlstra
  2024-05-28  0:35 ` [PATCH v2 14/35] rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO Ankur Arora
                   ` (23 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

Reuse sched_dynamic_update() and related logic to enable choosing
the preemption model at boot or runtime for PREEMPT_AUTO.

The interface is identical to PREEMPT_DYNAMIC.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Changelog:
  change title
---
 include/linux/preempt.h |  2 +-
 kernel/sched/core.c     | 31 +++++++++++++++++++++++++++----
 kernel/sched/debug.c    |  6 +++---
 kernel/sched/sched.h    |  2 +-
 4 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index d453f5e34390..d4f568606eda 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -481,7 +481,7 @@ DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
 DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(), preempt_enable_notrace())
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)
 
 extern bool preempt_model_none(void);
 extern bool preempt_model_voluntary(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 349f6257fdcd..d7804e29182d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8713,9 +8713,13 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_rwlock_write);
 
-#if defined(CONFIG_PREEMPT_DYNAMIC)
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
 #define PREEMPT_MODE "Dynamic Preempt"
+#else
+#define PREEMPT_MODE "Preempt Auto"
+#endif
 
 enum {
 	preempt_dynamic_undefined = -1,
@@ -8790,11 +8794,11 @@ PREEMPT_MODEL_ACCESSOR(none);
 PREEMPT_MODEL_ACCESSOR(voluntary);
 PREEMPT_MODEL_ACCESSOR(full);
 
-#else /* !CONFIG_PREEMPT_DYNAMIC */
+#else /* !CONFIG_PREEMPT_DYNAMIC && !CONFIG_PREEMPT_AUTO */
 
 static inline void preempt_dynamic_init(void) { }
 
-#endif /* !CONFIG_PREEMPT_DYNAMIC */
+#endif /* !CONFIG_PREEMPT_DYNAMIC && !CONFIG_PREEMPT_AUTO */
 
 #ifdef CONFIG_PREEMPT_DYNAMIC
 
@@ -8925,7 +8929,26 @@ void sched_dynamic_klp_disable(void)
 
 #endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
 
-#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
+#elif defined(CONFIG_PREEMPT_AUTO)
+
+static void __sched_dynamic_update(int mode)
+{
+	switch (mode) {
+	case preempt_dynamic_none:
+		preempt_dynamic_mode = preempt_dynamic_undefined;
+		break;
+
+	case preempt_dynamic_voluntary:
+		preempt_dynamic_mode = preempt_dynamic_undefined;
+		break;
+
+	case preempt_dynamic_full:
+		preempt_dynamic_mode = preempt_dynamic_undefined;
+		break;
+	}
+}
+
+#endif /* CONFIG_PREEMPT_AUTO */
 
 /**
  * yield - yield the current processor to other threads.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834d..e53f1b73bf4a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -216,7 +216,7 @@ static const struct file_operations sched_scaling_fops = {
 
 #endif /* SMP */
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)
 
 static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
 				   size_t cnt, loff_t *ppos)
@@ -276,7 +276,7 @@ static const struct file_operations sched_dynamic_fops = {
 	.release	= single_release,
 };
 
-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#endif /* CONFIG_PREEMPT_DYNAMIC || CONFIG_PREEMPT_AUTO */
 
 __read_mostly bool sched_debug_verbose;
 
@@ -343,7 +343,7 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
 	debugfs_create_file_unsafe("verbose", 0644, debugfs_sched, &sched_debug_verbose, &sched_verbose_fops);
-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae50f212775e..c9239c0b0095 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3231,7 +3231,7 @@ extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *w
 
 extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-05-28  0:34 ` [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO Ankur Arora
@ 2024-05-28 16:27   ` Peter Zijlstra
  2024-05-30  9:29     ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-28 16:27 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> Reuse sched_dynamic_update() and related logic to enable choosing
> the preemption model at boot or runtime for PREEMPT_AUTO.
> 
> The interface is identical to PREEMPT_DYNAMIC.

Colour me confused, why?!? What are you doing and why aren't just just
adding AUTO to the existing DYNAMIC thing?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-05-28 16:27   ` Peter Zijlstra
@ 2024-05-30  9:29     ` Ankur Arora
  2024-06-06 11:51       ` Peter Zijlstra
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-30  9:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
>> Reuse sched_dynamic_update() and related logic to enable choosing
>> the preemption model at boot or runtime for PREEMPT_AUTO.
>>
>> The interface is identical to PREEMPT_DYNAMIC.
>
> Colour me confused, why?!? What are you doing and why aren't just just
> adding AUTO to the existing DYNAMIC thing?

You mean have a single __sched_dynamic_update()? AUTO doesn't use any
of the static_call/static_key stuff so I'm not sure how that would work.

Or am I missing the point of what you are saying?

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-05-30  9:29     ` Ankur Arora
@ 2024-06-06 11:51       ` Peter Zijlstra
  2024-06-06 15:11         ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-06-06 11:51 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> >> Reuse sched_dynamic_update() and related logic to enable choosing
> >> the preemption model at boot or runtime for PREEMPT_AUTO.
> >>
> >> The interface is identical to PREEMPT_DYNAMIC.
> >
> > Colour me confused, why?!? What are you doing and why aren't just just
> > adding AUTO to the existing DYNAMIC thing?
> 
> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
> of the static_call/static_key stuff so I'm not sure how that would work.

*sigh*... see the below, seems to work.

---
 arch/x86/Kconfig                   |   1 +
 arch/x86/include/asm/thread_info.h |   6 +-
 include/linux/entry-common.h       |   3 +-
 include/linux/entry-kvm.h          |   5 +-
 include/linux/sched.h              |  10 +++-
 include/linux/thread_info.h        |  21 +++++--
 kernel/Kconfig.preempt             |  11 ++++
 kernel/entry/common.c              |   2 +-
 kernel/entry/kvm.c                 |   4 +-
 kernel/sched/core.c                | 110 ++++++++++++++++++++++++++++++++-----
 kernel/sched/debug.c               |   2 +-
 kernel/sched/fair.c                |   4 +-
 kernel/sched/sched.h               |   1 +
 13 files changed, 148 insertions(+), 32 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e8837116704ce..61f86b69524d7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,6 +91,7 @@ config X86
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
+	select ARCH_HAS_PREEMPT_LAZY
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 12da7dfd5ef13..75bb390f7baf5 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -87,8 +87,9 @@ struct thread_info {
 #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
-#define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY	4	/* rescheduling necessary */
+#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
+#define TIF_SSBD		6	/* Speculative store bypass disable */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -110,6 +111,7 @@ struct thread_info {
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index b0fb775a600d9..e66c8a7c113f4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -64,7 +64,8 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
+	 _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |			\
+	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |			\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 6813171afccb2..16149f6625e48 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -17,8 +17,9 @@
 #endif
 
 #define XFER_TO_GUEST_MODE_WORK						\
-	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
-	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
+	 _TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME |			\
+	 ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7635045b2395c..5900d84e08b3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1968,7 +1968,8 @@ static inline void set_tsk_need_resched(struct task_struct *tsk)
 
 static inline void clear_tsk_need_resched(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+	atomic_long_andnot(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY,
+			   (atomic_long_t *)&task_thread_info(tsk)->flags);
 }
 
 static inline int test_tsk_need_resched(struct task_struct *tsk)
@@ -2074,6 +2075,7 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 extern bool preempt_model_none(void);
 extern bool preempt_model_voluntary(void);
 extern bool preempt_model_full(void);
+extern bool preempt_model_lazy(void);
 
 #else
 
@@ -2089,6 +2091,10 @@ static inline bool preempt_model_full(void)
 {
 	return IS_ENABLED(CONFIG_PREEMPT);
 }
+static inline bool preempt_model_lazy(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_LAZY);
+}
 
 #endif
 
@@ -2107,7 +2113,7 @@ static inline bool preempt_model_rt(void)
  */
 static inline bool preempt_model_preemptible(void)
 {
-	return preempt_model_full() || preempt_model_rt();
+	return preempt_model_full() || preempt_model_lazy() || preempt_model_rt();
 }
 
 static __always_inline bool need_resched(void)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9ea0b28068f49..cf2446c9c30d4 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,14 @@ enum syscall_work_bit {
 
 #include <asm/thread_info.h>
 
+#ifndef TIF_NEED_RESCHED_LAZY
+#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
+#error Inconsistent PREEMPT_LAZY
+#endif
+#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+#endif
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data
@@ -179,22 +187,27 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
 
 #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
 
-static __always_inline bool tif_need_resched(void)
+static __always_inline bool tif_test_bit(int bit)
 {
-	return arch_test_bit(TIF_NEED_RESCHED,
+	return arch_test_bit(bit,
 			     (unsigned long *)(&current_thread_info()->flags));
 }
 
 #else
 
-static __always_inline bool tif_need_resched(void)
+static __always_inline bool tif_test_bit(int bit)
 {
-	return test_bit(TIF_NEED_RESCHED,
+	return test_bit(bit,
 			(unsigned long *)(&current_thread_info()->flags));
 }
 
 #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
 
+static __always_inline bool tif_need_resched(void)
+{
+	return tif_test_bit(TIF_NEED_RESCHED);
+}
+
 #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
 static inline int arch_within_stack_frames(const void * const stack,
 					   const void * const stackend,
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a8214..1a2e3849e3e5f 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,9 @@ config PREEMPT_BUILD
 	select PREEMPTION
 	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
 
+config ARCH_HAS_PREEMPT_LAZY
+	bool
+
 choice
 	prompt "Preemption Model"
 	default PREEMPT_NONE
@@ -67,6 +70,14 @@ config PREEMPT
 	  embedded system with latency requirements in the milliseconds
 	  range.
 
+config PREEMPT_LAZY
+	bool "Scheduler controlled preemption model"
+	depends on !ARCH_NO_PREEMPT
+	depends on ARCH_HAS_PREEMPT_LAZY
+	select PREEMPT_BUILD
+	help
+	  Hamsters in your brain...
+
 config PREEMPT_RT
 	bool "Fully Preemptible Kernel (Real-Time)"
 	depends on EXPERT && ARCH_SUPPORTS_RT
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 90843cc385880..bcb23c866425e 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -98,7 +98,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_UPROBE)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..8485f63863afc 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 			return -EINTR;
 		}
 
-		if (ti_work & _TIF_NEED_RESCHED)
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
 			schedule();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
@@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 			return ret;
 
 		ti_work = read_thread_flags();
-	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+	} while (ti_work & XFER_TO_GUEST_MODE_WORK);
 	return 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 965e6464e68e9..c32de809283cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -904,10 +904,9 @@ static inline void hrtick_rq_init(struct rq *rq)
  * this avoids any races wrt polling state changes and thereby avoids
  * spurious IPIs.
  */
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
 {
-	struct thread_info *ti = task_thread_info(p);
-	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+	return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
 }
 
 /*
@@ -932,9 +931,9 @@ static bool set_nr_if_polling(struct task_struct *p)
 }
 
 #else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
 {
-	set_tsk_need_resched(p);
+	atomic_long_or(1 << tif, (atomic_long_t *)&ti->flags);
 	return true;
 }
 
@@ -1039,28 +1038,66 @@ void wake_up_q(struct wake_q_head *head)
  * might also involve a cross-CPU call to trigger the scheduler on
  * the target CPU.
  */
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int tif)
 {
 	struct task_struct *curr = rq->curr;
+	struct thread_info *cti = task_thread_info(curr);
 	int cpu;
 
 	lockdep_assert_rq_held(rq);
 
-	if (test_tsk_need_resched(curr))
+	if (is_idle_task(curr) && tif == TIF_NEED_RESCHED_LAZY)
+		tif = TIF_NEED_RESCHED;
+
+	if (cti->flags & ((1 << tif) | _TIF_NEED_RESCHED))
 		return;
 
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		set_tsk_need_resched(curr);
-		set_preempt_need_resched();
+		set_ti_thread_flag(cti, tif);
+		if (tif == TIF_NEED_RESCHED)
+			set_preempt_need_resched();
 		return;
 	}
 
-	if (set_nr_and_not_polling(curr))
-		smp_send_reschedule(cpu);
-	else
+	if (set_nr_and_not_polling(cti, tif)) {
+		if (tif == TIF_NEED_RESCHED)
+			smp_send_reschedule(cpu);
+	} else {
 		trace_sched_wake_idle_without_ipi(cpu);
+	}
+}
+
+void resched_curr(struct rq *rq)
+{
+	__resched_curr(rq, TIF_NEED_RESCHED);
+}
+
+#ifdef CONFIG_PREEMPT_DYNAMIC
+static DEFINE_STATIC_KEY_FALSE(sk_dynamic_preempt_lazy);
+static __always_inline bool dynamic_preempt_lazy(void)
+{
+	return static_branch_unlikely(&sk_dynamic_preempt_lazy);
+}
+#else
+static __always_inline bool dynamic_preempt_lazy(void)
+{
+	return IS_ENABLED(PREEMPT_LAZY);
+}
+#endif
+
+static __always_inline int tif_need_resched_lazy(void)
+{
+	if (dynamic_preempt_lazy())
+		return TIF_NEED_RESCHED_LAZY;
+
+	return TIF_NEED_RESCHED;
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+	__resched_curr(rq, tif_need_resched_lazy());
 }
 
 void resched_cpu(int cpu)
@@ -1155,7 +1192,7 @@ static void wake_up_idle_cpu(int cpu)
 	 * and testing of the above solutions didn't appear to report
 	 * much benefits.
 	 */
-	if (set_nr_and_not_polling(rq->idle))
+	if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);
@@ -5537,6 +5574,10 @@ void sched_tick(void)
 	update_rq_clock(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
 	update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
+
+	if (dynamic_preempt_lazy() && tif_test_bit(TIF_NEED_RESCHED_LAZY))
+		resched_curr(rq);
+
 	curr->sched_class->task_tick(rq, curr, 0);
 	if (sched_feat(LATENCY_WARN))
 		resched_latency = cpu_resched_latency(rq);
@@ -7245,6 +7286,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
  *   preempt_schedule           <- NOP
  *   preempt_schedule_notrace   <- NOP
  *   irqentry_exit_cond_resched <- NOP
+ *   dynamic_preempt_lazy	<- false
  *
  * VOLUNTARY:
  *   cond_resched               <- __cond_resched
@@ -7252,6 +7294,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
  *   preempt_schedule           <- NOP
  *   preempt_schedule_notrace   <- NOP
  *   irqentry_exit_cond_resched <- NOP
+ *   dynamic_preempt_lazy	<- false
  *
  * FULL:
  *   cond_resched               <- RET0
@@ -7259,6 +7302,15 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
  *   preempt_schedule           <- preempt_schedule
  *   preempt_schedule_notrace   <- preempt_schedule_notrace
  *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ *   dynamic_preempt_lazy	<- false
+ *
+ * LAZY:
+ *   cond_resched               <- RET0
+ *   might_resched              <- RET0
+ *   preempt_schedule           <- preempt_schedule
+ *   preempt_schedule_notrace   <- preempt_schedule_notrace
+ *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ *   dynamic_preempt_lazy	<- true
  */
 
 enum {
@@ -7266,6 +7318,7 @@ enum {
 	preempt_dynamic_none,
 	preempt_dynamic_voluntary,
 	preempt_dynamic_full,
+	preempt_dynamic_lazy,
 };
 
 int preempt_dynamic_mode = preempt_dynamic_undefined;
@@ -7281,15 +7334,23 @@ int sched_dynamic_mode(const char *str)
 	if (!strcmp(str, "full"))
 		return preempt_dynamic_full;
 
+#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
+	if (!strcmp(str, "lazy"))
+		return preempt_dynamic_lazy;
+#endif
+
 	return -EINVAL;
 }
 
+#define preempt_dynamic_key_enable(f)	static_key_enable(&sk_dynamic_##f.key)
+#define preempt_dynamic_key_disable(f)	static_key_disable(&sk_dynamic_##f.key)
+
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)
 #define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)
 #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-#define preempt_dynamic_enable(f)	static_key_enable(&sk_dynamic_##f.key)
-#define preempt_dynamic_disable(f)	static_key_disable(&sk_dynamic_##f.key)
+#define preempt_dynamic_enable(f)	preempt_dynamic_key_enable(f)
+#define preempt_dynamic_disable(f)	preempt_dynamic_key_disable(f)
 #else
 #error "Unsupported PREEMPT_DYNAMIC mechanism"
 #endif
@@ -7309,6 +7370,7 @@ static void __sched_dynamic_update(int mode)
 	preempt_dynamic_enable(preempt_schedule);
 	preempt_dynamic_enable(preempt_schedule_notrace);
 	preempt_dynamic_enable(irqentry_exit_cond_resched);
+	preempt_dynamic_key_disable(preempt_lazy);
 
 	switch (mode) {
 	case preempt_dynamic_none:
@@ -7318,6 +7380,7 @@ static void __sched_dynamic_update(int mode)
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
+		preempt_dynamic_key_disable(preempt_lazy);
 		if (mode != preempt_dynamic_mode)
 			pr_info("Dynamic Preempt: none\n");
 		break;
@@ -7329,6 +7392,7 @@ static void __sched_dynamic_update(int mode)
 		preempt_dynamic_disable(preempt_schedule);
 		preempt_dynamic_disable(preempt_schedule_notrace);
 		preempt_dynamic_disable(irqentry_exit_cond_resched);
+		preempt_dynamic_key_disable(preempt_lazy);
 		if (mode != preempt_dynamic_mode)
 			pr_info("Dynamic Preempt: voluntary\n");
 		break;
@@ -7340,9 +7404,22 @@ static void __sched_dynamic_update(int mode)
 		preempt_dynamic_enable(preempt_schedule);
 		preempt_dynamic_enable(preempt_schedule_notrace);
 		preempt_dynamic_enable(irqentry_exit_cond_resched);
+		preempt_dynamic_key_disable(preempt_lazy);
 		if (mode != preempt_dynamic_mode)
 			pr_info("Dynamic Preempt: full\n");
 		break;
+
+	case preempt_dynamic_lazy:
+		if (!klp_override)
+			preempt_dynamic_disable(cond_resched);
+		preempt_dynamic_disable(might_resched);
+		preempt_dynamic_enable(preempt_schedule);
+		preempt_dynamic_enable(preempt_schedule_notrace);
+		preempt_dynamic_enable(irqentry_exit_cond_resched);
+		preempt_dynamic_key_enable(preempt_lazy);
+		if (mode != preempt_dynamic_mode)
+			pr_info("Dynamic Preempt: lazy\n");
+		break;
 	}
 
 	preempt_dynamic_mode = mode;
@@ -7405,6 +7482,8 @@ static void __init preempt_dynamic_init(void)
 			sched_dynamic_update(preempt_dynamic_none);
 		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
 			sched_dynamic_update(preempt_dynamic_voluntary);
+		} else if (IS_ENABLED(CONFIG_PREEMPT_LAZY)) {
+			sched_dynamic_update(preempt_dynamic_lazy);
 		} else {
 			/* Default static call setting, nothing to do */
 			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
@@ -7425,6 +7504,7 @@ static void __init preempt_dynamic_init(void)
 PREEMPT_MODEL_ACCESSOR(none);
 PREEMPT_MODEL_ACCESSOR(voluntary);
 PREEMPT_MODEL_ACCESSOR(full);
+PREEMPT_MODEL_ACCESSOR(lazy);
 
 #else /* !CONFIG_PREEMPT_DYNAMIC: */
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1bc24410ae501..87309cf247c68 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -245,7 +245,7 @@ static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
 static int sched_dynamic_show(struct seq_file *m, void *v)
 {
 	static const char * preempt_modes[] = {
-		"none", "voluntary", "full"
+		"none", "voluntary", "full", "lazy",
 	};
 	int i;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b5d50dbc79dc..71b4112cadde0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1007,7 +1007,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 * The task has consumed its request, reschedule.
 	 */
 	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
+		resched_curr_lazy(rq_of(cfs_rq));
 		clear_buddies(cfs_rq, se);
 	}
 }
@@ -8615,7 +8615,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	return;
 
 preempt:
-	resched_curr(rq);
+	resched_curr_lazy(rq);
 }
 
 static struct task_struct *pick_task_fair(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 041d8e00a1568..48a4617a5b28b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2494,6 +2494,7 @@ extern void init_sched_fair_class(void);
 extern void reweight_task(struct task_struct *p, int prio);
 
 extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-06-06 11:51       ` Peter Zijlstra
@ 2024-06-06 15:11         ` Ankur Arora
  2024-06-06 17:32           ` Peter Zijlstra
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-06 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
>>
>> Peter Zijlstra <peterz@infradead.org> writes:
>>
>> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
>> >> Reuse sched_dynamic_update() and related logic to enable choosing
>> >> the preemption model at boot or runtime for PREEMPT_AUTO.
>> >>
>> >> The interface is identical to PREEMPT_DYNAMIC.
>> >
>> > Colour me confused, why?!? What are you doing and why aren't just just
>> > adding AUTO to the existing DYNAMIC thing?
>>
>> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
>> of the static_call/static_key stuff so I'm not sure how that would work.
>
> *sigh*... see the below, seems to work.

Sorry, didn't mean for you to have to do all that work to prove the
point.

I phrased it badly. I do understand how lazy can be folded in as
you do here:

> +	case preempt_dynamic_lazy:
> +		if (!klp_override)
> +			preempt_dynamic_disable(cond_resched);
> +		preempt_dynamic_disable(might_resched);
> +		preempt_dynamic_enable(preempt_schedule);
> +		preempt_dynamic_enable(preempt_schedule_notrace);
> +		preempt_dynamic_enable(irqentry_exit_cond_resched);
> +		preempt_dynamic_key_enable(preempt_lazy);
> +		if (mode != preempt_dynamic_mode)
> +			pr_info("Dynamic Preempt: lazy\n");
> +		break;
>  	}

But, if the long term goal (at least as I understand it) is to get rid
of cond_resched() -- to allow optimizations that needing to call cond_resched()
makes impossible -- does it make sense to pull all of these together?

Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
only two models left. Then we will have (modulo figuring out how to
switch over klp from cond_resched() to a different unwinding technique):

static void __sched_dynamic_update(int mode)
{
        preempt_dynamic_enable(preempt_schedule);
        preempt_dynamic_enable(preempt_schedule_notrace);
        preempt_dynamic_enable(irqentry_exit_cond_resched);

        switch (mode) {
        case preempt_dynamic_full:
                preempt_dynamic_key_disable(preempt_lazy);
                if (mode != preempt_dynamic_mode)
                        pr_info("%s: full\n", PREEMPT_MODE);
                break;

	case preempt_dynamic_lazy:
		preempt_dynamic_key_enable(preempt_lazy);
		if (mode != preempt_dynamic_mode)
			pr_info("Dynamic Preempt: lazy\n");
		break;
        }

        preempt_dynamic_mode = mode;
}

Which is pretty similar to what the PREEMPT_AUTO code was doing.

Thanks
Ankur

> ---
>  arch/x86/Kconfig                   |   1 +
>  arch/x86/include/asm/thread_info.h |   6 +-
>  include/linux/entry-common.h       |   3 +-
>  include/linux/entry-kvm.h          |   5 +-
>  include/linux/sched.h              |  10 +++-
>  include/linux/thread_info.h        |  21 +++++--
>  kernel/Kconfig.preempt             |  11 ++++
>  kernel/entry/common.c              |   2 +-
>  kernel/entry/kvm.c                 |   4 +-
>  kernel/sched/core.c                | 110 ++++++++++++++++++++++++++++++++-----
>  kernel/sched/debug.c               |   2 +-
>  kernel/sched/fair.c                |   4 +-
>  kernel/sched/sched.h               |   1 +
>  13 files changed, 148 insertions(+), 32 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index e8837116704ce..61f86b69524d7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -91,6 +91,7 @@ config X86
>  	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
>  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	select ARCH_HAS_PMEM_API		if X86_64
> +	select ARCH_HAS_PREEMPT_LAZY
>  	select ARCH_HAS_PTE_DEVMAP		if X86_64
>  	select ARCH_HAS_PTE_SPECIAL
>  	select ARCH_HAS_HW_PTE_YOUNG
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 12da7dfd5ef13..75bb390f7baf5 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -87,8 +87,9 @@ struct thread_info {
>  #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
>  #define TIF_SIGPENDING		2	/* signal pending */
>  #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
> -#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
> -#define TIF_SSBD		5	/* Speculative store bypass disable */
> +#define TIF_NEED_RESCHED_LAZY	4	/* rescheduling necessary */
> +#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
> +#define TIF_SSBD		6	/* Speculative store bypass disable */
>  #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
>  #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
>  #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
> @@ -110,6 +111,7 @@ struct thread_info {
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> +#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
>  #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
>  #define _TIF_SSBD		(1 << TIF_SSBD)
>  #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index b0fb775a600d9..e66c8a7c113f4 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -64,7 +64,8 @@
>
>  #define EXIT_TO_USER_MODE_WORK						\
>  	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
> -	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |	\
> +	 _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |			\
> +	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |			\
>  	 ARCH_EXIT_TO_USER_MODE_WORK)
>
>  /**
> diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
> index 6813171afccb2..16149f6625e48 100644
> --- a/include/linux/entry-kvm.h
> +++ b/include/linux/entry-kvm.h
> @@ -17,8 +17,9 @@
>  #endif
>
>  #define XFER_TO_GUEST_MODE_WORK						\
> -	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL |	\
> -	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> +	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
> +	 _TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME |			\
> +	 ARCH_XFER_TO_GUEST_MODE_WORK)
>
>  struct kvm_vcpu;
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7635045b2395c..5900d84e08b3c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1968,7 +1968,8 @@ static inline void set_tsk_need_resched(struct task_struct *tsk)
>
>  static inline void clear_tsk_need_resched(struct task_struct *tsk)
>  {
> -	clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> +	atomic_long_andnot(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY,
> +			   (atomic_long_t *)&task_thread_info(tsk)->flags);
>  }
>
>  static inline int test_tsk_need_resched(struct task_struct *tsk)
> @@ -2074,6 +2075,7 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
>  extern bool preempt_model_none(void);
>  extern bool preempt_model_voluntary(void);
>  extern bool preempt_model_full(void);
> +extern bool preempt_model_lazy(void);
>
>  #else
>
> @@ -2089,6 +2091,10 @@ static inline bool preempt_model_full(void)
>  {
>  	return IS_ENABLED(CONFIG_PREEMPT);
>  }
> +static inline bool preempt_model_lazy(void)
> +{
> +	return IS_ENABLED(CONFIG_PREEMPT_LAZY);
> +}
>
>  #endif
>
> @@ -2107,7 +2113,7 @@ static inline bool preempt_model_rt(void)
>   */
>  static inline bool preempt_model_preemptible(void)
>  {
> -	return preempt_model_full() || preempt_model_rt();
> +	return preempt_model_full() || preempt_model_lazy() || preempt_model_rt();
>  }
>
>  static __always_inline bool need_resched(void)
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 9ea0b28068f49..cf2446c9c30d4 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -59,6 +59,14 @@ enum syscall_work_bit {
>
>  #include <asm/thread_info.h>
>
> +#ifndef TIF_NEED_RESCHED_LAZY
> +#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
> +#error Inconsistent PREEMPT_LAZY
> +#endif
> +#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
> +#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
> +#endif
> +
>  #ifdef __KERNEL__
>
>  #ifndef arch_set_restart_data
> @@ -179,22 +187,27 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
>
>  #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
>
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool tif_test_bit(int bit)
>  {
> -	return arch_test_bit(TIF_NEED_RESCHED,
> +	return arch_test_bit(bit,
>  			     (unsigned long *)(&current_thread_info()->flags));
>  }
>
>  #else
>
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool tif_test_bit(int bit)
>  {
> -	return test_bit(TIF_NEED_RESCHED,
> +	return test_bit(bit,
>  			(unsigned long *)(&current_thread_info()->flags));
>  }
>
>  #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>
> +static __always_inline bool tif_need_resched(void)
> +{
> +	return tif_test_bit(TIF_NEED_RESCHED);
> +}
> +
>  #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
>  static inline int arch_within_stack_frames(const void * const stack,
>  					   const void * const stackend,
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index c2f1fd95a8214..1a2e3849e3e5f 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -11,6 +11,9 @@ config PREEMPT_BUILD
>  	select PREEMPTION
>  	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>
> +config ARCH_HAS_PREEMPT_LAZY
> +	bool
> +
>  choice
>  	prompt "Preemption Model"
>  	default PREEMPT_NONE
> @@ -67,6 +70,14 @@ config PREEMPT
>  	  embedded system with latency requirements in the milliseconds
>  	  range.
>
> +config PREEMPT_LAZY
> +	bool "Scheduler controlled preemption model"
> +	depends on !ARCH_NO_PREEMPT
> +	depends on ARCH_HAS_PREEMPT_LAZY
> +	select PREEMPT_BUILD
> +	help
> +	  Hamsters in your brain...
> +
>  config PREEMPT_RT
>  	bool "Fully Preemptible Kernel (Real-Time)"
>  	depends on EXPERT && ARCH_SUPPORTS_RT
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 90843cc385880..bcb23c866425e 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -98,7 +98,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>
>  		local_irq_enable_exit_to_user(ti_work);
>
> -		if (ti_work & _TIF_NEED_RESCHED)
> +		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>  			schedule();
>
>  		if (ti_work & _TIF_UPROBE)
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 2e0f75bcb7fd1..8485f63863afc 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
>  			return -EINTR;
>  		}
>
> -		if (ti_work & _TIF_NEED_RESCHED)
> +		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>  			schedule();
>
>  		if (ti_work & _TIF_NOTIFY_RESUME)
> @@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
>  			return ret;
>
>  		ti_work = read_thread_flags();
> -	} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
> +	} while (ti_work & XFER_TO_GUEST_MODE_WORK);
>  	return 0;
>  }
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 965e6464e68e9..c32de809283cf 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -904,10 +904,9 @@ static inline void hrtick_rq_init(struct rq *rq)
>   * this avoids any races wrt polling state changes and thereby avoids
>   * spurious IPIs.
>   */
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
>  {
> -	struct thread_info *ti = task_thread_info(p);
> -	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
> +	return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
>  }
>
>  /*
> @@ -932,9 +931,9 @@ static bool set_nr_if_polling(struct task_struct *p)
>  }
>
>  #else
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
>  {
> -	set_tsk_need_resched(p);
> +	atomic_long_or(1 << tif, (atomic_long_t *)&ti->flags);
>  	return true;
>  }
>
> @@ -1039,28 +1038,66 @@ void wake_up_q(struct wake_q_head *head)
>   * might also involve a cross-CPU call to trigger the scheduler on
>   * the target CPU.
>   */
> -void resched_curr(struct rq *rq)
> +static void __resched_curr(struct rq *rq, int tif)
>  {
>  	struct task_struct *curr = rq->curr;
> +	struct thread_info *cti = task_thread_info(curr);
>  	int cpu;
>
>  	lockdep_assert_rq_held(rq);
>
> -	if (test_tsk_need_resched(curr))
> +	if (is_idle_task(curr) && tif == TIF_NEED_RESCHED_LAZY)
> +		tif = TIF_NEED_RESCHED;
> +
> +	if (cti->flags & ((1 << tif) | _TIF_NEED_RESCHED))
>  		return;
>
>  	cpu = cpu_of(rq);
>
>  	if (cpu == smp_processor_id()) {
> -		set_tsk_need_resched(curr);
> -		set_preempt_need_resched();
> +		set_ti_thread_flag(cti, tif);
> +		if (tif == TIF_NEED_RESCHED)
> +			set_preempt_need_resched();
>  		return;
>  	}
>
> -	if (set_nr_and_not_polling(curr))
> -		smp_send_reschedule(cpu);
> -	else
> +	if (set_nr_and_not_polling(cti, tif)) {
> +		if (tif == TIF_NEED_RESCHED)
> +			smp_send_reschedule(cpu);
> +	} else {
>  		trace_sched_wake_idle_without_ipi(cpu);
> +	}
> +}
> +
> +void resched_curr(struct rq *rq)
> +{
> +	__resched_curr(rq, TIF_NEED_RESCHED);
> +}
> +
> +#ifdef CONFIG_PREEMPT_DYNAMIC
> +static DEFINE_STATIC_KEY_FALSE(sk_dynamic_preempt_lazy);
> +static __always_inline bool dynamic_preempt_lazy(void)
> +{
> +	return static_branch_unlikely(&sk_dynamic_preempt_lazy);
> +}
> +#else
> +static __always_inline bool dynamic_preempt_lazy(void)
> +{
> +	return IS_ENABLED(PREEMPT_LAZY);
> +}
> +#endif
> +
> +static __always_inline int tif_need_resched_lazy(void)
> +{
> +	if (dynamic_preempt_lazy())
> +		return TIF_NEED_RESCHED_LAZY;
> +
> +	return TIF_NEED_RESCHED;
> +}
> +
> +void resched_curr_lazy(struct rq *rq)
> +{
> +	__resched_curr(rq, tif_need_resched_lazy());
>  }
>
>  void resched_cpu(int cpu)
> @@ -1155,7 +1192,7 @@ static void wake_up_idle_cpu(int cpu)
>  	 * and testing of the above solutions didn't appear to report
>  	 * much benefits.
>  	 */
> -	if (set_nr_and_not_polling(rq->idle))
> +	if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
>  		smp_send_reschedule(cpu);
>  	else
>  		trace_sched_wake_idle_without_ipi(cpu);
> @@ -5537,6 +5574,10 @@ void sched_tick(void)
>  	update_rq_clock(rq);
>  	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
>  	update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
> +
> +	if (dynamic_preempt_lazy() && tif_test_bit(TIF_NEED_RESCHED_LAZY))
> +		resched_curr(rq);
> +
>  	curr->sched_class->task_tick(rq, curr, 0);
>  	if (sched_feat(LATENCY_WARN))
>  		resched_latency = cpu_resched_latency(rq);
> @@ -7245,6 +7286,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>   *   preempt_schedule           <- NOP
>   *   preempt_schedule_notrace   <- NOP
>   *   irqentry_exit_cond_resched <- NOP
> + *   dynamic_preempt_lazy	<- false
>   *
>   * VOLUNTARY:
>   *   cond_resched               <- __cond_resched
> @@ -7252,6 +7294,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>   *   preempt_schedule           <- NOP
>   *   preempt_schedule_notrace   <- NOP
>   *   irqentry_exit_cond_resched <- NOP
> + *   dynamic_preempt_lazy	<- false
>   *
>   * FULL:
>   *   cond_resched               <- RET0
> @@ -7259,6 +7302,15 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>   *   preempt_schedule           <- preempt_schedule
>   *   preempt_schedule_notrace   <- preempt_schedule_notrace
>   *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + *   dynamic_preempt_lazy	<- false
> + *
> + * LAZY:
> + *   cond_resched               <- RET0
> + *   might_resched              <- RET0
> + *   preempt_schedule           <- preempt_schedule
> + *   preempt_schedule_notrace   <- preempt_schedule_notrace
> + *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + *   dynamic_preempt_lazy	<- true
>   */
>
>  enum {
> @@ -7266,6 +7318,7 @@ enum {
>  	preempt_dynamic_none,
>  	preempt_dynamic_voluntary,
>  	preempt_dynamic_full,
> +	preempt_dynamic_lazy,
>  };
>
>  int preempt_dynamic_mode = preempt_dynamic_undefined;
> @@ -7281,15 +7334,23 @@ int sched_dynamic_mode(const char *str)
>  	if (!strcmp(str, "full"))
>  		return preempt_dynamic_full;
>
> +#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
> +	if (!strcmp(str, "lazy"))
> +		return preempt_dynamic_lazy;
> +#endif
> +
>  	return -EINVAL;
>  }
>
> +#define preempt_dynamic_key_enable(f)	static_key_enable(&sk_dynamic_##f.key)
> +#define preempt_dynamic_key_disable(f)	static_key_disable(&sk_dynamic_##f.key)
> +
>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>  #define preempt_dynamic_enable(f)	static_call_update(f, f##_dynamic_enabled)
>  #define preempt_dynamic_disable(f)	static_call_update(f, f##_dynamic_disabled)
>  #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> -#define preempt_dynamic_enable(f)	static_key_enable(&sk_dynamic_##f.key)
> -#define preempt_dynamic_disable(f)	static_key_disable(&sk_dynamic_##f.key)
> +#define preempt_dynamic_enable(f)	preempt_dynamic_key_enable(f)
> +#define preempt_dynamic_disable(f)	preempt_dynamic_key_disable(f)
>  #else
>  #error "Unsupported PREEMPT_DYNAMIC mechanism"
>  #endif
> @@ -7309,6 +7370,7 @@ static void __sched_dynamic_update(int mode)
>  	preempt_dynamic_enable(preempt_schedule);
>  	preempt_dynamic_enable(preempt_schedule_notrace);
>  	preempt_dynamic_enable(irqentry_exit_cond_resched);
> +	preempt_dynamic_key_disable(preempt_lazy);
>
>  	switch (mode) {
>  	case preempt_dynamic_none:
> @@ -7318,6 +7380,7 @@ static void __sched_dynamic_update(int mode)
>  		preempt_dynamic_disable(preempt_schedule);
>  		preempt_dynamic_disable(preempt_schedule_notrace);
>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
> +		preempt_dynamic_key_disable(preempt_lazy);
>  		if (mode != preempt_dynamic_mode)
>  			pr_info("Dynamic Preempt: none\n");
>  		break;
> @@ -7329,6 +7392,7 @@ static void __sched_dynamic_update(int mode)
>  		preempt_dynamic_disable(preempt_schedule);
>  		preempt_dynamic_disable(preempt_schedule_notrace);
>  		preempt_dynamic_disable(irqentry_exit_cond_resched);
> +		preempt_dynamic_key_disable(preempt_lazy);
>  		if (mode != preempt_dynamic_mode)
>  			pr_info("Dynamic Preempt: voluntary\n");
>  		break;
> @@ -7340,9 +7404,22 @@ static void __sched_dynamic_update(int mode)
>  		preempt_dynamic_enable(preempt_schedule);
>  		preempt_dynamic_enable(preempt_schedule_notrace);
>  		preempt_dynamic_enable(irqentry_exit_cond_resched);
> +		preempt_dynamic_key_disable(preempt_lazy);
>  		if (mode != preempt_dynamic_mode)
>  			pr_info("Dynamic Preempt: full\n");
>  		break;
> +
> +	case preempt_dynamic_lazy:
> +		if (!klp_override)
> +			preempt_dynamic_disable(cond_resched);
> +		preempt_dynamic_disable(might_resched);
> +		preempt_dynamic_enable(preempt_schedule);
> +		preempt_dynamic_enable(preempt_schedule_notrace);
> +		preempt_dynamic_enable(irqentry_exit_cond_resched);
> +		preempt_dynamic_key_enable(preempt_lazy);
> +		if (mode != preempt_dynamic_mode)
> +			pr_info("Dynamic Preempt: lazy\n");
> +		break;
>  	}
>
>  	preempt_dynamic_mode = mode;
> @@ -7405,6 +7482,8 @@ static void __init preempt_dynamic_init(void)
>  			sched_dynamic_update(preempt_dynamic_none);
>  		} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
>  			sched_dynamic_update(preempt_dynamic_voluntary);
> +		} else if (IS_ENABLED(CONFIG_PREEMPT_LAZY)) {
> +			sched_dynamic_update(preempt_dynamic_lazy);
>  		} else {
>  			/* Default static call setting, nothing to do */
>  			WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
> @@ -7425,6 +7504,7 @@ static void __init preempt_dynamic_init(void)
>  PREEMPT_MODEL_ACCESSOR(none);
>  PREEMPT_MODEL_ACCESSOR(voluntary);
>  PREEMPT_MODEL_ACCESSOR(full);
> +PREEMPT_MODEL_ACCESSOR(lazy);
>
>  #else /* !CONFIG_PREEMPT_DYNAMIC: */
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 1bc24410ae501..87309cf247c68 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -245,7 +245,7 @@ static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
>  static int sched_dynamic_show(struct seq_file *m, void *v)
>  {
>  	static const char * preempt_modes[] = {
> -		"none", "voluntary", "full"
> +		"none", "voluntary", "full", "lazy",
>  	};
>  	int i;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5b5d50dbc79dc..71b4112cadde0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1007,7 +1007,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  	 * The task has consumed its request, reschedule.
>  	 */
>  	if (cfs_rq->nr_running > 1) {
> -		resched_curr(rq_of(cfs_rq));
> +		resched_curr_lazy(rq_of(cfs_rq));
>  		clear_buddies(cfs_rq, se);
>  	}
>  }
> @@ -8615,7 +8615,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	return;
>
>  preempt:
> -	resched_curr(rq);
> +	resched_curr_lazy(rq);
>  }
>
>  static struct task_struct *pick_task_fair(struct rq *rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 041d8e00a1568..48a4617a5b28b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2494,6 +2494,7 @@ extern void init_sched_fair_class(void);
>  extern void reweight_task(struct task_struct *p, int prio);
>
>  extern void resched_curr(struct rq *rq);
> +extern void resched_curr_lazy(struct rq *rq);
>  extern void resched_cpu(int cpu);
>
>  extern struct rt_bandwidth def_rt_bandwidth;


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-06-06 15:11         ` Ankur Arora
@ 2024-06-06 17:32           ` Peter Zijlstra
  2024-06-09  0:46             ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-06-06 17:32 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Thu, Jun 06, 2024 at 08:11:41AM -0700, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
> >>
> >> Peter Zijlstra <peterz@infradead.org> writes:
> >>
> >> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> >> >> Reuse sched_dynamic_update() and related logic to enable choosing
> >> >> the preemption model at boot or runtime for PREEMPT_AUTO.
> >> >>
> >> >> The interface is identical to PREEMPT_DYNAMIC.
> >> >
> >> > Colour me confused, why?!? What are you doing and why aren't just just
> >> > adding AUTO to the existing DYNAMIC thing?
> >>
> >> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
> >> of the static_call/static_key stuff so I'm not sure how that would work.
> >
> > *sigh*... see the below, seems to work.
> 
> Sorry, didn't mean for you to have to do all that work to prove the
> point.

Well, for a large part it was needed for me to figure out what your
patches were actually doing anyway. Peel away all the layers and this is
what remains.

> I phrased it badly. I do understand how lazy can be folded in as
> you do here:
> 
> > +	case preempt_dynamic_lazy:
> > +		if (!klp_override)
> > +			preempt_dynamic_disable(cond_resched);
> > +		preempt_dynamic_disable(might_resched);
> > +		preempt_dynamic_enable(preempt_schedule);
> > +		preempt_dynamic_enable(preempt_schedule_notrace);
> > +		preempt_dynamic_enable(irqentry_exit_cond_resched);
> > +		preempt_dynamic_key_enable(preempt_lazy);
> > +		if (mode != preempt_dynamic_mode)
> > +			pr_info("Dynamic Preempt: lazy\n");
> > +		break;
> >  	}
> 
> But, if the long term goal (at least as I understand it) is to get rid
> of cond_resched() -- to allow optimizations that needing to call cond_resched()
> makes impossible -- does it make sense to pull all of these together?

It certainly doesn't make sense to add yet another configurable thing. We
have one, so yes add it here.

> Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
> only two models left. Then we will have (modulo figuring out how to
> switch over klp from cond_resched() to a different unwinding technique):
> 
> static void __sched_dynamic_update(int mode)
> {
>         preempt_dynamic_enable(preempt_schedule);
>         preempt_dynamic_enable(preempt_schedule_notrace);
>         preempt_dynamic_enable(irqentry_exit_cond_resched);
> 
>         switch (mode) {
>         case preempt_dynamic_full:
>                 preempt_dynamic_key_disable(preempt_lazy);
>                 if (mode != preempt_dynamic_mode)
>                         pr_info("%s: full\n", PREEMPT_MODE);
>                 break;
> 
> 	case preempt_dynamic_lazy:
> 		preempt_dynamic_key_enable(preempt_lazy);
> 		if (mode != preempt_dynamic_mode)
> 			pr_info("Dynamic Preempt: lazy\n");
> 		break;
>         }
> 
>         preempt_dynamic_mode = mode;
> }
> 
> Which is pretty similar to what the PREEMPT_AUTO code was doing.

Right, but without duplicating all that stuff in the interim.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-06-06 17:32           ` Peter Zijlstra
@ 2024-06-09  0:46             ` Ankur Arora
  2024-06-12 18:10               ` Paul E. McKenney
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-09  0:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Jun 06, 2024 at 08:11:41AM -0700, Ankur Arora wrote:
>>
>> Peter Zijlstra <peterz@infradead.org> writes:
>>
>> > On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
>> >>
>> >> Peter Zijlstra <peterz@infradead.org> writes:
>> >>
>> >> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
>> >> >> Reuse sched_dynamic_update() and related logic to enable choosing
>> >> >> the preemption model at boot or runtime for PREEMPT_AUTO.
>> >> >>
>> >> >> The interface is identical to PREEMPT_DYNAMIC.
>> >> >
>> >> > Colour me confused, why?!? What are you doing and why aren't just just
>> >> > adding AUTO to the existing DYNAMIC thing?
>> >>
>> >> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
>> >> of the static_call/static_key stuff so I'm not sure how that would work.
>> >
>> > *sigh*... see the below, seems to work.
>>
>> Sorry, didn't mean for you to have to do all that work to prove the
>> point.
>
> Well, for a large part it was needed for me to figure out what your
> patches were actually doing anyway. Peel away all the layers and this is
> what remains.
>
>> I phrased it badly. I do understand how lazy can be folded in as
>> you do here:
>>
>> > +	case preempt_dynamic_lazy:
>> > +		if (!klp_override)
>> > +			preempt_dynamic_disable(cond_resched);
>> > +		preempt_dynamic_disable(might_resched);
>> > +		preempt_dynamic_enable(preempt_schedule);
>> > +		preempt_dynamic_enable(preempt_schedule_notrace);
>> > +		preempt_dynamic_enable(irqentry_exit_cond_resched);
>> > +		preempt_dynamic_key_enable(preempt_lazy);
>> > +		if (mode != preempt_dynamic_mode)
>> > +			pr_info("Dynamic Preempt: lazy\n");
>> > +		break;
>> >  	}
>>
>> But, if the long term goal (at least as I understand it) is to get rid
>> of cond_resched() -- to allow optimizations that needing to call cond_resched()
>> makes impossible -- does it make sense to pull all of these together?
>
> It certainly doesn't make sense to add yet another configurable thing. We
> have one, so yes add it here.
>
>> Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
>> only two models left. Then we will have (modulo figuring out how to
>> switch over klp from cond_resched() to a different unwinding technique):
>>
>> static void __sched_dynamic_update(int mode)
>> {
>>         preempt_dynamic_enable(preempt_schedule);
>>         preempt_dynamic_enable(preempt_schedule_notrace);
>>         preempt_dynamic_enable(irqentry_exit_cond_resched);
>>
>>         switch (mode) {
>>         case preempt_dynamic_full:
>>                 preempt_dynamic_key_disable(preempt_lazy);
>>                 if (mode != preempt_dynamic_mode)
>>                         pr_info("%s: full\n", PREEMPT_MODE);
>>                 break;
>>
>> 	case preempt_dynamic_lazy:
>> 		preempt_dynamic_key_enable(preempt_lazy);
>> 		if (mode != preempt_dynamic_mode)
>> 			pr_info("Dynamic Preempt: lazy\n");
>> 		break;
>>         }
>>
>>         preempt_dynamic_mode = mode;
>> }
>>
>> Which is pretty similar to what the PREEMPT_AUTO code was doing.
>
> Right, but without duplicating all that stuff in the interim.

Yeah, that makes sense. Joel had suggested something on these lines
earlier [1], to which I was resistant.

However, the duplication (and the fact that the voluntary model
was quite thin) should have told me that (AUTO, preempt=voluntary)
should just be folded under PREEMPT_DYNAMIC.

I'll rework the series to do that.

That should also simplify RCU related choices which I think Paul will
like. Given that the lazy model is meant to eventually replace
none/voluntary, so PREEMPT_RCU configuration can just be:

--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@ config TREE_RCU

 config PREEMPT_RCU
        bool
-       default y if PREEMPTION
+       default y if PREEMPTION && !PREEMPT_LAZY


Or, maybe we should instead have this:

--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@ config TREE_RCU

 config PREEMPT_RCU
        bool
-       default y if PREEMPTION
+       default y if PREEMPT || PREEMPT_RT
        select TREE_RCU

Though this would be a change in behaviour for current PREEMPT_DYNAMIC
users.

[1] https://lore.kernel.org/lkml/fd48ea5c-bc74-4914-a621-d12c9741c014@joelfernandes.org/

Thanks
--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO
  2024-06-09  0:46             ` Ankur Arora
@ 2024-06-12 18:10               ` Paul E. McKenney
  0 siblings, 0 replies; 95+ messages in thread
From: Paul E. McKenney @ 2024-06-12 18:10 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Peter Zijlstra, linux-kernel, tglx, torvalds, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot

On Sat, Jun 08, 2024 at 05:46:26PM -0700, Ankur Arora wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > On Thu, Jun 06, 2024 at 08:11:41AM -0700, Ankur Arora wrote:
> >> Peter Zijlstra <peterz@infradead.org> writes:
> >> > On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
> >> >> Peter Zijlstra <peterz@infradead.org> writes:
> >> >> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> >> >> >> Reuse sched_dynamic_update() and related logic to enable choosing
> >> >> >> the preemption model at boot or runtime for PREEMPT_AUTO.
> >> >> >>
> >> >> >> The interface is identical to PREEMPT_DYNAMIC.
> >> >> >
> >> >> > Colour me confused, why?!? What are you doing and why aren't just just
> >> >> > adding AUTO to the existing DYNAMIC thing?
> >> >>
> >> >> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
> >> >> of the static_call/static_key stuff so I'm not sure how that would work.
> >> >
> >> > *sigh*... see the below, seems to work.
> >>
> >> Sorry, didn't mean for you to have to do all that work to prove the
> >> point.
> >
> > Well, for a large part it was needed for me to figure out what your
> > patches were actually doing anyway. Peel away all the layers and this is
> > what remains.
> >
> >> I phrased it badly. I do understand how lazy can be folded in as
> >> you do here:
> >>
> >> > +	case preempt_dynamic_lazy:
> >> > +		if (!klp_override)
> >> > +			preempt_dynamic_disable(cond_resched);
> >> > +		preempt_dynamic_disable(might_resched);
> >> > +		preempt_dynamic_enable(preempt_schedule);
> >> > +		preempt_dynamic_enable(preempt_schedule_notrace);
> >> > +		preempt_dynamic_enable(irqentry_exit_cond_resched);
> >> > +		preempt_dynamic_key_enable(preempt_lazy);
> >> > +		if (mode != preempt_dynamic_mode)
> >> > +			pr_info("Dynamic Preempt: lazy\n");
> >> > +		break;
> >> >  	}
> >>
> >> But, if the long term goal (at least as I understand it) is to get rid
> >> of cond_resched() -- to allow optimizations that needing to call cond_resched()
> >> makes impossible -- does it make sense to pull all of these together?
> >
> > It certainly doesn't make sense to add yet another configurable thing. We
> > have one, so yes add it here.
> >
> >> Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
> >> only two models left. Then we will have (modulo figuring out how to
> >> switch over klp from cond_resched() to a different unwinding technique):
> >>
> >> static void __sched_dynamic_update(int mode)
> >> {
> >>         preempt_dynamic_enable(preempt_schedule);
> >>         preempt_dynamic_enable(preempt_schedule_notrace);
> >>         preempt_dynamic_enable(irqentry_exit_cond_resched);
> >>
> >>         switch (mode) {
> >>         case preempt_dynamic_full:
> >>                 preempt_dynamic_key_disable(preempt_lazy);
> >>                 if (mode != preempt_dynamic_mode)
> >>                         pr_info("%s: full\n", PREEMPT_MODE);
> >>                 break;
> >>
> >> 	case preempt_dynamic_lazy:
> >> 		preempt_dynamic_key_enable(preempt_lazy);
> >> 		if (mode != preempt_dynamic_mode)
> >> 			pr_info("Dynamic Preempt: lazy\n");
> >> 		break;
> >>         }
> >>
> >>         preempt_dynamic_mode = mode;
> >> }
> >>
> >> Which is pretty similar to what the PREEMPT_AUTO code was doing.
> >
> > Right, but without duplicating all that stuff in the interim.
> 
> Yeah, that makes sense. Joel had suggested something on these lines
> earlier [1], to which I was resistant.
> 
> However, the duplication (and the fact that the voluntary model
> was quite thin) should have told me that (AUTO, preempt=voluntary)
> should just be folded under PREEMPT_DYNAMIC.
> 
> I'll rework the series to do that.
> 
> That should also simplify RCU related choices which I think Paul will
> like. Given that the lazy model is meant to eventually replace
> none/voluntary, so PREEMPT_RCU configuration can just be:
> 
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -18,7 +18,7 @@ config TREE_RCU
> 
>  config PREEMPT_RCU
>         bool
> -       default y if PREEMPTION
> +       default y if PREEMPTION && !PREEMPT_LAZY

Given that PREEMPT_DYNAMIC selects PREEMPT_BUILD which in turn selects
PREEMPTION, this should work.

> Or, maybe we should instead have this:
> 
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -18,7 +18,7 @@ config TREE_RCU
> 
>  config PREEMPT_RCU
>         bool
> -       default y if PREEMPTION
> +       default y if PREEMPT || PREEMPT_RT
>         select TREE_RCU
> 
> Though this would be a change in behaviour for current PREEMPT_DYNAMIC
> users.

Which I believe to be a no-go.  I believe that PREEMPT_DYNAMIC users
really need their preemptible kernels to include preemptible RCU.

If PREEMPT_LAZY causes PREEMPT_DYNAMIC non-preemptible kernels to become
lazily preemptible, that is a topic to discuss with PREEMPT_DYNAMIC users.
On the other hand, if PREEMPT_LAZY does not cause PREEMPT_DYNAMIC
kernels to become lazily preemptible, then I would expect there to be
hard questions about removing cond_resched() and might_sleep(), or,
for that matter changing their semantics.  Which I again must leave
to PREEMPT_DYNAMIC users.

							Thanx, Paul

> [1] https://lore.kernel.org/lkml/fd48ea5c-bc74-4914-a621-d12c9741c014@joelfernandes.org/
> 
> Thanks
> --
> ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 14/35] rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (12 preceding siblings ...)
  2024-05-28  0:34 ` [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 15/35] rcu: fix header guard for rcu_all_qs() Ankur Arora
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Under PREEMPT_AUTO, CONFIG_PREEMPTION is enabled, and much like
PREEMPT_DYNAMIC, PREEMPT_AUTO also allows for dynamic switching
of preemption models.

The RCU model, however, is fixed at compile time.

Now, RCU typically selects PREEMPT_RCU if CONFIG_PREEMPTION is enabled.
Given the trade-offs between PREEMPT_RCU=y and PREEMPT_RCU=n, some
configurations might prefer the stronger forward-progress guarantees
of PREEMPT_RCU=n.

Accordingly, select PREEMPT_RCU=y only if the user selects
CONFIG_PREEMPT at compile time.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/rcu/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index e7d2dd267593..9dedb70ac2e6 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@ config TREE_RCU
 
 config PREEMPT_RCU
 	bool
-	default y if PREEMPTION
+	default y if (PREEMPT || PREEMPT_DYNAMIC || PREEMPT_RT)
 	select TREE_RCU
 	help
 	  This option selects the RCU implementation that is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 15/35] rcu: fix header guard for rcu_all_qs()
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (13 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 14/35] rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full Ankur Arora
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

rcu_all_qs() is defined for !CONFIG_PREEMPT_RCU but the declaration
is conditioned on CONFIG_PREEMPTION.

With CONFIG_PREEMPT_AUTO, you can have configurations where
CONFIG_PREEMPTION is enabled without also enabling CONFIG_PREEMPT_RCU.

Decouple the two.

Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Changelog:
  Might be going away
---
 include/linux/rcutree.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 254244202ea9..be2b77c81a6d 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -103,7 +103,7 @@ extern int rcu_scheduler_active;
 void rcu_end_inkernel_boot(void);
 bool rcu_inkernel_boot_has_ended(void);
 bool rcu_is_watching(void);
-#ifndef CONFIG_PREEMPTION
+#ifndef CONFIG_PREEMPT_RCU
 void rcu_all_qs(void);
 #endif
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (14 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 15/35] rcu: fix header guard for rcu_all_qs() Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-29  8:14   ` Peter Zijlstra
  2024-05-28  0:35 ` [PATCH v2 17/35] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y Ankur Arora
                   ` (20 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
works at cross purposes: the RCU read side critical sections disable
preemption, while preempt=full schedules eagerly to minimize
latency.

Warn if the user is switching to full preemption with PREEMPT_RCU=n.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7804e29182d..df8e333f2d8b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
 		break;
 
 	case preempt_dynamic_full:
+		if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
+			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
+				PREEMPT_MODE);
+
 		preempt_dynamic_mode = preempt_dynamic_undefined;
 		break;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-28  0:35 ` [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full Ankur Arora
@ 2024-05-29  8:14   ` Peter Zijlstra
  2024-05-30 18:32     ` Paul E. McKenney
  2024-05-30 23:04     ` Ankur Arora
  0 siblings, 2 replies; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-29  8:14 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> works at cross purposes: the RCU read side critical sections disable
> preemption, while preempt=full schedules eagerly to minimize
> latency.
> 
> Warn if the user is switching to full preemption with PREEMPT_RCU=n.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/sched/core.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d7804e29182d..df8e333f2d8b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
>  		break;
>  
>  	case preempt_dynamic_full:
> +		if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> +			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> +				PREEMPT_MODE);
> +

Yeah, so I don't believe this is a viable strategy.

Firstly, none of these RCU patches are actually about the whole LAZY
preempt scheme, they apply equally well (arguably better) to the
existing PREEMPT_DYNAMIC thing.

Secondly, esp. with the LAZY thing, you are effectively running FULL at
all times. It's just that some of the preemptions, typically those of
the normal scheduling class are somewhat delayed. However RT/DL classes
are still insta preempt.

Meaning that if you run anything in the realtime classes you're running
a fully preemptible kernel. As such, RCU had better be able to deal with
it.

So no, I don't believe this is right.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-29  8:14   ` Peter Zijlstra
@ 2024-05-30 18:32     ` Paul E. McKenney
  2024-05-30 23:05       ` Ankur Arora
  2024-05-30 23:04     ` Ankur Arora
  1 sibling, 1 reply; 95+ messages in thread
From: Paul E. McKenney @ 2024-05-30 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Wed, May 29, 2024 at 10:14:04AM +0200, Peter Zijlstra wrote:
> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> > The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> > works at cross purposes: the RCU read side critical sections disable
> > preemption, while preempt=full schedules eagerly to minimize
> > latency.
> > 
> > Warn if the user is switching to full preemption with PREEMPT_RCU=n.
> > 
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > ---
> >  kernel/sched/core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d7804e29182d..df8e333f2d8b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
> >  		break;
> >  
> >  	case preempt_dynamic_full:
> > +		if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> > +			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> > +				PREEMPT_MODE);
> > +
> 
> Yeah, so I don't believe this is a viable strategy.
> 
> Firstly, none of these RCU patches are actually about the whole LAZY
> preempt scheme, they apply equally well (arguably better) to the
> existing PREEMPT_DYNAMIC thing.
> 
> Secondly, esp. with the LAZY thing, you are effectively running FULL at
> all times. It's just that some of the preemptions, typically those of
> the normal scheduling class are somewhat delayed. However RT/DL classes
> are still insta preempt.
> 
> Meaning that if you run anything in the realtime classes you're running
> a fully preemptible kernel. As such, RCU had better be able to deal with
> it.
> 
> So no, I don't believe this is right.

At one point, lazy preemption selected PREEMPT_COUNT (which I am
not seeing in this version, perhaps due to blindness on my part).
Of course, selecting PREEMPT_COUNT would result in !PREEMPT_RCU kernel's
rcu_read_lock() explicitly disabling preemption, thus avoiding preemption
(including lazy preemption) in RCU read-side critical sections.

Ankur, what am I missing here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-30 18:32     ` Paul E. McKenney
@ 2024-05-30 23:05       ` Ankur Arora
  2024-05-30 23:15         ` Paul E. McKenney
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-30 23:05 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Ankur Arora, linux-kernel, tglx, torvalds,
	rostedt, mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Paul E. McKenney <paulmck@kernel.org> writes:

> On Wed, May 29, 2024 at 10:14:04AM +0200, Peter Zijlstra wrote:
>> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
>> > The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
>> > works at cross purposes: the RCU read side critical sections disable
>> > preemption, while preempt=full schedules eagerly to minimize
>> > latency.
>> >
>> > Warn if the user is switching to full preemption with PREEMPT_RCU=n.
>> >
>> > Cc: Ingo Molnar <mingo@redhat.com>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Cc: Juri Lelli <juri.lelli@redhat.com>
>> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
>> > Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
>> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> > ---
>> >  kernel/sched/core.c | 4 ++++
>> >  1 file changed, 4 insertions(+)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index d7804e29182d..df8e333f2d8b 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
>> >  		break;
>> >
>> >  	case preempt_dynamic_full:
>> > +		if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
>> > +			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
>> > +				PREEMPT_MODE);
>> > +
>>
>> Yeah, so I don't believe this is a viable strategy.
>>
>> Firstly, none of these RCU patches are actually about the whole LAZY
>> preempt scheme, they apply equally well (arguably better) to the
>> existing PREEMPT_DYNAMIC thing.
>>
>> Secondly, esp. with the LAZY thing, you are effectively running FULL at
>> all times. It's just that some of the preemptions, typically those of
>> the normal scheduling class are somewhat delayed. However RT/DL classes
>> are still insta preempt.
>>
>> Meaning that if you run anything in the realtime classes you're running
>> a fully preemptible kernel. As such, RCU had better be able to deal with
>> it.
>>
>> So no, I don't believe this is right.
>
> At one point, lazy preemption selected PREEMPT_COUNT (which I am
> not seeing in this version, perhaps due to blindness on my part).
> Of course, selecting PREEMPT_COUNT would result in !PREEMPT_RCU kernel's
> rcu_read_lock() explicitly disabling preemption, thus avoiding preemption
> (including lazy preemption) in RCU read-side critical sections.

That should be still happening, just transitively. PREEMPT_AUTO selects
PREEMPT_BUILD, which selects PREEMPTION, and that in turn selects
PREEMPT_COUNT.


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-30 23:05       ` Ankur Arora
@ 2024-05-30 23:15         ` Paul E. McKenney
  0 siblings, 0 replies; 95+ messages in thread
From: Paul E. McKenney @ 2024-05-30 23:15 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Peter Zijlstra, linux-kernel, tglx, torvalds, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot

On Thu, May 30, 2024 at 04:05:26PM -0700, Ankur Arora wrote:
> 
> Paul E. McKenney <paulmck@kernel.org> writes:
> 
> > On Wed, May 29, 2024 at 10:14:04AM +0200, Peter Zijlstra wrote:
> >> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> >> > The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> >> > works at cross purposes: the RCU read side critical sections disable
> >> > preemption, while preempt=full schedules eagerly to minimize
> >> > latency.
> >> >
> >> > Warn if the user is switching to full preemption with PREEMPT_RCU=n.
> >> >
> >> > Cc: Ingo Molnar <mingo@redhat.com>
> >> > Cc: Peter Zijlstra <peterz@infradead.org>
> >> > Cc: Juri Lelli <juri.lelli@redhat.com>
> >> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> >> > Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> >> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> > ---
> >> >  kernel/sched/core.c | 4 ++++
> >> >  1 file changed, 4 insertions(+)
> >> >
> >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > index d7804e29182d..df8e333f2d8b 100644
> >> > --- a/kernel/sched/core.c
> >> > +++ b/kernel/sched/core.c
> >> > @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
> >> >  		break;
> >> >
> >> >  	case preempt_dynamic_full:
> >> > +		if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> >> > +			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> >> > +				PREEMPT_MODE);
> >> > +
> >>
> >> Yeah, so I don't believe this is a viable strategy.
> >>
> >> Firstly, none of these RCU patches are actually about the whole LAZY
> >> preempt scheme, they apply equally well (arguably better) to the
> >> existing PREEMPT_DYNAMIC thing.
> >>
> >> Secondly, esp. with the LAZY thing, you are effectively running FULL at
> >> all times. It's just that some of the preemptions, typically those of
> >> the normal scheduling class are somewhat delayed. However RT/DL classes
> >> are still insta preempt.
> >>
> >> Meaning that if you run anything in the realtime classes you're running
> >> a fully preemptible kernel. As such, RCU had better be able to deal with
> >> it.
> >>
> >> So no, I don't believe this is right.
> >
> > At one point, lazy preemption selected PREEMPT_COUNT (which I am
> > not seeing in this version, perhaps due to blindness on my part).
> > Of course, selecting PREEMPT_COUNT would result in !PREEMPT_RCU kernel's
> > rcu_read_lock() explicitly disabling preemption, thus avoiding preemption
> > (including lazy preemption) in RCU read-side critical sections.
> 
> That should be still happening, just transitively. PREEMPT_AUTO selects
> PREEMPT_BUILD, which selects PREEMPTION, and that in turn selects
> PREEMPT_COUNT.

Ah, I gave up too soon.  Thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-29  8:14   ` Peter Zijlstra
  2024-05-30 18:32     ` Paul E. McKenney
@ 2024-05-30 23:04     ` Ankur Arora
  2024-05-30 23:20       ` Paul E. McKenney
  1 sibling, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-30 23:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
>> The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
>> works at cross purposes: the RCU read side critical sections disable
>> preemption, while preempt=full schedules eagerly to minimize
>> latency.
>>
>> Warn if the user is switching to full preemption with PREEMPT_RCU=n.
>>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
>> Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/sched/core.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d7804e29182d..df8e333f2d8b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
>>  		break;
>>
>>  	case preempt_dynamic_full:
>> +		if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
>> +			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
>> +				PREEMPT_MODE);
>> +
>
> Yeah, so I don't believe this is a viable strategy.
>
> Firstly, none of these RCU patches are actually about the whole LAZY
> preempt scheme, they apply equally well (arguably better) to the
> existing PREEMPT_DYNAMIC thing.

Agreed.

> Secondly, esp. with the LAZY thing, you are effectively running FULL at
> all times. It's just that some of the preemptions, typically those of
> the normal scheduling class are somewhat delayed. However RT/DL classes
> are still insta preempt.

Also, agreed.

> Meaning that if you run anything in the realtime classes you're running
> a fully preemptible kernel. As such, RCU had better be able to deal with
> it.

So, RCU can deal with (PREEMPT_RCU=y, PREEMPT_AUTO=y, preempt=none/voluntary/full).
Since that's basically what PREEMPT_DYNAMIC already works with.

The other combination, (PREEMPT_RCU=n, PREEMPT_AUTO,
preempt=none/voluntary) would generally be business as usual, except, as
you say, it is really PREEMPT_RCU=n, preempt=full in disguise.

However, as Paul says __rcu_read_lock(), for PREEMPT_RCU=n is defined as:

static inline void __rcu_read_lock(void)
{
        preempt_disable();
}

So, this combination -- though non standard -- should also work.

The reason for adding the warning was because Paul had warned in
discussions earlier (see here for instance:
https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/)

that the PREEMPT_FULL=y and PREEMPT_RCU=n is basically useless. But at
least in my understanding that's primarily a performance concern not a
correctness concern. But, Paul can probably speak to that more.

  "PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
  combination.  All of the gains from PREEMPT_FULL=y are more than lost
  due to PREEMPT_RCU=n, especially when the kernel decides to do something
  like walk a long task list under RCU protection.  We should not waste
  people's time getting burned by this combination, nor should we waste
  cycles testing it."


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-30 23:04     ` Ankur Arora
@ 2024-05-30 23:20       ` Paul E. McKenney
  2024-06-06 11:53         ` Peter Zijlstra
  0 siblings, 1 reply; 95+ messages in thread
From: Paul E. McKenney @ 2024-05-30 23:20 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Peter Zijlstra, linux-kernel, tglx, torvalds, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot

On Thu, May 30, 2024 at 04:04:41PM -0700, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> >> The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> >> works at cross purposes: the RCU read side critical sections disable
> >> preemption, while preempt=full schedules eagerly to minimize
> >> latency.
> >>
> >> Warn if the user is switching to full preemption with PREEMPT_RCU=n.
> >>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Juri Lelli <juri.lelli@redhat.com>
> >> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> >> Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >>  kernel/sched/core.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index d7804e29182d..df8e333f2d8b 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
> >>  		break;
> >>
> >>  	case preempt_dynamic_full:
> >> +		if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> >> +			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> >> +				PREEMPT_MODE);
> >> +
> >
> > Yeah, so I don't believe this is a viable strategy.
> >
> > Firstly, none of these RCU patches are actually about the whole LAZY
> > preempt scheme, they apply equally well (arguably better) to the
> > existing PREEMPT_DYNAMIC thing.
> 
> Agreed.
> 
> > Secondly, esp. with the LAZY thing, you are effectively running FULL at
> > all times. It's just that some of the preemptions, typically those of
> > the normal scheduling class are somewhat delayed. However RT/DL classes
> > are still insta preempt.
> 
> Also, agreed.
> 
> > Meaning that if you run anything in the realtime classes you're running
> > a fully preemptible kernel. As such, RCU had better be able to deal with
> > it.
> 
> So, RCU can deal with (PREEMPT_RCU=y, PREEMPT_AUTO=y, preempt=none/voluntary/full).
> Since that's basically what PREEMPT_DYNAMIC already works with.
> 
> The other combination, (PREEMPT_RCU=n, PREEMPT_AUTO,
> preempt=none/voluntary) would generally be business as usual, except, as
> you say, it is really PREEMPT_RCU=n, preempt=full in disguise.
> 
> However, as Paul says __rcu_read_lock(), for PREEMPT_RCU=n is defined as:
> 
> static inline void __rcu_read_lock(void)
> {
>         preempt_disable();
> }
> 
> So, this combination -- though non standard -- should also work.
> 
> The reason for adding the warning was because Paul had warned in
> discussions earlier (see here for instance:
> https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/)
> 
> that the PREEMPT_FULL=y and PREEMPT_RCU=n is basically useless. But at
> least in my understanding that's primarily a performance concern not a
> correctness concern. But, Paul can probably speak to that more.
> 
>   "PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
>   combination.  All of the gains from PREEMPT_FULL=y are more than lost
>   due to PREEMPT_RCU=n, especially when the kernel decides to do something
>   like walk a long task list under RCU protection.  We should not waste
>   people's time getting burned by this combination, nor should we waste
>   cycles testing it."

My selfish motivation here is to avoid testing this combination unless
and until someone actually has a good use for it.  I do not think that
anyone will ever need it, but perhaps I am suffering from a failure
of imagination.  If so, they hit that WARN, complain and explain their
use case, and at that point I start testing it (and fixing whatever bugs
have accumulated in the meantime).  But until that time, I save time by
avoiding testing it.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-05-30 23:20       ` Paul E. McKenney
@ 2024-06-06 11:53         ` Peter Zijlstra
  2024-06-06 13:38           ` Paul E. McKenney
  0 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2024-06-06 11:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Thu, May 30, 2024 at 04:20:26PM -0700, Paul E. McKenney wrote:

> My selfish motivation here is to avoid testing this combination unless
> and until someone actually has a good use for it.

That doesn't make sense, the whole LAZY thing is fundamentally identical
to FULL, except it sometimes delays the preemption a wee bit. But all
the preemption scenarios from FULL are possible.

As such, it makes far more sense to only test FULL.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-06-06 11:53         ` Peter Zijlstra
@ 2024-06-06 13:38           ` Paul E. McKenney
  2024-06-17 15:54             ` Paul E. McKenney
  0 siblings, 1 reply; 95+ messages in thread
From: Paul E. McKenney @ 2024-06-06 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Thu, Jun 06, 2024 at 01:53:25PM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2024 at 04:20:26PM -0700, Paul E. McKenney wrote:
> 
> > My selfish motivation here is to avoid testing this combination unless
> > and until someone actually has a good use for it.
> 
> That doesn't make sense, the whole LAZY thing is fundamentally identical
> to FULL, except it sometimes delays the preemption a wee bit. But all
> the preemption scenarios from FULL are possible.

As noted earlier in this thread, this is not the case for non-preemptible
RCU, which disables preemption across its read-side critical sections.
In addition, from a performance/throughput viewpoint, it is not just
the possibility of preemption that matters, but also the probability.

> As such, it makes far more sense to only test FULL.

You have considerable work left to do in order to convince me of this one.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-06-06 13:38           ` Paul E. McKenney
@ 2024-06-17 15:54             ` Paul E. McKenney
  2024-06-18 16:29               ` Paul E. McKenney
  0 siblings, 1 reply; 95+ messages in thread
From: Paul E. McKenney @ 2024-06-17 15:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Thu, Jun 06, 2024 at 06:38:57AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 06, 2024 at 01:53:25PM +0200, Peter Zijlstra wrote:
> > On Thu, May 30, 2024 at 04:20:26PM -0700, Paul E. McKenney wrote:
> > 
> > > My selfish motivation here is to avoid testing this combination unless
> > > and until someone actually has a good use for it.
> > 
> > That doesn't make sense, the whole LAZY thing is fundamentally identical
> > to FULL, except it sometimes delays the preemption a wee bit. But all
> > the preemption scenarios from FULL are possible.
> 
> As noted earlier in this thread, this is not the case for non-preemptible
> RCU, which disables preemption across its read-side critical sections.
> In addition, from a performance/throughput viewpoint, it is not just
> the possibility of preemption that matters, but also the probability.
> 
> > As such, it makes far more sense to only test FULL.
> 
> You have considerable work left to do in order to convince me of this one.

On the other hand, it does make sense to select Tiny SRCU for all !SMP
kernels, whether preemptible or not.  And it also makes sense to test
(but not *only* test and definitely *not* support) non-preemptible
RCU running in a preemptible kernel, perhaps as a one-off, perhaps
longer term.  The point being of course that the increased preemption
rate of a fully preemptible kernel should uncover bugs that might appear
in lazy-preemptible kernels only very rarely.

This might have been what you were getting at, and if so, apologies!
But in my defense, you did say "only test FULL" above.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
  2024-06-17 15:54             ` Paul E. McKenney
@ 2024-06-18 16:29               ` Paul E. McKenney
  0 siblings, 0 replies; 95+ messages in thread
From: Paul E. McKenney @ 2024-06-18 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Mon, Jun 17, 2024 at 08:54:49AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 06, 2024 at 06:38:57AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 06, 2024 at 01:53:25PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 30, 2024 at 04:20:26PM -0700, Paul E. McKenney wrote:
> > > 
> > > > My selfish motivation here is to avoid testing this combination unless
> > > > and until someone actually has a good use for it.
> > > 
> > > That doesn't make sense, the whole LAZY thing is fundamentally identical
> > > to FULL, except it sometimes delays the preemption a wee bit. But all
> > > the preemption scenarios from FULL are possible.
> > 
> > As noted earlier in this thread, this is not the case for non-preemptible
> > RCU, which disables preemption across its read-side critical sections.
> > In addition, from a performance/throughput viewpoint, it is not just
> > the possibility of preemption that matters, but also the probability.
> > 
> > > As such, it makes far more sense to only test FULL.
> > 
> > You have considerable work left to do in order to convince me of this one.
> 
> On the other hand, it does make sense to select Tiny SRCU for all !SMP
> kernels, whether preemptible or not.

Except that testing made a liar out of me.  SRCU priority boosting would
be required.  So this one is also strictly for pre-testing lazy
preemption.

As usual, it seemed like a good idea at the time...  ;-)

							Thanx, Paul

>                                       And it also makes sense to test
> (but not *only* test and definitely *not* support) non-preemptible
> RCU running in a preemptible kernel, perhaps as a one-off, perhaps
> longer term.  The point being of course that the increased preemption
> rate of a fully preemptible kernel should uncover bugs that might appear
> in lazy-preemptible kernels only very rarely.
> 
> This might have been what you were getting at, and if so, apologies!
> But in my defense, you did say "only test FULL" above.  ;-)
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 17/35] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (15 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 18/35] rcu: force context-switch " Ankur Arora
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
states for read-side critical sections via rcu_all_qs().
One reason why this was necessary: lacking preempt-count, the tick
handler has no way of knowing whether it is executing in a read-side
critical section or not.

With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
PREEMPT_RCU=n). This means that cond_resched() is a stub which does
not provide for quiescent states via rcu_all_qs().

So, use the availability of preempt_count() to report quiescent states
in rcu_flavor_sched_clock_irq().

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/rcu/tree_plugin.h | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 36a8b5dbf5b5..741476c841a1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
  */
 static void rcu_flavor_sched_clock_irq(int user)
 {
-	if (user || rcu_is_cpu_rrupt_from_idle()) {
+	if (user || rcu_is_cpu_rrupt_from_idle() ||
+	    (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
+	     !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
 
 		/*
 		 * Get here if this CPU took its interrupt from user
-		 * mode or from the idle loop, and if this is not a
-		 * nested interrupt.  In this case, the CPU is in
-		 * a quiescent state, so note it.
+		 * mode, from the idle loop without this being a nested
+		 * interrupt, or while not holding a preempt count.
+		 * In this case, the CPU is in a quiescent state, so note
+		 * it.
 		 *
 		 * No memory barrier is required here because rcu_qs()
 		 * references only CPU-local variables that other CPUs
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 18/35] rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (16 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 17/35] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 19/35] x86/thread_info: define TIF_NEED_RESCHED_LAZY Ankur Arora
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

With (PREEMPT_RCU=n, PREEMPT_COUNT=y), rcu_flavor_sched_clock_irq()
registers urgently needed quiescent states when preempt_count() is
available and no task or softirq is in a non-preemptible section.

This, however, does nothing for long running loops where preemption
is only temporarily enabled, since the tick is unlikely to neatly fall
in the preemptible() section.

Handle that by forcing a context-switch when we require a quiescent
state urgently but are holding a preempt_count().

Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/rcu/tree.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d9642dd06c25..3a0e1d0b939c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2286,8 +2286,17 @@ void rcu_sched_clock_irq(int user)
 	raw_cpu_inc(rcu_data.ticks_this_gp);
 	/* The load-acquire pairs with the store-release setting to true. */
 	if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
-		/* Idle and userspace execution already are quiescent states. */
-		if (!rcu_is_cpu_rrupt_from_idle() && !user) {
+		/*
+		 * Idle and userspace execution already are quiescent states.
+		 * If, however, we came here from a nested interrupt in the
+		 * kernel, or if we have PREEMPT_RCU=n but are holding a
+		 * preempt_count() (say, with CONFIG_PREEMPT_AUTO=y), then
+		 * force a context switch.
+		 */
+		if ((!rcu_is_cpu_rrupt_from_idle() && !user) ||
+		     ((!IS_ENABLED(CONFIG_PREEMPT_RCU) &&
+		       IS_ENABLED(CONFIG_PREEMPT_COUNT)) &&
+		     (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 19/35] x86/thread_info: define TIF_NEED_RESCHED_LAZY
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (17 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 18/35] rcu: force context-switch " Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 20/35] powerpc: add support for PREEMPT_AUTO Ankur Arora
                   ` (17 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Borislav Petkov,
	Dave Hansen

Define TIF_NEED_RESCHED_LAZY which, with TIF_NEED_RESCHED provides the
scheduler with two kinds of rescheduling intent: TIF_NEED_RESCHED,
for the usual rescheduling at the next safe preemption point;
TIF_NEED_RESCHED_LAZY expressing an intent to reschedule at some
time in the future while allowing the current task to run to
completion.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig                   | 1 +
 arch/x86/include/asm/thread_info.h | 6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 928820e61cb5..5cd83b12f6fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -277,6 +277,7 @@ config X86
 	select HAVE_STATIC_CALL
 	select HAVE_STATIC_CALL_INLINE		if HAVE_OBJTOOL
 	select HAVE_PREEMPT_DYNAMIC_CALL
+	select HAVE_PREEMPT_AUTO
 	select HAVE_RSEQ
 	select HAVE_RUST			if X86_64
 	select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 12da7dfd5ef1..6862bbbb98ab 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -87,8 +87,9 @@ struct thread_info {
 #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
-#define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY	4	/* Lazy rescheduling */
+#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
+#define TIF_SSBD		6	/* Speculative store bypass disable */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -110,6 +111,7 @@ struct thread_info {
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 20/35] powerpc: add support for PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (18 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 19/35] x86/thread_info: define TIF_NEED_RESCHED_LAZY Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr() Ankur Arora
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora

From: Shrikanth Hegde <sshegde@linux.ibm.com>

Add PowerPC arch support for PREEMPT_AUTO by defining LAZY bits.

Since PowerPC doesn't use generic exit to functions, Add
NR_LAZY check in exit to user and exit to kernel from interrupt
routines.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
[ Changed TIF_NEED_RESCHED_LAZY to now be defined unconditionally. ]
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/powerpc/Kconfig                   | 1 +
 arch/powerpc/include/asm/thread_info.h | 5 ++++-
 arch/powerpc/kernel/interrupt.c        | 5 +++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..11e7008f5dd3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -268,6 +268,7 @@ config PPC
 	select HAVE_PERF_EVENTS_NMI		if PPC64
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_PREEMPT_AUTO
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE
 	select HAVE_RSEQ
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 15c5691dd218..0d170e2be2b6 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -117,11 +117,14 @@ void arch_setup_new_exec(void);
 #endif
 #define TIF_POLLING_NRFLAG	19	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_32BIT		20	/* 32 bit binary */
+#define TIF_NEED_RESCHED_LAZY	21	/* Lazy rescheduling */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
+
 #define _TIF_NOTIFY_SIGNAL	(1<<TIF_NOTIFY_SIGNAL)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_32BIT		(1<<TIF_32BIT)
@@ -144,7 +147,7 @@ void arch_setup_new_exec(void);
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
 				 _TIF_RESTORE_TM | _TIF_PATCH_PENDING | \
-				 _TIF_NOTIFY_SIGNAL)
+				 _TIF_NOTIFY_SIGNAL | _TIF_NEED_RESCHED_LAZY)
 #define _TIF_PERSYSCALL_MASK	(_TIF_RESTOREALL|_TIF_NOERROR)
 
 /* Bits in local_flags */
diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index eca293794a1e..0b97cdd4b94e 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -185,7 +185,7 @@ interrupt_exit_user_prepare_main(unsigned long ret, struct pt_regs *regs)
 	ti_flags = read_thread_flags();
 	while (unlikely(ti_flags & (_TIF_USER_WORK_MASK & ~_TIF_RESTORE_TM))) {
 		local_irq_enable();
-		if (ti_flags & _TIF_NEED_RESCHED) {
+		if (ti_flags & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
 			schedule();
 		} else {
 			/*
@@ -396,7 +396,8 @@ notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs)
 		/* Returning to a kernel context with local irqs enabled. */
 		WARN_ON_ONCE(!(regs->msr & MSR_EE));
 again:
-		if (IS_ENABLED(CONFIG_PREEMPT)) {
+
+		if (IS_ENABLED(CONFIG_PREEMPTION)) {
 			/* Return to preemptible kernel context */
 			if (unlikely(read_thread_flags() & _TIF_NEED_RESCHED)) {
 				if (preempt_count() == 0)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr()
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (19 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 20/35] powerpc: add support for PREEMPT_AUTO Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-29  9:32   ` Peter Zijlstra
  2024-05-28  0:35 ` [PATCH v2 22/35] sched: default preemption policy for PREEMPT_AUTO Ankur Arora
                   ` (15 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

Handle RESCHED_LAZY in resched_curr(), by registering an intent to
reschedule at exit-to-user.
Given that the rescheduling is not imminent, skip the preempt folding
and the resched IPI.

Also, update set_nr_and_not_polling() to handle RESCHED_LAZY. Note that
there are no changes to set_nr_if_polling(), since lazy rescheduling
is not meaningful for idle.

And finally, now that there are two need-resched bits, enforce a
priority order while setting them.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df8e333f2d8b..27b908cc9134 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -899,14 +899,14 @@ static inline void hrtick_rq_init(struct rq *rq)
 
 #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
  * this avoids any races wrt polling state changes and thereby avoids
  * spurious IPIs.
  */
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
 {
 	struct thread_info *ti = task_thread_info(p);
-	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+	return !(fetch_or(&ti->flags, _tif_resched(rs)) & _TIF_POLLING_NRFLAG);
 }
 
 /*
@@ -931,9 +931,9 @@ static bool set_nr_if_polling(struct task_struct *p)
 }
 
 #else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
 {
-	__set_tsk_need_resched(p, RESCHED_NOW);
+	__set_tsk_need_resched(p, rs);
 	return true;
 }
 
@@ -1041,25 +1041,34 @@ void wake_up_q(struct wake_q_head *head)
 void resched_curr(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
+	resched_t rs = RESCHED_NOW;
 	int cpu;
 
 	lockdep_assert_rq_held(rq);
 
-	if (__test_tsk_need_resched(curr, RESCHED_NOW))
+	/*
+	 * TIF_NEED_RESCHED is the higher priority bit, so if it is already
+	 * set, nothing more to be done.
+	 */
+	if (__test_tsk_need_resched(curr, RESCHED_NOW) ||
+	    (rs == RESCHED_LAZY && __test_tsk_need_resched(curr, RESCHED_LAZY)))
 		return;
 
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		__set_tsk_need_resched(curr, RESCHED_NOW);
-		set_preempt_need_resched();
+		__set_tsk_need_resched(curr, rs);
+		if (rs == RESCHED_NOW)
+			set_preempt_need_resched();
 		return;
 	}
 
-	if (set_nr_and_not_polling(curr))
-		smp_send_reschedule(cpu);
-	else
+	if (set_nr_and_not_polling(curr, rs)) {
+		if (rs == RESCHED_NOW)
+			smp_send_reschedule(cpu);
+	} else {
 		trace_sched_wake_idle_without_ipi(cpu);
+	}
 }
 
 void resched_cpu(int cpu)
@@ -1154,7 +1163,7 @@ static void wake_up_idle_cpu(int cpu)
 	 * and testing of the above solutions didn't appear to report
 	 * much benefits.
 	 */
-	if (set_nr_and_not_polling(rq->idle))
+	if (set_nr_and_not_polling(rq->idle, RESCHED_NOW))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_idle_without_ipi(cpu);
@@ -6704,6 +6713,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	}
 
 	next = pick_next_task(rq, prev, &rf);
+
+	/* Clear both TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY */
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr()
  2024-05-28  0:35 ` [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr() Ankur Arora
@ 2024-05-29  9:32   ` Peter Zijlstra
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Zijlstra @ 2024-05-29  9:32 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot

On Mon, May 27, 2024 at 05:35:07PM -0700, Ankur Arora wrote:

> @@ -1041,25 +1041,34 @@ void wake_up_q(struct wake_q_head *head)
>  void resched_curr(struct rq *rq)
>  {
>  	struct task_struct *curr = rq->curr;
> +	resched_t rs = RESCHED_NOW;
>  	int cpu;
>  
>  	lockdep_assert_rq_held(rq);
>  
> -	if (__test_tsk_need_resched(curr, RESCHED_NOW))
> +	/*
> +	 * TIF_NEED_RESCHED is the higher priority bit, so if it is already
> +	 * set, nothing more to be done.
> +	 */
> +	if (__test_tsk_need_resched(curr, RESCHED_NOW) ||
> +	    (rs == RESCHED_LAZY && __test_tsk_need_resched(curr, RESCHED_LAZY)))
>  		return;
>  
>  	cpu = cpu_of(rq);
>  
>  	if (cpu == smp_processor_id()) {
> -		__set_tsk_need_resched(curr, RESCHED_NOW);
> -		set_preempt_need_resched();
> +		__set_tsk_need_resched(curr, rs);
> +		if (rs == RESCHED_NOW)
> +			set_preempt_need_resched();
>  		return;
>  	}
>  
> -	if (set_nr_and_not_polling(curr))
> -		smp_send_reschedule(cpu);
> -	else
> +	if (set_nr_and_not_polling(curr, rs)) {
> +		if (rs == RESCHED_NOW)
> +			smp_send_reschedule(cpu);

I'm thinking this wants at least something like:

		WARN_ON_ONCE(rs == RESCHED_LAZY && is_idle_task(curr));


> +	} else {
>  		trace_sched_wake_idle_without_ipi(cpu);
> +	}
>  }
>  
>  void resched_cpu(int cpu)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 22/35] sched: default preemption policy for PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (20 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr() Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 23/35] sched: handle idle preemption " Ankur Arora
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

Add resched_opt_translate() which determines the particular
need-resched flag based on scheduling policy.

Preemption models other than PREEMPT_AUTO: continue to use
tif_resched(RESCHED_NOW).

PREEMPT_AUTO: use tif_resched(RESCHED_LAZY) to reschedule at
the next exit-to-user.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c  | 30 ++++++++++++++++++++++++------
 kernel/sched/sched.h | 12 +++++++++++-
 2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 27b908cc9134..ee846dc9133b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1032,20 +1032,38 @@ void wake_up_q(struct wake_q_head *head)
 }
 
 /*
- * resched_curr - mark rq's current task 'to be rescheduled now'.
+ * For preemption models other than PREEMPT_AUTO: always schedule
+ * eagerly.
  *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
+ * For PREEMPT_AUTO: allow everything else to finish its time quanta, and
+ * mark for rescheduling at the next exit to user.
  */
-void resched_curr(struct rq *rq)
+static resched_t resched_opt_translate(struct task_struct *curr,
+				       enum resched_opt opt)
+{
+	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO))
+		return RESCHED_NOW;
+
+	return RESCHED_LAZY;
+}
+
+/*
+ * __resched_curr - mark rq's current task 'to be rescheduled now'.
+ *
+ * On UP this means the setting of the appropriate need_resched flag.
+ * On SMP, in addition it might also involve a cross-CPU call to
+ * trigger the scheduler on the target CPU.
+ */
+void __resched_curr(struct rq *rq, enum resched_opt opt)
 {
 	struct task_struct *curr = rq->curr;
-	resched_t rs = RESCHED_NOW;
+	resched_t rs;
 	int cpu;
 
 	lockdep_assert_rq_held(rq);
 
+	rs = resched_opt_translate(curr, opt);
+
 	/*
 	 * TIF_NEED_RESCHED is the higher priority bit, so if it is already
 	 * set, nothing more to be done.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c9239c0b0095..7013bd054a2f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2464,7 +2464,17 @@ extern void init_sched_fair_class(void);
 
 extern void reweight_task(struct task_struct *p, int prio);
 
-extern void resched_curr(struct rq *rq);
+enum resched_opt {
+	RESCHED_DEFAULT,
+};
+
+extern void __resched_curr(struct rq *rq, enum resched_opt opt);
+
+static inline void resched_curr(struct rq *rq)
+{
+	__resched_curr(rq, RESCHED_DEFAULT);
+}
+
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 23/35] sched: handle idle preemption for PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (21 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 22/35] sched: default preemption policy for PREEMPT_AUTO Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 24/35] sched: schedule eagerly in resched_cpu() Ankur Arora
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

When running the idle task, we always want to schedule out immediately.
Use tif_resched(RESCHED_NOW) to do that.

This path should be identical across preemption models which is borne
out by comparing latency via perf bench sched pipe (5 runs):

PREEMPT_AUTO:	  4.430 usecs/op +-    0.080  ( +- 1.800% )
PREEMPT_DYNAMIC:  4.400 usecs/op +-    0.100  ( +- 2.270% )

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee846dc9133b..1b930b84eb59 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1044,6 +1044,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
 	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO))
 		return RESCHED_NOW;
 
+	if (is_idle_task(curr))
+		return RESCHED_NOW;
+
 	return RESCHED_LAZY;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 24/35] sched: schedule eagerly in resched_cpu()
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (22 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 23/35] sched: handle idle preemption " Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 25/35] sched/fair: refactor update_curr(), entity_tick() Ankur Arora
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

resched_cpu() is used as an RCU hammer of last resort. Force
rescheduling eagerly with tif_resched(RESCHED_NOW).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c  | 14 +++++++++++---
 kernel/sched/sched.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b930b84eb59..e838328d93d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,8 +1035,9 @@ void wake_up_q(struct wake_q_head *head)
  * For preemption models other than PREEMPT_AUTO: always schedule
  * eagerly.
  *
- * For PREEMPT_AUTO: allow everything else to finish its time quanta, and
- * mark for rescheduling at the next exit to user.
+ * For PREEMPT_AUTO: schedule idle threads eagerly, allow everything else
+ * to finish its time quanta, and mark for rescheduling at the next exit
+ * to user.
  */
 static resched_t resched_opt_translate(struct task_struct *curr,
 				       enum resched_opt opt)
@@ -1044,6 +1045,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
 	if (!IS_ENABLED(CONFIG_PREEMPT_AUTO))
 		return RESCHED_NOW;
 
+	if (opt == RESCHED_FORCE)
+		return RESCHED_NOW;
+
 	if (is_idle_task(curr))
 		return RESCHED_NOW;
 
@@ -1099,7 +1103,11 @@ void resched_cpu(int cpu)
 
 	raw_spin_rq_lock_irqsave(rq, flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
-		resched_curr(rq);
+		/*
+		 * resched_cpu() is typically used as an RCU hammer.
+		 * Mark for imminent resched.
+		 */
+		__resched_curr(rq, RESCHED_FORCE);
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7013bd054a2f..e5e4747fbef2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2466,6 +2466,7 @@ extern void reweight_task(struct task_struct *p, int prio);
 
 enum resched_opt {
 	RESCHED_DEFAULT,
+	RESCHED_FORCE,
 };
 
 extern void __resched_curr(struct rq *rq, enum resched_opt opt);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 25/35] sched/fair: refactor update_curr(), entity_tick()
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (23 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 24/35] sched: schedule eagerly in resched_cpu() Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 26/35] sched/fair: handle tick expiry under lazy preemption Ankur Arora
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

When updating the task's runtime statistics via update_curr()
or entity_tick(), we call resched_curr() to resched if needed.

Refactor update_curr() and entity_tick() to only update the stats
and deferring any rescheduling needed to task_tick_fair() or
update_curr().

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Suggested-by: Peter Ziljstra <peterz@infradead.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++-----------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c5171c247466..dd34709f294c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -981,10 +981,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if ((s64)(se->vruntime - se->deadline) < 0)
-		return;
+		return false;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1002,9 +1002,11 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 * The task has consumed its request, reschedule.
 	 */
 	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
 		clear_buddies(cfs_rq, se);
+		return true;
 	}
+
+	return false;
 }
 
 #include "pelt.h"
@@ -1159,26 +1161,35 @@ s64 update_curr_common(struct rq *rq)
 /*
  * Update the current task's runtime statistics.
  */
-static void update_curr(struct cfs_rq *cfs_rq)
+static bool __update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
 	s64 delta_exec;
+	bool resched;
 
 	if (unlikely(!curr))
-		return;
+		return false;
 
 	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
 	if (unlikely(delta_exec <= 0))
-		return;
+		return false;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
-	update_deadline(cfs_rq, curr);
+	resched = update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr))
 		update_curr_task(task_of(curr), delta_exec);
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+	return resched;
+}
+
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	if (__update_curr(cfs_rq))
+		resched_curr(rq_of(cfs_rq));
 }
 
 static void update_curr_fair(struct rq *rq)
@@ -5499,13 +5510,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	cfs_rq->curr = NULL;
 }
 
-static void
-entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
+static bool
+entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
 	/*
 	 * Update run-time statistics of the 'current'.
 	 */
-	update_curr(cfs_rq);
+	bool resched = __update_curr(cfs_rq);
 
 	/*
 	 * Ensure that runnable average is periodically updated.
@@ -5513,22 +5524,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
 
-#ifdef CONFIG_SCHED_HRTICK
-	/*
-	 * queued ticks are scheduled to match the slice, so don't bother
-	 * validating it and just reschedule.
-	 */
-	if (queued) {
-		resched_curr(rq_of(cfs_rq));
-		return;
-	}
-	/*
-	 * don't let the period tick interfere with the hrtick preemption
-	 */
-	if (!sched_feat(DOUBLE_TICK) &&
-			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
-		return;
-#endif
+	return resched;
 }
 
 
@@ -12611,12 +12607,16 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &curr->se;
+	bool resched = false;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		entity_tick(cfs_rq, se, queued);
+		resched |= entity_tick(cfs_rq, se);
 	}
 
+	if (resched)
+		resched_curr(rq);
+
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 26/35] sched/fair: handle tick expiry under lazy preemption
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (24 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 25/35] sched/fair: refactor update_curr(), entity_tick() Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 27/35] sched: support preempt=none under PREEMPT_AUTO Ankur Arora
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

The default policy for lazy scheduling is to schedule in exit-to-user.
So, we do that for all but deadline tasks. For deadline tasks once a
task is not leftmost, force it to be scheduled away.

Always scheduling lazily, however, runs into the 'hog' problem -- the
target task might be running in the kernel and might not relinquish
CPU on its own.

Handle that by upgrading the ignored tif_resched(RESCHED_LAZY) bit to
tif_resched(RESCHED_NOW) at the next tick.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c     | 8 ++++++++
 kernel/sched/deadline.c | 5 ++++-
 kernel/sched/fair.c     | 2 +-
 kernel/sched/rt.c       | 2 +-
 kernel/sched/sched.h    | 6 ++++++
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e838328d93d1..2bc7f636267d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1051,6 +1051,14 @@ static resched_t resched_opt_translate(struct task_struct *curr,
 	if (is_idle_task(curr))
 		return RESCHED_NOW;
 
+	if (opt == RESCHED_TICK &&
+	    unlikely(__test_tsk_need_resched(curr, RESCHED_LAZY)))
+		/*
+		 * If the task hasn't switched away by the second tick,
+		 * force it away by upgrading to TIF_NEED_RESCHED.
+		 */
+		return RESCHED_NOW;
+
 	return RESCHED_LAZY;
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d24d6bfee293..cb0dd77508b1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1378,8 +1378,11 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
 		}
 
+		/*
+		 * We are not leftmost anymore. Reschedule straight away.
+		 */
 		if (!is_leftmost(dl_se, &rq->dl))
-			resched_curr(rq);
+			__resched_curr(rq, RESCHED_FORCE);
 	}
 
 	/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd34709f294c..faa6afe0af0d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12615,7 +12615,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	}
 
 	if (resched)
-		resched_curr(rq);
+		resched_curr_tick(rq);
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f0a6c9bb890b..4713783bbdef 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1023,7 +1023,7 @@ static void update_curr_rt(struct rq *rq)
 			rt_rq->rt_time += delta_exec;
 			exceeded = sched_rt_runtime_exceeded(rt_rq);
 			if (exceeded)
-				resched_curr(rq);
+				resched_curr_tick(rq);
 			raw_spin_unlock(&rt_rq->rt_runtime_lock);
 			if (exceeded)
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e5e4747fbef2..107c5fc2b7bb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2467,6 +2467,7 @@ extern void reweight_task(struct task_struct *p, int prio);
 enum resched_opt {
 	RESCHED_DEFAULT,
 	RESCHED_FORCE,
+	RESCHED_TICK,
 };
 
 extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2476,6 +2477,11 @@ static inline void resched_curr(struct rq *rq)
 	__resched_curr(rq, RESCHED_DEFAULT);
 }
 
+static inline void resched_curr_tick(struct rq *rq)
+{
+	__resched_curr(rq, RESCHED_TICK);
+}
+
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 27/35] sched: support preempt=none under PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (25 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 26/35] sched/fair: handle tick expiry under lazy preemption Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 28/35] sched: support preempt=full " Ankur Arora
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

The default preemption policy for the no forced preemption model under
PREEMPT_AUTO is to always schedule lazily for well-behaved, non-idle
tasks, preempting at exit-to-user.

We already have that, so enable it.

Comparing a scheduling/IPC workload:

 # perf stat -a -e cs --repeat 10 --  perf bench sched messaging -g 20 -t -l 5000

 PREEMPT_AUTO, preempt=none

        3,173,961             context-switches      ( +-  0.60% )

           3.03058 +- 0.00621 seconds time elapsed  ( +- 0.20% )

 PREEMPT_DYNAMIC, preempt=none

         2,942,664      context-switches            ( +-  0.49% )

           3.18924 +- 0.00483 seconds time elapsed  ( +-  0.15% )

Both perform similarly, but we incur a slightly higher number of
context-switches with PREEMPT_AUTO.

Drilling down we see that both voluntary and involuntary
context-switches are higher for this test:

 PREEMPT_AUTO, preempt=none

          2286219.90 +- 39510.80 voluntary context-switches   ( +- 1.72% )
	   887741.80 +- 20137.63 involuntary context-switches ( +- 2.26% )

 PREEMPT_DYNAMIC, preempt=none

          2125750.40 +- 29593.55 voluntary context-switches   ( +- 1.39% )
           816914.20 +- 13723.46 involuntary context-switches ( +- 1.67% )

Assuming voluntary context-switches due to explicit blocking are
similar, we expect that PREEMPT_AUTO will incur larger context
switches at exit-to-user (counted as voluntary) since that is its
default rescheduling point.

Involuntary context-switches, under PREEMPT_AUTO are seen when a
task has exceeded its time quanta by a tick. Under PREEMPT_DYNAMIC,
these are incurred when a task needs to be rescheduled and then
encounters a cond_resched().
So, these two numbers aren't directly comparable.

Comparing a kernbench workload:

  # Half load (-j 32)

                         PREEMPT_AUTO                           PREEMPT_DYNAMIC

  wall            74.41 +-     0.45 ( +-  0.60% )         74.20 +-    0.33 sec ( +- 0.45% )
  utime         1419.78 +-     2.04 ( +-  0.14% )       1416.40 +-    6.07 sec ( +- 0.42% )
  stime          247.70 +-     0.88 ( +-  0.35% )        246.23 +-    1.20 sec ( +- 0.49% )
  %cpu          2240.20 +-    16.03 ( +-  0.71% )       2240.20 +-   19.34     ( +- 0.86% )
  inv-csw      13056.00 +-   427.58 ( +-  3.27% )      18750.60 +-  771.21     ( +- 4.11% )
  vol-csw     191000.00 +-  1623.25 ( +-  0.84% )     182857.00 +- 2373.12     ( +- 1.29% )

The runtimes are basically identical for both of these. Voluntary
context switches, as above (and in the optimal, maximal runs below),
are higher. Which as mentioned above, does add up.

However, unlike the sched-messaging workload, the involuntary
context-switches are generally lower (also true for the optimal,
maximal runs.) One reason for that might be that kbuild spends
~20% time executing in the kernel, while sched-messaging spends ~95%
time in the kernel. Which means a greater likelihood of being
preempted due to exceeding its time quanta.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Ziljstra <peterz@infradead.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2bc7f636267d..c3ba33c77053 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8983,7 +8983,9 @@ static void __sched_dynamic_update(int mode)
 {
 	switch (mode) {
 	case preempt_dynamic_none:
-		preempt_dynamic_mode = preempt_dynamic_undefined;
+		if (mode != preempt_dynamic_mode)
+			pr_info("%s: none\n", PREEMPT_MODE);
+		preempt_dynamic_mode = mode;
 		break;

 	case preempt_dynamic_voluntary:
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 28/35] sched: support preempt=full under PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (26 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 27/35] sched: support preempt=none under PREEMPT_AUTO Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 29/35] sched: handle preempt=voluntary " Ankur Arora
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

The default preemption policy for preempt-full under PREEMPT_AUTO is
to minimize latency, and thus to always schedule eagerly. This is
identical to CONFIG_PREEMPT, and so should result in similar
performance.

Comparing scheduling/IPC workload:

 # perf stat -a -e cs --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

 PREEMPT_AUTO, preempt=full

         3,080,508            context-switches      ( +-  0.64% )
	   3.65171 +- 0.00654 seconds time elapsed  ( +-  0.18% )

 PREEMPT_DYNAMIC, preempt=full

	 3,087,527            context-switches      ( +-  0.33% )
	   3.60163 +- 0.00633 seconds time elapsed  ( +-  0.18% )

Looking at the breakup between voluntary and involuntary
context-switches, we see almost identical behaviour as well.

 PREEMPT_AUTO, preempt=full

           2087910.00 +- 34720.95 voluntary context-switches   ( +- 1.660% )
            784437.60 +- 19827.79 involuntary context-switches ( +- 2.520% )

 PREEMPT_DYNAMIC, preempt=full

           2102879.60 +- 22767.11 voluntary context-switches   ( +- 1.080% )
            801189.90 +- 21324.18 involuntary context-switches ( +- 2.660% )

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c3ba33c77053..c25cccc09b65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,9 +1035,10 @@ void wake_up_q(struct wake_q_head *head)
  * For preemption models other than PREEMPT_AUTO: always schedule
  * eagerly.
  *
- * For PREEMPT_AUTO: schedule idle threads eagerly, allow everything else
- * to finish its time quanta, and mark for rescheduling at the next exit
- * to user.
+ * For PREEMPT_AUTO: schedule idle threads eagerly, and under full
+ * preemption, all tasks eagerly. Otherwise, allow everything else
+ * to finish its time quanta, and mark for rescheduling at the next
+ * exit to user.
  */
 static resched_t resched_opt_translate(struct task_struct *curr,
 				       enum resched_opt opt)
@@ -1048,6 +1049,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
 	if (opt == RESCHED_FORCE)
 		return RESCHED_NOW;
 
+	if (preempt_model_preemptible())
+		return RESCHED_NOW;
+
 	if (is_idle_task(curr))
 		return RESCHED_NOW;
 
@@ -8997,7 +9001,9 @@ static void __sched_dynamic_update(int mode)
 			pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
 				PREEMPT_MODE);
 
-		preempt_dynamic_mode = preempt_dynamic_undefined;
+		if (mode != preempt_dynamic_mode)
+			pr_info("%s: full\n", PREEMPT_MODE);
+		preempt_dynamic_mode = mode;
 		break;
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (27 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 28/35] sched: support preempt=full " Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-06-17  3:20   ` Tianchen Ding
  2024-05-28  0:35 ` [PATCH v2 30/35] sched: latency warn for TIF_NEED_RESCHED_LAZY Ankur Arora
                   ` (7 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

The default preemption policy for voluntary preemption under
PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
class, and lazily for well-behaved, non-idle tasks.

This is the same policy as preempt=none, with an eager handling of
higher priority scheduling classes.

Comparing a cyclictest workload with a background kernel load of
'stress-ng --mmap', shows that both the average and the maximum
latencies improve:

 # stress-ng --mmap 0 &
 # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300

                                     Min     (  %stdev )    Act     (  %stdev )   Avg     (  %stdev )   Max      (  %stdev )

  PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +- 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
  PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )

The table above shows the aggregated latencies across all CPUs.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c  | 12 ++++++++----
 kernel/sched/sched.h |  6 ++++++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c25cccc09b65..2bc3ae21a9d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
 	if (preempt_model_preemptible())
 		return RESCHED_NOW;
 
+	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
+		return RESCHED_NOW;
+
 	if (is_idle_task(curr))
 		return RESCHED_NOW;
 
@@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 	if (p->sched_class == rq->curr->sched_class)
 		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
 	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
-		resched_curr(rq);
+		resched_curr_priority(rq);
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -8989,11 +8992,11 @@ static void __sched_dynamic_update(int mode)
 	case preempt_dynamic_none:
 		if (mode != preempt_dynamic_mode)
 			pr_info("%s: none\n", PREEMPT_MODE);
-		preempt_dynamic_mode = mode;
 		break;
 
 	case preempt_dynamic_voluntary:
-		preempt_dynamic_mode = preempt_dynamic_undefined;
+		if (mode != preempt_dynamic_mode)
+			pr_info("%s: voluntary\n", PREEMPT_MODE);
 		break;
 
 	case preempt_dynamic_full:
@@ -9003,9 +9006,10 @@ static void __sched_dynamic_update(int mode)
 
 		if (mode != preempt_dynamic_mode)
 			pr_info("%s: full\n", PREEMPT_MODE);
-		preempt_dynamic_mode = mode;
 		break;
 	}
+
+	preempt_dynamic_mode = mode;
 }
 
 #endif /* CONFIG_PREEMPT_AUTO */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 107c5fc2b7bb..ee8e99a9a677 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2468,6 +2468,7 @@ enum resched_opt {
 	RESCHED_DEFAULT,
 	RESCHED_FORCE,
 	RESCHED_TICK,
+	RESCHED_PRIORITY,
 };
 
 extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2482,6 +2483,11 @@ static inline void resched_curr_tick(struct rq *rq)
 	__resched_curr(rq, RESCHED_TICK);
 }
 
+static inline void resched_curr_priority(struct rq *rq)
+{
+	__resched_curr(rq, RESCHED_PRIORITY);
+}
+
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO
  2024-05-28  0:35 ` [PATCH v2 29/35] sched: handle preempt=voluntary " Ankur Arora
@ 2024-06-17  3:20   ` Tianchen Ding
  2024-06-21 18:58     ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Tianchen Ding @ 2024-06-17  3:20 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot, linux-kernel

On 2024/5/28 08:35, Ankur Arora wrote:
> The default preemption policy for voluntary preemption under
> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
> class, and lazily for well-behaved, non-idle tasks.
> 
> This is the same policy as preempt=none, with an eager handling of
> higher priority scheduling classes.
> 
> Comparing a cyclictest workload with a background kernel load of
> 'stress-ng --mmap', shows that both the average and the maximum
> latencies improve:
> 
>   # stress-ng --mmap 0 &
>   # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
> 
>                                       Min     (  %stdev )    Act     (  %stdev )   Avg     (  %stdev )   Max      (  %stdev )
> 
>    PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +- 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>    PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
> 
> The table above shows the aggregated latencies across all CPUs.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Ziljstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   kernel/sched/core.c  | 12 ++++++++----
>   kernel/sched/sched.h |  6 ++++++
>   2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c25cccc09b65..2bc3ae21a9d0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>   	if (preempt_model_preemptible())
>   		return RESCHED_NOW;
>   
> +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
> +		return RESCHED_NOW;
> +
>   	if (is_idle_task(curr))
>   		return RESCHED_NOW;
>   
> @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
>   	if (p->sched_class == rq->curr->sched_class)
>   		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>   	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
> -		resched_curr(rq);
> +		resched_curr_priority(rq);
>   
Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.

Maybe sth like

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..eedb70234bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8352,6 +8352,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	struct sched_entity *se = &curr->se, *pse = &p->se;
  	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
  	int cse_is_idle, pse_is_idle;
+	enum resched_opt opt = RESCHED_PRIORITY;
  
  	if (unlikely(se == pse))
  		return;
@@ -8385,7 +8386,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	/* Idle tasks are by definition preempted by non-idle tasks. */
  	if (unlikely(task_has_idle_policy(curr)) &&
  	    likely(!task_has_idle_policy(p)))
-		goto preempt;
+		goto preempt; /* RESCHED_PRIORITY */
  
  	/*
  	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
@@ -8405,7 +8406,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	 * in the inverse case).
  	 */
  	if (cse_is_idle && !pse_is_idle)
-		goto preempt;
+		goto preempt; /* RESCHED_PRIORITY */
  	if (cse_is_idle != pse_is_idle)
  		return;
  
@@ -8415,13 +8416,15 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	/*
  	 * XXX pick_eevdf(cfs_rq) != se ?
  	 */
-	if (pick_eevdf(cfs_rq) == pse)
+	if (pick_eevdf(cfs_rq) == pse) {
+		opt = RESCHED_DEFAULT;
  		goto preempt;
+	}
  
  	return;
  
  preempt:
-	resched_curr(rq);
+	__resched_curr(rq, opt);
  }
  
  #ifdef CONFIG_SMP


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO
  2024-06-17  3:20   ` Tianchen Ding
@ 2024-06-21 18:58     ` Ankur Arora
  2024-06-24  2:35       ` Tianchen Ding
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-21 18:58 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot,
	linux-kernel


Tianchen Ding <dtcccc@linux.alibaba.com> writes:

> On 2024/5/28 08:35, Ankur Arora wrote:
>> The default preemption policy for voluntary preemption under
>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>> class, and lazily for well-behaved, non-idle tasks.
>> This is the same policy as preempt=none, with an eager handling of
>> higher priority scheduling classes.
>> Comparing a cyclictest workload with a background kernel load of
>> 'stress-ng --mmap', shows that both the average and the maximum
>> latencies improve:
>>   # stress-ng --mmap 0 &
>>   # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
>>                                       Min     (  %stdev )    Act     (  %stdev
>> )   Avg     (  %stdev )   Max      (  %stdev )
>>    PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +-
>> 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>>    PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
>> The table above shows the aggregated latencies across all CPUs.
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Ziljstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   kernel/sched/core.c  | 12 ++++++++----
>>   kernel/sched/sched.h |  6 ++++++
>>   2 files changed, 14 insertions(+), 4 deletions(-)
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c25cccc09b65..2bc3ae21a9d0 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>>   	if (preempt_model_preemptible())
>>   		return RESCHED_NOW;
>>   +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
>> +		return RESCHED_NOW;
>> +
>>   	if (is_idle_task(curr))
>>   		return RESCHED_NOW;
>>   @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct
>> *p, int flags)
>>   	if (p->sched_class == rq->curr->sched_class)
>>   		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>>   	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
>> -		resched_curr(rq);
>> +		resched_curr_priority(rq);
>>
> Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
> For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.

So, I agree about the specific case of SCHED_NORMAL vs SCHED_IDLE.
(And, that case is already handled by resched_opt_translate() explicitly
promoting idle tasks to TIF_NEED_RESCHED.)

But, on the general question of doing resched_curr_priority() in the
same class: I did consider it. But, it seemed to me that we want to
keep run to completion semantics for lazy scheduling, and so not
enforcing priority in a scheduling class was a good line.

(Note that resched_curr_priority(), at least as it stands, is going away
for v3. I'll be folding lazy scheduling as a single model under
PREEMPT_DYNAMIC. So, no separate lazy=none, lazy=voluntary.)

Thanks

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO
  2024-06-21 18:58     ` Ankur Arora
@ 2024-06-24  2:35       ` Tianchen Ding
  2024-06-25  1:12         ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Tianchen Ding @ 2024-06-24  2:35 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot, linux-kernel

On 2024/6/22 02:58, Ankur Arora wrote:
> 
> Tianchen Ding <dtcccc@linux.alibaba.com> writes:
> 
>> On 2024/5/28 08:35, Ankur Arora wrote:
>>> The default preemption policy for voluntary preemption under
>>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>>> class, and lazily for well-behaved, non-idle tasks.
>>> This is the same policy as preempt=none, with an eager handling of
>>> higher priority scheduling classes.
>>> Comparing a cyclictest workload with a background kernel load of
>>> 'stress-ng --mmap', shows that both the average and the maximum
>>> latencies improve:
>>>    # stress-ng --mmap 0 &
>>>    # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
>>>                                        Min     (  %stdev )    Act     (  %stdev
>>> )   Avg     (  %stdev )   Max      (  %stdev )
>>>     PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +-
>>> 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>>>     PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
>>> The table above shows the aggregated latencies across all CPUs.
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Ziljstra <peterz@infradead.org>
>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>>    kernel/sched/core.c  | 12 ++++++++----
>>>    kernel/sched/sched.h |  6 ++++++
>>>    2 files changed, 14 insertions(+), 4 deletions(-)
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index c25cccc09b65..2bc3ae21a9d0 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>>>    	if (preempt_model_preemptible())
>>>    		return RESCHED_NOW;
>>>    +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
>>> +		return RESCHED_NOW;
>>> +
>>>    	if (is_idle_task(curr))
>>>    		return RESCHED_NOW;
>>>    @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct
>>> *p, int flags)
>>>    	if (p->sched_class == rq->curr->sched_class)
>>>    		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>>>    	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
>>> -		resched_curr(rq);
>>> +		resched_curr_priority(rq);
>>>
>> Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
>> For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.
> 
> So, I agree about the specific case of SCHED_NORMAL vs SCHED_IDLE.
> (And, that case is already handled by resched_opt_translate() explicitly
> promoting idle tasks to TIF_NEED_RESCHED.)
> 
> But, on the general question of doing resched_curr_priority() in the
> same class: I did consider it. But, it seemed to me that we want to
> keep run to completion semantics for lazy scheduling, and so not
> enforcing priority in a scheduling class was a good line.
> 

OK, on general question, this is just a suggestion :-)

Actually, my key point is about SCHED_IDLE. It's not a real idle task, but a 
normal task with lowest priority. So is_idle_task() in resched_opt_translate() 
does not fit it. Should add task_has_idle_policy().

However, even using task_has_idle_policy() may be still not enough. Because 
SCHED_IDLE policy:
   1. It is the lowest priority, but still belongs to fair_sched_class, which is 
the same as SCHED_NORMAL.
   2. Not only tasks, *se of cgroup* can be SCHED_IDLE, too. (introduced by 
commit 304000390f88d)

So in the special case about SCHED_NORMAL vs SCHED_IDLE, I suggest still do some 
work in fair.c.

Thanks.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO
  2024-06-24  2:35       ` Tianchen Ding
@ 2024-06-25  1:12         ` Ankur Arora
  2024-06-26  2:43           ` Tianchen Ding
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-25  1:12 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk, Ingo Molnar, Vincent Guittot,
	linux-kernel


Tianchen Ding <dtcccc@linux.alibaba.com> writes:

> On 2024/6/22 02:58, Ankur Arora wrote:
>> Tianchen Ding <dtcccc@linux.alibaba.com> writes:
>>
>>> On 2024/5/28 08:35, Ankur Arora wrote:
>>>> The default preemption policy for voluntary preemption under
>>>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>>>> class, and lazily for well-behaved, non-idle tasks.
>>>> This is the same policy as preempt=none, with an eager handling of
>>>> higher priority scheduling classes.
>>>> Comparing a cyclictest workload with a background kernel load of
>>>> 'stress-ng --mmap', shows that both the average and the maximum
>>>> latencies improve:
>>>>    # stress-ng --mmap 0 &
>>>>    # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
>>>>                                        Min     (  %stdev )    Act     (  %stdev
>>>> )   Avg     (  %stdev )   Max      (  %stdev )
>>>>     PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +-
>>>> 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>>>>     PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
>>>> The table above shows the aggregated latencies across all CPUs.
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Peter Ziljstra <peterz@infradead.org>
>>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>>>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    kernel/sched/core.c  | 12 ++++++++----
>>>>    kernel/sched/sched.h |  6 ++++++
>>>>    2 files changed, 14 insertions(+), 4 deletions(-)
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c25cccc09b65..2bc3ae21a9d0 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>>>>    	if (preempt_model_preemptible())
>>>>    		return RESCHED_NOW;
>>>>    +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
>>>> +		return RESCHED_NOW;
>>>> +
>>>>    	if (is_idle_task(curr))
>>>>    		return RESCHED_NOW;
>>>>    @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct
>>>> *p, int flags)
>>>>    	if (p->sched_class == rq->curr->sched_class)
>>>>    		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>>>>    	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
>>>> -		resched_curr(rq);
>>>> +		resched_curr_priority(rq);
>>>>
>>> Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
>>> For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.
>> So, I agree about the specific case of SCHED_NORMAL vs SCHED_IDLE.
>> (And, that case is already handled by resched_opt_translate() explicitly
>> promoting idle tasks to TIF_NEED_RESCHED.)
>> But, on the general question of doing resched_curr_priority() in the
>> same class: I did consider it. But, it seemed to me that we want to
>> keep run to completion semantics for lazy scheduling, and so not
>> enforcing priority in a scheduling class was a good line.
>>
>
> OK, on general question, this is just a suggestion :-)
>
> Actually, my key point is about SCHED_IDLE. It's not a real idle task, but a
> normal task with lowest priority. So is_idle_task() in resched_opt_translate()
> does not fit it. Should add task_has_idle_policy().
>
> However, even using task_has_idle_policy() may be still not enough. Because
> SCHED_IDLE policy:
>   1. It is the lowest priority, but still belongs to fair_sched_class, which is
>   the same as SCHED_NORMAL.
>   2. Not only tasks, *se of cgroup* can be SCHED_IDLE, too. (introduced by
>   commit 304000390f88d)

Thanks. That is useful to know. Let me see how best to incorporate this.

Side question: are there any benchmarks that would exercise various types
of sched policy, idle and otherwise?

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO
  2024-06-25  1:12         ` Ankur Arora
@ 2024-06-26  2:43           ` Tianchen Ding
  0 siblings, 0 replies; 95+ messages in thread
From: Tianchen Ding @ 2024-06-26  2:43 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ingo Molnar, Vincent Guittot, linux-kernel

On 2024/6/25 09:12, Ankur Arora wrote:
> 
> Side question: are there any benchmarks that would exercise various types
> of sched policy, idle and otherwise?
> 

AFAIK, no.
May need to combine workloads with different sched policies set by manual...
Or let's see if anyone else can offer suggestions.

Thanks.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 30/35] sched: latency warn for TIF_NEED_RESCHED_LAZY
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (28 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 29/35] sched: handle preempt=voluntary " Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 31/35] tracing: support lazy resched Ankur Arora
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar, Vincent Guittot

resched_latency_warn() now also warns if TIF_NEED_RESCHED_LAZY is set
without rescheduling for more than the latency_warn_ms period.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/debug.c | 7 +++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2bc3ae21a9d0..4f0ac90b7d47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5673,7 +5673,7 @@ static u64 cpu_resched_latency(struct rq *rq)
 	if (sysctl_resched_latency_warn_once && warned_once)
 		return 0;
 
-	if (!need_resched() || !latency_warn_ms)
+	if ((!need_resched() && !need_resched_lazy()) || !latency_warn_ms)
 		return 0;
 
 	if (system_state == SYSTEM_BOOTING)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e53f1b73bf4a..a1be40101844 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1114,9 +1114,12 @@ void proc_sched_set_task(struct task_struct *p)
 void resched_latency_warn(int cpu, u64 latency)
 {
 	static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, 1);
+	char *nr;
+
+	nr = __tif_need_resched(RESCHED_NOW) ? "need_resched" : "need_resched_lazy";
 
 	WARN(__ratelimit(&latency_check_ratelimit),
-	     "sched: CPU %d need_resched set for > %llu ns (%d ticks) "
+	     "sched: CPU %d %s set for > %llu ns (%d ticks) "
 	     "without schedule\n",
-	     cpu, latency, cpu_rq(cpu)->ticks_without_resched);
+	     cpu, nr, latency, cpu_rq(cpu)->ticks_without_resched);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 31/35] tracing: support lazy resched
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (29 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 30/35] sched: latency warn for TIF_NEED_RESCHED_LAZY Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 32/35] Documentation: tracing: add TIF_NEED_RESCHED_LAZY Ankur Arora
                   ` (5 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Masami Hiramatsu

trace_entry::flags is full, so reuse the TRACE_FLAG_IRQS_NOSUPPORT
bit for this. The flag is safe to reuse since it is only used in
old archs that don't support lockdep irq tracing.

Also, now that we have a variety of need-resched combinations, document
these in the tracing headers.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/trace_events.h |  6 +++---
 kernel/trace/trace.c         | 28 ++++++++++++++++++----------
 kernel/trace/trace_output.c  | 16 ++++++++++++++--
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 6f9bdfb09d1d..329002785b4d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -184,7 +184,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status);
 
 enum trace_flag_type {
 	TRACE_FLAG_IRQS_OFF		= 0x01,
-	TRACE_FLAG_IRQS_NOSUPPORT	= 0x02,
+	TRACE_FLAG_NEED_RESCHED_LAZY    = 0x02,
 	TRACE_FLAG_NEED_RESCHED		= 0x04,
 	TRACE_FLAG_HARDIRQ		= 0x08,
 	TRACE_FLAG_SOFTIRQ		= 0x10,
@@ -211,11 +211,11 @@ static inline unsigned int tracing_gen_ctx(void)
 
 static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
 {
-	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+	return tracing_gen_ctx_irq_test(0);
 }
 static inline unsigned int tracing_gen_ctx(void)
 {
-	return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+	return tracing_gen_ctx_irq_test(0);
 }
 #endif
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ed229527be05..7941e9ec979a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2513,6 +2513,8 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
 
 	if (__tif_need_resched(RESCHED_NOW))
 		trace_flags |= TRACE_FLAG_NEED_RESCHED;
+	if (__tif_need_resched(RESCHED_LAZY))
+		trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
 	if (test_preempt_need_resched())
 		trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
 	return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
@@ -4096,17 +4098,23 @@ unsigned long trace_total_entries(struct trace_array *tr)
 	return entries;
 }
 
+#ifdef CONFIG_PREEMPT_AUTO
+#define NR_LEGEND "l: lazy, n: now, p: preempt, b: l|n, L: l|p, N: n|p, B: l|n|p"
+#else
+#define NR_LEGEND "n: now, p: preempt, N: n|p"
+#endif
+
 static void print_lat_help_header(struct seq_file *m)
 {
-	seq_puts(m, "#                    _------=> CPU#            \n"
-		    "#                   / _-----=> irqs-off/BH-disabled\n"
-		    "#                  | / _----=> need-resched    \n"
-		    "#                  || / _---=> hardirq/softirq \n"
-		    "#                  ||| / _--=> preempt-depth   \n"
-		    "#                  |||| / _-=> migrate-disable \n"
-		    "#                  ||||| /     delay           \n"
-		    "#  cmd     pid     |||||| time  |   caller     \n"
-		    "#     \\   /        ||||||  \\    |    /       \n");
+	seq_printf(m, "#                    _------=> CPU#            \n"
+		      "#                   / _-----=> irqs-off/BH-disabled\n"
+		      "#                  | / _----=> need-resched ( %s ) \n"
+		      "#                  || / _---=> hardirq/softirq \n"
+		      "#                  ||| / _--=> preempt-depth   \n"
+		      "#                  |||| / _-=> migrate-disable \n"
+		      "#                  ||||| /     delay           \n"
+		      "#  cmd     pid     |||||| time  |   caller     \n"
+		      "#     \\   /        ||||||  \\    |    /       \n", NR_LEGEND);
 }
 
 static void print_event_info(struct array_buffer *buf, struct seq_file *m)
@@ -4141,7 +4149,7 @@ static void print_func_help_header_irq(struct array_buffer *buf, struct seq_file
 	print_event_info(buf, m);
 
 	seq_printf(m, "#                            %.*s  _-----=> irqs-off/BH-disabled\n", prec, space);
-	seq_printf(m, "#                            %.*s / _----=> need-resched\n", prec, space);
+	seq_printf(m, "#                            %.*s / _----=> need-resched ( %s )\n", prec, space, NR_LEGEND);
 	seq_printf(m, "#                            %.*s| / _---=> hardirq/softirq\n", prec, space);
 	seq_printf(m, "#                            %.*s|| / _--=> preempt-depth\n", prec, space);
 	seq_printf(m, "#                            %.*s||| / _-=> migrate-disable\n", prec, space);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..4f58a196e14c 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
 		(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
 		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
 		bh_off ? 'b' :
-		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+		!IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
 		'.';
 
-	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
+	switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
 				TRACE_FLAG_PREEMPT_RESCHED)) {
+	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+		need_resched = 'B';
+		break;
 	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
 		need_resched = 'N';
 		break;
+	case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+		need_resched = 'L';
+		break;
+	case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
+		need_resched = 'b';
+		break;
 	case TRACE_FLAG_NEED_RESCHED:
 		need_resched = 'n';
 		break;
+	case TRACE_FLAG_NEED_RESCHED_LAZY:
+		need_resched = 'l';
+		break;
 	case TRACE_FLAG_PREEMPT_RESCHED:
 		need_resched = 'p';
 		break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 32/35] Documentation: tracing: add TIF_NEED_RESCHED_LAZY
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (30 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 31/35] tracing: support lazy resched Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y Ankur Arora
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Masami Hiramatsu, Jonathan Corbet

Document various combinations of resched flags.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 Documentation/trace/ftrace.rst | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 7e7b8ec17934..7f20c0bae009 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1036,8 +1036,12 @@ explains which is which.
 		be printed here.
 
   need-resched:
-	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
+	- 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
+	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
+	- 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
+	- 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
 	- 'n' only TIF_NEED_RESCHED is set,
+	- 'l' only TIF_NEED_RESCHED_LAZY is set,
 	- 'p' only PREEMPT_NEED_RESCHED is set,
 	- '.' otherwise.
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (31 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 32/35] Documentation: tracing: add TIF_NEED_RESCHED_LAZY Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28 13:12   ` Daniel Bristot de Oliveira
  2024-05-28  0:35 ` [PATCH v2 34/35] kconfig: decompose ARCH_NO_PREEMPT Ankur Arora
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Daniel Bristot de Oliveira

To reduce RCU noise for nohz_full configurations, osnoise depends
on cond_resched() providing quiescent states for PREEMPT_RCU=n
configurations. And, for PREEMPT_RCU=y configurations does this
by directly calling rcu_momentary_dyntick_idle().

With PREEMPT_AUTO=y, however, we can have configurations with
(PREEMPTION=y, PREEMPT_RCU=n), which means neither of the above can
help.

Handle that by fallback to the explicit quiescent states via
rcu_momentary_dyntick_idle().

Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/trace/trace_osnoise.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index a8e28f9b9271..88d2cd2593c4 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1532,18 +1532,20 @@ static int run_osnoise(void)
 		/*
 		 * In some cases, notably when running on a nohz_full CPU with
 		 * a stopped tick PREEMPT_RCU has no way to account for QSs.
-		 * This will eventually cause unwarranted noise as PREEMPT_RCU
-		 * will force preemption as the means of ending the current
-		 * grace period. We avoid this problem by calling
-		 * rcu_momentary_dyntick_idle(), which performs a zero duration
-		 * EQS allowing PREEMPT_RCU to end the current grace period.
-		 * This call shouldn't be wrapped inside an RCU critical
-		 * section.
+		 * This will eventually cause unwarranted noise as RCU forces
+		 * preemption as the means of ending the current grace period.
+		 * We avoid this by calling rcu_momentary_dyntick_idle(),
+		 * which performs a zero duration EQS allowing RCU to end the
+		 * current grace period. This call shouldn't be wrapped inside
+		 * an RCU critical section.
 		 *
-		 * Note that in non PREEMPT_RCU kernels QSs are handled through
-		 * cond_resched()
+		 * For non-PREEMPT_RCU kernels with cond_resched() (non-
+		 * PREEMPT_AUTO configurations), QSs are handled through
+		 * cond_resched(). For PREEMPT_AUTO kernels, we fallback to the
+		 * zero duration QS via rcu_momentary_dyntick_idle().
 		 */
-		if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
+		if (IS_ENABLED(CONFIG_PREEMPT_RCU) ||
+		    (!IS_ENABLED(CONFIG_PREEMPT_RCU) && IS_ENABLED(CONFIG_PREEMPTION))) {
 			if (!disable_irq)
 				local_irq_disable();
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y
  2024-05-28  0:35 ` [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y Ankur Arora
@ 2024-05-28 13:12   ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-05-28 13:12 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk

On 5/28/24 02:35, Ankur Arora wrote:
> To reduce RCU noise for nohz_full configurations, osnoise depends
> on cond_resched() providing quiescent states for PREEMPT_RCU=n
> configurations. And, for PREEMPT_RCU=y configurations does this
> by directly calling rcu_momentary_dyntick_idle().
> 
> With PREEMPT_AUTO=y, however, we can have configurations with
> (PREEMPTION=y, PREEMPT_RCU=n), which means neither of the above can
> help.
> 
> Handle that by fallback to the explicit quiescent states via
> rcu_momentary_dyntick_idle().
> 
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>

Thanks
-- Daniel

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 34/35] kconfig: decompose ARCH_NO_PREEMPT
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (32 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-28  0:35 ` [PATCH v2 35/35] arch: " Ankur Arora
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Andy Lutomirski, Andrew Morton,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Brian Cain, Geert Uytterhoeven, Richard Weinberger, Anton Ivanov,
	Johannes Berg

We have two sets of preemption points where we evaluate preempt_count
to decide if scheduling is safe:

 - preempt_enable()
 - irqentry_exit_cond_resched()

Architectures with untested/incomplete preempt_count support define
ARCH_NO_PREEMPT.

Now, if an architecture did want to support preemption, ensuring that
preemption at irqentry_exit_cond_resched() is safe is difficult --
preemption can happen at any arbitrary location where preempt_count
evaluates to zero.

The preempt_enable() case -- whether executed in common code or in
architecture specific code mostly requires verification of local context
and can be done piecewise.

So, decompose ARCH_NO_PREEMPT into ARCH_NO_PREEMPT_IRQ and
ARCH_NO_PREEMPT_TOGGLE, with the first stating that the architecture
code has possibly incomplete preempt_count accounting and thus
preempting in irqentry_exit_cond_resched() is unsafe, and the second
stating that the preempt_count accounting is incorrect so preempting
anywhere is unsafe.

ARCH_NO_PREEMPT now only depends on the ARCH_NO_PREEMPT_TOGGLE.
Additionally, only invoke irqentry_exit_cond_resched() if
ARCH_NO_PREEMPT_IRQ=n.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/Kconfig          | 7 +++++++
 kernel/entry/common.c | 3 ++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 30f7930275d8..dc09306aeca0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1295,8 +1295,15 @@ config COMPAT_32BIT_TIME
 	  This is relevant on all 32-bit architectures, and 64-bit architectures
 	  as part of compat syscall handling.
 
+config ARCH_NO_PREEMPT_IRQ
+	bool
+
+config ARCH_NO_PREEMPT_TOGGLE
+	bool
+
 config ARCH_NO_PREEMPT
 	bool
+	default y if ARCH_NO_PREEMPT_TOGGLE
 
 config ARCH_SUPPORTS_RT
 	bool
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c684385921de..b18175961374 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -359,7 +359,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		if (IS_ENABLED(CONFIG_PREEMPTION))
+		if (IS_ENABLED(CONFIG_PREEMPTION) &&
+		    !IS_ENABLED(CONFIG_ARCH_NO_PREEMPT_IRQ))
 			irqentry_exit_cond_resched();
 
 		/* Covers both tracing and lockdep */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 35/35] arch: decompose ARCH_NO_PREEMPT
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (33 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 34/35] kconfig: decompose ARCH_NO_PREEMPT Ankur Arora
@ 2024-05-28  0:35 ` Ankur Arora
  2024-05-29  6:16 ` [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Shrikanth Hegde
  2024-06-05 15:44 ` Sean Christopherson
  36 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-05-28  0:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, sshegde, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, Brian Cain, Geert Uytterhoeven, Richard Weinberger,
	Anton Ivanov, Johannes Berg

Now that ARCH_NO_PREEMPT is conditioned only on ARCH_NO_PREEMPT_TOGGLE,
decompose ARCH_NO_PREEMPT into ARCH_NO_PREEMPT_IRQ and
ARCH_NO_PREEMPT_TOGGLE. This allows architecture code to selectively
enable one or the other.

Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/alpha/Kconfig   | 3 ++-
 arch/hexagon/Kconfig | 3 ++-
 arch/m68k/Kconfig    | 3 ++-
 arch/um/Kconfig      | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 3afd042150f8..7fd1d9dcad8d 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -6,7 +6,8 @@ config ALPHA
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
-	select ARCH_NO_PREEMPT
+	select ARCH_NO_PREEMPT_TOGGLE
+	select ARCH_NO_PREEMPT_IRQ
 	select ARCH_NO_SG_CHAIN
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select DMA_OPS if PCI
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index e233b5efa276..3a33a26e1b81 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -6,7 +6,8 @@ config HEXAGON
 	def_bool y
 	select ARCH_32BIT_OFF_T
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
-	select ARCH_NO_PREEMPT
+	select ARCH_NO_PREEMPT_TOGGLE
+	select ARCH_NO_PREEMPT_IRQ
 	select ARCH_WANT_FRAME_POINTERS
 	select DMA_GLOBAL_POOL
 	select HAVE_PAGE_SIZE_4KB
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 6ffa29585194..3f7d675849ed 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -11,7 +11,8 @@ config M68K
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE if M68K_NONCOHERENT_DMA
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if RMW_INSNS
 	select ARCH_MIGHT_HAVE_PC_PARPORT if ISA
-	select ARCH_NO_PREEMPT if !COLDFIRE
+	select ARCH_NO_PREEMPT_TOGGLE if !COLDFIRE
+	select ARCH_NO_PREEMPT_IRQ if !COLDFIRE
 	select ARCH_USE_MEMTEST if MMU_MOTOROLA
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select BINFMT_FLAT_ARGVP_ENVP_ON_STACK
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 93a5a8999b07..390328e97261 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -11,7 +11,8 @@ config UML
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_STRNCPY_FROM_USER
 	select ARCH_HAS_STRNLEN_USER
-	select ARCH_NO_PREEMPT
+	select ARCH_NO_PREEMPT_TOGGLE
+	select ARCH_NO_PREEMPT_IRQ
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_KASAN if X86_64
 	select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (34 preceding siblings ...)
  2024-05-28  0:35 ` [PATCH v2 35/35] arch: " Ankur Arora
@ 2024-05-29  6:16 ` Shrikanth Hegde
  2024-06-01 11:47   ` Ankur Arora
  2024-06-05 15:44 ` Sean Christopherson
  36 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-05-29  6:16 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	LKML



On 5/28/24 6:04 AM, Ankur Arora wrote:
> Hi,
> 
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
> 
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
> 
> v2 mostly reworks v1, with one of the main changes having less
> noisy need-resched-lazy related interfaces.
> More details in the changelog below.
>

Hi Ankur. Thanks for the series. 

nit: had to manually patch 11,12,13 since it didnt apply cleanly on
tip/master and tip/sched/core. Mostly due some word differences in the change. 

tip/master was at:
commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
Merge: 5d145493a139 47ff30cc1be7
Author: Ingo Molnar <mingo@kernel.org>
Date:   Tue May 28 12:44:26 2024 +0200

    Merge branch into tip/master: 'x86/percpu'
    

 
> The v1 of the series is at [4] and the RFC at [5].
> 
> Design
> ==
> 
> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
> PREEMPT_COUNT). This means that the scheduler can always safely
> preempt. (This is identical to CONFIG_PREEMPT.)
> 
> Having that, the next step is to make the rescheduling policy dependent
> on the chosen scheduling model. Currently, the scheduler uses a single
> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
> reschedule is needed.
> PREEMPT_AUTO extends this by adding an additional need-resched bit
> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
> scheduler to express two kinds of rescheduling intent: schedule at
> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
> rescheduling while allowing the task on the runqueue to run to
> timeslice completion (TIF_NEED_RESCHED_LAZY).
> 
> The scheduler decides which need-resched bits are chosen based on
> the preemption model in use:
> 
> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
> 
> none		never   		always [*]
> voluntary       higher sched class	other tasks [*]
> full 		always                  never
> 
> [*] some details elided.
> 
> The last part of the puzzle is, when does preemption happen, or
> alternately stated, when are the need-resched bits checked:
> 
>                  exit-to-user    ret-to-kernel    preempt_count()
> 
> NEED_RESCHED_LAZY     Y               N                N
> NEED_RESCHED          Y               Y                Y
> 
> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
> none/voluntary preemption policies are in effect. And eager semantics
> under full preemption.
> 
> In addition, since this is driven purely by the scheduler (not
> depending on cond_resched() placement and the like), there is enough
> flexibility in the scheduler to cope with edge cases -- ex. a kernel
> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
> simply upgrading to a full NEED_RESCHED which can use more coercive
> instruments like resched IPI to induce a context-switch.
> 
> Performance
> ==
> The performance in the basic tests (perf bench sched messaging, kernbench,
> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
> (See patches 
>   "sched: support preempt=none under PREEMPT_AUTO"
>   "sched: support preempt=full under PREEMPT_AUTO"
>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
> 
> For a macro test, a colleague in Oracle's Exadata team tried two
> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
> backported.)
> 
> In both tests the data was cached on remote nodes (cells), and the
> database nodes (compute) served client queries, with clients being
> local in the first test and remote in the second.
> 
> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
> 
> 
> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
> 				                                        (preempt=voluntary)          
>                               ==============================      =============================
>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
> 		      -------  ----------  -----------------      ----------  -----------------   -------
> 				                                            
> 
>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
> 
> 
>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>   90/10 RW ratio)
> 
> 
> (Both sets of tests have a fair amount of NW traffic since the query
> tables etc are cached on the cells. Additionally, the first set,
> given the local clients, stress the scheduler a bit more than the
> second.)
> 
> The comparative performance for both the tests is fairly close,
> more or less within a margin of error.
> 
> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
> 
> "
>  a) Base kernel (6.7),
>  b) v1, PREEMPT_AUTO, preempt=voluntary
>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>  
>  Workloads I tested and their %gain,
>                     case b           case c       case d
>  NAS                +2.7%              +1.9%         +2.1%
>  Hashjoin,          +0.0%              +0.0%         +0.0%
>  Graph500,          -6.0%              +0.0%         +0.0%
>  XSBench            +1.7%              +0.0%         +1.2%
>  
>  (Note about the Graph500 numbers at [8].)
>  
>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>  much difference.
> "
> 
> One case where there is a significant performance drop is on powerpc,
> seen running hackbench on a 320 core system (a test on a smaller system is
> fine.) In theory there's no reason for this to only happen on powerpc
> since most of the code is common, but I haven't been able to reproduce
> it on x86 so far.
> 
> All in all, I think the tests above show that this scheduling model has legs.
> However, the none/voluntary models under PREEMPT_AUTO are conceptually
> different enough from the current none/voluntary models that there
> likely are workloads where performance would be subpar. That needs more
> extensive testing to figure out the weak points.
> 
> 
> 
Did test it again on PowerPC. Unfortunately numbers shows there is regression 
still compared to 6.10-rc1. This is done with preempt=none. I tried again on the 
smaller system too to confirm. For now I have done the comparison for the hackbench 
where highest regression was seen in v1.  

perf stat collected for 20 iterations show higher context switch and higher migrations. 
Could it be that LAZY bit is causing more context switches? or could it be something 
else? Could it be that more exit-to-user happens in PowerPC? will continue to debug. 

Meanwhile, will do more test with other micro-benchmarks and post the results.


More details below. 
CONFIG_HZ = 100
./hackbench -pipe 60 process 100000 loops

====================================================================================
On the larger system. (40 Cores, 320CPUS)
====================================================================================
				6.10-rc1		+preempt_auto
				preempt=none		preempt=none
20 iterations avg value
hackbench pipe(60)		26.403			32.368 ( -31.1%)

++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    168,980,939.76 msec cpu-clock                        # 6400.026 CPUs utilized               ( +-  6.59% )
     6,299,247,371      context-switches                 #   70.596 K/sec                       ( +-  6.60% )
       246,646,236      cpu-migrations                   #    2.764 K/sec                       ( +-  6.57% )
         1,759,232      page-faults                      #   19.716 /sec                        ( +-  6.61% )
577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )
37,280,192,946,445      branches                         #  417.801 M/sec                       ( +-  6.61% )
   166,456,311,053      branch-misses                    #    0.85% of all branches             ( +-  6.60% )

            26.403 +- 0.166 seconds time elapsed  ( +-  0.63% )

++++++++++++
preempt auto
++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    207,154,235.95 msec cpu-clock                        # 6400.009 CPUs utilized               ( +-  6.64% )
     9,337,462,696      context-switches                 #   85.645 K/sec                       ( +-  6.68% )
       631,276,554      cpu-migrations                   #    5.790 K/sec                       ( +-  6.79% )
         1,756,583      page-faults                      #   16.112 /sec                        ( +-  6.59% )
700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
   231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )

            32.368 +- 0.200 seconds time elapsed  ( +-  0.62% )


============================================================================================
Smaller system ( 12Cores, 96CPUS)
============================================================================================
				6.10-rc1		+preempt_auto
				preempt=none		preempt=none
20 iterations avg value
hackbench pipe(60)		55.930			65.75 ( -17.6%)

++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    107,386,299.19 msec cpu-clock                        # 1920.003 CPUs utilized               ( +-  6.55% )
     1,388,830,542      context-switches                 #   24.536 K/sec                       ( +-  6.19% )
        44,538,641      cpu-migrations                   #  786.840 /sec                        ( +-  6.23% )
         1,698,710      page-faults                      #   30.010 /sec                        ( +-  6.58% )
412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )
30,328,724,557,878      branches                         #  535.801 M/sec                       ( +-  6.58% )
    99,642,840,901      branch-misses                    #    0.63% of all branches             ( +-  6.57% )

            55.930 +- 0.509 seconds time elapsed  ( +-  0.91% )


+++++++++++++++++
v2_preempt_auto
+++++++++++++++++
 Performance counter stats for 'system wide' (20 runs):
    126,244,029.04 msec cpu-clock                        # 1920.005 CPUs utilized               ( +-  6.51% )
     2,563,720,294      context-switches                 #   38.356 K/sec                       ( +-  6.10% )
       147,445,392      cpu-migrations                   #    2.206 K/sec                       ( +-  6.37% )
         1,710,637      page-faults                      #   25.593 /sec                        ( +-  6.55% )
483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )
33,851,562,301,187      branches                         #  506.454 M/sec                       ( +-  6.56% )
   134,059,721,699      branch-misses                    #    0.75% of all branches             ( +-  6.45% )

             65.75 +- 1.06 seconds time elapsed  ( +-  1.61% )


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-05-29  6:16 ` [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Shrikanth Hegde
@ 2024-06-01 11:47   ` Ankur Arora
  2024-06-04  7:32     ` Shrikanth Hegde
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-01 11:47 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, LKML


Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 5/28/24 6:04 AM, Ankur Arora wrote:
>> Hi,
>>
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>>
>> The series is based on Thomas' original proposal which he outlined
>> in [1], [2] and in his PoC [3].
>>
>> v2 mostly reworks v1, with one of the main changes having less
>> noisy need-resched-lazy related interfaces.
>> More details in the changelog below.
>>
>
> Hi Ankur. Thanks for the series.
>
> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
> tip/master and tip/sched/core. Mostly due some word differences in the change.
>
> tip/master was at:
> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
> Merge: 5d145493a139 47ff30cc1be7
> Author: Ingo Molnar <mingo@kernel.org>
> Date:   Tue May 28 12:44:26 2024 +0200
>
>     Merge branch into tip/master: 'x86/percpu'
>
>
>
>> The v1 of the series is at [4] and the RFC at [5].
>>
>> Design
>> ==
>>
>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>> PREEMPT_COUNT). This means that the scheduler can always safely
>> preempt. (This is identical to CONFIG_PREEMPT.)
>>
>> Having that, the next step is to make the rescheduling policy dependent
>> on the chosen scheduling model. Currently, the scheduler uses a single
>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>> reschedule is needed.
>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>> scheduler to express two kinds of rescheduling intent: schedule at
>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>> rescheduling while allowing the task on the runqueue to run to
>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>
>> The scheduler decides which need-resched bits are chosen based on
>> the preemption model in use:
>>
>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>
>> none		never   		always [*]
>> voluntary       higher sched class	other tasks [*]
>> full 		always                  never
>>
>> [*] some details elided.
>>
>> The last part of the puzzle is, when does preemption happen, or
>> alternately stated, when are the need-resched bits checked:
>>
>>                  exit-to-user    ret-to-kernel    preempt_count()
>>
>> NEED_RESCHED_LAZY     Y               N                N
>> NEED_RESCHED          Y               Y                Y
>>
>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>> none/voluntary preemption policies are in effect. And eager semantics
>> under full preemption.
>>
>> In addition, since this is driven purely by the scheduler (not
>> depending on cond_resched() placement and the like), there is enough
>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>> simply upgrading to a full NEED_RESCHED which can use more coercive
>> instruments like resched IPI to induce a context-switch.
>>
>> Performance
>> ==
>> The performance in the basic tests (perf bench sched messaging, kernbench,
>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>> (See patches
>>   "sched: support preempt=none under PREEMPT_AUTO"
>>   "sched: support preempt=full under PREEMPT_AUTO"
>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>
>> For a macro test, a colleague in Oracle's Exadata team tried two
>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>> backported.)
>>
>> In both tests the data was cached on remote nodes (cells), and the
>> database nodes (compute) served client queries, with clients being
>> local in the first test and remote in the second.
>>
>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>
>>
>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>> 				                                        (preempt=voluntary)
>>                               ==============================      =============================
>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>
>>
>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>
>>
>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>   90/10 RW ratio)
>>
>>
>> (Both sets of tests have a fair amount of NW traffic since the query
>> tables etc are cached on the cells. Additionally, the first set,
>> given the local clients, stress the scheduler a bit more than the
>> second.)
>>
>> The comparative performance for both the tests is fairly close,
>> more or less within a margin of error.
>>
>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>
>> "
>>  a) Base kernel (6.7),
>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>
>>  Workloads I tested and their %gain,
>>                     case b           case c       case d
>>  NAS                +2.7%              +1.9%         +2.1%
>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>  Graph500,          -6.0%              +0.0%         +0.0%
>>  XSBench            +1.7%              +0.0%         +1.2%
>>
>>  (Note about the Graph500 numbers at [8].)
>>
>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>  much difference.
>> "
>>
>> One case where there is a significant performance drop is on powerpc,
>> seen running hackbench on a 320 core system (a test on a smaller system is
>> fine.) In theory there's no reason for this to only happen on powerpc
>> since most of the code is common, but I haven't been able to reproduce
>> it on x86 so far.
>>
>> All in all, I think the tests above show that this scheduling model has legs.
>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>> different enough from the current none/voluntary models that there
>> likely are workloads where performance would be subpar. That needs more
>> extensive testing to figure out the weak points.
>>
>>
>>
> Did test it again on PowerPC. Unfortunately numbers shows there is regression
> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
> smaller system too to confirm. For now I have done the comparison for the hackbench
> where highest regression was seen in v1.
>
> perf stat collected for 20 iterations show higher context switch and higher migrations.
> Could it be that LAZY bit is causing more context switches? or could it be something
> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.

Thanks for trying it out.

As you point out, context-switches and migrations are signficantly higher.

Definitely unexpected. I ran the same test on an x86 box
(Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.

  6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
  6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
  6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )

  6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
  6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
  6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )

Clearly there's something different going on powerpc. I'm travelling
right now, but will dig deeper into this once I get back.

Meanwhile can you check if the increased context-switches are voluntary or
involuntary (or what the division is)?


Thanks
Ankur

> Meanwhile, will do more test with other micro-benchmarks and post the results.
>
>
> More details below.
> CONFIG_HZ = 100
> ./hackbench -pipe 60 process 100000 loops
>
> ====================================================================================
> On the larger system. (40 Cores, 320CPUS)
> ====================================================================================
> 				6.10-rc1		+preempt_auto
> 				preempt=none		preempt=none
> 20 iterations avg value
> hackbench pipe(60)		26.403			32.368 ( -31.1%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     168,980,939.76 msec cpu-clock                        # 6400.026 CPUs utilized               ( +-  6.59% )
>      6,299,247,371      context-switches                 #   70.596 K/sec                       ( +-  6.60% )
>        246,646,236      cpu-migrations                   #    2.764 K/sec                       ( +-  6.57% )
>          1,759,232      page-faults                      #   19.716 /sec                        ( +-  6.61% )
> 577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
> 226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )
> 37,280,192,946,445      branches                         #  417.801 M/sec                       ( +-  6.61% )
>    166,456,311,053      branch-misses                    #    0.85% of all branches             ( +-  6.60% )
>
>             26.403 +- 0.166 seconds time elapsed  ( +-  0.63% )
>
> ++++++++++++
> preempt auto
> ++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     207,154,235.95 msec cpu-clock                        # 6400.009 CPUs utilized               ( +-  6.64% )
>      9,337,462,696      context-switches                 #   85.645 K/sec                       ( +-  6.68% )
>        631,276,554      cpu-migrations                   #    5.790 K/sec                       ( +-  6.79% )
>          1,756,583      page-faults                      #   16.112 /sec                        ( +-  6.59% )
> 700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
> 254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
> 42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
>    231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )
>
>             32.368 +- 0.200 seconds time elapsed  ( +-  0.62% )
>
>
> ============================================================================================
> Smaller system ( 12Cores, 96CPUS)
> ============================================================================================
> 				6.10-rc1		+preempt_auto
> 				preempt=none		preempt=none
> 20 iterations avg value
> hackbench pipe(60)		55.930			65.75 ( -17.6%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     107,386,299.19 msec cpu-clock                        # 1920.003 CPUs utilized               ( +-  6.55% )
>      1,388,830,542      context-switches                 #   24.536 K/sec                       ( +-  6.19% )
>         44,538,641      cpu-migrations                   #  786.840 /sec                        ( +-  6.23% )
>          1,698,710      page-faults                      #   30.010 /sec                        ( +-  6.58% )
> 412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
> 192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )
> 30,328,724,557,878      branches                         #  535.801 M/sec                       ( +-  6.58% )
>     99,642,840,901      branch-misses                    #    0.63% of all branches             ( +-  6.57% )
>
>             55.930 +- 0.509 seconds time elapsed  ( +-  0.91% )
>
>
> +++++++++++++++++
> v2_preempt_auto
> +++++++++++++++++
>  Performance counter stats for 'system wide' (20 runs):
>     126,244,029.04 msec cpu-clock                        # 1920.005 CPUs utilized               ( +-  6.51% )
>      2,563,720,294      context-switches                 #   38.356 K/sec                       ( +-  6.10% )
>        147,445,392      cpu-migrations                   #    2.206 K/sec                       ( +-  6.37% )
>          1,710,637      page-faults                      #   25.593 /sec                        ( +-  6.55% )
> 483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
> 210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )
> 33,851,562,301,187      branches                         #  506.454 M/sec                       ( +-  6.56% )
>    134,059,721,699      branch-misses                    #    0.75% of all branches             ( +-  6.45% )
>
>              65.75 +- 1.06 seconds time elapsed  ( +-  1.61% )

So, the context-switches are meaningfully higher.

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-01 11:47   ` Ankur Arora
@ 2024-06-04  7:32     ` Shrikanth Hegde
  2024-06-07 16:48       ` Shrikanth Hegde
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-06-04  7:32 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	LKML



On 6/1/24 5:17 PM, Ankur Arora wrote:
> 
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
> 
>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>> Hi,
>>>
>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>> on explicit preemption points for the voluntary models.
>>>
>>> The series is based on Thomas' original proposal which he outlined
>>> in [1], [2] and in his PoC [3].
>>>
>>> v2 mostly reworks v1, with one of the main changes having less
>>> noisy need-resched-lazy related interfaces.
>>> More details in the changelog below.
>>>
>>
>> Hi Ankur. Thanks for the series.
>>
>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>
>> tip/master was at:
>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>> Merge: 5d145493a139 47ff30cc1be7
>> Author: Ingo Molnar <mingo@kernel.org>
>> Date:   Tue May 28 12:44:26 2024 +0200
>>
>>     Merge branch into tip/master: 'x86/percpu'
>>
>>
>>
>>> The v1 of the series is at [4] and the RFC at [5].
>>>
>>> Design
>>> ==
>>>
>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>
>>> Having that, the next step is to make the rescheduling policy dependent
>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>> reschedule is needed.
>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>> scheduler to express two kinds of rescheduling intent: schedule at
>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>> rescheduling while allowing the task on the runqueue to run to
>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>
>>> The scheduler decides which need-resched bits are chosen based on
>>> the preemption model in use:
>>>
>>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>>
>>> none		never   		always [*]
>>> voluntary       higher sched class	other tasks [*]
>>> full 		always                  never
>>>
>>> [*] some details elided.
>>>
>>> The last part of the puzzle is, when does preemption happen, or
>>> alternately stated, when are the need-resched bits checked:
>>>
>>>                  exit-to-user    ret-to-kernel    preempt_count()
>>>
>>> NEED_RESCHED_LAZY     Y               N                N
>>> NEED_RESCHED          Y               Y                Y
>>>
>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>> none/voluntary preemption policies are in effect. And eager semantics
>>> under full preemption.
>>>
>>> In addition, since this is driven purely by the scheduler (not
>>> depending on cond_resched() placement and the like), there is enough
>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>> instruments like resched IPI to induce a context-switch.
>>>
>>> Performance
>>> ==
>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>> (See patches
>>>   "sched: support preempt=none under PREEMPT_AUTO"
>>>   "sched: support preempt=full under PREEMPT_AUTO"
>>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>
>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>> backported.)
>>>
>>> In both tests the data was cached on remote nodes (cells), and the
>>> database nodes (compute) served client queries, with clients being
>>> local in the first test and remote in the second.
>>>
>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>
>>>
>>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>>> 				                                        (preempt=voluntary)
>>>                               ==============================      =============================
>>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>>
>>>
>>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>>
>>>
>>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>>   90/10 RW ratio)
>>>
>>>
>>> (Both sets of tests have a fair amount of NW traffic since the query
>>> tables etc are cached on the cells. Additionally, the first set,
>>> given the local clients, stress the scheduler a bit more than the
>>> second.)
>>>
>>> The comparative performance for both the tests is fairly close,
>>> more or less within a margin of error.
>>>
>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>>
>>> "
>>>  a) Base kernel (6.7),
>>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>
>>>  Workloads I tested and their %gain,
>>>                     case b           case c       case d
>>>  NAS                +2.7%              +1.9%         +2.1%
>>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>>  Graph500,          -6.0%              +0.0%         +0.0%
>>>  XSBench            +1.7%              +0.0%         +1.2%
>>>
>>>  (Note about the Graph500 numbers at [8].)
>>>
>>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>  much difference.
>>> "
>>>
>>> One case where there is a significant performance drop is on powerpc,
>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>> fine.) In theory there's no reason for this to only happen on powerpc
>>> since most of the code is common, but I haven't been able to reproduce
>>> it on x86 so far.
>>>
>>> All in all, I think the tests above show that this scheduling model has legs.
>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>> different enough from the current none/voluntary models that there
>>> likely are workloads where performance would be subpar. That needs more
>>> extensive testing to figure out the weak points.
>>>
>>>
>>>
>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>> smaller system too to confirm. For now I have done the comparison for the hackbench
>> where highest regression was seen in v1.
>>
>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>> Could it be that LAZY bit is causing more context switches? or could it be something
>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
> 
> Thanks for trying it out.
> 
> As you point out, context-switches and migrations are signficantly higher.
> 
> Definitely unexpected. I ran the same test on an x86 box
> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
> 
>   6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
>   6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
>   6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )
> 
>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )
> 
> Clearly there's something different going on powerpc. I'm travelling
> right now, but will dig deeper into this once I get back.
> 
> Meanwhile can you check if the increased context-switches are voluntary or
> involuntary (or what the division is)?


Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for 
context switches per second while running "hackbench -pipe 60 process 100000 loops" 


preempt=none				6.10			preempt_auto
=============================================================================
voluntary context switches	    	7632166.19	        9391636.34(+23%)
involuntary context switches		2305544.07		3527293.94(+53%)

Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase 
involuntary seems to increase at higher rate. 


BTW, ran Unixbench as well. It shows slight regression. stress-ng numbers didn't seem conclusive. 
schench(old) showed slightly lower latency when the number of threads were low. at higher thread 
count showed higher tail latency. But it doesn't seem very convincing numbers. 
All these were done under preempt=none in both 6.10 and preempt_auto. 


Unixbench				6.10		preempt_auto
=====================================================================
1 X Execl Throughput               :    5345.70,    5109.68(-4.42)
4 X Execl Throughput               :   15610.54,   15087.92(-3.35)
1 X Pipe-based Context Switching   :  183172.30,  177069.52(-3.33)
4 X Pipe-based Context Switching   :  615471.66,  602773.74(-2.06)
1 X Process Creation               :   10778.92,   10443.76(-3.11)
4 X Process Creation               :   24327.06,   25150.42(+3.38)
1 X Shell Scripts (1 concurrent)   :   10416.76,   10222.28(-1.87)
4 X Shell Scripts (1 concurrent)   :   36051.00,   35206.90(-2.34)
1 X Shell Scripts (8 concurrent)   :    5004.22,    4907.32(-1.94)
4 X Shell Scripts (8 concurrent)   :   12676.08,   12418.18(-2.03)


> 
> 
> Thanks
> Ankur
> 
>> Meanwhile, will do more test with other micro-benchmarks and post the results.
>>
>>
>> More details below.
>> CONFIG_HZ = 100
>> ./hackbench -pipe 60 process 100000 loops
>>
>> ====================================================================================
>> On the larger system. (40 Cores, 320CPUS)
>> ====================================================================================
>> 				6.10-rc1		+preempt_auto
>> 				preempt=none		preempt=none
>> 20 iterations avg value
>> hackbench pipe(60)		26.403			32.368 ( -31.1%)
>>
>> ++++++++++++++++++
>> baseline 6.10-rc1:
>> ++++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     168,980,939.76 msec cpu-clock                        # 6400.026 CPUs utilized               ( +-  6.59% )
>>      6,299,247,371      context-switches                 #   70.596 K/sec                       ( +-  6.60% )
>>        246,646,236      cpu-migrations                   #    2.764 K/sec                       ( +-  6.57% )
>>          1,759,232      page-faults                      #   19.716 /sec                        ( +-  6.61% )
>> 577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
>> 226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )
>> 37,280,192,946,445      branches                         #  417.801 M/sec                       ( +-  6.61% )
>>    166,456,311,053      branch-misses                    #    0.85% of all branches             ( +-  6.60% )
>>
>>             26.403 +- 0.166 seconds time elapsed  ( +-  0.63% )
>>
>> ++++++++++++
>> preempt auto
>> ++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     207,154,235.95 msec cpu-clock                        # 6400.009 CPUs utilized               ( +-  6.64% )
>>      9,337,462,696      context-switches                 #   85.645 K/sec                       ( +-  6.68% )
>>        631,276,554      cpu-migrations                   #    5.790 K/sec                       ( +-  6.79% )
>>          1,756,583      page-faults                      #   16.112 /sec                        ( +-  6.59% )
>> 700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
>> 254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
>> 42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
>>    231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )
>>
>>             32.368 +- 0.200 seconds time elapsed  ( +-  0.62% )
>>
>>
>> ============================================================================================
>> Smaller system ( 12Cores, 96CPUS)
>> ============================================================================================
>> 				6.10-rc1		+preempt_auto
>> 				preempt=none		preempt=none
>> 20 iterations avg value
>> hackbench pipe(60)		55.930			65.75 ( -17.6%)
>>
>> ++++++++++++++++++
>> baseline 6.10-rc1:
>> ++++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     107,386,299.19 msec cpu-clock                        # 1920.003 CPUs utilized               ( +-  6.55% )
>>      1,388,830,542      context-switches                 #   24.536 K/sec                       ( +-  6.19% )
>>         44,538,641      cpu-migrations                   #  786.840 /sec                        ( +-  6.23% )
>>          1,698,710      page-faults                      #   30.010 /sec                        ( +-  6.58% )
>> 412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
>> 192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )
>> 30,328,724,557,878      branches                         #  535.801 M/sec                       ( +-  6.58% )
>>     99,642,840,901      branch-misses                    #    0.63% of all branches             ( +-  6.57% )
>>
>>             55.930 +- 0.509 seconds time elapsed  ( +-  0.91% )
>>
>>
>> +++++++++++++++++
>> v2_preempt_auto
>> +++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>>     126,244,029.04 msec cpu-clock                        # 1920.005 CPUs utilized               ( +-  6.51% )
>>      2,563,720,294      context-switches                 #   38.356 K/sec                       ( +-  6.10% )
>>        147,445,392      cpu-migrations                   #    2.206 K/sec                       ( +-  6.37% )
>>          1,710,637      page-faults                      #   25.593 /sec                        ( +-  6.55% )
>> 483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
>> 210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )
>> 33,851,562,301,187      branches                         #  506.454 M/sec                       ( +-  6.56% )
>>    134,059,721,699      branch-misses                    #    0.75% of all branches             ( +-  6.45% )
>>
>>              65.75 +- 1.06 seconds time elapsed  ( +-  1.61% )
> 
> So, the context-switches are meaningfully higher.
> 
> --
> ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-04  7:32     ` Shrikanth Hegde
@ 2024-06-07 16:48       ` Shrikanth Hegde
  2024-06-10  7:23         ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-06-07 16:48 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	LKML



On 6/4/24 1:02 PM, Shrikanth Hegde wrote:
> 
> 
> On 6/1/24 5:17 PM, Ankur Arora wrote:
>>
>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>
>>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>>> Hi,
>>>>
>>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>>> on explicit preemption points for the voluntary models.
>>>>
>>>> The series is based on Thomas' original proposal which he outlined
>>>> in [1], [2] and in his PoC [3].
>>>>
>>>> v2 mostly reworks v1, with one of the main changes having less
>>>> noisy need-resched-lazy related interfaces.
>>>> More details in the changelog below.
>>>>
>>>
>>> Hi Ankur. Thanks for the series.
>>>
>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>>
>>> tip/master was at:
>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>>> Merge: 5d145493a139 47ff30cc1be7
>>> Author: Ingo Molnar <mingo@kernel.org>
>>> Date:   Tue May 28 12:44:26 2024 +0200
>>>
>>>     Merge branch into tip/master: 'x86/percpu'
>>>
>>>
>>>
>>>> The v1 of the series is at [4] and the RFC at [5].
>>>>
>>>> Design
>>>> ==
>>>>
>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>>
>>>> Having that, the next step is to make the rescheduling policy dependent
>>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>>> reschedule is needed.
>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>>> scheduler to express two kinds of rescheduling intent: schedule at
>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>>> rescheduling while allowing the task on the runqueue to run to
>>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>>
>>>> The scheduler decides which need-resched bits are chosen based on
>>>> the preemption model in use:
>>>>
>>>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>>>
>>>> none		never   		always [*]
>>>> voluntary       higher sched class	other tasks [*]
>>>> full 		always                  never
>>>>
>>>> [*] some details elided.
>>>>
>>>> The last part of the puzzle is, when does preemption happen, or
>>>> alternately stated, when are the need-resched bits checked:
>>>>
>>>>                  exit-to-user    ret-to-kernel    preempt_count()
>>>>
>>>> NEED_RESCHED_LAZY     Y               N                N
>>>> NEED_RESCHED          Y               Y                Y
>>>>
>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>>> none/voluntary preemption policies are in effect. And eager semantics
>>>> under full preemption.
>>>>
>>>> In addition, since this is driven purely by the scheduler (not
>>>> depending on cond_resched() placement and the like), there is enough
>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>>> instruments like resched IPI to induce a context-switch.
>>>>
>>>> Performance
>>>> ==
>>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>>> (See patches
>>>>   "sched: support preempt=none under PREEMPT_AUTO"
>>>>   "sched: support preempt=full under PREEMPT_AUTO"
>>>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>>
>>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>>> backported.)
>>>>
>>>> In both tests the data was cached on remote nodes (cells), and the
>>>> database nodes (compute) served client queries, with clients being
>>>> local in the first test and remote in the second.
>>>>
>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>>
>>>>
>>>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>>>> 				                                        (preempt=voluntary)
>>>>                               ==============================      =============================
>>>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>>>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>>>
>>>>
>>>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>>>
>>>>
>>>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>>>   90/10 RW ratio)
>>>>
>>>>
>>>> (Both sets of tests have a fair amount of NW traffic since the query
>>>> tables etc are cached on the cells. Additionally, the first set,
>>>> given the local clients, stress the scheduler a bit more than the
>>>> second.)
>>>>
>>>> The comparative performance for both the tests is fairly close,
>>>> more or less within a margin of error.
>>>>
>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>>>
>>>> "
>>>>  a) Base kernel (6.7),
>>>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>>
>>>>  Workloads I tested and their %gain,
>>>>                     case b           case c       case d
>>>>  NAS                +2.7%              +1.9%         +2.1%
>>>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>>>  Graph500,          -6.0%              +0.0%         +0.0%
>>>>  XSBench            +1.7%              +0.0%         +1.2%
>>>>
>>>>  (Note about the Graph500 numbers at [8].)
>>>>
>>>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>>  much difference.
>>>> "
>>>>
>>>> One case where there is a significant performance drop is on powerpc,
>>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>>> fine.) In theory there's no reason for this to only happen on powerpc
>>>> since most of the code is common, but I haven't been able to reproduce
>>>> it on x86 so far.
>>>>
>>>> All in all, I think the tests above show that this scheduling model has legs.
>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>>> different enough from the current none/voluntary models that there
>>>> likely are workloads where performance would be subpar. That needs more
>>>> extensive testing to figure out the weak points.
>>>>
>>>>
>>>>
>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>>> smaller system too to confirm. For now I have done the comparison for the hackbench
>>> where highest regression was seen in v1.
>>>
>>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>>> Could it be that LAZY bit is causing more context switches? or could it be something
>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>>
>> Thanks for trying it out.
>>
>> As you point out, context-switches and migrations are signficantly higher.
>>
>> Definitely unexpected. I ran the same test on an x86 box
>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>>
>>   6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
>>   6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
>>   6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )
>>
>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )
>>
>> Clearly there's something different going on powerpc. I'm travelling
>> right now, but will dig deeper into this once I get back.
>>
>> Meanwhile can you check if the increased context-switches are voluntary or
>> involuntary (or what the division is)?
> 
> 
> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for 
> context switches per second while running "hackbench -pipe 60 process 100000 loops" 
> 
> 
> preempt=none				6.10			preempt_auto
> =============================================================================
> voluntary context switches	    	7632166.19	        9391636.34(+23%)
> involuntary context switches		2305544.07		3527293.94(+53%)
> 
> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase 
> involuntary seems to increase at higher rate. 
> 
> 


Continued data from hackbench regression. preempt=none in both the cases.
From mpstat, I see slightly higher idle time and more irq time with preempt_auto. 

6.10-rc1:
=========
10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99

PREEMPT_AUTO:
===========
10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71

Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more 
timer,sched softirq. Numbers vary between different samples, but trend seems to be similar. 

6.10-rc1:
=========
SOFTIRQ          TOTAL_usecs
tasklet                   71
block                    145
net_rx                  7914
rcu                   136988
timer                 304357
sched                1404497



PREEMPT_AUTO:
===========
SOFTIRQ          TOTAL_usecs
tasklet                   80
block                    139
net_rx                  6907
rcu                   223508
timer                 492767
sched                1794441


Would any specific setting of RCU matter for this? 
This is what I have in config. 

# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TREE_SRCU=y
CONFIG_NEED_SRCU_NMI_SAFE=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_NEED_TASKS_RCU=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
# CONFIG_RCU_LAZY is not set
# end of RCU Subsystem


# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING_USER=y
# CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-07 16:48       ` Shrikanth Hegde
@ 2024-06-10  7:23         ` Ankur Arora
  2024-06-15 15:04           ` Shrikanth Hegde
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-10  7:23 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, LKML


Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/4/24 1:02 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/1/24 5:17 PM, Ankur Arora wrote:
>>>
>>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>>
>>>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>>>> Hi,
>>>>>
>>>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>>>> on explicit preemption points for the voluntary models.
>>>>>
>>>>> The series is based on Thomas' original proposal which he outlined
>>>>> in [1], [2] and in his PoC [3].
>>>>>
>>>>> v2 mostly reworks v1, with one of the main changes having less
>>>>> noisy need-resched-lazy related interfaces.
>>>>> More details in the changelog below.
>>>>>
>>>>
>>>> Hi Ankur. Thanks for the series.
>>>>
>>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>>>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>>>
>>>> tip/master was at:
>>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>>>> Merge: 5d145493a139 47ff30cc1be7
>>>> Author: Ingo Molnar <mingo@kernel.org>
>>>> Date:   Tue May 28 12:44:26 2024 +0200
>>>>
>>>>     Merge branch into tip/master: 'x86/percpu'
>>>>
>>>>
>>>>
>>>>> The v1 of the series is at [4] and the RFC at [5].
>>>>>
>>>>> Design
>>>>> ==
>>>>>
>>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>>>
>>>>> Having that, the next step is to make the rescheduling policy dependent
>>>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>>>> reschedule is needed.
>>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>>>> scheduler to express two kinds of rescheduling intent: schedule at
>>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>>>> rescheduling while allowing the task on the runqueue to run to
>>>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>>>
>>>>> The scheduler decides which need-resched bits are chosen based on
>>>>> the preemption model in use:
>>>>>
>>>>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>>>>
>>>>> none		never   		always [*]
>>>>> voluntary       higher sched class	other tasks [*]
>>>>> full 		always                  never
>>>>>
>>>>> [*] some details elided.
>>>>>
>>>>> The last part of the puzzle is, when does preemption happen, or
>>>>> alternately stated, when are the need-resched bits checked:
>>>>>
>>>>>                  exit-to-user    ret-to-kernel    preempt_count()
>>>>>
>>>>> NEED_RESCHED_LAZY     Y               N                N
>>>>> NEED_RESCHED          Y               Y                Y
>>>>>
>>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>>>> none/voluntary preemption policies are in effect. And eager semantics
>>>>> under full preemption.
>>>>>
>>>>> In addition, since this is driven purely by the scheduler (not
>>>>> depending on cond_resched() placement and the like), there is enough
>>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>>>> instruments like resched IPI to induce a context-switch.
>>>>>
>>>>> Performance
>>>>> ==
>>>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>>>> (See patches
>>>>>   "sched: support preempt=none under PREEMPT_AUTO"
>>>>>   "sched: support preempt=full under PREEMPT_AUTO"
>>>>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>>>
>>>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>>>> backported.)
>>>>>
>>>>> In both tests the data was cached on remote nodes (cells), and the
>>>>> database nodes (compute) served client queries, with clients being
>>>>> local in the first test and remote in the second.
>>>>>
>>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>>>
>>>>>
>>>>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>>>>> 				                                        (preempt=voluntary)
>>>>>                               ==============================      =============================
>>>>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>>>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>>>>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>>>>
>>>>>
>>>>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>>>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>>>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>>>>
>>>>>
>>>>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>>>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>>>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>>>>   90/10 RW ratio)
>>>>>
>>>>>
>>>>> (Both sets of tests have a fair amount of NW traffic since the query
>>>>> tables etc are cached on the cells. Additionally, the first set,
>>>>> given the local clients, stress the scheduler a bit more than the
>>>>> second.)
>>>>>
>>>>> The comparative performance for both the tests is fairly close,
>>>>> more or less within a margin of error.
>>>>>
>>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>>>>
>>>>> "
>>>>>  a) Base kernel (6.7),
>>>>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>>>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>>>
>>>>>  Workloads I tested and their %gain,
>>>>>                     case b           case c       case d
>>>>>  NAS                +2.7%              +1.9%         +2.1%
>>>>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>>>>  Graph500,          -6.0%              +0.0%         +0.0%
>>>>>  XSBench            +1.7%              +0.0%         +1.2%
>>>>>
>>>>>  (Note about the Graph500 numbers at [8].)
>>>>>
>>>>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>>>  much difference.
>>>>> "
>>>>>
>>>>> One case where there is a significant performance drop is on powerpc,
>>>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>>>> fine.) In theory there's no reason for this to only happen on powerpc
>>>>> since most of the code is common, but I haven't been able to reproduce
>>>>> it on x86 so far.
>>>>>
>>>>> All in all, I think the tests above show that this scheduling model has legs.
>>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>>>> different enough from the current none/voluntary models that there
>>>>> likely are workloads where performance would be subpar. That needs more
>>>>> extensive testing to figure out the weak points.
>>>>>
>>>>>
>>>>>
>>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>>>> smaller system too to confirm. For now I have done the comparison for the hackbench
>>>> where highest regression was seen in v1.
>>>>
>>>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>>>> Could it be that LAZY bit is causing more context switches? or could it be something
>>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>>>
>>> Thanks for trying it out.
>>>
>>> As you point out, context-switches and migrations are signficantly higher.
>>>
>>> Definitely unexpected. I ran the same test on an x86 box
>>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>>>
>>>   6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
>>>   6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
>>>   6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )
>>>
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )
>>>
>>> Clearly there's something different going on powerpc. I'm travelling
>>> right now, but will dig deeper into this once I get back.
>>>
>>> Meanwhile can you check if the increased context-switches are voluntary or
>>> involuntary (or what the division is)?
>>
>>
>> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for
>> context switches per second while running "hackbench -pipe 60 process 100000 loops"
>>
>>
>> preempt=none				6.10			preempt_auto
>> =============================================================================
>> voluntary context switches	    	7632166.19	        9391636.34(+23%)
>> involuntary context switches		2305544.07		3527293.94(+53%)
>>
>> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase
>> involuntary seems to increase at higher rate.
>>
>>
>
>
> Continued data from hackbench regression. preempt=none in both the cases.
> From mpstat, I see slightly higher idle time and more irq time with preempt_auto.
>
> 6.10-rc1:
> =========
> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>
> PREEMPT_AUTO:
> ===========
> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>
> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.

Yeah, the %sys is lower and %irq, higher. Can you also see where the
increased %irq is? For instance are the resched IPIs numbers greater?

> 6.10-rc1:
> =========
> SOFTIRQ          TOTAL_usecs
> tasklet                   71
> block                    145
> net_rx                  7914
> rcu                   136988
> timer                 304357
> sched                1404497
>
>
>
> PREEMPT_AUTO:
> ===========
> SOFTIRQ          TOTAL_usecs
> tasklet                   80
> block                    139
> net_rx                  6907
> rcu                   223508
> timer                 492767
> sched                1794441
>
>
> Would any specific setting of RCU matter for this?
> This is what I have in config.

Don't see how it could matter unless the RCU settings are changing
between the two tests? In my testing I'm also using TREE_RCU=y,
PREEMPT_RCU=n.

Let me see if I can find a test which shows a similar trend to what you
are seeing. And, then maybe see if tracing sched-switch might point to
an interesting difference between x86 and powerpc.


Thanks for all the detail.

Ankur

> # RCU Subsystem
> #
> CONFIG_TREE_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_TREE_SRCU=y
> CONFIG_NEED_SRCU_NMI_SAFE=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_NEED_TASKS_RCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> CONFIG_RCU_NOCB_CPU=y
> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
> # CONFIG_RCU_LAZY is not set
> # end of RCU Subsystem
>
>
> # Timers subsystem
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> CONFIG_CONTEXT_TRACKING_USER=y
> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
> CONFIG_NO_HZ=y
> CONFIG_HIGH_RES_TIMERS=y
> # end of Timers subsystem


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-10  7:23         ` Ankur Arora
@ 2024-06-15 15:04           ` Shrikanth Hegde
  2024-06-18 18:27             ` Shrikanth Hegde
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-06-15 15:04 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	LKML



On 6/10/24 12:53 PM, Ankur Arora wrote:
> 
_auto.
>>
>> 6.10-rc1:
>> =========
>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
>> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
>> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
>> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>>
>> PREEMPT_AUTO:
>> ===========
>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
>> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
>> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
>> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
>> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>>
>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
> 
> Yeah, the %sys is lower and %irq, higher. Can you also see where the
> increased %irq is? For instance are the resched IPIs numbers greater?

Hi Ankur,


Used mpstat -I ALL to capture this info for 20 seconds. 

HARDIRQ per second:
===================
6.10:
===================
18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
417956.86	1114642.30	1712683.65	2058664.99	0.00	0.00	18.30	0.39	31978.37	0.00	0.35	351.98	0.00	0.00	0.00	6405.54	329189.45

Preempt_auto:
===================
18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
609509.69	1910413.99	1923503.52	2061876.33	0.00	0.00	19.14	0.30	31916.59	0.00	0.45	497.88	0.00	0.00	0.00	6825.49	88247.85

18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing. 


SOFTIRQ per second:
===================
6.10:
=================== 
HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95

Preempt_auto:
===================
HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77

Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why. 
It maybe irq triggering to softirq or softirq causing more IPI. 



Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them 
enabled. I still see the same regression in hackbench. These configs still may need attention?
		
					6.10				       | 					preempt auto 
  CONFIG_INLINE_SPIN_UNLOCK_IRQ=y                                              |  CONFIG_UNINLINE_SPIN_UNLOCK=y                                               
  CONFIG_INLINE_READ_UNLOCK=y                                                  |  ----------------------------------------------------------------------------
  CONFIG_INLINE_READ_UNLOCK_IRQ=y                                              |  ----------------------------------------------------------------------------
  CONFIG_INLINE_WRITE_UNLOCK=y                                                 |  ----------------------------------------------------------------------------
  CONFIG_INLINE_WRITE_UNLOCK_IRQ=y                                             |  ----------------------------------------------------------------------------


> 
>> 6.10-rc1:
>> =========
>> SOFTIRQ          TOTAL_usecs
>> tasklet                   71
>> block                    145
>> net_rx                  7914
>> rcu                   136988
>> timer                 304357
>> sched                1404497
>>
>>
>>
>> PREEMPT_AUTO:
>> ===========
>> SOFTIRQ          TOTAL_usecs
>> tasklet                   80
>> block                    139
>> net_rx                  6907
>> rcu                   223508
>> timer                 492767
>> sched                1794441
>>
>>
>> Would any specific setting of RCU matter for this?
>> This is what I have in config.
> 
> Don't see how it could matter unless the RCU settings are changing
> between the two tests? In my testing I'm also using TREE_RCU=y,
> PREEMPT_RCU=n.
> 
> Let me see if I can find a test which shows a similar trend to what you
> are seeing. And, then maybe see if tracing sched-switch might point to
> an interesting difference between x86 and powerpc.
> 
> 
> Thanks for all the detail.
> 
> Ankur
> 
>> # RCU Subsystem
>> #
>> CONFIG_TREE_RCU=y
>> # CONFIG_RCU_EXPERT is not set
>> CONFIG_TREE_SRCU=y
>> CONFIG_NEED_SRCU_NMI_SAFE=y
>> CONFIG_TASKS_RCU_GENERIC=y
>> CONFIG_NEED_TASKS_RCU=y
>> CONFIG_TASKS_RCU=y
>> CONFIG_TASKS_RUDE_RCU=y
>> CONFIG_TASKS_TRACE_RCU=y
>> CONFIG_RCU_STALL_COMMON=y
>> CONFIG_RCU_NEED_SEGCBLIST=y
>> CONFIG_RCU_NOCB_CPU=y
>> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
>> # CONFIG_RCU_LAZY is not set
>> # end of RCU Subsystem
>>
>>
>> # Timers subsystem
>> #
>> CONFIG_TICK_ONESHOT=y
>> CONFIG_NO_HZ_COMMON=y
>> # CONFIG_HZ_PERIODIC is not set
>> # CONFIG_NO_HZ_IDLE is not set
>> CONFIG_NO_HZ_FULL=y
>> CONFIG_CONTEXT_TRACKING_USER=y
>> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
>> CONFIG_NO_HZ=y
>> CONFIG_HIGH_RES_TIMERS=y
>> # end of Timers subsystem
> 
> 
> --
> ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-15 15:04           ` Shrikanth Hegde
@ 2024-06-18 18:27             ` Shrikanth Hegde
  2024-06-19  2:40               ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-06-18 18:27 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	LKML



On 6/15/24 8:34 PM, Shrikanth Hegde wrote:
> 
> 
> On 6/10/24 12:53 PM, Ankur Arora wrote:
>>
> _auto.
>>>
>>> 6.10-rc1:
>>> =========
>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
>>> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
>>> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
>>> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>>>
>>> PREEMPT_AUTO:
>>> ===========
>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
>>> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
>>> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
>>> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
>>> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>>>
>>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>>
>> Yeah, the %sys is lower and %irq, higher. Can you also see where the
>> increased %irq is? For instance are the resched IPIs numbers greater?
> 
> Hi Ankur,
> 
> 
> Used mpstat -I ALL to capture this info for 20 seconds. 
> 
> HARDIRQ per second:
> ===================
> 6.10:
> ===================
> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 417956.86	1114642.30	1712683.65	2058664.99	0.00	0.00	18.30	0.39	31978.37	0.00	0.35	351.98	0.00	0.00	0.00	6405.54	329189.45
> 
> Preempt_auto:
> ===================
> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 609509.69	1910413.99	1923503.52	2061876.33	0.00	0.00	19.14	0.30	31916.59	0.00	0.45	497.88	0.00	0.00	0.00	6825.49	88247.85
> 
> 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing. 
> 
> 
> SOFTIRQ per second:
> ===================
> 6.10:
> =================== 
> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
> 
> Preempt_auto:
> ===================
> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU	
> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
> 
> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why. 
> It maybe irq triggering to softirq or softirq causing more IPI. 
> 
> 
> 
> Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them 
> enabled. I still see the same regression in hackbench. These configs still may need attention?
> 		
> 					6.10				       | 					preempt auto 
>   CONFIG_INLINE_SPIN_UNLOCK_IRQ=y                                              |  CONFIG_UNINLINE_SPIN_UNLOCK=y                                               
>   CONFIG_INLINE_READ_UNLOCK=y                                                  |  ----------------------------------------------------------------------------
>   CONFIG_INLINE_READ_UNLOCK_IRQ=y                                              |  ----------------------------------------------------------------------------
>   CONFIG_INLINE_WRITE_UNLOCK=y                                                 |  ----------------------------------------------------------------------------
>   CONFIG_INLINE_WRITE_UNLOCK_IRQ=y                                             |  ----------------------------------------------------------------------------
> 
> 

Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across. 
When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start 
spanning across sockets. 

Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
which may not scale well. Will try to shift to percpu based method and see. will get back if I can get that done successfully. 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-18 18:27             ` Shrikanth Hegde
@ 2024-06-19  2:40               ` Ankur Arora
  2024-06-24 18:37                 ` Shrikanth Hegde
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-19  2:40 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, LKML


Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/15/24 8:34 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/10/24 12:53 PM, Ankur Arora wrote:
>>>
>> _auto.
>>>>
>>>> 6.10-rc1:
>>>> =========
>>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
>>>> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
>>>> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
>>>> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>>>>
>>>> PREEMPT_AUTO:
>>>> ===========
>>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
>>>> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
>>>> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
>>>> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
>>>> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>>>>
>>>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>>>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>>>
>>> Yeah, the %sys is lower and %irq, higher. Can you also see where the
>>> increased %irq is? For instance are the resched IPIs numbers greater?
>>
>> Hi Ankur,
>>
>>
>> Used mpstat -I ALL to capture this info for 20 seconds.
>>
>> HARDIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 417956.86	1114642.30	1712683.65	2058664.99	0.00	0.00	18.30	0.39	31978.37	0.00	0.35	351.98	0.00	0.00	0.00	6405.54	329189.45
>>
>> Preempt_auto:
>> ===================
>> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 609509.69	1910413.99	1923503.52	2061876.33	0.00	0.00	19.14	0.30	31916.59	0.00	0.45	497.88	0.00	0.00	0.00	6825.49	88247.85
>>
>> 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing.
>>
>>
>> SOFTIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
>>
>> Preempt_auto:
>> ===================
>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
>>
>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>> It maybe irq triggering to softirq or softirq causing more IPI.
>>
>>
>>
>> Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them
>> enabled. I still see the same regression in hackbench. These configs still may need attention?
>>
>> 					6.10				       | 					preempt auto
>>   CONFIG_INLINE_SPIN_UNLOCK_IRQ=y                                              |  CONFIG_UNINLINE_SPIN_UNLOCK=y
>>   CONFIG_INLINE_READ_UNLOCK=y                                                  |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_READ_UNLOCK_IRQ=y                                              |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_WRITE_UNLOCK=y                                                 |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_WRITE_UNLOCK_IRQ=y                                             |  ----------------------------------------------------------------------------
>>
>>
>
> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.
> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
> spanning across sockets.

Ah. That's really interesting. So, upto 160 CPUs was okay?

> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
> which may not scale well.

Yeah this would explain why I don't see similar behaviour on a 384 CPU
x86 box.

Also, IIRC the powerpc numbers on preempt=full were significantly worse
than preempt=none. That test might also be worth doing once you have the
percpu based method working.

> Will try to shift to percpu based method and see. will get back if I can get that done successfully.

Sounds good to me.


Thanks
Ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-19  2:40               ` Ankur Arora
@ 2024-06-24 18:37                 ` Shrikanth Hegde
  2024-06-27  2:50                   ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-06-24 18:37 UTC (permalink / raw)
  To: Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	LKML, Michael Ellerman, Nicholas Piggin



On 6/19/24 8:10 AM, Ankur Arora wrote:


>>>
>>> SOFTIRQ per second:
>>> ===================
>>> 6.10:
>>> ===================
>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
>>>
>>> Preempt_auto:
>>> ===================
>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
>>>
>>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>>> It maybe irq triggering to softirq or softirq causing more IPI.
>>
>> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.CPU 
>> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
>> spanning across sockets.
> 
> Ah. That's really interesting. So, upto 160 CPUs was okay?

No. In both the cases CPUs are limited to 96. In one case its in single NUMA node and in other case its across two NUMA nodes. 

> 
>> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
>> which may not scale well.
> 
> Yeah this would explain why I don't see similar behaviour on a 384 CPU
> x86 box.
> 
> Also, IIRC the powerpc numbers on preempt=full were significantly worse
> than preempt=none. That test might also be worth doing once you have the
> percpu based method working.
> 
>> Will try to shift to percpu based method and see. will get back if I can get that done successfully.
> 
> Sounds good to me.
> 

Did give a try. Made the preempt count per CPU by adding it in paca field. Unfortunately it didn't
improve the the performance. Its more or less same as preempt_auto.  

Issue still remains illusive. Likely crux is that somehow IPI-interrupts and SOFTIRQs are increasing 
with preempt_auto. Doing some more data collection with perf/ftrace. Will share that soon. 

This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided 
tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch 
for reference. It didn't help fix the regression unless I implemented it wrongly.  


diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1d58da946739..374642288061 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -268,6 +268,7 @@ struct paca_struct {
 	u16 slb_save_cache_ptr;
 #endif
 #endif /* CONFIG_PPC_BOOK3S_64 */
+	int preempt_count;
 #ifdef CONFIG_STACKPROTECTOR
 	unsigned long canary;
 #endif
diff --git a/arch/powerpc/include/asm/preempt.h b/arch/powerpc/include/asm/preempt.h
new file mode 100644
index 000000000000..406dad1a0cf6
--- /dev/null
+++ b/arch/powerpc/include/asm/preempt.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PREEMPT_H
+#define __ASM_PREEMPT_H
+
+#include <linux/thread_info.h>
+
+#ifdef CONFIG_PPC64
+#include <asm/paca.h>
+#endif
+#include <asm/percpu.h>
+#include <asm/smp.h>
+
+#define PREEMPT_ENABLED (0)
+
+/*
+ * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
+ * that think a non-zero value indicates we cannot preempt.
+ */
+static __always_inline int preempt_count(void)
+{
+	return READ_ONCE(local_paca->preempt_count);
+}
+
+static __always_inline void preempt_count_set(int pc)
+{
+	WRITE_ONCE(local_paca->preempt_count, pc);
+}
+
+/*
+ * must be macros to avoid header recursion hell
+ */
+#define init_task_preempt_count(p) do { } while (0)
+
+#define init_idle_preempt_count(p, cpu) do { } while (0)
+
+static __always_inline void set_preempt_need_resched(void)
+{
+}
+
+static __always_inline void clear_preempt_need_resched(void)
+{
+}
+
+static __always_inline bool test_preempt_need_resched(void)
+{
+	return false;
+}
+
+/*
+ * The various preempt_count add/sub methods
+ */
+
+static __always_inline void __preempt_count_add(int val)
+{
+	preempt_count_set(preempt_count() + val);
+}
+
+static __always_inline void __preempt_count_sub(int val)
+{
+	preempt_count_set(preempt_count() - val);
+}
+
+static __always_inline bool __preempt_count_dec_and_test(void)
+{
+	/*
+	 * Because of load-store architectures cannot do per-cpu atomic
+	 * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
+	 * lost.
+	 */
+	preempt_count_set(preempt_count() - 1);
+	if (preempt_count() == 0 && tif_need_resched())
+		return true;
+	else
+		return false;
+}
+
+/*
+ * Returns true when we need to resched and can (barring IRQ state).
+ */
+static __always_inline bool should_resched(int preempt_offset)
+{
+	return unlikely(preempt_count() == preempt_offset && tif_need_resched());
+}
+
+//EXPORT_SYMBOL(per_cpu_preempt_count);
+
+#ifdef CONFIG_PREEMPTION
+extern asmlinkage void preempt_schedule(void);
+extern asmlinkage void preempt_schedule_notrace(void);
+
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
+
+void dynamic_preempt_schedule(void);
+void dynamic_preempt_schedule_notrace(void);
+#define __preempt_schedule()		dynamic_preempt_schedule()
+#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
+
+#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
+
+#define __preempt_schedule() preempt_schedule()
+#define __preempt_schedule_notrace() preempt_schedule_notrace()
+
+#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
+#endif /* CONFIG_PREEMPTION */
+
+#endif /* __ASM_PREEMPT_H */
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 0d170e2be2b6..bf2199384751 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -52,8 +52,8 @@
  * low level task data.
  */
 struct thread_info {
-	int		preempt_count;		/* 0 => preemptable,
-						   <0 => BUG */
+	//int		preempt_count;		// 0 => preemptable,
+						//   <0 => BUG
 #ifdef CONFIG_SMP
 	unsigned int	cpu;
 #endif
@@ -77,7 +77,6 @@ struct thread_info {
  */
 #define INIT_THREAD_INFO(tsk)			\
 {						\
-	.preempt_count = INIT_PREEMPT_COUNT,	\
 	.flags =	0,			\
 }
 
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 7502066c3c53..f90245b8359f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -204,6 +204,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 #ifdef CONFIG_PPC_64S_HASH_MMU
 	new_paca->slb_shadow_ptr = NULL;
 #endif
+	new_paca->preempt_count = PREEMPT_DISABLED;
 
 #ifdef CONFIG_PPC_BOOK3E_64
 	/* For now -- if we have threads this will be adjusted later */
diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 85050be08a23..2adab682aab9 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -33,6 +33,8 @@
 #include <asm/ultravisor.h>
 #include <asm/crashdump-ppc64.h>
 
+#include <linux/percpu-defs.h>
+
 int machine_kexec_prepare(struct kimage *image)
 {
 	int i;
@@ -324,7 +326,7 @@ void default_machine_kexec(struct kimage *image)
 	 * XXX: the task struct will likely be invalid once we do the copy!
 	 */
 	current_thread_info()->flags = 0;
-	current_thread_info()->preempt_count = HARDIRQ_OFFSET;
+	local_paca->preempt_count = HARDIRQ_OFFSET;
 
 	/* We need a static PACA, too; copy this CPU's PACA over and switch to
 	 * it. Also poison per_cpu_offset and NULL lppaca to catch anyone using

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-24 18:37                 ` Shrikanth Hegde
@ 2024-06-27  2:50                   ` Ankur Arora
  2024-06-27  5:56                     ` Michael Ellerman
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-06-27  2:50 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, LKML, Michael Ellerman, Nicholas Piggin


Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/19/24 8:10 AM, Ankur Arora wrote:
>
>
>>>>
>>>> SOFTIRQ per second:
>>>> ===================
>>>> 6.10:
>>>> ===================
>>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>>> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
>>>>
>>>> Preempt_auto:
>>>> ===================
>>>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>>>> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
>>>>
>>>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>>>> It maybe irq triggering to softirq or softirq causing more IPI.
>>>
>>> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.CPU
>>> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
>>> spanning across sockets.
>>
>> Ah. That's really interesting. So, upto 160 CPUs was okay?
>
> No. In both the cases CPUs are limited to 96. In one case its in single NUMA node and in other case its across two NUMA nodes.
>
>>
>>> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
>>> which may not scale well.
>>
>> Yeah this would explain why I don't see similar behaviour on a 384 CPU
>> x86 box.
>>
>> Also, IIRC the powerpc numbers on preempt=full were significantly worse
>> than preempt=none. That test might also be worth doing once you have the
>> percpu based method working.
>>
>>> Will try to shift to percpu based method and see. will get back if I can get that done successfully.
>>
>> Sounds good to me.
>>
>
> Did give a try. Made the preempt count per CPU by adding it in paca field. Unfortunately it didn't
> improve the the performance. Its more or less same as preempt_auto.
>
> Issue still remains illusive. Likely crux is that somehow IPI-interrupts and SOFTIRQs are increasing
> with preempt_auto. Doing some more data collection with perf/ftrace. Will share that soon.

True. But, just looking at IPC for now:

>> baseline 6.10-rc1:
>> ++++++++++++++++++
>>  Performance counter stats for 'system wide' (20 runs):
>> 577,719,907,794,874      cycles                           #    6.475 GHz                         ( +-  6.60% )
>> 226,392,778,622,410      instructions                     #    0.74  insn per cycle              ( +-  6.61% )

>> preempt auto
>>  Performance counter stats for 'system wide' (20 runs):
>> 700,281,729,230,103      cycles                           #    6.423 GHz                         ( +-  6.64% )
>> 254,713,123,656,485      instructions                     #    0.69  insn per cycle              ( +-  6.63% )
>> 42,275,061,484,512      branches                         #  387.756 M/sec                       ( +-  6.63% )
>>    231,944,216,106      branch-misses                    #    1.04% of all branches             ( +-  6.64% )

Not sure if comparing IPC is worthwhile given the substantially higher
number of instructions under execution. But, that is meaningfully worse.

This was also true on the 12 core system:

>> baseline 6.10-rc1:
>>  Performance counter stats for 'system wide' (20 runs):
>> 412,401,110,929,055      cycles                           #    7.286 GHz                         ( +-  6.54% )
>> 192,380,094,075,743      instructions                     #    0.88  insn per cycle              ( +-  6.59% )

>> v2_preempt_auto
>> Performance counter stats for 'system wide' (20 runs):
>> 483,419,889,144,017      cycles                           #    7.232 GHz                         ( +-  6.51% )
>> 210,788,030,476,548      instructions                     #    0.82  insn per cycle              ( +-  6.57% )

Just to get rid of the preempt_auto aspect completely, maybe you could
try seeing what perf stat -d shows for:
CONFIG_PREEMPT vs CONFIG_PREEMPT_NONE vs (CONFIG_PREEMPT_DYNAMIC, preempt=none).

> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
> for reference. It didn't help fix the regression unless I implemented it wrongly.
>
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 1d58da946739..374642288061 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -268,6 +268,7 @@ struct paca_struct {
>  	u16 slb_save_cache_ptr;
>  #endif
>  #endif /* CONFIG_PPC_BOOK3S_64 */
> +	int preempt_count;

I don't know powerpc at all. But, would this cacheline be hotter
than current_thread_info()::preempt_count?

Thanks
Ankur

>  #ifdef CONFIG_STACKPROTECTOR
>  	unsigned long canary;
>  #endif
> diff --git a/arch/powerpc/include/asm/preempt.h b/arch/powerpc/include/asm/preempt.h
> new file mode 100644
> index 000000000000..406dad1a0cf6
> --- /dev/null
> +++ b/arch/powerpc/include/asm/preempt.h
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_PREEMPT_H
> +#define __ASM_PREEMPT_H
> +
> +#include <linux/thread_info.h>
> +
> +#ifdef CONFIG_PPC64
> +#include <asm/paca.h>
> +#endif
> +#include <asm/percpu.h>
> +#include <asm/smp.h>
> +
> +#define PREEMPT_ENABLED (0)
> +
> +/*
> + * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
> + * that think a non-zero value indicates we cannot preempt.
> + */
> +static __always_inline int preempt_count(void)
> +{
> +	return READ_ONCE(local_paca->preempt_count);
> +}
> +
> +static __always_inline void preempt_count_set(int pc)
> +{
> +	WRITE_ONCE(local_paca->preempt_count, pc);
> +}
> +
> +/*
> + * must be macros to avoid header recursion hell
> + */
> +#define init_task_preempt_count(p) do { } while (0)
> +
> +#define init_idle_preempt_count(p, cpu) do { } while (0)
> +
> +static __always_inline void set_preempt_need_resched(void)
> +{
> +}
> +
> +static __always_inline void clear_preempt_need_resched(void)
> +{
> +}
> +
> +static __always_inline bool test_preempt_need_resched(void)
> +{
> +	return false;
> +}
> +
> +/*
> + * The various preempt_count add/sub methods
> + */
> +
> +static __always_inline void __preempt_count_add(int val)
> +{
> +	preempt_count_set(preempt_count() + val);
> +}
> +
> +static __always_inline void __preempt_count_sub(int val)
> +{
> +	preempt_count_set(preempt_count() - val);
> +}
> +
> +static __always_inline bool __preempt_count_dec_and_test(void)
> +{
> +	/*
> +	 * Because of load-store architectures cannot do per-cpu atomic
> +	 * operations; we cannot use PREEMPT_NEED_RESCHED because it might get
> +	 * lost.
> +	 */
> +	preempt_count_set(preempt_count() - 1);
> +	if (preempt_count() == 0 && tif_need_resched())
> +		return true;
> +	else
> +		return false;
> +}
> +
> +/*
> + * Returns true when we need to resched and can (barring IRQ state).
> + */
> +static __always_inline bool should_resched(int preempt_offset)
> +{
> +	return unlikely(preempt_count() == preempt_offset && tif_need_resched());
> +}
> +
> +//EXPORT_SYMBOL(per_cpu_preempt_count);
> +
> +#ifdef CONFIG_PREEMPTION
> +extern asmlinkage void preempt_schedule(void);
> +extern asmlinkage void preempt_schedule_notrace(void);
> +
> +#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> +
> +void dynamic_preempt_schedule(void);
> +void dynamic_preempt_schedule_notrace(void);
> +#define __preempt_schedule()		dynamic_preempt_schedule()
> +#define __preempt_schedule_notrace()	dynamic_preempt_schedule_notrace()
> +
> +#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
> +
> +#define __preempt_schedule() preempt_schedule()
> +#define __preempt_schedule_notrace() preempt_schedule_notrace()
> +
> +#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_KEY*/
> +#endif /* CONFIG_PREEMPTION */
> +
> +#endif /* __ASM_PREEMPT_H */
> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> index 0d170e2be2b6..bf2199384751 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -52,8 +52,8 @@
>   * low level task data.
>   */
>  struct thread_info {
> -	int		preempt_count;		/* 0 => preemptable,
> -						   <0 => BUG */
> +	//int		preempt_count;		// 0 => preemptable,
> +						//   <0 => BUG
>  #ifdef CONFIG_SMP
>  	unsigned int	cpu;
>  #endif
> @@ -77,7 +77,6 @@ struct thread_info {
>   */
>  #define INIT_THREAD_INFO(tsk)			\
>  {						\
> -	.preempt_count = INIT_PREEMPT_COUNT,	\
>  	.flags =	0,			\
>  }
>
> diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
> index 7502066c3c53..f90245b8359f 100644
> --- a/arch/powerpc/kernel/paca.c
> +++ b/arch/powerpc/kernel/paca.c
> @@ -204,6 +204,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
>  #ifdef CONFIG_PPC_64S_HASH_MMU
>  	new_paca->slb_shadow_ptr = NULL;
>  #endif
> +	new_paca->preempt_count = PREEMPT_DISABLED;
>
>  #ifdef CONFIG_PPC_BOOK3E_64
>  	/* For now -- if we have threads this will be adjusted later */
> diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
> index 85050be08a23..2adab682aab9 100644
> --- a/arch/powerpc/kexec/core_64.c
> +++ b/arch/powerpc/kexec/core_64.c
> @@ -33,6 +33,8 @@
>  #include <asm/ultravisor.h>
>  #include <asm/crashdump-ppc64.h>
>
> +#include <linux/percpu-defs.h>
> +
>  int machine_kexec_prepare(struct kimage *image)
>  {
>  	int i;
> @@ -324,7 +326,7 @@ void default_machine_kexec(struct kimage *image)
>  	 * XXX: the task struct will likely be invalid once we do the copy!
>  	 */
>  	current_thread_info()->flags = 0;
> -	current_thread_info()->preempt_count = HARDIRQ_OFFSET;
> +	local_paca->preempt_count = HARDIRQ_OFFSET;
>
>  	/* We need a static PACA, too; copy this CPU's PACA over and switch to
>  	 * it. Also poison per_cpu_offset and NULL lppaca to catch anyone using


--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-27  2:50                   ` Ankur Arora
@ 2024-06-27  5:56                     ` Michael Ellerman
  2024-06-27 15:44                       ` Shrikanth Hegde
  0 siblings, 1 reply; 95+ messages in thread
From: Michael Ellerman @ 2024-06-27  5:56 UTC (permalink / raw)
  To: Ankur Arora, Shrikanth Hegde
  Cc: Ankur Arora, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, LKML, Nicholas Piggin

Ankur Arora <ankur.a.arora@oracle.com> writes:
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>> ...
>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>
>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>> index 1d58da946739..374642288061 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -268,6 +268,7 @@ struct paca_struct {
>>  	u16 slb_save_cache_ptr;
>>  #endif
>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>> +	int preempt_count;
>
> I don't know powerpc at all. But, would this cacheline be hotter
> than current_thread_info()::preempt_count?
>
>>  #ifdef CONFIG_STACKPROTECTOR
>>  	unsigned long canary;
>>  #endif

Assuming stack protector is enabled (it is in defconfig), that cache
line should quite be hot, because the canary is loaded as part of the
epilogue of many functions.

Putting preempt_count in the paca also means it's a single load/store to
access the value, just paca (in r13) + static offset. With the
preempt_count in thread_info it's two loads, one to load current from
the paca and then another to get the preempt_count.

It could be worthwhile to move preempt_count into the paca, but I'm not
convinced preempt_count is accessed enough for it to be a major
performance issue.

cheers

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-27  5:56                     ` Michael Ellerman
@ 2024-06-27 15:44                       ` Shrikanth Hegde
  2024-07-03  5:27                         ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-06-27 15:44 UTC (permalink / raw)
  To: Michael Ellerman, Ankur Arora
  Cc: tglx, peterz, torvalds, paulmck, rostedt, mark.rutland,
	juri.lelli, joel, raghavendra.kt, boris.ostrovsky, konrad.wilk,
	LKML, Nicholas Piggin



On 6/27/24 11:26 AM, Michael Ellerman wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>> ...
>>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>>
>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>> index 1d58da946739..374642288061 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -268,6 +268,7 @@ struct paca_struct {
>>>  	u16 slb_save_cache_ptr;
>>>  #endif
>>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>>> +	int preempt_count;
>>
>> I don't know powerpc at all. But, would this cacheline be hotter
>> than current_thread_info()::preempt_count?
>>
>>>  #ifdef CONFIG_STACKPROTECTOR
>>>  	unsigned long canary;
>>>  #endif
> 
> Assuming stack protector is enabled (it is in defconfig), that cache
> line should quite be hot, because the canary is loaded as part of the
> epilogue of many functions.

Thanks Michael for taking a look at it.  

Yes. CONFIG_STACKPROTECTOR=y 
which cacheline is a question still if we are going to pursue this. 
> Putting preempt_count in the paca also means it's a single load/store to 
> access the value, just paca (in r13) + static offset. With the
> preempt_count in thread_info it's two loads, one to load current from
> the paca and then another to get the preempt_count.
> 
> It could be worthwhile to move preempt_count into the paca, but I'm not
> convinced preempt_count is accessed enough for it to be a major
> performance issue.

With PREEMPT_COUNT enabled, this would mean for every preempt_enable/disable. 
That means for every spin lock/unlock, get/set cpu etc. Those might be 
quite frequent. no? But w.r.t to preempt auto it didn't change the performance per se. 

> 
> cheers

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-27 15:44                       ` Shrikanth Hegde
@ 2024-07-03  5:27                         ` Ankur Arora
  2024-08-12 17:32                           ` Shrikanth Hegde
  0 siblings, 1 reply; 95+ messages in thread
From: Ankur Arora @ 2024-07-03  5:27 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Michael Ellerman, Ankur Arora, tglx, peterz, torvalds, paulmck,
	rostedt, mark.rutland, juri.lelli, joel, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, LKML, Nicholas Piggin


Shrikanth Hegde <sshegde@linux.ibm.com> writes:

> On 6/27/24 11:26 AM, Michael Ellerman wrote:
>> Ankur Arora <ankur.a.arora@oracle.com> writes:
>>> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
>>>> ...
>>>> This was the patch which I tried to make it per cpu for powerpc: It boots and runs workload.
>>>> Implemented a simpler one instead of folding need resched into preempt count. By hacky way avoided
>>>> tif_need_resched calls as didnt affect the throughput. Hence kept it simple. Below is the patch
>>>> for reference. It didn't help fix the regression unless I implemented it wrongly.
>>>>
>>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>>> index 1d58da946739..374642288061 100644
>>>> --- a/arch/powerpc/include/asm/paca.h
>>>> +++ b/arch/powerpc/include/asm/paca.h
>>>> @@ -268,6 +268,7 @@ struct paca_struct {
>>>>  	u16 slb_save_cache_ptr;
>>>>  #endif
>>>>  #endif /* CONFIG_PPC_BOOK3S_64 */
>>>> +	int preempt_count;
>>>
>>> I don't know powerpc at all. But, would this cacheline be hotter
>>> than current_thread_info()::preempt_count?
>>>
>>>>  #ifdef CONFIG_STACKPROTECTOR
>>>>  	unsigned long canary;
>>>>  #endif
>>
>> Assuming stack protector is enabled (it is in defconfig), that cache
>> line should quite be hot, because the canary is loaded as part of the
>> epilogue of many functions.
>
> Thanks Michael for taking a look at it.
>
> Yes. CONFIG_STACKPROTECTOR=y
> which cacheline is a question still if we are going to pursue this.
>> Putting preempt_count in the paca also means it's a single load/store to
>> access the value, just paca (in r13) + static offset. With the
>> preempt_count in thread_info it's two loads, one to load current from
>> the paca and then another to get the preempt_count.
>>
>> It could be worthwhile to move preempt_count into the paca, but I'm not
>> convinced preempt_count is accessed enough for it to be a major
>> performance issue.

Yeah, that makes sense. I'm working on making the x86 preempt_count
and related code similar to powerpc. Let's see how that does on x86.

> With PREEMPT_COUNT enabled, this would mean for every preempt_enable/disable.
> That means for every spin lock/unlock, get/set cpu etc. Those might be
> quite frequent. no? But w.r.t to preempt auto it didn't change the performance per se.

Yeah and you had mentioned that folding the NR bit (or not) doesn't
seem to matter either. Hackbench does a lot of remote wakeups, which
should mean that the target's thread_info::flags cacheline would be
bouncing around, so I would have imagined that that would be noticeable.

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-07-03  5:27                         ` Ankur Arora
@ 2024-08-12 17:32                           ` Shrikanth Hegde
  2024-08-12 21:07                             ` Linus Torvalds
  0 siblings, 1 reply; 95+ messages in thread
From: Shrikanth Hegde @ 2024-08-12 17:32 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Michael Ellerman, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, LKML, Nicholas Piggin



On 7/3/24 10:57, Ankur Arora wrote:
> 
> Shrikanth Hegde <sshegde@linux.ibm.com> writes:
> 

Hi.
Sorry for the delayed response.

I could see this hackbench pipe regression with preempt=full kernel on 6.10-rc kernel. i.e without PREEMPT_AUTO as well.

There seems to more wakeups in read path, implies pipe was more often empty. Correspondingly more contention
is there on the mutex pipe lock in preempt=full. But why, not sure. One difference in powerpc is page size. but
here pipe isn't getting full. Its not the write side that is blocked.



preempt=none: Time taken for 20 groups  in seconds        : 25.70
preempt=full: Time taken for 20 groups  in seconds        : 54.56

----------------
hackbench (pipe)
----------------
top 3 callstacks of __schedule collected with bpftrace.

			preempt=none								preempt=full

     __schedule+12                                                                  |@[
     schedule+64                                                                    |    __schedule+12
     interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
     interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
     interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
, hackbench]: 482228                                                               |    pipe_write+1772
@[                                                                                 |    vfs_write+1052
     __schedule+12                                                                  |    ksys_write+248
     schedule+64                                                                    |    system_call_exception+296
     pipe_write+1452                                                                |    system_call_vectored_common+348
     vfs_write+940                                                                  |, hackbench]: 538591
     ksys_write+248                                                                 |@[
     system_call_exception+292                                                      |    __schedule+12
     system_call_vectored_common+348                                                |    schedule+76
, hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
@[                                                                                 |    __mutex_lock.constprop.0+1748
     __schedule+12                                                                  |    pipe_write+132
     schedule+64                                                                    |    vfs_write+1052
     interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
     syscall_exit_prepare+336                                                       |    system_call_exception+296
     system_call_vectored_common+360                                                |    system_call_vectored_common+348
, hackbench]: 8151309                                                              |, hackbench]: 5388301
@[                                                                                 |@[
     __schedule+12                                                                  |    __schedule+12
     schedule+64                                                                    |    schedule+76
     pipe_read+1100                                                                 |    pipe_read+1100
     vfs_read+716                                                                   |    vfs_read+716
     ksys_read+252                                                                  |    ksys_read+252
     system_call_exception+292                                                      |    system_call_exception+296
     system_call_vectored_common+348                                                |    system_call_vectored_common+348
, hackbench]: 18132753                                                             |, hackbench]: 64424110
                                                                                                                                                                 



--------------------------------------------
hackbench (messaging) - one that uses sockets
--------------------------------------------
Here there is no regression with preempt=full.

preempt=none: Time taken for 20 groups  in seconds        : 55.51
preempt=full: Time taken for 20 groups  in seconds        : 55.10


Similar bpftrace collected for socket based hackbench. highest caller of __schedule doesn't change much.

	preempt=none                                                                             preempt=full


                                                                                    |    __schedule+12
                                                                                    |    preempt_schedule+84
                                                                                    |    _raw_spin_unlock+108
@[                                                                                 |    unix_stream_sendmsg+660
     __schedule+12                                                                  |    sock_write_iter+372
     schedule+64                                                                    |    vfs_write+1052
     schedule_timeout+412                                                           |    ksys_write+248
     sock_alloc_send_pskb+684                                                       |    system_call_exception+296
     unix_stream_sendmsg+448                                                        |    system_call_vectored_common+348
     sock_write_iter+372                                                            |, hackbench]: 819290
     vfs_write+940                                                                  |@[
     ksys_write+248                                                                 |    __schedule+12
     system_call_exception+292                                                      |    schedule+76
     system_call_vectored_common+348                                                |    schedule_timeout+476
, hackbench]: 3424197                                                              |    sock_alloc_send_pskb+684
@[                                                                                 |    unix_stream_sendmsg+444
     __schedule+12                                                                  |    sock_write_iter+372
     schedule+64                                                                    |    vfs_write+1052
     interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
     syscall_exit_prepare+336                                                       |    system_call_exception+296
     system_call_vectored_common+360                                                |    system_call_vectored_common+348
, hackbench]: 9800144                                                              |, hackbench]: 3386594
@[                                                                                 |@[
     __schedule+12                                                                  |    __schedule+12
     schedule+64                                                                    |    schedule+76
     schedule_timeout+412                                                           |    schedule_timeout+476
     unix_stream_data_wait+528                                                      |    unix_stream_data_wait+468
     unix_stream_read_generic+872                                                   |    unix_stream_read_generic+804
     unix_stream_recvmsg+196                                                        |    unix_stream_recvmsg+196
     sock_recvmsg+164                                                               |    sock_recvmsg+156
     sock_read_iter+200                                                             |    sock_read_iter+200
     vfs_read+716                                                                   |    vfs_read+716
     ksys_read+252                                                                  |    ksys_read+252
     system_call_exception+292                                                      |    system_call_exception+296
     system_call_vectored_common+348                                                |    system_call_vectored_common+348
, hackbench]: 25375142                                                             |, hackbench]: 27275685



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-08-12 17:32                           ` Shrikanth Hegde
@ 2024-08-12 21:07                             ` Linus Torvalds
  2024-08-13  5:40                               ` Ankur Arora
  0 siblings, 1 reply; 95+ messages in thread
From: Linus Torvalds @ 2024-08-12 21:07 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ankur Arora, Michael Ellerman, tglx, peterz, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, LKML, Nicholas Piggin

On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
> top 3 callstacks of __schedule collected with bpftrace.
>
>                         preempt=none                                                            preempt=full
>
>      __schedule+12                                                                  |@[
>      schedule+64                                                                    |    __schedule+12
>      interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
>      interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
>      interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
> , hackbench]: 482228                                                               |    pipe_write+1772
> @[                                                                                 |    vfs_write+1052
>      __schedule+12                                                                  |    ksys_write+248
>      schedule+64                                                                    |    system_call_exception+296
>      pipe_write+1452                                                                |    system_call_vectored_common+348
>      vfs_write+940                                                                  |, hackbench]: 538591
>      ksys_write+248                                                                 |@[
>      system_call_exception+292                                                      |    __schedule+12
>      system_call_vectored_common+348                                                |    schedule+76
> , hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
> @[                                                                                 |    __mutex_lock.constprop.0+1748
>      __schedule+12                                                                  |    pipe_write+132
>      schedule+64                                                                    |    vfs_write+1052
>      interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
>      syscall_exit_prepare+336                                                       |    system_call_exception+296
>      system_call_vectored_common+360                                                |    system_call_vectored_common+348
> , hackbench]: 8151309                                                              |, hackbench]: 5388301
> @[                                                                                 |@[
>      __schedule+12                                                                  |    __schedule+12
>      schedule+64                                                                    |    schedule+76
>      pipe_read+1100                                                                 |    pipe_read+1100
>      vfs_read+716                                                                   |    vfs_read+716
>      ksys_read+252                                                                  |    ksys_read+252
>      system_call_exception+292                                                      |    system_call_exception+296
>      system_call_vectored_common+348                                                |    system_call_vectored_common+348
> , hackbench]: 18132753                                                             |, hackbench]: 64424110
>

So the pipe performance is very sensitive, partly because the pipe
overhead is normally very low.

So we've seen it in lots of benchmarks where the benchmark then gets
wildly different results depending on whether you get the goo "optimal
pattern".

And I think your "preempt=none" pattern is the one you really want,
where all the pipe IO scheduling is basically done at exactly the
(optimized) pipe points, ie where the writer blocks because there is
no room (if it's a throughput benchmark), and the reader blocks
because there is no data (for the ping-pong or pipe ring latency
benchmarks).

And then when you get that "perfect" behavior, you typically also get
the best performance when all readers and all writers are on the same
CPU, so you get no unnecessary cache ping-pong either.

And that's a *very* typical pipe benchmark, where there are no costs
to generating the pipe data and no costs involved with consuming it
(ie the actual data isn't really *used* by the benchmark).

In real (non-benchmark) loads, you typically want to spread the
consumer and producer apart on different CPUs, so that the real load
then uses multiple CPUs on the data. But the benchmark case - having
no real data load - likes the "stay on the same CPU" thing.

Your traces for "preempt=none" very much look like that "both reader
and writer sleep synchronously" case, which is the optimal benchmark
case.

And then with "preempt=full", you see that "oh damn, reader and writer
actually hit the pipe mutex contention, because they are presumably
running at the same time on different CPUs, and didn't get into that
nice serial synchronous pattern. So now you not only have that mutex
overhead (which doesn't exist in the reader and writer synchronize),
you also end up with the cost of cache misses *and* the cost of
scheduling on two different CPU's where both of them basically go into
idle while waiting for the other end.

I'm not convinced this is solvable, because it really is an effect
that comes from "benchmarking is doing something odd that we
*shouldn't* generally optimize for".

I also absolutely detest the pipe mutex - 99% of what it protects
should be using either just atomic cmpxchg or possibly a spinlock, and
that's actually what the "use pipes for events" code does. However,
the actual honest user read()/write() code needs to do user space
accesses, and so it wants a sleeping lock.

We could - and probably at some point should - split the pipe mutex
into two: one that protects the writer side, one that protects the
reader side. Then with the common situation of a single reader and a
single writer, the mutex would never be contended. Then the rendezvous
between that "one reader" and "one writer" would be done using
atomics.

But it would be more complex, and it's already complicated by the
whole "you can also use pipes for atomic messaging for watch-queues".

Anyeway, preempt=none has always excelled at certain things. This is
one of them.

               Linus

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-08-12 21:07                             ` Linus Torvalds
@ 2024-08-13  5:40                               ` Ankur Arora
  0 siblings, 0 replies; 95+ messages in thread
From: Ankur Arora @ 2024-08-13  5:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shrikanth Hegde, Ankur Arora, Michael Ellerman, tglx, peterz,
	paulmck, rostedt, mark.rutland, juri.lelli, joel, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, LKML, Nicholas Piggin


Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>>
>> top 3 callstacks of __schedule collected with bpftrace.
>>
>>                         preempt=none                                                            preempt=full
>>
>>      __schedule+12                                                                  |@[
>>      schedule+64                                                                    |    __schedule+12
>>      interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
>>      interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
>>      interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
>> , hackbench]: 482228                                                               |    pipe_write+1772
>> @[                                                                                 |    vfs_write+1052
>>      __schedule+12                                                                  |    ksys_write+248
>>      schedule+64                                                                    |    system_call_exception+296
>>      pipe_write+1452                                                                |    system_call_vectored_common+348
>>      vfs_write+940                                                                  |, hackbench]: 538591
>>      ksys_write+248                                                                 |@[
>>      system_call_exception+292                                                      |    __schedule+12
>>      system_call_vectored_common+348                                                |    schedule+76
>> , hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
>> @[                                                                                 |    __mutex_lock.constprop.0+1748
>>      __schedule+12                                                                  |    pipe_write+132
>>      schedule+64                                                                    |    vfs_write+1052
>>      interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
>>      syscall_exit_prepare+336                                                       |    system_call_exception+296
>>      system_call_vectored_common+360                                                |    system_call_vectored_common+348
>> , hackbench]: 8151309                                                              |, hackbench]: 5388301
>> @[                                                                                 |@[
>>      __schedule+12                                                                  |    __schedule+12
>>      schedule+64                                                                    |    schedule+76
>>      pipe_read+1100                                                                 |    pipe_read+1100
>>      vfs_read+716                                                                   |    vfs_read+716
>>      ksys_read+252                                                                  |    ksys_read+252
>>      system_call_exception+292                                                      |    system_call_exception+296
>>      system_call_vectored_common+348                                                |    system_call_vectored_common+348
>> , hackbench]: 18132753                                                             |, hackbench]: 64424110
>>
>
> So the pipe performance is very sensitive, partly because the pipe
> overhead is normally very low.
>
> So we've seen it in lots of benchmarks where the benchmark then gets
> wildly different results depending on whether you get the goo "optimal
> pattern".
>
> And I think your "preempt=none" pattern is the one you really want,
> where all the pipe IO scheduling is basically done at exactly the
> (optimized) pipe points, ie where the writer blocks because there is
> no room (if it's a throughput benchmark), and the reader blocks
> because there is no data (for the ping-pong or pipe ring latency
> benchmarks).
>
> And then when you get that "perfect" behavior, you typically also get
> the best performance when all readers and all writers are on the same
> CPU, so you get no unnecessary cache ping-pong either.
>
> And that's a *very* typical pipe benchmark, where there are no costs
> to generating the pipe data and no costs involved with consuming it
> (ie the actual data isn't really *used* by the benchmark).
>
> In real (non-benchmark) loads, you typically want to spread the
> consumer and producer apart on different CPUs, so that the real load
> then uses multiple CPUs on the data. But the benchmark case - having
> no real data load - likes the "stay on the same CPU" thing.
>
> Your traces for "preempt=none" very much look like that "both reader
> and writer sleep synchronously" case, which is the optimal benchmark
> case.
>
> And then with "preempt=full", you see that "oh damn, reader and writer
> actually hit the pipe mutex contention, because they are presumably
> running at the same time on different CPUs, and didn't get into that
> nice serial synchronous pattern. So now you not only have that mutex
> overhead (which doesn't exist in the reader and writer synchronize),
> you also end up with the cost of cache misses *and* the cost of
> scheduling on two different CPU's where both of them basically go into
> idle while waiting for the other end.

Thanks. That was very clarifying.

--
ankur

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
                   ` (35 preceding siblings ...)
  2024-05-29  6:16 ` [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Shrikanth Hegde
@ 2024-06-05 15:44 ` Sean Christopherson
  2024-06-05 17:45   ` Peter Zijlstra
  36 siblings, 1 reply; 95+ messages in thread
From: Sean Christopherson @ 2024-06-05 15:44 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, tglx, peterz, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk

On Mon, May 27, 2024, Ankur Arora wrote:
> Patches 1,2 
>  "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
>  "sched/core: Drop spinlocks on contention iff kernel is preemptible"
> condition spin_needbreak() on the dynamic preempt_model_*().

...

> Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.
> Sean Christopherson (2):
>   sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
>   sched/core: Drop spinlocks on contention iff kernel is preemptible

Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner
than later?  They fix a real bug that affects KVM to varying degrees.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
  2024-06-05 15:44 ` Sean Christopherson
@ 2024-06-05 17:45   ` Peter Zijlstra
  0 siblings, 0 replies; 95+ messages in thread
From: Peter Zijlstra @ 2024-06-05 17:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ankur Arora, linux-kernel, tglx, torvalds, paulmck, rostedt,
	mark.rutland, juri.lelli, joel, raghavendra.kt, sshegde,
	boris.ostrovsky, konrad.wilk

On Wed, Jun 05, 2024 at 08:44:50AM -0700, Sean Christopherson wrote:
> On Mon, May 27, 2024, Ankur Arora wrote:
> > Patches 1,2 
> >  "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
> >  "sched/core: Drop spinlocks on contention iff kernel is preemptible"
> > condition spin_needbreak() on the dynamic preempt_model_*().
> 
> ...
> 
> > Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.
> > Sean Christopherson (2):
> >   sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
> >   sched/core: Drop spinlocks on contention iff kernel is preemptible
> 
> Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner
> than later?  They fix a real bug that affects KVM to varying degrees.

It so happens I've queued them for sched/core earlier today (see
queue/sched/core). If the robot comes back happy, I'll push them into
tip.

Thanks!

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2024-08-13  5:40 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-28  0:34 [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Ankur Arora
2024-05-28  0:34 ` [PATCH v2 01/35] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Ankur Arora
2024-06-06 17:45   ` [tip: sched/core] " tip-bot2 for Sean Christopherson
2024-05-28  0:34 ` [PATCH v2 02/35] sched/core: Drop spinlocks on contention iff kernel is preemptible Ankur Arora
2024-05-28  0:34 ` [PATCH v2 03/35] sched: make test_*_tsk_thread_flag() return bool Ankur Arora
2024-05-28  0:34 ` [PATCH v2 04/35] preempt: introduce CONFIG_PREEMPT_AUTO Ankur Arora
2024-06-03 15:04   ` Shrikanth Hegde
2024-06-04 17:52     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY] Ankur Arora
2024-05-28 15:55   ` Peter Zijlstra
2024-05-30  9:07     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t) Ankur Arora
2024-05-28 16:03   ` Peter Zijlstra
2024-05-28  0:34 ` [PATCH v2 07/35] sched: define *_tsk_need_resched_lazy() helpers Ankur Arora
2024-05-28 16:09   ` Peter Zijlstra
2024-05-30  9:02     ` Ankur Arora
2024-05-29  8:25   ` Peter Zijlstra
2024-05-30  9:08     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit Ankur Arora
2024-05-28 16:12   ` Peter Zijlstra
2024-05-28  0:34 ` [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry Ankur Arora
2024-05-28 16:13   ` Peter Zijlstra
2024-05-30  9:04     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED Ankur Arora
2024-05-28 16:18   ` Peter Zijlstra
2024-05-30  9:03     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 11/35] sched: __schedule_loop() doesn't need to check for need_resched_lazy() Ankur Arora
2024-05-28  0:34 ` [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic Ankur Arora
2024-05-28 16:25   ` Peter Zijlstra
2024-05-30  9:30     ` Ankur Arora
2024-05-28  0:34 ` [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO Ankur Arora
2024-05-28 16:27   ` Peter Zijlstra
2024-05-30  9:29     ` Ankur Arora
2024-06-06 11:51       ` Peter Zijlstra
2024-06-06 15:11         ` Ankur Arora
2024-06-06 17:32           ` Peter Zijlstra
2024-06-09  0:46             ` Ankur Arora
2024-06-12 18:10               ` Paul E. McKenney
2024-05-28  0:35 ` [PATCH v2 14/35] rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 15/35] rcu: fix header guard for rcu_all_qs() Ankur Arora
2024-05-28  0:35 ` [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full Ankur Arora
2024-05-29  8:14   ` Peter Zijlstra
2024-05-30 18:32     ` Paul E. McKenney
2024-05-30 23:05       ` Ankur Arora
2024-05-30 23:15         ` Paul E. McKenney
2024-05-30 23:04     ` Ankur Arora
2024-05-30 23:20       ` Paul E. McKenney
2024-06-06 11:53         ` Peter Zijlstra
2024-06-06 13:38           ` Paul E. McKenney
2024-06-17 15:54             ` Paul E. McKenney
2024-06-18 16:29               ` Paul E. McKenney
2024-05-28  0:35 ` [PATCH v2 17/35] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y Ankur Arora
2024-05-28  0:35 ` [PATCH v2 18/35] rcu: force context-switch " Ankur Arora
2024-05-28  0:35 ` [PATCH v2 19/35] x86/thread_info: define TIF_NEED_RESCHED_LAZY Ankur Arora
2024-05-28  0:35 ` [PATCH v2 20/35] powerpc: add support for PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr() Ankur Arora
2024-05-29  9:32   ` Peter Zijlstra
2024-05-28  0:35 ` [PATCH v2 22/35] sched: default preemption policy for PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 23/35] sched: handle idle preemption " Ankur Arora
2024-05-28  0:35 ` [PATCH v2 24/35] sched: schedule eagerly in resched_cpu() Ankur Arora
2024-05-28  0:35 ` [PATCH v2 25/35] sched/fair: refactor update_curr(), entity_tick() Ankur Arora
2024-05-28  0:35 ` [PATCH v2 26/35] sched/fair: handle tick expiry under lazy preemption Ankur Arora
2024-05-28  0:35 ` [PATCH v2 27/35] sched: support preempt=none under PREEMPT_AUTO Ankur Arora
2024-05-28  0:35 ` [PATCH v2 28/35] sched: support preempt=full " Ankur Arora
2024-05-28  0:35 ` [PATCH v2 29/35] sched: handle preempt=voluntary " Ankur Arora
2024-06-17  3:20   ` Tianchen Ding
2024-06-21 18:58     ` Ankur Arora
2024-06-24  2:35       ` Tianchen Ding
2024-06-25  1:12         ` Ankur Arora
2024-06-26  2:43           ` Tianchen Ding
2024-05-28  0:35 ` [PATCH v2 30/35] sched: latency warn for TIF_NEED_RESCHED_LAZY Ankur Arora
2024-05-28  0:35 ` [PATCH v2 31/35] tracing: support lazy resched Ankur Arora
2024-05-28  0:35 ` [PATCH v2 32/35] Documentation: tracing: add TIF_NEED_RESCHED_LAZY Ankur Arora
2024-05-28  0:35 ` [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y Ankur Arora
2024-05-28 13:12   ` Daniel Bristot de Oliveira
2024-05-28  0:35 ` [PATCH v2 34/35] kconfig: decompose ARCH_NO_PREEMPT Ankur Arora
2024-05-28  0:35 ` [PATCH v2 35/35] arch: " Ankur Arora
2024-05-29  6:16 ` [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling Shrikanth Hegde
2024-06-01 11:47   ` Ankur Arora
2024-06-04  7:32     ` Shrikanth Hegde
2024-06-07 16:48       ` Shrikanth Hegde
2024-06-10  7:23         ` Ankur Arora
2024-06-15 15:04           ` Shrikanth Hegde
2024-06-18 18:27             ` Shrikanth Hegde
2024-06-19  2:40               ` Ankur Arora
2024-06-24 18:37                 ` Shrikanth Hegde
2024-06-27  2:50                   ` Ankur Arora
2024-06-27  5:56                     ` Michael Ellerman
2024-06-27 15:44                       ` Shrikanth Hegde
2024-07-03  5:27                         ` Ankur Arora
2024-08-12 17:32                           ` Shrikanth Hegde
2024-08-12 21:07                             ` Linus Torvalds
2024-08-13  5:40                               ` Ankur Arora
2024-06-05 15:44 ` Sean Christopherson
2024-06-05 17:45   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox