[PATCH V7 00/11] Scheduler time slice extension

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V7 00/11] Scheduler time slice extension
@ 2025-07-24 16:16 Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 01/11] sched: " Prakash Sangappa
                   ` (12 more replies)
  0 siblings, 13 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Based on v6.16-rc3.

Patches 7-11 in this series are an attempt to implement the API/mechanism 
for an RT thread to indicate not to delay scheduling it when the thread
running on the cpu requests extending its time slice, as suggested by
Thomas Gleixner. This is required to address the concern that with the use 
of the proposed scheduler time slice extension feature, a normal thread 
can delay the scheduling of an RT thread.

This will require a new TIF flag(TIF_NEED_RESCHED_NODELAY), which will be 
set on the running thread when this RT thread gets woken up and is enqueued. 
The API is only allowed for use by RT(RR, FIFO) threads. 

Implementation of TIF_NEED_RESCHED_NODELAY patches is on the lines of
how TIF_NEED_RESCHED_LAZY was added. However, TIF_NEED_RESCHED_NODELAY
will be effective only with the scheduler time slice extension feature
(i.e, when CONFIG_RSEQ_RESCHED_DELAY config option is enabled).

Introduces prctl APIs to set and get the sched_nodelay flag. Adds a
new 1-bit member(sched_nodelay) to the struct task_struct to store this
flag, as there is no more room for a new PF* flag. This flag will be 
inherited across fork and exec. 

The API provides per-thread control to decide if it can be delayed
or not. Also, a kernel parameter is added to disable delaying scheduling of
all RT threads, if necessary, when the scheduler time slice extension feature
is enabled.

The above change is more of an RFC, looking for feedback on the
approach. 

Patches 1-6  have been updated based on comments from the V6 patch series.

---------------- cover letter previously sent --------------------------------
A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, which can have undesirable affect
on performance. Having a way for the thread to request additional execution
time on cpu, so that it can complete the critical section will be useful in
such scenario. The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature.

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[5], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patchset implements above solution as suggested, with use of restartable
sequences(rseq) structure for API. Refer [3][4] for further discussions.


v7:
- Addressed comments & suggestions from Thomas Gleixner & Prateek Nayak.
  Renamed 'sched_time_delay' to 'rseq_delay_resched'. Made it a 2-bit 
  member to store 3 states NONE, PROBE & REQUESTED as suggested by
  Thomas Gleixner. Also refactored some code in patch 1.
- Renamed the config option to 'CONFIG_RSEQ_RESCHED_DELAY' and
  added it in patch 1. Added SCHED_HRTICK dependency.
- Patches 7-11 are an attempt to implement the API/mechanism 
  Thomas suggested. They introduce a prctl() api which lets an RT thread
  indicate not to delay scheduling it when some thread running on
  the cpu requests extending its time slice.

v6:
https://lore.kernel.org/all/20250701003749.50525-1-prakash.sangappa@oracle.com/
- Rebased onto v6.16-rc3. 
  syscall_exit_to_user_mode_prepare() & __syscall_exit_to_user_mode_work()
  routines have been deleted. Moved changes to the consolidated routine
  syscall_exit_to_user_mode_work()(patch 1).
- Introduced a new config option for scheduler time slice extension
  CONFIG_SCHED_PREEMPT_DELAY which is dependent on CONFIG_RSEQ.
  Enabled by default(new patch 7). Is this reasonable?
- Modified tracepoint to a conditional tracepoint(patch 5), as suggested
  by Steven Rostedt.
- Added kernel parameters documentation for the tunable
  'sysctl_sched_preempt_delay_us'(patch 3)

v5:
https://lore.kernel.org/all/20250603233654.1838967-1-prakash.sangappa@oracle.com/
- Added #ifdef CONFIG_RSEQ and CONFIG_PROC_SYSCTL for sysctl tunable
  changes(patch 3).
- Added #ifdef CONFIG_RSEQ for schedular stat changes(patch 4).
- Removed deprecated flags from the supported flags returned, as
  pointed out by Mathieu Desnoyers(patch 6).
- Added IF_ENABLED(CONFIG_SCHED_HRTICK) check before returning supported
  delay resched flags.

v4:
https://lore.kernel.org/all/20250513214554.4160454-1-prakash.sangappa@oracle.com
- Changed default sched delay extension time to 30us
- Added patch to indicate to userspace if the thread got preempted in
  the extended cpu time granted. Uses another bit in rseq cs flags for it.
  This should help the application to check and avoid having to call a
  system call to yield cpu, especially sched_yield() as pointed out
  by Steven Rostedt.
- Moved tracepoint call towards end of exit_to_user_mode_loop().
- Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
  set higher then the default value of 30us.
- Patch to add an API to query if sched time extension feature is supported. 
  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
  is added, as suggested by Mathieu Desnoyers. 
  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.

v3:
https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
- Addressing review comments by Sebastian and Prateek.
  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
    struct task_struct near other bits so it fits in existing word.
  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
    'sched_time_delay'.
  * removed rseq_delay_resched_tick() call from hrtick_clear().
  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
    suggested by Sebastian.
  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.

v2:
https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
- Based on discussions in [3], expecting user application to call sched_yield()
  to yield the cpu at the end of the critical section may not be advisable as
  pointed out by Linus.  

  So added a check in return path from a system call to reschedule if time
  slice extension was granted to the thread. The check could as well be in
  syscall enter path from user mode.
  This would allow application thread to call any system call to yield the cpu. 
  Which system call should be suggested? getppid(2) works.

  Do we still need the change in sched_yield() to reschedule when the thread
  has current->rseq_sched_delay set?

- Added patch to introduce a sysctl tunable parameter to specify duration of
  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
  Can take a value in the range 0 to 100. Default is set to 50us.
  Setting this tunable to 0 disables the scheduler time slice extension feature.

v1: 
https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/


[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/all/20250131225837.972218232@goodmis.org/
[4] https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
[5] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[6] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/

Prakash Sangappa (11):
  sched: Scheduler time slice extension
  sched: Indicate if thread got rescheduled
  sched: Tunable to specify duration of time slice extension
  sched: Add scheduler stat for cpu time slice extension
  sched: Add tracepoint for sched time slice extension
  Add API to query supported rseq cs flags
  sched: Add API to indicate not to delay scheduling
  sched: Add TIF_NEED_RESCHED_NODELAY infrastructure
  sched: Add nodelay scheduling
  sched, x86: Enable nodelay scheduling
  sched: Add kernel parameter to enable delaying RT threads

 .../admin-guide/kernel-parameters.txt         |  8 ++
 Documentation/admin-guide/sysctl/kernel.rst   |  8 ++
 arch/x86/Kconfig                              |  1 +
 arch/x86/include/asm/thread_info.h            |  2 +
 include/linux/entry-common.h                  | 18 ++--
 include/linux/entry-kvm.h                     |  4 +-
 include/linux/sched.h                         | 47 +++++++++-
 include/linux/thread_info.h                   | 11 ++-
 include/trace/events/sched.h                  | 31 +++++++
 include/uapi/linux/prctl.h                    |  3 +
 include/uapi/linux/rseq.h                     | 19 ++++
 init/Kconfig                                  |  7 ++
 kernel/Kconfig.preempt                        |  3 +
 kernel/entry/common.c                         | 36 ++++++-
 kernel/entry/kvm.c                            |  3 +-
 kernel/rseq.c                                 | 71 ++++++++++++++
 kernel/sched/core.c                           | 93 ++++++++++++++++++-
 kernel/sched/debug.c                          |  4 +
 kernel/sched/rt.c                             | 10 +-
 kernel/sched/sched.h                          |  1 +
 kernel/sched/syscalls.c                       |  4 +
 kernel/sys.c                                  | 18 ++++
 22 files changed, 380 insertions(+), 22 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-08-06 20:34   ` Thomas Gleixner
  2025-07-24 16:16 ` [PATCH V7 02/11] sched: Indicate if thread got rescheduled Prakash Sangappa
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Add support for a thread to request extending its execution time slice on
the cpu. The extra cpu time granted would help in allowing the thread to
complete executing the critical section and drop any locks without getting
preempted. The thread would request this cpu time extension, by setting a
bit in the restartable sequences(rseq) structure registered with the kernel.

Kernel will grant a 30us extension on the cpu, when it sees the bit set.
With the help of a timer, kernel force preempts the thread if it is still
running on the cpu when the 30us timer expires. The thread should yield
the cpu by making a system call after completing the critical section.

Suggested-by: Peter Ziljstra <peterz@infradead.org>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/entry-common.h | 14 +++++++++----
 include/linux/sched.h        | 28 ++++++++++++++++++++++++++
 include/uapi/linux/rseq.h    |  7 +++++++
 init/Kconfig                 |  7 +++++++
 kernel/entry/common.c        | 28 ++++++++++++++++++++++----
 kernel/rseq.c                | 38 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          | 15 ++++++++++++++
 kernel/sched/syscalls.c      |  4 ++++
 8 files changed, 133 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..7b258d2510f8 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -304,7 +304,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
  * exit_to_user_mode_loop - do any pending work before leaving to user space
  */
 unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+				     unsigned long ti_work, bool irq);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
@@ -316,7 +316,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
  *    EXIT_TO_USER_MODE_WORK are set
  * 4) check that interrupts are still disabled
  */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool irq)
 {
 	unsigned long ti_work;
 
@@ -327,7 +327,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
 
 	ti_work = read_thread_flags();
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
+
+	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) && irq)
+		rseq_delay_resched_arm_timer();
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
@@ -396,6 +399,9 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 
 	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
 
+	/* Reschedule if scheduler time delay was granted */
+	rseq_delay_set_need_resched();
+
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
 		if (WARN(irqs_disabled(), "syscall %lu left IRQs disabled", nr))
 			local_irq_enable();
@@ -411,7 +417,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, false);
 }
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5bcf44ae6c79..5d2819afd481 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -338,6 +338,7 @@ extern int __must_check io_schedule_prepare(void);
 extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
+extern void hrtick_local_start(u64 delay);
 
 /* wrapper function to trace from this header file */
 DECLARE_TRACEPOINT(sched_set_state_tp);
@@ -1048,6 +1049,7 @@ struct task_struct {
 	unsigned                        in_thrashing:1;
 #endif
 	unsigned			in_nf_duplicate:1;
+	unsigned			rseq_delay_resched:2;
 #ifdef CONFIG_PREEMPT_RT
 	struct netdev_xmit		net_xmit;
 #endif
@@ -1711,6 +1713,13 @@ static inline char task_state_to_char(struct task_struct *tsk)
 
 extern struct pid *cad_pid;
 
+/*
+ * Used in tsk->rseq_delay_resched.
+ */
+#define	RSEQ_RESCHED_DELAY_NONE		0	/* tsk->rseq not registered */
+#define	RSEQ_RESCHED_DELAY_PROBE	1	/* Evaluate tsk->rseq->flags */
+#define	RSEQ_RESCHED_DELAY_REQUESTED	2	/* Request to delay reschedule successful */
+
 /*
  * Per process flags
  */
@@ -2245,6 +2254,25 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+extern bool __rseq_delay_resched(void);
+extern void rseq_delay_resched_arm_timer(void);
+extern void rseq_delay_resched_tick(void);
+static inline bool rseq_delay_set_need_resched(void)
+{
+    if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) {
+	    set_tsk_need_resched(current);
+	    return true;
+    }
+    return false;
+}
+#else
+static inline bool __rseq_delay_resched(void) { return false; }
+static inline void rseq_delay_resched_arm_timer(void) { }
+static inline void rseq_delay_resched_tick(void) { }
+static inline bool rseq_delay_set_need_resched(void) { return false; }
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..25fc636b17d5 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -26,6 +26,7 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	RSEQ_CS_FLAG_DELAY_RESCHED_BIT		= 3,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +36,8 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+	RSEQ_CS_FLAG_DELAY_RESCHED		=
+		(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
 };
 
 /*
@@ -128,6 +131,10 @@ struct rseq {
 	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
 	 *     Inhibit instruction sequence block restart on migration for
 	 *     this thread.
+	 * - RSEQ_CS_FLAG_DELAY_RESCHED
+	 *     Request by user thread to delay preemption. With use
+	 *     of a timer, kernel grants extra cpu time upto 30us for this
+	 *     thread before being rescheduled.
 	 */
 	__u32 flags;
 
diff --git a/init/Kconfig b/init/Kconfig
index ce76e913aa2b..3005abab77cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1130,6 +1130,13 @@ config SCHED_MM_CID
 	def_bool y
 	depends on SMP && RSEQ
 
+config RSEQ_RESCHED_DELAY
+	def_bool y
+	depends on SMP && RSEQ && SCHED_HRTICK
+	help
+	  This feature enables a thread to request extending its time slice on
+	  the cpu by delaying preemption.
+
 config UCLAMP_TASK_GROUP
 	bool "Utilization clamping per group of tasks"
 	depends on CGROUP_SCHED
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index a8dd1f27417c..3d2d670980ec 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -82,13 +82,31 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 
+static inline bool rseq_delay_resched(unsigned long ti_work)
+{
+	if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
+		return false;
+
+	if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
+		return false;
+
+	if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
+		return false;
+
+	if (__rseq_delay_resched()) {
+		clear_tsk_need_resched(current);
+		return true;
+	}
+	return false;
+}
+
 /**
  * exit_to_user_mode_loop - do any pending work before leaving to user space
  * @regs:	Pointer to pt_regs on entry stack
  * @ti_work:	TIF work flags as read by the caller
  */
 __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-						     unsigned long ti_work)
+						     unsigned long ti_work, bool irq)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
@@ -98,8 +116,10 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+		       if (likely(!irq || !rseq_delay_resched(ti_work)))
+			       schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
@@ -181,7 +201,7 @@ noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, true);
 	instrumentation_end();
 	exit_to_user_mode();
 }
diff --git a/kernel/rseq.c b/kernel/rseq.c
index b7a1ec327e81..8b6af4e12142 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -448,6 +448,40 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+#ifdef	CONFIG_RSEQ_RESCHED_DELAY
+bool __rseq_delay_resched(void)
+{
+	struct task_struct *t = current;
+	u32 flags;
+
+	if (copy_from_user_nofault(&flags, &t->rseq->flags, sizeof(flags)))
+		return false;
+
+	if (!(flags & RSEQ_CS_FLAG_DELAY_RESCHED))
+		return false;
+
+	flags &= ~RSEQ_CS_FLAG_DELAY_RESCHED;
+	if (copy_to_user_nofault(&t->rseq->flags, &flags, sizeof(flags)))
+		return false;
+
+	t->rseq_delay_resched = RSEQ_RESCHED_DELAY_REQUESTED;
+
+	return true;
+}
+
+void rseq_delay_resched_arm_timer(void)
+{
+	if (unlikely(current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED))
+		hrtick_local_start(30 * NSEC_PER_USEC);
+}
+
+void rseq_delay_resched_tick(void)
+{
+	if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
+		set_tsk_need_resched(current);
+}
+#endif /* CONFIG_RSEQ_RESCHED_DELAY */
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
@@ -493,6 +527,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		current->rseq = NULL;
 		current->rseq_sig = 0;
 		current->rseq_len = 0;
+		if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
+			current->rseq_delay_resched = RSEQ_RESCHED_DELAY_NONE;
 		return 0;
 	}
 
@@ -561,6 +597,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	current->rseq = rseq;
 	current->rseq_len = rseq_len;
 	current->rseq_sig = sig;
+	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
+		current->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
 
 	/*
 	 * If rseq was previously inactive, and has just been
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ad7cf3cfdca..e75ecbb2c1f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -845,6 +845,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 
+	rseq_delay_resched_tick();
+
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
@@ -918,6 +920,16 @@ void hrtick_start(struct rq *rq, u64 delay)
 
 #endif /* CONFIG_SMP */
 
+void hrtick_local_start(u64 delay)
+{
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+
+	rq_lock(rq, &rf);
+	hrtick_start(rq, delay);
+	rq_unlock(rq, &rf);
+}
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -6740,6 +6752,9 @@ static void __sched notrace __schedule(int sched_mode)
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
+	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
+	    prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
+		prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
 	rq->last_seen_need_resched_ns = 0;
 
 	is_switch = prev != next;
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ee5641757838..e684a77ed1fb 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -1379,6 +1379,10 @@ static void do_sched_yield(void)
  */
 SYSCALL_DEFINE0(sched_yield)
 {
+	/* Reschedule if scheduler time delay was granted */
+	if (rseq_delay_set_need_resched())
+		return 0;
+
 	do_sched_yield();
 	return 0;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 01/11] sched: " Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-08-07 13:06   ` Thomas Gleixner
  2025-07-24 16:16 ` [PATCH V7 03/11] sched: Tunable to specify duration of time slice extension Prakash Sangappa
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Use a bit in rseq flags to indicate if the thread got rescheduled
after the cpu time extension was graned. The user thread can check this
flag before calling sched_yield() to yield the cpu.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h     |  2 ++
 include/uapi/linux/rseq.h | 10 ++++++++++
 kernel/rseq.c             | 13 +++++++++++++
 kernel/sched/core.c       |  5 ++---
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d2819afd481..5df055f2dd9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2258,6 +2258,7 @@ unsigned long sched_cpu_util(int cpu);
 extern bool __rseq_delay_resched(void);
 extern void rseq_delay_resched_arm_timer(void);
 extern void rseq_delay_resched_tick(void);
+extern void rseq_delay_resched_clear(struct task_struct *tsk);
 static inline bool rseq_delay_set_need_resched(void)
 {
     if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) {
@@ -2271,6 +2272,7 @@ static inline bool __rseq_delay_resched(void) { return false; }
 static inline void rseq_delay_resched_arm_timer(void) { }
 static inline void rseq_delay_resched_tick(void) { }
 static inline bool rseq_delay_set_need_resched(void) { return false; }
+static inline void rseq_delay_resched_clear(struct task_struct *tsk) { }
 #endif
 
 #ifdef CONFIG_SCHED_CORE
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 25fc636b17d5..f4813d931387 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -27,6 +27,7 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
 	RSEQ_CS_FLAG_DELAY_RESCHED_BIT		= 3,
+	RSEQ_CS_FLAG_RESCHEDULED_BIT		= 4,
 };
 
 enum rseq_cs_flags {
@@ -38,6 +39,9 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 	RSEQ_CS_FLAG_DELAY_RESCHED		=
 		(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
+	RSEQ_CS_FLAG_RESCHEDULED		=
+		(1U << RSEQ_CS_FLAG_RESCHEDULED_BIT),
+
 };
 
 /*
@@ -135,6 +139,12 @@ struct rseq {
 	 *     Request by user thread to delay preemption. With use
 	 *     of a timer, kernel grants extra cpu time upto 30us for this
 	 *     thread before being rescheduled.
+	 * - RSEQ_CS_FLAG_RESCHEDULED
+	 *     Set by kernel if the thread was rescheduled in the extra time
+	 *     granted due to request RSEQ_CS_DELAY_RESCHED. This bit is
+	 *     checked by the thread before calling sched_yield() to yield
+	 *     cpu. User thread sets this bit to 0, when setting
+	 *     RSEQ_CS_DELAY_RESCHED to request preemption delay.
 	 */
 	__u32 flags;
 
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 8b6af4e12142..6331b653b402 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -480,6 +480,19 @@ void rseq_delay_resched_tick(void)
 	if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
 		set_tsk_need_resched(current);
 }
+
+void rseq_delay_resched_clear(struct task_struct *tsk)
+{
+	u32 flags;
+
+	if (tsk->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) {
+		tsk->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
+		if (copy_from_user_nofault(&flags, &tsk->rseq->flags, sizeof(flags)))
+                        return;
+                flags |= RSEQ_CS_FLAG_RESCHEDULED;
+                copy_to_user_nofault(&tsk->rseq->flags, &flags, sizeof(flags));
+	}
+}
 #endif /* CONFIG_RSEQ_RESCHED_DELAY */
 
 #ifdef CONFIG_DEBUG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e75ecbb2c1f7..ba1e4f6981cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6752,9 +6752,8 @@ static void __sched notrace __schedule(int sched_mode)
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
-	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
-	    prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
-		prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
+	if(IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
+		rseq_delay_resched_clear(prev);
 	rq->last_seen_need_resched_ns = 0;
 
 	is_switch = prev != next;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 03/11] sched: Tunable to specify duration of time slice extension
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 01/11] sched: " Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 02/11] sched: Indicate if thread got rescheduled Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 04/11] sched: Add scheduler stat for cpu " Prakash Sangappa
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Add a tunable to specify duration of scheduler time slice extension.
The default will be set to 30us and the max value that can be specified
is 100us. Setting it to 0, disables scheduler time slice extension.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++
 Documentation/admin-guide/sysctl/kernel.rst   |  8 ++++
 include/linux/sched.h                         |  5 +++
 include/uapi/linux/rseq.h                     |  5 ++-
 kernel/rseq.c                                 |  8 +++-
 kernel/sched/core.c                           | 40 +++++++++++++++++++
 6 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0ee6c5314637..1e0f86cda0db 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6398,6 +6398,14 @@
 
 	sched_verbose	[KNL,EARLY] Enables verbose scheduler debug messages.
 
+	sched_preempt_delay_us=	[KNL]
+			Scheduler preemption delay in microseconds.
+			Allowed range is 0 to 100us. A thread can request
+			extending its scheduler time slice on the cpu by
+			delaying preemption. Duration of preemption delay
+			granted is specified by this parameter. Setting it
+			to 0 will disable this feature.
+
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
 			Allowed values are enable and disable. This feature
 			incurs a small amount of overhead in the scheduler
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index dd49a89a62d3..f446347215c3 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1232,6 +1232,14 @@ Documentation/accounting/delay-accounting.rst. Enabling this feature incurs
 a small amount of overhead in the scheduler but is useful for debugging
 and performance tuning. It is required by some tools such as iotop.
 
+sched_preempt_delay_us
+======================
+
+Scheduler preemption delay in microseconds.  Allowed range is 0 to 100us.
+A thread can request extending its scheduler time slice on the cpu by
+delaying preemption. Duration of preemption delay granted is specified by
+this parameter. Setting it to 0 will disable this feature.
+
 sched_schedstats
 ================
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5df055f2dd9e..5ba3e33f6252 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -406,6 +406,11 @@ static inline void sched_domains_mutex_lock(void) { }
 static inline void sched_domains_mutex_unlock(void) { }
 #endif
 
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+/* Scheduler time slice extension duration */
+extern unsigned int sysctl_sched_preempt_delay_us;
+#endif
+
 struct sched_param {
 	int sched_priority;
 };
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index f4813d931387..015534f064af 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -137,8 +137,9 @@ struct rseq {
 	 *     this thread.
 	 * - RSEQ_CS_FLAG_DELAY_RESCHED
 	 *     Request by user thread to delay preemption. With use
-	 *     of a timer, kernel grants extra cpu time upto 30us for this
-	 *     thread before being rescheduled.
+	 *     of a timer, kernel grants extra cpu time upto the tunable
+	 *     'sched_preempt_delay_us' value for this thread before it gets
+	 *     rescheduled.
 	 * - RSEQ_CS_FLAG_RESCHEDULED
 	 *     Set by kernel if the thread was rescheduled in the extra time
 	 *     granted due to request RSEQ_CS_DELAY_RESCHED. This bit is
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 6331b653b402..3107bbc9b77c 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -454,6 +454,9 @@ bool __rseq_delay_resched(void)
 	struct task_struct *t = current;
 	u32 flags;
 
+	if (!sysctl_sched_preempt_delay_us)
+		return false;
+
 	if (copy_from_user_nofault(&flags, &t->rseq->flags, sizeof(flags)))
 		return false;
 
@@ -471,8 +474,9 @@ bool __rseq_delay_resched(void)
 
 void rseq_delay_resched_arm_timer(void)
 {
-	if (unlikely(current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED))
-		hrtick_local_start(30 * NSEC_PER_USEC);
+	if (unlikely(sysctl_sched_preempt_delay_us &&
+	    current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED))
+		hrtick_local_start(sysctl_sched_preempt_delay_us * NSEC_PER_USEC);
 }
 
 void rseq_delay_resched_tick(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba1e4f6981cd..03834ac426d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -149,6 +149,16 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  */
 __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
 
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+/*
+ * Scheduler time slice extension, duration in microsecs.
+ * Max value allowed 100us, default is 30us.
+ * If set to 0, scheduler time slice extension is disabled.
+ */
+#define SCHED_PREEMPT_DELAY_DEFAULT_US	30
+__read_mostly unsigned int sysctl_sched_preempt_delay_us = SCHED_PREEMPT_DELAY_DEFAULT_US;
+#endif
+
 __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
@@ -4678,6 +4688,25 @@ static int sysctl_schedstats(const struct ctl_table *table, int write, void *buf
 #endif /* CONFIG_PROC_SYSCTL */
 #endif /* CONFIG_SCHEDSTATS */
 
+#ifdef CONFIG_PROC_SYSCTL
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+static int sysctl_sched_preempt_delay(const struct ctl_table *table, int write,
+		void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err;
+
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (sysctl_sched_preempt_delay_us > SCHED_PREEMPT_DELAY_DEFAULT_US)
+		pr_warn("Sched preemption delay set to %d us is higher than the default value %d us\n",
+                        sysctl_sched_preempt_delay_us, SCHED_PREEMPT_DELAY_DEFAULT_US);
+
+	return err;
+}
+#endif /* CONFIG_RSEQ_RESCHED_DELAY */
+#endif /* CONFIG_PROC_SYSCTL */
+
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table sched_core_sysctls[] = {
 #ifdef CONFIG_SCHEDSTATS
@@ -4725,6 +4754,17 @@ static const struct ctl_table sched_core_sysctls[] = {
 		.extra2		= SYSCTL_FOUR,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+	{
+		.procname	= "sched_preempt_delay_us",
+		.data		= &sysctl_sched_preempt_delay_us,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_preempt_delay,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE_HUNDRED,
+	},
+#endif /* CONFIG_RSEQ_RESCHED_DELAY */
 };
 static int __init sched_core_sysctl_init(void)
 {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 04/11] sched: Add scheduler stat for cpu time slice extension
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (2 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 03/11] sched: Tunable to specify duration of time slice extension Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 05/11] sched: Add tracepoint for sched " Prakash Sangappa
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Add scheduler stat to record number of times the thread was granted
cpu time slice extension.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h | 7 +++++++
 kernel/rseq.c         | 1 +
 kernel/sched/core.c   | 7 +++++++
 kernel/sched/debug.c  | 4 ++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5ba3e33f6252..5c5868c555f0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -339,6 +339,9 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 extern void hrtick_local_start(u64 delay);
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+extern void update_stat_preempt_delayed(struct task_struct *t);
+#endif
 
 /* wrapper function to trace from this header file */
 DECLARE_TRACEPOINT(sched_set_state_tp);
@@ -569,6 +572,10 @@ struct sched_statistics {
 	u64				nr_wakeups_passive;
 	u64				nr_wakeups_idle;
 
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+	u64				nr_preempt_delay_granted;
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
 #endif
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 3107bbc9b77c..6ca3ca959b66 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -468,6 +468,7 @@ bool __rseq_delay_resched(void)
 		return false;
 
 	t->rseq_delay_resched = RSEQ_RESCHED_DELAY_REQUESTED;
+	update_stat_preempt_delayed(t);
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 03834ac426d0..1ddb45b4b46a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -940,6 +940,13 @@ void hrtick_local_start(u64 delay)
 	rq_unlock(rq, &rf);
 }
 
+#ifdef CONFIG_RSEQ_RESCHED_DELAY
+void update_stat_preempt_delayed(struct task_struct *t)
+{
+	schedstat_inc(t->stats.nr_preempt_delay_granted);
+}
+#endif
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 9135d5c2edea..831b5bbeb805 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1225,6 +1225,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_passive);
 		P_SCHEDSTAT(nr_wakeups_idle);
 
+#ifdef	CONFIG_RSEQ_RESCHED_DELAY
+		P_SCHEDSTAT(nr_preempt_delay_granted);
+#endif
+
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
 			avg_atom = div64_ul(avg_atom, nr_switches);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 05/11] sched: Add tracepoint for sched time slice extension
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (3 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 04/11] sched: Add scheduler stat for cpu " Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 06/11] Add API to query supported rseq cs flags Prakash Sangappa
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Trace thread's preemption getting delayed. Which can occur if
the running thread requested extra time on cpu.  Also, indicate
the NEED_RESCHED flag, that is set on the thread which gets cleared.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++
 kernel/entry/common.c        |  2 ++
 2 files changed, 33 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 4e6b2910cec3..a4846579f377 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -296,6 +296,37 @@ TRACE_EVENT(sched_migrate_task,
 		  __entry->orig_cpu, __entry->dest_cpu)
 );
 
+/*
+ * Tracepoint for delayed resched requested by task:
+ */
+TRACE_EVENT_CONDITION(sched_delay_resched,
+
+	TP_PROTO(struct task_struct *p, unsigned int ti_work_cleared),
+
+	TP_ARGS(p, ti_work_cleared),
+
+	TP_CONDITION(ti_work_cleared),
+
+	TP_STRUCT__entry(
+		__array( char, comm, TASK_COMM_LEN	)
+		__field( pid_t, pid			)
+		__field( int, cpu			)
+		__field( int, flg			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->cpu 		= task_cpu(p);
+		__entry->flg		= ti_work_cleared & (_TIF_NEED_RESCHED |
+					_TIF_NEED_RESCHED_LAZY);
+	),
+
+	TP_printk("comm=%s pid=%d cpu=%d resched_flg_cleared=0x%x",
+		__entry->comm, __entry->pid, __entry->cpu, __entry->flg)
+
+);
+
 DECLARE_EVENT_CLASS(sched_process_template,
 
 	TP_PROTO(struct task_struct *p),
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 3d2d670980ec..2635fecb83ff 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -12,6 +12,7 @@
 
 #include "common.h"
 
+#include <trace/events/sched.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
@@ -95,6 +96,7 @@ static inline bool rseq_delay_resched(unsigned long ti_work)
 
 	if (__rseq_delay_resched()) {
 		clear_tsk_need_resched(current);
+		trace_sched_delay_resched(current, ti_work);
 		return true;
 	}
 	return false;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 06/11] Add API to query supported rseq cs flags
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (4 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 05/11] sched: Add tracepoint for sched " Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 07/11] sched: Add API to indicate not to delay scheduling Prakash Sangappa
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

For the API, add a new flag to sys_rseq 'flags' argument called
RSEQ_FLAG_QUERY_CS_FLAGS.

When this flag is passed it returns a bit mask of all the supported
rseq cs flags in the user provided rseq struct's 'flags' member.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/uapi/linux/rseq.h |  1 +
 kernel/rseq.c             | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 015534f064af..44baea9dd10a 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -20,6 +20,7 @@ enum rseq_cpu_id_state {
 
 enum rseq_flags {
 	RSEQ_FLAG_UNREGISTER = (1 << 0),
+	RSEQ_FLAG_QUERY_CS_FLAGS = (1 << 1),
 };
 
 enum rseq_cs_flags_bit {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 6ca3ca959b66..7f4daeba6d0d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -550,6 +550,21 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		return 0;
 	}
 
+	/*
+	 * Return supported rseq_cs flags.
+	 */
+	if (flags & RSEQ_FLAG_QUERY_CS_FLAGS) {
+		u32 rseq_csflags = RSEQ_CS_FLAG_DELAY_RESCHED |
+				   RSEQ_CS_FLAG_RESCHEDULED;
+		if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
+			return -EINVAL;
+		if (!rseq)
+			return -EINVAL;
+		if (copy_to_user(&rseq->flags, &rseq_csflags, sizeof(u32)))
+			return -EFAULT;
+		return 0;
+	}
+
 	if (unlikely(flags))
 		return -EINVAL;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 07/11] sched: Add API to indicate not to delay scheduling
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (5 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 06/11] Add API to query supported rseq cs flags Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-25 14:30   ` kernel test robot
  2025-07-24 16:16 ` [PATCH V7 08/11] sched: Add TIF_NEED_RESCHED_NODELAY infrastructure Prakash Sangappa
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Add an API for user threads to request scheduler to not delay
scheduling it when woken up to run. This is allowed only for RT threads.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h      |  1 +
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c               | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c5868c555f0..3e8eb64658d1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1062,6 +1062,7 @@ struct task_struct {
 #endif
 	unsigned			in_nf_duplicate:1;
 	unsigned			rseq_delay_resched:2;
+	unsigned			sched_nodelay:1;
 #ifdef CONFIG_PREEMPT_RT
 	struct netdev_xmit		net_xmit;
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 6f9912c65595..907300cd4469 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -379,4 +379,7 @@ struct prctl_mm_map {
 # define PR_FUTEX_HASH_GET_SLOTS	2
 # define PR_FUTEX_HASH_GET_IMMUTABLE	3
 
+/* TASK sched nodelay request */
+#define PR_SET_SCHED_NODELAY		79
+#define PR_GET_SCHED_NODELAY		80
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index a088a6b1ac23..2f8b4512c6e4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2890,6 +2890,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_FUTEX_HASH:
 		error = futex_hash_prctl(arg2, arg3, arg4);
 		break;
+	case PR_SET_SCHED_NODELAY:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (current->sched_class != &rt_sched_class)
+			return -EINVAL;
+		if (arg2)
+			current->sched_nodelay = 1;
+		else
+			current->sched_nodelay = 0;
+		break;
+	case PR_GET_SCHED_NODELAY:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (current->sched_class != &rt_sched_class)
+			return -EINVAL;
+		error = (current->sched_nodelay == 1);
+		break;
+
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 08/11] sched: Add TIF_NEED_RESCHED_NODELAY infrastructure
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (6 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 07/11] sched: Add API to indicate not to delay scheduling Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 09/11] sched: Add nodelay scheduling Prakash Sangappa
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Add basic infrastructure to introduce a bit for nodelay resched.
This is mainly used by RT threads to indicate it should not be delayed
to be scheduled, by the thread running on the cpu that has requested
extending its cpu time slice.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/entry-common.h |  4 ++--
 include/linux/entry-kvm.h    |  4 ++--
 include/linux/sched.h        |  3 ++-
 include/linux/thread_info.h  | 11 ++++++++++-
 kernel/entry/common.c        |  3 ++-
 kernel/entry/kvm.c           |  3 ++-
 kernel/sched/core.c          |  4 ++--
 7 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 7b258d2510f8..79510895f87a 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -66,8 +66,8 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |			\
-	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |			\
-	 ARCH_EXIT_TO_USER_MODE_WORK)
+	 _TIF_NEED_RESCHED_NODELAY |_TIF_PATCH_PENDING |		\
+	 _TIF_NOTIFY_SIGNAL | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 16149f6625e4..eb59f8185f42 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,8 +18,8 @@
 
 #define XFER_TO_GUEST_MODE_WORK						\
 	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
-	 _TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME |			\
-	 ARCH_XFER_TO_GUEST_MODE_WORK)
+	 _TIF_NEED_RESCHED_NODELAY | _TIF_NOTIFY_SIGNAL |		\
+	 _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
 
 struct kvm_vcpu;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3e8eb64658d1..af3bf1923509 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2093,7 +2093,8 @@ static inline void set_tsk_need_resched(struct task_struct *tsk)
 
 static inline void clear_tsk_need_resched(struct task_struct *tsk)
 {
-	atomic_long_andnot(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY,
+	atomic_long_andnot(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |
+			   _TIF_NEED_RESCHED_NODELAY,
 			   (atomic_long_t *)&task_thread_info(tsk)->flags);
 }
 
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index dd925d84fa46..ee7fa1f8f242 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -67,6 +67,14 @@ enum syscall_work_bit {
 #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
 #endif
 
+#ifndef TIF_NEED_RESCHED_NODELAY
+#ifdef CONFIG_ARCH_HAS_PREEMPT_NODELAY
+#error Inconsistent PREEMPT_NODELAY
+#endif
+#define TIF_NEED_RESCHED_NODELAY TIF_NEED_RESCHED
+#define _TIF_NEED_RESCHED_NODELAY _TIF_NEED_RESCHED
+#endif
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data
@@ -205,7 +213,8 @@ static __always_inline bool tif_test_bit(int bit)
 
 static __always_inline bool tif_need_resched(void)
 {
-	return tif_test_bit(TIF_NEED_RESCHED);
+	return (tif_test_bit(TIF_NEED_RESCHED) ||
+		    tif_test_bit(TIF_NEED_RESCHED_NODELAY));
 }
 
 #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 2635fecb83ff..15ddf335ad4a 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -118,7 +118,8 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |
+		    _TIF_NEED_RESCHED_NODELAY)) {
 		       if (likely(!irq || !rseq_delay_resched(ti_work)))
 			       schedule();
 		}
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 8485f63863af..f4c10bbb42ac 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,8 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 			return -EINTR;
 		}
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |
+		    _TIF_NEED_RESCHED_NODELAY))
 			schedule();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1ddb45b4b46a..035eec8911c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1141,13 +1141,13 @@ static void __resched_curr(struct rq *rq, int tif)
 
 	if (cpu == smp_processor_id()) {
 		set_ti_thread_flag(cti, tif);
-		if (tif == TIF_NEED_RESCHED)
+		if (tif & (TIF_NEED_RESCHED | _TIF_NEED_RESCHED_NODELAY))
 			set_preempt_need_resched();
 		return;
 	}
 
 	if (set_nr_and_not_polling(cti, tif)) {
-		if (tif == TIF_NEED_RESCHED)
+		if (tif & (TIF_NEED_RESCHED | _TIF_NEED_RESCHED_NODELAY))
 			smp_send_reschedule(cpu);
 	} else {
 		trace_sched_wake_idle_without_ipi(cpu);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 09/11] sched: Add nodelay scheduling
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (7 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 08/11] sched: Add TIF_NEED_RESCHED_NODELAY infrastructure Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-08-08 13:26   ` Thomas Gleixner
  2025-07-24 16:16 ` [PATCH V7 10/11] sched, x86: Enable " Prakash Sangappa
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Realtime threads that are sensitive can indicate not to be delayed by a thread
running on th cpu, that has requested scheduler time slice extension.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 init/Kconfig           |  2 +-
 kernel/Kconfig.preempt |  3 +++
 kernel/sched/core.c    | 14 ++++++++++++++
 kernel/sched/rt.c      | 10 +++++-----
 kernel/sched/sched.h   |  1 +
 5 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 3005abab77cf..119448f0b9e1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1132,7 +1132,7 @@ config SCHED_MM_CID
 
 config RSEQ_RESCHED_DELAY
 	def_bool y
-	depends on SMP && RSEQ && SCHED_HRTICK
+	depends on SMP && RSEQ && SCHED_HRTICK && ARCH_HAS_PREEMPT_NODELAY
 	help
 	  This feature enables a thread to request extending its time slice on
 	  the cpu by delaying preemption.
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 54ea59ff8fbe..96809d8d8bcb 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -14,6 +14,9 @@ config PREEMPT_BUILD
 config ARCH_HAS_PREEMPT_LAZY
 	bool
 
+config ARCH_HAS_PREEMPT_NODELAY
+	bool
+
 choice
 	prompt "Preemption Model"
 	default PREEMPT_NONE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 035eec8911c2..e9be8a6b8851 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1185,6 +1185,20 @@ void resched_curr_lazy(struct rq *rq)
 	__resched_curr(rq, get_lazy_tif_bit());
 }
 
+#ifdef	CONFIG_RSEQ_RESCHED_DELAY
+void resched_curr_nodelay(struct rq *rq, struct task_struct *p)
+{
+	int tif;
+	tif = p->sched_nodelay ? TIF_NEED_RESCHED_NODELAY : TIF_NEED_RESCHED;
+	__resched_curr(rq, tif);
+}
+#else
+void resched_curr_nodelay(struct rq *rq, struct task_struct *p)
+{
+	__resched_curr(rq, TIF_NEED_RESCHED);
+}
+#endif
+
 void resched_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c37033..1beae971799e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1027,7 +1027,7 @@ static void update_curr_rt(struct rq *rq)
 			rt_rq->rt_time += delta_exec;
 			exceeded = sched_rt_runtime_exceeded(rt_rq);
 			if (exceeded)
-				resched_curr(rq);
+				resched_curr_nodelay(rq, rq->curr);
 			raw_spin_unlock(&rt_rq->rt_runtime_lock);
 			if (exceeded)
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
@@ -1634,7 +1634,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 	 * to try and push the current task away:
 	 */
 	requeue_task_rt(rq, p, 1);
-	resched_curr(rq);
+	resched_curr_nodelay(rq, p);
 }
 
 static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
@@ -1663,7 +1663,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 	struct task_struct *donor = rq->donor;
 
 	if (p->prio < donor->prio) {
-		resched_curr(rq);
+		resched_curr_nodelay(rq, p);
 		return;
 	}
 
@@ -1999,7 +1999,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * just reschedule current.
 	 */
 	if (unlikely(next_task->prio < rq->donor->prio)) {
-		resched_curr(rq);
+		resched_curr_nodelay(rq, next_task);
 		return 0;
 	}
 
@@ -2087,7 +2087,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	}
 
 	move_queued_task_locked(rq, lowest_rq, next_task);
-	resched_curr(lowest_rq);
+	resched_curr_nodelay(lowest_rq, next_task);
 	ret = 1;
 
 	double_unlock_balance(rq, lowest_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f213f9e68aa6..b81354dfed3c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2698,6 +2698,7 @@ extern void init_sched_fair_class(void);
 extern void resched_curr(struct rq *rq);
 extern void resched_curr_lazy(struct rq *rq);
 extern void resched_cpu(int cpu);
+extern void resched_curr_nodelay(struct rq *rq, struct task_struct *p);
 
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 10/11] sched, x86: Enable nodelay scheduling
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (8 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 09/11] sched: Add nodelay scheduling Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-24 16:16 ` [PATCH V7 11/11] sched: Add kernel parameter to enable delaying RT threads Prakash Sangappa
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Add the TIF_NEED_RESCHED_NODELAY bit and enable it in Kconfig

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 arch/x86/Kconfig                   | 1 +
 arch/x86/include/asm/thread_info.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 71019b3b54ea..8925af10b9b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,7 @@ config X86
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PREEMPT_LAZY
+	select ARCH_HAS_PREEMPT_NODELAY
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 9282465eea21..00ef128cea9d 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -90,6 +90,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED_LAZY	4	/* Lazy rescheduling needed */
 #define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
 #define TIF_SSBD		6	/* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_NODELAY	7	/* No delay rescheduling needed */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -114,6 +115,7 @@ struct thread_info {
 #define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
+#define _TIF_NEED_RESCHED_NODELAY	(1 << TIF_NEED_RESCHED_NODELAY)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V7 11/11] sched: Add kernel parameter to enable delaying RT threads
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (9 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 10/11] sched, x86: Enable " Prakash Sangappa
@ 2025-07-24 16:16 ` Prakash Sangappa
  2025-07-25 15:52   ` kernel test robot
  2025-08-06 16:03 ` [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
  2025-08-06 16:30 ` Thomas Gleixner
  12 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-07-24 16:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

Add a kernel parameter to enable or disable delaying a RT thread from being
scheduled on the cpu, if a thread running on cpu has requested extending its
time slice.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h |  1 +
 kernel/entry/common.c |  7 ++++++-
 kernel/sched/core.c   | 14 ++++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index af3bf1923509..2e65aafeef23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -412,6 +412,7 @@ static inline void sched_domains_mutex_unlock(void) { }
 #ifdef CONFIG_RSEQ_RESCHED_DELAY
 /* Scheduler time slice extension duration */
 extern unsigned int sysctl_sched_preempt_delay_us;
+extern unsigned int sysctl_sched_delay_rt;
 #endif
 
 struct sched_param {
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 15ddf335ad4a..912565a24cca 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -85,13 +85,18 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 
 static inline bool rseq_delay_resched(unsigned long ti_work)
 {
+	unsigned long tiflag;
+
 	if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
 		return false;
 
 	if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
 		return false;
 
-	if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
+	tiflag = sysctl_sched_delay_rt ? _TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY :
+		 _TIF_NEED_RESCHED_LAZY;
+
+	if (!(ti_work & tiflag))
 		return false;
 
 	if (__rseq_delay_resched()) {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e9be8a6b8851..bf16e11a3c27 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,6 +157,11 @@ __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
  */
 #define SCHED_PREEMPT_DELAY_DEFAULT_US	30
 __read_mostly unsigned int sysctl_sched_preempt_delay_us = SCHED_PREEMPT_DELAY_DEFAULT_US;
+
+/*
+ * Scheduler time slice extension - Enable delaying RT threads. Disabled by default.
+ */
+__read_mostly unsigned int sysctl_sched_delay_rt = 0;
 #endif
 
 __read_mostly int scheduler_running;
@@ -4785,6 +4790,15 @@ static const struct ctl_table sched_core_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE_HUNDRED,
 	},
+	{
+		.procname	= "sched_delay_rt",
+		.data		= &sysctl_sched_delay_rt,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
 #endif /* CONFIG_RSEQ_RESCHED_DELAY */
 };
 static int __init sched_core_sysctl_init(void)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 07/11] sched: Add API to indicate not to delay scheduling
  2025-07-24 16:16 ` [PATCH V7 07/11] sched: Add API to indicate not to delay scheduling Prakash Sangappa
@ 2025-07-25 14:30   ` kernel test robot
  0 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2025-07-25 14:30 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: oe-kbuild-all, peterz, rostedt, mathieu.desnoyers, tglx, bigeasy,
	kprateek.nayak, vineethr, prakash.sangappa

Hi Prakash,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on linus/master v6.16-rc7]
[cannot apply to tip/sched/core tip/core/entry next-20250725]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Prakash-Sangappa/sched-Scheduler-time-slice-extension/20250725-002052
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20250724161625.2360309-8-prakash.sangappa%40oracle.com
patch subject: [PATCH V7 07/11] sched: Add API to indicate not to delay scheduling
config: riscv-randconfig-001-20250725 (https://download.01.org/0day-ci/archive/20250725/202507252250.gOKHNKiq-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250725/202507252250.gOKHNKiq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507252250.gOKHNKiq-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/sys.c: In function '__do_sys_prctl':
>> kernel/sys.c:2830:32: error: 'rt_sched_class' undeclared (first use in this function); did you mean 'sched_class'?
    2830 |   if (current->sched_class != &rt_sched_class)
         |                                ^~~~~~~~~~~~~~
         |                                sched_class
   kernel/sys.c:2830:32: note: each undeclared identifier is reported only once for each function it appears in


vim +2830 kernel/sys.c

  2473	
  2474	SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
  2475			unsigned long, arg4, unsigned long, arg5)
  2476	{
  2477		struct task_struct *me = current;
  2478		unsigned char comm[sizeof(me->comm)];
  2479		long error;
  2480	
  2481		error = security_task_prctl(option, arg2, arg3, arg4, arg5);
  2482		if (error != -ENOSYS)
  2483			return error;
  2484	
  2485		error = 0;
  2486		switch (option) {
  2487		case PR_SET_PDEATHSIG:
  2488			if (!valid_signal(arg2)) {
  2489				error = -EINVAL;
  2490				break;
  2491			}
  2492			me->pdeath_signal = arg2;
  2493			break;
  2494		case PR_GET_PDEATHSIG:
  2495			error = put_user(me->pdeath_signal, (int __user *)arg2);
  2496			break;
  2497		case PR_GET_DUMPABLE:
  2498			error = get_dumpable(me->mm);
  2499			break;
  2500		case PR_SET_DUMPABLE:
  2501			if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) {
  2502				error = -EINVAL;
  2503				break;
  2504			}
  2505			set_dumpable(me->mm, arg2);
  2506			break;
  2507	
  2508		case PR_SET_UNALIGN:
  2509			error = SET_UNALIGN_CTL(me, arg2);
  2510			break;
  2511		case PR_GET_UNALIGN:
  2512			error = GET_UNALIGN_CTL(me, arg2);
  2513			break;
  2514		case PR_SET_FPEMU:
  2515			error = SET_FPEMU_CTL(me, arg2);
  2516			break;
  2517		case PR_GET_FPEMU:
  2518			error = GET_FPEMU_CTL(me, arg2);
  2519			break;
  2520		case PR_SET_FPEXC:
  2521			error = SET_FPEXC_CTL(me, arg2);
  2522			break;
  2523		case PR_GET_FPEXC:
  2524			error = GET_FPEXC_CTL(me, arg2);
  2525			break;
  2526		case PR_GET_TIMING:
  2527			error = PR_TIMING_STATISTICAL;
  2528			break;
  2529		case PR_SET_TIMING:
  2530			if (arg2 != PR_TIMING_STATISTICAL)
  2531				error = -EINVAL;
  2532			break;
  2533		case PR_SET_NAME:
  2534			comm[sizeof(me->comm) - 1] = 0;
  2535			if (strncpy_from_user(comm, (char __user *)arg2,
  2536					      sizeof(me->comm) - 1) < 0)
  2537				return -EFAULT;
  2538			set_task_comm(me, comm);
  2539			proc_comm_connector(me);
  2540			break;
  2541		case PR_GET_NAME:
  2542			get_task_comm(comm, me);
  2543			if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
  2544				return -EFAULT;
  2545			break;
  2546		case PR_GET_ENDIAN:
  2547			error = GET_ENDIAN(me, arg2);
  2548			break;
  2549		case PR_SET_ENDIAN:
  2550			error = SET_ENDIAN(me, arg2);
  2551			break;
  2552		case PR_GET_SECCOMP:
  2553			error = prctl_get_seccomp();
  2554			break;
  2555		case PR_SET_SECCOMP:
  2556			error = prctl_set_seccomp(arg2, (char __user *)arg3);
  2557			break;
  2558		case PR_GET_TSC:
  2559			error = GET_TSC_CTL(arg2);
  2560			break;
  2561		case PR_SET_TSC:
  2562			error = SET_TSC_CTL(arg2);
  2563			break;
  2564		case PR_TASK_PERF_EVENTS_DISABLE:
  2565			error = perf_event_task_disable();
  2566			break;
  2567		case PR_TASK_PERF_EVENTS_ENABLE:
  2568			error = perf_event_task_enable();
  2569			break;
  2570		case PR_GET_TIMERSLACK:
  2571			if (current->timer_slack_ns > ULONG_MAX)
  2572				error = ULONG_MAX;
  2573			else
  2574				error = current->timer_slack_ns;
  2575			break;
  2576		case PR_SET_TIMERSLACK:
  2577			if (rt_or_dl_task_policy(current))
  2578				break;
  2579			if (arg2 <= 0)
  2580				current->timer_slack_ns =
  2581						current->default_timer_slack_ns;
  2582			else
  2583				current->timer_slack_ns = arg2;
  2584			break;
  2585		case PR_MCE_KILL:
  2586			if (arg4 | arg5)
  2587				return -EINVAL;
  2588			switch (arg2) {
  2589			case PR_MCE_KILL_CLEAR:
  2590				if (arg3 != 0)
  2591					return -EINVAL;
  2592				current->flags &= ~PF_MCE_PROCESS;
  2593				break;
  2594			case PR_MCE_KILL_SET:
  2595				current->flags |= PF_MCE_PROCESS;
  2596				if (arg3 == PR_MCE_KILL_EARLY)
  2597					current->flags |= PF_MCE_EARLY;
  2598				else if (arg3 == PR_MCE_KILL_LATE)
  2599					current->flags &= ~PF_MCE_EARLY;
  2600				else if (arg3 == PR_MCE_KILL_DEFAULT)
  2601					current->flags &=
  2602							~(PF_MCE_EARLY|PF_MCE_PROCESS);
  2603				else
  2604					return -EINVAL;
  2605				break;
  2606			default:
  2607				return -EINVAL;
  2608			}
  2609			break;
  2610		case PR_MCE_KILL_GET:
  2611			if (arg2 | arg3 | arg4 | arg5)
  2612				return -EINVAL;
  2613			if (current->flags & PF_MCE_PROCESS)
  2614				error = (current->flags & PF_MCE_EARLY) ?
  2615					PR_MCE_KILL_EARLY : PR_MCE_KILL_LATE;
  2616			else
  2617				error = PR_MCE_KILL_DEFAULT;
  2618			break;
  2619		case PR_SET_MM:
  2620			error = prctl_set_mm(arg2, arg3, arg4, arg5);
  2621			break;
  2622		case PR_GET_TID_ADDRESS:
  2623			error = prctl_get_tid_address(me, (int __user * __user *)arg2);
  2624			break;
  2625		case PR_SET_CHILD_SUBREAPER:
  2626			me->signal->is_child_subreaper = !!arg2;
  2627			if (!arg2)
  2628				break;
  2629	
  2630			walk_process_tree(me, propagate_has_child_subreaper, NULL);
  2631			break;
  2632		case PR_GET_CHILD_SUBREAPER:
  2633			error = put_user(me->signal->is_child_subreaper,
  2634					 (int __user *)arg2);
  2635			break;
  2636		case PR_SET_NO_NEW_PRIVS:
  2637			if (arg2 != 1 || arg3 || arg4 || arg5)
  2638				return -EINVAL;
  2639	
  2640			task_set_no_new_privs(current);
  2641			break;
  2642		case PR_GET_NO_NEW_PRIVS:
  2643			if (arg2 || arg3 || arg4 || arg5)
  2644				return -EINVAL;
  2645			return task_no_new_privs(current) ? 1 : 0;
  2646		case PR_GET_THP_DISABLE:
  2647			if (arg2 || arg3 || arg4 || arg5)
  2648				return -EINVAL;
  2649			error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
  2650			break;
  2651		case PR_SET_THP_DISABLE:
  2652			if (arg3 || arg4 || arg5)
  2653				return -EINVAL;
  2654			if (mmap_write_lock_killable(me->mm))
  2655				return -EINTR;
  2656			if (arg2)
  2657				set_bit(MMF_DISABLE_THP, &me->mm->flags);
  2658			else
  2659				clear_bit(MMF_DISABLE_THP, &me->mm->flags);
  2660			mmap_write_unlock(me->mm);
  2661			break;
  2662		case PR_MPX_ENABLE_MANAGEMENT:
  2663		case PR_MPX_DISABLE_MANAGEMENT:
  2664			/* No longer implemented: */
  2665			return -EINVAL;
  2666		case PR_SET_FP_MODE:
  2667			error = SET_FP_MODE(me, arg2);
  2668			break;
  2669		case PR_GET_FP_MODE:
  2670			error = GET_FP_MODE(me);
  2671			break;
  2672		case PR_SVE_SET_VL:
  2673			error = SVE_SET_VL(arg2);
  2674			break;
  2675		case PR_SVE_GET_VL:
  2676			error = SVE_GET_VL();
  2677			break;
  2678		case PR_SME_SET_VL:
  2679			error = SME_SET_VL(arg2);
  2680			break;
  2681		case PR_SME_GET_VL:
  2682			error = SME_GET_VL();
  2683			break;
  2684		case PR_GET_SPECULATION_CTRL:
  2685			if (arg3 || arg4 || arg5)
  2686				return -EINVAL;
  2687			error = arch_prctl_spec_ctrl_get(me, arg2);
  2688			break;
  2689		case PR_SET_SPECULATION_CTRL:
  2690			if (arg4 || arg5)
  2691				return -EINVAL;
  2692			error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
  2693			break;
  2694		case PR_PAC_RESET_KEYS:
  2695			if (arg3 || arg4 || arg5)
  2696				return -EINVAL;
  2697			error = PAC_RESET_KEYS(me, arg2);
  2698			break;
  2699		case PR_PAC_SET_ENABLED_KEYS:
  2700			if (arg4 || arg5)
  2701				return -EINVAL;
  2702			error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
  2703			break;
  2704		case PR_PAC_GET_ENABLED_KEYS:
  2705			if (arg2 || arg3 || arg4 || arg5)
  2706				return -EINVAL;
  2707			error = PAC_GET_ENABLED_KEYS(me);
  2708			break;
  2709		case PR_SET_TAGGED_ADDR_CTRL:
  2710			if (arg3 || arg4 || arg5)
  2711				return -EINVAL;
  2712			error = SET_TAGGED_ADDR_CTRL(arg2);
  2713			break;
  2714		case PR_GET_TAGGED_ADDR_CTRL:
  2715			if (arg2 || arg3 || arg4 || arg5)
  2716				return -EINVAL;
  2717			error = GET_TAGGED_ADDR_CTRL();
  2718			break;
  2719		case PR_SET_IO_FLUSHER:
  2720			if (!capable(CAP_SYS_RESOURCE))
  2721				return -EPERM;
  2722	
  2723			if (arg3 || arg4 || arg5)
  2724				return -EINVAL;
  2725	
  2726			if (arg2 == 1)
  2727				current->flags |= PR_IO_FLUSHER;
  2728			else if (!arg2)
  2729				current->flags &= ~PR_IO_FLUSHER;
  2730			else
  2731				return -EINVAL;
  2732			break;
  2733		case PR_GET_IO_FLUSHER:
  2734			if (!capable(CAP_SYS_RESOURCE))
  2735				return -EPERM;
  2736	
  2737			if (arg2 || arg3 || arg4 || arg5)
  2738				return -EINVAL;
  2739	
  2740			error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
  2741			break;
  2742		case PR_SET_SYSCALL_USER_DISPATCH:
  2743			error = set_syscall_user_dispatch(arg2, arg3, arg4,
  2744							  (char __user *) arg5);
  2745			break;
  2746	#ifdef CONFIG_SCHED_CORE
  2747		case PR_SCHED_CORE:
  2748			error = sched_core_share_pid(arg2, arg3, arg4, arg5);
  2749			break;
  2750	#endif
  2751		case PR_SET_MDWE:
  2752			error = prctl_set_mdwe(arg2, arg3, arg4, arg5);
  2753			break;
  2754		case PR_GET_MDWE:
  2755			error = prctl_get_mdwe(arg2, arg3, arg4, arg5);
  2756			break;
  2757		case PR_PPC_GET_DEXCR:
  2758			if (arg3 || arg4 || arg5)
  2759				return -EINVAL;
  2760			error = PPC_GET_DEXCR_ASPECT(me, arg2);
  2761			break;
  2762		case PR_PPC_SET_DEXCR:
  2763			if (arg4 || arg5)
  2764				return -EINVAL;
  2765			error = PPC_SET_DEXCR_ASPECT(me, arg2, arg3);
  2766			break;
  2767		case PR_SET_VMA:
  2768			error = prctl_set_vma(arg2, arg3, arg4, arg5);
  2769			break;
  2770		case PR_GET_AUXV:
  2771			if (arg4 || arg5)
  2772				return -EINVAL;
  2773			error = prctl_get_auxv((void __user *)arg2, arg3);
  2774			break;
  2775	#ifdef CONFIG_KSM
  2776		case PR_SET_MEMORY_MERGE:
  2777			if (arg3 || arg4 || arg5)
  2778				return -EINVAL;
  2779			if (mmap_write_lock_killable(me->mm))
  2780				return -EINTR;
  2781	
  2782			if (arg2)
  2783				error = ksm_enable_merge_any(me->mm);
  2784			else
  2785				error = ksm_disable_merge_any(me->mm);
  2786			mmap_write_unlock(me->mm);
  2787			break;
  2788		case PR_GET_MEMORY_MERGE:
  2789			if (arg2 || arg3 || arg4 || arg5)
  2790				return -EINVAL;
  2791	
  2792			error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
  2793			break;
  2794	#endif
  2795		case PR_RISCV_V_SET_CONTROL:
  2796			error = RISCV_V_SET_CONTROL(arg2);
  2797			break;
  2798		case PR_RISCV_V_GET_CONTROL:
  2799			error = RISCV_V_GET_CONTROL();
  2800			break;
  2801		case PR_RISCV_SET_ICACHE_FLUSH_CTX:
  2802			error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
  2803			break;
  2804		case PR_GET_SHADOW_STACK_STATUS:
  2805			if (arg3 || arg4 || arg5)
  2806				return -EINVAL;
  2807			error = arch_get_shadow_stack_status(me, (unsigned long __user *) arg2);
  2808			break;
  2809		case PR_SET_SHADOW_STACK_STATUS:
  2810			if (arg3 || arg4 || arg5)
  2811				return -EINVAL;
  2812			error = arch_set_shadow_stack_status(me, arg2);
  2813			break;
  2814		case PR_LOCK_SHADOW_STACK_STATUS:
  2815			if (arg3 || arg4 || arg5)
  2816				return -EINVAL;
  2817			error = arch_lock_shadow_stack_status(me, arg2);
  2818			break;
  2819		case PR_TIMER_CREATE_RESTORE_IDS:
  2820			if (arg3 || arg4 || arg5)
  2821				return -EINVAL;
  2822			error = posixtimer_create_prctl(arg2);
  2823			break;
  2824		case PR_FUTEX_HASH:
  2825			error = futex_hash_prctl(arg2, arg3, arg4);
  2826			break;
  2827		case PR_SET_SCHED_NODELAY:
  2828			if (arg3 || arg4 || arg5)
  2829				return -EINVAL;
> 2830			if (current->sched_class != &rt_sched_class)
  2831				return -EINVAL;
  2832			if (arg2)
  2833				current->sched_nodelay = 1;
  2834			else
  2835				current->sched_nodelay = 0;
  2836			break;
  2837		case PR_GET_SCHED_NODELAY:
  2838			if (arg2 || arg3 || arg4 || arg5)
  2839				return -EINVAL;
  2840			if (current->sched_class != &rt_sched_class)
  2841				return -EINVAL;
  2842			error = (current->sched_nodelay == 1);
  2843			break;
  2844	
  2845		default:
  2846			trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
  2847			error = -EINVAL;
  2848			break;
  2849		}
  2850		return error;
  2851	}
  2852	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 11/11] sched: Add kernel parameter to enable delaying RT threads
  2025-07-24 16:16 ` [PATCH V7 11/11] sched: Add kernel parameter to enable delaying RT threads Prakash Sangappa
@ 2025-07-25 15:52   ` kernel test robot
  0 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2025-07-25 15:52 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: oe-kbuild-all, peterz, rostedt, mathieu.desnoyers, tglx, bigeasy,
	kprateek.nayak, vineethr, prakash.sangappa

Hi Prakash,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on linus/master v6.16-rc7]
[cannot apply to tip/sched/core tip/core/entry next-20250725]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Prakash-Sangappa/sched-Scheduler-time-slice-extension/20250725-002052
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20250724161625.2360309-12-prakash.sangappa%40oracle.com
patch subject: [PATCH V7 11/11] sched: Add kernel parameter to enable delaying RT threads
config: riscv-randconfig-001-20250725 (https://download.01.org/0day-ci/archive/20250725/202507252357.llkbFyOC-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250725/202507252357.llkbFyOC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507252357.llkbFyOC-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/entry/common.c: In function 'rseq_delay_resched':
>> kernel/entry/common.c:96:11: error: 'sysctl_sched_delay_rt' undeclared (first use in this function)
      96 |  tiflag = sysctl_sched_delay_rt ? _TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY :
         |           ^~~~~~~~~~~~~~~~~~~~~
   kernel/entry/common.c:96:11: note: each undeclared identifier is reported only once for each function it appears in


vim +/sysctl_sched_delay_rt +96 kernel/entry/common.c

    85	
    86	static inline bool rseq_delay_resched(unsigned long ti_work)
    87	{
    88		unsigned long tiflag;
    89	
    90		if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
    91			return false;
    92	
    93		if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
    94			return false;
    95	
  > 96		tiflag = sysctl_sched_delay_rt ? _TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY :
    97			 _TIF_NEED_RESCHED_LAZY;
    98	
    99		if (!(ti_work & tiflag))
   100			return false;
   101	
   102		if (__rseq_delay_resched()) {
   103			clear_tsk_need_resched(current);
   104			trace_sched_delay_resched(current, ti_work);
   105			return true;
   106		}
   107		return false;
   108	}
   109	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 00/11] Scheduler time slice extension
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (10 preceding siblings ...)
  2025-07-24 16:16 ` [PATCH V7 11/11] sched: Add kernel parameter to enable delaying RT threads Prakash Sangappa
@ 2025-08-06 16:03 ` Prakash Sangappa
  2025-08-06 16:24   ` Thomas Gleixner
  2025-08-06 16:30 ` Thomas Gleixner
  12 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-06 16:03 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, tglx@linutronix.de,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

Any comments?
Thanks
-Prakash

> On Jul 24, 2025, at 9:16 AM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> Based on v6.16-rc3.
> 
> Patches 7-11 in this series are an attempt to implement the API/mechanism 
> for an RT thread to indicate not to delay scheduling it when the thread
> running on the cpu requests extending its time slice, as suggested by
> Thomas Gleixner. This is required to address the concern that with the use 
> of the proposed scheduler time slice extension feature, a normal thread 
> can delay the scheduling of an RT thread.
> 
> This will require a new TIF flag(TIF_NEED_RESCHED_NODELAY), which will be 
> set on the running thread when this RT thread gets woken up and is enqueued. 
> The API is only allowed for use by RT(RR, FIFO) threads. 
> 
> Implementation of TIF_NEED_RESCHED_NODELAY patches is on the lines of
> how TIF_NEED_RESCHED_LAZY was added. However, TIF_NEED_RESCHED_NODELAY
> will be effective only with the scheduler time slice extension feature
> (i.e, when CONFIG_RSEQ_RESCHED_DELAY config option is enabled).
> 
> Introduces prctl APIs to set and get the sched_nodelay flag. Adds a
> new 1-bit member(sched_nodelay) to the struct task_struct to store this
> flag, as there is no more room for a new PF* flag. This flag will be 
> inherited across fork and exec. 
> 
> The API provides per-thread control to decide if it can be delayed
> or not. Also, a kernel parameter is added to disable delaying scheduling of
> all RT threads, if necessary, when the scheduler time slice extension feature
> is enabled.
> 
> The above change is more of an RFC, looking for feedback on the
> approach. 
> 
> Patches 1-6  have been updated based on comments from the V6 patch series.
> 
> ---------------- cover letter previously sent --------------------------------
> A user thread can get preempted in the middle of executing a critical
> section in user space while holding locks, which can have undesirable affect
> on performance. Having a way for the thread to request additional execution
> time on cpu, so that it can complete the critical section will be useful in
> such scenario. The request can be made by setting a bit in mapped memory,
> such that the kernel can also access to check and grant extra execution time
> on the cpu. 
> 
> There have been couple of proposals[1][2] for such a feature, which attempt
> to address the above scenario by granting one extra tick of execution time.
> In patch thread [1] posted by Steven Rostedt, there is ample discussion about
> need for this feature.
> 
> However, the concern has been that this can lead to abuse. One extra tick can
> be a long time(about a millisec or more). Peter Zijlstra in response posted a 
> prototype solution[5], which grants 50us execution time extension only.
> This is achieved with the help of a timer started on that cpu at the time of
> granting extra execution time. When the timer fires the thread will be
> preempted, if still running. 
> 
> This patchset implements above solution as suggested, with use of restartable
> sequences(rseq) structure for API. Refer [3][4] for further discussions.
> 
> 
> v7:
> - Addressed comments & suggestions from Thomas Gleixner & Prateek Nayak.
>  Renamed 'sched_time_delay' to 'rseq_delay_resched'. Made it a 2-bit 
>  member to store 3 states NONE, PROBE & REQUESTED as suggested by
>  Thomas Gleixner. Also refactored some code in patch 1.
> - Renamed the config option to 'CONFIG_RSEQ_RESCHED_DELAY' and
>  added it in patch 1. Added SCHED_HRTICK dependency.
> - Patches 7-11 are an attempt to implement the API/mechanism 
>  Thomas suggested. They introduce a prctl() api which lets an RT thread
>  indicate not to delay scheduling it when some thread running on
>  the cpu requests extending its time slice.
> 
> v6:
> https://lore.kernel.org/all/20250701003749.50525-1-prakash.sangappa@oracle.com/
> - Rebased onto v6.16-rc3. 
>  syscall_exit_to_user_mode_prepare() & __syscall_exit_to_user_mode_work()
>  routines have been deleted. Moved changes to the consolidated routine
>  syscall_exit_to_user_mode_work()(patch 1).
> - Introduced a new config option for scheduler time slice extension
>  CONFIG_SCHED_PREEMPT_DELAY which is dependent on CONFIG_RSEQ.
>  Enabled by default(new patch 7). Is this reasonable?
> - Modified tracepoint to a conditional tracepoint(patch 5), as suggested
>  by Steven Rostedt.
> - Added kernel parameters documentation for the tunable
>  'sysctl_sched_preempt_delay_us'(patch 3)
> 
> v5:
> https://lore.kernel.org/all/20250603233654.1838967-1-prakash.sangappa@oracle.com/
> - Added #ifdef CONFIG_RSEQ and CONFIG_PROC_SYSCTL for sysctl tunable
>  changes(patch 3).
> - Added #ifdef CONFIG_RSEQ for schedular stat changes(patch 4).
> - Removed deprecated flags from the supported flags returned, as
>  pointed out by Mathieu Desnoyers(patch 6).
> - Added IF_ENABLED(CONFIG_SCHED_HRTICK) check before returning supported
>  delay resched flags.
> 
> v4:
> https://lore.kernel.org/all/20250513214554.4160454-1-prakash.sangappa@oracle.com
> - Changed default sched delay extension time to 30us
> - Added patch to indicate to userspace if the thread got preempted in
>  the extended cpu time granted. Uses another bit in rseq cs flags for it.
>  This should help the application to check and avoid having to call a
>  system call to yield cpu, especially sched_yield() as pointed out
>  by Steven Rostedt.
> - Moved tracepoint call towards end of exit_to_user_mode_loop().
> - Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
>  set higher then the default value of 30us.
> - Patch to add an API to query if sched time extension feature is supported. 
>  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
>  is added, as suggested by Mathieu Desnoyers. 
>  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.
> 
> v3:
> https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
> - Addressing review comments by Sebastian and Prateek.
>  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
>    struct task_struct near other bits so it fits in existing word.
>  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
>    'sched_time_delay'.
>  * removed rseq_delay_resched_tick() call from hrtick_clear().
>  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
>    suggested by Sebastian.
>  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.
> 
> v2:
> https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
> - Based on discussions in [3], expecting user application to call sched_yield()
>  to yield the cpu at the end of the critical section may not be advisable as
>  pointed out by Linus.  
> 
>  So added a check in return path from a system call to reschedule if time
>  slice extension was granted to the thread. The check could as well be in
>  syscall enter path from user mode.
>  This would allow application thread to call any system call to yield the cpu. 
>  Which system call should be suggested? getppid(2) works.
> 
>  Do we still need the change in sched_yield() to reschedule when the thread
>  has current->rseq_sched_delay set?
> 
> - Added patch to introduce a sysctl tunable parameter to specify duration of
>  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
>  Can take a value in the range 0 to 100. Default is set to 50us.
>  Setting this tunable to 0 disables the scheduler time slice extension feature.
> 
> v1: 
> https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/
> 
> 
> [1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
> [2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
> [3] https://lore.kernel.org/all/20250131225837.972218232@goodmis.org/
> [4] https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
> [5] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
> [6] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/
> 
> Prakash Sangappa (11):
>  sched: Scheduler time slice extension
>  sched: Indicate if thread got rescheduled
>  sched: Tunable to specify duration of time slice extension
>  sched: Add scheduler stat for cpu time slice extension
>  sched: Add tracepoint for sched time slice extension
>  Add API to query supported rseq cs flags
>  sched: Add API to indicate not to delay scheduling
>  sched: Add TIF_NEED_RESCHED_NODELAY infrastructure
>  sched: Add nodelay scheduling
>  sched, x86: Enable nodelay scheduling
>  sched: Add kernel parameter to enable delaying RT threads
> 
> .../admin-guide/kernel-parameters.txt         |  8 ++
> Documentation/admin-guide/sysctl/kernel.rst   |  8 ++
> arch/x86/Kconfig                              |  1 +
> arch/x86/include/asm/thread_info.h            |  2 +
> include/linux/entry-common.h                  | 18 ++--
> include/linux/entry-kvm.h                     |  4 +-
> include/linux/sched.h                         | 47 +++++++++-
> include/linux/thread_info.h                   | 11 ++-
> include/trace/events/sched.h                  | 31 +++++++
> include/uapi/linux/prctl.h                    |  3 +
> include/uapi/linux/rseq.h                     | 19 ++++
> init/Kconfig                                  |  7 ++
> kernel/Kconfig.preempt                        |  3 +
> kernel/entry/common.c                         | 36 ++++++-
> kernel/entry/kvm.c                            |  3 +-
> kernel/rseq.c                                 | 71 ++++++++++++++
> kernel/sched/core.c                           | 93 ++++++++++++++++++-
> kernel/sched/debug.c                          |  4 +
> kernel/sched/rt.c                             | 10 +-
> kernel/sched/sched.h                          |  1 +
> kernel/sched/syscalls.c                       |  4 +
> kernel/sys.c                                  | 18 ++++
> 22 files changed, 380 insertions(+), 22 deletions(-)
> 
> -- 
> 2.43.5
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 00/11] Scheduler time slice extension
  2025-08-06 16:03 ` [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
@ 2025-08-06 16:24   ` Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-06 16:24 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel@vger.kernel.org
  Cc: peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, bigeasy@linutronix.de,
	kprateek.nayak@amd.com, vineethr@linux.ibm.com

On Wed, Aug 06 2025 at 16:03, Prakash Sangappa wrote:

Please don't top post and trim your replies. We all have this mail in
our inboxes.

> Any comments?

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#merge-window

and people are on vacation ....

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 00/11] Scheduler time slice extension
  2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
                   ` (11 preceding siblings ...)
  2025-08-06 16:03 ` [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
@ 2025-08-06 16:30 ` Thomas Gleixner
  2025-08-07  6:52   ` Prakash Sangappa
  12 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-06 16:30 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
> Based on v6.16-rc3.

This is useless. At the point of posting, a massive amount of changes
had been queued for the 6.17 merge window. So why can't you submit
against the relevant tree (tip) as asked for in Documentation?

I'm going to look at it from a conceptual level nevertheless to spare
you the extra round.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-07-24 16:16 ` [PATCH V7 01/11] sched: " Prakash Sangappa
@ 2025-08-06 20:34   ` Thomas Gleixner
  2025-08-07 14:07     ` Thomas Gleixner
                       ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:34 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
> @@ -304,7 +304,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
>   * exit_to_user_mode_loop - do any pending work before leaving to user space
>   */
>  unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> -				     unsigned long ti_work);
> +				     unsigned long ti_work, bool irq);

I know the kernel-doc already lacks the description for the existing
arguments, but adding more undocumented ones is not the right thing
either.

Also please name this argument 'from_irq' to make it clear what this is
about.

>  /**
>   * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
> @@ -316,7 +316,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   *    EXIT_TO_USER_MODE_WORK are set
>   * 4) check that interrupts are still disabled
>   */
> -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
> +static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool irq)

New argument not documented in kernel-doc.

>  {
>  	unsigned long ti_work;
>  
> @@ -327,7 +327,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>  
>  	ti_work = read_thread_flags();
>  	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> -		ti_work = exit_to_user_mode_loop(regs, ti_work);
> +		ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
> +
> +	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) && irq)
> +		rseq_delay_resched_arm_timer();

This is still an unconditional function call which is a NOOP for
everyone who does not use this. It's not that hard to inline the
check. How often do I have to explain that?

>  	arch_exit_to_user_mode_prepare(regs, ti_work);
>  
> @@ -396,6 +399,9 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>  
>  	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>  
> +	/* Reschedule if scheduler time delay was granted */

This is not rescheduling. It sets NEED_RESCHED, which is a completely
different thing.

> +	rseq_delay_set_need_resched();

I fundamentally hate this hack as it goes out to user space with
NEED_RESCHED set and absolutely zero debug mechanism which validates
it. Currently going out with NEED_RESCHED set is a plain bug, rigthfully
so.

But now this muck comes along and sets the flag, which is semantically
just wrong and ill defined.

The point is that NEED_RESCHED has been cleared by requesting and
granting the extension, which means the task can go out to userspace,
until it either relinquishes the CPU or hrtick() whacks it over the
head.

And your muck requires this insane hack with sched_yield():

>  SYSCALL_DEFINE0(sched_yield)
>  {
> +	/* Reschedule if scheduler time delay was granted */
> +	if (rseq_delay_set_need_resched())
> +		return 0;
> +
>  	do_sched_yield();
>  	return 0;
>  }

That's just completely wrong. Relinquishing the CPU should be possible
by any arbitrary syscall and not require to make sched_yield() more
ill-defined as it is already.

The obvious way to solve both issues is to clear NEED_RESCHED when
the delay is granted and then do in syscall_enter_from_user_mode_work()

        rseq_delay_sys_enter()
        {
             if (unlikely(current->rseq_delay_resched == GRANTED)) {
		    set_tsk_need_resched(current);
                    schedule();
             }       
        }     	

No?

It's debatable whether the schedule() there is necessary. Removing it
would allow the task to either complete the syscall and reschedule on
exit to user space or go to sleep in the syscall. But that's a trivial
detail.

The important point is that the NEED_RESCHED semantics stay sane and the
problem is solved right on the next syscall entry.

This delay is not for extending CPU time accross syscalls, it's solely
to allow user space to complete a _user space_ critical
section. Everything else is just wrong and we don't implement it as an
invitation for abuse.

For the record: I used GRANTED on purpose, because REQUESTED is
bogus. At the point where __rseq_delay_resched() is invoked _AND_
observes the user space request, it grants the delay, no?

This random nomenclature is just making this stuff annoyingly hard to
follow.

> +static inline bool rseq_delay_resched(unsigned long ti_work)
> +{
> +	if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
> +		return false;
> +
> +	if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
> +		return false;

Why unlikely? The majority of applications do not use this.

> +
> +	if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
> +		return false;

The caller already established that one of these flags is set, no?

> +	if (__rseq_delay_resched()) {
> +		clear_tsk_need_resched(current);

Why has this to be inline and is not done in __rseq_delay_resched()?

> +		return true;
> +	}
> +	return false;

>  /**
>   * exit_to_user_mode_loop - do any pending work before leaving to user space
>   * @regs:	Pointer to pt_regs on entry stack
>   * @ti_work:	TIF work flags as read by the caller
>   */
>  __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> -						     unsigned long ti_work)
> +						     unsigned long ti_work, bool irq)
>  {

Same comments as above.

> +#ifdef	CONFIG_RSEQ_RESCHED_DELAY
> +bool __rseq_delay_resched(void)
> +{
> +	struct task_struct *t = current;
> +	u32 flags;
> +
> +	if (copy_from_user_nofault(&flags, &t->rseq->flags, sizeof(flags)))
> +		return false;
> +
> +	if (!(flags & RSEQ_CS_FLAG_DELAY_RESCHED))
> +		return false;
> +
> +	flags &= ~RSEQ_CS_FLAG_DELAY_RESCHED;
> +	if (copy_to_user_nofault(&t->rseq->flags, &flags, sizeof(flags)))
> +		return false;
> +
> +	t->rseq_delay_resched = RSEQ_RESCHED_DELAY_REQUESTED;
> +
> +	return true;
> +}
> +
> +void rseq_delay_resched_arm_timer(void)
> +{
> +	if (unlikely(current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED))
> +		hrtick_local_start(30 * NSEC_PER_USEC);
> +}
> +
> +void rseq_delay_resched_tick(void)
> +{
> +	if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
> +		set_tsk_need_resched(current);

Small enough to inline into hrtick() with a IS_ENABLED() guard, no?

> +}
> +#endif /* CONFIG_RSEQ_RESCHED_DELAY */
> +
>  #ifdef CONFIG_DEBUG_RSEQ
>  
>  /*
> @@ -493,6 +527,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>  		current->rseq = NULL;
>  		current->rseq_sig = 0;
>  		current->rseq_len = 0;
> +		if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
> +			current->rseq_delay_resched = RSEQ_RESCHED_DELAY_NONE;

What's that conditional for?

t->rseq_delay_resched is unconditionally available. Your choice of
optimizing the irrelevant places is amazing.

>  		return 0;
>  	}
>  
> @@ -561,6 +597,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>  	current->rseq = rseq;
>  	current->rseq_len = rseq_len;
>  	current->rseq_sig = sig;
> +	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
> +		current->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;

Why is this done unconditionally for rseq?

So that any rseq user needs to do a function call and a copy_from_user()
just for nothing?

A task, which needs this muck, can very well opt-in for this and leave
everybody else unaffected, no?

prctl() exists for a reason and that allows even filtering out the
request to enable it if the sysadmin sets up filters accordingly.

As code which wants to utilize this has to be modified anyway, adding
the prctl() is not a unreasonable requirement.

>  	clear_preempt_need_resched();
> +	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
> +	    prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
> +		prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;

Yet another code conditional for no reason. These are two bits and you can
use them smart:

#define ENABLED		1
#define GRANTED		3

So you can just go and do

   prev->rseq_delay_resched &= RSEQ_RESCHED_DELAY_ENABLED;

which clears the GRANTED bit without a conditional and that's correct
whether the ENABLED bit was set or not.

In the syscall exit path you then do:

static inline bool rseq_delay_resched(void)
{
   if (prev->rseq_delay_resched != ENABLED)
   	return false;
   return __rseq_delay_resched();
}

and __rseq_delay_resched() does:

    rseq_delay_resched = GRANTED;

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 00/11] Scheduler time slice extension
  2025-08-06 16:30 ` Thomas Gleixner
@ 2025-08-07  6:52   ` Prakash Sangappa
  0 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-07  6:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 6, 2025, at 9:30 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> Based on v6.16-rc3.
> 
> This is useless. At the point of posting, a massive amount of changes
> had been queued for the 6.17 merge window. So why can't you submit
> against the relevant tree (tip) as asked for in Documentation?
> 
> I'm going to look at it from a conceptual level nevertheless to spare
> you the extra round.

Thanks,
-Prakash.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-07-24 16:16 ` [PATCH V7 02/11] sched: Indicate if thread got rescheduled Prakash Sangappa
@ 2025-08-07 13:06   ` Thomas Gleixner
  2025-08-07 16:15     ` Prakash Sangappa
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-07 13:06 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:

Indicate this to whom? Can you please write descriptive subject lines
which summarize the change in a way that is comprehensible?

> +void rseq_delay_resched_clear(struct task_struct *tsk)
> +{
> +	u32 flags;
> +
> +	if (tsk->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) {
> +		tsk->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
> +		if (copy_from_user_nofault(&flags, &tsk->rseq->flags, sizeof(flags)))
> +                        return;
> +                flags |= RSEQ_CS_FLAG_RESCHEDULED;
> +                copy_to_user_nofault(&tsk->rseq->flags, &flags, sizeof(flags));
> +	}
> +}
>  #endif /* CONFIG_RSEQ_RESCHED_DELAY */
>  
>  #ifdef CONFIG_DEBUG_RSEQ
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e75ecbb2c1f7..ba1e4f6981cd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6752,9 +6752,8 @@ static void __sched notrace __schedule(int sched_mode)
>  picked:
>  	clear_tsk_need_resched(prev);
>  	clear_preempt_need_resched();
> -	if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
> -	    prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
> -		prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
> +	if(IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
> +		rseq_delay_resched_clear(prev);

Yet another unconditional function call for the sake of something which
is only used by special applications. This is the scheduler hotpath and
not a dump ground for random functionality, which is even completely
redundant. Why redundant?

The kernel already handles in rseq, that a task was scheduled out:

    schedule()
       prepare_task_switch()
         rseq_preempt()

rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which
causes exit to userspace to invoke __rseq_handle_notify_resume(). That's
the obvious place to handle this instead of inflicting it into the
scheduler hotpath.

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-06 20:34   ` Thomas Gleixner
@ 2025-08-07 14:07     ` Thomas Gleixner
  2025-08-07 16:45       ` Prakash Sangappa
  2025-08-07 15:49     ` Sebastian Andrzej Siewior
  2025-08-07 16:13     ` Prakash Sangappa
  2 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-07 14:07 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

On Wed, Aug 06 2025 at 22:34, Thomas Gleixner wrote:
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> @@ -396,6 +399,9 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>>  
>>  	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>>  
>> +	/* Reschedule if scheduler time delay was granted */
>
> This is not rescheduling. It sets NEED_RESCHED, which is a completely
> different thing.
>
>> +	rseq_delay_set_need_resched();
>
> I fundamentally hate this hack as it goes out to user space with
> NEED_RESCHED set and absolutely zero debug mechanism which validates
> it. Currently going out with NEED_RESCHED set is a plain bug, rigthfully
> so.
>
> But now this muck comes along and sets the flag, which is semantically
> just wrong and ill defined.
>
> The point is that NEED_RESCHED has been cleared by requesting and
> granting the extension, which means the task can go out to userspace,
> until it either relinquishes the CPU or hrtick() whacks it over the
> head.

Sorry. I misread this. It's placed before it enters the exit work loop
and not afterwards. I got lost in this maze. :(

> The obvious way to solve both issues is to clear NEED_RESCHED when
> the delay is granted and then do in syscall_enter_from_user_mode_work()
>
>         rseq_delay_sys_enter()
>         {
>              if (unlikely(current->rseq_delay_resched == GRANTED)) {
> 		    set_tsk_need_resched(current);
>                     schedule();
>              }       
>         }     	
>
> No?
>
> It's debatable whether the schedule() there is necessary. Removing it
> would allow the task to either complete the syscall and reschedule on
> exit to user space or go to sleep in the syscall. But that's a trivial
> detail.

But, the most important thing is that doing it at entry allows to debug
this stuff for correctness.

I can kinda see that a sched_yield() shortcut might be the right thing
to do for relinguishing the CPU, but if that's the user space contract,
then any other syscall needs to be caught and not silently papered over
at return from syscall.

Let me think about this some more.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-06 20:34   ` Thomas Gleixner
  2025-08-07 14:07     ` Thomas Gleixner
@ 2025-08-07 15:49     ` Sebastian Andrzej Siewior
  2025-08-07 16:56       ` Prakash Sangappa
  2025-08-07 16:13     ` Prakash Sangappa
  2 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-07 15:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Prakash Sangappa, linux-kernel, peterz, rostedt,
	mathieu.desnoyers, kprateek.nayak, vineethr

On 2025-08-06 22:34:00 [+0200], Thomas Gleixner wrote:
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
> 
> The obvious way to solve both issues is to clear NEED_RESCHED when
> the delay is granted and then do in syscall_enter_from_user_mode_work()
> 
>         rseq_delay_sys_enter()
>         {
>              if (unlikely(current->rseq_delay_resched == GRANTED)) {
> 		    set_tsk_need_resched(current);
>                     schedule();
>              }       
>         }     	
> 
> No?
> 
> It's debatable whether the schedule() there is necessary. Removing it
> would allow the task to either complete the syscall and reschedule on
> exit to user space or go to sleep in the syscall. But that's a trivial
> detail.

Either schedule() or setting NEED_RESCHED is enough.

> The important point is that the NEED_RESCHED semantics stay sane and the
> problem is solved right on the next syscall entry.
> 
…
> > +static inline bool rseq_delay_resched(unsigned long ti_work)
> > +{
> > +	if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
> > +		return false;
> > +
> > +	if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))

The functions and the task_struct member field share the same.

> > +		return false;
> 
> Why unlikely? The majority of applications do not use this.
> 
> > +
> > +	if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
> > +		return false;
> 
> The caller already established that one of these flags is set, no?

correct, and if they are set, this never gets to false.

> > +	if (__rseq_delay_resched()) {
> > +		clear_tsk_need_resched(current);
> 
> Why has this to be inline and is not done in __rseq_delay_resched()?

A SCHED_OTHER wake up sets _TIF_NEED_RESCHED_LAZY so
clear_tsk_need_resched() will revoke this granting an extension.

The RT/DL wake up will set _TIF_NEED_RESCHED and
clear_tsk_need_resched() will also clear it. However this one
additionally sets set_preempt_need_resched() so the next preempt
disable/ enable combo will lead to a scheduling event. A remote wakeup
will trigger an IPI (scheduler_ipi()) which also does
set_preempt_need_resched().

If I understand this correct then a RT/DL wake up while the task is in
kernel-mode should lead to a scheduling event assuming we pass a
spinlock_t (ignoring the irq argument).
Should the task be in user-mode then we return to user mode with the TIF
flag cleared and the NEED-RESCHED flag folded into the preemption
counter.

I am once again asking to limit this to _TIF_NEED_RESCHED_LAZY.

> > +		return true;
> > +	}
> > +	return false;
> 

…

> Thanks,
> 
>         tglx

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-06 20:34   ` Thomas Gleixner
  2025-08-07 14:07     ` Thomas Gleixner
  2025-08-07 15:49     ` Sebastian Andrzej Siewior
@ 2025-08-07 16:13     ` Prakash Sangappa
  2 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-07 16:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 6, 2025, at 1:34 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> @@ -304,7 +304,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
>>  * exit_to_user_mode_loop - do any pending work before leaving to user space
>>  */
>> unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -     unsigned long ti_work);
>> +     unsigned long ti_work, bool irq);
> 
> I know the kernel-doc already lacks the description for the existing
> arguments, but adding more undocumented ones is not the right thing
> either.
> 
> Also please name this argument 'from_irq' to make it clear what this is
> about.

Ok, will change it.

> 
>> /**
>>  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
>> @@ -316,7 +316,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>  *    EXIT_TO_USER_MODE_WORK are set
>>  * 4) check that interrupts are still disabled
>>  */
>> -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>> +static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool irq)
> 
> New argument not documented in kernel-doc.

Will add necessary documentation.

> 
>> {
>> unsigned long ti_work;
>> 
>> @@ -327,7 +327,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>> 
>> ti_work = read_thread_flags();
>> if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>> - ti_work = exit_to_user_mode_loop(regs, ti_work);
>> + ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
>> +
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) && irq)
>> + rseq_delay_resched_arm_timer();
> 
> This is still an unconditional function call which is a NOOP for
> everyone who does not use this. It's not that hard to inline the
> check. How often do I have to explain that?

Will fix.

> 
>> arch_exit_to_user_mode_prepare(regs, ti_work);
>> 
>> @@ -396,6 +399,9 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>> 
>> CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>> 
>> + /* Reschedule if scheduler time delay was granted */
> 
> This is not rescheduling. It sets NEED_RESCHED, which is a completely
> different thing.
> 
>> + rseq_delay_set_need_resched();
> 
> I fundamentally hate this hack as it goes out to user space with
> NEED_RESCHED set and absolutely zero debug mechanism which validates
> it. Currently going out with NEED_RESCHED set is a plain bug, rigthfully
> so.
> 
> But now this muck comes along and sets the flag, which is semantically
> just wrong and ill defined.
> 
> The point is that NEED_RESCHED has been cleared by requesting and
> granting the extension, which means the task can go out to userspace,
> until it either relinquishes the CPU or hrtick() whacks it over the
> head.
> 
> And your muck requires this insane hack with sched_yield():
> 
>> SYSCALL_DEFINE0(sched_yield)
>> {
>> + /* Reschedule if scheduler time delay was granted */
>> + if (rseq_delay_set_need_resched())
>> + return 0;
>> +
>> do_sched_yield();
>> return 0;
>> }
> 
> That's just completely wrong. Relinquishing the CPU should be possible
> by any arbitrary syscall and not require to make sched_yield() more
> ill-defined as it is already.
> 
> The obvious way to solve both issues is to clear NEED_RESCHED when
> the delay is granted and then do in syscall_enter_from_user_mode_work()
> 
>        rseq_delay_sys_enter()
>        {
>             if (unlikely(current->rseq_delay_resched == GRANTED)) {
>    set_tsk_need_resched(current);
>                    schedule();
>             }       
>        }      
> 
> No?
> 
> It's debatable whether the schedule() there is necessary. Removing it
> would allow the task to either complete the syscall and reschedule on
> exit to user space or go to sleep in the syscall. But that's a trivial
> detail.
> 
> The important point is that the NEED_RESCHED semantics stay sane and the
> problem is solved right on the next syscall entry.
> 
> This delay is not for extending CPU time accross syscalls, it's solely
> to allow user space to complete a _user space_ critical
> section. Everything else is just wrong and we don't implement it as an
> invitation for abuse.
> 
> For the record: I used GRANTED on purpose, because REQUESTED is
> bogus. At the point where __rseq_delay_resched() is invoked _AND_
> observes the user space request, it grants the delay, no?
> 
> This random nomenclature is just making this stuff annoyingly hard to
> follow.
> 

Ok I can move the check to relinquish cpu in syscall_enter_from_user_mode_work()
instead of in syscall_exit_to_user_mode_work(). 


>> +static inline bool rseq_delay_resched(unsigned long ti_work)
>> +{
>> + if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + return false;
>> +
>> + if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
>> + return false;
> 
> Why unlikely? The majority of applications do not use this.

WIll change to likely().

> 
>> +
>> + if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
>> + return false;
> 
> The caller already established that one of these flags is set, no?

That is right, will delete this check here. 

> 
>> + if (__rseq_delay_resched()) {
>> + clear_tsk_need_resched(current);
> 
> Why has this to be inline and is not done in __rseq_delay_resched()?

Sure, it could be in __rseq_delay_resched(). 

> 
>> + return true;
>> + }
>> + return false;
> 
>> /**
>>  * exit_to_user_mode_loop - do any pending work before leaving to user space
>>  * @regs: Pointer to pt_regs on entry stack
>>  * @ti_work: TIF work flags as read by the caller
>>  */
>> __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -     unsigned long ti_work)
>> +     unsigned long ti_work, bool irq)
>> {
> 
> Same comments as above.
> 
>> +
>> +void rseq_delay_resched_tick(void)
>> +{
>> + if (current->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
>> + set_tsk_need_resched(current);
> 
> Small enough to inline into hrtick() with a IS_ENABLED() guard, no?

 I can move it to hrtick() and delete the rseq_delay_resched_tick() routine.

> 
>> +}
>> +#endif /* CONFIG_RSEQ_RESCHED_DELAY */
>> +
>> #ifdef CONFIG_DEBUG_RSEQ
>> 
>> /*
>> @@ -493,6 +527,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> current->rseq = NULL;
>> current->rseq_sig = 0;
>> current->rseq_len = 0;
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + current->rseq_delay_resched = RSEQ_RESCHED_DELAY_NONE;
> 
> What's that conditional for?
> 
> t->rseq_delay_resched is unconditionally available. Your choice of
> optimizing the irrelevant places is amazing.

Will fix.

> 
>> return 0;
>> }
>> 
>> @@ -561,6 +597,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
>> current->rseq = rseq;
>> current->rseq_len = rseq_len;
>> current->rseq_sig = sig;
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + current->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
> 
> Why is this done unconditionally for rseq?
> 
> So that any rseq user needs to do a function call and a copy_from_user()
> just for nothing?
> 
> A task, which needs this muck, can very well opt-in for this and leave
> everybody else unaffected, no?

Sure, that seems reasonable.

> 
> prctl() exists for a reason and that allows even filtering out the
> request to enable it if the sysadmin sets up filters accordingly.
> 
> As code which wants to utilize this has to be modified anyway, adding
> the prctl() is not a unreasonable requirement.
> 
>> clear_preempt_need_resched();
>> + if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
>> +    prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
>> + prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
> 
> Yet another code conditional for no reason. These are two bits and you can
> use them smart:
> 
> #define ENABLED 1
> #define GRANTED 3
> 
> So you can just go and do
> 
>   prev->rseq_delay_resched &= RSEQ_RESCHED_DELAY_ENABLED;
> 
> which clears the GRANTED bit without a conditional and that's correct
> whether the ENABLED bit was set or not.
> 
> In the syscall exit path you then do:
> 
> static inline bool rseq_delay_resched(void)
> {
>   if (prev->rseq_delay_resched != ENABLED)
>    return false;
>   return __rseq_delay_resched();
> }
> 
> and __rseq_delay_resched() does:
> 
>    rseq_delay_resched = GRANTED;
> 
> No?

That is a nice. I can add a prctl() call to enable & disable this functionality
which would help avoid unnecessary copy_from_user() calls.

In addition to registering the ‘rseq’ struct the application that needs the functionality
will have to make  the prctl() call to enable it, which I think should be reasonable.  

Thanks,
-Prakash.


> 
> Thanks,
> 
>        tglx


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-07 13:06   ` Thomas Gleixner
@ 2025-08-07 16:15     ` Prakash Sangappa
  2025-08-11  9:45       ` Thomas Gleixner
  0 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-07 16:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 7, 2025, at 6:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
> 
> Indicate this to whom? Can you please write descriptive subject lines
> which summarize the change in a way that is comprehensible?
> 
>> +void rseq_delay_resched_clear(struct task_struct *tsk)
>> +{
>> + u32 flags;
>> +
>> + if (tsk->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED) {
>> + tsk->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
>> + if (copy_from_user_nofault(&flags, &tsk->rseq->flags, sizeof(flags)))
>> +                        return;
>> +                flags |= RSEQ_CS_FLAG_RESCHEDULED;
>> +                copy_to_user_nofault(&tsk->rseq->flags, &flags, sizeof(flags));
>> + }
>> +}
>> #endif /* CONFIG_RSEQ_RESCHED_DELAY */
>> 
>> #ifdef CONFIG_DEBUG_RSEQ
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index e75ecbb2c1f7..ba1e4f6981cd 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6752,9 +6752,8 @@ static void __sched notrace __schedule(int sched_mode)
>> picked:
>> clear_tsk_need_resched(prev);
>> clear_preempt_need_resched();
>> - if (IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY) &&
>> -    prev->rseq_delay_resched == RSEQ_RESCHED_DELAY_REQUESTED)
>> - prev->rseq_delay_resched = RSEQ_RESCHED_DELAY_PROBE;
>> + if(IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>> + rseq_delay_resched_clear(prev);
> 
> Yet another unconditional function call for the sake of something which
> is only used by special applications. This is the scheduler hotpath and
> not a dump ground for random functionality, which is even completely
> redundant. Why redundant?
> 
> The kernel already handles in rseq, that a task was scheduled out:
> 
>    schedule()
>       prepare_task_switch()
>         rseq_preempt()
> 
> rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which
> causes exit to userspace to invoke __rseq_handle_notify_resume(). That's
> the obvious place to handle this instead of inflicting it into the
> scheduler hotpath.
> 
> No?

Sure, I will look at moving rseq_delay_resched_clear() call to __rseq_handle_notify_resume().
-Prakash

> 
> Thanks,
> 
>        tglx


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-07 14:07     ` Thomas Gleixner
@ 2025-08-07 16:45       ` Prakash Sangappa
  0 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-07 16:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 7, 2025, at 7:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, Aug 06 2025 at 22:34, Thomas Gleixner wrote:
>> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>>> @@ -396,6 +399,9 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>>> 
>>> CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>>> 
>>> + /* Reschedule if scheduler time delay was granted */
>> 
>> This is not rescheduling. It sets NEED_RESCHED, which is a completely
>> different thing.
>> 
>>> + rseq_delay_set_need_resched();
>> 
>> I fundamentally hate this hack as it goes out to user space with
>> NEED_RESCHED set and absolutely zero debug mechanism which validates
>> it. Currently going out with NEED_RESCHED set is a plain bug, rigthfully
>> so.
>> 
>> But now this muck comes along and sets the flag, which is semantically
>> just wrong and ill defined.
>> 
>> The point is that NEED_RESCHED has been cleared by requesting and
>> granting the extension, which means the task can go out to userspace,
>> until it either relinquishes the CPU or hrtick() whacks it over the
>> head.
> 
> Sorry. I misread this. It's placed before it enters the exit work loop
> and not afterwards. I got lost in this maze. :(

Yes.

> 
>> The obvious way to solve both issues is to clear NEED_RESCHED when
>> the delay is granted and then do in syscall_enter_from_user_mode_work()
>> 
>>        rseq_delay_sys_enter()
>>        {
>>             if (unlikely(current->rseq_delay_resched == GRANTED)) {
>>    set_tsk_need_resched(current);
>>                    schedule();
>>             }       
>>        }      
>> 
>> No?
>> 
>> It's debatable whether the schedule() there is necessary. Removing it
>> would allow the task to either complete the syscall and reschedule on
>> exit to user space or go to sleep in the syscall. But that's a trivial
>> detail.
> 
> But, the most important thing is that doing it at entry allows to debug
> this stuff for correctness.
> 
> I can kinda see that a sched_yield() shortcut might be the right thing
> to do for relinguishing the CPU, but if that's the user space contract,
> then any other syscall needs to be caught and not silently papered over
> at return from syscall.

Sure.  The check to see if delay was GRANTED in syscall_exit_to_user_mode_work() 
would catch any other system calls. 

> 
> Let me think about this some more.

Sure,
We will need a recommended system call, which the application can call
to relinquish the cpu after extra cpu time was granted. sched_yield(2) seems
appropriate. The shortcut in sched_yield() was to avoid going thru do_sched_yield() 
when called in the extended time. If we move the GRANTED check to
syscall_enter_from_user_mode_work(), then the shortcut in sched_yield()
cannot be implemented.

Thanks,
-Prakash


> 
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-07 15:49     ` Sebastian Andrzej Siewior
@ 2025-08-07 16:56       ` Prakash Sangappa
  2025-08-08  9:59         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-07 16:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 7, 2025, at 8:49 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> On 2025-08-06 22:34:00 [+0200], Thomas Gleixner wrote:
>> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> 
>> The obvious way to solve both issues is to clear NEED_RESCHED when
>> the delay is granted and then do in syscall_enter_from_user_mode_work()
>> 
>>        rseq_delay_sys_enter()
>>        {
>>             if (unlikely(current->rseq_delay_resched == GRANTED)) {
>>    set_tsk_need_resched(current);
>>                    schedule();
>>             }       
>>        }      
>> 
>> No?
>> 
>> It's debatable whether the schedule() there is necessary. Removing it
>> would allow the task to either complete the syscall and reschedule on
>> exit to user space or go to sleep in the syscall. But that's a trivial
>> detail.
> 
> Either schedule() or setting NEED_RESCHED is enough.
> 
>> The important point is that the NEED_RESCHED semantics stay sane and the
>> problem is solved right on the next syscall entry.
>> 
> …
>>> +static inline bool rseq_delay_resched(unsigned long ti_work)
>>> +{
>>> + if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>>> + return false;
>>> +
>>> + if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
> 
> The functions and the task_struct member field share the same.

I can look at modifying names of the functions. 

> 
>>> + return false;
>> 
>> Why unlikely? The majority of applications do not use this.
>> 
>>> +
>>> + if (!(ti_work & (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_LAZY)))
>>> + return false;
>> 
>> The caller already established that one of these flags is set, no?
> 
> correct, and if they are set, this never gets to false.

Will fix it.

> 
>>> + if (__rseq_delay_resched()) {
>>> + clear_tsk_need_resched(current);
>> 
>> Why has this to be inline and is not done in __rseq_delay_resched()?
> 
> A SCHED_OTHER wake up sets _TIF_NEED_RESCHED_LAZY so
> clear_tsk_need_resched() will revoke this granting an extension.
> 
> The RT/DL wake up will set _TIF_NEED_RESCHED and
> clear_tsk_need_resched() will also clear it. However this one
> additionally sets set_preempt_need_resched() so the next preempt
> disable/ enable combo will lead to a scheduling event. A remote wakeup
> will trigger an IPI (scheduler_ipi()) which also does
> set_preempt_need_resched().
> 
> If I understand this correct then a RT/DL wake up while the task is in
> kernel-mode should lead to a scheduling event assuming we pass a
> spinlock_t (ignoring the irq argument).
> Should the task be in user-mode then we return to user mode with the TIF
> flag cleared and the NEED-RESCHED flag folded into the preemption
> counter.
> 
> I am once again asking to limit this to _TIF_NEED_RESCHED_LAZY.

Would the proposal(patches 7-11) to have an API/Mechanism, as Thomas suggested,
for RT threads to indicate not to be delayed address the concern?.  
Also there is the proposal to have a kernel parameter to disable delaying 
RT threads in general, when granting extra time to the running task.

Thanks,
-Prakash

> 
>>> + return true;
>>> + }
>>> + return false;
>> 
> 
> …
> 
>> Thanks,
>> 
>>        tglx
> 
> Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-07 16:56       ` Prakash Sangappa
@ 2025-08-08  9:59         ` Sebastian Andrzej Siewior
  2025-08-08 17:00           ` Prakash Sangappa
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-08  9:59 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Thomas Gleixner, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On 2025-08-07 16:56:33 [+0000], Prakash Sangappa wrote:
> >>> + if (__rseq_delay_resched()) {
> >>> + clear_tsk_need_resched(current);
> >> 
> >> Why has this to be inline and is not done in __rseq_delay_resched()?
> > 
> > A SCHED_OTHER wake up sets _TIF_NEED_RESCHED_LAZY so
> > clear_tsk_need_resched() will revoke this granting an extension.
> > 
> > The RT/DL wake up will set _TIF_NEED_RESCHED and
> > clear_tsk_need_resched() will also clear it. However this one
> > additionally sets set_preempt_need_resched() so the next preempt
> > disable/ enable combo will lead to a scheduling event. A remote wakeup
> > will trigger an IPI (scheduler_ipi()) which also does
> > set_preempt_need_resched().
> > 
> > If I understand this correct then a RT/DL wake up while the task is in
> > kernel-mode should lead to a scheduling event assuming we pass a
> > spinlock_t (ignoring the irq argument).
> > Should the task be in user-mode then we return to user mode with the TIF
> > flag cleared and the NEED-RESCHED flag folded into the preemption
> > counter.
> > 
> > I am once again asking to limit this to _TIF_NEED_RESCHED_LAZY.
> 
> Would the proposal(patches 7-11) to have an API/Mechanism, as Thomas suggested,
> for RT threads to indicate not to be delayed address the concern?.  
> Also there is the proposal to have a kernel parameter to disable delaying 
> RT threads in general, when granting extra time to the running task.

While I appreciate the effort I don't see the need for this
functionality atm. I would say just get the basic infrastructure
focusing on LAZY preempt and ignore the wakes for tasks with elevated
priority. If this works reliably and people indeed ask for delayed
wakes for RT threads then this can be added assuming you have enough
flexibility in the API to allow it. Then you would also have a use-case
on how to implement it.

Looking at 07/11, you set a task_sched::sched_nodelay if this is
requested. In 09/11 you set TIF_NEED_RESCHED_NODELAY if that flag is
set. In 08/11 you use that flag additionally for wake ups and propagate
it for the architecture. Puh.
If a task needs to set this flag first in order to be excluded from the
delayed wake ups then I don't see how this can work for kernel threads
such as the threaded interrupts or a user thread which is PI-boosted and
inherits the RT priority.

On the other hand lets assume you check and clear only
TIF_NEED_RESCHED_LAZY. Lets say people ask to extend the delayed wakes
to certain userland RT threads. Then you could add a prctl() to turn
TIF_NEED_RESCHED into TIF_NEED_RESCHED_LAZY for the "marked" threads.
Saying I don't mind if this particular thread gets delayed.
If this is needed for all threads in system you could do a system wide
sysctl and so on.
You would get all this without another TIF bit and tracing would keep
showing reliably a N or L flag.

> Thanks,
> -Prakash
> 
Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 09/11] sched: Add nodelay scheduling
  2025-07-24 16:16 ` [PATCH V7 09/11] sched: Add nodelay scheduling Prakash Sangappa
@ 2025-08-08 13:26   ` Thomas Gleixner
  2025-08-08 16:54     ` Prakash Sangappa
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-08 13:26 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, bigeasy, kprateek.nayak,
	vineethr, prakash.sangappa

On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c

You forgot dl.c :)

> @@ -1027,7 +1027,7 @@ static void update_curr_rt(struct rq *rq)
>  			rt_rq->rt_time += delta_exec;
>  			exceeded = sched_rt_runtime_exceeded(rt_rq);
>  			if (exceeded)
> -				resched_curr(rq);
> +				resched_curr_nodelay(rq, rq->curr);

How is this possibly correct?

If the current task has nodelay set, then this means it asks not to be
affected by a slice extension of a lower priority task.

But that aside, I agree with Sebastian, that this is overly complex and
yet another TIF RESCHED flag is just horrible. We should avoid it in the
first place unless there is a real use case.

RT uses the LAZY flag for non-RT tasks, which means if the regular
RESCHED is set on RT, then we just go and preempt and decline the
extension.

If there is a real use case somewhere down the road, we can revisit the
problem later. Keep it simple for now.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 09/11] sched: Add nodelay scheduling
  2025-08-08 13:26   ` Thomas Gleixner
@ 2025-08-08 16:54     ` Prakash Sangappa
  0 siblings, 0 replies; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-08 16:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 8, 2025, at 6:26 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Jul 24 2025 at 16:16, Prakash Sangappa wrote:
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
> 
> You forgot dl.c :)
> 
>> @@ -1027,7 +1027,7 @@ static void update_curr_rt(struct rq *rq)
>> rt_rq->rt_time += delta_exec;
>> exceeded = sched_rt_runtime_exceeded(rt_rq);
>> if (exceeded)
>> - resched_curr(rq);
>> + resched_curr_nodelay(rq, rq->curr);
> 
> How is this possibly correct?
> 
> If the current task has nodelay set, then this means it asks not to be
> affected by a slice extension of a lower priority task.
> 
> But that aside, I agree with Sebastian, that this is overly complex and
> yet another TIF RESCHED flag is just horrible. We should avoid it in the
> first place unless there is a real use case.
> 

This was a prototype. Appears it would get complex.

> RT uses the LAZY flag for non-RT tasks, which means if the regular
> RESCHED is set on RT, then we just go and preempt and decline the
> extension.

So we allow extension only if LAZY is set. 

> 
> If there is a real use case somewhere down the road, we can revisit the
> problem later. Keep it simple for now.

OK, I will drop these patches in the next round.
> 

Thanks,
-Prakash.
> Thanks,
> 
>        tglx
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-08  9:59         ` Sebastian Andrzej Siewior
@ 2025-08-08 17:00           ` Prakash Sangappa
  2025-08-11  6:28             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-08 17:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 8, 2025, at 2:59 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> On 2025-08-07 16:56:33 [+0000], Prakash Sangappa wrote:
>>>>> + if (__rseq_delay_resched()) {
>>>>> + clear_tsk_need_resched(current);
>>>> 
>>>> Why has this to be inline and is not done in __rseq_delay_resched()?
>>> 
>>> A SCHED_OTHER wake up sets _TIF_NEED_RESCHED_LAZY so
>>> clear_tsk_need_resched() will revoke this granting an extension.
>>> 
>>> The RT/DL wake up will set _TIF_NEED_RESCHED and
>>> clear_tsk_need_resched() will also clear it. However this one
>>> additionally sets set_preempt_need_resched() so the next preempt
>>> disable/ enable combo will lead to a scheduling event. A remote wakeup
>>> will trigger an IPI (scheduler_ipi()) which also does
>>> set_preempt_need_resched().
>>> 
>>> If I understand this correct then a RT/DL wake up while the task is in
>>> kernel-mode should lead to a scheduling event assuming we pass a
>>> spinlock_t (ignoring the irq argument).
>>> Should the task be in user-mode then we return to user mode with the TIF
>>> flag cleared and the NEED-RESCHED flag folded into the preemption
>>> counter.
>>> 
>>> I am once again asking to limit this to _TIF_NEED_RESCHED_LAZY.
>> 
>> Would the proposal(patches 7-11) to have an API/Mechanism, as Thomas suggested,
>> for RT threads to indicate not to be delayed address the concern?.  
>> Also there is the proposal to have a kernel parameter to disable delaying 
>> RT threads in general, when granting extra time to the running task.
> 
> While I appreciate the effort I don't see the need for this
> functionality atm. I would say just get the basic infrastructure
> focusing on LAZY preempt and ignore the wakes for tasks with elevated
> priority. If this works reliably and people indeed ask for delayed
> wakes for RT threads then this can be added assuming you have enough
> flexibility in the API to allow it. Then you would also have a use-case
> on how to implement it.
> 
> Looking at 07/11, you set a task_sched::sched_nodelay if this is
> requested. In 09/11 you set TIF_NEED_RESCHED_NODELAY if that flag is
> set. In 08/11 you use that flag additionally for wake ups and propagate
> it for the architecture. Puh.
> If a task needs to set this flag first in order to be excluded from the
> delayed wake ups then I don't see how this can work for kernel threads
> such as the threaded interrupts or a user thread which is PI-boosted and
> inherits the RT priority.
> 
> On the other hand lets assume you check and clear only
> TIF_NEED_RESCHED_LAZY. Lets say people ask to extend the delayed wakes
> to certain userland RT threads. Then you could add a prctl() to turn
> TIF_NEED_RESCHED into TIF_NEED_RESCHED_LAZY for the "marked" threads.
> Saying I don't mind if this particular thread gets delayed.
> If this is needed for all threads in system you could do a system wide
> sysctl and so on.
> You would get all this without another TIF bit and tracing would keep
> showing reliably a N or L flag.

Ok, Will  drop these patches next round.

Should we just consider adding a sysctl to to choose if we want to delay if 
TIF_NEED_RESCHED Is set?

-Prakash





> 
>> Thanks,
>> -Prakash
>> 
> Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 01/11] sched: Scheduler time slice extension
  2025-08-08 17:00           ` Prakash Sangappa
@ 2025-08-11  6:28             ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-11  6:28 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Thomas Gleixner, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On 2025-08-08 17:00:27 [+0000], Prakash Sangappa wrote:
> Should we just consider adding a sysctl to to choose if we want to delay if 
> TIF_NEED_RESCHED Is set?

Please don't. Just focus on the LAZY wake up.

> -Prakash

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-07 16:15     ` Prakash Sangappa
@ 2025-08-11  9:45       ` Thomas Gleixner
  2025-08-13 16:19         ` bigeasy
  2025-08-14  7:18         ` Prakash Sangappa
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-11  9:45 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On Thu, Aug 07 2025 at 16:15, Prakash Sangappa wrote:
>> On Aug 7, 2025, at 6:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which
>> causes exit to userspace to invoke __rseq_handle_notify_resume(). That's
>> the obvious place to handle this instead of inflicting it into the
>> scheduler hotpath.
>> 
>> No?
>
> Sure, I will look at moving rseq_delay_resched_clear() call to __rseq_handle_notify_resume().

I looked deeper into it and it does not completely solve the problem.

The approach of having a request bit and then a disconnected rescheduled
bit is not working. You need a proper contract between kernel and
userspace and you have to enforce it.

You gracefully avoided to provide an actual ABI description and a user
space test case for this....

You need two bits in rseq::flags: REQUEST and GRANTED

The flow is:

    local_set_bit(REQUEST, &rseq->flags);
    critical_section();
    if (!local_test_and_clear_bit(REQUEST, &rseq->flags)) {
    	if (local_test_bit(GRANTED, &rseq->flags))
        	sched_yield();
    }

local_set_bit() could be a simple

            rseq->flags |= REQUEST;

operation when and only when there is no other usage of rseq->flags than
this extension muck. Otherwise the resulting RMW would race against the
kernel updating flags.

If that's not guaranteed and it won't be because flags might be used for
other things later, then local_set_bit() must use a instruction which is
atomic on the local CPU vs. interrupts, e.g. ORB on X86. There is no
LOCK prefix required as there is no cross CPU concurrency possible due
to rseq being strictly thread local.

The only way to avoid that is to provide a distinct new rseq field for
this, but that's a different debate to be had.

local_test_and_clear_bit() on the other hand _must_ always be thread
local atomic to prevent the obvious RMW race. On X86 this is a simple
BTR without LOCK prefix. Only ALPHA, LONGARCH, MIPS, POWERPC and X86
provide such local operations, on everything else you need to fall back
to a full atomic.

local_test_bit() has no atomicity requirements as there is obviously a
race which cannot be avoided:

    	if (local_test_bit(GRANTED))
->        
        	sched_yield();

If the interrupt hits between the test and the actual syscall entry,
then the kernel might reschedule and clear the grant.

And no, local_test_and_clear(GRANTED) does not help either because if
that evaluates to true, then the syscall has to be issued anyway to
reset the kernel state for the price of a brief period of inconsistent
state between kernel and user space, which is not an option at all.

The kernel side does in the NEED_RESCHED check:

    if (!tsk->state)
    	return false;

    if (tsk->state == GRANTED) {
    	tsk->rseq->flags &= ~GRANTED;
        tsk->state == ENABLED;
        return false;
    }

    if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)
    	return false;

    if (tsk->rseq->flags & REQUEST)
    	return false;

    tsk->rseq->flags &= ~REQUEST;
    tsk->rseq->flags |= GRANTED;
    tsk->state = GRANTED;
    return true;

and sched_yield() does:

    if (tsk->state == GRANTED) {
    	tsk->rseq->flags &= ~GRANTED;
        set_need_resched();
    }

This obviously needs some sanity checks, whether user space violated the
contract, but they are cheep because the operations on the user space
flags are RMW and the value is already loaded into a register.

Now taking the rseq event_mask into account the NEED_RESCHED side needs
additionally:

    if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)
    	return false;

+    if (tsk->rseq_event_mask)
+    	return false;

    if (tsk->rseq->flags & REQUEST)
    	return false;

Because if that is set then the task was preempted, migrated, had a
signal delivered and then the slice extension is moot.

The sched_yield() abuse wants some sanity checking. The simplest way to
achieve that is to create SYSCALL_WORK for it.

When granted: 
   set_task_syscall_work(t, SYSCALL_RSEQ_SLICE);

On reset
   clear_task_syscall_work(t, SYSCALL_RSEQ_SLICE);

Plus the corresponding syscall work function, which sets NEED_RESCHED
and clears the kernel and user space state. Along with state checks and
a check whether syscallnr == sched_yield. If not, kill the beast.

You _CANNOT_ rely on user space behaving correctly, you need to enforce
it and inconsistent state is not an option. Look how strict the RSEQ
critical section code or the futex code is about that. There is no room
for assumptions.

It's still required to hook into __rseq_handle_notify_resume() to reset
the grant when event_mask is not empty. This handles the case where the
task is scheduled out in exit work e.g. through a cond_resched() or
blocking on a mutex. Then the subsequent iteration in the loop won't
have NEED_RESCHED set, but the GRANTED state is still there and makes no
sense anymore. In that case TIF_NOTIFY_RESUME is set, which ends up
calling into the rseq code.

TBH, my interest to stare at yet another variant of undocumented hacks,
which inflict pointless overhead into the hotpaths, is very close to
zero.

As I've already analyzed this in depth, I sat down for half a day
and converted the analysis into code.

See combo patch below. I still need to address a few details and write
change logs for the 17 patches, which introduce this gradually and in a
reviewable way. I'll send that out in the next days.

What's interesting is that the selftest does not expose a big advantage
vs. the rescheduled case.

 # Success        1491820
 # Yielded         123792
 # Raced                0
 # Scheduled           27

but that might depend on the actual scheduling interference pattern.

The Success number might be misleading as the kernel might still have
rescheduled without touching the user space bits, but enforcing an
update for that is just extra pointless overhead.

I wasn't able to trigger the sched_yield() race yet, but that's
obviously a question of interrupt and scheduling patterns as well.

Thanks,

        tglx
---
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
 #ifndef __LINUX_IRQENTRYCOMMON_H
 #define __LINUX_IRQENTRYCOMMON_H
 
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
 #include <linux/static_call_types.h>
 #include <linux/syscalls.h>
-#include <linux/context_tracking.h>
 #include <linux/tick.h>
-#include <linux/kmsan.h>
 #include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
@@ -67,6 +68,7 @@ static __always_inline bool arch_in_rcu_
 
 /**
  * enter_from_user_mode - Establish state when coming from user mode
+ * @regs:	Pointer to currents pt_regs
  *
  * Syscall/interrupt entry disables interrupts, but user mode is traced as
  * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -195,15 +197,13 @@ static __always_inline void arch_exit_to
  */
 void arch_do_signal_or_restart(struct pt_regs *regs);
 
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work, bool from_irq);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
+ * @from_irq:	Exiting to user space from an interrupt
  *
  * 1) check that interrupts are disabled
  * 2) call tick_nohz_user_enter_prepare()
@@ -211,7 +211,7 @@ unsigned long exit_to_user_mode_loop(str
  *    EXIT_TO_USER_MODE_WORK are set
  * 4) check that interrupts are still disabled
  */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool from_irq)
 {
 	unsigned long ti_work;
 
@@ -222,16 +222,28 @@ static __always_inline void exit_to_user
 
 	ti_work = read_thread_flags();
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		ti_work = exit_to_user_mode_loop(regs, ti_work, from_irq);
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
+	rseq_exit_to_user_mode();
+
 	/* Ensure that kernel state is sane for a return to userspace */
 	kmap_assert_nomap();
 	lockdep_assert_irqs_disabled();
 	lockdep_sys_exit();
 }
 
+static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	exit_to_user_mode_prepare(regs, false);
+}
+
+static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	exit_to_user_mode_prepare(regs, true);
+}
+
 /**
  * exit_to_user_mode - Fixup state when exiting to user mode
  *
@@ -354,6 +366,7 @@ irqentry_state_t noinstr irqentry_enter(
  * Conditional reschedule with additional sanity checks.
  */
 void raw_irqentry_exit_cond_resched(void);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -4,6 +4,7 @@
 
 #ifdef CONFIG_RSEQ
 
+#include <linux/jump_label.h>
 #include <linux/preempt.h>
 #include <linux/sched.h>
 
@@ -61,6 +62,20 @@ static inline void rseq_migrate(struct t
 	rseq_set_notify_resume(t);
 }
 
+static __always_inline void rseq_slice_extension_timer(void);
+
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+	rseq_slice_extension_timer();
+	/*
+	 * Clear the event mask so it does not contain stale bits when
+	 * coming back from user space.
+	 */
+	current->rseq_event_mask = 0;
+}
+
+static inline void rseq_slice_fork(struct task_struct *t, bool inherit);
+
 /*
  * If parent process has a registered restartable sequences area, the
  * child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -72,11 +87,13 @@ static inline void rseq_fork(struct task
 		t->rseq_len = 0;
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
+		rseq_slice_fork(t, false);
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_event_mask = current->rseq_event_mask;
+		rseq_slice_fork(t, true);
 	}
 }
 
@@ -86,46 +103,127 @@ static inline void rseq_execve(struct ta
 	t->rseq_len = 0;
 	t->rseq_sig = 0;
 	t->rseq_event_mask = 0;
+	rseq_slice_fork(t, false);
 }
 
-#else
+#else /* CONFIG_RSEQ */
+static inline void rseq_set_notify_resume(struct task_struct *t) { }
+static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_preempt(struct task_struct *t) { }
+static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
+static inline void rseq_execve(struct task_struct *t) { }
+static inline void rseq_exit_to_user_mode(void) { }
+#endif  /* !CONFIG_RSEQ */
 
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
-					     struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
-				       struct pt_regs *regs)
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+/*
+ * Constants for task::rseq_slice_extension:
+ *
+ * ENABLED is set when the task enables it via prctl()
+ * GRANTED is set when the kernel grants an extension on interrupt return
+ *	   to user space. Implies ENABLED
+ */
+#define RSEQ_SLICE_EXTENSION_ENABLED	0x1
+#define RSEQ_SLICE_EXTENSION_GRANTED	0x2
+
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static inline bool rseq_slice_extension_enabled(void)
 {
+	return static_branch_likely(&rseq_slice_extension_key);
 }
-static inline void rseq_preempt(struct task_struct *t)
+
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+bool rseq_syscall_enter_work(long syscall);
+void __rseq_slice_extension_timer(void);
+bool __rseq_grant_slice_extension(unsigned int slext);
+
+#ifdef CONFIG_PREEMPT_RT
+#define	RSEQ_TIF_DENY	(_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | _TIF_NEED_RESCHED)
+#else
+#define	RSEQ_TIF_DENY	(_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)
+#endif
+
+static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work)
 {
+	unsigned int slext;
+
+	if (!rseq_slice_extension_enabled())
+		return false;
+
+	slext = current->rseq_slice_extension;
+	if (likely(!slext))
+		return false;
+
+	/*
+	 * Two quick check conditions where a grant is not possible:
+	 *  1) Signal is pending, which means the task will return
+	 *     to the signal handler and not to the interrupted code
+	 *
+	 *  2) On RT, when NEED_RESCHED is set. RT grants only when
+	 *     NEED_RESCHED_LAZY is set.
+	 *
+	 * In both cases __rseq_grant_slice_extension() has to be invoked
+	 * when the extension was already granted to clear it.
+	 */
+	if (ti_work & RSEQ_TIF_DENY && !(slext & RSEQ_SLICE_EXTENSION_GRANTED))
+		return false;
+	return __rseq_grant_slice_extension(slext);
+}
+
+static inline bool rseq_slice_extension_resched(void)
+{
+	if (!rseq_slice_extension_enabled())
+		return false;
+
+	if (unlikely(current->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED)) {
+		set_tsk_need_resched(current);
+		return true;
+	}
+	return false;
 }
-static inline void rseq_migrate(struct task_struct *t)
+
+static __always_inline void rseq_slice_extension_timer(void)
 {
+	if (!rseq_slice_extension_enabled())
+		return;
+
+	if (unlikely(current->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED))
+		__rseq_slice_extension_timer();
 }
-static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+
+static inline void rseq_slice_fork(struct task_struct *t, bool inherit)
 {
+	if (inherit)
+		t->rseq_slice_extension = current->rseq_slice_extension;
+	else
+		t->rseq_slice_extension = 0;
 }
-static inline void rseq_execve(struct task_struct *t)
+
+static inline void rseq_slice_extension_disable(void)
 {
+	current->rseq_slice_extension = 0;
 }
 
-#endif
-
-#ifdef CONFIG_DEBUG_RSEQ
-
-void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_slice_extension_resched(void) { return false; }
+static inline bool rseq_syscall_enter_work(long syscall) { return false; }
+static __always_inline void rseq_slice_extension_timer(void) { }
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
 {
+	return -EINVAL;
 }
+static inline void rseq_slice_fork(struct task_struct *t, bool inherit) { }
+static inline void rseq_slice_extension_disable(void) { }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
-#endif
+#ifdef CONFIG_DEBUG_RSEQ
+void rseq_debug_syscall_exit(struct pt_regs *regs);
+#else /* CONFIG_DEBUG_RSEQ */
+static inline void rseq_debug_syscall_exit(struct pt_regs *regs) { }
+#endif /* !CONFIG_DEBUG_RSEQ */
 
 #endif /* _LINUX_RSEQ_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -8,24 +8,36 @@
  * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
  */
 
+#include <linux/prctl.h>
+#include <linux/ratelimit.h>
+#include <linux/rseq.h>
 #include <linux/sched.h>
-#include <linux/uaccess.h>
 #include <linux/syscalls.h>
-#include <linux/rseq.h>
+#include <linux/sysctl.h>
 #include <linux/types.h>
-#include <linux/ratelimit.h>
+#include <linux/uaccess.h>
+
 #include <asm/ptrace.h>
 
+#include "sched/hrtick.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/rseq.h>
 
 /* The original rseq structure size (including padding) is 32 bytes. */
 #define ORIG_RSEQ_SIZE		32
 
-#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
-				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
+#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT |	\
+				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL |	\
 				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
 
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+#define RSEQ_CS_VALID_FLAGS	 (RSEQ_CS_FLAG_SLICE_EXT_REQUEST |	\
+				  RSEQ_CS_FLAG_SLICE_EXT_GRANTED)
+#else
+#define RSEQ_CS_VALID_FLAGS	 (0)
+#endif
+
 #ifdef CONFIG_DEBUG_RSEQ
 static struct rseq *rseq_kernel_fields(struct task_struct *t)
 {
@@ -313,12 +325,12 @@ static bool rseq_warn_flags(const char *
 {
 	u32 test_flags;
 
-	if (!flags)
+	if (!(flags & ~RSEQ_CS_VALID_FLAGS))
 		return false;
 	test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
 	if (test_flags)
 		pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
-	test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
+	test_flags = flags & ~(RSEQ_CS_NO_RESTART_FLAGS | RSEQ_CS_VALID_FLAGS);
 	if (test_flags)
 		pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
 	return true;
@@ -410,6 +422,8 @@ static int rseq_ip_fixup(struct pt_regs
 	return 0;
 }
 
+static inline bool rseq_reset_slice_extension(struct task_struct *t);
+
 /*
  * This resume handler must always be executed between any of:
  * - preemption,
@@ -430,11 +444,16 @@ void __rseq_handle_notify_resume(struct
 		return;
 
 	/*
-	 * regs is NULL if and only if the caller is in a syscall path.  Skip
-	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
-	 * kill a misbehaving userspace on debug kernels.
+	 * If invoked from hypervisors or IO-URING @regs is a NULL pointer,
+	 * so fixup cannot be done. If the syscall which led to this
+	 * invocation was invoked inside a critical section, then it can
+	 * only be detected on a debug kernel in rseq_debug_syscall_exit(),
+	 * which will detect and kill a misbehaving userspace.
 	 */
 	if (regs) {
+		if (!rseq_reset_slice_extension(t))
+			goto error;
+
 		ret = rseq_ip_fixup(regs);
 		if (unlikely(ret < 0))
 			goto error;
@@ -454,7 +473,7 @@ void __rseq_handle_notify_resume(struct
  * Terminate the process if a syscall is issued within a restartable
  * sequence.
  */
-void rseq_syscall(struct pt_regs *regs)
+void rseq_debug_syscall_exit(struct pt_regs *regs)
 {
 	unsigned long ip = instruction_pointer(regs);
 	struct task_struct *t = current;
@@ -490,6 +509,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 		ret = rseq_reset_rseq_cpu_node_id(current);
 		if (ret)
 			return ret;
+		rseq_slice_extension_disable();
 		current->rseq = NULL;
 		current->rseq_sig = 0;
 		current->rseq_len = 0;
@@ -571,3 +591,189 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 
 	return 0;
 }
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+	{
+		.procname	= "rseq_slice_extension_nsec",
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= (unsigned int *)&rseq_slice_ext_nsecs_min,
+		.extra2		= (unsigned int *)&rseq_slice_ext_nsecs_max,
+	},
+};
+
+static int __init rseq_sysctl_init(void)
+{
+	register_sysctl("kernel", rseq_slice_ext_sysctl);
+	return 0;
+}
+device_initcall(rseq_sysctl_init);
+#endif /* !CONFIG_SYSCTL */
+
+static inline bool rseq_clear_slice_granted(struct task_struct *curr, u32 rflags)
+{
+	/* Check whether user space violated the contract */
+	if (rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST)
+		return false;
+	if (!(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED))
+		return false;
+
+	rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_GRANTED;
+	return !put_user(rflags, &curr->rseq->flags);
+}
+
+static bool __rseq_reset_slice_extension(struct task_struct *curr)
+{
+	u32 rflags;
+
+	if (get_user(rflags, &curr->rseq->flags))
+		return false;
+	return rseq_clear_slice_granted(curr, rflags);
+}
+
+static inline bool rseq_reset_slice_extension(struct task_struct *curr)
+{
+	if (!rseq_slice_extension_enabled())
+		return true;
+
+	if (likely(!(curr->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED)))
+		return true;
+	if (likely(!curr->rseq_event_mask))
+		return true;
+
+	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
+
+	return __rseq_reset_slice_extension(curr);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ */
+bool rseq_syscall_enter_work(long syscall)
+{
+	struct task_struct *curr = current;
+	unsigned int slext = curr->rseq_slice_extension;
+
+	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
+
+	/*
+	 * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only
+	 * be set when state is GRANTED!
+	 */
+	if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED))
+		return false;
+
+	set_tsk_need_resched(curr);
+
+	if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield))
+		force_sigsegv(0);
+
+	/* Abort syscall to reschedule immediately */
+	return true;
+}
+
+bool __rseq_grant_slice_extension(unsigned int slext)
+{
+	struct task_struct *curr = current;
+	u32 rflags;
+
+	if (unlikely(get_user(rflags, &curr->rseq->flags)))
+		goto die;
+
+	/*
+	 * Happens when exit_to_user_mode_loop() loops and has
+	 * TIF_NEED_RESCHED* set again. Clear the grant and schedule.
+	 */
+	if (unlikely(slext == RSEQ_SLICE_EXTENSION_GRANTED)) {
+		curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
+		clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+		if (!rseq_clear_slice_granted(curr, rflags))
+			goto die;
+		return false;
+	}
+
+	/* User space set the flag. That's a violation of the contract. */
+	if (unlikely(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED))
+		goto die;
+
+	/* User space is not interrested. */
+	if (likely(!(rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST)))
+		return false;
+
+	/*
+	 * Don't bother if the rseq event mask has bits pending. The task
+	 * was preempted.
+	 */
+	if (curr->rseq_event_mask)
+		return false;
+
+	/* Grant the request and update user space */
+	rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_REQUEST;
+	rflags |= RSEQ_CS_FLAG_SLICE_EXT_GRANTED;
+	if (unlikely(put_user(rflags, &curr->rseq->flags)))
+		goto die;
+
+	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_GRANTED;
+	set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+	clear_tsk_need_resched(curr);
+	return true;
+die:
+	force_sig(SIGSEGV);
+	return false;
+}
+
+void __rseq_slice_extension_timer(void)
+{
+	hrtick_extend_timeslice(rseq_slice_ext_nsecs);
+}
+
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+	switch (arg2) {
+	case PR_RSEQ_SLICE_EXTENSION_GET:
+		if (arg3)
+			return -EINVAL;
+		return current->rseq_slice_extension ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+	case PR_RSEQ_SLICE_EXTENSION_SET:
+		if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+			return -EINVAL;
+		if (!rseq_slice_extension_enabled() || !current->rseq)
+			return -ENOTSUPP;
+		current->rseq_slice_extension = (arg3 & PR_RSEQ_SLICE_EXT_ENABLE) ?
+			RSEQ_SLICE_EXTENSION_ENABLED : 0;
+		return 0;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+static int __init rseq_slice_cmdline(char *str)
+{
+	bool on;
+
+	if (kstrtobool(str, &on))
+		return -EINVAL;
+
+	if (!on)
+		static_branch_disable(&rseq_slice_extension_key);
+	return 0;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_reset_slice_extension(struct task_struct *t) { return true; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,11 +3,11 @@
 #define __LINUX_ENTRYCOMMON_H
 
 #include <linux/irq-entry-common.h>
+#include <linux/livepatch.h>
 #include <linux/ptrace.h>
+#include <linux/resume_user_mode.h>
 #include <linux/seccomp.h>
 #include <linux/sched.h>
-#include <linux/livepatch.h>
-#include <linux/resume_user_mode.h>
 
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@@ -36,6 +36,7 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_RSEQ_SLICE |	\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
@@ -61,8 +62,7 @@
  */
 void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
 
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
-			 unsigned long work);
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
 
 /**
  * syscall_enter_from_user_mode_work - Check and handle work before invoking
@@ -162,7 +162,7 @@ static __always_inline void syscall_exit
 			local_irq_enable();
 	}
 
-	rseq_syscall(regs);
+	rseq_debug_syscall_exit(regs);
 
 	/*
 	 * Do one-time syscall specific work. If these work items are
@@ -172,7 +172,7 @@ static __always_inline void syscall_exit
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	syscall_exit_to_user_mode_prepare(regs);
 }
 
 /**
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -15,9 +15,10 @@ void __weak arch_do_signal_or_restart(st
  * exit_to_user_mode_loop - do any pending work before leaving to user space
  * @regs:	Pointer to pt_regs on entry stack
  * @ti_work:	TIF work flags as read by the caller
+ * @from_irq:	Exiting to user space from an interrupt
  */
-__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-						     unsigned long ti_work)
+__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work,
+						     bool from_irq)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
@@ -27,8 +28,15 @@ void __weak arch_do_signal_or_restart(st
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		/*
+		 * FIXME: This should actually take the execution time
+		 *        of the rest of the loop into account and refuse
+		 *        the extension if there is other work to do.
+		 */
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+			if (!from_irq || !rseq_grant_slice_extension(ti_work))
+				schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
@@ -70,7 +78,7 @@ noinstr void irqentry_enter_from_user_mo
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	irqentry_exit_to_user_mode_prepare(regs);
 	instrumentation_end();
 	exit_to_user_mode();
 }
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
    ebpf/index
    ioctl/index
    mseal
+   rseq
 
 Security-related interfaces
 ===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,92 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and user-space for three purposes:
+
+ * user-space restartable sequences
+
+ * quick access to read the current CPU number, node ID from user-space
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartables sequences allow user-space to perform update operations on
+per-cpu data without requiring heavy-weight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+    * Enabled in Kconfig
+
+    * Enabled at boot time (default is enabled)
+
+    * A rseq user space pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+If granted the thread can request a time slice extension by setting the
+``RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT`` in the rseq::flags field. If the
+thread is interrupted and the interrupt results in a reschedule request in
+the kernel, then the kernel can grant a time slice extension and return to
+user space instead of scheduling out. The kernel indicates the grant by
+clearing ``RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT`` and setting
+``RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT`` in the rseq::flags field. If there
+is a reschedule of the thread after granting the extension, the kernel
+clears the granted bit to indicate that to user space.
+
+If the request bit is still set when the leaving the critical section, user
+space can clear it and continue.
+
+If the granted bit is set, then user space has to invoke sched_yield() when
+leaving the critical section to relinquish the CPU. The kernel enforces
+this by arming a timer to prevent misbehaving user space from abusing this
+mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by user space.
+
+The required code flow is as follows::
+
+    local_set_bit(REQUEST, &rseq->flags);
+    critical_section();
+    if (!local_test_and_clear_bit(REQUEST, &rseq->flags)) {
+        if (local_test_bit(GRANTED, &rseq->flags))
+                sched_yield();
+    }
+
+The local bit operations on the flags, except for local_test_bit() have to
+be atomically versus the local CPU to prevent the obvious RMW race versus
+an interrupt. On X86 this can be achieved with ORB and BTRL without LOCK
+prefix. On architectures, which do not provide lightweight CPU local
+atomics this needs to be implemented with regular atomic operations.
+
+local_test_bit() has no atomicity requirements as there is obviously a
+race which cannot be avoided at all::
+
+        if (local_test_bit(GRANTED))
+	-> Interrupt results in schedule and grant revocation
+                sched_yield();
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,7 +1408,10 @@ struct task_struct {
 	 * RmW on rseq_event_mask must be performed atomically
 	 * with respect to preemption.
 	 */
-	unsigned long rseq_event_mask;
+	unsigned long			rseq_event_mask;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+	unsigned int			rseq_slice_extension;
+#endif
 # ifdef CONFIG_DEBUG_RSEQ
 	/*
 	 * This is a place holder to save a copy of the rseq fields for
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -26,6 +26,8 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT	= 3,
+	RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT	= 4,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +37,10 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+	RSEQ_CS_FLAG_SLICE_EXT_REQUEST		=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT),
+	RSEQ_CS_FLAG_SLICE_EXT_GRANTED		=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT),
 };
 
 /*
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1883,6 +1883,18 @@ config RSEQ
 
 	  If unsure, say Y.
 
+config RSEQ_SLICE_EXTENSION
+	bool "Enable rseq based time slice extension mechanism"
+	depends on RSEQ && SCHED_HRTICK
+	help
+          Allows userspace to request a limited time slice extension when
+	  returning from an interrupt to user space via the RSEQ shared
+	  data ABI. If granted, that allows to complete a critical section,
+	  so that other threads are not stuck on a conflicted resource,
+	  while the task is scheduled out.
+
+	  If unsure, say N.
+
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -376,4 +376,14 @@ struct prctl_mm_map {
 # define PR_FUTEX_HASH_SET_SLOTS	1
 # define PR_FUTEX_HASH_GET_SLOTS	2
 
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION			79
+# define PR_RSEQ_SLICE_EXTENSION_GET		1
+# define PR_RSEQ_SLICE_EXTENSION_SET		2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE:	Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE		0x01
+
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
 #include <linux/time_namespace.h>
 #include <linux/binfmts.h>
 #include <linux/futex.h>
+#include <linux/rseq.h>
 
 #include <linux/sched.h>
 #include <linux/sched/autogroup.h>
@@ -2805,6 +2806,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 	case PR_FUTEX_HASH:
 		error = futex_hash_prctl(arg2, arg3, arg4);
 		break;
+	case PR_RSEQ_SLICE_EXTENSION:
+		if (arg4 || arg5)
+			return -EINVAL;
+		error = rseq_slice_extension_prctl(arg2, arg3);
+		break;
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -90,6 +90,7 @@
 #include "stats.h"
 
 #include "autogroup.h"
+#include "hrtick.h"
 #include "pelt.h"
 #include "smp.h"
 
@@ -873,6 +874,10 @@ static enum hrtimer_restart hrtick(struc
 
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 
+	// CHECKME: Is this correct?
+	if (rseq_slice_extension_resched())
+		return HRTIMER_NORESTART;
+
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
@@ -902,6 +907,14 @@ static void __hrtick_start(void *arg)
 	rq_unlock(rq, &rf);
 }
 
+void hrtick_extend_timeslice(ktime_t nsecs)
+{
+	struct rq *rq = this_rq();
+
+	guard(rq_lock_irqsave)(rq);
+	hrtimer_start(&rq->hrtick_timer, nsecs, HRTIMER_MODE_REL_PINNED_HARD);
+}
+
 /*
  * Called to set the hrtick timer state.
  *
--- /dev/null
+++ b/kernel/sched/hrtick.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _KERNEL_SCHED_HRTICK_H
+#define _KERNEL_SCHED_HRTICK_H
+
+/*
+ * Scheduler internal method to support time slice extensions,
+ * shared with rseq.
+ */
+void hrtick_extend_timeslice(ktime_t nsecs);
+
+#endif /* _KERNEL_SCHED_HRTICK_H */
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
 };
 
-#define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT	BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE	BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU	BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP			BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT		BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE		BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU		BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT		BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH	BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP		BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE		BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
 #endif
 
 #include <asm/thread_info.h>
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s
 	}
 }
 
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
-				unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
 {
 	long ret = 0;
 
@@ -32,6 +31,16 @@ long syscall_trace_enter(struct pt_regs
 			return -1L;
 	}
 
+	/*
+	 * User space got a time slice extension granted and relinquishes
+	 * the CPU. Abort the syscall right away. If it's not sched_yield()
+	 * rseq_syscall_enter_work() sends a SIGSEGV.
+	 */
+	if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE) {
+		if (rseq_syscall_enter_work(syscall))
+			return -1L;
+	}
+
 	/* Handle ptrace */
 	if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
 		ret = ptrace_report_syscall_entry(regs);
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
 param_test_mm_cid_benchmark
 param_test_mm_cid_compare_twice
 syscall_errors_test
+slice_test
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
 		param_test_benchmark param_test_compare_twice param_test_mm_cid \
 		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
-		syscall_errors_test
+		syscall_errors_test slice_test
 
 TEST_GEN_PROGS_EXTENDED = librseq.so
 
@@ -59,3 +59,6 @@ include ../lib.mk
 $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
 					rseq.h rseq-*.h
 	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,205 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#define BITS_PER_INT	32
+#define BITS_PER_BYTE	8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION		79
+#  define PR_RSEQ_SLICE_EXTENSION_GET		1
+#  define PR_RSEQ_SLICE_EXTENSION_SET		2
+#  define PR_RSEQ_SLICE_EXT_ENABLE		0x01
+#endif
+
+#ifndef RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT
+# define RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT	3
+# define RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT	4
+#endif
+
+#ifndef asm_inline
+# define asm_inline	asm __inline
+#endif
+
+#if defined(__x86_64__) || defined(__i386__)
+
+static __always_inline bool local_test_and_clear_bit(unsigned int bit,
+						     volatile unsigned int *addr)
+{
+	bool res;
+
+	asm inline volatile("btrl %[__bit], %[__addr]\n"
+			    : [__addr] "+m" (*addr), "=@cc" "c" (res)
+			    : [__bit] "Ir" (bit)
+			    : "memory");
+	return res;
+}
+
+static __always_inline void local_set_bit(unsigned int bit, volatile unsigned int *addr)
+{
+	volatile char *caddr = (void *)(addr) + (bit / BITS_PER_BYTE);
+
+	asm inline volatile("orb %b[__bit],%[__addr]\n"
+			    : [__addr] "+m" (*caddr)
+			    : [__bit] "iq" (1U << (bit & (BITS_PER_BYTE - 1)))
+			    : "memory");
+}
+
+static __always_inline bool local_test_bit(unsigned int bit, const volatile unsigned int *addr)
+{
+	return !!(addr[bit / BITS_PER_INT] & ((1U << (bit & (BITS_PER_INT - 1)))));
+}
+
+#else
+# error unsupported target
+#endif
+
+#define NSEC_PER_SEC	1000000000L
+#define NSEC_PER_USEC	      1000L
+
+struct noise_params {
+	int	noise_nsecs;
+	int	sleep_nsecs;
+	int	run;
+};
+
+FIXTURE(slice_ext)
+{
+	pthread_t		noise_thread;
+	struct noise_params	noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+	int64_t	total_nsecs;
+	int	slice_nsecs;
+	int	noise_nsecs;
+	int	sleep_nsecs;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+	.total_nsecs	=  5 * NSEC_PER_SEC,
+	.slice_nsecs	=  2 * NSEC_PER_USEC,
+	.noise_nsecs    =  2 * NSEC_PER_USEC,
+	.sleep_nsecs	= 50 * NSEC_PER_USEC,
+};
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+			   int64_t span)
+{
+	int64_t delta = now->tv_sec - start->tv_sec;
+
+	delta *= NSEC_PER_SEC;
+	delta += now->tv_nsec - start->tv_nsec;
+	return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+	struct noise_params *p = arg;
+
+	while (RSEQ_READ_ONCE(p->run)) {
+		struct timespec ts_start, ts_now;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_start);
+		do {
+			clock_gettime(CLOCK_MONOTONIC, &ts_now);
+		} while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+		ts_start.tv_sec = 0;
+		ts_start.tv_nsec = p->sleep_nsecs;
+		clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+	}
+	return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+	cpu_set_t affinity;
+
+	ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+	/* Pin it on a single CPU. Avoid CPU 0 */
+	for (int i = 1; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &affinity))
+			continue;
+
+		CPU_ZERO(&affinity);
+		CPU_SET(i, &affinity);
+		ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+		break;
+	}
+
+	ASSERT_EQ(rseq_register_current_thread(), 0);
+
+	ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+			PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+	self->noise_params.noise_nsecs = variant->noise_nsecs;
+	self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+	self->noise_params.run = 1;
+
+	ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+	self->noise_params.run = 0;
+	pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+	unsigned long success = 0, yielded = 0, raced = 0, scheduled = 0;
+	struct rseq_abi *rs = rseq_get_abi();
+	struct timespec ts_start, ts_now;
+
+	ASSERT_NE(rs, NULL);
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+	do {
+		struct timespec ts_cs;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+		local_set_bit(RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, &rs->flags);
+		do {
+			clock_gettime(CLOCK_MONOTONIC, &ts_now);
+		} while (!elapsed(&ts_cs, &ts_now, variant->noise_nsecs));
+
+		if (!local_test_and_clear_bit(RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, &rs->flags)) {
+			if (local_test_bit(RSEQ_CS_FLAG_SLICE_EXT_GRANTED_BIT, &rs->flags)) {
+				yielded++;
+				if (!sched_yield())
+					raced++;
+			} else {
+				scheduled++;
+			}
+		} else {
+			success++;
+		}
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_now);
+	} while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+	printf("# Success   %12ld\n", success);
+	printf("# Yielded   %12ld\n", yielded);
+	printf("# Raced     %12ld\n", raced);
+	printf("# Scheduled %12ld\n", scheduled);
+}
+
+TEST_HARNESS_MAIN





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-11  9:45       ` Thomas Gleixner
@ 2025-08-13 16:19         ` bigeasy
  2025-08-13 16:56           ` Thomas Gleixner
  2025-08-14  7:18         ` Prakash Sangappa
  1 sibling, 1 reply; 38+ messages in thread
From: bigeasy @ 2025-08-13 16:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

I spent some time on the review. I tried to test it but for some reason
userland always segfaults. This is not subject to your changes because
param_test (from tools/testing/selftests/rseq) also segfaults. Also on a
Debian v6.12. So this must be something else and maybe glibc related.

On 2025-08-11 11:45:11 [+0200], Thomas Gleixner wrote:
…
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -4,6 +4,7 @@
>  
>  #ifdef CONFIG_RSEQ
>  
> +#include <linux/jump_label.h>
>  #include <linux/preempt.h>
>  #include <linux/sched.h>
>  
> @@ -61,6 +62,20 @@ static inline void rseq_migrate(struct t
>  	rseq_set_notify_resume(t);
>  }
>  
> +static __always_inline void rseq_slice_extension_timer(void);
> +
> +static __always_inline void rseq_exit_to_user_mode(void)

Is this __always_inline required?

> +{
> +	rseq_slice_extension_timer();
> +	/*
> +	 * Clear the event mask so it does not contain stale bits when
> +	 * coming back from user space.
> +	 */
> +	current->rseq_event_mask = 0;
> +}
> +
> +static inline void rseq_slice_fork(struct task_struct *t, bool inherit);
> +
>  /*
>   * If parent process has a registered restartable sequences area, the
>   * child inherits. Unregister rseq for a clone with CLONE_VM set.
> @@ -86,46 +103,127 @@ static inline void rseq_execve(struct ta
…

> -#else
> -
> -static inline void rseq_syscall(struct pt_regs *regs)
> +#else /* CONFIG_RSEQ_SLICE_EXTENSION */
> +static inline bool rseq_slice_extension_enabled(void) { return false; }
> +static inline bool rseq_slice_extension_resched(void) { return false; }
> +static inline bool rseq_syscall_enter_work(long syscall) { return false; }
> +static __always_inline void rseq_slice_extension_timer(void) { }

why is this one so special and seends __always_inline while the other
are fine with inline?

> +static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
>  {
> +	return -EINVAL;
>  }
…
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -571,3 +591,189 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
…
> +static bool __rseq_reset_slice_extension(struct task_struct *curr)
> +{
> +	u32 rflags;
> +
> +	if (get_user(rflags, &curr->rseq->flags))
> +		return false;
> +	return rseq_clear_slice_granted(curr, rflags);
> +}
> +
> +static inline bool rseq_reset_slice_extension(struct task_struct *curr)
> +{
> +	if (!rseq_slice_extension_enabled())
> +		return true;
> +
> +	if (likely(!(curr->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED)))
> +		return true;

We shouldn't get preempted because this would require an interrupt. But
we could receive a signal which would bring us here, right?

If an extension was not granted but userland enabled it set
RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, shouldn't we clear
RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT indicating that we scheduled?
Or we keep things as they are because the signal handler is subject the
same kind of extensions? The signal handler has a list of functions
which are signal safe and that might end up in a syscall.

> +	if (likely(!curr->rseq_event_mask))
> +		return true;

Why don't you need to clear SYSCALL_RSEQ_SLICE if !rseq_event_mask ?

> +
> +	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
> +	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
> +
> +	return __rseq_reset_slice_extension(curr);
> +}
> +
> +/*
> + * Invoked from syscall entry if a time slice extension was granted and the
> + * kernel did not clear it before user space left the critical section.
> + */
> +bool rseq_syscall_enter_work(long syscall)
> +{
> +	struct task_struct *curr = current;
> +	unsigned int slext = curr->rseq_slice_extension;
> +
> +	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
> +	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
> +
> +	/*
> +	 * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only
> +	 * be set when state is GRANTED!
> +	 */
> +	if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED))
> +		return false;
> +
> +	set_tsk_need_resched(curr);
> +
> +	if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield))
> +		force_sigsegv(0);
> +
> +	/* Abort syscall to reschedule immediately */

If the syscall is the sched_yield() as expected then you still abort it.
You avoid the "scheduling" request from the do_sched_yield() (and
everything the syscall does) and perform your schedule request due to
the NEED_RESCHED flag above in exit_to_user_mode_loop().
This explains why sched_yield(2) returns a return code != 0 even the man
page and the kernel function always returns 0. errno will be set in
userland and the syscall tracer will bypass sched_yield in its trace.


> +	return true;
> +}
> +
> +bool __rseq_grant_slice_extension(unsigned int slext)
> +{
> +	struct task_struct *curr = current;
> +	u32 rflags;
> +
> +	if (unlikely(get_user(rflags, &curr->rseq->flags)))
> +		goto die;
> +
> +	/*
> +	 * Happens when exit_to_user_mode_loop() loops and has
> +	 * TIF_NEED_RESCHED* set again. Clear the grant and schedule.
> +	 */

Not only that. Also if userland does not finish its critical section
before a subsequent scheduling request happens.

> +	if (unlikely(slext == RSEQ_SLICE_EXTENSION_GRANTED)) {
> +		curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
> +		clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
> +		if (!rseq_clear_slice_granted(curr, rflags))
> +			goto die;
> +		return false;
> +	}
> +
> +	/* User space set the flag. That's a violation of the contract. */
> +	if (unlikely(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED))
> +		goto die;
> +
> +	/* User space is not interrested. */
> +	if (likely(!(rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST)))
> +		return false;
> +
> +	/*
> +	 * Don't bother if the rseq event mask has bits pending. The task
> +	 * was preempted.
> +	 */
> +	if (curr->rseq_event_mask)
> +		return false;
> +
> +	/* Grant the request and update user space */
> +	rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_REQUEST;
> +	rflags |= RSEQ_CS_FLAG_SLICE_EXT_GRANTED;
> +	if (unlikely(put_user(rflags, &curr->rseq->flags)))
> +		goto die;
> +
> +	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_GRANTED;
> +	set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
> +	clear_tsk_need_resched(curr);

If you keep doing this also for NEED_RESCHED then you should clear the
preemption counter via
	clear_preempt_need_resched();

otherwise you could stumble upon a spinlock_t on your way out and visit
the scheduler anyway.

> +	return true;
> +die:
> +	force_sig(SIGSEGV);
> +	return false;
> +}
…
> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,92 @@
…
> +If the request bit is still set when the leaving the critical section, user
> +space can clear it and continue.
> +
> +If the granted bit is set, then user space has to invoke sched_yield() when
                                                            sched_yield(2)

> +leaving the critical section to relinquish the CPU. The kernel enforces
> +this by arming a timer to prevent misbehaving user space from abusing this
> +mechanism.
> +

Enforcing is one thing. The documentation should mention that you must
not invoke any syscalls other than sched_yield() after setting
RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT or you get the segfault thrown at
you.
Your testcase does clock_gettime(). This works as long as the syscall
can be handled via vDSO.

…
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1883,6 +1883,18 @@ config RSEQ
>  
>  	  If unsure, say Y.
>  
> +config RSEQ_SLICE_EXTENSION
> +	bool "Enable rseq based time slice extension mechanism"
> +	depends on RSEQ && SCHED_HRTICK
> +	help
> +          Allows userspace to request a limited time slice extension when
an expanded tab

> +	  returning from an interrupt to user space via the RSEQ shared
> +	  data ABI. If granted, that allows to complete a critical section,
> +	  so that other threads are not stuck on a conflicted resource,
> +	  while the task is scheduled out.
> +
> +	  If unsure, say N.
> +
>  config DEBUG_RSEQ
>  	default n
>  	bool "Enable debugging of rseq() system call" if EXPERT
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -873,6 +874,10 @@ static enum hrtimer_restart hrtick(struc
>  
>  	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
>  
> +	// CHECKME: Is this correct?
> +	if (rseq_slice_extension_resched())
> +		return HRTIMER_NORESTART;
> +

You shouldn't need to return early HRTIMER_NORESTART in hrtick().
If the extension is not yet granted then rseq_slice_extension_resched()
returns false and the task_tick() below does the usual thing setting
RESCHED_LAZY. This will be cleared on return to userland granting an
extension, arming the timer again.
If this fires for the second time then let the sched_class->task_tick do
the usual and set RESCHED_LAZY. Given that we return from IRQ
exit_to_user_mode_loop() will clear the grant and go to schedule().

>  	rq_lock(rq, &rf);
>  	update_rq_clock(rq);
>  	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
> @@ -902,6 +907,14 @@ static void __hrtick_start(void *arg)
>  	rq_unlock(rq, &rf);
>  }
>  
> +void hrtick_extend_timeslice(ktime_t nsecs)
> +{
> +	struct rq *rq = this_rq();
> +
> +	guard(rq_lock_irqsave)(rq);
> +	hrtimer_start(&rq->hrtick_timer, nsecs, HRTIMER_MODE_REL_PINNED_HARD);

You arm the timer after granting an extension. So it run for some time,
got a scheduling request and now you extend it and keep the timer to
honour it. If the user does yield before the timer fires then schedule()
should clear the timer. I *think* you need update __schedule() because
it has
|         if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
|                 hrtick_clear(rq);

and HRTICK is disabled by default
| grep -i hrtick --color /sys/kernel/debug/sched/features
| PLACE_LAG … NO_HRTICK NO_HRTICK_DL …

> +}
> +
>  /*
>   * Called to set the hrtick timer state.
>   *
…
> --- /dev/null
> +++ b/tools/testing/selftests/rseq/slice_test.c
> @@ -0,0 +1,205 @@
…
> +#if defined(__x86_64__) || defined(__i386__)
> +
> +static __always_inline bool local_test_and_clear_bit(unsigned int bit,
> +						     volatile unsigned int *addr)
> +{
> +	bool res;
> +
> +	asm inline volatile("btrl %[__bit], %[__addr]\n"
> +			    : [__addr] "+m" (*addr), "=@cc" "c" (res)
> +			    : [__bit] "Ir" (bit)
> +			    : "memory");
> +	return res;
> +}
> +
> +static __always_inline void local_set_bit(unsigned int bit, volatile unsigned int *addr)
> +{
> +	volatile char *caddr = (void *)(addr) + (bit / BITS_PER_BYTE);
> +
> +	asm inline volatile("orb %b[__bit],%[__addr]\n"
> +			    : [__addr] "+m" (*caddr)
> +			    : [__bit] "iq" (1U << (bit & (BITS_PER_BYTE - 1)))
> +			    : "memory");
> +}

gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as
built-ins.
There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit()
provided by <stdatomic.h>. Mostly the same magic.

If you use this like
|  static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit)
|  {
|          return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit);
|  }

the gcc will emit btr. Sadly the lock prefix will be there, too. On the
plus side you would have logic for every architecture.

…

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-13 16:19         ` bigeasy
@ 2025-08-13 16:56           ` Thomas Gleixner
  2025-08-18 13:16             ` bigeasy
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:56 UTC (permalink / raw)
  To: bigeasy@linutronix.de
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On Wed, Aug 13 2025 at 18:19, bigeasy@linutronix.de wrote:
> I spent some time on the review. I tried to test it but for some reason
> userland always segfaults. This is not subject to your changes because
> param_test (from tools/testing/selftests/rseq) also segfaults. Also on a
> Debian v6.12. So this must be something else and maybe glibc related.

Hrm. I did not run the rseq tests. I only used the test I wrote, but
that works and the underlying glibc uses rseq too, but I might have
screwed up there. As I said it's POC. I'm about to send out the polished
version, which survive the selftests nicely :)

> On 2025-08-11 11:45:11 [+0200], Thomas Gleixner wrote:
>> +static __always_inline void rseq_slice_extension_timer(void);
>> +
>> +static __always_inline void rseq_exit_to_user_mode(void)
>
> Is this __always_inline required?

To prevent stupid compilers to put it out of line.

>> +static inline bool rseq_slice_extension_enabled(void) { return false; }
>> +static inline bool rseq_slice_extension_resched(void) { return false; }
>> +static inline bool rseq_syscall_enter_work(long syscall) { return false; }
>> +static __always_inline void rseq_slice_extension_timer(void) { }
>
> why is this one so special and seends __always_inline while the other
> are fine with inline?

Copy and pasta :)

>> +static inline bool rseq_reset_slice_extension(struct task_struct *curr)
>> +{
>> +	if (!rseq_slice_extension_enabled())
>> +		return true;
>> +
>> +	if (likely(!(curr->rseq_slice_extension & RSEQ_SLICE_EXTENSION_GRANTED)))
>> +		return true;
>
> We shouldn't get preempted because this would require an interrupt. But
> we could receive a signal which would bring us here, right?

Signal or a another round through the decision function, when
NEED_RESCHED was raised again.

> If an extension was not granted but userland enabled it set
> RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT, shouldn't we clear
> RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT indicating that we scheduled?

No. The user space flow is:

    set(REQUEST);
    critical_section();
    if (!test_and_clear_bit(REQUEST)) {
    	if (test_bit(GRANTED))
           sched_yield();
    }

This is not meant as a 'we scheduled' indicator, which is useless after
the fact. That's what critical sections are for.

> Or we keep things as they are because the signal handler is subject the
> same kind of extensions? The signal handler has a list of functions
> which are signal safe and that might end up in a syscall.
>
>> +	if (likely(!curr->rseq_event_mask))
>> +		return true;
>
> Why don't you need to clear SYSCALL_RSEQ_SLICE if !rseq_event_mask ?

The problem is that rseq_handle_notify_resume() is invoked
unconditionally when TIF_NOTIFY_RESUME is set, which can be set by other
functionalities too. So when nothing happened (no migration, signal,
preemption) then there is no point to revoke it, no?

My rework addresses that:

  https://lore.kernel.org/lkml/20250813155941.014821755@linutronix.de/

That's just preparatory work for this time slice muck :)

>> +
>> +	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
>> +	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
>> +
>> +	return __rseq_reset_slice_extension(curr);
>> +}
>> +
>> +/*
>> + * Invoked from syscall entry if a time slice extension was granted and the
>> + * kernel did not clear it before user space left the critical section.
>> + */
>> +bool rseq_syscall_enter_work(long syscall)
>> +{
>> +	struct task_struct *curr = current;
>> +	unsigned int slext = curr->rseq_slice_extension;
>> +
>> +	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
>> +	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
>> +
>> +	/*
>> +	 * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only
>> +	 * be set when state is GRANTED!
>> +	 */
>> +	if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED))
>> +		return false;
>> +
>> +	set_tsk_need_resched(curr);
>> +
>> +	if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield))
>> +		force_sigsegv(0);
>> +
>> +	/* Abort syscall to reschedule immediately */
>
> If the syscall is the sched_yield() as expected then you still abort it.
> You avoid the "scheduling" request from the do_sched_yield() (and
> everything the syscall does) and perform your schedule request due to
> the NEED_RESCHED flag above in exit_to_user_mode_loop().
> This explains why sched_yield(2) returns a return code != 0 even the man
> page and the kernel function always returns 0. errno will be set in
> userland and the syscall tracer will bypass sched_yield in its trace.

I took the liberty to optimize it that way. It's also useful to see
whether this raced against the kernel:

    	if (test_bit(GRANTED))
-> Interrupt
		sched_yield();

that race can't be avoided. If the kernel wins, then sched_yield()
returns 0. We might change that later, but this makes a lot of sense
conceptually. Ideally we have a dedicated mechanism instead of relying
on sched_yield(), but that's bikeshed painting territory.

>> +	return true;
>> +}
>> +
>> +bool __rseq_grant_slice_extension(unsigned int slext)
>> +{
>> +	struct task_struct *curr = current;
>> +	u32 rflags;
>> +
>> +	if (unlikely(get_user(rflags, &curr->rseq->flags)))
>> +		goto die;
>> +
>> +	/*
>> +	 * Happens when exit_to_user_mode_loop() loops and has
>> +	 * TIF_NEED_RESCHED* set again. Clear the grant and schedule.
>> +	 */
>
> Not only that. Also if userland does not finish its critical section
> before a subsequent scheduling request happens.

Correct.

>> +	if (unlikely(slext == RSEQ_SLICE_EXTENSION_GRANTED)) {
>> +		curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
>> +		clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
>> +		if (!rseq_clear_slice_granted(curr, rflags))
>> +			goto die;
>> +		return false;
>> +	}
>> +
>> +	/* User space set the flag. That's a violation of the contract. */
>> +	if (unlikely(rflags & RSEQ_CS_FLAG_SLICE_EXT_GRANTED))
>> +		goto die;
>> +
>> +	/* User space is not interrested. */
>> +	if (likely(!(rflags & RSEQ_CS_FLAG_SLICE_EXT_REQUEST)))
>> +		return false;
>> +
>> +	/*
>> +	 * Don't bother if the rseq event mask has bits pending. The task
>> +	 * was preempted.
>> +	 */
>> +	if (curr->rseq_event_mask)
>> +		return false;
>> +
>> +	/* Grant the request and update user space */
>> +	rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_REQUEST;
>> +	rflags |= RSEQ_CS_FLAG_SLICE_EXT_GRANTED;
>> +	if (unlikely(put_user(rflags, &curr->rseq->flags)))
>> +		goto die;
>> +
>> +	curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_GRANTED;
>> +	set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
>> +	clear_tsk_need_resched(curr);
>
> If you keep doing this also for NEED_RESCHED then you should clear the
> preemption counter via
> 	clear_preempt_need_resched();
>
> otherwise you could stumble upon a spinlock_t on your way out and visit
> the scheduler anyway.

Hmm. Good point.

> Enforcing is one thing. The documentation should mention that you must
> not invoke any syscalls other than sched_yield() after setting
> RSEQ_CS_FLAG_SLICE_EXT_REQUEST_BIT or you get the segfault thrown at
> you.
> Your testcase does clock_gettime(). This works as long as the syscall
> can be handled via vDSO.

Of course :)

>> +	// CHECKME: Is this correct?
>> +	if (rseq_slice_extension_resched())
>> +		return HRTIMER_NORESTART;
>> +
>
> You shouldn't need to return early HRTIMER_NORESTART in hrtick().
> If the extension is not yet granted then rseq_slice_extension_resched()
> returns false and the task_tick() below does the usual thing setting
> RESCHED_LAZY. This will be cleared on return to userland granting an
> extension, arming the timer again.
> If this fires for the second time then let the sched_class->task_tick do
> the usual and set RESCHED_LAZY. Given that we return from IRQ
> exit_to_user_mode_loop() will clear the grant and go to schedule().

No. This is all wrong and I implemented a dedicated timer for this as
the abuse of HRTICK is daft and depending on the scheduler state (HRTICK
can be disabled) this might cause hard to diagnose subtle surprises.

>> +void hrtick_extend_timeslice(ktime_t nsecs)
>> +{
>> +	struct rq *rq = this_rq();
>> +
>> +	guard(rq_lock_irqsave)(rq);
>> +	hrtimer_start(&rq->hrtick_timer, nsecs, HRTIMER_MODE_REL_PINNED_HARD);
>
> You arm the timer after granting an extension. So it run for some time,
> got a scheduling request and now you extend it and keep the timer to
> honour it. If the user does yield before the timer fires then schedule()
> should clear the timer. I *think* you need update __schedule() because
> it has
> |         if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
> |                 hrtick_clear(rq);
>
> and HRTICK is disabled by default
> | grep -i hrtick --color /sys/kernel/debug/sched/features
> | PLACE_LAG … NO_HRTICK NO_HRTICK_DL …

See new code :)
> gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as
> built-ins.
> There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit()
> provided by <stdatomic.h>. Mostly the same magic.
>
> If you use this like
> |  static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit)
> |  {
> |          return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit);
> |  }
>
> the gcc will emit btr. Sadly the lock prefix will be there, too. On the
> plus side you would have logic for every architecture.

I know, but the whole point is to avoid the LOCK prefix because it's not
necessary in this context and slows things down. The only requirement is
CPU local atomicity vs. an interrupt/exception/NMI or whatever the CPU
uses to mess things up. You need LOCK if you have cross CPU concurrency,
which is not the case here. The LOCK is very measurable when you use
this pattern with a high frequency and that's what the people who long
for this do :)

Thanks,

        tglx






^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-11  9:45       ` Thomas Gleixner
  2025-08-13 16:19         ` bigeasy
@ 2025-08-14  7:18         ` Prakash Sangappa
  2025-08-14 18:20           ` Thomas Gleixner
  1 sibling, 1 reply; 38+ messages in thread
From: Prakash Sangappa @ 2025-08-14  7:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Aug 11, 2025, at 2:45 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Aug 07 2025 at 16:15, Prakash Sangappa wrote:
>>> On Aug 7, 2025, at 6:06 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> rseq_preempt() sets RSEQ_EVENT_PREEMPT_BIT and TIF_NOTIFY_RESUME, which
>>> causes exit to userspace to invoke __rseq_handle_notify_resume(). That's
>>> the obvious place to handle this instead of inflicting it into the
>>> scheduler hotpath.
>>> 
>>> No?
>> 
>> Sure, I will look at moving rseq_delay_resched_clear() call to __rseq_handle_notify_resume().
> 
> I looked deeper into it and it does not completely solve the problem.

Thanks for taking a deeper look.

> 
> +bool rseq_syscall_enter_work(long syscall)
> +{
> + struct task_struct *curr = current;
> + unsigned int slext = curr->rseq_slice_extension;
> +
> + clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
> + curr->rseq_slice_extension = RSEQ_SLICE_EXTENSION_ENABLED;
> +
> + /*
> + * Kernel internal state inconsistency. SYSCALL_RSEQ_SLICE can only
> + * be set when state is GRANTED!
> + */
> + if (WARN_ON_ONCE(slext != RSEQ_SLICE_EXTENSION_GRANTED))
> + return false;
> +
> + set_tsk_need_resched(curr);
> +
> + if (unlikely(!__rseq_reset_slice_extension(curr) || syscall != __NR_sched_yield))
> + force_sigsegv(0);
> +
> + /* Abort syscall to reschedule immediately */
> + return true;
> +}

Is it ok to fail the sched_yield(2) syscall? The man page says
sched_yield(2) always succeeds(returns 0).

Also, is it necessary to force kill the process here with SIGSEGV, if some other 
system call was made?

Ideally it would be expected that the process should not be making any system call 
while in the critical section and is using time slice extension, other then sched_yield(2) 
to relinquish the cpu. However an application process could have a signal handler 
that gets invoked while in the critical section which can potentially be making some 
system call that is not sched_yield(2).

Thanks,
-Prakash




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-14  7:18         ` Prakash Sangappa
@ 2025-08-14 18:20           ` Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-14 18:20 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On Thu, Aug 14 2025 at 07:18, Prakash Sangappa wrote:
>> On Aug 11, 2025, at 2:45 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Is it ok to fail the sched_yield(2) syscall? The man page says
> sched_yield(2) always succeeds(returns 0).

I used it because it's simple. In practice we need a new syscall w/o
side effects.

> Also, is it necessary to force kill the process here with SIGSEGV, if
> some other system call was made?

Yes, because we do not trust user space and any violation of the
contract has consequences. Any kernel facility which interacts in such a
way with user space has to be defensive by design. Assuming that user
space is neither stupid nor malicious is naive at best and has been a
source of big gaping holes forever.

> Ideally it would be expected that the process should not be making any
> system call while in the critical section and is using time slice
> extension, other then sched_yield(2) to relinquish the cpu. However an
> application process could have a signal handler that gets invoked
> while in the critical section which can potentially be making some
> system call that is not sched_yield(2).

The timeslice extension is canceled when a signal is pending, so nothing
bad happens. The kernel already revoked it similar to how rseq aborts
the critical section on signal delivery.

If it doesn't work with the POC, that may be. With the stuff I'm
polishing now it works because I tested it :)

Thanks

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-13 16:56           ` Thomas Gleixner
@ 2025-08-18 13:16             ` bigeasy
  2025-08-19  8:12               ` Thomas Gleixner
  0 siblings, 1 reply; 38+ messages in thread
From: bigeasy @ 2025-08-18 13:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On 2025-08-13 18:56:16 [+0200], Thomas Gleixner wrote:
> On Wed, Aug 13 2025 at 18:19, bigeasy@linutronix.de wrote:
> > I spent some time on the review. I tried to test it but for some reason
> > userland always segfaults. This is not subject to your changes because
> > param_test (from tools/testing/selftests/rseq) also segfaults. Also on a
> > Debian v6.12. So this must be something else and maybe glibc related.
> 
> Hrm. I did not run the rseq tests. I only used the test I wrote, but
> that works and the underlying glibc uses rseq too, but I might have
> screwed up there. As I said it's POC. I'm about to send out the polished
> version, which survive the selftests nicely :)

It was not your code. Everything exploded here. Am right to assume that
you had a recent/ current Debian Trixie environment testing? My guess is
that glibc or gcc got out of sync. 

> > gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as
> > built-ins.
> > There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit()
> > provided by <stdatomic.h>. Mostly the same magic.
> >
> > If you use this like
> > |  static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit)
> > |  {
> > |          return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit);
> > |  }
> >
> > the gcc will emit btr. Sadly the lock prefix will be there, too. On the
> > plus side you would have logic for every architecture.
> 
> I know, but the whole point is to avoid the LOCK prefix because it's not
> necessary in this context and slows things down. The only requirement is
> CPU local atomicity vs. an interrupt/exception/NMI or whatever the CPU
> uses to mess things up. You need LOCK if you have cross CPU concurrency,
> which is not the case here. The LOCK is very measurable when you use
> this pattern with a high frequency and that's what the people who long
> for this do :)

Sure. You can keep it on x86 and use the generic one in the else case
rather than abort with an error.
Looking at arch___test_and_clear_bit() in the kernel, there is x86 with
its custom implementation. s390 points to generic___test_and_clear_bit()
which is a surprise. alpha's and sh's isn't atomic so this does not look
right. hexagon and m68k might okay and a candidate.

> Thanks,
> 
>         tglx

Sebastian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V7 02/11] sched: Indicate if thread got rescheduled
  2025-08-18 13:16             ` bigeasy
@ 2025-08-19  8:12               ` Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2025-08-19  8:12 UTC (permalink / raw)
  To: bigeasy@linutronix.de
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On Mon, Aug 18 2025 at 15:16, bigeasy@linutronix.de wrote:
> On 2025-08-13 18:56:16 [+0200], Thomas Gleixner wrote:
>> On Wed, Aug 13 2025 at 18:19, bigeasy@linutronix.de wrote:
>> > I spent some time on the review. I tried to test it but for some reason
>> > userland always segfaults. This is not subject to your changes because
>> > param_test (from tools/testing/selftests/rseq) also segfaults. Also on a
>> > Debian v6.12. So this must be something else and maybe glibc related.
>> 
>> Hrm. I did not run the rseq tests. I only used the test I wrote, but
>> that works and the underlying glibc uses rseq too, but I might have
>> screwed up there. As I said it's POC. I'm about to send out the polished
>> version, which survive the selftests nicely :)
>
> It was not your code. Everything exploded here. Am right to assume that
> you had a recent/ current Debian Trixie environment testing? My guess is
> that glibc or gcc got out of sync.

  https://lore.kernel.org/lkml/aKODByTQMYFs3WVN@google.com

:)

>> > gcc has __atomic_fetch_and() and __atomic_fetch_or() provided as
>> > built-ins.
>> > There is atomic_fetch_and_explicit() and atomic_fetch_or_explicit()
>> > provided by <stdatomic.h>. Mostly the same magic.
>> >
>> > If you use this like
>> > |  static inline int test_and_clear_bit(unsigned long *ptr, unsigned int bit)
>> > |  {
>> > |          return __atomic_fetch_and(ptr, ~(1 << bit), __ATOMIC_RELAXED) & (1 << bit);
>> > |  }
>> >
>> > the gcc will emit btr. Sadly the lock prefix will be there, too. On the
>> > plus side you would have logic for every architecture.
>> 
>> I know, but the whole point is to avoid the LOCK prefix because it's not
>> necessary in this context and slows things down. The only requirement is
>> CPU local atomicity vs. an interrupt/exception/NMI or whatever the CPU
>> uses to mess things up. You need LOCK if you have cross CPU concurrency,
>> which is not the case here. The LOCK is very measurable when you use
>> this pattern with a high frequency and that's what the people who long
>> for this do :)
>
> Sure. You can keep it on x86 and use the generic one in the else case
> rather than abort with an error.
> Looking at arch___test_and_clear_bit() in the kernel, there is x86 with
> its custom implementation. s390 points to generic___test_and_clear_bit()
> which is a surprise. alpha's and sh's isn't atomic so this does not look
> right. hexagon and m68k might okay and a candidate.

Right, I'll look into that after I sorted out the underlying rseq
mess. See the context of the link above. That solved will make the
integration of this timeslice muck way simpler (famous last words).

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-08-19  8:12 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24 16:16 [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 01/11] sched: " Prakash Sangappa
2025-08-06 20:34   ` Thomas Gleixner
2025-08-07 14:07     ` Thomas Gleixner
2025-08-07 16:45       ` Prakash Sangappa
2025-08-07 15:49     ` Sebastian Andrzej Siewior
2025-08-07 16:56       ` Prakash Sangappa
2025-08-08  9:59         ` Sebastian Andrzej Siewior
2025-08-08 17:00           ` Prakash Sangappa
2025-08-11  6:28             ` Sebastian Andrzej Siewior
2025-08-07 16:13     ` Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 02/11] sched: Indicate if thread got rescheduled Prakash Sangappa
2025-08-07 13:06   ` Thomas Gleixner
2025-08-07 16:15     ` Prakash Sangappa
2025-08-11  9:45       ` Thomas Gleixner
2025-08-13 16:19         ` bigeasy
2025-08-13 16:56           ` Thomas Gleixner
2025-08-18 13:16             ` bigeasy
2025-08-19  8:12               ` Thomas Gleixner
2025-08-14  7:18         ` Prakash Sangappa
2025-08-14 18:20           ` Thomas Gleixner
2025-07-24 16:16 ` [PATCH V7 03/11] sched: Tunable to specify duration of time slice extension Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 04/11] sched: Add scheduler stat for cpu " Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 05/11] sched: Add tracepoint for sched " Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 06/11] Add API to query supported rseq cs flags Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 07/11] sched: Add API to indicate not to delay scheduling Prakash Sangappa
2025-07-25 14:30   ` kernel test robot
2025-07-24 16:16 ` [PATCH V7 08/11] sched: Add TIF_NEED_RESCHED_NODELAY infrastructure Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 09/11] sched: Add nodelay scheduling Prakash Sangappa
2025-08-08 13:26   ` Thomas Gleixner
2025-08-08 16:54     ` Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 10/11] sched, x86: Enable " Prakash Sangappa
2025-07-24 16:16 ` [PATCH V7 11/11] sched: Add kernel parameter to enable delaying RT threads Prakash Sangappa
2025-07-25 15:52   ` kernel test robot
2025-08-06 16:03 ` [PATCH V7 00/11] Scheduler time slice extension Prakash Sangappa
2025-08-06 16:24   ` Thomas Gleixner
2025-08-06 16:30 ` Thomas Gleixner
2025-08-07  6:52   ` Prakash Sangappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).