[PATCH V6 0/7] Scheduler time slice extension

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V6 0/7] Scheduler time slice extension
@ 2025-07-01  0:37 Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 1/7] Sched: " Prakash Sangappa
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Rebased to v6.16-rc3.

A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, which can have undesirable affect
on performance. Having a way for the thread to request additional execution
time on cpu, so that it can complete the critical section will be useful in
such scenario. The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature.

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[5], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patchset implements above solution as suggested, with use of restartable
sequences(rseq) structure for API. Refer [3][4] for further discussions.

v6:
- Rebased onto v6.16-rc3. 
  syscall_exit_to_user_mode_prepare() & __syscall_exit_to_user_mode_work()
  routines have been deleted. Moved changes to the consolidated routine
  syscall_exit_to_user_mode_work()(patch 1).
- Introduced a new config option for scheduler time slice extension
  CONFIG_SCHED_PREEMPT_DELAY which is dependent on CONFIG_RSEQ.
  Enabled by default(new patch 7). Is this reasonable?
- Modified tracepoint to a conditional tracepoint(patch 5), as suggested
  by Steven Rostedt.
- Added kernel parameters documentation for the tunable
  'sysctl_sched_preempt_delay_us'(patch 3)

v5:
https://lore.kernel.org/all/20250603233654.1838967-1-prakash.sangappa@oracle.com/
- Added #ifdef CONFIG_RSEQ and CONFIG_PROC_SYSCTL for sysctl tunable
  changes(patch 3).
- Added #ifdef CONFIG_RSEQ for schedular stat changes(patch 4).
- Removed deprecated flags from the supported flags returned, as
  pointed out by Mathieu Desnoyers(patch 6).
- Added IF_ENABLED(CONFIG_SCHED_HRTICK) check before returning supported
  delay resched flags.

v4:
https://lore.kernel.org/all/20250513214554.4160454-1-prakash.sangappa@oracle.com
- Changed default sched delay extension time to 30us
- Added patch to indicate to userspace if the thread got preempted in
  the extended cpu time granted. Uses another bit in rseq cs flags for it.
  This should help the application to check and avoid having to call a
  system call to yield cpu, especially sched_yield() as pointed out
  by Steven Rostedt.
- Moved tracepoint call towards end of exit_to_user_mode_loop().
- Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
  set higher then the default value of 30us.
- Patch to add an API to query if sched time extension feature is supported. 
  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
  is added, as suggested by Mathieu Desnoyers. 
  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.

v3:
https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
- Addressing review comments by Sebastian and Prateek.
  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
    struct task_struct near other bits so it fits in existing word.
  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
    'sched_time_delay'.
  * removed rseq_delay_resched_tick() call from hrtick_clear().
  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
    suggested by Sebastian.
  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.

v2:
https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
- Based on discussions in [3], expecting user application to call sched_yield()
  to yield the cpu at the end of the critical section may not be advisable as
  pointed out by Linus.  

  So added a check in return path from a system call to reschedule if time
  slice extension was granted to the thread. The check could as well be in
  syscall enter path from user mode.
  This would allow application thread to call any system call to yield the cpu. 
  Which system call should be suggested? getppid(2) works.

  Do we still need the change in sched_yield() to reschedule when the thread
  has current->rseq_sched_delay set?

- Added patch to introduce a sysctl tunable parameter to specify duration of
  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
  Can take a value in the range 0 to 100. Default is set to 50us.
  Setting this tunable to 0 disables the scheduler time slice extension feature.

v1: 
https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/


[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/all/20250131225837.972218232@goodmis.org/
[4] https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
[5] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[6] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/


Prakash Sangappa (7):
  Sched: Scheduler time slice extension
  Sched: Indicate if thread got rescheduled
  Sched: Tunable to specify duration of time slice extension
  Sched: Add scheduler stat for cpu time slice extension
  Sched: Add tracepoint for sched time slice extension
  Add API to query supported rseq cs flags
  Introduce a config option for scheduler time slice extension feature

 .../admin-guide/kernel-parameters.txt         |  8 ++
 include/linux/entry-common.h                  | 17 +++-
 include/linux/sched.h                         | 30 ++++++
 include/trace/events/sched.h                  | 31 ++++++
 include/uapi/linux/rseq.h                     | 19 ++++
 init/Kconfig                                  |  7 ++
 kernel/entry/common.c                         | 19 +++-
 kernel/rseq.c                                 | 97 +++++++++++++++++++
 kernel/sched/core.c                           | 60 ++++++++++++
 kernel/sched/debug.c                          |  4 +
 kernel/sched/syscalls.c                       |  6 ++
 11 files changed, 290 insertions(+), 8 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
@ 2025-07-01  0:37 ` Prakash Sangappa
  2025-07-01  8:42   ` Thomas Gleixner
  2025-07-01  0:37 ` [PATCH V6 2/7] Sched: Indicate if thread got rescheduled Prakash Sangappa
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Add support for a thread to request extending its execution time slice on
the cpu. The extra cpu time granted would help in allowing the thread to
complete executing the critical section and drop any locks without getting
preempted. The thread would request this cpu time extension, by setting a
bit in the restartable sequences(rseq) structure registered with the kernel.

Kernel will grant a 30us extension on the cpu, when it sees the bit set.
With the help of a timer, kernel force preempts the thread if it is still
running on the cpu when the 30us timer expires. The thread should yield
the cpu by making a system call after completing the critical section.

Suggested-by: Peter Ziljstra <peterz@infradead.org>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/entry-common.h | 17 ++++++++---
 include/linux/sched.h        | 16 +++++++++++
 include/uapi/linux/rseq.h    |  7 +++++
 kernel/entry/common.c        | 13 ++++++---
 kernel/rseq.c                | 56 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          | 14 +++++++++
 kernel/sched/syscalls.c      |  5 ++++
 7 files changed, 120 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..d4fa952e394e 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -304,7 +304,8 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
  * exit_to_user_mode_loop - do any pending work before leaving to user space
  */
 unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+				     unsigned long ti_work,
+				     bool irq);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
@@ -316,7 +317,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
  *    EXIT_TO_USER_MODE_WORK are set
  * 4) check that interrupts are still disabled
  */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs,
+						bool irq)
 {
 	unsigned long ti_work;
 
@@ -327,7 +329,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
 
 	ti_work = read_thread_flags();
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
+
+	if (irq)
+		rseq_delay_resched_fini();
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
@@ -396,6 +401,10 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 
 	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
 
+	/* reschedule if sched delay was granted */
+	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay)
+		set_tsk_need_resched(current);
+
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
 		if (WARN(irqs_disabled(), "syscall %lu left IRQs disabled", nr))
 			local_irq_enable();
@@ -411,7 +420,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, false);
 }
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5bcf44ae6c79..9b4670d85131 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -338,6 +338,7 @@ extern int __must_check io_schedule_prepare(void);
 extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
+extern void hrtick_local_start(u64 delay);
 
 /* wrapper function to trace from this header file */
 DECLARE_TRACEPOINT(sched_set_state_tp);
@@ -1263,6 +1264,7 @@ struct task_struct {
 	int				softirq_context;
 	int				irq_config;
 #endif
+	unsigned			sched_time_delay:1;
 #ifdef CONFIG_PREEMPT_RT
 	int				softirq_disable_cnt;
 #endif
@@ -2245,6 +2247,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_RSEQ
+
+extern bool rseq_delay_resched(void);
+extern void rseq_delay_resched_fini(void);
+extern void rseq_delay_resched_tick(void);
+
+#else
+
+static inline bool rseq_delay_resched(void) { return false; }
+static inline void rseq_delay_resched_fini(void) { }
+static inline void rseq_delay_resched_tick(void) { }
+
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..25fc636b17d5 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -26,6 +26,7 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	RSEQ_CS_FLAG_DELAY_RESCHED_BIT		= 3,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +36,8 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+	RSEQ_CS_FLAG_DELAY_RESCHED		=
+		(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
 };
 
 /*
@@ -128,6 +131,10 @@ struct rseq {
 	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
 	 *     Inhibit instruction sequence block restart on migration for
 	 *     this thread.
+	 * - RSEQ_CS_FLAG_DELAY_RESCHED
+	 *     Request by user thread to delay preemption. With use
+	 *     of a timer, kernel grants extra cpu time upto 30us for this
+	 *     thread before being rescheduled.
 	 */
 	__u32 flags;
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index a8dd1f27417c..8769c3592e26 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -88,7 +88,8 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
  * @ti_work:	TIF work flags as read by the caller
  */
 __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-						     unsigned long ti_work)
+						     unsigned long ti_work,
+						     bool irq)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
@@ -98,8 +99,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+		       if (irq && rseq_delay_resched())
+			       clear_tsk_need_resched(current);
+		       else
+			       schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
@@ -181,7 +186,7 @@ noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, true);
 	instrumentation_end();
 	exit_to_user_mode();
 }
diff --git a/kernel/rseq.c b/kernel/rseq.c
index b7a1ec327e81..dba44ca9f624 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -448,6 +448,62 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+bool rseq_delay_resched(void)
+{
+	struct task_struct *t = current;
+	u32 flags;
+
+	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
+		return false;
+
+	if (!t->rseq)
+		return false;
+
+	if (t->sched_time_delay)
+		return false;
+
+	if (copy_from_user_nofault(&flags, &t->rseq->flags, sizeof(flags)))
+		return false;
+
+	if (!(flags & RSEQ_CS_FLAG_DELAY_RESCHED))
+		return false;
+
+	flags &= ~RSEQ_CS_FLAG_DELAY_RESCHED;
+	if (copy_to_user_nofault(&t->rseq->flags, &flags, sizeof(flags)))
+		return false;
+
+	t->sched_time_delay = 1;
+
+	return true;
+}
+
+void rseq_delay_resched_fini(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	extern void hrtick_local_start(u64 delay);
+	struct task_struct *t = current;
+	/*
+	 * IRQs off, guaranteed to return to userspace, start timer on this CPU
+	 * to limit the resched-overdraft.
+	 *
+	 * If your critical section is longer than 30 us you get to keep the
+	 * pieces.
+	 */
+	if (t->sched_time_delay)
+		hrtick_local_start(30 * NSEC_PER_USEC);
+#endif
+}
+
+void rseq_delay_resched_tick(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	struct task_struct *t = current;
+
+	if (t->sched_time_delay)
+		set_tsk_need_resched(t);
+#endif
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ad7cf3cfdca..c1b64879115f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -845,6 +845,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 
+	rseq_delay_resched_tick();
+
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
@@ -918,6 +920,16 @@ void hrtick_start(struct rq *rq, u64 delay)
 
 #endif /* CONFIG_SMP */
 
+void hrtick_local_start(u64 delay)
+{
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+
+	rq_lock(rq, &rf);
+	hrtick_start(rq, delay);
+	rq_unlock(rq, &rf);
+}
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -6740,6 +6752,8 @@ static void __sched notrace __schedule(int sched_mode)
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
+	if (IS_ENABLED(CONFIG_RSEQ))
+		prev->sched_time_delay = 0;
 	rq->last_seen_need_resched_ns = 0;
 
 	is_switch = prev != next;
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ee5641757838..d9a4e3a2e064 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -1379,6 +1379,11 @@ static void do_sched_yield(void)
  */
 SYSCALL_DEFINE0(sched_yield)
 {
+	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay) {
+		schedule();
+		return 0;
+	}
+
 	do_sched_yield();
 	return 0;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 2/7] Sched: Indicate if thread got rescheduled
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 1/7] Sched: " Prakash Sangappa
@ 2025-07-01  0:37 ` Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 3/7] Sched: Tunable to specify duration of time slice extension Prakash Sangappa
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Use a bit in rseq flags to indicate if the thread got rescheduled
after the cpu time extension was graned. The user thread can check this
flag before calling sched_yield() to yield the cpu.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h     |  2 ++
 include/uapi/linux/rseq.h | 10 ++++++++++
 kernel/rseq.c             | 19 +++++++++++++++++++
 kernel/sched/core.c       |  3 +--
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b4670d85131..0a6d564d2745 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2252,12 +2252,14 @@ unsigned long sched_cpu_util(int cpu);
 extern bool rseq_delay_resched(void);
 extern void rseq_delay_resched_fini(void);
 extern void rseq_delay_resched_tick(void);
+extern void rseq_delay_schedule(struct task_struct *tsk);
 
 #else
 
 static inline bool rseq_delay_resched(void) { return false; }
 static inline void rseq_delay_resched_fini(void) { }
 static inline void rseq_delay_resched_tick(void) { }
+static inline void rseq_delay_schedule(struct task_struct *tsk) { }
 
 #endif
 
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 25fc636b17d5..f4813d931387 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -27,6 +27,7 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
 	RSEQ_CS_FLAG_DELAY_RESCHED_BIT		= 3,
+	RSEQ_CS_FLAG_RESCHEDULED_BIT		= 4,
 };
 
 enum rseq_cs_flags {
@@ -38,6 +39,9 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 	RSEQ_CS_FLAG_DELAY_RESCHED		=
 		(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
+	RSEQ_CS_FLAG_RESCHEDULED		=
+		(1U << RSEQ_CS_FLAG_RESCHEDULED_BIT),
+
 };
 
 /*
@@ -135,6 +139,12 @@ struct rseq {
 	 *     Request by user thread to delay preemption. With use
 	 *     of a timer, kernel grants extra cpu time upto 30us for this
 	 *     thread before being rescheduled.
+	 * - RSEQ_CS_FLAG_RESCHEDULED
+	 *     Set by kernel if the thread was rescheduled in the extra time
+	 *     granted due to request RSEQ_CS_DELAY_RESCHED. This bit is
+	 *     checked by the thread before calling sched_yield() to yield
+	 *     cpu. User thread sets this bit to 0, when setting
+	 *     RSEQ_CS_DELAY_RESCHED to request preemption delay.
 	 */
 	__u32 flags;
 
diff --git a/kernel/rseq.c b/kernel/rseq.c
index dba44ca9f624..eb20622634ef 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -504,6 +504,25 @@ void rseq_delay_resched_tick(void)
 #endif
 }
 
+void rseq_delay_schedule(struct task_struct *tsk)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	u32 flags;
+
+	if (tsk->sched_time_delay) {
+		tsk->sched_time_delay = 0;
+		if (!tsk->rseq)
+			return;
+		if (copy_from_user_nofault(&flags, &tsk->rseq->flags,
+                        sizeof(flags)))
+                        return;
+                flags |= RSEQ_CS_FLAG_RESCHEDULED;
+                copy_to_user_nofault(&tsk->rseq->flags, &flags,
+                        sizeof(flags));
+	}
+#endif
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c1b64879115f..e163822d5381 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6752,8 +6752,7 @@ static void __sched notrace __schedule(int sched_mode)
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
-	if (IS_ENABLED(CONFIG_RSEQ))
-		prev->sched_time_delay = 0;
+	rseq_delay_schedule(prev);
 	rq->last_seen_need_resched_ns = 0;
 
 	is_switch = prev != next;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 3/7] Sched: Tunable to specify duration of time slice extension
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 1/7] Sched: " Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 2/7] Sched: Indicate if thread got rescheduled Prakash Sangappa
@ 2025-07-01  0:37 ` Prakash Sangappa
  2025-07-01  3:59   ` K Prateek Nayak
  2025-07-01  0:37 ` [PATCH V6 4/7] Sched: Add scheduler stat for cpu " Prakash Sangappa
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Add a tunable to specify duration of scheduler time slice extension.
The default will be set to 30us and the max value that can be specified
is 100us. Setting it to 0, disables scheduler time slice extension.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
v5 -> v6
- Added documentation for tunable 'sysctl_sched_preempt_delay_us'.
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++
 include/linux/sched.h                         |  5 +++
 include/uapi/linux/rseq.h                     |  5 ++-
 kernel/rseq.c                                 |  7 +++-
 kernel/sched/core.c                           | 40 +++++++++++++++++++
 5 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0ee6c5314637..1e0f86cda0db 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6398,6 +6398,14 @@
 
 	sched_verbose	[KNL,EARLY] Enables verbose scheduler debug messages.
 
+	sched_preempt_delay_us=	[KNL]
+			Scheduler preemption delay in microseconds.
+			Allowed range is 0 to 100us. A thread can request
+			extending its scheduler time slice on the cpu by
+			delaying preemption. Duration of preemption delay
+			granted is specified by this parameter. Setting it
+			to 0 will disable this feature.
+
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
 			Allowed values are enable and disable. This feature
 			incurs a small amount of overhead in the scheduler
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0a6d564d2745..a0661f1d423b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -406,6 +406,11 @@ static inline void sched_domains_mutex_lock(void) { }
 static inline void sched_domains_mutex_unlock(void) { }
 #endif
 
+#ifdef CONFIG_RSEQ
+/* Scheduler time slice extension */
+extern unsigned int sysctl_sched_preempt_delay_us;
+#endif
+
 struct sched_param {
 	int sched_priority;
 };
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index f4813d931387..015534f064af 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -137,8 +137,9 @@ struct rseq {
 	 *     this thread.
 	 * - RSEQ_CS_FLAG_DELAY_RESCHED
 	 *     Request by user thread to delay preemption. With use
-	 *     of a timer, kernel grants extra cpu time upto 30us for this
-	 *     thread before being rescheduled.
+	 *     of a timer, kernel grants extra cpu time upto the tunable
+	 *     'sched_preempt_delay_us' value for this thread before it gets
+	 *     rescheduled.
 	 * - RSEQ_CS_FLAG_RESCHEDULED
 	 *     Set by kernel if the thread was rescheduled in the extra time
 	 *     granted due to request RSEQ_CS_DELAY_RESCHED. This bit is
diff --git a/kernel/rseq.c b/kernel/rseq.c
index eb20622634ef..545123ca60b0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -456,6 +456,8 @@ bool rseq_delay_resched(void)
 	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
 		return false;
 
+	if (!sysctl_sched_preempt_delay_us)
+		return false;
 	if (!t->rseq)
 		return false;
 
@@ -489,8 +491,9 @@ void rseq_delay_resched_fini(void)
 	 * If your critical section is longer than 30 us you get to keep the
 	 * pieces.
 	 */
-	if (t->sched_time_delay)
-		hrtick_local_start(30 * NSEC_PER_USEC);
+	if (sysctl_sched_preempt_delay_us && t->sched_time_delay)
+		hrtick_local_start(sysctl_sched_preempt_delay_us *
+				   NSEC_PER_USEC);
 #endif
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e163822d5381..6d50eff9be8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -149,6 +149,17 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  */
 __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
 
+#ifdef CONFIG_RSEQ
+/*
+ * Scheduler time slice extension, duration in microsecs.
+ * Max value allowed 100us, default is 30us.
+ * If set to 0, scheduler time slice extension is disabled.
+ */
+#define SCHED_PREEMPT_DELAY_DEFAULT_US	30
+__read_mostly unsigned int sysctl_sched_preempt_delay_us =
+	SCHED_PREEMPT_DELAY_DEFAULT_US;
+#endif
+
 __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
@@ -4678,6 +4689,24 @@ static int sysctl_schedstats(const struct ctl_table *table, int write, void *buf
 #endif /* CONFIG_PROC_SYSCTL */
 #endif /* CONFIG_SCHEDSTATS */
 
+#ifdef CONFIG_PROC_SYSCTL
+#ifdef CONFIG_RSEQ
+static int sysctl_sched_preempt_delay(const struct ctl_table *table, int write,
+		void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err;
+
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (sysctl_sched_preempt_delay_us > SCHED_PREEMPT_DELAY_DEFAULT_US)
+		pr_warn("Sched preemption delay time set higher then default value %d us\n",
+			SCHED_PREEMPT_DELAY_DEFAULT_US);
+	return err;
+}
+#endif /* CONFIG_RSEQ */
+#endif /* CONFIG_PROC_SYSCTL */
+
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table sched_core_sysctls[] = {
 #ifdef CONFIG_SCHEDSTATS
@@ -4725,6 +4754,17 @@ static const struct ctl_table sched_core_sysctls[] = {
 		.extra2		= SYSCTL_FOUR,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_RSEQ
+	{
+		.procname	= "sched_preempt_delay_us",
+		.data		= &sysctl_sched_preempt_delay_us,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_preempt_delay,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE_HUNDRED,
+	},
+#endif /* CONFIG_RSEQ */
 };
 static int __init sched_core_sysctl_init(void)
 {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 4/7] Sched: Add scheduler stat for cpu time slice extension
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
                   ` (2 preceding siblings ...)
  2025-07-01  0:37 ` [PATCH V6 3/7] Sched: Tunable to specify duration of time slice extension Prakash Sangappa
@ 2025-07-01  0:37 ` Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 5/7] Sched: Add tracepoint for sched " Prakash Sangappa
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Add scheduler stat to record number of times the thread was granted
cpu time slice extension.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h | 7 +++++++
 kernel/rseq.c         | 1 +
 kernel/sched/core.c   | 7 +++++++
 kernel/sched/debug.c  | 4 ++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a0661f1d423b..90d7989a0185 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -339,6 +339,9 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 extern void hrtick_local_start(u64 delay);
+#ifdef CONFIG_RSEQ
+extern void update_stat_preempt_delayed(struct task_struct *t);
+#endif
 
 /* wrapper function to trace from this header file */
 DECLARE_TRACEPOINT(sched_set_state_tp);
@@ -569,6 +572,10 @@ struct sched_statistics {
 	u64				nr_wakeups_passive;
 	u64				nr_wakeups_idle;
 
+#ifdef CONFIG_RSEQ
+	u64				nr_preempt_delay_granted;
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
 #endif
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 545123ca60b0..99aa263c3a07 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -475,6 +475,7 @@ bool rseq_delay_resched(void)
 		return false;
 
 	t->sched_time_delay = 1;
+	update_stat_preempt_delayed(t);
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d50eff9be8c..fd572053a955 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -941,6 +941,13 @@ void hrtick_local_start(u64 delay)
 	rq_unlock(rq, &rf);
 }
 
+#ifdef CONFIG_RSEQ
+void update_stat_preempt_delayed(struct task_struct *t)
+{
+	schedstat_inc(t->stats.nr_preempt_delay_granted);
+}
+#endif
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 9135d5c2edea..3a2efd9505e1 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1225,6 +1225,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_passive);
 		P_SCHEDSTAT(nr_wakeups_idle);
 
+#ifdef CONFIG_RSEQ
+		P_SCHEDSTAT(nr_preempt_delay_granted);
+#endif
+
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
 			avg_atom = div64_ul(avg_atom, nr_switches);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 5/7] Sched: Add tracepoint for sched time slice extension
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
                   ` (3 preceding siblings ...)
  2025-07-01  0:37 ` [PATCH V6 4/7] Sched: Add scheduler stat for cpu " Prakash Sangappa
@ 2025-07-01  0:37 ` Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 6/7] Add API to query supported rseq cs flags Prakash Sangappa
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Trace thread's preemption getting delayed. Which can occur if
the running thread requested extra time on cpu.  Also, indicate
the NEED_RESCHED flag, that is set on the thread, getting cleared.

Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
v5 -> v6
- Changed tracepoint to tracepoint condition.
---
 include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++
 kernel/entry/common.c        | 10 ++++++++--
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 4e6b2910cec3..a4846579f377 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -296,6 +296,37 @@ TRACE_EVENT(sched_migrate_task,
 		  __entry->orig_cpu, __entry->dest_cpu)
 );
 
+/*
+ * Tracepoint for delayed resched requested by task:
+ */
+TRACE_EVENT_CONDITION(sched_delay_resched,
+
+	TP_PROTO(struct task_struct *p, unsigned int ti_work_cleared),
+
+	TP_ARGS(p, ti_work_cleared),
+
+	TP_CONDITION(ti_work_cleared),
+
+	TP_STRUCT__entry(
+		__array( char, comm, TASK_COMM_LEN	)
+		__field( pid_t, pid			)
+		__field( int, cpu			)
+		__field( int, flg			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->cpu 		= task_cpu(p);
+		__entry->flg		= ti_work_cleared & (_TIF_NEED_RESCHED |
+					_TIF_NEED_RESCHED_LAZY);
+	),
+
+	TP_printk("comm=%s pid=%d cpu=%d resched_flg_cleared=0x%x",
+		__entry->comm, __entry->pid, __entry->cpu, __entry->flg)
+
+);
+
 DECLARE_EVENT_CLASS(sched_process_template,
 
 	TP_PROTO(struct task_struct *p),
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 8769c3592e26..ca3c91f0ea99 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -12,6 +12,7 @@
 
 #include "common.h"
 
+#include <trace/events/sched.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
@@ -91,6 +92,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 						     unsigned long ti_work,
 						     bool irq)
 {
+	unsigned long ti_work_cleared = 0;
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
@@ -100,10 +102,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
-		       if (irq && rseq_delay_resched())
+		       if (irq && rseq_delay_resched()) {
 			       clear_tsk_need_resched(current);
-		       else
+			       ti_work_cleared = ti_work;
+		       } else {
 			       schedule();
+		       }
 		}
 
 		if (ti_work & _TIF_UPROBE)
@@ -134,6 +138,8 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		ti_work = read_thread_flags();
 	}
 
+	trace_sched_delay_resched(current, ti_work_cleared);
+
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 6/7] Add API to query supported rseq cs flags
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
                   ` (4 preceding siblings ...)
  2025-07-01  0:37 ` [PATCH V6 5/7] Sched: Add tracepoint for sched " Prakash Sangappa
@ 2025-07-01  0:37 ` Prakash Sangappa
  2025-07-01  0:37 ` [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature Prakash Sangappa
  2025-07-01  4:30 ` [PATCH V6 0/7] Scheduler time slice extension K Prateek Nayak
  7 siblings, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

For the API, add a new flag to sys_rseq 'flags' argument called
RSEQ_FLAG_QUERY_CS_FLAGS.

When this flag is passed it returns a bit mask of all the supported
rseq cs flags in the user provided rseq struct's 'flags' member.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/uapi/linux/rseq.h |  1 +
 kernel/rseq.c             | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 015534f064af..44baea9dd10a 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -20,6 +20,7 @@ enum rseq_cpu_id_state {
 
 enum rseq_flags {
 	RSEQ_FLAG_UNREGISTER = (1 << 0),
+	RSEQ_FLAG_QUERY_CS_FLAGS = (1 << 1),
 };
 
 enum rseq_cs_flags_bit {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 99aa263c3a07..7710a209433b 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -575,6 +575,21 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		return 0;
 	}
 
+	/*
+	 * Return supported rseq_cs flags.
+	 */
+	if (flags & RSEQ_FLAG_QUERY_CS_FLAGS) {
+		u32 rseq_csflags = RSEQ_CS_FLAG_DELAY_RESCHED |
+				   RSEQ_CS_FLAG_RESCHEDULED;
+		if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
+			return -EINVAL;
+		if (!rseq)
+			return -EINVAL;
+		if (copy_to_user(&rseq->flags, &rseq_csflags, sizeof(u32)))
+			return -EFAULT;
+		return 0;
+	}
+
 	if (unlikely(flags))
 		return -EINVAL;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
                   ` (5 preceding siblings ...)
  2025-07-01  0:37 ` [PATCH V6 6/7] Add API to query supported rseq cs flags Prakash Sangappa
@ 2025-07-01  0:37 ` Prakash Sangappa
  2025-07-01  3:12   ` K Prateek Nayak
  2025-07-01  8:46   ` Thomas Gleixner
  2025-07-01  4:30 ` [PATCH V6 0/7] Scheduler time slice extension K Prateek Nayak
  7 siblings, 2 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Add a config option to enable schedule time slice extension.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/entry-common.h |  2 +-
 include/linux/sched.h        |  8 ++++----
 init/Kconfig                 |  7 +++++++
 kernel/rseq.c                |  5 ++++-
 kernel/sched/core.c          | 12 ++++++------
 kernel/sched/debug.c         |  2 +-
 kernel/sched/syscalls.c      |  3 ++-
 7 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index d4fa952e394e..351c9dc159bc 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -402,7 +402,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
 
 	/* reschedule if sched delay was granted */
-	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay)
+	if (IS_ENABLED(CONFIG_SCHED_PREEMPT_DELAY) && current->sched_time_delay)
 		set_tsk_need_resched(current);
 
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90d7989a0185..ca2b461b7662 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -339,7 +339,7 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 extern void hrtick_local_start(u64 delay);
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 extern void update_stat_preempt_delayed(struct task_struct *t);
 #endif
 
@@ -409,7 +409,7 @@ static inline void sched_domains_mutex_lock(void) { }
 static inline void sched_domains_mutex_unlock(void) { }
 #endif
 
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 /* Scheduler time slice extension */
 extern unsigned int sysctl_sched_preempt_delay_us;
 #endif
@@ -572,7 +572,7 @@ struct sched_statistics {
 	u64				nr_wakeups_passive;
 	u64				nr_wakeups_idle;
 
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 	u64				nr_preempt_delay_granted;
 #endif
 
@@ -2259,7 +2259,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 
 extern bool rseq_delay_resched(void);
 extern void rseq_delay_resched_fini(void);
diff --git a/init/Kconfig b/init/Kconfig
index ce76e913aa2b..2f5f603d175a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1130,6 +1130,13 @@ config SCHED_MM_CID
 	def_bool y
 	depends on SMP && RSEQ
 
+config SCHED_PREEMPT_DELAY
+	def_bool y
+	depends on SMP && RSEQ
+	help
+	  This feature enables a thread to request extending its time slice on
+	  the cpu by delaying preemption.
+
 config UCLAMP_TASK_GROUP
 	bool "Utilization clamping per group of tasks"
 	depends on CGROUP_SCHED
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 7710a209433b..440fa4002be5 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -448,6 +448,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 bool rseq_delay_resched(void)
 {
 	struct task_struct *t = current;
@@ -526,6 +527,7 @@ void rseq_delay_schedule(struct task_struct *tsk)
 	}
 #endif
 }
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
 
 #ifdef CONFIG_DEBUG_RSEQ
 
@@ -581,7 +583,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	if (flags & RSEQ_FLAG_QUERY_CS_FLAGS) {
 		u32 rseq_csflags = RSEQ_CS_FLAG_DELAY_RESCHED |
 				   RSEQ_CS_FLAG_RESCHEDULED;
-		if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
+		if (!IS_ENABLED(CONFIG_SCHED_PREEMPT_DELAY) ||
+					!IS_ENABLED(CONFIG_SCHED_HRTICK))
 			return -EINVAL;
 		if (!rseq)
 			return -EINVAL;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd572053a955..d28c0e75b4f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -149,7 +149,7 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  */
 __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
 
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 /*
  * Scheduler time slice extension, duration in microsecs.
  * Max value allowed 100us, default is 30us.
@@ -941,7 +941,7 @@ void hrtick_local_start(u64 delay)
 	rq_unlock(rq, &rf);
 }
 
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 void update_stat_preempt_delayed(struct task_struct *t)
 {
 	schedstat_inc(t->stats.nr_preempt_delay_granted);
@@ -4697,7 +4697,7 @@ static int sysctl_schedstats(const struct ctl_table *table, int write, void *buf
 #endif /* CONFIG_SCHEDSTATS */
 
 #ifdef CONFIG_PROC_SYSCTL
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 static int sysctl_sched_preempt_delay(const struct ctl_table *table, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -4711,7 +4711,7 @@ static int sysctl_sched_preempt_delay(const struct ctl_table *table, int write,
 			SCHED_PREEMPT_DELAY_DEFAULT_US);
 	return err;
 }
-#endif /* CONFIG_RSEQ */
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
 #endif /* CONFIG_PROC_SYSCTL */
 
 #ifdef CONFIG_SYSCTL
@@ -4761,7 +4761,7 @@ static const struct ctl_table sched_core_sysctls[] = {
 		.extra2		= SYSCTL_FOUR,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 	{
 		.procname	= "sched_preempt_delay_us",
 		.data		= &sysctl_sched_preempt_delay_us,
@@ -4771,7 +4771,7 @@ static const struct ctl_table sched_core_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE_HUNDRED,
 	},
-#endif /* CONFIG_RSEQ */
+#endif /* CONFIG_SCHED_PREEMPT_DELAY */
 };
 static int __init sched_core_sysctl_init(void)
 {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3a2efd9505e1..45ae09447624 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1225,7 +1225,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_passive);
 		P_SCHEDSTAT(nr_wakeups_idle);
 
-#ifdef CONFIG_RSEQ
+#ifdef CONFIG_SCHED_PREEMPT_DELAY
 		P_SCHEDSTAT(nr_preempt_delay_granted);
 #endif
 
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index d9a4e3a2e064..f86eac7e2b43 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -1379,7 +1379,8 @@ static void do_sched_yield(void)
  */
 SYSCALL_DEFINE0(sched_yield)
 {
-	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay) {
+	if (IS_ENABLED(CONFIG_SCHED_PREEMPT_DELAY) &&
+				current->sched_time_delay) {
 		schedule();
 		return 0;
 	}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature
  2025-07-01  0:37 ` [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature Prakash Sangappa
@ 2025-07-01  3:12   ` K Prateek Nayak
  2025-07-01 17:47     ` Prakash Sangappa
  2025-07-01  8:46   ` Thomas Gleixner
  1 sibling, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2025-07-01  3:12 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, vineethr

Hello Prakash,

Couple of nits. inlined below.

On 7/1/2025 6:07 AM, Prakash Sangappa wrote:
> Add a config option to enable schedule time slice extension.
> 
> Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
> ---
>   include/linux/entry-common.h |  2 +-
>   include/linux/sched.h        |  8 ++++----
>   init/Kconfig                 |  7 +++++++
>   kernel/rseq.c                |  5 ++++-
>   kernel/sched/core.c          | 12 ++++++------
>   kernel/sched/debug.c         |  2 +-
>   kernel/sched/syscalls.c      |  3 ++-
>   7 files changed, 25 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index d4fa952e394e..351c9dc159bc 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -402,7 +402,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>   	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>   
>   	/* reschedule if sched delay was granted */
> -	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay)
> +	if (IS_ENABLED(CONFIG_SCHED_PREEMPT_DELAY) && current->sched_time_delay)

A wrapper around this would be nice. Something like
sched_delay_resched()? It can also be reused in do_sched_yield() then.
Thoughts?

>   		set_tsk_need_resched(current);
>   
>   	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {

[..snip..]

> diff --git a/init/Kconfig b/init/Kconfig
> index ce76e913aa2b..2f5f603d175a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1130,6 +1130,13 @@ config SCHED_MM_CID
>   	def_bool y
>   	depends on SMP && RSEQ
>   
> +config SCHED_PREEMPT_DELAY
> +	def_bool y
> +	depends on SMP && RSEQ

         && SCHED_HRTICK

and then you can avoid the ugly "!IS_ENABLED(CONFIG_SCHED_HRTICK)"
checks and keep all the SCHED_PREEMPT_DELAY bits in one place
without the need to put them in the "#ifdef  CONFIG_SCHED_HRTICK"
block.

Also, are we settling for 30us delay for PREEMPT_RT too or should
this also include "&& !PREEMPT_RT"?

> +	help
> +	  This feature enables a thread to request extending its time slice on
> +	  the cpu by delaying preemption.
> +
>   config UCLAMP_TASK_GROUP
>   	bool "Utilization clamping per group of tasks"
>   	depends on CGROUP_SCHED

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 3/7] Sched: Tunable to specify duration of time slice extension
  2025-07-01  0:37 ` [PATCH V6 3/7] Sched: Tunable to specify duration of time slice extension Prakash Sangappa
@ 2025-07-01  3:59   ` K Prateek Nayak
  0 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-07-01  3:59 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, vineethr

Hello Prakash,

On 7/1/2025 6:07 AM, Prakash Sangappa wrote:
> Add a tunable to specify duration of scheduler time slice extension.
> The default will be set to 30us and the max value that can be specified
> is 100us. Setting it to 0, disables scheduler time slice extension.
> 
> Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
> ---
> v5 -> v6
> - Added documentation for tunable 'sysctl_sched_preempt_delay_us'.
> ---
>   .../admin-guide/kernel-parameters.txt         |  8 ++++
>   include/linux/sched.h                         |  5 +++
>   include/uapi/linux/rseq.h                     |  5 ++-
>   kernel/rseq.c                                 |  7 +++-
>   kernel/sched/core.c                           | 40 +++++++++++++++++++
>   5 files changed, 61 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 0ee6c5314637..1e0f86cda0db 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6398,6 +6398,14 @@
>   
>   	sched_verbose	[KNL,EARLY] Enables verbose scheduler debug messages.
>   
> +	sched_preempt_delay_us=	[KNL]
> +			Scheduler preemption delay in microseconds.
> +			Allowed range is 0 to 100us. A thread can request
> +			extending its scheduler time slice on the cpu by
> +			delaying preemption. Duration of preemption delay
> +			granted is specified by this parameter. Setting it
> +			to 0 will disable this feature.
> +

Shouldn't these bits go into Documentation/admin-guide/sysctl/kernel.rst
since this is a sysctl and not a kernel parameter?

>   	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
>   			Allowed values are enable and disable. This feature
>   			incurs a small amount of overhead in the scheduler

[..snip..]

> @@ -4678,6 +4689,24 @@ static int sysctl_schedstats(const struct ctl_table *table, int write, void *buf
>   #endif /* CONFIG_PROC_SYSCTL */
>   #endif /* CONFIG_SCHEDSTATS */
>   
> +#ifdef CONFIG_PROC_SYSCTL
> +#ifdef CONFIG_RSEQ
> +static int sysctl_sched_preempt_delay(const struct ctl_table *table, int write,
> +		void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	int err;
> +
> +	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	if (err < 0)
> +		return err;
> +	if (sysctl_sched_preempt_delay_us > SCHED_PREEMPT_DELAY_DEFAULT_US)
> +		pr_warn("Sched preemption delay time set higher then default value %d us\n",

s/then/than the/

Should we print the current value too? That way, the dmesg can read how
badly the user wanted to shoot themselves in the foot if this ever goes
sideways / has unintended effects on PREEMPT_RT.

> +			SCHED_PREEMPT_DELAY_DEFAULT_US);
> +	return err;
> +}
> +#endif /* CONFIG_RSEQ */
> +#endif /* CONFIG_PROC_SYSCTL */
> +
>   #ifdef CONFIG_SYSCTL
>   static const struct ctl_table sched_core_sysctls[] = {
>   #ifdef CONFIG_SCHEDSTATS

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 0/7] Scheduler time slice extension
  2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
                   ` (6 preceding siblings ...)
  2025-07-01  0:37 ` [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature Prakash Sangappa
@ 2025-07-01  4:30 ` K Prateek Nayak
  2025-07-01 19:04   ` Prakash Sangappa
  7 siblings, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2025-07-01  4:30 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, vineethr

Hello Prakash,

On 7/1/2025 6:07 AM, Prakash Sangappa wrote:
> Prakash Sangappa (7):
>    Sched: Scheduler time slice extension
>    Sched: Indicate if thread got rescheduled
>    Sched: Tunable to specify duration of time slice extension
>    Sched: Add scheduler stat for cpu time slice extension
>    Sched: Add tracepoint for sched time slice extension
>    Add API to query supported rseq cs flags
>    Introduce a config option for scheduler time slice extension feature

nit.

IMO, the ordering of these patches can be improved. Introduction of
CONFIG_SCHED_PREEMPT_DELAY can come first followed by incrementally
adding the scheduler bits, followed by "rseq: Add API to query supported
rseq cs flags" and then finally introduce the bits that introduces
"RSEQ_CS_FLAG_DELAY_RESCHED" and allows the user to set.

This way all the CONFIG_SCHED_PREEMPT_DELAY can live in one place and
make it easier to review the entire series.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01  0:37 ` [PATCH V6 1/7] Sched: " Prakash Sangappa
@ 2025-07-01  8:42   ` Thomas Gleixner
  2025-07-01 10:56     ` Peter Zijlstra
  2025-07-01 18:40     ` Prakash Sangappa
  0 siblings, 2 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-07-01  8:42 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, bigeasy, kprateek.nayak,
	vineethr

On Tue, Jul 01 2025 at 00:37, Prakash Sangappa wrote:

The subsystem prefix for the scheduler is 'sched:' It's not that hard to
figure out.

>  unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> -				     unsigned long ti_work);
> +				     unsigned long ti_work,
> +				     bool irq);

No need for a new line
  
>  /**
>   * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
> @@ -316,7 +317,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>   *    EXIT_TO_USER_MODE_WORK are set
>   * 4) check that interrupts are still disabled
>   */
> -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
> +static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs,
> +						bool irq)

Ditto. 100 characters line width, please use it. And if you need a line
break, please align the second lines arguments properly. This is
documented:

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html

>  {
>  	unsigned long ti_work;
>  
> @@ -327,7 +329,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>  
>  	ti_work = read_thread_flags();
>  	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> -		ti_work = exit_to_user_mode_loop(regs, ti_work);
> +		ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
> +
> +	if (irq)
> +		rseq_delay_resched_fini();

This is an unconditional function call for every interrupt return and
it's even done when the whole thing is known to be non-functional at
compile time:

> +void rseq_delay_resched_fini(void)
> +{
> +#ifdef CONFIG_SCHED_HRTICK
  ....
> +#endif
> +}

Seriously?

>  	arch_exit_to_user_mode_prepare(regs, ti_work);
>  
> @@ -396,6 +401,10 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>  
>  	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>  
> +	/* reschedule if sched delay was granted */

Sentences start with an upper case letter and please use full words and
not arbitrary abbreviations. This is neither twatter nor SMS.

> +	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay)
> +		set_tsk_need_resched(current);
> +
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>  		if (WARN(irqs_disabled(), "syscall %lu left IRQs disabled", nr))
>  			local_irq_enable();
> @@ -411,7 +420,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>  	if (unlikely(work & SYSCALL_WORK_EXIT))
>  		syscall_exit_work(regs, work);
>  	local_irq_disable_exit_to_user();
> -	exit_to_user_mode_prepare(regs);
> +	exit_to_user_mode_prepare(regs, false);
>  }
>  
>  /**
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5bcf44ae6c79..9b4670d85131 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -338,6 +338,7 @@ extern int __must_check io_schedule_prepare(void);
>  extern void io_schedule_finish(int token);
>  extern long io_schedule_timeout(long timeout);
>  extern void io_schedule(void);
> +extern void hrtick_local_start(u64 delay);
>  
>  /* wrapper function to trace from this header file */
>  DECLARE_TRACEPOINT(sched_set_state_tp);
> @@ -1263,6 +1264,7 @@ struct task_struct {
>  	int				softirq_context;
>  	int				irq_config;
>  #endif
> +	unsigned			sched_time_delay:1;

Find an arbitrary place by rolling a dice and stick it in, right?

There is already a section with bit fields in this struct. So it's more
than bloody obvious to stick it there instead of creating a hole in the
middle of task struct.

>  #ifdef CONFIG_PREEMPT_RT
>  	int				softirq_disable_cnt;
>  #endif
> @@ -2245,6 +2247,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>  unsigned long sched_cpu_util(int cpu);
>  #endif /* CONFIG_SMP */
>  
> +#ifdef CONFIG_RSEQ
> +

Remove these newlines please. They have zero value.

> +extern bool rseq_delay_resched(void);
> +extern void rseq_delay_resched_fini(void);
> +extern void rseq_delay_resched_tick(void);
> +
> +#else

> @@ -98,8 +99,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  
>  		local_irq_enable_exit_to_user(ti_work);
>  
> -		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> -			schedule();
> +		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
> +		       if (irq && rseq_delay_resched())

unlikely() and again this results in an unconditional function call for
every interrupt when CONFIG_RSEQ is enabled. A pointless exercise for
the majority of use cases.

What's worse is that it breaks the LAZY semantics. I explained this to
you before and this thing needs to be tied on the LAZY bit otherwise a
SCHED_OTHER task can prevent a real-time task from running, which is
fundamentally wrong.

So this wants to be:

	if (likely(!irq || !rseq_delay_resched(ti_work))
        	schedule();

and

static inline bool rseq_delay_resched(unsigned long ti_work)
{
        // Set when all Kconfig conditions are met
        if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
        	return false;

        // Only NEED_RESCHED_LAZY can be delayed
	if (ti_work & _TIF_NEED_RESCHED)
        	return false;

        // NONE indicates that current::rseq == NULL
        // PROBE indicates that current::rseq::flags needs to be
        // evaluated.
        // REQUESTED indicates that there was a successful request
        // already.
        if (likely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
        	return false;

        return __rseq_delay_resched();
}

or something like that.

> +bool rseq_delay_resched(void)
> +{
> +	struct task_struct *t = current;
> +	u32 flags;
> +
> +	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
> +		return false;
> +
> +	if (!t->rseq)
> +		return false;
> +
> +	if (t->sched_time_delay)
> +		return false;

Then all of the above conditions go away.

> +	if (copy_from_user_nofault(&flags, &t->rseq->flags, sizeof(flags)))
> +		return false;
> +
> +	if (!(flags & RSEQ_CS_FLAG_DELAY_RESCHED))
> +		return false;
> +
> +	flags &= ~RSEQ_CS_FLAG_DELAY_RESCHED;
> +	if (copy_to_user_nofault(&t->rseq->flags, &flags, sizeof(flags)))
> +		return false;
> +
> +	t->sched_time_delay = 1;

and this becomes:

       	t->rseq_delay_resched = RSEQ_RESCHED_DELAY_REQUESTED;

> +	return true;
> +}
> +
> +void rseq_delay_resched_fini(void)

What's does _fini() mean here? Absolutely nothing. This wants to be a
self explaining function name and see below

> +{
> +#ifdef CONFIG_SCHED_HRTICK

You really are fond of pointless function calls. Obviously performance
is not really a concern in your work.

> +	extern void hrtick_local_start(u64 delay);

header files with prototypes exist for a reason....

> +	struct task_struct *t = current;
> +	/*
> +	 * IRQs off, guaranteed to return to userspace, start timer on this CPU
> +	 * to limit the resched-overdraft.
> +	 *
> +	 * If your critical section is longer than 30 us you get to keep the
> +	 * pieces.
> +	 */
> +	if (t->sched_time_delay)
> +		hrtick_local_start(30 * NSEC_PER_USEC);
> +#endif

This whole thing can be condensed into:

static inline void rseq_delay_resched_arm_timer(void)
{
	if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
        	return;
        if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_REQUESTED))
        	hrtick_local_start(...);
}

> +}
> +
> +void rseq_delay_resched_tick(void)
> +{
> +#ifdef CONFIG_SCHED_HRTICK
> +	struct task_struct *t = current;
> +
> +	if (t->sched_time_delay)
> +		set_tsk_need_resched(t);
> +#endif

Oh well.....

> +}
> +
>  #ifdef CONFIG_DEBUG_RSEQ
>  
>  /*
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4ad7cf3cfdca..c1b64879115f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -845,6 +845,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
>  
>  	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
>  
> +	rseq_delay_resched_tick();
> +
>  	rq_lock(rq, &rf);
>  	update_rq_clock(rq);
>  	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
> @@ -918,6 +920,16 @@ void hrtick_start(struct rq *rq, u64 delay)
>  
>  #endif /* CONFIG_SMP */
>  
> +void hrtick_local_start(u64 delay)

How is this supposed to compile cleanly without a prototype?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature
  2025-07-01  0:37 ` [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature Prakash Sangappa
  2025-07-01  3:12   ` K Prateek Nayak
@ 2025-07-01  8:46   ` Thomas Gleixner
  2025-07-01 19:04     ` Prakash Sangappa
  1 sibling, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2025-07-01  8:46 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, bigeasy, kprateek.nayak,
	vineethr

On Tue, Jul 01 2025 at 00:37, Prakash Sangappa wrote:
> Add a config option to enable schedule time slice extension.

This is so backwards that it's not even funny anymore.

> +config SCHED_PREEMPT_DELAY
> +	def_bool y
> +	depends on SMP && RSEQ

and hilariously fails to include a SCHED_HRTICK dependency.

Impressive....

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01  8:42   ` Thomas Gleixner
@ 2025-07-01 10:56     ` Peter Zijlstra
  2025-07-01 11:28       ` K Prateek Nayak
  2025-07-01 12:36       ` Thomas Gleixner
  2025-07-01 18:40     ` Prakash Sangappa
  1 sibling, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2025-07-01 10:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Prakash Sangappa, linux-kernel, rostedt, mathieu.desnoyers,
	bigeasy, kprateek.nayak, vineethr

On Tue, Jul 01, 2025 at 10:42:36AM +0200, Thomas Gleixner wrote:

> What's worse is that it breaks the LAZY semantics. I explained this to
> you before and this thing needs to be tied on the LAZY bit otherwise a
> SCHED_OTHER task can prevent a real-time task from running, which is
> fundamentally wrong.

So here we disagree, I don't want this tied to LAZY.

SCHED_OTHER can already inhibit a RT task from getting ran by doing a
syscall, this syscall will have non-preemptible sections and the RT task
will get delayed.

I very much want this thing to be limited to a time frame where a
userspace critical section (this thing) is smaller than such a kernel
critical section.

That is, there should be no observable difference between the effects of
this new thing and a syscall doing preempt_disable().

That said; the reason I don't want this tied to LAZY is that RT itself
is not subject to LAZY and this then means that RT threads cannot make
use of this new facility, whereas I think it makes perfect sense for
them to use this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01 10:56     ` Peter Zijlstra
@ 2025-07-01 11:28       ` K Prateek Nayak
  2025-07-01 11:40         ` Peter Zijlstra
  2025-07-01 12:36       ` Thomas Gleixner
  1 sibling, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2025-07-01 11:28 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Prakash Sangappa, linux-kernel, rostedt, mathieu.desnoyers,
	bigeasy, vineethr

Hello Peter,

On 7/1/2025 4:26 PM, Peter Zijlstra wrote:
> On Tue, Jul 01, 2025 at 10:42:36AM +0200, Thomas Gleixner wrote:
> 
>> What's worse is that it breaks the LAZY semantics. I explained this to
>> you before and this thing needs to be tied on the LAZY bit otherwise a
>> SCHED_OTHER task can prevent a real-time task from running, which is
>> fundamentally wrong.
> 
> So here we disagree, I don't want this tied to LAZY.
> 
> SCHED_OTHER can already inhibit a RT task from getting ran by doing a
> syscall, this syscall will have non-preemptible sections and the RT task
> will get delayed.
> 
> I very much want this thing to be limited to a time frame where a
> userspace critical section (this thing) is smaller than such a kernel
> critical section.
> 
> That is, there should be no observable difference between the effects of
> this new thing and a syscall doing preempt_disable().
> 
> 
> That said; the reason I don't want this tied to LAZY is that RT itself
> is not subject to LAZY and this then means that RT threads cannot make
> use of this new facility, whereas I think it makes perfect sense for
> them to use this.

Thinking out loud: I know we are trying to keep the overhead to a
minimum but is it acceptable to go through with schedule() and decide
on extending the time slice in pick_next_task_fair() / pick_task_rt()?

Then, a higher priority task can always preempt us when preemption is
enabled and between the tasks of same class, it is just a redundant
schedule() loop.

It'll require some additional care to start accounting for delay from
the time when NEED_RESCHED was set and not when schedule() is actually
called but would the overhead be that bad?

Or would we like to prevent preemption from RT tasks too on
!PREMMPT_RT since whatever the task asking for the extended slice is
doing is considered important enough?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01 11:28       ` K Prateek Nayak
@ 2025-07-01 11:40         ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2025-07-01 11:40 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, Prakash Sangappa, linux-kernel, rostedt,
	mathieu.desnoyers, bigeasy, vineethr

On Tue, Jul 01, 2025 at 04:58:26PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 7/1/2025 4:26 PM, Peter Zijlstra wrote:
> > On Tue, Jul 01, 2025 at 10:42:36AM +0200, Thomas Gleixner wrote:
> > 
> > > What's worse is that it breaks the LAZY semantics. I explained this to
> > > you before and this thing needs to be tied on the LAZY bit otherwise a
> > > SCHED_OTHER task can prevent a real-time task from running, which is
> > > fundamentally wrong.
> > 
> > So here we disagree, I don't want this tied to LAZY.
> > 
> > SCHED_OTHER can already inhibit a RT task from getting ran by doing a
> > syscall, this syscall will have non-preemptible sections and the RT task
> > will get delayed.
> > 
> > I very much want this thing to be limited to a time frame where a
> > userspace critical section (this thing) is smaller than such a kernel
> > critical section.
> > 
> > That is, there should be no observable difference between the effects of
> > this new thing and a syscall doing preempt_disable().
> > 
> > 
> > That said; the reason I don't want this tied to LAZY is that RT itself
> > is not subject to LAZY and this then means that RT threads cannot make
> > use of this new facility, whereas I think it makes perfect sense for
> > them to use this.
> 
> Thinking out loud: I know we are trying to keep the overhead to a
> minimum but is it acceptable to go through with schedule() and decide
> on extending the time slice in pick_next_task_fair() / pick_task_rt()?
> 
> Then, a higher priority task can always preempt us when preemption is
> enabled and between the tasks of same class, it is just a redundant
> schedule() loop.
> 
> It'll require some additional care to start accounting for delay from
> the time when NEED_RESCHED was set and not when schedule() is actually
> called but would the overhead be that bad?

Probably not -- if care was taken to make sure all callers have an
up-to-date rq->clock (many will have today, some might need updating).
Then its just a matter of saving a copy.

Basically stick assert_clock_updated() in __resched_curr() and make all
the splats go away.

> Or would we like to prevent preemption from RT tasks too on
> !PREMMPT_RT since whatever the task asking for the extended slice is
> doing is considered important enough?

I'm not sure I see the need for this complication -- under the premise
that the duration is strictly limited to less than what syscalls can
already inflict upon us, there should be no observable difference in
worst case timing.

But yes, if this makes some people feel better, then I suppose can look
at this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01 10:56     ` Peter Zijlstra
  2025-07-01 11:28       ` K Prateek Nayak
@ 2025-07-01 12:36       ` Thomas Gleixner
  2025-07-01 14:49         ` Steven Rostedt
  2025-07-03  5:38         ` Prakash Sangappa
  1 sibling, 2 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-07-01 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Prakash Sangappa, linux-kernel, rostedt, mathieu.desnoyers,
	bigeasy, kprateek.nayak, vineethr

On Tue, Jul 01 2025 at 12:56, Peter Zijlstra wrote:
> On Tue, Jul 01, 2025 at 10:42:36AM +0200, Thomas Gleixner wrote:
>
>> What's worse is that it breaks the LAZY semantics. I explained this to
>> you before and this thing needs to be tied on the LAZY bit otherwise a
>> SCHED_OTHER task can prevent a real-time task from running, which is
>> fundamentally wrong.
>
> So here we disagree, I don't want this tied to LAZY.
>
> SCHED_OTHER can already inhibit a RT task from getting ran by doing a
> syscall, this syscall will have non-preemptible sections and the RT task
> will get delayed.
>
> I very much want this thing to be limited to a time frame where a
> userspace critical section (this thing) is smaller than such a kernel
> critical section.
>
> That is, there should be no observable difference between the effects of
> this new thing and a syscall doing preempt_disable().
>
> That said; the reason I don't want this tied to LAZY is that RT itself
> is not subject to LAZY and this then means that RT threads cannot make
> use of this new facility, whereas I think it makes perfect sense for
> them to use this.

Fair enough, but can we pretty please have this explained and documented
and not just burried in some gory implementation details, which nobody
will understand in 3 months down the road.

Also if we go there and allow non-RT tasks to delay scheduling, then we
need a control mechanism to enable/disable this mechanism on a per task
or process basis. That way a RT system designer can prevent random
user space tasks, which think they are the most important piece, from
interfering with truly relevant RT tasks w/o going to chase down source
code and hack it into submission.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01 12:36       ` Thomas Gleixner
@ 2025-07-01 14:49         ` Steven Rostedt
  2025-07-03  5:38         ` Prakash Sangappa
  1 sibling, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2025-07-01 14:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Prakash Sangappa, linux-kernel, mathieu.desnoyers,
	bigeasy, kprateek.nayak, vineethr

On Tue, 01 Jul 2025 14:36:32 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> > That said; the reason I don't want this tied to LAZY is that RT itself
> > is not subject to LAZY and this then means that RT threads cannot make
> > use of this new facility, whereas I think it makes perfect sense for
> > them to use this.  
> 
> Fair enough, but can we pretty please have this explained and documented
> and not just burried in some gory implementation details, which nobody
> will understand in 3 months down the road.
> 
> Also if we go there and allow non-RT tasks to delay scheduling, then we
> need a control mechanism to enable/disable this mechanism on a per task
> or process basis. That way a RT system designer can prevent random
> user space tasks, which think they are the most important piece, from
> interfering with truly relevant RT tasks w/o going to chase down source
> code and hack it into submission.

BTW, I already showed[1] that any amount of delay this adds will build up
on top of the current worse case latency. So just saying "we only delay
5us which is in the noise" is incorrect when you have a system that has
a worse case latency of 30us. Because that 5us now makes it 35us.

Which is why I said this must be able to be disabled. I wouldn't want
this on any RT system, unless it can be configured as Thomas states
that it can be limited to specific tasks and is default off for
anything that the admin doesn't explicitly state it's for.

[1] https://lore.kernel.org/all/20250609165532.3265e142@gandalf.local.home/

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature
  2025-07-01  3:12   ` K Prateek Nayak
@ 2025-07-01 17:47     ` Prakash Sangappa
  0 siblings, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01 17:47 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, bigeasy@linutronix.de, vineethr@linux.ibm.com



> On Jun 30, 2025, at 8:12 PM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Prakash,
> 
> Couple of nits. inlined below.
> 
> On 7/1/2025 6:07 AM, Prakash Sangappa wrote:
>> Add a config option to enable schedule time slice extension.
>> Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
>> ---
>>  include/linux/entry-common.h |  2 +-
>>  include/linux/sched.h        |  8 ++++----
>>  init/Kconfig                 |  7 +++++++
>>  kernel/rseq.c                |  5 ++++-
>>  kernel/sched/core.c          | 12 ++++++------
>>  kernel/sched/debug.c         |  2 +-
>>  kernel/sched/syscalls.c      |  3 ++-
>>  7 files changed, 25 insertions(+), 14 deletions(-)
>> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
>> index d4fa952e394e..351c9dc159bc 100644
>> --- a/include/linux/entry-common.h
>> +++ b/include/linux/entry-common.h
>> @@ -402,7 +402,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>>   CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>>     /* reschedule if sched delay was granted */
>> - if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay)
>> + if (IS_ENABLED(CONFIG_SCHED_PREEMPT_DELAY) && current->sched_time_delay)
> 
> A wrapper around this would be nice. Something like
> sched_delay_resched()? It can also be reused in do_sched_yield() then.
> Thoughts?
> 

Ok, will do that.

>>   set_tsk_need_resched(current);
>>     if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> 
> [..snip..]
> 
>> diff --git a/init/Kconfig b/init/Kconfig
>> index ce76e913aa2b..2f5f603d175a 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1130,6 +1130,13 @@ config SCHED_MM_CID
>>   def_bool y
>>   depends on SMP && RSEQ
>>  +config SCHED_PREEMPT_DELAY
>> + def_bool y
>> + depends on SMP && RSEQ
> 
>        && SCHED_HRTICK
> 
> and then you can avoid the ugly "!IS_ENABLED(CONFIG_SCHED_HRTICK)"
> checks and keep all the SCHED_PREEMPT_DELAY bits in one place
> without the need to put them in the "#ifdef  CONFIG_SCHED_HRTICK"
> block.

Sure, I should have included SCHED_HRTICK. 
Will make that change.

> 
> Also, are we settling for 30us delay for PREEMPT_RT too or should
> this also include "&& !PREEMPT_RT"?

Yes, 30us is the default. This needs to be decided. 
If we decide that scheduler time slice extension should be disabled for PREEMPT_RT
then I will include !PREEMPT_RT.

> 
>> + help
>> +  This feature enables a thread to request extending its time slice on
>> +  the cpu by delaying preemption.
>> +
>>  config UCLAMP_TASK_GROUP
>>   bool "Utilization clamping per group of tasks"
>>   depends on CGROUP_SCHED
> 
> —

Thanks for looking into it.
-Prakash

> Thanks and Regards,
> Prateek
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01  8:42   ` Thomas Gleixner
  2025-07-01 10:56     ` Peter Zijlstra
@ 2025-07-01 18:40     ` Prakash Sangappa
  1 sibling, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01 18:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jul 1, 2025, at 1:42 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, Jul 01 2025 at 00:37, Prakash Sangappa wrote:
> 
> The subsystem prefix for the scheduler is 'sched:' It's not that hard to
> figure out.

Will fix that. 

> 
>> unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -     unsigned long ti_work);
>> +     unsigned long ti_work,
>> +     bool irq);
> 
> No need for a new line

Ok.

> 
>> /**
>>  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
>> @@ -316,7 +317,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>>  *    EXIT_TO_USER_MODE_WORK are set
>>  * 4) check that interrupts are still disabled
>>  */
>> -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>> +static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs,
>> + bool irq)
> 
> Ditto. 100 characters line width, please use it. And if you need a line
> break, please align the second lines arguments properly. This is
> documented:
> 
> https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/process/maintainer-tip.html__;!!ACWV5N9M2RV99hQ!KeHcAVTEgkkOTIZCEsRYgcQtDwYHZ4s77sjO9fKsjO530m5mavRMOnDMB_v6fm5TBvEfRPV2tcR2KTqt1865VZ0$

Got it.

> 
>> {
>> unsigned long ti_work;
>> 
>> @@ -327,7 +329,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
>> 
>> ti_work = read_thread_flags();
>> if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>> - ti_work = exit_to_user_mode_loop(regs, ti_work);
>> + ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
>> +
>> + if (irq)
>> + rseq_delay_resched_fini();
> 
> This is an unconditional function call for every interrupt return and
> it's even done when the whole thing is known to be non-functional at
> compile time:

Will include IS_CONFIG check. 

> 
>> +void rseq_delay_resched_fini(void)
>> +{
>> +#ifdef CONFIG_SCHED_HRTICK
>  ....
>> +#endif
>> +}
> 
> Seriously?

Will make the new config CONFIG_SCHED_PREEMPT_DELAY dependent not SCHED_HRTICK,
So, I can remove these #ifdef, #endif.

> 
>> arch_exit_to_user_mode_prepare(regs, ti_work);
>> 
>> @@ -396,6 +401,10 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>> 
>> CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
>> 
>> + /* reschedule if sched delay was granted */
> 
> Sentences start with an upper case letter and please use full words and
> not arbitrary abbreviations. This is neither twatter nor SMS.

Will fix. 

> 
>> + if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay)
>> + set_tsk_need_resched(current);
>> +
>> if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>> if (WARN(irqs_disabled(), "syscall %lu left IRQs disabled", nr))
>> local_irq_enable();
>> @@ -411,7 +420,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
>> if (unlikely(work & SYSCALL_WORK_EXIT))
>> syscall_exit_work(regs, work);
>> local_irq_disable_exit_to_user();
>> - exit_to_user_mode_prepare(regs);
>> + exit_to_user_mode_prepare(regs, false);
>> }
>> 
>> /**
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 5bcf44ae6c79..9b4670d85131 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -338,6 +338,7 @@ extern int __must_check io_schedule_prepare(void);
>> extern void io_schedule_finish(int token);
>> extern long io_schedule_timeout(long timeout);
>> extern void io_schedule(void);
>> +extern void hrtick_local_start(u64 delay);
>> 
>> /* wrapper function to trace from this header file */
>> DECLARE_TRACEPOINT(sched_set_state_tp);
>> @@ -1263,6 +1264,7 @@ struct task_struct {
>> int softirq_context;
>> int irq_config;
>> #endif
>> + unsigned sched_time_delay:1;
> 
> Find an arbitrary place by rolling a dice and stick it in, right?

Sorry, merging issue. I had it next to the following 
>         unsigned                        in_thrashing:1;

Will fix it. 



> 
> There is already a section with bit fields in this struct. So it's more
> than bloody obvious to stick it there instead of creating a hole in the
> middle of task struct.
> 
>> #ifdef CONFIG_PREEMPT_RT
>> int softirq_disable_cnt;
>> #endif
>> @@ -2245,6 +2247,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>> unsigned long sched_cpu_util(int cpu);
>> #endif /* CONFIG_SMP */
>> 
>> +#ifdef CONFIG_RSEQ
>> +
> 
> Remove these newlines please. They have zero value.

Ok

> 
>> +extern bool rseq_delay_resched(void);
>> +extern void rseq_delay_resched_fini(void);
>> +extern void rseq_delay_resched_tick(void);
>> +
>> +#else
> 
>> @@ -98,8 +99,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> 
>> local_irq_enable_exit_to_user(ti_work);
>> 
>> - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>> - schedule();
>> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
>> +       if (irq && rseq_delay_resched())
> 
> unlikely() and again this results in an unconditional function call for
> every interrupt when CONFIG_RSEQ is enabled. A pointless exercise for
> the majority of use cases.
> 
> What's worse is that it breaks the LAZY semantics. I explained this to
> you before and this thing needs to be tied on the LAZY bit otherwise a
> SCHED_OTHER task can prevent a real-time task from running, which is
> fundamentally wrong.
> 
> So this wants to be:
> 
> if (likely(!irq || !rseq_delay_resched(ti_work))
>         schedule();
> 
> and
> 
> static inline bool rseq_delay_resched(unsigned long ti_work)
> {
>        // Set when all Kconfig conditions are met
>        if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>         return false;
> 
>        // Only NEED_RESCHED_LAZY can be delayed
> if (ti_work & _TIF_NEED_RESCHED)
>         return false;
> 
>        // NONE indicates that current::rseq == NULL
>        // PROBE indicates that current::rseq::flags needs to be
>        // evaluated.
>        // REQUESTED indicates that there was a successful request
>        // already.
>        if (likely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_PROBE))
>         return false;
> 
>        return __rseq_delay_resched();
> }
> 
> or something like that.

Will refactor the code.

> 
>> +bool rseq_delay_resched(void)
>> +{
>> + struct task_struct *t = current;
>> + u32 flags;
>> +
>> + if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
>> + return false;
>> +
>> + if (!t->rseq)
>> + return false;
>> +
>> + if (t->sched_time_delay)
>> + return false;
> 
> Then all of the above conditions go away.
> 
>> + if (copy_from_user_nofault(&flags, &t->rseq->flags, sizeof(flags)))
>> + return false;
>> +
>> + if (!(flags & RSEQ_CS_FLAG_DELAY_RESCHED))
>> + return false;
>> +
>> + flags &= ~RSEQ_CS_FLAG_DELAY_RESCHED;
>> + if (copy_to_user_nofault(&t->rseq->flags, &flags, sizeof(flags)))
>> + return false;
>> +
>> + t->sched_time_delay = 1;
> 
> and this becomes:
> 
>        t->rseq_delay_resched = RSEQ_RESCHED_DELAY_REQUESTED;
> 
>> + return true;
>> +}
>> +
>> +void rseq_delay_resched_fini(void)
> 
> What's does _fini() mean here? Absolutely nothing. This wants to be a
> self explaining function name and see below
> 
>> +{
>> +#ifdef CONFIG_SCHED_HRTICK
> 
> You really are fond of pointless function calls. Obviously performance
> is not really a concern in your work.
> 
>> + extern void hrtick_local_start(u64 delay);
> 
> header files with prototypes exist for a reason....
> 
>> + struct task_struct *t = current;
>> + /*
>> + * IRQs off, guaranteed to return to userspace, start timer on this CPU
>> + * to limit the resched-overdraft.
>> + *
>> + * If your critical section is longer than 30 us you get to keep the
>> + * pieces.
>> + */
>> + if (t->sched_time_delay)
>> + hrtick_local_start(30 * NSEC_PER_USEC);
>> +#endif
> 
> This whole thing can be condensed into:
> 
> static inline void rseq_delay_resched_arm_timer(void)
> {
> if (!IS_ENABLED(CONFIG_RSEQ_RESCHED_DELAY))
>         return;
>        if (unlikely(current->rseq_delay_resched != RSEQ_RESCHED_DELAY_REQUESTED))
>         hrtick_local_start(...);
> }

Got it, will make the changes.

> 
>> +}
>> +
>> +void rseq_delay_resched_tick(void)
>> +{
>> +#ifdef CONFIG_SCHED_HRTICK
>> + struct task_struct *t = current;
>> +
>> + if (t->sched_time_delay)
>> + set_tsk_need_resched(t);
>> +#endif
> 
> Oh well.....
> 
>> +}
>> +
>> #ifdef CONFIG_DEBUG_RSEQ
>> 
>> /*
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4ad7cf3cfdca..c1b64879115f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -845,6 +845,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
>> 
>> WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
>> 
>> + rseq_delay_resched_tick();
>> +
>> rq_lock(rq, &rf);
>> update_rq_clock(rq);
>> rq->donor->sched_class->task_tick(rq, rq->curr, 1);
>> @@ -918,6 +920,16 @@ void hrtick_start(struct rq *rq, u64 delay)
>> 
>> #endif /* CONFIG_SMP */
>> 
>> +void hrtick_local_start(u64 delay)
> 
> How is this supposed to compile cleanly without a prototype?

Will fix.

Thanks for your comments.
-Prakash
> 
> Thanks,
> 
>        tglx


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 0/7] Scheduler time slice extension
  2025-07-01  4:30 ` [PATCH V6 0/7] Scheduler time slice extension K Prateek Nayak
@ 2025-07-01 19:04   ` Prakash Sangappa
  0 siblings, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01 19:04 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, bigeasy@linutronix.de, vineethr@linux.ibm.com



> On Jun 30, 2025, at 9:30 PM, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> 
> Hello Prakash,
> 
> On 7/1/2025 6:07 AM, Prakash Sangappa wrote:
>> Prakash Sangappa (7):
>>   Sched: Scheduler time slice extension
>>   Sched: Indicate if thread got rescheduled
>>   Sched: Tunable to specify duration of time slice extension
>>   Sched: Add scheduler stat for cpu time slice extension
>>   Sched: Add tracepoint for sched time slice extension
>>   Add API to query supported rseq cs flags
>>   Introduce a config option for scheduler time slice extension feature
> 
> nit.
> 
> IMO, the ordering of these patches can be improved. Introduction of
> CONFIG_SCHED_PREEMPT_DELAY can come first followed by incrementally
> adding the scheduler bits, followed by "rseq: Add API to query supported
> rseq cs flags" and then finally introduce the bits that introduces
> "RSEQ_CS_FLAG_DELAY_RESCHED" and allows the user to set.

Ok, I can introduce the CONFIG_SCHED_PREEMPT_DELAY changes into patch 1 itself, instead of
It being a separate patch.   

Thanks,
-Prakash.

> 
> This way all the CONFIG_SCHED_PREEMPT_DELAY can live in one place and
> make it easier to review the entire series.
> 
> -- 
> Thanks and Regards,
> Prateek
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature
  2025-07-01  8:46   ` Thomas Gleixner
@ 2025-07-01 19:04     ` Prakash Sangappa
  0 siblings, 0 replies; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-01 19:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jul 1, 2025, at 1:46 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, Jul 01 2025 at 00:37, Prakash Sangappa wrote:
>> Add a config option to enable schedule time slice extension.
> 
> This is so backwards that it's not even funny anymore.
> 
>> +config SCHED_PREEMPT_DELAY
>> + def_bool y
>> + depends on SMP && RSEQ
> 
> and hilariously fails to include a SCHED_HRTICK dependency.
> 

Will fix that.
-Prakash

> Impressive….



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-01 12:36       ` Thomas Gleixner
  2025-07-01 14:49         ` Steven Rostedt
@ 2025-07-03  5:38         ` Prakash Sangappa
  2025-07-03  8:32           ` Thomas Gleixner
  1 sibling, 1 reply; 24+ messages in thread
From: Prakash Sangappa @ 2025-07-03  5:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, bigeasy@linutronix.de,
	kprateek.nayak@amd.com, vineethr@linux.ibm.com



> On Jul 1, 2025, at 5:36 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, Jul 01 2025 at 12:56, Peter Zijlstra wrote:
>> On Tue, Jul 01, 2025 at 10:42:36AM +0200, Thomas Gleixner wrote:
>> 
>>> What's worse is that it breaks the LAZY semantics. I explained this to
>>> you before and this thing needs to be tied on the LAZY bit otherwise a
>>> SCHED_OTHER task can prevent a real-time task from running, which is
>>> fundamentally wrong.
>> 
>> So here we disagree, I don't want this tied to LAZY.
>> 
>> SCHED_OTHER can already inhibit a RT task from getting ran by doing a
>> syscall, this syscall will have non-preemptible sections and the RT task
>> will get delayed.
>> 
>> I very much want this thing to be limited to a time frame where a
>> userspace critical section (this thing) is smaller than such a kernel
>> critical section.
>> 
>> That is, there should be no observable difference between the effects of
>> this new thing and a syscall doing preempt_disable().
>> 
>> That said; the reason I don't want this tied to LAZY is that RT itself
>> is not subject to LAZY and this then means that RT threads cannot make
>> use of this new facility, whereas I think it makes perfect sense for
>> them to use this.
> 
> Fair enough, but can we pretty please have this explained and documented
> and not just burried in some gory implementation details, which nobody
> will understand in 3 months down the road.
> 
> Also if we go there and allow non-RT tasks to delay scheduling, then we
> need a control mechanism to enable/disable this mechanism on a per task
> or process basis. That way a RT system designer can prevent random
> user space tasks, which think they are the most important piece, from
> interfering with truly relevant RT tasks w/o going to chase down source
> code and hack it into submission.

Could the per task  control mechanism be thru /proc?
Wonder how easy it will be to administer such control.

Alternatively, can we have a config option to apply to LAZY only?
This will not provide the finer  control as you suggested. 

Thanks,
-Prakash.

> 
> Thanks,
> 
>        tglx


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 1/7] Sched: Scheduler time slice extension
  2025-07-03  5:38         ` Prakash Sangappa
@ 2025-07-03  8:32           ` Thomas Gleixner
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2025-07-03  8:32 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, bigeasy@linutronix.de,
	kprateek.nayak@amd.com, vineethr@linux.ibm.com

On Thu, Jul 03 2025 at 05:38, Prakash Sangappa wrote:
>> On Jul 1, 2025, at 5:36 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Also if we go there and allow non-RT tasks to delay scheduling, then we
>> need a control mechanism to enable/disable this mechanism on a per task
>> or process basis. That way a RT system designer can prevent random
>> user space tasks, which think they are the most important piece, from
>> interfering with truly relevant RT tasks w/o going to chase down source
>> code and hack it into submission.
>
> Could the per task  control mechanism be thru /proc?

Is that a serious question?

> Wonder how easy it will be to administer such control.

Obviously it's horrible.

That's what prctl() is for. Plus a proper inheritance mechanism on
fork/exec along with a system wide default which can be controlled via
the kernel command line.

> Alternatively, can we have a config option to apply to LAZY only?
> This will not provide the finer  control as you suggested. 

A config option is not solving anything; it's just a lazy hack to avoid
the hard work of a proper and future proof ABI design.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-07-03  8:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01  0:37 [PATCH V6 0/7] Scheduler time slice extension Prakash Sangappa
2025-07-01  0:37 ` [PATCH V6 1/7] Sched: " Prakash Sangappa
2025-07-01  8:42   ` Thomas Gleixner
2025-07-01 10:56     ` Peter Zijlstra
2025-07-01 11:28       ` K Prateek Nayak
2025-07-01 11:40         ` Peter Zijlstra
2025-07-01 12:36       ` Thomas Gleixner
2025-07-01 14:49         ` Steven Rostedt
2025-07-03  5:38         ` Prakash Sangappa
2025-07-03  8:32           ` Thomas Gleixner
2025-07-01 18:40     ` Prakash Sangappa
2025-07-01  0:37 ` [PATCH V6 2/7] Sched: Indicate if thread got rescheduled Prakash Sangappa
2025-07-01  0:37 ` [PATCH V6 3/7] Sched: Tunable to specify duration of time slice extension Prakash Sangappa
2025-07-01  3:59   ` K Prateek Nayak
2025-07-01  0:37 ` [PATCH V6 4/7] Sched: Add scheduler stat for cpu " Prakash Sangappa
2025-07-01  0:37 ` [PATCH V6 5/7] Sched: Add tracepoint for sched " Prakash Sangappa
2025-07-01  0:37 ` [PATCH V6 6/7] Add API to query supported rseq cs flags Prakash Sangappa
2025-07-01  0:37 ` [PATCH V6 7/7] Introduce a config option for scheduler time slice extension feature Prakash Sangappa
2025-07-01  3:12   ` K Prateek Nayak
2025-07-01 17:47     ` Prakash Sangappa
2025-07-01  8:46   ` Thomas Gleixner
2025-07-01 19:04     ` Prakash Sangappa
2025-07-01  4:30 ` [PATCH V6 0/7] Scheduler time slice extension K Prateek Nayak
2025-07-01 19:04   ` Prakash Sangappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox