linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5 0/6] Scheduler time slice extension
@ 2025-06-03 23:36 Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 1/6] Sched: " Prakash Sangappa
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-03 23:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, which can have undesirable affect
on performance. Having a way for the thread to request additional execution
time on cpu, so that it can complete the critical section will be useful in
such scenario. The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature.

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[5], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patchset implements above solution as suggested, with use of restartable
sequences(rseq) structure for API. Refer [3][4] for further discussions.

v5:
- Added #ifdef CONFIG_RSEQ and CONFIG_PROC_SYSCTL for sysctl tunable
  changes(patch 3).
- Added #ifdef CONFIG_RSEQ for schedular stat changes(patch 4).
- Removed deprecated flags from the supported flags returned, as
  pointed out by Mathieu Desnoyers(patch 6).
- Added IF_ENABLED(CONFIG_SCHED_HRTICK) check before returning supported
  delay resched flags.

v4:
https://lore.kernel.org/all/20250513214554.4160454-1-prakash.sangappa@oracle.com
- Changed default sched delay extension time to 30us
- Added patch to indicate to userspace if the thread got preempted in
  the extended cpu time granted. Uses another bit in rseq cs flags for it.
  This should help the application to check and avoid having to call a
  system call to yield cpu, especially sched_yield() as pointed out
  by Steven Rostedt.
- Moved tracepoint call towards end of exit_to_user_mode_loop().
- Added a pr_warn() message when the 'sched_preempt_delay_us' tunable is
  set higher then the default value of 30us.
- Patch to add an API to query if sched time extension feature is supported. 
  A new flag to sys_rseq flags argument called 'RSEQ_FLAG_QUERY_CS_FLAGS',
  is added, as suggested by Mathieu Desnoyers. 
  Returns bitmask of all the supported rseq cs flags, in rseq->flags field.

v3:
https://lore.kernel.org/all/20250502015955.3146733-1-prakash.sangappa@oracle.com
- Addressing review comments by Sebastian and Prateek.
  * Rename rseq_sched_delay -> sched_time_delay. Move its place in
    struct task_struct near other bits so it fits in existing word.
  * Use IS_ENABLED(CONFIG_RSEQ) instead of #ifdef to access
    'sched_time_delay'.
  * removed rseq_delay_resched_tick() call from hrtick_clear().
  * Introduced a patch to add a tracepoint in exit_to_user_mode_loop(),
    suggested by Sebastian.
  * Added comments to describe RSEQ_CS_FLAG_DELAY_RESCHED flag.

v2:
https://lore.kernel.org/all/20250418193410.2010058-1-prakash.sangappa@oracle.com/
- Based on discussions in [3], expecting user application to call sched_yield()
  to yield the cpu at the end of the critical section may not be advisable as
  pointed out by Linus.  

  So added a check in return path from a system call to reschedule if time
  slice extension was granted to the thread. The check could as well be in
  syscall enter path from user mode.
  This would allow application thread to call any system call to yield the cpu. 
  Which system call should be suggested? getppid(2) works.

  Do we still need the change in sched_yield() to reschedule when the thread
  has current->rseq_sched_delay set?

- Added patch to introduce a sysctl tunable parameter to specify duration of
  the time slice extension in micro seconds(us), called 'sched_preempt_delay_us'.
  Can take a value in the range 0 to 100. Default is set to 50us.
  Setting this tunable to 0 disables the scheduler time slice extension feature.

v1: 
https://lore.kernel.org/all/20250215005414.224409-1-prakash.sangappa@oracle.com/


[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/all/20250131225837.972218232@goodmis.org/
[4] https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
[5] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[6] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/

Prakash Sangappa (6):
  Sched: Scheduler time slice extension
  Sched: Indicate if thread got rescheduled
  Sched: Tunable to specify duration of time slice extension
  Sched: Add scheduler stat for cpu time slice extension
  Sched: Add tracepoint for sched time slice extension
  Add API to query supported rseq cs flags

 include/linux/entry-common.h | 11 +++--
 include/linux/sched.h        | 30 +++++++++++
 include/trace/events/sched.h | 28 +++++++++++
 include/uapi/linux/rseq.h    | 19 +++++++
 kernel/entry/common.c        | 27 ++++++++--
 kernel/rseq.c                | 96 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          | 60 ++++++++++++++++++++++
 kernel/sched/debug.c         |  4 ++
 kernel/sched/syscalls.c      |  5 ++
 9 files changed, 272 insertions(+), 8 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-03 23:36 [PATCH V5 0/6] Scheduler time slice extension Prakash Sangappa
@ 2025-06-03 23:36 ` Prakash Sangappa
  2025-06-04 14:31   ` Steven Rostedt
  2025-06-03 23:36 ` [PATCH V5 2/6] Sched: Indicate if thread got rescheduled Prakash Sangappa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-03 23:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Add support for a thread to request extending its execution time slice on
the cpu. The extra cpu time granted would help in allowing the thread to
complete executing the critical section and drop any locks without getting
preempted. The thread would request this cpu time extension, by setting a
bit in the restartable sequences(rseq) structure registered with the kernel.

Kernel will grant a 30us extension on the cpu, when it sees the bit set.
With the help of a timer, kernel force preempts the thread if it is still
running on the cpu when the 30us timer expires. The thread should yield
the cpu by making a system call after completing the critical section.

Suggested-by: Peter Ziljstra <peterz@infradead.org>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
v4:
- Changed default sched delay extension time to 30us
v3:
- Rename rseq_sched_delay -> sched_time_delay and move near other bits
  in struct task_struct.
- Use IS_ENABLED() check to access 'sched_time_delay' instead of #ifdef
- Modify coment describing RSEQ_CS_FLAG_DELAY_RESCHED flag.
- Remove rseq_delay_resched_tick() call from hrtick_clear().
v2:
- Add check in syscall_exit_to_user_mode_prepare() and reschedule if
  thread has 'rseq_sched_delay' set.
---
 include/linux/entry-common.h | 11 +++++--
 include/linux/sched.h        | 16 +++++++++++
 include/uapi/linux/rseq.h    |  7 +++++
 kernel/entry/common.c        | 19 ++++++++----
 kernel/rseq.c                | 56 ++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          | 14 +++++++++
 kernel/sched/syscalls.c      |  5 ++++
 7 files changed, 120 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fc61d0205c97..cec343f95210 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -303,7 +303,8 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
  * exit_to_user_mode_loop - do any pending work before leaving to user space
  */
 unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+				     unsigned long ti_work,
+				     bool irq);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
@@ -315,7 +316,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
  *    EXIT_TO_USER_MODE_WORK are set
  * 4) check that interrupts are still disabled
  */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs,
+						bool irq)
 {
 	unsigned long ti_work;
 
@@ -326,7 +328,10 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
 
 	ti_work = read_thread_flags();
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
+
+	if (irq)
+		rseq_delay_resched_fini();
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c08fd199be4e..14bf0508bfca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -339,6 +339,7 @@ extern int __must_check io_schedule_prepare(void);
 extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
+extern void hrtick_local_start(u64 delay);
 
 /* wrapper function to trace from this header file */
 DECLARE_TRACEPOINT(sched_set_state_tp);
@@ -1044,6 +1045,7 @@ struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
+	unsigned			sched_time_delay:1;
 #ifdef CONFIG_PREEMPT_RT
 	struct netdev_xmit		net_xmit;
 #endif
@@ -2249,6 +2251,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_RSEQ
+
+extern bool rseq_delay_resched(void);
+extern void rseq_delay_resched_fini(void);
+extern void rseq_delay_resched_tick(void);
+
+#else
+
+static inline bool rseq_delay_resched(void) { return false; }
+static inline void rseq_delay_resched_fini(void) { }
+static inline void rseq_delay_resched_tick(void) { }
+
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..25fc636b17d5 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -26,6 +26,7 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	RSEQ_CS_FLAG_DELAY_RESCHED_BIT		= 3,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +36,8 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+	RSEQ_CS_FLAG_DELAY_RESCHED		=
+		(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
 };
 
 /*
@@ -128,6 +131,10 @@ struct rseq {
 	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
 	 *     Inhibit instruction sequence block restart on migration for
 	 *     this thread.
+	 * - RSEQ_CS_FLAG_DELAY_RESCHED
+	 *     Request by user thread to delay preemption. With use
+	 *     of a timer, kernel grants extra cpu time upto 30us for this
+	 *     thread before being rescheduled.
 	 */
 	__u32 flags;
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 20154572ede9..b26adccb32df 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -88,7 +88,8 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
  * @ti_work:	TIF work flags as read by the caller
  */
 __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-						     unsigned long ti_work)
+						     unsigned long ti_work,
+						     bool irq)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
@@ -98,8 +99,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+		       if (irq && rseq_delay_resched())
+			       clear_tsk_need_resched(current);
+		       else
+			       schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
@@ -184,6 +189,10 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 
 	CT_WARN_ON(ct_state() != CT_STATE_KERNEL);
 
+	/* reschedule if sched delay was granted */
+	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay)
+		set_tsk_need_resched(current);
+
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
 		if (WARN(irqs_disabled(), "syscall %lu left IRQs disabled", nr))
 			local_irq_enable();
@@ -204,7 +213,7 @@ static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *reg
 {
 	syscall_exit_to_user_mode_prepare(regs);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, false);
 }
 
 void syscall_exit_to_user_mode_work(struct pt_regs *regs)
@@ -228,7 +237,7 @@ noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, true);
 	instrumentation_end();
 	exit_to_user_mode();
 }
diff --git a/kernel/rseq.c b/kernel/rseq.c
index b7a1ec327e81..dba44ca9f624 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -448,6 +448,62 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+bool rseq_delay_resched(void)
+{
+	struct task_struct *t = current;
+	u32 flags;
+
+	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
+		return false;
+
+	if (!t->rseq)
+		return false;
+
+	if (t->sched_time_delay)
+		return false;
+
+	if (copy_from_user_nofault(&flags, &t->rseq->flags, sizeof(flags)))
+		return false;
+
+	if (!(flags & RSEQ_CS_FLAG_DELAY_RESCHED))
+		return false;
+
+	flags &= ~RSEQ_CS_FLAG_DELAY_RESCHED;
+	if (copy_to_user_nofault(&t->rseq->flags, &flags, sizeof(flags)))
+		return false;
+
+	t->sched_time_delay = 1;
+
+	return true;
+}
+
+void rseq_delay_resched_fini(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	extern void hrtick_local_start(u64 delay);
+	struct task_struct *t = current;
+	/*
+	 * IRQs off, guaranteed to return to userspace, start timer on this CPU
+	 * to limit the resched-overdraft.
+	 *
+	 * If your critical section is longer than 30 us you get to keep the
+	 * pieces.
+	 */
+	if (t->sched_time_delay)
+		hrtick_local_start(30 * NSEC_PER_USEC);
+#endif
+}
+
+void rseq_delay_resched_tick(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	struct task_struct *t = current;
+
+	if (t->sched_time_delay)
+		set_tsk_need_resched(t);
+#endif
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4de24eefe661..8c8960245ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -844,6 +844,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 
+	rseq_delay_resched_tick();
+
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	rq->donor->sched_class->task_tick(rq, rq->curr, 1);
@@ -917,6 +919,16 @@ void hrtick_start(struct rq *rq, u64 delay)
 
 #endif /* CONFIG_SMP */
 
+void hrtick_local_start(u64 delay)
+{
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+
+	rq_lock(rq, &rf);
+	hrtick_start(rq, delay);
+	rq_unlock(rq, &rf);
+}
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -6722,6 +6734,8 @@ static void __sched notrace __schedule(int sched_mode)
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
+	if (IS_ENABLED(CONFIG_RSEQ))
+		prev->sched_time_delay = 0;
 	rq->last_seen_need_resched_ns = 0;
 
 	is_switch = prev != next;
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index cd38f4e9899d..1b2b64fe0fb1 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -1378,6 +1378,11 @@ static void do_sched_yield(void)
  */
 SYSCALL_DEFINE0(sched_yield)
 {
+	if (IS_ENABLED(CONFIG_RSEQ) && current->sched_time_delay) {
+		schedule();
+		return 0;
+	}
+
 	do_sched_yield();
 	return 0;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 2/6] Sched: Indicate if thread got rescheduled
  2025-06-03 23:36 [PATCH V5 0/6] Scheduler time slice extension Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 1/6] Sched: " Prakash Sangappa
@ 2025-06-03 23:36 ` Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 3/6] Sched: Tunable to specify duration of time slice extension Prakash Sangappa
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-03 23:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Use a bit in rseq flags to indicate if the thread got rescheduled
after the cpu time extension was graned. The user thread can check this
flag before calling sched_yield() to yield the cpu.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/linux/sched.h     |  2 ++
 include/uapi/linux/rseq.h | 10 ++++++++++
 kernel/rseq.c             | 20 ++++++++++++++++++++
 kernel/sched/core.c       |  3 +--
 4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 14bf0508bfca..71e6c8221c1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2256,12 +2256,14 @@ unsigned long sched_cpu_util(int cpu);
 extern bool rseq_delay_resched(void);
 extern void rseq_delay_resched_fini(void);
 extern void rseq_delay_resched_tick(void);
+extern void rseq_delay_schedule(void);
 
 #else
 
 static inline bool rseq_delay_resched(void) { return false; }
 static inline void rseq_delay_resched_fini(void) { }
 static inline void rseq_delay_resched_tick(void) { }
+static inline void rseq_delay_schedule(void) { }
 
 #endif
 
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 25fc636b17d5..f4813d931387 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -27,6 +27,7 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
 	RSEQ_CS_FLAG_DELAY_RESCHED_BIT		= 3,
+	RSEQ_CS_FLAG_RESCHEDULED_BIT		= 4,
 };
 
 enum rseq_cs_flags {
@@ -38,6 +39,9 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 	RSEQ_CS_FLAG_DELAY_RESCHED		=
 		(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
+	RSEQ_CS_FLAG_RESCHEDULED		=
+		(1U << RSEQ_CS_FLAG_RESCHEDULED_BIT),
+
 };
 
 /*
@@ -135,6 +139,12 @@ struct rseq {
 	 *     Request by user thread to delay preemption. With use
 	 *     of a timer, kernel grants extra cpu time upto 30us for this
 	 *     thread before being rescheduled.
+	 * - RSEQ_CS_FLAG_RESCHEDULED
+	 *     Set by kernel if the thread was rescheduled in the extra time
+	 *     granted due to request RSEQ_CS_DELAY_RESCHED. This bit is
+	 *     checked by the thread before calling sched_yield() to yield
+	 *     cpu. User thread sets this bit to 0, when setting
+	 *     RSEQ_CS_DELAY_RESCHED to request preemption delay.
 	 */
 	__u32 flags;
 
diff --git a/kernel/rseq.c b/kernel/rseq.c
index dba44ca9f624..9355654e9b38 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -504,6 +504,26 @@ void rseq_delay_resched_tick(void)
 #endif
 }
 
+void rseq_delay_schedule(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	struct task_struct *t = current;
+	u32 flags;
+
+	if (t->sched_time_delay) {
+		t->sched_time_delay = 0;
+		if (!t->rseq)
+			return;
+		if (copy_from_user_nofault(&flags, &t->rseq->flags,
+                        sizeof(flags)))
+                        return;
+                flags |= RSEQ_CS_FLAG_RESCHEDULED;
+                copy_to_user_nofault(&t->rseq->flags, &flags,
+                        sizeof(flags));
+	}
+#endif
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8c8960245ec0..86583fb72914 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6734,8 +6734,7 @@ static void __sched notrace __schedule(int sched_mode)
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
-	if (IS_ENABLED(CONFIG_RSEQ))
-		prev->sched_time_delay = 0;
+	rseq_delay_schedule();
 	rq->last_seen_need_resched_ns = 0;
 
 	is_switch = prev != next;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 3/6] Sched: Tunable to specify duration of time slice extension
  2025-06-03 23:36 [PATCH V5 0/6] Scheduler time slice extension Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 1/6] Sched: " Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 2/6] Sched: Indicate if thread got rescheduled Prakash Sangappa
@ 2025-06-03 23:36 ` Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 4/6] Sched: Add scheduler stat for cpu " Prakash Sangappa
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-03 23:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Add a tunable to specify duration of scheduler time slice extension.
The default will be set to 30us and the max value that can be specified
is 100us. Setting it to 0, disables scheduler time slice extension.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
v5:
- Added #ifdef CONFIG_RSEQ & CONFIG_PROC_SYSCTL 
---
 include/linux/sched.h     |  5 +++++
 include/uapi/linux/rseq.h |  5 +++--
 kernel/rseq.c             |  7 +++++--
 kernel/sched/core.c       | 40 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 71e6c8221c1e..14069ebe26e2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -407,6 +407,11 @@ static inline void sched_domains_mutex_lock(void) { }
 static inline void sched_domains_mutex_unlock(void) { }
 #endif
 
+#ifdef CONFIG_RSEQ
+/* Scheduler time slice extension */
+extern unsigned int sysctl_sched_preempt_delay_us;
+#endif
+
 struct sched_param {
 	int sched_priority;
 };
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index f4813d931387..015534f064af 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -137,8 +137,9 @@ struct rseq {
 	 *     this thread.
 	 * - RSEQ_CS_FLAG_DELAY_RESCHED
 	 *     Request by user thread to delay preemption. With use
-	 *     of a timer, kernel grants extra cpu time upto 30us for this
-	 *     thread before being rescheduled.
+	 *     of a timer, kernel grants extra cpu time upto the tunable
+	 *     'sched_preempt_delay_us' value for this thread before it gets
+	 *     rescheduled.
 	 * - RSEQ_CS_FLAG_RESCHEDULED
 	 *     Set by kernel if the thread was rescheduled in the extra time
 	 *     granted due to request RSEQ_CS_DELAY_RESCHED. This bit is
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9355654e9b38..44d0f3ae0cd3 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -456,6 +456,8 @@ bool rseq_delay_resched(void)
 	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
 		return false;
 
+	if (!sysctl_sched_preempt_delay_us)
+		return false;
 	if (!t->rseq)
 		return false;
 
@@ -489,8 +491,9 @@ void rseq_delay_resched_fini(void)
 	 * If your critical section is longer than 30 us you get to keep the
 	 * pieces.
 	 */
-	if (t->sched_time_delay)
-		hrtick_local_start(30 * NSEC_PER_USEC);
+	if (sysctl_sched_preempt_delay_us && t->sched_time_delay)
+		hrtick_local_start(sysctl_sched_preempt_delay_us *
+				   NSEC_PER_USEC);
 #endif
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 86583fb72914..e5307389b30a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -148,6 +148,17 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  */
 __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
 
+#ifdef CONFIG_RSEQ
+/*
+ * Scheduler time slice extension, duration in microsecs.
+ * Max value allowed 100us, default is 30us.
+ * If set to 0, scheduler time slice extension is disabled.
+ */
+#define SCHED_PREEMPT_DELAY_DEFAULT_US	30
+__read_mostly unsigned int sysctl_sched_preempt_delay_us =
+	SCHED_PREEMPT_DELAY_DEFAULT_US;
+#endif
+
 __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
@@ -4664,6 +4675,24 @@ static int sysctl_schedstats(const struct ctl_table *table, int write, void *buf
 #endif /* CONFIG_PROC_SYSCTL */
 #endif /* CONFIG_SCHEDSTATS */
 
+#ifdef CONFIG_PROC_SYSCTL
+#ifdef CONFIG_RSEQ
+static int sysctl_sched_preempt_delay(const struct ctl_table *table, int write,
+		void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err;
+
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (sysctl_sched_preempt_delay_us > SCHED_PREEMPT_DELAY_DEFAULT_US)
+		pr_warn("Sched preemption delay time set higher then default value %d us\n",
+			SCHED_PREEMPT_DELAY_DEFAULT_US);
+	return err;
+}
+#endif /* CONFIG_RSEQ */
+#endif /* CONFIG_PROC_SYSCTL */
+
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table sched_core_sysctls[] = {
 #ifdef CONFIG_SCHEDSTATS
@@ -4711,6 +4740,17 @@ static const struct ctl_table sched_core_sysctls[] = {
 		.extra2		= SYSCTL_FOUR,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_RSEQ
+	{
+		.procname	= "sched_preempt_delay_us",
+		.data		= &sysctl_sched_preempt_delay_us,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_preempt_delay,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE_HUNDRED,
+	},
+#endif /* CONFIG_RSEQ */
 };
 static int __init sched_core_sysctl_init(void)
 {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 4/6] Sched: Add scheduler stat for cpu time slice extension
  2025-06-03 23:36 [PATCH V5 0/6] Scheduler time slice extension Prakash Sangappa
                   ` (2 preceding siblings ...)
  2025-06-03 23:36 ` [PATCH V5 3/6] Sched: Tunable to specify duration of time slice extension Prakash Sangappa
@ 2025-06-03 23:36 ` Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 5/6] Sched: Add tracepoint for sched " Prakash Sangappa
  2025-06-03 23:36 ` [PATCH V5 6/6] Add API to query supported rseq cs flags Prakash Sangappa
  5 siblings, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-03 23:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Add scheduler stat to record number of times the thread was granted
cpu time slice extension.

Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
v5:
- Added #ifdef CONFIG_RSEQ
---
 include/linux/sched.h | 7 +++++++
 kernel/rseq.c         | 1 +
 kernel/sched/core.c   | 7 +++++++
 kernel/sched/debug.c  | 4 ++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 14069ebe26e2..6c2e9d30c2fc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -340,6 +340,9 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 extern void hrtick_local_start(u64 delay);
+#ifdef CONFIG_RSEQ
+extern void update_stat_preempt_delayed(struct task_struct *t);
+#endif
 
 /* wrapper function to trace from this header file */
 DECLARE_TRACEPOINT(sched_set_state_tp);
@@ -566,6 +569,10 @@ struct sched_statistics {
 	u64				nr_wakeups_passive;
 	u64				nr_wakeups_idle;
 
+#ifdef CONFIG_RSEQ
+	u64				nr_preempt_delay_granted;
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 	u64				core_forceidle_sum;
 #endif
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 44d0f3ae0cd3..c4bc52f8ba9c 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -475,6 +475,7 @@ bool rseq_delay_resched(void)
 		return false;
 
 	t->sched_time_delay = 1;
+	update_stat_preempt_delayed(t);
 
 	return true;
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5307389b30a..95fce557a294 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -940,6 +940,13 @@ void hrtick_local_start(u64 delay)
 	rq_unlock(rq, &rf);
 }
 
+#ifdef CONFIG_RSEQ
+void update_stat_preempt_delayed(struct task_struct *t)
+{
+	schedstat_inc(t->stats.nr_preempt_delay_granted);
+}
+#endif
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4cba21f5d24d..b178cb0e2904 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1217,6 +1217,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_wakeups_passive);
 		P_SCHEDSTAT(nr_wakeups_idle);
 
+#ifdef CONFIG_RSEQ
+		P_SCHEDSTAT(nr_preempt_delay_granted);
+#endif
+
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
 			avg_atom = div64_ul(avg_atom, nr_switches);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 5/6] Sched: Add tracepoint for sched time slice extension
  2025-06-03 23:36 [PATCH V5 0/6] Scheduler time slice extension Prakash Sangappa
                   ` (3 preceding siblings ...)
  2025-06-03 23:36 ` [PATCH V5 4/6] Sched: Add scheduler stat for cpu " Prakash Sangappa
@ 2025-06-03 23:36 ` Prakash Sangappa
  2025-06-04 14:36   ` Steven Rostedt
  2025-06-03 23:36 ` [PATCH V5 6/6] Add API to query supported rseq cs flags Prakash Sangappa
  5 siblings, 1 reply; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-03 23:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

Trace thread's preemption getting delayed. Which can occur if
the running thread requested extra time on cpu.  Also, indicate
the NEED_RESCHED flag, that is set on the thread, getting cleared.

Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
 include/trace/events/sched.h | 28 ++++++++++++++++++++++++++++
 kernel/entry/common.c        | 12 ++++++++++--
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c1..4aa04044b14a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -296,6 +296,34 @@ TRACE_EVENT(sched_migrate_task,
 		  __entry->orig_cpu, __entry->dest_cpu)
 );
 
+/*
+ * Tracepoint for delayed resched requested by task:
+ */
+TRACE_EVENT(sched_delay_resched,
+
+	TP_PROTO(struct task_struct *p, unsigned int resched_flg),
+
+	TP_ARGS(p, resched_flg),
+
+	TP_STRUCT__entry(
+		__array( char, comm, TASK_COMM_LEN	)
+		__field( pid_t, pid			)
+		__field( int, cpu			)
+		__field( int, flg			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->cpu 		= task_cpu(p);
+		__entry->flg		= resched_flg;
+	),
+
+	TP_printk("comm=%s pid=%d cpu=%d resched_flg_cleared=0x%x",
+		__entry->comm, __entry->pid, __entry->cpu, __entry->flg)
+
+);
+
 DECLARE_EVENT_CLASS(sched_process_template,
 
 	TP_PROTO(struct task_struct *p),
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index b26adccb32df..cd0f076920fd 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -12,6 +12,7 @@
 
 #include "common.h"
 
+#include <trace/events/sched.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
@@ -91,6 +92,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 						     unsigned long ti_work,
 						     bool irq)
 {
+	unsigned long ti_work_cleared = 0;
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
@@ -100,10 +102,12 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
-		       if (irq && rseq_delay_resched())
+		       if (irq && rseq_delay_resched()) {
 			       clear_tsk_need_resched(current);
-		       else
+			       ti_work_cleared = ti_work;
+		       } else {
 			       schedule();
+		       }
 		}
 
 		if (ti_work & _TIF_UPROBE)
@@ -134,6 +138,10 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		ti_work = read_thread_flags();
 	}
 
+	if (ti_work_cleared)
+		trace_sched_delay_resched(current, ti_work_cleared &
+			(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY));
+
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 6/6] Add API to query supported rseq cs flags
  2025-06-03 23:36 [PATCH V5 0/6] Scheduler time slice extension Prakash Sangappa
                   ` (4 preceding siblings ...)
  2025-06-03 23:36 ` [PATCH V5 5/6] Sched: Add tracepoint for sched " Prakash Sangappa
@ 2025-06-03 23:36 ` Prakash Sangappa
  5 siblings, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-03 23:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, rostedt, mathieu.desnoyers, tglx, bigeasy, kprateek.nayak,
	vineethr

For the API, add a new flag to sys_rseq 'flags' argument called
RSEQ_FLAG_QUERY_CS_FLAGS.

When this flag is passed it returns a bit mask of all the supported
rseq cs flags in the user provided rseq struct's 'flags' member.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
---
v5:
- Removed deprecated flags from supported cs flags returned.
- Added IS_ENABLED(CONFIG_SCHED_HRTICK)
---
 include/uapi/linux/rseq.h |  1 +
 kernel/rseq.c             | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 015534f064af..44baea9dd10a 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -20,6 +20,7 @@ enum rseq_cpu_id_state {
 
 enum rseq_flags {
 	RSEQ_FLAG_UNREGISTER = (1 << 0),
+	RSEQ_FLAG_QUERY_CS_FLAGS = (1 << 1),
 };
 
 enum rseq_cs_flags_bit {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index c4bc52f8ba9c..d2b010dccff5 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -576,6 +576,22 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		return 0;
 	}
 
+	/*
+	 * Return supported rseq_cs flags.
+	 */
+	if (flags & RSEQ_FLAG_QUERY_CS_FLAGS) {
+		u32 rseq_csflags = RSEQ_CS_FLAG_DELAY_RESCHED |
+				   RSEQ_CS_FLAG_RESCHEDULED;
+		/* Following is required for delay resched support */
+		if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
+			return -EINVAL;
+		if (!rseq)
+			return -EINVAL;
+		if (copy_to_user(&rseq->flags, &rseq_csflags, sizeof(u32)))
+			return -EFAULT;
+		return 0;
+	}
+
 	if (unlikely(flags))
 		return -EINVAL;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-03 23:36 ` [PATCH V5 1/6] Sched: " Prakash Sangappa
@ 2025-06-04 14:31   ` Steven Rostedt
  2025-06-04 14:54     ` Sebastian Andrzej Siewior
  2025-06-04 17:09     ` Prakash Sangappa
  0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-06-04 14:31 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: linux-kernel, peterz, mathieu.desnoyers, tglx, bigeasy,
	kprateek.nayak, vineethr

On Tue,  3 Jun 2025 23:36:49 +0000
Prakash Sangappa <prakash.sangappa@oracle.com> wrote:

> @@ -2249,6 +2251,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>  unsigned long sched_cpu_util(int cpu);
>  #endif /* CONFIG_SMP */
>  
> +#ifdef CONFIG_RSEQ
> +
> +extern bool rseq_delay_resched(void);
> +extern void rseq_delay_resched_fini(void);
> +extern void rseq_delay_resched_tick(void);
> +
> +#else
> +
> +static inline bool rseq_delay_resched(void) { return false; }
> +static inline void rseq_delay_resched_fini(void) { }
> +static inline void rseq_delay_resched_tick(void) { }
> +
> +#endif
> +

Can we add a config to make this optional. I don't want to allow any task
to have an extended timeslice over RT tasks regardless of how small the
delay is.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 5/6] Sched: Add tracepoint for sched time slice extension
  2025-06-03 23:36 ` [PATCH V5 5/6] Sched: Add tracepoint for sched " Prakash Sangappa
@ 2025-06-04 14:36   ` Steven Rostedt
  2025-06-04 17:10     ` Prakash Sangappa
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-06-04 14:36 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: linux-kernel, peterz, mathieu.desnoyers, tglx, bigeasy,
	kprateek.nayak, vineethr

On Tue,  3 Jun 2025 23:36:53 +0000
Prakash Sangappa <prakash.sangappa@oracle.com> wrote:

> @@ -134,6 +138,10 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>  		ti_work = read_thread_flags();
>  	}
>  
> +	if (ti_work_cleared)
> +		trace_sched_delay_resched(current, ti_work_cleared &
> +			(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY));
> +

Please make the above into a conditional tracepoint and you can also just
pass in ti_work_cleared. No reason to do that outside the tracepoint. As
the above is always checked regardless if tracing is enabled or not.

TRACE_EVENT_CONDITION(sched_delay_resched,

	TP_PROTO(struct task_struct *p, unsigned int ti_work_cleared),

	TP_ARGS(p, ti_work_cleared),

	TP_CONDITION(ti_work_cleared),

	TP_STRUCT__entry(
		__array( char, comm, TASK_COMM_LEN	)
		__field( pid_t, pid			)
		__field( int, cpu			)
		__field( int, flg			)
	),

	TP_fast_assign(
		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
		__entry->pid		= p->pid;
		__entry->cpu 		= task_cpu(p);
		__entry->flg		= ti_work_cleared & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY);
	),


-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-04 14:31   ` Steven Rostedt
@ 2025-06-04 14:54     ` Sebastian Andrzej Siewior
  2025-06-04 17:29       ` Prakash Sangappa
  2025-06-04 17:09     ` Prakash Sangappa
  1 sibling, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-04 14:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Prakash Sangappa, linux-kernel, peterz, mathieu.desnoyers, tglx,
	kprateek.nayak, vineethr

On 2025-06-04 10:31:06 [-0400], Steven Rostedt wrote:
> On Tue,  3 Jun 2025 23:36:49 +0000
> Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
> > @@ -2249,6 +2251,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
> >  unsigned long sched_cpu_util(int cpu);
> >  #endif /* CONFIG_SMP */
> >  
> > +#ifdef CONFIG_RSEQ
> > +
> > +extern bool rseq_delay_resched(void);
> > +extern void rseq_delay_resched_fini(void);
> > +extern void rseq_delay_resched_tick(void);
> > +
> > +#else
> > +
> > +static inline bool rseq_delay_resched(void) { return false; }
> > +static inline void rseq_delay_resched_fini(void) { }
> > +static inline void rseq_delay_resched_tick(void) { }
> > +
> > +#endif
> > +
> 
> Can we add a config to make this optional. I don't want to allow any task
> to have an extended timeslice over RT tasks regardless of how small the
> delay is.

I asked to get RT tasks excluded from this extensions and it is ignored.
Maybe they were benefits mentioned somewhere…

> -- Steve

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-04 14:31   ` Steven Rostedt
  2025-06-04 14:54     ` Sebastian Andrzej Siewior
@ 2025-06-04 17:09     ` Prakash Sangappa
  1 sibling, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-04 17:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mathieu.desnoyers@efficios.com, tglx@linutronix.de,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jun 4, 2025, at 7:31 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue,  3 Jun 2025 23:36:49 +0000
> Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
>> @@ -2249,6 +2251,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>> unsigned long sched_cpu_util(int cpu);
>> #endif /* CONFIG_SMP */
>> 
>> +#ifdef CONFIG_RSEQ
>> +
>> +extern bool rseq_delay_resched(void);
>> +extern void rseq_delay_resched_fini(void);
>> +extern void rseq_delay_resched_tick(void);
>> +
>> +#else
>> +
>> +static inline bool rseq_delay_resched(void) { return false; }
>> +static inline void rseq_delay_resched_fini(void) { }
>> +static inline void rseq_delay_resched_tick(void) { }
>> +
>> +#endif
>> +
> 
> Can we add a config to make this optional. I don't want to allow any task
> to have an extended timeslice over RT tasks regardless of how small the
> delay is.

Are you suggesting including a CONFIG to enable this feature or
to choose to be applicable to PREEMPT_LAZY task only?


> 
> -- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 5/6] Sched: Add tracepoint for sched time slice extension
  2025-06-04 14:36   ` Steven Rostedt
@ 2025-06-04 17:10     ` Prakash Sangappa
  0 siblings, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-04 17:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mathieu.desnoyers@efficios.com, tglx@linutronix.de,
	bigeasy@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jun 4, 2025, at 7:36 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue,  3 Jun 2025 23:36:53 +0000
> Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
>> @@ -134,6 +138,10 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> ti_work = read_thread_flags();
>> }
>> 
>> + if (ti_work_cleared)
>> + trace_sched_delay_resched(current, ti_work_cleared &
>> + (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY));
>> +
> 
> Please make the above into a conditional tracepoint and you can also just
> pass in ti_work_cleared. No reason to do that outside the tracepoint. As
> the above is always checked regardless if tracing is enabled or not.
> 
> TRACE_EVENT_CONDITION(sched_delay_resched,
> 
> TP_PROTO(struct task_struct *p, unsigned int ti_work_cleared),
> 
> TP_ARGS(p, ti_work_cleared),
> 
> TP_CONDITION(ti_work_cleared),
> 
> TP_STRUCT__entry(
> __array( char, comm, TASK_COMM_LEN )
> __field( pid_t, pid )
> __field( int, cpu )
> __field( int, flg )
> ),
> 
> TP_fast_assign(
> memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> __entry->pid = p->pid;
> __entry->cpu = task_cpu(p);
> __entry->flg = ti_work_cleared & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY);
> ),

Ok, will make that change.
Thanks
-Prakash

> 
> 
> -- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-04 14:54     ` Sebastian Andrzej Siewior
@ 2025-06-04 17:29       ` Prakash Sangappa
  2025-06-04 19:23         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-04 17:29 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jun 4, 2025, at 7:54 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> On 2025-06-04 10:31:06 [-0400], Steven Rostedt wrote:
>> On Tue,  3 Jun 2025 23:36:49 +0000
>> Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>> 
>>> @@ -2249,6 +2251,20 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>>> unsigned long sched_cpu_util(int cpu);
>>> #endif /* CONFIG_SMP */
>>> 
>>> +#ifdef CONFIG_RSEQ
>>> +
>>> +extern bool rseq_delay_resched(void);
>>> +extern void rseq_delay_resched_fini(void);
>>> +extern void rseq_delay_resched_tick(void);
>>> +
>>> +#else
>>> +
>>> +static inline bool rseq_delay_resched(void) { return false; }
>>> +static inline void rseq_delay_resched_fini(void) { }
>>> +static inline void rseq_delay_resched_tick(void) { }
>>> +
>>> +#endif
>>> +
>> 
>> Can we add a config to make this optional. I don't want to allow any task
>> to have an extended timeslice over RT tasks regardless of how small the
>> delay is.
> 
> I asked to get RT tasks excluded from this extensions and it is ignored.
> Maybe they were benefits mentioned somewhere…

Don’t know if there were benefits mentioned when RT tasks are involved.

I had shared some benchmark results in this thread showing benefit of using scheduler time extension.
https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
The workload did not include RT tasks.

-Prakash


> 
>> -- Steve
> 
> Sebastian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-04 17:29       ` Prakash Sangappa
@ 2025-06-04 19:23         ` Sebastian Andrzej Siewior
  2025-06-09 20:55           ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-04 19:23 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On 2025-06-04 17:29:44 [+0000], Prakash Sangappa wrote:
> Don’t know if there were benefits mentioned when RT tasks are involved.
> 
> I had shared some benchmark results in this thread showing benefit of using scheduler time extension.
> https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
> The workload did not include RT tasks.

I don't question the mechanism/ approach. I just don't want RT tasks
delayed.

> -Prakash

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-04 19:23         ` Sebastian Andrzej Siewior
@ 2025-06-09 20:55           ` Steven Rostedt
  2025-06-09 21:33             ` Steven Rostedt
  2025-06-09 21:52             ` Steven Rostedt
  0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-06-09 20:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On Wed, 4 Jun 2025 21:23:27 +0200
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> On 2025-06-04 17:29:44 [+0000], Prakash Sangappa wrote:
> > Don’t know if there were benefits mentioned when RT tasks are involved.
> > 
> > I had shared some benchmark results in this thread showing benefit of using scheduler time extension.
> > https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/
> > The workload did not include RT tasks.  
> 
> I don't question the mechanism/ approach. I just don't want RT tasks
> delayed.
> 

So I applied your patches and fixed up my "extend-sched.c" program to use
your method. I booted on bare-metal PREEMPT_RT and ran:

~# cyclictest --smp -p95 -m -s --system -l 100000  -b 1000
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 0.35 0.75 0.38 1/549 4219          

T: 0 ( 4213) P:95 I:1000 C:   5163 Min:      3 Act:    3 Avg:    3 Max:      12
T: 1 ( 4214) P:95 I:1500 C:   3444 Min:      3 Act:    4 Avg:    3 Max:       9
T: 2 ( 4215) P:95 I:2000 C:   2582 Min:      3 Act:    3 Avg:    3 Max:       8
T: 3 ( 4216) P:95 I:2500 C:   2066 Min:      3 Act:    4 Avg:    3 Max:       9
T: 4 ( 4217) P:95 I:3000 C:   1721 Min:      3 Act:    4 Avg:    3 Max:       7
T: 5 ( 4218) P:95 I:3500 C:   1474 Min:      3 Act:    4 Avg:    4 Max:      11
T: 6 ( 4219) P:95 I:4000 C:   1290 Min:      3 Act:    3 Avg:    3 Max:       9

In another window, I ran the "extend-sched" and cyclictest immediately turned into:

T: 0 ( 4372) P:95 I:1000 C:  33235 Min:      3 Act:    4 Avg:    3 Max:      36
T: 1 ( 4373) P:95 I:1500 C:  22182 Min:      3 Act:    4 Avg:    3 Max:      39
T: 2 ( 4374) P:95 I:2000 C:  16647 Min:      3 Act:    5 Avg:    3 Max:      35
T: 3 ( 4375) P:95 I:2500 C:  13321 Min:      3 Act:    5 Avg:    3 Max:      36
T: 4 ( 4376) P:95 I:3000 C:  11103 Min:      3 Act:    4 Avg:    3 Max:      35
T: 5 ( 4377) P:95 I:3500 C:   9518 Min:      3 Act:    5 Avg:    3 Max:      36
T: 6 ( 4378) P:95 I:4000 C:   8330 Min:      3 Act:    5 Avg:    3 Max:      35

It went from 12us to 39us. That's more than triple the max latency.

I noticed that the delay was set to 30, so I switched it to 5 and tried again:

~# cat /proc/sys/kernel/sched_preempt_delay_us 
30
~# echo 5 > /proc/sys/kernel/sched_preempt_delay_us 
~# cat /proc/sys/kernel/sched_preempt_delay_us 
5

T: 0 ( 4296) P:95 I:1000 C:  15324 Min:      3 Act:    3 Avg:    4 Max:      21
T: 1 ( 4297) P:95 I:1500 C:  10228 Min:      3 Act:    3 Avg:    4 Max:      21
T: 2 ( 4298) P:95 I:2000 C:   7676 Min:      3 Act:    3 Avg:    4 Max:      21
T: 3 ( 4299) P:95 I:2500 C:   6143 Min:      3 Act:    3 Avg:    4 Max:      20
T: 4 ( 4300) P:95 I:3000 C:   5119 Min:      3 Act:    3 Avg:    4 Max:      21
T: 5 ( 4301) P:95 I:3500 C:   4388 Min:      3 Act:    3 Avg:    4 Max:      20
T: 6 ( 4302) P:95 I:4000 C:   3840 Min:      3 Act:    3 Avg:    4 Max:      19

It went from a max of 12us to 21us. That's almost double. And this with just 5us.

The point we have with this, is it's NOT NOISE! It's addition to the worse
case scenario.

If we have 30us as the worst case latency, using this with 5 will make the
worst case latency 35 (or more as there's some overhead with this).

You cannot say "oh, the system causes 5us latency in general, so we can
just make it 5us", because this adds on top of it. If the system has 5us
latency in general and you set the extended scheduler slice to 5us, then
the system now has a 10us latency in general.

This is why it should be turned off with PREEMPT_RT.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-09 20:55           ` Steven Rostedt
@ 2025-06-09 21:33             ` Steven Rostedt
  2025-06-10 16:31               ` Prakash Sangappa
  2025-06-09 21:52             ` Steven Rostedt
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-06-09 21:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com


Now I put the machine under load.

In one window I ran:

  $ cd linux.git
  $ make j=20

[ This is just a 8 core machine. I just noticed that I have isolcpus=3 so
only 7 are running ]

And in another window I ran:

  $ while :; ./hackbench 50; done

This made the system have:

~# cyclictest --smp -p95 -m -s --system -l 100000  -b 1000
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 38.84 19.89 8.05 29/2609 80387           

T: 0 (71748) P:95 I:1000 C:  23386 Min:      5 Act:   10 Avg:    9 Max:      30
T: 1 (71749) P:95 I:1500 C:  15635 Min:      5 Act:    7 Avg:    9 Max:      24
T: 2 (71750) P:95 I:2000 C:  11735 Min:      6 Act:   11 Avg:   10 Max:      27
T: 3 (71751) P:95 I:2500 C:   9388 Min:      6 Act:    9 Avg:   10 Max:      24
T: 4 (71753) P:95 I:3000 C:   7823 Min:      6 Act:   10 Avg:   10 Max:      24
T: 5 (71755) P:95 I:3500 C:   6699 Min:      6 Act:   10 Avg:   10 Max:      23
T: 6 (71756) P:95 I:4000 C:   5865 Min:      6 Act:   10 Avg:    9 Max:      23

Then running my extend-sched with 5us delay, it jumped up slightly.

T: 0 (104507) P:95 I:1000 C:  69385 Min:      4 Act:   10 Avg:    8 Max:      34
T: 1 (104509) P:95 I:1500 C:  46378 Min:      4 Act:   14 Avg:    9 Max:      29
T: 2 (104510) P:95 I:2000 C:  34829 Min:      5 Act:   13 Avg:    9 Max:      27
T: 3 (104511) P:95 I:2500 C:  27885 Min:      5 Act:   11 Avg:    9 Max:      28
T: 4 (104512) P:95 I:3000 C:  23246 Min:      5 Act:   12 Avg:    9 Max:      29
T: 5 (104514) P:95 I:3500 C:  19931 Min:      5 Act:   11 Avg:    9 Max:      32
T: 6 (104518) P:95 I:4000 C:  17446 Min:      5 Act:   11 Avg:    9 Max:      24

This is more in the noise but still sightly noticeable. I still argue that
this extends any worst case scenario with the delay, as if the path that
causes the worst case scenario happens when the extended slice happens,
they are combined. Which is the definition of worst case scenario.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-09 20:55           ` Steven Rostedt
  2025-06-09 21:33             ` Steven Rostedt
@ 2025-06-09 21:52             ` Steven Rostedt
  2025-06-09 22:06               ` Steven Rostedt
  2025-06-10 15:40               ` Prakash Sangappa
  1 sibling, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-06-09 21:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

[-- Attachment #1: Type: text/plain, Size: 659 bytes --]

On Mon, 9 Jun 2025 16:55:32 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> So I applied your patches and fixed up my "extend-sched.c" program to use
> your method. I booted on bare-metal PREEMPT_RT and ran:

In case anyone else wants to play, I'm attaching the source of extend-sched.c

I ran it with: sleep 5; ./extend-sched

Then switched over to cyclic test, counted to five and it was pretty
noticeable when it triggered.

To build, simply do:

  $ cd linux.git
  $ mkdir /tmp/extend
  $ cp tools/testing/selftests/rseq/rseq-abi.h /tmp/extend
  $ cd /tmp/extend

  [ download extend-sched.c here ]

  $ gcc extend-sched.c -o extend-sched


-- Steve

[-- Attachment #2: extend-sched.c --]
[-- Type: text/x-c++src, Size: 9188 bytes --]


// Run with: GLIBC_TUNABLES=glibc.pthread.rseq=0 

#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/time.h>

#ifdef ENABLE_TRACEFS
#include <tracefs.h>
#else
static inline void tracefs_printf(void *inst, const char *fmt, ...) { }
static inline void tracefs_print_init(void *inst) { }
#endif

#include <sys/rseq.h>
#include "rseq-abi.h"

static bool no_rseq;
static bool extend_wait;

/* In case we want to play with priorities */
static int busy_prio = 0;
static int lock_prio = 0;

static int loop_spin = 15000;

//#define barrier() asm volatile ("" ::: "memory")
#define rmb() asm volatile ("lfence" ::: "memory")
#define wmb() asm volatile ("sfence" ::: "memory")

#define NR_BUSY_THREADS 5

static pthread_barrier_t pbarrier;

static __thread struct rseq_abi *rseq_map;

static void init_extend_map(void)
{
	if (no_rseq)
		return;

	rseq_map = (void *)__builtin_thread_pointer() + __rseq_offset;
}

struct data;

struct thread_data {
	unsigned long long			x_count;
	unsigned long long			total;
	unsigned long long			max;
	unsigned long long			min;
	unsigned long long			total_wait;
	unsigned long long			max_wait;
	unsigned long long			min_wait;
	unsigned long long			contention;
	unsigned long long			extended;
	struct data				*data;
	int					cpu;
};

struct data {
	unsigned long long		x;
	unsigned long			lock;
	struct thread_data		*tdata;
	bool				done;
};

static inline unsigned long
cmpxchg(volatile unsigned long *ptr, unsigned long old, unsigned long new)
{
        unsigned long prev;

	asm volatile("lock; cmpxchg %b1,%2"
		     : "=a"(prev)
		     : "q"(new), "m"(*(ptr)), "0"(old)
		     : "memory");
        return prev;
}

static void extend(void)
{
	if (no_rseq)
		return;

	rseq_map->flags |= 1 << 3;
}

static int unextend(void)
{
	int flags;
	if (no_rseq)
		return 0;

	flags = rseq_map->flags;
	rseq_map->flags &= ~((1 << 3) | (1 << 4));
	if (!(flags & (1 << 4)))
		return 0;

	tracefs_printf(NULL, "Yield!\n");
	sched_yield();
		return 1;
}

#define sec2usec(sec) (sec * 1000000ULL)
#define usec2sec(usec) (usec / 1000000ULL)

static unsigned long long get_time(void)
{
	struct timeval tv;
	unsigned long long time;

	gettimeofday(&tv, NULL);

	time = sec2usec(tv.tv_sec);
	time += tv.tv_usec;

	return time;
}

static void do_sleep(unsigned usecs)
{
	struct timespec ts;

	ts.tv_sec = 0;
	ts.tv_nsec = usecs * 1000;
	nanosleep(&ts, NULL);
}

static void grab_lock(struct thread_data *tdata, struct data *data)
{
	unsigned long long start_wait, start, end, delta;
	unsigned long long end_wait;
	unsigned long prev;
	bool contention = false;

	start_wait = get_time();

	rmb();
	while (data->lock && !data->done) {
		contention = true;
		rmb();
	}

	tracefs_printf(NULL, "Grab lock\n");
	if (extend_wait)
		extend();
	do {
		if (!extend_wait)
			extend();
		start = get_time();
		prev = cmpxchg(&data->lock, 0, 1);
		if (prev) {
			contention = true;
			if (!extend_wait && unextend())
				tdata->extended++;
			while (data->lock && !data->done)
				rmb();
		}
	} while (prev && !data->done);

	if (contention)
		tdata->contention++;

	if (data->done)
		return;

	end_wait = get_time();

	tracefs_printf(NULL, "Have lock!\n");

	delta = end_wait - start_wait;
	if (!tdata->total_wait || tdata->max_wait < delta)
		tdata->max_wait = delta;
	if (!tdata->total_wait || tdata->min_wait > delta)
		tdata->min_wait = delta;
	tdata->total_wait += delta;

	data->x++;

	if (data->lock != 1) {
		printf("Failed locking\n");
		exit(-1);
	}

	/* Loop */
	for (int i = 0; i < loop_spin; i++)
		wmb();

	prev = cmpxchg(&data->lock, 1, 0);
	end = get_time();
	tracefs_printf(NULL, "released lock!\n");
	if (unextend())
		tdata->extended++;
	if (prev != 1) {
		printf("Failed unlocking\n");
		exit(-1);
	}

	delta = end - start;
	if (!tdata->total || tdata->max < delta) {
		tracefs_printf(NULL, "New max: %lld\n", delta);
		tdata->max = delta;
	}

	if (!tdata->total || tdata->min > delta)
		tdata->min = delta;

	tdata->total += delta;
	tdata->x_count++;
}

static void *busy_thread(void *d)
{
	struct data *data = d;
	int i;

	nice(busy_prio);

	while (!data->done) {
		for (i = 0; i < 100; i++)
			wmb();
		do_sleep(10);
		rmb();
	}
	return NULL;
}

static void *run_thread(void *d)
{
	struct thread_data *tdata = d;
	struct data *data = tdata->data;

	init_extend_map();

	nice(lock_prio);

	pthread_barrier_wait(&pbarrier);

	while (!data->done) {
		grab_lock(tdata, data);
		/* Make slighty different waits */
		/* 100us + cpu * 27us */
		do_sleep(100 + tdata->cpu * 27);
		rmb();
	}
	return NULL;
}

int main (int argc, char **argv)
{
	unsigned long long total_wait = 0;
	unsigned long long total_held = 0;
	unsigned long long total_contention = 0;
	unsigned long long total_extended = 0;
	unsigned long long max_wait = 0;
	unsigned long long max = 0;
	unsigned long long secs;
	unsigned long long avg_wait;
	unsigned long long avg_secs;
	unsigned long long avg_held;
	unsigned long long avg_held_secs;
	unsigned long long total_count = 0;
	bool verbose = false;
	pthread_t *threads;
	cpu_set_t *save_affinity;
	cpu_set_t *set_affinity;
	size_t cpu_size;
	struct data data;
	int cpus;
	int ch;
	int i;

	while ((ch = getopt(argc, argv, "dwv")) >= 0) {
		switch (ch) {
			case 'd':
				no_rseq = true;
				break;
			case 'w':
				extend_wait = true;
				break;
			case 'v':
				verbose = true;
				break;
			default:
				fprintf(stderr, "usage: extend-sched [-d|-w|-v]\n"
						"  -d: disable rseq\n"
						"  -w: extend while trying to get lock\n"
						"  -v: verbose output\n");
				exit(-1);
		}
	}
	memset(&data, 0, sizeof(data));

	cpus = sysconf(_SC_NPROCESSORS_CONF);

	cpu_size = CPU_ALLOC_SIZE(cpus);
	save_affinity = CPU_ALLOC(cpus);
	set_affinity = CPU_ALLOC(cpus);
	if (!save_affinity || !set_affinity) {
		perror("Allocating CPU sets");
		exit(-1);
	}
	if (sched_getaffinity(0, cpu_size, save_affinity) < 0) {
		perror("Getting affinity");
		exit(-1);
	}

	/* Create two threads for ever CPU. One grabbing the lock, and a busy task */
	threads = calloc(cpus * (NR_BUSY_THREADS + 1), sizeof(*threads));
	if (!threads) {
		perror("threads");
		exit(-1);
	}

	/* Allocate the data for the lock grabbers */
	data.tdata = calloc(cpus, sizeof(*data.tdata));
	if (!data.tdata) {
		perror("Allocating tdata");
		exit(-1);
	}

	tracefs_print_init(NULL);
	pthread_barrier_init(&pbarrier, NULL, cpus + 1);

	/* Save current affinity */
	for (i = 0; i < cpus; i++) {
		int ret;

		/* Set the affinity to this CPU as threads will inherit it */
		CPU_ZERO_S(cpu_size, set_affinity);
		CPU_SET_S(i, cpu_size, set_affinity);
		if (sched_setaffinity(0, cpu_size, set_affinity) < 0) {
			perror("Setting affinity");
			fprintf(stderr, " Setting cpu %d\n", i);
			exit(-1);
		}

		data.tdata[i].data = &data;
		data.tdata[i].cpu = i;

		ret = pthread_create(&threads[i], NULL, run_thread, &data.tdata[i]);
		if (ret < 0) {
			perror("creating lock threads");
			exit(-1);
		}

		for (int n = 1; n <= NR_BUSY_THREADS; n++) {
			ret = pthread_create(&threads[i + cpus * n], NULL, busy_thread, &data);
			if (ret < 0) {
				perror("creating busy threads");
				exit(-1);
			}
		}
	}

	if (sched_setaffinity(0, cpu_size, save_affinity) < 0) {
		perror("Setting saved affinity");
		exit(-1);
	}

	pthread_barrier_wait(&pbarrier);
	sleep(5);

	printf("Finish up\n");
	data.done = true;
	wmb();

	for (i = 0; i < cpus; i++) {
		for (int n = 1; n <= NR_BUSY_THREADS; n++)
			pthread_join(threads[i + cpus * n], NULL);
	}

	for (i = 0; i < cpus; i++) {
		pthread_join(threads[i], NULL);
		if (verbose) {
			printf("thread %i:\n", i);
			printf("   count:\t%lld\n", data.tdata[i].x_count);
			printf("   total:\t%lld\n", data.tdata[i].total);
			printf("     max:\t%lld\n", data.tdata[i].max);
			printf("     min:\t%lld\n", data.tdata[i].min);
			printf("   total wait:\t%lld\n", data.tdata[i].total_wait);
			printf("     max wait:\t%lld\n", data.tdata[i].max_wait);
			printf("     min wait:\t%lld\n", data.tdata[i].min_wait);
			printf("   contention:\t%lld\n", data.tdata[i].contention);
			printf("     extended:\t%lld\n", data.tdata[i].extended);
		}
		total_count += data.tdata[i].x_count;
		total_wait += data.tdata[i].total_wait;
		total_contention += data.tdata[i].contention;
		total_held += data.tdata[i].total;
		total_extended += data.tdata[i].extended;
		if (data.tdata[i].max_wait > max_wait)
			max_wait = data.tdata[i].max_wait;
		if (data.tdata[i].max > max)
			max = data.tdata[i].max;
	}

	secs = usec2sec(total_wait);
	avg_wait = total_count ? total_wait / total_count : 0;
	avg_secs = usec2sec(avg_wait);
	avg_held = total_count ? total_held / total_count : 0;
	avg_held_secs = usec2sec(avg_held);

	printf("Ran for %lld times\n", data.x);
	printf("Total wait time: %llu.%06llu  (avg: %llu.%06llu)\n", secs, total_wait - sec2usec(secs),
				avg_secs, avg_wait - sec2usec(avg_secs));
	printf("Total contetion: %lld\n", total_contention);
	printf("Total extended: %lld\n", total_extended);
	printf("      max wait: %lld\n", max_wait);
	printf("           max: %lld (avg: %llu.%06llu)\n", max, avg_held_secs, avg_held - sec2usec(avg_held_secs));
	return 0;
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-09 21:52             ` Steven Rostedt
@ 2025-06-09 22:06               ` Steven Rostedt
  2025-06-10 15:40               ` Prakash Sangappa
  1 sibling, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-06-09 22:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Prakash Sangappa, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On Mon, 9 Jun 2025 17:52:51 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> // Run with: GLIBC_TUNABLES=glibc.pthread.rseq=0 

You can ignore the above. I modified the rseq for my original version and
had to use a custom rseq that gcc didn't use. Although, since I was using
reserved bits, it probably didn't matter.

-- Steve


> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <stdbool.h>
> #include <pthread.h>
> #include <unistd.h>
> #include <sys/time.h>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-09 21:52             ` Steven Rostedt
  2025-06-09 22:06               ` Steven Rostedt
@ 2025-06-10 15:40               ` Prakash Sangappa
  1 sibling, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-10 15:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sebastian Andrzej Siewior, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jun 9, 2025, at 2:52 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Mon, 9 Jun 2025 16:55:32 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> So I applied your patches and fixed up my "extend-sched.c" program to use
>> your method. I booted on bare-metal PREEMPT_RT and ran:
> 
> In case anyone else wants to play, I'm attaching the source of extend-sched.c
> 
> I ran it with: sleep 5; ./extend-sched
> 
> Then switched over to cyclic test, counted to five and it was pretty
> noticeable when it triggered.
> 
> To build, simply do:
> 
>  $ cd linux.git
>  $ mkdir /tmp/extend
>  $ cp tools/testing/selftests/rseq/rseq-abi.h /tmp/extend
>  $ cd /tmp/extend
> 
>  [ download extend-sched.c here ]
> 
>  $ gcc extend-sched.c -o extend-sched
> 
> 
> -- Steve
> <extend-sched.c>

Thanks for sharing the test program.

In test program, unextend() should be slightly modified as follows.

101c101
< 	if (!(flags & (1 << 4)))
---
> 	if (flags & (1 << 3))
103,106c103,107
< 
< 	tracefs_printf(NULL, "Yield!\n");
< 	sched_yield();
< 		return 1;
---
> 	if (!(flags & (1 << 4))) {
> 		tracefs_printf(NULL, "Yield!\n");
> 		sched_yield();
> 	}
> 	return 1;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-09 21:33             ` Steven Rostedt
@ 2025-06-10 16:31               ` Prakash Sangappa
  2025-06-10 16:40                 ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Prakash Sangappa @ 2025-06-10 16:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sebastian Andrzej Siewior, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jun 9, 2025, at 2:33 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> 
> Now I put the machine under load.
> 
> In one window I ran:
> 
>  $ cd linux.git
>  $ make j=20
> 
> [ This is just a 8 core machine. I just noticed that I have isolcpus=3 so
> only 7 are running ]
> 
> And in another window I ran:
> 
>  $ while :; ./hackbench 50; done
> 
> This made the system have:
> 
> ~# cyclictest --smp -p95 -m -s --system -l 100000  -b 1000
> # /dev/cpu_dma_latency set to 0us
> policy: fifo: loadavg: 38.84 19.89 8.05 29/2609 80387           
> 
> T: 0 (71748) P:95 I:1000 C:  23386 Min:      5 Act:   10 Avg:    9 Max:      30
> T: 1 (71749) P:95 I:1500 C:  15635 Min:      5 Act:    7 Avg:    9 Max:      24
> T: 2 (71750) P:95 I:2000 C:  11735 Min:      6 Act:   11 Avg:   10 Max:      27
> T: 3 (71751) P:95 I:2500 C:   9388 Min:      6 Act:    9 Avg:   10 Max:      24
> T: 4 (71753) P:95 I:3000 C:   7823 Min:      6 Act:   10 Avg:   10 Max:      24
> T: 5 (71755) P:95 I:3500 C:   6699 Min:      6 Act:   10 Avg:   10 Max:      23
> T: 6 (71756) P:95 I:4000 C:   5865 Min:      6 Act:   10 Avg:    9 Max:      23
> 
> Then running my extend-sched with 5us delay, it jumped up slightly.
> 
> T: 0 (104507) P:95 I:1000 C:  69385 Min:      4 Act:   10 Avg:    8 Max:      34
> T: 1 (104509) P:95 I:1500 C:  46378 Min:      4 Act:   14 Avg:    9 Max:      29
> T: 2 (104510) P:95 I:2000 C:  34829 Min:      5 Act:   13 Avg:    9 Max:      27
> T: 3 (104511) P:95 I:2500 C:  27885 Min:      5 Act:   11 Avg:    9 Max:      28
> T: 4 (104512) P:95 I:3000 C:  23246 Min:      5 Act:   12 Avg:    9 Max:      29
> T: 5 (104514) P:95 I:3500 C:  19931 Min:      5 Act:   11 Avg:    9 Max:      32
> T: 6 (104518) P:95 I:4000 C:  17446 Min:      5 Act:   11 Avg:    9 Max:      24
> 
> This is more in the noise but still sightly noticeable. I still argue that
> this extends any worst case scenario with the delay, as if the path that
> causes the worst case scenario happens when the extended slice happens,
> they are combined. Which is the definition of worst case scenario.

Ok, adding load also seems to increase the max latency.

It is up to Peter to decide if scheduler time extension should be restricted to non RT threads.

We could add a new CONFIG which can be used to disable this
feature for PREEMPT_RT, if that is what you are suggesting. 

Will setting the tunable 'sched_preempt_delay_us ‘ to 0, which would disable the feature 
not suffice?. 

-Prakash

> -- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-10 16:31               ` Prakash Sangappa
@ 2025-06-10 16:40                 ` Steven Rostedt
  2025-07-01  0:48                   ` Prakash Sangappa
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-06-10 16:40 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Sebastian Andrzej Siewior, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com

On Tue, 10 Jun 2025 16:31:05 +0000
Prakash Sangappa <prakash.sangappa@oracle.com> wrote:

> Ok, adding load also seems to increase the max latency.

Right.

> 
> It is up to Peter to decide if scheduler time extension should be restricted to non RT threads.

Peter was against restricting it to just non RT threads because he said it
wouldn't make a difference. He asked for benchmarks that says otherwise.

I'm now supplying the benchmarks that say it does make a difference.

Hopefully Peter will now change his mind.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 1/6] Sched: Scheduler time slice extension
  2025-06-10 16:40                 ` Steven Rostedt
@ 2025-07-01  0:48                   ` Prakash Sangappa
  0 siblings, 0 replies; 22+ messages in thread
From: Prakash Sangappa @ 2025-07-01  0:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sebastian Andrzej Siewior, linux-kernel@vger.kernel.org,
	peterz@infradead.org, mathieu.desnoyers@efficios.com,
	tglx@linutronix.de, kprateek.nayak@amd.com,
	vineethr@linux.ibm.com



> On Jun 10, 2025, at 9:40 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue, 10 Jun 2025 16:31:05 +0000
> Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
> 
>> Ok, adding load also seems to increase the max latency.
> 
> Right.
> 
>> 
>> It is up to Peter to decide if scheduler time extension should be restricted to non RT threads.
> 
> Peter was against restricting it to just non RT threads because he said it
> wouldn't make a difference. He asked for benchmarks that says otherwise.
> 
> I'm now supplying the benchmarks that say it does make a difference.
> 
> Hopefully Peter will now change his mind.

Did not see a response from Peter.

I sent out  V6 version which adds a config option for this feature.
https://lore.kernel.org/all/20250701003749.50525-1-prakash.sangappa@oracle.com/

-Prakash
> 
> -- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-07-01  0:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 23:36 [PATCH V5 0/6] Scheduler time slice extension Prakash Sangappa
2025-06-03 23:36 ` [PATCH V5 1/6] Sched: " Prakash Sangappa
2025-06-04 14:31   ` Steven Rostedt
2025-06-04 14:54     ` Sebastian Andrzej Siewior
2025-06-04 17:29       ` Prakash Sangappa
2025-06-04 19:23         ` Sebastian Andrzej Siewior
2025-06-09 20:55           ` Steven Rostedt
2025-06-09 21:33             ` Steven Rostedt
2025-06-10 16:31               ` Prakash Sangappa
2025-06-10 16:40                 ` Steven Rostedt
2025-07-01  0:48                   ` Prakash Sangappa
2025-06-09 21:52             ` Steven Rostedt
2025-06-09 22:06               ` Steven Rostedt
2025-06-10 15:40               ` Prakash Sangappa
2025-06-04 17:09     ` Prakash Sangappa
2025-06-03 23:36 ` [PATCH V5 2/6] Sched: Indicate if thread got rescheduled Prakash Sangappa
2025-06-03 23:36 ` [PATCH V5 3/6] Sched: Tunable to specify duration of time slice extension Prakash Sangappa
2025-06-03 23:36 ` [PATCH V5 4/6] Sched: Add scheduler stat for cpu " Prakash Sangappa
2025-06-03 23:36 ` [PATCH V5 5/6] Sched: Add tracepoint for sched " Prakash Sangappa
2025-06-04 14:36   ` Steven Rostedt
2025-06-04 17:10     ` Prakash Sangappa
2025-06-03 23:36 ` [PATCH V5 6/6] Add API to query supported rseq cs flags Prakash Sangappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).