linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V2 00/12] rseq: Implement time slice extension mechanism
@ 2025-10-22 12:57 Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

This is a follow up on the V1 version:

     https://lore.kernel.org/20250908225709.144709889@linutronix.de

Time slice extensions are an attempt to provide opportunistic priority
ceiling without the overhead of an actual priority ceiling protocol, but
also without the guarantees such a protocol provides.

The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.

This series uses the existing RSEQ user memory to implement it.

Changes vs. V1:

   - Rebase on the newest RSEQ and uaccess changes

   - Use seperate bytes for request and grant and lift the atomic operation
     requirement for user space - Mathieu

   - Kconfig indentation, fix typos and expressions - Randy

   - Provide an extra stub for the !RSEQ case - Prateek

   - Use the proper name in sys_ni.c and add comment - Prateek

   - Return 1 from __setup() - Prateek


The uaccess and RSEQ modifications on which this series is based can be
found here:

    https://lore.kernel.org/20251022104005.907410538@linutronix.de/

and in git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice

Thanks,

	tglx
---
Peter Zilstra (1):
      sched: Provide and use set_need_resched_current()

Thomas Gleixner (11):
      rseq: Add fields and constants for time slice extension
      rseq: Provide static branch for time slice extensions
      rseq: Add statistics for time slice extensions
      rseq: Add prctl() to enable time slice extensions
      rseq: Implement sys_rseq_slice_yield()
      rseq: Implement syscall entry work for time slice extensions
      rseq: Implement time slice extension enforcement timer
      rseq: Reset slice extension when scheduled
      rseq: Implement rseq_grant_slice_extension()
      entry: Hook up rseq time slice extension
      selftests/rseq: Implement time slice extension test

 Documentation/userspace-api/index.rst       |    1 
 Documentation/userspace-api/rseq.rst        |  118 ++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/arm64/tools/syscall_32.tbl             |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/s390/mm/pfault.c                       |    3 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 include/linux/entry-common.h                |    2 
 include/linux/rseq.h                        |   11 +
 include/linux/rseq_entry.h                  |  190 ++++++++++++++++-
 include/linux/rseq_types.h                  |   28 ++
 include/linux/sched.h                       |    7 
 include/linux/syscalls.h                    |    1 
 include/linux/thread_info.h                 |   16 -
 include/uapi/asm-generic/unistd.h           |    5 
 include/uapi/linux/prctl.h                  |   10 
 include/uapi/linux/rseq.h                   |   38 +++
 init/Kconfig                                |   12 +
 kernel/entry/common.c                       |   14 +
 kernel/entry/syscall-common.c               |   11 -
 kernel/rcu/tiny.c                           |    8 
 kernel/rcu/tree.c                           |   14 -
 kernel/rcu/tree_exp.h                       |    3 
 kernel/rcu/tree_plugin.h                    |    9 
 kernel/rcu/tree_stall.h                     |    3 
 kernel/rseq.c                               |  304 ++++++++++++++++++++++++++++
 kernel/sys.c                                |    6 
 kernel/sys_ni.c                             |    1 
 scripts/syscall.tbl                         |    1 
 tools/testing/selftests/rseq/.gitignore     |    1 
 tools/testing/selftests/rseq/Makefile       |    5 
 tools/testing/selftests/rseq/rseq-abi.h     |   27 ++
 tools/testing/selftests/rseq/slice_test.c   |  198 ++++++++++++++++++
 45 files changed, 1011 insertions(+), 52 deletions(-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 01/12] sched: Provide and use set_need_resched_current()
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-27  8:59   ` Sebastian Andrzej Siewior
  2025-10-22 12:57 ` [patch V2 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

set_tsk_need_resched(current) requires set_preempt_need_resched(current) to
work correctly outside of the scheduler.

Provide set_need_resched_current() which wraps this correctly and replace
all the open coded instances.

Signed-off-by: Peter Zilstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/s390/mm/pfault.c    |    3 +--
 include/linux/sched.h    |    7 +++++++
 kernel/rcu/tiny.c        |    8 +++-----
 kernel/rcu/tree.c        |   14 +++++---------
 kernel/rcu/tree_exp.h    |    3 +--
 kernel/rcu/tree_plugin.h |    9 +++------
 kernel/rcu/tree_stall.h  |    3 +--
 7 files changed, 21 insertions(+), 26 deletions(-)

--- a/arch/s390/mm/pfault.c
+++ b/arch/s390/mm/pfault.c
@@ -199,8 +199,7 @@ static void pfault_interrupt(struct ext_
 			 * return to userspace schedule() to block.
 			 */
 			__set_current_state(TASK_UNINTERRUPTIBLE);
-			set_tsk_need_resched(tsk);
-			set_preempt_need_resched();
+			set_need_resched_current();
 		}
 	}
 out:
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2033,6 +2033,13 @@ static inline int test_tsk_need_resched(
 	return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
 }
 
+static inline void set_need_resched_current(void)
+{
+	lockdep_assert_irqs_disabled();
+	set_tsk_need_resched(current);
+	set_preempt_need_resched();
+}
+
 /*
  * cond_resched() and cond_resched_lock(): latency reduction via
  * explicit rescheduling in places that are safe. The return
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -70,12 +70,10 @@ void rcu_qs(void)
  */
 void rcu_sched_clock_irq(int user)
 {
-	if (user) {
+	if (user)
 		rcu_qs();
-	} else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail) {
-		set_tsk_need_resched(current);
-		set_preempt_need_resched();
-	}
+	else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail)
+		set_need_resched_current();
 }
 
 /*
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2696,10 +2696,8 @@ void rcu_sched_clock_irq(int user)
 	/* The load-acquire pairs with the store-release setting to true. */
 	if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
 		/* Idle and userspace execution already are quiescent states. */
-		if (!rcu_is_cpu_rrupt_from_idle() && !user) {
-			set_tsk_need_resched(current);
-			set_preempt_need_resched();
-		}
+		if (!rcu_is_cpu_rrupt_from_idle() && !user)
+			set_need_resched_current();
 		__this_cpu_write(rcu_data.rcu_urgent_qs, false);
 	}
 	rcu_flavor_sched_clock_irq(user);
@@ -2824,7 +2822,6 @@ static void strict_work_handler(struct w
 /* Perform RCU core processing work for the current CPU.  */
 static __latent_entropy void rcu_core(void)
 {
-	unsigned long flags;
 	struct rcu_data *rdp = raw_cpu_ptr(&rcu_data);
 	struct rcu_node *rnp = rdp->mynode;
 
@@ -2837,8 +2834,8 @@ static __latent_entropy void rcu_core(vo
 	if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && (!(preempt_count() & PREEMPT_MASK))) {
 		rcu_preempt_deferred_qs(current);
 	} else if (rcu_preempt_need_deferred_qs(current)) {
-		set_tsk_need_resched(current);
-		set_preempt_need_resched();
+		guard(irqsave)();
+		set_need_resched_current();
 	}
 
 	/* Update RCU state based on any recent quiescent states. */
@@ -2847,10 +2844,9 @@ static __latent_entropy void rcu_core(vo
 	/* No grace period and unregistered callbacks? */
 	if (!rcu_gp_in_progress() &&
 	    rcu_segcblist_is_enabled(&rdp->cblist) && !rcu_rdp_is_offloaded(rdp)) {
-		local_irq_save(flags);
+		guard(irqsave)();
 		if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
 			rcu_accelerate_cbs_unlocked(rnp, rdp);
-		local_irq_restore(flags);
 	}
 
 	rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -729,8 +729,7 @@ static void rcu_exp_need_qs(void)
 	__this_cpu_write(rcu_data.cpu_no_qs.b.exp, true);
 	/* Store .exp before .rcu_urgent_qs. */
 	smp_store_release(this_cpu_ptr(&rcu_data.rcu_urgent_qs), true);
-	set_tsk_need_resched(current);
-	set_preempt_need_resched();
+	set_need_resched_current();
 }
 
 #ifdef CONFIG_PREEMPT_RCU
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -753,8 +753,7 @@ static void rcu_read_unlock_special(stru
 			// Also if no expediting and no possible deboosting,
 			// slow is OK.  Plus nohz_full CPUs eventually get
 			// tick enabled.
-			set_tsk_need_resched(current);
-			set_preempt_need_resched();
+			set_need_resched_current();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
 			    needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
 			    cpu_online(rdp->cpu)) {
@@ -813,10 +812,8 @@ static void rcu_flavor_sched_clock_irq(i
 	if (rcu_preempt_depth() > 0 ||
 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 		/* No QS, force context switch if deferred. */
-		if (rcu_preempt_need_deferred_qs(t)) {
-			set_tsk_need_resched(t);
-			set_preempt_need_resched();
-		}
+		if (rcu_preempt_need_deferred_qs(t))
+			set_need_resched_current();
 	} else if (rcu_preempt_need_deferred_qs(t)) {
 		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
 		return;
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -763,8 +763,7 @@ static void print_cpu_stall(unsigned lon
 	 * progress and it could be we're stuck in kernel space without context
 	 * switches for an entirely unreasonable amount of time.
 	 */
-	set_tsk_need_resched(current);
-	set_preempt_need_resched();
+	set_need_resched_current();
 }
 
 static bool csd_lock_suppress_rcu_stall;


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 02/12] rseq: Add fields and constants for time slice extension
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-22 17:28   ` Randy Dunlap
  2025-10-22 12:57 ` [patch V2 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Peter Zilstra, Arnd Bergmann, linux-arch

Aside of a Kconfig knob add the following items:

   - Two flag bits for the rseq user space ABI, which allow user space to
     query the availability and enablement without a syscall.

   - A new member to the user space ABI struct rseq, which is going to be
     used to communicate request and grant between kernel and user space.

   - A rseq state struct to hold the kernel state of this

   - Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
V2: Fix Kconfig indentation, fix typos and expressions - Randy
    Make the control fields a struct and remove the atomicity requirement - Mathieu
---
 Documentation/userspace-api/index.rst |    1 
 Documentation/userspace-api/rseq.rst  |  118 ++++++++++++++++++++++++++++++++++
 include/linux/rseq_types.h            |   26 +++++++
 include/uapi/linux/rseq.h             |   38 ++++++++++
 init/Kconfig                          |   12 +++
 kernel/rseq.c                         |    7 ++
 6 files changed, 202 insertions(+)

--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
    ebpf/index
    ioctl/index
    mseal
+   rseq
 
 Security-related interfaces
 ===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,118 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartables sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+    * Enabled in Kconfig
+
+    * Enabled at boot time (default is enabled)
+
+    * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success and otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg4 and arg5 must be zero
+ENOTSUPP  Functionality was disabled on the kernel command line
+ENXIO	  Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl.request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::reqeust and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving userspace from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+    rseq->slice_ctrl.request = 1;
+    critical_section();
+    if (rseq->slice_ctrl.granted)
+         rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+    if (rseq->slice_ctrl & GRANTED)
+      -> Interrupt results in schedule and grant revocation
+        rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
 };
 
 /**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state:	Compound to access the overall state
+ * @enabled:	Time slice extension is enabled for the task
+ * @granted:	Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+	u16			state;
+	struct {
+		u8		enabled;
+		u8		granted;
+	};
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state:	Time slice extension state
+ */
+struct rseq_slice {
+	union rseq_slice_state	state;
+};
+
+/**
  * struct rseq_data - Storage for all rseq related data
  * @usrptr:	Pointer to the registered user space RSEQ memory
  * @len:	Length of the RSEQ region
  * @sig:	Signature of critial section abort IPs
  * @event:	Storage for event management
  * @ids:	Storage for cached CPU ID and MM CID
+ * @slice:	Storage for time slice extension data
  */
 struct rseq_data {
 	struct rseq __user		*usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
 	u32				sig;
 	struct rseq_event		event;
 	struct rseq_ids			ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+	struct rseq_slice		slice;
+#endif
 };
 
 #else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
 };
 
 enum rseq_cs_flags_bit {
+	/* Historical and unsupported bits */
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	/* (3) Intentional gap to put new bits into a separate byte */
+
+	/* User read only feature flags */
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT	= 4,
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT	= 5,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE	=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED		=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
 };
 
 /*
@@ -53,6 +64,27 @@ struct rseq_cs {
 	__u64 abort_ip;
 } __attribute__((aligned(4 * sizeof(__u64))));
 
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all:	Compound value
+ * @request:	Request for a time slice extension
+ * @granted:	Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space.  @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+	union {
+		__u32		all;
+		struct {
+			__u8	request;
+			__u8	granted;
+			__u16	__reserved;
+		};
+	};
+};
+
 /*
  * struct rseq is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
 	__u32 mm_cid;
 
 	/*
+	 * Time slice extension control structure. CPU local updates from
+	 * kernel and user space.
+	 */
+	struct rseq_slice_ctrl slice_ctrl;
+
+	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
 	char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1913,6 +1913,18 @@ config RSEQ
 
 	  If unsure, say Y.
 
+config RSEQ_SLICE_EXTENSION
+	bool "Enable rseq-based time slice extension mechanism"
+	depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+	help
+	  Allows userspace to request a limited time slice extension when
+	  returning from an interrupt to user space via the RSEQ shared
+	  data ABI. If granted, that allows to complete a critical section,
+	  so that other threads are not stuck on a conflicted resource,
+	  while the task is scheduled out.
+
+	  If unsure, say N.
+
 config RSEQ_STATS
 	default n
 	bool "Enable lightweight statistics of restartable sequences" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
  */
 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
 {
+	u32 rseqfl = 0;
+
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
 			return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
 
+	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+		rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
 	scoped_user_write_access(rseq, efault) {
 		/*
 		 * If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -450,11 +455,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 		 */
 		unsafe_put_user(0UL, &rseq->rseq_cs, efault);
 		unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+		unsafe_put_user(rseqfl, &rseq->flags, efault);
 		/* Initialize IDs in user space */
 		unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
 		unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
 		unsafe_put_user(0U, &rseq->node_id, efault);
 		unsafe_put_user(0U, &rseq->mm_cid, efault);
+		unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
 	}
 
 	/*


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 03/12] rseq: Provide static branch for time slice extensions
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-27  9:29   ` Sebastian Andrzej Siewior
  2025-10-22 12:57 ` [patch V2 04/12] rseq: Add statistics " Thomas Gleixner
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V2: Return 1 from __setup() - Prateek
---
 include/linux/rseq_entry.h |   11 +++++++++++
 kernel/rseq.c              |   17 +++++++++++++++++
 2 files changed, 28 insertions(+)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
 #define rseq_inline __always_inline
 #endif
 
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static __always_inline bool rseq_slice_extension_enabled(void)
+{
+	return static_branch_likely(&rseq_slice_extension_key);
+}
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
 bool rseq_debug_validate_ids(struct task_struct *t);
 
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -484,3 +484,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 efault:
 	return -EFAULT;
 }
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static int __init rseq_slice_cmdline(char *str)
+{
+	bool on;
+
+	if (kstrtobool(str, &on))
+		return -EINVAL;
+
+	if (!on)
+		static_branch_disable(&rseq_slice_extension_key);
+	return 1;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#endif /* CONFIG_RSEQ_SLICE_EXTENSION */


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 04/12] rseq: Add statistics for time slice extensions
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (2 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 05/12] rseq: Add prctl() to enable " Thomas Gleixner
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Extend the quick statistics with time slice specific fields.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq_entry.h |    4 ++++
 kernel/rseq.c              |   12 ++++++++++++
 2 files changed, 16 insertions(+)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -15,6 +15,10 @@ struct rseq_stats {
 	unsigned long	cs;
 	unsigned long	clear;
 	unsigned long	fixup;
+	unsigned long	s_granted;
+	unsigned long	s_expired;
+	unsigned long	s_revoked;
+	unsigned long	s_yielded;
 };
 
 DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -138,6 +138,12 @@ static int rseq_stats_show(struct seq_fi
 		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
 		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
 		stats.fixup	+= data_race(per_cpu(rseq_stats.fixup, cpu));
+		if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+			stats.s_granted	+= data_race(per_cpu(rseq_stats.s_granted, cpu));
+			stats.s_expired	+= data_race(per_cpu(rseq_stats.s_expired, cpu));
+			stats.s_revoked	+= data_race(per_cpu(rseq_stats.s_revoked, cpu));
+			stats.s_yielded	+= data_race(per_cpu(rseq_stats.s_yielded, cpu));
+		}
 	}
 
 	seq_printf(m, "exit:   %16lu\n", stats.exit);
@@ -148,6 +154,12 @@ static int rseq_stats_show(struct seq_fi
 	seq_printf(m, "cs:     %16lu\n", stats.cs);
 	seq_printf(m, "clear:  %16lu\n", stats.clear);
 	seq_printf(m, "fixup:  %16lu\n", stats.fixup);
+	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+		seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
+		seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
+		seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
+		seq_printf(m, "syield: %16lu\n", stats.s_yielded);
+	}
 	return 0;
 }
 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 05/12] rseq: Add prctl() to enable time slice extensions
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (3 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 04/12] rseq: Add statistics " Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-27  9:40   ` Sebastian Andrzej Siewior
  2025-10-22 12:57 ` [patch V2 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.

That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq.h       |    9 +++++++
 include/uapi/linux/prctl.h |   10 ++++++++
 kernel/rseq.c              |   52 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |    6 +++++
 4 files changed, 77 insertions(+)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -164,4 +164,13 @@ void rseq_syscall(struct pt_regs *regs);
 static inline void rseq_syscall(struct pt_regs *regs) { }
 #endif /* !CONFIG_DEBUG_RSEQ */
 
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+	return -EINVAL;
+}
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
 #endif /* _LINUX_RSEQ_H */
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -386,4 +386,14 @@ struct prctl_mm_map {
 # define PR_FUTEX_HASH_SET_SLOTS	1
 # define PR_FUTEX_HASH_GET_SLOTS	2
 
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION			79
+# define PR_RSEQ_SLICE_EXTENSION_GET		1
+# define PR_RSEQ_SLICE_EXTENSION_SET		2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE:	Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE		0x01
+
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,7 @@
 #define RSEQ_BUILD_SLOW_PATH
 
 #include <linux/debugfs.h>
+#include <linux/prctl.h>
 #include <linux/ratelimit.h>
 #include <linux/rseq_entry.h>
 #include <linux/sched.h>
@@ -500,6 +501,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
 DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
 
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+	switch (arg2) {
+	case PR_RSEQ_SLICE_EXTENSION_GET:
+		if (arg3)
+			return -EINVAL;
+		return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+	case PR_RSEQ_SLICE_EXTENSION_SET: {
+		u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+		bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
+
+		if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+			return -EINVAL;
+		if (!rseq_slice_extension_enabled())
+			return -ENOTSUPP;
+		if (!current->rseq.usrptr)
+			return -ENXIO;
+
+		/* No change? */
+		if (enable == !!current->rseq.slice.state.enabled)
+			return 0;
+
+		if (get_user(rflags, &current->rseq.usrptr->flags))
+			goto die;
+
+		if (current->rseq.slice.state.enabled)
+			valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+		if ((rflags & valid) != valid)
+			goto die;
+
+		rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+		rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+		if (enable)
+			rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+		if (put_user(rflags, &current->rseq.usrptr->flags))
+			goto die;
+
+		current->rseq.slice.state.enabled = enable;
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
+die:
+	force_sig(SIGSEGV);
+	return -EFAULT;
+}
+
 static int __init rseq_slice_cmdline(char *str)
 {
 	bool on;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
 #include <linux/time_namespace.h>
 #include <linux/binfmts.h>
 #include <linux/futex.h>
+#include <linux/rseq.h>
 
 #include <linux/sched.h>
 #include <linux/sched/autogroup.h>
@@ -2868,6 +2869,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 	case PR_FUTEX_HASH:
 		error = futex_hash_prctl(arg2, arg3, arg4);
 		break;
+	case PR_RSEQ_SLICE_EXTENSION:
+		if (arg4 || arg5)
+			return -EINVAL;
+		error = rseq_slice_extension_prctl(arg2, arg3);
+		break;
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 06/12] rseq: Implement sys_rseq_slice_yield()
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (4 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Arnd Bergmann, linux-arch, Peter Zilstra, Peter Zijlstra,
	Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
	Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
	Steven Rostedt, Sebastian Andrzej Siewior

Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.

sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
---
V2: Use the proper name in sys_ni.c and add comment - Prateek
---
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/tools/syscall_32.tbl             |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 include/linux/syscalls.h                    |    1 +
 include/uapi/asm-generic/unistd.h           |    5 ++++-
 kernel/rseq.c                               |   21 +++++++++++++++++++++
 kernel/sys_ni.c                             |    1 +
 scripts/syscall.tbl                         |    1 +
 21 files changed, 44 insertions(+), 1 deletion(-)

--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
 577	common	open_tree_attr			sys_open_tree_attr
 578	common	file_getattr			sys_file_getattr
 579	common	file_setattr			sys_file_setattr
+580	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -484,3 +484,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -481,3 +481,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -408,3 +408,4 @@
 467	n32	open_tree_attr			sys_open_tree_attr
 468	n32	file_getattr			sys_file_getattr
 469	n32	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -384,3 +384,4 @@
 467	n64	open_tree_attr			sys_open_tree_attr
 468	n64	file_getattr			sys_file_getattr
 469	n64	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -457,3 +457,4 @@
 467	o32	open_tree_attr			sys_open_tree_attr
 468	o32	file_getattr			sys_file_getattr
 469	o32	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -468,3 +468,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -560,3 +560,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	nospu	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -472,3 +472,4 @@
 467  common	open_tree_attr		sys_open_tree_attr		sys_open_tree_attr
 468  common	file_getattr		sys_file_getattr		sys_file_getattr
 469  common	file_setattr		sys_file_setattr		sys_file_setattr
+470  common	rseq_slice_yield	sys_rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -475,3 +475,4 @@
 467	i386	open_tree_attr		sys_open_tree_attr
 468	i386	file_getattr		sys_file_getattr
 469	i386	file_setattr		sys_file_setattr
+470	i386	rseq_slice_yield	sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -394,6 +394,7 @@
 467	common	open_tree_attr		sys_open_tree_attr
 468	common	file_getattr		sys_file_getattr
 469	common	file_setattr		sys_file_setattr
+470	common	rseq_slice_yield	sys_rseq_slice_yield
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -957,6 +957,7 @@ asmlinkage long sys_statx(int dfd, const
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_rseq_slice_yield(void);
 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
 				   unsigned flags,
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -858,8 +858,11 @@
 #define __NR_file_setattr 469
 __SYSCALL(__NR_file_setattr, sys_file_setattr)
 
+#define __NR_rseq_slice_yield 470
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
 #undef __NR_syscalls
-#define __NR_syscalls 470
+#define __NR_syscalls 471
 
 /*
  * 32 bit systems traditionally used different
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -552,6 +552,27 @@ int rseq_slice_extension_prctl(unsigned
 	return -EFAULT;
 }
 
+/**
+ * sys_rseq_slice_yield - yield the current processor if a task granted
+ *			  with a time slice extension is done with the
+ *			  critical work before being forced out.
+ *
+ * On entry from user space, syscall_entry_work() ensures that NEED_RESCHED is
+ * set if the task was granted a slice extension before arriving here.
+ *
+ * Return: 1 if the task successfully yielded the CPU within the granted slice.
+ *         0 if the slice extension was either never granted or was revoked by
+ *	     going over the granted extension or being scheduled out earlier
+ */
+SYSCALL_DEFINE0(rseq_slice_yield)
+{
+	if (need_resched()) {
+		schedule();
+		return 1;
+	}
+	return 0;
+}
+
 static int __init rseq_slice_cmdline(char *str)
 {
 	bool on;
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,6 +390,7 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+COND_SYSCALL(rseq_slice_yield);
 
 COND_SYSCALL(uretprobe);
 COND_SYSCALL(uprobe);
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -410,3 +410,4 @@
 467	common	open_tree_attr			sys_open_tree_attr
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
+470	common	rseq_slice_yield		sys_rseq_slice_yield


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 07/12] rseq: Implement syscall entry work for time slice extensions
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (5 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.

In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.

Doing it in syscall entry work allows to catch misbehaving user space,
which issues a syscall from the critical section. Wrong syscall and
inconsistent user space result in a SIGSEGV.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/entry-common.h  |    2 -
 include/linux/rseq.h          |    2 +
 include/linux/thread_info.h   |   16 ++++----
 kernel/entry/syscall-common.c |   11 ++++-
 kernel/rseq.c                 |   80 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 10 deletions(-)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -36,8 +36,8 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_RSEQ_SLICE |	\
 				 ARCH_SYSCALL_WORK_ENTER)
-
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -165,8 +165,10 @@ static inline void rseq_syscall(struct p
 #endif /* !CONFIG_DEBUG_RSEQ */
 
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
+void rseq_syscall_enter_work(long syscall);
 int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline void rseq_syscall_enter_work(long syscall) { }
 static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
 {
 	return -EINVAL;
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
 };
 
-#define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT	BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE	BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU	BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP			BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT		BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE		BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU		BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT		BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH	BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP		BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE		BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
 #endif
 
 #include <asm/thread_info.h>
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s
 	}
 }
 
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
-				unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
 {
 	long ret = 0;
 
@@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs
 			return -1L;
 	}
 
+	/*
+	 * User space got a time slice extension granted and relinquishes
+	 * the CPU. The work stops the slice timer to avoid an extra round
+	 * through hrtimer_interrupt().
+	 */
+	if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE)
+		rseq_syscall_enter_work(syscall);
+
 	/* Handle ptrace */
 	if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
 		ret = ptrace_report_syscall_entry(regs);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -501,6 +501,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
 DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
 
+static inline void rseq_slice_set_need_resched(struct task_struct *curr)
+{
+	/*
+	 * The interrupt guard is required to prevent inconsistent state in
+	 * this case:
+	 *
+	 * set_tsk_need_resched()
+	 * --> Interrupt
+	 *       wakeup()
+	 *        set_tsk_need_resched()
+	 *	  set_preempt_need_resched()
+	 *     schedule_on_return()
+	 *        clear_tsk_need_resched()
+	 *	  clear_preempt_need_resched()
+	 * set_preempt_need_resched()		<- Inconsistent state
+	 *
+	 * This is safe vs. a remote set of TIF_NEED_RESCHED because that
+	 * only sets the already set bit and does not create inconsistent
+	 * state.
+	 */
+	scoped_guard(irq)
+		set_need_resched_current();
+}
+
+static void rseq_slice_validate_ctrl(u32 expected)
+{
+	u32 __user *sctrl = &current->rseq.usrptr->slice_ctrl.all;
+	u32 uval;
+
+	if (!get_user_scoped(uval, sctrl) || uval != expected)
+		force_sig(SIGSEGV);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ */
+void rseq_syscall_enter_work(long syscall)
+{
+	struct task_struct *curr = current;
+	struct rseq_slice_ctrl ctrl = { .granted = curr->rseq.slice.state.granted };
+
+	clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+
+	if (static_branch_unlikely(&rseq_debug_enabled))
+		rseq_slice_validate_ctrl(ctrl.all);
+
+	/*
+	 * The kernel might have raced, revoked the grant and updated
+	 * userspace, but kept the SLICE work set.
+	 */
+	if (!ctrl.granted)
+		return;
+
+	rseq_stat_inc(rseq_stats.s_yielded);
+
+	/*
+	 * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
+	 * kernels.
+	 */
+	scoped_guard(preempt) {
+		/*
+		 * Now that preemption is disabled, quickly check whether
+		 * the task was already rescheduled before arriving here.
+		 */
+		if (!curr->rseq.event.sched_switch)
+			rseq_slice_set_need_resched(curr);
+	}
+
+	curr->rseq.slice.state.granted = false;
+	/*
+	 * Clear the grant in user space and check whether this was the
+	 * correct syscall to yield. If the user access fails or the task
+	 * used an arbitrary syscall, terminate it.
+	 */
+	if (!put_user_scoped(0U, &curr->rseq.usrptr->slice_ctrl.all) ||
+	    syscall != __NR_rseq_slice_yield)
+		force_sig(SIGSEGV);
+}
+
 int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
 {
 	switch (arg2) {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (6 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-27 11:38   ` Sebastian Andrzej Siewior
  2025-10-22 12:57 ` [patch V2 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.

It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:

   1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
      independently of CONFIG_HIGHRES_TIMERS

   2) HRTICK usage in the scheduler can be runtime disabled or is only used
      for certain aspects of scheduling.

   3) The function is calling into the scheduler code and that might have
      unexpected consequences when this is invoked due to a time slice
      enforcement expiry. Especially when the task managed to clear the
      grant via sched_yield(0).

It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.

Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.

The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().

It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq_entry.h |   38 ++++++++++----
 include/linux/rseq_types.h |    2 
 kernel/rseq.c              |  119 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 147 insertions(+), 12 deletions(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -86,8 +86,24 @@ static __always_inline bool rseq_slice_e
 {
 	return static_branch_likely(&rseq_slice_extension_key);
 }
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+	if (!rseq_slice_extension_enabled())
+		return false;
+
+	if (likely(!current->rseq.slice.state.granted))
+		return false;
+
+	return __rseq_arm_slice_extension_timer();
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -543,17 +559,19 @@ static __always_inline void clear_tif_rs
 static __always_inline bool
 rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
 {
-	if (likely(!test_tif_rseq(ti_work)))
-		return false;
-
-	if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
-		current->rseq.event.slowpath = true;
-		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
-		return true;
+	if (unlikely(test_tif_rseq(ti_work))) {
+		if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+			current->rseq.event.slowpath = true;
+			set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+			return true;
+		}
+		clear_tif_rseq();
 	}
-
-	clear_tif_rseq();
-	return false;
+	/*
+	 * Arm the slice extension timer if nothing to do anymore and the
+	 * task really goes out to user space.
+	 */
+	return rseq_arm_slice_extension_timer();
 }
 
 #endif /* CONFIG_GENERIC_ENTRY */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,9 +89,11 @@ union rseq_slice_state {
 /**
  * struct rseq_slice - Status information for rseq time slice extension
  * @state:	Time slice extension state
+ * @expires:	The time when a grant expires
  */
 struct rseq_slice {
 	union rseq_slice_state	state;
+	u64			expires;
 };
 
 /**
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
 #define RSEQ_BUILD_SLOW_PATH
 
 #include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
 #include <linux/prctl.h>
 #include <linux/ratelimit.h>
 #include <linux/rseq_entry.h>
@@ -500,8 +502,82 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 }
 
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+	struct hrtimer	timer;
+	void		*cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
 DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
 
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+	if (st->cookie == current && current->rseq.slice.state.granted) {
+		rseq_stat_inc(rseq_stats.s_expired);
+		set_need_resched_current();
+	}
+	return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+	struct slice_timer *st = this_cpu_ptr(&slice_timer);
+	struct task_struct *curr = current;
+
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * This check prevents that a granted time slice extension exceeds
+	 * the maximum scheduling latency when the grant expired before
+	 * going out to user space. Don't bother to clear the grant here,
+	 * it will be cleaned up automatically before going out to user
+	 * space.
+	 */
+	if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+		set_need_resched_current();
+		return true;
+	}
+
+	/*
+	 * Store the task pointer as a cookie for comparison in the timer
+	 * function. This is safe as the timer is CPU local and cannot be
+	 * in the expiry function at this point.
+	 */
+	st->cookie = curr;
+	hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+	/* Arm the syscall entry work */
+	set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+	return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+	struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+	/*
+	 * st->cookie can be safely read as preemption is disabled and the
+	 * timer is CPU local. The active check can obviously race with the
+	 * hrtimer interrupt, but that's better than disabling interrupts
+	 * unconditionally right away.
+	 *
+	 * As this is most probably the first expiring timer, the cancel is
+	 * expensive as it has to reprogram the hardware, but that's less
+	 * expensive than going through a full hrtimer_interrupt() cycle
+	 * for nothing.
+	 *
+	 * hrtimer_try_to_cancel() is sufficient here as with interrupts
+	 * disabled the timer callback cannot be running and the timer base
+	 * is well determined as the timer is pinned on the local CPU.
+	 */
+	if (st->cookie == current && hrtimer_active(&st->timer)) {
+		scoped_guard(irq)
+			hrtimer_try_to_cancel(&st->timer);
+	}
+}
+
 static inline void rseq_slice_set_need_resched(struct task_struct *curr)
 {
 	/*
@@ -559,10 +635,11 @@ void rseq_syscall_enter_work(long syscal
 	rseq_stat_inc(rseq_stats.s_yielded);
 
 	/*
-	 * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
-	 * kernels.
+	 * Required to stabilize the per CPU timer pointer and to make
+	 * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
 	 */
 	scoped_guard(preempt) {
+		rseq_cancel_slice_extension_timer();
 		/*
 		 * Now that preemption is disabled, quickly check whether
 		 * the task was already rescheduled before arriving here.
@@ -654,6 +731,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
 	return 0;
 }
 
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+	{
+		.procname	= "rseq_slice_extension_nsec",
+		.data		= &rseq_slice_ext_nsecs,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= (unsigned int *)&rseq_slice_ext_nsecs_min,
+		.extra2		= (unsigned int *)&rseq_slice_ext_nsecs_max,
+	},
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+	if (rseq_slice_extension_enabled())
+		register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif  /* !CONFIG_SYSCTL */
+
 static int __init rseq_slice_cmdline(char *str)
 {
 	bool on;
@@ -666,4 +768,17 @@ static int __init rseq_slice_cmdline(cha
 	return 1;
 }
 __setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+			      CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+	}
+	rseq_slice_sysctl_init();
+	return 0;
+}
+device_initcall(rseq_slice_init);
 #endif /* CONFIG_RSEQ_SLICE_EXTENSION */


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 09/12] rseq: Reset slice extension when scheduled
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (7 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.

RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.

Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq_entry.h |   29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -101,9 +101,17 @@ static __always_inline bool rseq_arm_sli
 	return __rseq_arm_slice_extension_timer();
 }
 
+static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
+{
+	if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted)
+		rseq_stat_inc(rseq_stats.s_revoked);
+	t->rseq.slice.state.granted = false;
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
 static inline bool rseq_arm_slice_extension_timer(void) { return false; }
+static inline void rseq_slice_clear_grant(struct task_struct *t) { }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -390,6 +398,13 @@ bool rseq_set_ids_get_csaddr(struct task
 		unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
 		if (csaddr)
 			unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+
+		/* Open coded, so it's in the same user access region */
+		if (rseq_slice_extension_enabled()) {
+			/* Unconditionally clear it, no point in conditionals */
+			unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+			rseq_slice_clear_grant(t);
+		}
 	}
 
 	/* Cache the new values */
@@ -487,8 +502,16 @@ static __always_inline bool rseq_exit_us
 		 */
 		u64 csaddr;
 
-		if (unlikely(!get_user_scoped(csaddr, &rseq->rseq_cs)))
-			return false;
+		scoped_user_rw_access(rseq, efault) {
+			unsafe_get_user(csaddr, &rseq->rseq_cs, efault);
+
+			/* Open coded, so it's in the same user access region */
+			if (rseq_slice_extension_enabled()) {
+				/* Unconditionally clear it, no point in conditionals */
+				unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+				rseq_slice_clear_grant(t);
+			}
+		}
 
 		if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
 			if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
@@ -504,6 +527,8 @@ static __always_inline bool rseq_exit_us
 	u32 node_id = cpu_to_node(ids.cpu_id);
 
 	return rseq_update_usr(t, regs, &ids, node_id);
+efault:
+	return false;
 }
 
 static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 10/12] rseq: Implement rseq_grant_slice_extension()
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (8 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.

The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:

 #1 the functionality is enabled

 #2 the exit is a return from interrupt to user mode

 #3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
    which means the task was already scheduled out.
 
The slow path, which implements the actual user space ABI, is invoked
when:

  A) #1 is true, #2 is true and #3 is false

     It checks whether user space requested a slice extension by setting
     the request bit in the rseq slice_ctrl field. If so, it grants the
     extension and stores the slice expiry time, so that the actual exit
     code can double check whether the slice is already exhausted before
     going back.

  B) #1 - #3 are true _and_ a slice extension was granted in a previous
     loop iteration

     In this case the grant is revoked.

In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V2: Provide an extra stub for the !RSEQ case - Prateek
---
 include/linux/rseq_entry.h |  108 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -41,6 +41,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
 #ifdef CONFIG_RSEQ
 #include <linux/jump_label.h>
 #include <linux/rseq.h>
+#include <linux/sched/signal.h>
 #include <linux/uaccess.h>
 
 #include <linux/tracepoint-defs.h>
@@ -108,10 +109,116 @@ static __always_inline void rseq_slice_c
 	t->rseq.slice.state.granted = false;
 }
 
+static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+{
+	struct task_struct *curr = current;
+	struct rseq_slice_ctrl usr_ctrl;
+	union rseq_slice_state state;
+	struct rseq __user *rseq;
+
+	if (!rseq_slice_extension_enabled())
+		return false;
+
+	/* If not enabled or not a return from interrupt, nothing to do. */
+	state = curr->rseq.slice.state;
+	state.enabled &= curr->rseq.event.user_irq;
+	if (likely(!state.state))
+		return false;
+
+	rseq = curr->rseq.usrptr;
+	scoped_user_rw_access(rseq, efault) {
+
+		/*
+		 * Quick check conditions where a grant is not possible or
+		 * needs to be revoked.
+		 *
+		 *  1) Any TIF bit which needs to do extra work aside of
+		 *     rescheduling prevents a grant.
+		 *
+		 *  2) A previous rescheduling request resulted in a slice
+		 *     extension grant.
+		 */
+		if (unlikely(work_pending || state.granted)) {
+			/* Clear user control unconditionally. No point for checking */
+			unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+			rseq_slice_clear_grant(curr);
+			return false;
+		}
+
+		unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+		if (likely(!(usr_ctrl.request)))
+			return false;
+
+		/* Grant the slice extention */
+		usr_ctrl.request = 0;
+		usr_ctrl.granted = 1;
+		unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+	}
+
+	rseq_stat_inc(rseq_stats.s_granted);
+
+	curr->rseq.slice.state.granted = true;
+	/* Store expiry time for arming the timer on the way out */
+	curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
+	/*
+	 * This is racy against a remote CPU setting TIF_NEED_RESCHED in
+	 * several ways:
+	 *
+	 * 1)
+	 *	CPU0			CPU1
+	 *	clear_tsk()
+	 *				set_tsk()
+	 *	clear_preempt()
+	 *				Raise scheduler IPI on CPU0
+	 *	--> IPI
+	 *	    fold_need_resched() -> Folds correctly
+	 * 2)
+	 *	CPU0			CPU1
+	 *				set_tsk()
+	 *	clear_tsk()
+	 *	clear_preempt()
+	 *				Raise scheduler IPI on CPU0
+	 *	--> IPI
+	 *	    fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false
+	 *
+	 * #1 is not any different from a regular remote reschedule as it
+	 *    sets the previously not set bit and then raises the IPI which
+	 *    folds it into the preempt counter
+	 *
+	 * #2 is obviously incorrect from a scheduler POV, but it's not
+	 *    differently incorrect than the code below clearing the
+	 *    reschedule request with the safety net of the timer.
+	 *
+	 * The important part is that the clearing is protected against the
+	 * scheduler IPI and also against any other interrupt which might
+	 * end up waking up a task and setting the bits in the middle of
+	 * the operation:
+	 *
+	 *	clear_tsk()
+	 *	---> Interrupt
+	 *		wakeup_on_this_cpu()
+	 *		set_tsk()
+	 *		set_preempt()
+	 *	clear_preempt()
+	 *
+	 * which would be inconsistent state.
+	 */
+	scoped_guard(irq) {
+		clear_tsk_need_resched(curr);
+		clear_preempt_need_resched();
+	}
+	return true;
+
+efault:
+	force_sig(SIGSEGV);
+	return false;
+}
+
 #else /* CONFIG_RSEQ_SLICE_EXTENSION */
 static inline bool rseq_slice_extension_enabled(void) { return false; }
 static inline bool rseq_arm_slice_extension_timer(void) { return false; }
 static inline void rseq_slice_clear_grant(struct task_struct *t) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -645,6 +752,7 @@ static inline bool rseq_exit_to_user_mod
 static inline void rseq_syscall_exit_to_user_mode(void) { }
 static inline void rseq_irqentry_exit_to_user_mode(void) { }
 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
 #endif /* !CONFIG_RSEQ */
 
 #endif /* _LINUX_RSEQ_ENTRY_H */


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 11/12] entry: Hook up rseq time slice extension
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (9 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-22 12:57 ` [patch V2 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
  2025-10-27 17:30 ` [patch V2 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
  12 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Peter Zilstra, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Wire the grant decision function up in exit_to_user_mode_loop()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/entry/common.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
 #define EXIT_TO_USER_MODE_WORK_LOOP	(EXIT_TO_USER_MODE_WORK)
 #endif
 
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED	(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY	(EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
 static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
 							      unsigned long ti_work)
 {
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
-			schedule();
+		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+			if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+				schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch V2 12/12] selftests/rseq: Implement time slice extension test
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (10 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
@ 2025-10-22 12:57 ` Thomas Gleixner
  2025-10-27 17:30 ` [patch V2 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
  12 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-22 12:57 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zilstra, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch

Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and arriving in rseq_slice_yield().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 tools/testing/selftests/rseq/.gitignore   |    1 
 tools/testing/selftests/rseq/Makefile     |    5 
 tools/testing/selftests/rseq/rseq-abi.h   |   27 ++++
 tools/testing/selftests/rseq/slice_test.c |  198 ++++++++++++++++++++++++++++++
 4 files changed, 230 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
 param_test_mm_cid_benchmark
 param_test_mm_cid_compare_twice
 syscall_errors_test
+slice_test
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
 		param_test_benchmark param_test_compare_twice param_test_mm_cid \
 		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
-		syscall_errors_test
+		syscall_errors_test slice_test
 
 TEST_GEN_PROGS_EXTENDED = librseq.so
 
@@ -59,3 +59,6 @@ include ../lib.mk
 $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
 					rseq.h rseq-*.h
 	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+	$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -53,6 +53,27 @@ struct rseq_abi_cs {
 	__u64 abort_ip;
 } __attribute__((aligned(4 * sizeof(__u64))));
 
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all:	Compound value
+ * @request:	Request for a time slice extension
+ * @granted:	Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space.  @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+	union {
+		__u32		all;
+		struct {
+			__u8	request;
+			__u8	granted;
+			__u16	__reserved;
+		};
+	};
+};
+
 /*
  * struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line.
@@ -165,6 +186,12 @@ struct rseq_abi {
 	__u32 mm_cid;
 
 	/*
+	 * Time slice extension control structure. CPU local updates from
+	 * kernel and user space.
+	 */
+	struct rseq_slice_ctrl slice_ctrl;
+
+	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
 	char end[];
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_rseq_slice_yield
+# define __NR_rseq_slice_yield	470
+#endif
+
+#define BITS_PER_INT	32
+#define BITS_PER_BYTE	8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION		79
+#  define PR_RSEQ_SLICE_EXTENSION_GET		1
+#  define PR_RSEQ_SLICE_EXTENSION_SET		2
+#  define PR_RSEQ_SLICE_EXT_ENABLE		0x01
+#endif
+
+#ifndef RSEQ_SLICE_EXT_REQUEST_BIT
+# define RSEQ_SLICE_EXT_REQUEST_BIT	0
+# define RSEQ_SLICE_EXT_GRANTED_BIT	1
+#endif
+
+#ifndef asm_inline
+# define asm_inline	asm __inline
+#endif
+
+#define NSEC_PER_SEC	1000000000L
+#define NSEC_PER_USEC	      1000L
+
+struct noise_params {
+	int	noise_nsecs;
+	int	sleep_nsecs;
+	int	run;
+};
+
+FIXTURE(slice_ext)
+{
+	pthread_t		noise_thread;
+	struct noise_params	noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+	int64_t	total_nsecs;
+	int	slice_nsecs;
+	int	noise_nsecs;
+	int	sleep_nsecs;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+	.total_nsecs	=  5 * NSEC_PER_SEC,
+	.slice_nsecs	=  2 * NSEC_PER_USEC,
+	.noise_nsecs    =  2 * NSEC_PER_USEC,
+	.sleep_nsecs	= 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n50_2_50)
+{
+	.total_nsecs	=  5 * NSEC_PER_SEC,
+	.slice_nsecs	= 50 * NSEC_PER_USEC,
+	.noise_nsecs    =  2 * NSEC_PER_USEC,
+	.sleep_nsecs	= 50 * NSEC_PER_USEC,
+};
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+			   int64_t span)
+{
+	int64_t delta = now->tv_sec - start->tv_sec;
+
+	delta *= NSEC_PER_SEC;
+	delta += now->tv_nsec - start->tv_nsec;
+	return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+	struct noise_params *p = arg;
+
+	while (RSEQ_READ_ONCE(p->run)) {
+		struct timespec ts_start, ts_now;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_start);
+		do {
+			clock_gettime(CLOCK_MONOTONIC, &ts_now);
+		} while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+		ts_start.tv_sec = 0;
+		ts_start.tv_nsec = p->sleep_nsecs;
+		clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+	}
+	return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+	cpu_set_t affinity;
+
+	ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+	/* Pin it on a single CPU. Avoid CPU 0 */
+	for (int i = 1; i < CPU_SETSIZE; i++) {
+		if (!CPU_ISSET(i, &affinity))
+			continue;
+
+		CPU_ZERO(&affinity);
+		CPU_SET(i, &affinity);
+		ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+		break;
+	}
+
+	ASSERT_EQ(rseq_register_current_thread(), 0);
+
+	ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+			PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+	self->noise_params.noise_nsecs = variant->noise_nsecs;
+	self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+	self->noise_params.run = 1;
+
+	ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+	self->noise_params.run = 0;
+	pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+	unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0;
+	struct rseq_abi *rs = rseq_get_abi();
+	struct timespec ts_start, ts_now;
+
+	ASSERT_NE(rs, NULL);
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+	do {
+		struct timespec ts_cs;
+		bool req = false;
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+		RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 1);
+		do {
+			clock_gettime(CLOCK_MONOTONIC, &ts_now);
+		} while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs));
+
+		/*
+		 * request can be cleared unconditionally, but for making
+		 * the stats work this is actually checking it first
+		 */
+		if (RSEQ_READ_ONCE(rs->slice_ctrl.request)) {
+			RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 0);
+			/* Race between check and clear! */
+			req = true;
+			success++;
+		}
+
+		if (RSEQ_READ_ONCE(rs->slice_ctrl.granted)) {
+			/* The above raced against a late grant */
+			if (req)
+				success--;
+			yielded++;
+			if (!syscall(__NR_rseq_slice_yield))
+				raced++;
+		} else {
+			if (!req)
+				scheduled++;
+		}
+
+		clock_gettime(CLOCK_MONOTONIC, &ts_now);
+	} while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+	printf("# Success   %12ld\n", success);
+	printf("# Yielded   %12ld\n", yielded);
+	printf("# Scheduled %12ld\n", scheduled);
+	printf("# Raced     %12ld\n", raced);
+}
+
+TEST_HARNESS_MAIN


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 02/12] rseq: Add fields and constants for time slice extension
  2025-10-22 12:57 ` [patch V2 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-10-22 17:28   ` Randy Dunlap
  0 siblings, 0 replies; 29+ messages in thread
From: Randy Dunlap @ 2025-10-22 17:28 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
	Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
	K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
	Arnd Bergmann, linux-arch



On 10/22/25 5:57 AM, Thomas Gleixner wrote:
> Aside of a Kconfig knob add the following items:
> 
>    - Two flag bits for the rseq user space ABI, which allow user space to
>      query the availability and enablement without a syscall.
> 
>    - A new member to the user space ABI struct rseq, which is going to be
>      used to communicate request and grant between kernel and user space.
> 
>    - A rseq state struct to hold the kernel state of this
> 
>    - Documentation of the new mechanism
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> V2: Fix Kconfig indentation, fix typos and expressions - Randy
>     Make the control fields a struct and remove the atomicity requirement - Mathieu
> ---
>  Documentation/userspace-api/index.rst |    1 
>  Documentation/userspace-api/rseq.rst  |  118 ++++++++++++++++++++++++++++++++++
>  include/linux/rseq_types.h            |   26 +++++++
>  include/uapi/linux/rseq.h             |   38 ++++++++++
>  init/Kconfig                          |   12 +++
>  kernel/rseq.c                         |    7 ++
>  6 files changed, 202 insertions(+)
> 

> --- /dev/null
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -0,0 +1,118 @@
> +=====================
> +Restartable Sequences
> +=====================
> +
> +Restartable Sequences allow to register a per thread userspace memory area
> +to be used as an ABI between kernel and userspace for three purposes:
> +
> + * userspace restartable sequences
> +
> + * quick access to read the current CPU number, node ID from userspace
> +
> + * scheduler time slice extensions
> +
> +Restartable sequences (per-cpu atomics)
> +---------------------------------------
> +
> +Restartables sequences allow userspace to perform update operations on

   Restartable

> +per-cpu data without requiring heavyweight atomic operations. The actual
> +ABI is unfortunately only available in the code and selftests.
> +
> +Quick access to CPU number, node ID
> +-----------------------------------
> +
> +Allows to implement per CPU data efficiently. Documentation is in code and
> +selftests. :(
> +
> +Scheduler time slice extensions
> +-------------------------------
> +
> +This allows a thread to request a time slice extension when it enters a
> +critical section to avoid contention on a resource when the thread is
> +scheduled out inside of the critical section.
> +
> +The prerequisites for this functionality are:
> +
> +    * Enabled in Kconfig
> +
> +    * Enabled at boot time (default is enabled)
> +
> +    * A rseq userspace pointer has been registered for the thread

I would write:  An rseq ...
but it depends on how someone treats (or speaks or thinks) "rseq."
I say/think of it as are-seq, so using "An" makes sense.

> +
> +The thread has to enable the functionality via prctl(2)::
> +
> +    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> +          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> +
> +prctl() returns 0 on success and otherwise with the following error codes:

                                or

> +
> +========= ==============================================================
> +Errorcode Meaning
> +========= ==============================================================
> +EINVAL	  Functionality not available or invalid function arguments.
> +          Note: arg4 and arg5 must be zero
> +ENOTSUPP  Functionality was disabled on the kernel command line
> +ENXIO	  Available, but no rseq user struct registered
> +========= ==============================================================


[snip]> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -73,12 +73,35 @@ struct rseq_ids {
>  };
>  
>  /**
> + * union rseq_slice_state - Status information for rseq time slice extension
> + * @state:	Compound to access the overall state
> + * @enabled:	Time slice extension is enabled for the task
> + * @granted:	Time slice extension was granted to the task
> + */
> +union rseq_slice_state {
> +	u16			state;
> +	struct {
> +		u8		enabled;
> +		u8		granted;
> +	};
> +};
> +
> +/**
> + * struct rseq_slice - Status information for rseq time slice extension
> + * @state:	Time slice extension state
> + */
> +struct rseq_slice {
> +	union rseq_slice_state	state;
> +};
> +
> +/**
>   * struct rseq_data - Storage for all rseq related data
>   * @usrptr:	Pointer to the registered user space RSEQ memory
>   * @len:	Length of the RSEQ region
>   * @sig:	Signature of critial section abort IPs

                             critical

>   * @event:	Storage for event management
>   * @ids:	Storage for cached CPU ID and MM CID
> + * @slice:	Storage for time slice extension data
>   */
>  struct rseq_data {
>  	struct rseq __user		*usrptr;
> @@ -86,6 +109,9 @@ struct rseq_data {
>  	u32				sig;
>  	struct rseq_event		event;
>  	struct rseq_ids			ids;
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +	struct rseq_slice		slice;
> +#endif
>  };
>  
>  #else /* CONFIG_RSEQ */
-- 
~Randy


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 01/12] sched: Provide and use set_need_resched_current()
  2025-10-22 12:57 ` [patch V2 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
@ 2025-10-27  8:59   ` Sebastian Andrzej Siewior
  2025-10-27 11:13     ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-27  8:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On 2025-10-22 14:57:29 [+0200], Thomas Gleixner wrote:
> set_tsk_need_resched(current) requires set_preempt_need_resched(current) to
> work correctly outside of the scheduler.
> 
> Provide set_need_resched_current() which wraps this correctly and replace
> all the open coded instances.
> 
> Signed-off-by: Peter Zilstra <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Peter's SOB comes first be he is not the author. Is this meant to be
Co-developed-by or is his authorship lost?

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 03/12] rseq: Provide static branch for time slice extensions
  2025-10-22 12:57 ` [patch V2 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-10-27  9:29   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-27  9:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On 2025-10-22 14:57:32 [+0200], Thomas Gleixner wrote:
> Guard the time slice extension functionality with a static key, which can
> be disabled on the kernel command line.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Might want to fold:

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061ca20e5..34325cf61b8de 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6482,6 +6482,10 @@
 
 	rootflags=	[KNL] Set root filesystem mount option string
 
+	rseq_slice_ext= [KNL] RSEQ slice extension
+			Disable the slice extension at boot time by
+			setting it to "off". Default is "on".
+
 	initramfs_options= [KNL]
                         Specify mount options for for the initramfs mount.
 

Sebastian

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch V2 05/12] rseq: Add prctl() to enable time slice extensions
  2025-10-22 12:57 ` [patch V2 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-10-27  9:40   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-27  9:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On 2025-10-22 14:57:34 [+0200], Thomas Gleixner wrote:
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -164,4 +164,13 @@ void rseq_syscall(struct pt_regs *regs);
>  static inline void rseq_syscall(struct pt_regs *regs) { }
>  #endif /* !CONFIG_DEBUG_RSEQ */
>  
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
> +#else /* CONFIG_RSEQ_SLICE_EXTENSION */
> +static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
> +{
> +	return -EINVAL;

This should be -ENOTSUPP as in the !rseq_slice_extension_enabled() case.
After all it is the same condition.

> +}

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 01/12] sched: Provide and use set_need_resched_current()
  2025-10-27  8:59   ` Sebastian Andrzej Siewior
@ 2025-10-27 11:13     ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-27 11:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On Mon, Oct 27 2025 at 09:59, Sebastian Andrzej Siewior wrote:
> On 2025-10-22 14:57:29 [+0200], Thomas Gleixner wrote:
>> set_tsk_need_resched(current) requires set_preempt_need_resched(current) to
>> work correctly outside of the scheduler.
>> 
>> Provide set_need_resched_current() which wraps this correctly and replace
>> all the open coded instances.
>> 
>> Signed-off-by: Peter Zilstra <peterz@infradead.org>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> Peter's SOB comes first be he is not the author. Is this meant to be
> Co-developed-by or is his authorship lost?

Bah. I dropped the From: line accidentally

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-22 12:57 ` [patch V2 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-10-27 11:38   ` Sebastian Andrzej Siewior
  2025-10-27 16:26     ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-27 11:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On 2025-10-22 14:57:38 [+0200], Thomas Gleixner wrote:
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -500,8 +502,82 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>  }
>  
>  #ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +struct slice_timer {
> +	struct hrtimer	timer;
> +	void		*cookie;
> +};
> +
> +unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
> +static DEFINE_PER_CPU(struct slice_timer, slice_timer);
>  DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>  
> +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
> +{
> +	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
> +
> +	if (st->cookie == current && current->rseq.slice.state.granted) {
> +		rseq_stat_inc(rseq_stats.s_expired);
> +		set_need_resched_current();
> +	}

You arm the timer while leaving to userland. Once in userland the task
can be migrated to another CPU. Once migrated, this CPU can host another
task while the timer fires and does nothing.

> +	return HRTIMER_NORESTART;
> +}
> +
> +static void rseq_cancel_slice_extension_timer(void)
> +{
> +	struct slice_timer *st = this_cpu_ptr(&slice_timer);
> +
> +	/*
> +	 * st->cookie can be safely read as preemption is disabled and the
> +	 * timer is CPU local. The active check can obviously race with the
> +	 * hrtimer interrupt, but that's better than disabling interrupts
> +	 * unconditionally right away.
> +	 *
> +	 * As this is most probably the first expiring timer, the cancel is
> +	 * expensive as it has to reprogram the hardware, but that's less
> +	 * expensive than going through a full hrtimer_interrupt() cycle
> +	 * for nothing.
> +	 *
> +	 * hrtimer_try_to_cancel() is sufficient here as with interrupts
> +	 * disabled the timer callback cannot be running and the timer base
> +	 * is well determined as the timer is pinned on the local CPU.
> +	 */
> +	if (st->cookie == current && hrtimer_active(&st->timer)) {
> +		scoped_guard(irq)
> +			hrtimer_try_to_cancel(&st->timer);

I don't see why hrtimer_active() and IRQ-disable is a benefit here.
Unless you want to avoid a branch to hrtimer_try_to_cancel().

The function has its own hrtimer_active() check and disables interrupts
while accessing the hrtimer_base lock. Since preemption is disabled,
st->cookie remains stable.
It can fire right after the hrtimer_active() here. You could just

	if (st->cookie == current)
		hrtimer_try_to_cancel(&st->timer);

at the expense of a branch to hrtimer_try_to_cancel() if the timer
already expired (no interrupts off/on).

> +}
> +
>  static inline void rseq_slice_set_need_resched(struct task_struct *curr)
>  {
>  	/*
> @@ -654,6 +731,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_SYSCTL
> +static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
> +static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
> +
> +static const struct ctl_table rseq_slice_ext_sysctl[] = {
> +	{
> +		.procname	= "rseq_slice_extension_nsec",
> +		.data		= &rseq_slice_ext_nsecs,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_douintvec_minmax,
> +		.extra1		= (unsigned int *)&rseq_slice_ext_nsecs_min,
> +		.extra2		= (unsigned int *)&rseq_slice_ext_nsecs_max,
…

maybe +

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index f3ee807b5d8b3..ed34d21ed94e4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1228,6 +1228,12 @@ reboot-cmd (SPARC only)
 ROM/Flash boot loader. Maybe to tell it what to do after
 rebooting. ???
 
+rseq_slice_extension_nsec
+=========================
+
+A task may ask to delay its scheduling if it is in a critical section via the
+prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum allowed
+extension in nanoseconds before a mandatory scheduling of the task is forced.
 
 sched_energy_aware
 ==================


Sebastian

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-27 11:38   ` Sebastian Andrzej Siewior
@ 2025-10-27 16:26     ` Thomas Gleixner
  2025-10-28  8:33       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-27 16:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On Mon, Oct 27 2025 at 12:38, Sebastian Andrzej Siewior wrote:
> On 2025-10-22 14:57:38 [+0200], Thomas Gleixner wrote:
>> +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
>> +{
>> +	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
>> +
>> +	if (st->cookie == current && current->rseq.slice.state.granted) {
>> +		rseq_stat_inc(rseq_stats.s_expired);
>> +		set_need_resched_current();
>> +	}
>
> You arm the timer while leaving to userland. Once in userland the task
> can be migrated to another CPU. Once migrated, this CPU can host another
> task while the timer fires and does nothing.

That's inevitable. If the scheduler decides to do that then there is
nothing which can be done about it and that's why the cookie pointer
exists.

>> +	return HRTIMER_NORESTART;
>> +}
>> +
> …
>> +static void rseq_cancel_slice_extension_timer(void)
>> +{
>> +	struct slice_timer *st = this_cpu_ptr(&slice_timer);
>> +
>> +	/*
>> +	 * st->cookie can be safely read as preemption is disabled and the
>> +	 * timer is CPU local. The active check can obviously race with the
>> +	 * hrtimer interrupt, but that's better than disabling interrupts
>> +	 * unconditionally right away.
>> +	 *
>> +	 * As this is most probably the first expiring timer, the cancel is
>> +	 * expensive as it has to reprogram the hardware, but that's less
>> +	 * expensive than going through a full hrtimer_interrupt() cycle
>> +	 * for nothing.
>> +	 *
>> +	 * hrtimer_try_to_cancel() is sufficient here as with interrupts
>> +	 * disabled the timer callback cannot be running and the timer base
>> +	 * is well determined as the timer is pinned on the local CPU.
>> +	 */
>> +	if (st->cookie == current && hrtimer_active(&st->timer)) {
>> +		scoped_guard(irq)
>> +			hrtimer_try_to_cancel(&st->timer);
>
> I don't see why hrtimer_active() and IRQ-disable is a benefit here.
> Unless you want to avoid a branch to hrtimer_try_to_cancel().
>
> The function has its own hrtimer_active() check and disables interrupts
> while accessing the hrtimer_base lock. Since preemption is disabled,
> st->cookie remains stable.
> It can fire right after the hrtimer_active() here. You could just
>
> 	if (st->cookie == current)
> 		hrtimer_try_to_cancel(&st->timer);
>
> at the expense of a branch to hrtimer_try_to_cancel() if the timer
> already expired (no interrupts off/on).

That's not equivalent. As this is CPU local the interrupt disable
ensures that the timer is not running on this CPU. Otherwise you need
hrtimer_cancel(). Read the comment. :)

If it fired already, then the task is reaching this code too
late. Nothing to see there.

>> +		.extra1		= (unsigned int *)&rseq_slice_ext_nsecs_min,
>> +		.extra2		= (unsigned int *)&rseq_slice_ext_nsecs_max,
> …
>
> maybe +
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index f3ee807b5d8b3..ed34d21ed94e4 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1228,6 +1228,12 @@ reboot-cmd (SPARC only)
>  ROM/Flash boot loader. Maybe to tell it what to do after
>  rebooting. ???
>  
> +rseq_slice_extension_nsec
> +=========================
> +
> +A task may ask to delay its scheduling if it is in a critical section via the
> +prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum allowed
> +extension in nanoseconds before a mandatory scheduling of the task is forced.

Yes. Forgot about it as I already documented it in the time slice
extension docs. Let me add that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 00/12] rseq: Implement time slice extension mechanism
  2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
                   ` (11 preceding siblings ...)
  2025-10-22 12:57 ` [patch V2 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-10-27 17:30 ` Sebastian Andrzej Siewior
  2025-10-27 18:48   ` Thomas Gleixner
  12 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-27 17:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On 2025-10-22 14:57:28 [+0200], Thomas Gleixner wrote:
> Time slice extensions are an attempt to provide opportunistic priority
> ceiling without the overhead of an actual priority ceiling protocol, but
> also without the guarantees such a protocol provides.
> 
> The intent is to avoid situations where a user space thread is interrupted
> in a critical section and scheduled out, while holding a resource on which
> the preempting thread or other threads in the system might block on. That
> obviously prevents those threads from making progress in the worst case for
> at least a full time slice. Especially in the context of user space
> spinlocks, which are a patently bad idea to begin with, but that's also
> true for other mechanisms.

I've been playing with it a bit with RT enabled and started to debug
this:

|       slice_test-2903    [001] d.h..  2313.285439: local_timer_entry: vector=236
|       slice_test-2903    [001] d.h1.  2313.285440: hrtimer_cancel: hrtimer=000000000507e6d5
|       slice_test-2903    [001] d.h..  2313.285440: hrtimer_expire_entry: hrtimer=000000000507e6d5 function=tick_nohz_handler now=2313208001152
|       slice_test-2903    [001] d.h1.  2313.285449: sched_stat_runtime: comm=slice_test pid=2903 runtime=3982905 [ns]
|       slice_test-2903    [001] dlh..  2313.285452: softirq_raise: vec=7 [action=SCHED]
|       slice_test-2903    [001] dlh..  2313.285452: hrtimer_expire_exit: hrtimer=000000000507e6d5
|       slice_test-2903    [001] dlh1.  2313.285452: hrtimer_start: hrtimer=000000000507e6d5 function=tick_nohz_handler expires=2313212000000 softexpires=2313212000000 mode=ABS
|       slice_test-2903    [001] dlh..  2313.285453: local_timer_exit: vector=236
|       slice_test-2903    [001] dl.2.  2313.285453: sched_waking: comm=ksoftirqd/1 pid=32 prio=120 target_cpu=001
|       slice_test-2903    [001] dl.3.  2313.285456: sched_wakeup: comm=ksoftirqd/1 pid=32 prio=120 target_cpu=001
|       slice_test-2903    [001] d....  2313.285457: irqentry_exit: rseq_grant_slice_extension(216)

granting the extension and removing the lazy wakup. We are still on
return from IRQ but the 'h' flag has been already removed…

|       slice_test-2903    [001] d..1.  2313.285458: hrtimer_start: hrtimer=0000000030a688cc function=rseq_slice_expired expires=2313208047790 softexpires=2313208047790 mode=ABS|PINNED|HARD
|       slice_test-2903    [001] d....  2313.285458: __rseq_arm_slice_extension_timer: timer
|       slice_test-2903    [001] d..2.  2313.285484: hrtimer_cancel: hrtimer=0000000030a688cc
extension granted, timer started and revoked and set need resched.

|       slice_test-2903    [001] dN.2.  2313.285487: sched_stat_runtime: comm=slice_test pid=2903 runtime=36886 [ns]
This is coming from schedule() already. It took me a while since I was
hunting a missing clear of need-resched.

|       slice_test-2903    [001] d..2.  2313.285489: sched_switch: prev_comm=slice_test prev_pid=2903 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/1 next_pid=32 next_prio=120
|      ksoftirqd/1-32      [001] ..s.1  2313.285490: softirq_entry: vec=7 [action=SCHED]
|      ksoftirqd/1-32      [001] ..s.1  2313.285501: softirq_exit: vec=7 [action=SCHED]
|      ksoftirqd/1-32      [001] d..2.  2313.285502: sched_stat_runtime: comm=ksoftirqd/1 pid=32 runtime=16438 [ns]
|      ksoftirqd/1-32      [001] d..2.  2313.285503: sched_switch: prev_comm=ksoftirqd/1 prev_pid=32 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2904 next_prio=120
|       slice_test-2904    [001] .....  2313.285507: sys_enter: NR 230 (1, 0, 7f4692c7baa0, 0, 0, 0)
|       slice_test-2904    [001] .....  2313.285507: hrtimer_setup: hrtimer=00000000f2d53899 clockid=CLOCK_MONOTONIC mode=REL
|       slice_test-2904    [001] d..1.  2313.285507: hrtimer_start: hrtimer=00000000f2d53899 function=hrtimer_wakeup expires=2313208168792 softexpires=2313208118792 mode=REL
|       slice_test-2904    [001] d..2.  2313.285508: sched_stat_runtime: comm=slice_test pid=2904 runtime=6149 [ns]
|       slice_test-2904    [001] d..2.  2313.285510: sched_switch: prev_comm=slice_test prev_pid=2904 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2903 next_prio=120
|       slice_test-2903    [001] .....  2313.285510: sys_enter: NR 470 (7fffc04f1ff0, c350, 11a0e0, 0, 7f4692e99000, 0)

slice_test-2903 enters _now_ rseq_slice_yield() so it must have been in
userland during the suppressed wake up at 2313.285457.
But a few iterations later it turns at out this trace event is recorded
_after_ the rseq magic happens at sys_enter time. We entered
rseq_slice_yield() a few cycles after the extension was granted. Buh.
So it seems to work as intended but it is not obvious tell from tracing
why it does not work.

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 00/12] rseq: Implement time slice extension mechanism
  2025-10-27 17:30 ` [patch V2 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
@ 2025-10-27 18:48   ` Thomas Gleixner
  2025-10-28  8:53     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-27 18:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On Mon, Oct 27 2025 at 18:30, Sebastian Andrzej Siewior wrote:

> |       slice_test-2903    [001] d..2.  2313.285484: hrtimer_cancel: hrtimer=0000000030a688cc
> extension granted, timer started and revoked and set need resched.
>
> |       slice_test-2903    [001] dN.2.  2313.285487: sched_stat_runtime: comm=slice_test pid=2903 runtime=36886 [ns]
> This is coming from schedule() already. It took me a while since I was
> hunting a missing clear of need-resched.
>
> |       slice_test-2903    [001] d..2.  2313.285489: sched_switch: prev_comm=slice_test prev_pid=2903 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/1 next_pid=32 next_prio=120
> |      ksoftirqd/1-32      [001] ..s.1  2313.285490: softirq_entry: vec=7 [action=SCHED]
> |      ksoftirqd/1-32      [001] ..s.1  2313.285501: softirq_exit: vec=7 [action=SCHED]
> |      ksoftirqd/1-32      [001] d..2.  2313.285502: sched_stat_runtime: comm=ksoftirqd/1 pid=32 runtime=16438 [ns]
> |      ksoftirqd/1-32      [001] d..2.  2313.285503: sched_switch: prev_comm=ksoftirqd/1 prev_pid=32 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2904 next_prio=120
> |       slice_test-2904    [001] .....  2313.285507: sys_enter: NR 230 (1, 0, 7f4692c7baa0, 0, 0, 0)
> |       slice_test-2904    [001] .....  2313.285507: hrtimer_setup: hrtimer=00000000f2d53899 clockid=CLOCK_MONOTONIC mode=REL
> |       slice_test-2904    [001] d..1.  2313.285507: hrtimer_start: hrtimer=00000000f2d53899 function=hrtimer_wakeup expires=2313208168792 softexpires=2313208118792 mode=REL
> |       slice_test-2904    [001] d..2.  2313.285508: sched_stat_runtime: comm=slice_test pid=2904 runtime=6149 [ns]
> |       slice_test-2904    [001] d..2.  2313.285510: sched_switch: prev_comm=slice_test prev_pid=2904 prev_prio=120 prev_state=S ==> next_comm=slice_test next_pid=2903 next_prio=120
> |       slice_test-2903    [001] .....  2313.285510: sys_enter: NR 470 (7fffc04f1ff0, c350, 11a0e0, 0, 7f4692e99000, 0)
>
> slice_test-2903 enters _now_ rseq_slice_yield() so it must have been in
> userland during the suppressed wake up at 2313.285457.
> But a few iterations later it turns at out this trace event is recorded
> _after_ the rseq magic happens at sys_enter time. We entered
> rseq_slice_yield() a few cycles after the extension was granted. Buh.
> So it seems to work as intended but it is not obvious tell from tracing
> why it does not work.

Tracing of the syscall happens _after_ syscall_trace_enter() invoked
rseq_syscall_enter_work() which canceled the timer and set
NEED_RESCHED. That immediately rescheduled _after_ the preempt enable:

  syscall()
    do_syscall_64()
      syscall_enter_from_user_mode() {
        syscall_enter_from_user_mode_work()
          syscall_trace_enter()
            rseq_syscall_enter_work()
              preempt_disable()
              hrtimer_try_to_cancel()
                remove_hrtimer()                <- tracepoint
              set_need_resched()
              preempt_enable()
                schedule()
           ...
           trace_sys_enter()                    <- tracepoint

Even if it would not reschedule immediately the ordering would be
reverse.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-27 16:26     ` Thomas Gleixner
@ 2025-10-28  8:33       ` Sebastian Andrzej Siewior
  2025-10-28  8:51         ` K Prateek Nayak
  2025-10-28 13:04         ` Thomas Gleixner
  0 siblings, 2 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-28  8:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On 2025-10-27 17:26:29 [+0100], Thomas Gleixner wrote:
> On Mon, Oct 27 2025 at 12:38, Sebastian Andrzej Siewior wrote:
> > On 2025-10-22 14:57:38 [+0200], Thomas Gleixner wrote:
> >> +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
> >> +{
> >> +	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
> >> +
> >> +	if (st->cookie == current && current->rseq.slice.state.granted) {
> >> +		rseq_stat_inc(rseq_stats.s_expired);
> >> +		set_need_resched_current();
> >> +	}
> >
> > You arm the timer while leaving to userland. Once in userland the task
> > can be migrated to another CPU. Once migrated, this CPU can host another
> > task while the timer fires and does nothing.
> 
> That's inevitable. If the scheduler decides to do that then there is
> nothing which can be done about it and that's why the cookie pointer
> exists.

Without an interrupt on the target CPU, there is nothing stopping the
task from overstepping its fair share.

> >> +	return HRTIMER_NORESTART;
> >> +}
> >> +
> > …
> >> +static void rseq_cancel_slice_extension_timer(void)
> >> +{
> >> +	struct slice_timer *st = this_cpu_ptr(&slice_timer);
> >> +
> >> +	/*
> >> +	 * st->cookie can be safely read as preemption is disabled and the
> >> +	 * timer is CPU local. The active check can obviously race with the
> >> +	 * hrtimer interrupt, but that's better than disabling interrupts
> >> +	 * unconditionally right away.
> >> +	 *
> >> +	 * As this is most probably the first expiring timer, the cancel is
> >> +	 * expensive as it has to reprogram the hardware, but that's less
> >> +	 * expensive than going through a full hrtimer_interrupt() cycle
> >> +	 * for nothing.
> >> +	 *
> >> +	 * hrtimer_try_to_cancel() is sufficient here as with interrupts
> >> +	 * disabled the timer callback cannot be running and the timer base
> >> +	 * is well determined as the timer is pinned on the local CPU.
> >> +	 */
> >> +	if (st->cookie == current && hrtimer_active(&st->timer)) {
> >> +		scoped_guard(irq)
> >> +			hrtimer_try_to_cancel(&st->timer);
> >
> > I don't see why hrtimer_active() and IRQ-disable is a benefit here.
> > Unless you want to avoid a branch to hrtimer_try_to_cancel().
> >
> > The function has its own hrtimer_active() check and disables interrupts
> > while accessing the hrtimer_base lock. Since preemption is disabled,
> > st->cookie remains stable.
> > It can fire right after the hrtimer_active() here. You could just
> >
> > 	if (st->cookie == current)
> > 		hrtimer_try_to_cancel(&st->timer);
> >
> > at the expense of a branch to hrtimer_try_to_cancel() if the timer
> > already expired (no interrupts off/on).
> 
> That's not equivalent. As this is CPU local the interrupt disable
> ensures that the timer is not running on this CPU. Otherwise you need
> hrtimer_cancel(). Read the comment. :)

Since it is a CPU local timer which is HRTIMER_MODE_HARD, from this CPUs
perspective it is either about to run or it did run. Therefore the
hrtimer_try_to_cancel() can't return -1 due to
hrtimer_callback_running() == true.
If you drop hrtimer_active() check and scoped_guard(irq),
hrtimer_try_to_cancel() will do the same hrtimer_active() check as you
have above followed by disable interrupts via lock_hrtimer_base() and
here hrtimer_callback_running() can't return true because interrupts are
disabled and the timer can't run on a remote CPU because it is a
CPU-local timer.

So you avoid a branch to hrtimer_try_to_cancel() if the timer already
fired.

> Thanks,
> 
>         tglx

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-28  8:33       ` Sebastian Andrzej Siewior
@ 2025-10-28  8:51         ` K Prateek Nayak
  2025-10-28  9:00           ` Sebastian Andrzej Siewior
  2025-10-28 13:04         ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: K Prateek Nayak @ 2025-10-28  8:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Thomas Gleixner
  Cc: LKML, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, Steven Rostedt, Arnd Bergmann, linux-arch

Hello Sebastian,

On 10/28/2025 2:03 PM, Sebastian Andrzej Siewior wrote:
> On 2025-10-27 17:26:29 [+0100], Thomas Gleixner wrote:
>> On Mon, Oct 27 2025 at 12:38, Sebastian Andrzej Siewior wrote:
>>> On 2025-10-22 14:57:38 [+0200], Thomas Gleixner wrote:
>>>> +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
>>>> +{
>>>> +	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
>>>> +
>>>> +	if (st->cookie == current && current->rseq.slice.state.granted) {
>>>> +		rseq_stat_inc(rseq_stats.s_expired);
>>>> +		set_need_resched_current();
>>>> +	}
>>>
>>> You arm the timer while leaving to userland. Once in userland the task
>>> can be migrated to another CPU. Once migrated, this CPU can host another
>>> task while the timer fires and does nothing.
>>
>> That's inevitable. If the scheduler decides to do that then there is
>> nothing which can be done about it and that's why the cookie pointer
>> exists.
> 
> Without an interrupt on the target CPU, there is nothing stopping the
> task from overstepping its fair share.

When the task moves CPU, the rseq_exit_user_update() would clear all
of the slice extension state before running the task again. The task
will start off again with "rseq->slice_ctrl.request" and
"rseq->slice_ctrl.granted" both at 0 signifying the task was
rescheduled.

As for overstepping the limits on the previous CPU, the EEVDF
algorithm (using the task's "vlag" - the vruntime deviation from the
"avg_vruntime") would penalize it accordingly when enqueued.

The previous CPU would just get a spurious interrupt and since the
timer cookie doesn't match with "current", the handler does
nothing and goes away.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 00/12] rseq: Implement time slice extension mechanism
  2025-10-27 18:48   ` Thomas Gleixner
@ 2025-10-28  8:53     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-28  8:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zilstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On 2025-10-27 19:48:56 [+0100], Thomas Gleixner wrote:
> 
> Tracing of the syscall happens _after_ syscall_trace_enter() invoked
> rseq_syscall_enter_work() which canceled the timer and set
> NEED_RESCHED. That immediately rescheduled _after_ the preempt enable:
> 
>   syscall()
>     do_syscall_64()
>       syscall_enter_from_user_mode() {
>         syscall_enter_from_user_mode_work()
>           syscall_trace_enter()
>             rseq_syscall_enter_work()
>               preempt_disable()
>               hrtimer_try_to_cancel()
>                 remove_hrtimer()                <- tracepoint
>               set_need_resched()
>               preempt_enable()
>                 schedule()
>            ...
>            trace_sys_enter()                    <- tracepoint
> 
> Even if it would not reschedule immediately the ordering would be
> reverse.

I know that know after doing the tracing. But having only the sched
events looked like the slice gets granted and usecs later scheduling
happens. Adding interrupts and syscalls continued pointing to the wrong
direction.
Maybe the lack of events here is okay if you know what you do and what
to expect in terms of available trace events.

In that spirit, I did test it and didn't find anything wrong with it ;)

> Thanks,
> 
>         tglx

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-28  8:51         ` K Prateek Nayak
@ 2025-10-28  9:00           ` Sebastian Andrzej Siewior
  2025-10-28  9:22             ` K Prateek Nayak
  0 siblings, 1 reply; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-28  9:00 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, LKML, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, Steven Rostedt, Arnd Bergmann, linux-arch

On 2025-10-28 14:21:24 [+0530], K Prateek Nayak wrote:
> Hello Sebastian,
> 
> On 10/28/2025 2:03 PM, Sebastian Andrzej Siewior wrote:
> > On 2025-10-27 17:26:29 [+0100], Thomas Gleixner wrote:
> >> On Mon, Oct 27 2025 at 12:38, Sebastian Andrzej Siewior wrote:
> >>> On 2025-10-22 14:57:38 [+0200], Thomas Gleixner wrote:
> >>>> +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
> >>>> +{
> >>>> +	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
> >>>> +
> >>>> +	if (st->cookie == current && current->rseq.slice.state.granted) {
> >>>> +		rseq_stat_inc(rseq_stats.s_expired);
> >>>> +		set_need_resched_current();
> >>>> +	}
> >>>
> >>> You arm the timer while leaving to userland. Once in userland the task
> >>> can be migrated to another CPU. Once migrated, this CPU can host another
> >>> task while the timer fires and does nothing.
> >>
> >> That's inevitable. If the scheduler decides to do that then there is
> >> nothing which can be done about it and that's why the cookie pointer
> >> exists.
> > 
> > Without an interrupt on the target CPU, there is nothing stopping the
> > task from overstepping its fair share.
> 
> When the task moves CPU, the rseq_exit_user_update() would clear all
> of the slice extension state before running the task again. The task
> will start off again with "rseq->slice_ctrl.request" and
> "rseq->slice_ctrl.granted" both at 0 signifying the task was
> rescheduled.

I wasn't aware this is done once the task is in userland and then
relocated to another CPU.

> As for overstepping the limits on the previous CPU, the EEVDF
> algorithm (using the task's "vlag" - the vruntime deviation from the
> "avg_vruntime") would penalize it accordingly when enqueued.

So it wouldn't be the initial delay which is enforced by the timer, but
the regular scheduler that would put an end to it. Somehow forgot that
we still have a scheduler…

> The previous CPU would just get a spurious interrupt and since the
> timer cookie doesn't match with "current", the handler does
> nothing and goes away.

Yeah, that is fine.

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-28  9:00           ` Sebastian Andrzej Siewior
@ 2025-10-28  9:22             ` K Prateek Nayak
  2025-10-28 10:22               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 29+ messages in thread
From: K Prateek Nayak @ 2025-10-28  9:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, LKML, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, Steven Rostedt, Arnd Bergmann, linux-arch

On 10/28/2025 2:30 PM, Sebastian Andrzej Siewior wrote:
>>> Without an interrupt on the target CPU, there is nothing stopping the
>>> task from overstepping its fair share.
>>
>> When the task moves CPU, the rseq_exit_user_update() would clear all
>> of the slice extension state before running the task again. The task
>> will start off again with "rseq->slice_ctrl.request" and
>> "rseq->slice_ctrl.granted" both at 0 signifying the task was
>> rescheduled.
> 
> I wasn't aware this is done once the task is in userland and then
> relocated to another CPU.

The exact path based on my understanding is:

  /* Task migrates to another CPU; Has to resume from kernel. */
  __schedule()
    context_switch()
      rseq_sched_switch_event()
        t->rseq.event.sched_switch = true;
        set_tsk_thread_flag(t, TIF_RSEQ);

    ...
    exit_to_user_mode_loop()
      rseq_exit_to_user_mode_restart()
        __rseq_exit_to_user_mode_restart()
          /* Sees t->rseq.event.sched_switch to be true. */
          rseq_exit_user_update()
            if (rseq_slice_extension_enabled())
              unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); /* Unconditionally clears all of "rseq_ctrl" */

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-28  9:22             ` K Prateek Nayak
@ 2025-10-28 10:22               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 29+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-28 10:22 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, LKML, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, Steven Rostedt, Arnd Bergmann, linux-arch

On 2025-10-28 14:52:09 [+0530], K Prateek Nayak wrote:
> On 10/28/2025 2:30 PM, Sebastian Andrzej Siewior wrote:
> >>> Without an interrupt on the target CPU, there is nothing stopping the
> >>> task from overstepping its fair share.
> >>
> >> When the task moves CPU, the rseq_exit_user_update() would clear all
> >> of the slice extension state before running the task again. The task
> >> will start off again with "rseq->slice_ctrl.request" and
> >> "rseq->slice_ctrl.granted" both at 0 signifying the task was
> >> rescheduled.
> > 
> > I wasn't aware this is done once the task is in userland and then
> > relocated to another CPU.
> 
> The exact path based on my understanding is:
> 
>   /* Task migrates to another CPU; Has to resume from kernel. */
>   __schedule()
>     context_switch()
>       rseq_sched_switch_event()
>         t->rseq.event.sched_switch = true;
>         set_tsk_thread_flag(t, TIF_RSEQ);
> 
>     ...
>     exit_to_user_mode_loop()
>       rseq_exit_to_user_mode_restart()
>         __rseq_exit_to_user_mode_restart()
>           /* Sees t->rseq.event.sched_switch to be true. */
>           rseq_exit_user_update()
>             if (rseq_slice_extension_enabled())
>               unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); /* Unconditionally clears all of "rseq_ctrl" */

You are right. The migration thread preempts it on the old CPU and then
it gets scheduled in on the new CPU.

Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch V2 08/12] rseq: Implement time slice extension enforcement timer
  2025-10-28  8:33       ` Sebastian Andrzej Siewior
  2025-10-28  8:51         ` K Prateek Nayak
@ 2025-10-28 13:04         ` Thomas Gleixner
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2025-10-28 13:04 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Jonathan Corbet, Prakash Sangappa,
	Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
	Arnd Bergmann, linux-arch

On Tue, Oct 28 2025 at 09:33, Sebastian Andrzej Siewior wrote:
> On 2025-10-27 17:26:29 [+0100], Thomas Gleixner wrote:
>> On Mon, Oct 27 2025 at 12:38, Sebastian Andrzej Siewior wrote:
>> > On 2025-10-22 14:57:38 [+0200], Thomas Gleixner wrote:
>> >> +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
>> >> +{
>> >> +	struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
>> >> +
>> >> +	if (st->cookie == current && current->rseq.slice.state.granted) {
>> >> +		rseq_stat_inc(rseq_stats.s_expired);
>> >> +		set_need_resched_current();
>> >> +	}
>> >
>> > You arm the timer while leaving to userland. Once in userland the task
>> > can be migrated to another CPU. Once migrated, this CPU can host another
>> > task while the timer fires and does nothing.
>> 
>> That's inevitable. If the scheduler decides to do that then there is
>> nothing which can be done about it and that's why the cookie pointer
>> exists.
>
> Without an interrupt on the target CPU, there is nothing stopping the
> task from overstepping its fair share.

If a task gets migrated then it can't overstep the share because the
migration is bringing it back into the kernel, schedules it out and
schedules it in on the new CPU. So the whole accounting start over
freshly. That's the same as if the task gets the extension granted, goes
to user space and gets interrupted again. If that interrupt sets
NEED_RESCHED the grant is "revoked" and the timer fires for nothing.

> Since it is a CPU local timer which is HRTIMER_MODE_HARD, from this CPUs
> perspective it is either about to run or it did run. Therefore the
> hrtimer_try_to_cancel() can't return -1 due to
> hrtimer_callback_running() == true.
> If you drop hrtimer_active() check and scoped_guard(irq),
> hrtimer_try_to_cancel() will do the same hrtimer_active() check as you
> have above followed by disable interrupts via lock_hrtimer_base() and
> here hrtimer_callback_running() can't return true because interrupts are
> disabled and the timer can't run on a remote CPU because it is a
> CPU-local timer.
>
> So you avoid a branch to hrtimer_try_to_cancel() if the timer already
> fired.

Yes you are right. Seems I've suffered from brain congestion. Let me
remove it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-10-28 13:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 12:57 [patch V2 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-10-22 12:57 ` [patch V2 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-10-27  8:59   ` Sebastian Andrzej Siewior
2025-10-27 11:13     ` Thomas Gleixner
2025-10-22 12:57 ` [patch V2 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-10-22 17:28   ` Randy Dunlap
2025-10-22 12:57 ` [patch V2 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-10-27  9:29   ` Sebastian Andrzej Siewior
2025-10-22 12:57 ` [patch V2 04/12] rseq: Add statistics " Thomas Gleixner
2025-10-22 12:57 ` [patch V2 05/12] rseq: Add prctl() to enable " Thomas Gleixner
2025-10-27  9:40   ` Sebastian Andrzej Siewior
2025-10-22 12:57 ` [patch V2 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-10-22 12:57 ` [patch V2 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-10-22 12:57 ` [patch V2 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-10-27 11:38   ` Sebastian Andrzej Siewior
2025-10-27 16:26     ` Thomas Gleixner
2025-10-28  8:33       ` Sebastian Andrzej Siewior
2025-10-28  8:51         ` K Prateek Nayak
2025-10-28  9:00           ` Sebastian Andrzej Siewior
2025-10-28  9:22             ` K Prateek Nayak
2025-10-28 10:22               ` Sebastian Andrzej Siewior
2025-10-28 13:04         ` Thomas Gleixner
2025-10-22 12:57 ` [patch V2 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-10-22 12:57 ` [patch V2 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-10-22 12:57 ` [patch V2 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
2025-10-22 12:57 ` [patch V2 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-10-27 17:30 ` [patch V2 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
2025-10-27 18:48   ` Thomas Gleixner
2025-10-28  8:53     ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).