* [patch V6 00/11] rseq: Implement time slice extension mechanism
@ 2025-12-15 16:52 Thomas Gleixner
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
` (11 more replies)
0 siblings, 12 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
This is a follow up on the V5 version:
https://lore.kernel.org/20251128225931.959481199@linutronix.de
V1 contains a detailed explanation:
https://lore.kernel.org/20250908225709.144709889@linutronix.de
TLDR: Time slice extensions are an attempt to provide opportunistic
priority ceiling without the overhead of an actual priority ceiling
protocol, but also without the guarantees such a protocol provides.
The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.
This series uses the existing RSEQ user memory to implement it.
Changes vs. V5:
- Rebase on v6.19-rc1
- Fold typo fixes - Sebastian
- Switch to syscall number 471
The series is based on v6.19-rc1 and is also available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 01/11] rseq: Add fields and constants for time slice extension
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
` (3 more replies)
2025-12-15 16:52 ` [patch V6 02/11] rseq: Provide static branch for time slice extensions Thomas Gleixner
` (10 subsequent siblings)
11 siblings, 4 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
V6: Fix typos - Bigeasy
V5: Document behaviour of arbitrary syscalls
V4: Make the example correct - Prakash
V3: Fix more typos and expressions - Randy
V2: Fix Kconfig indentation, fix typos and expressions - Randy
Make the control fields a struct and remove the atomicity requirement - Mathieu
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 135 ++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 28 ++++++-
include/uapi/linux/rseq.h | 38 +++++++++
init/Kconfig | 12 +++
kernel/rseq.c | 7 +
6 files changed, 220 insertions(+), 1 deletion(-)
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,135 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out. The length of the extension is
+determined by the ``rseq_slice_extension_nsec`` sysctl.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::request and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
+leaving the critical section to relinquish the CPU. The kernel enforces
+this by arming a timer to prevent misbehaving userspace from abusing this
+mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl.request = 1;
+ barrier(); // Prevent compiler reordering
+ critical_section();
+ barrier(); // Prevent compiler reordering
+ rseq->slice_ctrl.request = 0;
+ if (rseq->slice_ctrl.granted)
+ rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+ if (rseq->slice_ctrl.granted)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+If the thread issues a syscall other than rseq_slice_yield(2) within the
+granted timeslice extension, the grant is also revoked and the CPU is
+relinquished immediately when entering the kernel. This is required as
+syscalls might consume arbitrary CPU time until they reach a scheduling
+point when the preemption model is either NONE or VOLUNTARY and therefore
+might exceed the grant by far.
+
+The preferred solution for user space is to use rseq_slice_yield(2) which
+is side effect free. The support for arbitrary syscalls is required to
+support onion layer architectured applications, where the code handling the
+critical section and requesting the time slice extension has no control
+over the code within the critical section.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
- * @sig: Signature of critial section abort IPs
+ * @sig: Signature of critical section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a separate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
@@ -53,6 +64,27 @@ struct rseq_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1938,6 +1938,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq-based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config RSEQ_STATS
default n
bool "Enable lightweight statistics of restartable sequences" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 02/11] rseq: Provide static branch for time slice extensions
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 03/11] rseq: Add statistics " Thomas Gleixner
` (9 subsequent siblings)
11 siblings, 2 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V4: Return 0 on error in __setup() - Randy
V3: Document command line parameter - Sebastian
V2: Return 1 from __setup() - Prateek
---
Documentation/admin-guide/kernel-parameters.txt | 5 +++++
include/linux/rseq_entry.h | 11 +++++++++++
kernel/rseq.c | 17 +++++++++++++++++
3 files changed, 33 insertions(+)
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6482,6 +6482,11 @@
rootflags= [KNL] Set root filesystem mount option string
+ rseq_slice_ext= [KNL] RSEQ based time slice extension
+ Format: boolean
+ Control enablement of RSEQ based time slice extension.
+ Default is 'on'.
+
initramfs_options= [KNL]
Specify mount options for for the initramfs mount.
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
#define rseq_inline __always_inline
#endif
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static __always_inline bool rseq_slice_extension_enabled(void)
+{
+ return static_branch_likely(&rseq_slice_extension_key);
+}
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
bool rseq_debug_validate_ids(struct task_struct *t);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -483,3 +483,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
efault:
return -EFAULT;
}
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static int __init rseq_slice_cmdline(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return 0;
+
+ if (!on)
+ static_branch_disable(&rseq_slice_extension_key);
+ return 1;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 03/11] rseq: Add statistics for time slice extensions
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-12-15 16:52 ` [patch V6 02/11] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 04/11] rseq: Add prctl() to enable " Thomas Gleixner
` (8 subsequent siblings)
11 siblings, 2 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Extend the quick statistics with time slice specific fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Add s_aborted to account for arbitrary syscalls
---
include/linux/rseq_entry.h | 5 +++++
kernel/rseq.c | 14 ++++++++++++++
2 files changed, 19 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -15,6 +15,11 @@ struct rseq_stats {
unsigned long cs;
unsigned long clear;
unsigned long fixup;
+ unsigned long s_granted;
+ unsigned long s_expired;
+ unsigned long s_revoked;
+ unsigned long s_yielded;
+ unsigned long s_aborted;
};
DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -138,6 +138,13 @@ static int rseq_stats_show(struct seq_fi
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu));
+ stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu));
+ stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu));
+ stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu));
+ stats.s_aborted += data_race(per_cpu(rseq_stats.s_aborted, cpu));
+ }
}
seq_printf(m, "exit: %16lu\n", stats.exit);
@@ -148,6 +155,13 @@ static int rseq_stats_show(struct seq_fi
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
+ seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
+ seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
+ seq_printf(m, "syield: %16lu\n", stats.s_yielded);
+ seq_printf(m, "sabort: %16lu\n", stats.s_aborted);
+ }
return 0;
}
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 04/11] rseq: Add prctl() to enable time slice extensions
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (2 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 03/11] rseq: Add statistics " Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 05/11] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
` (7 subsequent siblings)
11 siblings, 2 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V3: Use -ENOTSUPP for the stub inline - Sebastian
---
include/linux/rseq.h | 9 +++++++
include/uapi/linux/prctl.h | 10 ++++++++
kernel/rseq.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 6 +++++
4 files changed, 77 insertions(+)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -163,4 +163,13 @@ void rseq_syscall(struct pt_regs *regs);
static inline void rseq_syscall(struct pt_regs *regs) { }
#endif /* !CONFIG_DEBUG_RSEQ */
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ return -ENOTSUPP;
+}
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
#endif /* _LINUX_RSEQ_H */
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -386,4 +386,14 @@ struct prctl_mm_map {
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE: Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+
#endif /* _LINUX_PRCTL_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,7 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
#include <linux/sched.h>
@@ -501,6 +502,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ switch (arg2) {
+ case PR_RSEQ_SLICE_EXTENSION_GET:
+ if (arg3)
+ return -EINVAL;
+ return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+ case PR_RSEQ_SLICE_EXTENSION_SET: {
+ u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
+
+ if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+ return -EINVAL;
+ if (!rseq_slice_extension_enabled())
+ return -ENOTSUPP;
+ if (!current->rseq.usrptr)
+ return -ENXIO;
+
+ /* No change? */
+ if (enable == !!current->rseq.slice.state.enabled)
+ return 0;
+
+ if (get_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ if (current->rseq.slice.state.enabled)
+ valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if ((rflags & valid) != valid)
+ goto die;
+
+ rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ if (enable)
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if (put_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ current->rseq.slice.state.enabled = enable;
+ return 0;
+ }
+ default:
+ return -EINVAL;
+ }
+die:
+ force_sig(SIGSEGV);
+ return -EFAULT;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
#include <linux/futex.h>
+#include <linux/rseq.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2868,6 +2869,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
case PR_FUTEX_HASH:
error = futex_hash_prctl(arg2, arg3, arg4);
break;
+ case PR_RSEQ_SLICE_EXTENSION:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = rseq_slice_extension_prctl(arg2, arg3);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 05/11] rseq: Implement sys_rseq_slice_yield()
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (3 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 04/11] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
` (2 more replies)
2025-12-15 16:52 ` [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
` (6 subsequent siblings)
11 siblings, 3 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
---
Note: This still uses 470 which conflicts with -next, but this is scheduled
for post -rc1 and basing it on -next makes it more complicated for
now. Will be changed in the final submission.
---
V5: Rework to adjust to support for arbitrary syscall changes
Use n32/n64/o32 for MIPS - Arnd
V2: Use the proper name in sys_ni.c and add comment - Prateek
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/tools/syscall_32.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/rseq_types.h | 2 ++
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 ++++-
kernel/rseq.c | 21 +++++++++++++++++++++
kernel/sys_ni.c | 1 +
scripts/syscall.tbl | 1 +
22 files changed, 46 insertions(+), 1 deletion(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -510,3 +510,4 @@
578 common file_getattr sys_file_getattr
579 common file_setattr sys_file_setattr
580 common listns sys_listns
+581 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -485,3 +485,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -482,3 +482,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -470,3 +470,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -476,3 +476,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -409,3 +409,4 @@
468 n32 file_getattr sys_file_getattr
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
+471 n32 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -385,3 +385,4 @@
468 n64 file_getattr sys_file_getattr
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
+471 n64 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -458,3 +458,4 @@
468 o32 file_getattr sys_file_getattr
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
+471 o32 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -561,3 +561,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 nospu rseq_slice_yield sys_rseq_slice_yield
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -397,3 +397,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -474,3 +474,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -516,3 +516,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -476,3 +476,4 @@
468 i386 file_getattr sys_file_getattr
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
+471 i386 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -395,6 +395,7 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
#
# Due to a historical design error, certain syscalls are numbered differently
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,9 +89,11 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @yielded: Indicator for rseq_slice_yield()
*/
struct rseq_slice {
union rseq_slice_state state;
+ u8 yielded;
};
/**
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -961,6 +961,7 @@ asmlinkage long sys_statx(int dfd, const
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_rseq_slice_yield(void);
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
unsigned flags,
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -860,8 +860,11 @@
#define __NR_listns 470
__SYSCALL(__NR_listns, sys_listns)
+#define __NR_rseq_slice_yield 471
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
#undef __NR_syscalls
-#define __NR_syscalls 471
+#define __NR_syscalls 472
/*
* 32 bit systems traditionally used different
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -553,6 +553,27 @@ int rseq_slice_extension_prctl(unsigned
return -EFAULT;
}
+/**
+ * sys_rseq_slice_yield - yield the current processor side effect free if a
+ * task granted with a time slice extension is done with
+ * the critical work before being forced out.
+ *
+ * Return: 1 if the task successfully yielded the CPU within the granted slice.
+ * 0 if the slice extension was either never granted or was revoked by
+ * going over the granted extension, using a syscall other than this one
+ * or being scheduled out earlier due to a subsequent interrupt.
+ *
+ * The syscall does not schedule because the syscall entry work immediately
+ * relinquishes the CPU and schedules if required.
+ */
+SYSCALL_DEFINE0(rseq_slice_yield)
+{
+ int yielded = !!current->rseq.slice.yielded;
+
+ current->rseq.slice.yielded = 0;
+ return yielded;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,6 +390,7 @@ COND_SYSCALL(setuid16);
/* restartable sequence */
COND_SYSCALL(rseq);
+COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -411,3 +411,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (4 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 05/11] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
` (2 more replies)
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (5 subsequent siblings)
11 siblings, 3 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V5: Allow arbitrary syscalls
V3: Use get/put_user()
---
include/linux/entry-common.h | 2
include/linux/rseq.h | 2
include/linux/thread_info.h | 16 ++++---
kernel/entry/syscall-common.c | 11 ++++-
kernel/rseq.c | 91 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 112 insertions(+), 10 deletions(-)
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -36,8 +36,8 @@
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
+ SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \
ARCH_SYSCALL_WORK_ENTER)
-
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -164,8 +164,10 @@ static inline void rseq_syscall(struct p
#endif /* !CONFIG_DEBUG_RSEQ */
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+void rseq_syscall_enter_work(long syscall);
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline void rseq_syscall_enter_work(long syscall) { }
static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
return -ENOTSUPP;
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
SYSCALL_WORK_BIT_SYSCALL_AUDIT,
SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+ SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
};
-#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
#endif
#include <asm/thread_info.h>
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s
}
}
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
{
long ret = 0;
@@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs
return -1L;
}
+ /*
+ * User space got a time slice extension granted and relinquishes
+ * the CPU. The work stops the slice timer to avoid an extra round
+ * through hrtimer_interrupt().
+ */
+ if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE)
+ rseq_syscall_enter_work(syscall);
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -502,6 +502,97 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+static inline void rseq_slice_set_need_resched(struct task_struct *curr)
+{
+ /*
+ * The interrupt guard is required to prevent inconsistent state in
+ * this case:
+ *
+ * set_tsk_need_resched()
+ * --> Interrupt
+ * wakeup()
+ * set_tsk_need_resched()
+ * set_preempt_need_resched()
+ * schedule_on_return()
+ * clear_tsk_need_resched()
+ * clear_preempt_need_resched()
+ * set_preempt_need_resched() <- Inconsistent state
+ *
+ * This is safe vs. a remote set of TIF_NEED_RESCHED because that
+ * only sets the already set bit and does not create inconsistent
+ * state.
+ */
+ scoped_guard(irq)
+ set_need_resched_current();
+}
+
+static void rseq_slice_validate_ctrl(u32 expected)
+{
+ u32 __user *sctrl = ¤t->rseq.usrptr->slice_ctrl.all;
+ u32 uval;
+
+ if (get_user(uval, sctrl) || uval != expected)
+ force_sig(SIGSEGV);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ *
+ * While the recommended way to relinquish the CPU side effect free is
+ * rseq_slice_yield(2), any syscall within a granted slice terminates the
+ * grant and immediately reschedules if required. This supports onion layer
+ * applications, where the code requesting the grant cannot control the
+ * code within the critical section.
+ */
+void rseq_syscall_enter_work(long syscall)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl ctrl = { .granted = curr->rseq.slice.state.granted };
+
+ clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ rseq_slice_validate_ctrl(ctrl.all);
+
+ /*
+ * The kernel might have raced, revoked the grant and updated
+ * userspace, but kept the SLICE work set.
+ */
+ if (!ctrl.granted)
+ return;
+
+ /*
+ * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
+ * kernels. Leaving the scope will reschedule on preemption models
+ * FULL, LAZY and RT if necessary.
+ */
+ scoped_guard(preempt) {
+ /*
+ * Now that preemption is disabled, quickly check whether
+ * the task was already rescheduled before arriving here.
+ */
+ if (!curr->rseq.event.sched_switch) {
+ rseq_slice_set_need_resched(curr);
+
+ if (syscall == __NR_rseq_slice_yield) {
+ rseq_stat_inc(rseq_stats.s_yielded);
+ /* Update the yielded state for syscall return */
+ curr->rseq.slice.yielded = 1;
+ } else {
+ rseq_stat_inc(rseq_stats.s_aborted);
+ }
+ }
+ }
+ /* Reschedule on NONE/VOLUNTARY preemption models */
+ cond_resched();
+
+ /* Clear the grant in kernel state and user space */
+ curr->rseq.slice.state.granted = false;
+ if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all))
+ force_sig(SIGSEGV);
+}
+
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
switch (arg2) {
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (5 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
` (7 more replies)
2025-12-15 16:52 ` [patch V6 08/11] rseq: Reset slice extension when scheduled Thomas Gleixner
` (4 subsequent siblings)
11 siblings, 8 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V5: Document the slice extension range - PeterZ
V4: Update comment - Steven
V3: Add sysctl documentation, simplify timer cancelation - Sebastian
---
Documentation/admin-guide/sysctl/kernel.rst | 8 +
include/linux/rseq_entry.h | 38 +++++---
include/linux/rseq_types.h | 2
kernel/rseq.c | 132 +++++++++++++++++++++++++++-
4 files changed, 167 insertions(+), 13 deletions(-)
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1228,6 +1228,14 @@ reboot-cmd (SPARC only)
ROM/Flash boot loader. Maybe to tell it what to do after
rebooting. ???
+rseq_slice_extension_nsec
+=========================
+
+A task can request to delay its scheduling if it is in a critical section
+via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
+allowed extension in nanoseconds before scheduling of the task is enforced.
+Default value is 30000ns (30us). The possible range is 10000ns (10us) to
+50000ns (50us).
sched_energy_aware
==================
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -87,8 +87,24 @@ static __always_inline bool rseq_slice_e
{
return static_branch_likely(&rseq_slice_extension_key);
}
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ if (likely(!current->rseq.slice.state.granted))
+ return false;
+
+ return __rseq_arm_slice_extension_timer();
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -543,17 +559,19 @@ static __always_inline void clear_tif_rs
static __always_inline bool
rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
- if (likely(!test_tif_rseq(ti_work)))
- return false;
-
- if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
- current->rseq.event.slowpath = true;
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
- return true;
+ if (unlikely(test_tif_rseq(ti_work))) {
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ clear_tif_rseq();
}
-
- clear_tif_rseq();
- return false;
+ /*
+ * Arm the slice extension timer if nothing to do anymore and the
+ * task really goes out to user space.
+ */
+ return rseq_arm_slice_extension_timer();
}
#else /* CONFIG_GENERIC_ENTRY */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,10 +89,12 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @expires: The time when a grant expires
* @yielded: Indicator for rseq_slice_yield()
*/
struct rseq_slice {
union rseq_slice_state state;
+ u64 expires;
u8 yielded;
};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
@@ -500,8 +502,91 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
}
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+ struct hrtimer timer;
+ void *cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+/*
+ * When the timer expires and the task is still in user space, the return
+ * from interrupt will revoke the grant and schedule. If the task already
+ * entered the kernel via a syscall and the timer fires before the syscall
+ * work was able to cancel it, then depending on the preemption model this
+ * will either reschedule on return from interrupt or in the syscall work
+ * below.
+ */
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+ struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+ /*
+ * Validate that the task which armed the timer is still on the
+ * CPU. It could have been scheduled out without canceling the
+ * timer.
+ */
+ if (st->cookie == current && current->rseq.slice.state.granted) {
+ rseq_stat_inc(rseq_stats.s_expired);
+ set_need_resched_current();
+ }
+ return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+ struct task_struct *curr = current;
+
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * This check prevents a task, which got a time slice extension
+ * granted, from exceeding the maximum scheduling latency when the
+ * grant expired before going out to user space. Don't bother to
+ * clear the grant here, it will be cleaned up automatically before
+ * going out to user space after being scheduled back in.
+ */
+ if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+ set_need_resched_current();
+ return true;
+ }
+
+ /*
+ * Store the task pointer as a cookie for comparison in the timer
+ * function. This is safe as the timer is CPU local and cannot be
+ * in the expiry function at this point.
+ */
+ st->cookie = curr;
+ hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+ /* Arm the syscall entry work */
+ set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+ return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+ /*
+ * st->cookie can be safely read as preemption is disabled and the
+ * timer is CPU local.
+ *
+ * As this is most probably the first expiring timer, the cancel is
+ * expensive as it has to reprogram the hardware, but that's less
+ * expensive than going through a full hrtimer_interrupt() cycle
+ * for nothing.
+ *
+ * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
+ * local and once the hrtimer code disabled interrupts the timer
+ * callback cannot be running.
+ */
+ if (st->cookie == current)
+ hrtimer_try_to_cancel(&st->timer);
+}
+
static inline void rseq_slice_set_need_resched(struct task_struct *curr)
{
/*
@@ -563,11 +648,14 @@ void rseq_syscall_enter_work(long syscal
return;
/*
- * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
- * kernels. Leaving the scope will reschedule on preemption models
- * FULL, LAZY and RT if necessary.
+ * Required to stabilize the per CPU timer pointer and to make
+ * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
+ *
+ * Leaving the scope will reschedule on preemption models FULL,
+ * LAZY and RT if necessary.
*/
scoped_guard(preempt) {
+ rseq_cancel_slice_extension_timer();
/*
* Now that preemption is disabled, quickly check whether
* the task was already rescheduled before arriving here.
@@ -665,6 +753,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
return yielded;
}
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+ {
+ .procname = "rseq_slice_extension_nsec",
+ .data = &rseq_slice_ext_nsecs,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
+ .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
+ },
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+ if (rseq_slice_extension_enabled())
+ register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif /* !CONFIG_SYSCTL */
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
@@ -677,4 +790,17 @@ static int __init rseq_slice_cmdline(cha
return 1;
}
__setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+ }
+ rseq_slice_sysctl_init();
+ return 0;
+}
+device_initcall(rseq_slice_init);
#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 08/11] rseq: Reset slice extension when scheduled
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (6 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
` (2 more replies)
2025-12-15 16:52 ` [patch V6 09/11] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
` (3 subsequent siblings)
11 siblings, 3 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.
RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.
Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq_entry.h | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -102,9 +102,17 @@ static __always_inline bool rseq_arm_sli
return __rseq_arm_slice_extension_timer();
}
+static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
+{
+ if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted)
+ rseq_stat_inc(rseq_stats.s_revoked);
+ t->rseq.slice.state.granted = false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
+static inline void rseq_slice_clear_grant(struct task_struct *t) { }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -391,8 +399,15 @@ bool rseq_set_ids_get_csaddr(struct task
unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
if (csaddr)
unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
}
+ rseq_slice_clear_grant(t);
/* Cache the new values */
t->rseq.ids.cpu_cid = ids->cpu_cid;
rseq_stat_inc(rseq_stats.ids);
@@ -488,8 +503,17 @@ static __always_inline bool rseq_exit_us
*/
u64 csaddr;
- if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs)))
- return false;
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_get_user(csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
+ }
+
+ rseq_slice_clear_grant(t);
if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
@@ -505,6 +529,8 @@ static __always_inline bool rseq_exit_us
u32 node_id = cpu_to_node(ids.cpu_id);
return rseq_update_usr(t, regs, &ids, node_id);
+efault:
+ return false;
}
static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 09/11] rseq: Implement rseq_grant_slice_extension()
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (7 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 08/11] rseq: Reset slice extension when scheduled Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
` (2 more replies)
2025-12-15 16:52 ` [patch V6 10/11] entry: Hook up rseq time slice extension Thomas Gleixner
` (2 subsequent siblings)
11 siblings, 3 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.
The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:
#1 the functionality is enabled
#2 the exit is a return from interrupt to user mode
#3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
which means the task was already scheduled out.
The slow path, which implements the actual user space ABI, is invoked
when:
A) #1 is true, #2 is true and #3 is false
It checks whether user space requested a slice extension by setting
the request bit in the rseq slice_ctrl field. If so, it grants the
extension and stores the slice expiry time, so that the actual exit
code can double check whether the slice is already exhausted before
going back.
B) #1 - #3 are true _and_ a slice extension was granted in a previous
loop iteration
In this case the grant is revoked.
In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V2: Provide an extra stub for the !RSEQ case - Prateek
---
include/linux/rseq_entry.h | 108 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -42,6 +42,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/sched/signal.h>
#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -109,10 +110,116 @@ static __always_inline void rseq_slice_c
t->rseq.slice.state.granted = false;
}
+static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl usr_ctrl;
+ union rseq_slice_state state;
+ struct rseq __user *rseq;
+
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ /* If not enabled or not a return from interrupt, nothing to do. */
+ state = curr->rseq.slice.state;
+ state.enabled &= curr->rseq.event.user_irq;
+ if (likely(!state.state))
+ return false;
+
+ rseq = curr->rseq.usrptr;
+ scoped_user_rw_access(rseq, efault) {
+
+ /*
+ * Quick check conditions where a grant is not possible or
+ * needs to be revoked.
+ *
+ * 1) Any TIF bit which needs to do extra work aside of
+ * rescheduling prevents a grant.
+ *
+ * 2) A previous rescheduling request resulted in a slice
+ * extension grant.
+ */
+ if (unlikely(work_pending || state.granted)) {
+ /* Clear user control unconditionally. No point for checking */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ rseq_slice_clear_grant(curr);
+ return false;
+ }
+
+ unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ if (likely(!(usr_ctrl.request)))
+ return false;
+
+ /* Grant the slice extention */
+ usr_ctrl.request = 0;
+ usr_ctrl.granted = 1;
+ unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ }
+
+ rseq_stat_inc(rseq_stats.s_granted);
+
+ curr->rseq.slice.state.granted = true;
+ /* Store expiry time for arming the timer on the way out */
+ curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
+ /*
+ * This is racy against a remote CPU setting TIF_NEED_RESCHED in
+ * several ways:
+ *
+ * 1)
+ * CPU0 CPU1
+ * clear_tsk()
+ * set_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() -> Folds correctly
+ * 2)
+ * CPU0 CPU1
+ * set_tsk()
+ * clear_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false
+ *
+ * #1 is not any different from a regular remote reschedule as it
+ * sets the previously not set bit and then raises the IPI which
+ * folds it into the preempt counter
+ *
+ * #2 is obviously incorrect from a scheduler POV, but it's not
+ * differently incorrect than the code below clearing the
+ * reschedule request with the safety net of the timer.
+ *
+ * The important part is that the clearing is protected against the
+ * scheduler IPI and also against any other interrupt which might
+ * end up waking up a task and setting the bits in the middle of
+ * the operation:
+ *
+ * clear_tsk()
+ * ---> Interrupt
+ * wakeup_on_this_cpu()
+ * set_tsk()
+ * set_preempt()
+ * clear_preempt()
+ *
+ * which would be inconsistent state.
+ */
+ scoped_guard(irq) {
+ clear_tsk_need_resched(curr);
+ clear_preempt_need_resched();
+ }
+ return true;
+
+efault:
+ force_sig(SIGSEGV);
+ return false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
static inline void rseq_slice_clear_grant(struct task_struct *t) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -671,6 +778,7 @@ static inline void rseq_syscall_exit_to_
static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_exit_to_user_mode_legacy(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 10/11] entry: Hook up rseq time slice extension
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (8 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 09/11] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
` (2 more replies)
2025-12-15 16:52 ` [patch V6 11/11] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-12-15 18:24 ` [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
11 siblings, 3 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Wire the grant decision function up in exit_to_user_mode_loop()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
kernel/entry/common.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
#endif
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
- schedule();
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+ if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ schedule();
+ }
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 11/11] selftests/rseq: Implement time slice extension test
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (9 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 10/11] entry: Hook up rseq time slice extension Thomas Gleixner
@ 2025-12-15 16:52 ` Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 18:24 ` [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
11 siblings, 2 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 16:52 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and arriving in rseq_slice_yield().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Add a test for a random syscall
---
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 27 +++
tools/testing/selftests/rseq/slice_test.c | 219 ++++++++++++++++++++++++++++++
4 files changed, 251 insertions(+), 1 deletion(-)
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
param_test_mm_cid_benchmark
param_test_mm_cid_compare_twice
syscall_errors_test
+slice_test
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test slice_test
TEST_GEN_PROGS_EXTENDED = librseq.so
@@ -59,3 +59,6 @@ include ../lib.mk
$(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
rseq.h rseq-*.h
$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+ $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -53,6 +53,27 @@ struct rseq_abi_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_abi_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_abi_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -165,6 +186,12 @@ struct rseq_abi {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_abi_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_rseq_slice_yield
+# define __NR_rseq_slice_yield 471
+#endif
+
+#define BITS_PER_INT 32
+#define BITS_PER_BYTE 8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+#endif
+
+#ifndef RSEQ_SLICE_EXT_REQUEST_BIT
+# define RSEQ_SLICE_EXT_REQUEST_BIT 0
+# define RSEQ_SLICE_EXT_GRANTED_BIT 1
+#endif
+
+#ifndef asm_inline
+# define asm_inline asm __inline
+#endif
+
+#define NSEC_PER_SEC 1000000000L
+#define NSEC_PER_USEC 1000L
+
+struct noise_params {
+ int64_t noise_nsecs;
+ int64_t sleep_nsecs;
+ int64_t run;
+};
+
+FIXTURE(slice_ext)
+{
+ pthread_t noise_thread;
+ struct noise_params noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+ int64_t total_nsecs;
+ int64_t slice_nsecs;
+ int64_t noise_nsecs;
+ int64_t sleep_nsecs;
+ bool no_yield;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 2 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n50_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 50 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50_no_yield)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 2 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+ .no_yield = true,
+};
+
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+ int64_t span)
+{
+ int64_t delta = now->tv_sec - start->tv_sec;
+
+ delta *= NSEC_PER_SEC;
+ delta += now->tv_nsec - start->tv_nsec;
+ return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+ struct noise_params *p = arg;
+
+ while (RSEQ_READ_ONCE(p->run)) {
+ struct timespec ts_start, ts_now;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+ ts_start.tv_sec = 0;
+ ts_start.tv_nsec = p->sleep_nsecs;
+ clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+ }
+ return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+ cpu_set_t affinity;
+
+ ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+ /* Pin it on a single CPU. Avoid CPU 0 */
+ for (int i = 1; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+
+ CPU_ZERO(&affinity);
+ CPU_SET(i, &affinity);
+ ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+ break;
+ }
+
+ ASSERT_EQ(rseq_register_current_thread(), 0);
+
+ ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+ self->noise_params.noise_nsecs = variant->noise_nsecs;
+ self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+ self->noise_params.run = 1;
+
+ ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+ self->noise_params.run = 0;
+ pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+ unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0;
+ unsigned long total = 0, aborted = 0;
+ struct rseq_abi *rs = rseq_get_abi();
+ struct timespec ts_start, ts_now;
+
+ ASSERT_NE(rs, NULL);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ struct timespec ts_cs;
+ bool req = false;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+ total++;
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 1);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs));
+
+ /*
+ * request can be cleared unconditionally, but for making
+ * the stats work this is actually checking it first
+ */
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.request)) {
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 0);
+ /* Race between check and clear! */
+ req = true;
+ success++;
+ }
+
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.granted)) {
+ /* The above raced against a late grant */
+ if (req)
+ success--;
+ if (variant->no_yield) {
+ syscall(__NR_getpid);
+ aborted++;
+ } else {
+ yielded++;
+ if (!syscall(__NR_rseq_slice_yield))
+ raced++;
+ }
+ } else {
+ if (!req)
+ scheduled++;
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+ printf("# Total %12ld\n", total);
+ printf("# Success %12ld\n", success);
+ printf("# Yielded %12ld\n", yielded);
+ printf("# Aborted %12ld\n", aborted);
+ printf("# Scheduled %12ld\n", scheduled);
+ printf("# Raced %12ld\n", raced);
+}
+
+TEST_HARNESS_MAIN
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 00/11] rseq: Implement time slice extension mechanism
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
` (10 preceding siblings ...)
2025-12-15 16:52 ` [patch V6 11/11] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
11 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
This is a follow up on the V5 version:
https://lore.kernel.org/20251128225931.959481199@linutronix.de
V1 contains a detailed explanation:
https://lore.kernel.org/20250908225709.144709889@linutronix.de
TLDR: Time slice extensions are an attempt to provide opportunistic
priority ceiling without the overhead of an actual priority ceiling
protocol, but also without the guarantees such a protocol provides.
The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.
This series uses the existing RSEQ user memory to implement it.
Changes vs. V5:
- Rebase on v6.19-rc1
- Fold typo fixes - Sebastian
- Switch to syscall number 471
The series is based on v6.19-rc1 and is also available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 01/11] rseq: Add fields and constants for time slice extension
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 14:36 ` Mathieu Desnoyers
` (2 subsequent siblings)
3 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
V6: Fix typos - Bigeasy
V5: Document behaviour of arbitrary syscalls
V4: Make the example correct - Prakash
V3: Fix more typos and expressions - Randy
V2: Fix Kconfig indentation, fix typos and expressions - Randy
Make the control fields a struct and remove the atomicity requirement - Mathieu
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 135 ++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 28 ++++++-
include/uapi/linux/rseq.h | 38 +++++++++
init/Kconfig | 12 +++
kernel/rseq.c | 7 +
6 files changed, 220 insertions(+), 1 deletion(-)
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,135 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out. The length of the extension is
+determined by the ``rseq_slice_extension_nsec`` sysctl.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::request and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
+leaving the critical section to relinquish the CPU. The kernel enforces
+this by arming a timer to prevent misbehaving userspace from abusing this
+mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl.request = 1;
+ barrier(); // Prevent compiler reordering
+ critical_section();
+ barrier(); // Prevent compiler reordering
+ rseq->slice_ctrl.request = 0;
+ if (rseq->slice_ctrl.granted)
+ rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+ if (rseq->slice_ctrl.granted)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+If the thread issues a syscall other than rseq_slice_yield(2) within the
+granted timeslice extension, the grant is also revoked and the CPU is
+relinquished immediately when entering the kernel. This is required as
+syscalls might consume arbitrary CPU time until they reach a scheduling
+point when the preemption model is either NONE or VOLUNTARY and therefore
+might exceed the grant by far.
+
+The preferred solution for user space is to use rseq_slice_yield(2) which
+is side effect free. The support for arbitrary syscalls is required to
+support onion layer architectured applications, where the code handling the
+critical section and requesting the time slice extension has no control
+over the code within the critical section.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
- * @sig: Signature of critial section abort IPs
+ * @sig: Signature of critical section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a separate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
@@ -53,6 +64,27 @@ struct rseq_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1938,6 +1938,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq-based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config RSEQ_STATS
default n
bool "Enable lightweight statistics of restartable sequences" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 02/11] rseq: Provide static branch for time slice extensions
2025-12-15 16:52 ` [patch V6 02/11] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V4: Return 0 on error in __setup() - Randy
V3: Document command line parameter - Sebastian
V2: Return 1 from __setup() - Prateek
---
Documentation/admin-guide/kernel-parameters.txt | 5 +++++
include/linux/rseq_entry.h | 11 +++++++++++
kernel/rseq.c | 17 +++++++++++++++++
3 files changed, 33 insertions(+)
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6482,6 +6482,11 @@
rootflags= [KNL] Set root filesystem mount option string
+ rseq_slice_ext= [KNL] RSEQ based time slice extension
+ Format: boolean
+ Control enablement of RSEQ based time slice extension.
+ Default is 'on'.
+
initramfs_options= [KNL]
Specify mount options for for the initramfs mount.
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
#define rseq_inline __always_inline
#endif
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static __always_inline bool rseq_slice_extension_enabled(void)
+{
+ return static_branch_likely(&rseq_slice_extension_key);
+}
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
bool rseq_debug_validate_ids(struct task_struct *t);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -483,3 +483,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
efault:
return -EFAULT;
}
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static int __init rseq_slice_cmdline(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return 0;
+
+ if (!on)
+ static_branch_disable(&rseq_slice_extension_key);
+ return 1;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 03/11] rseq: Add statistics for time slice extensions
2025-12-15 16:52 ` [patch V6 03/11] rseq: Add statistics " Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Extend the quick statistics with time slice specific fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Add s_aborted to account for arbitrary syscalls
---
include/linux/rseq_entry.h | 5 +++++
kernel/rseq.c | 14 ++++++++++++++
2 files changed, 19 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -15,6 +15,11 @@ struct rseq_stats {
unsigned long cs;
unsigned long clear;
unsigned long fixup;
+ unsigned long s_granted;
+ unsigned long s_expired;
+ unsigned long s_revoked;
+ unsigned long s_yielded;
+ unsigned long s_aborted;
};
DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -138,6 +138,13 @@ static int rseq_stats_show(struct seq_fi
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu));
+ stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu));
+ stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu));
+ stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu));
+ stats.s_aborted += data_race(per_cpu(rseq_stats.s_aborted, cpu));
+ }
}
seq_printf(m, "exit: %16lu\n", stats.exit);
@@ -148,6 +155,13 @@ static int rseq_stats_show(struct seq_fi
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
+ seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
+ seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
+ seq_printf(m, "syield: %16lu\n", stats.s_yielded);
+ seq_printf(m, "sabort: %16lu\n", stats.s_aborted);
+ }
return 0;
}
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 04/11] rseq: Add prctl() to enable time slice extensions
2025-12-15 16:52 ` [patch V6 04/11] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V3: Use -ENOTSUPP for the stub inline - Sebastian
---
include/linux/rseq.h | 9 +++++++
include/uapi/linux/prctl.h | 10 ++++++++
kernel/rseq.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 6 +++++
4 files changed, 77 insertions(+)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -163,4 +163,13 @@ void rseq_syscall(struct pt_regs *regs);
static inline void rseq_syscall(struct pt_regs *regs) { }
#endif /* !CONFIG_DEBUG_RSEQ */
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ return -ENOTSUPP;
+}
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
#endif /* _LINUX_RSEQ_H */
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -386,4 +386,14 @@ struct prctl_mm_map {
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE: Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+
#endif /* _LINUX_PRCTL_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,7 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
#include <linux/sched.h>
@@ -501,6 +502,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ switch (arg2) {
+ case PR_RSEQ_SLICE_EXTENSION_GET:
+ if (arg3)
+ return -EINVAL;
+ return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+ case PR_RSEQ_SLICE_EXTENSION_SET: {
+ u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
+
+ if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+ return -EINVAL;
+ if (!rseq_slice_extension_enabled())
+ return -ENOTSUPP;
+ if (!current->rseq.usrptr)
+ return -ENXIO;
+
+ /* No change? */
+ if (enable == !!current->rseq.slice.state.enabled)
+ return 0;
+
+ if (get_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ if (current->rseq.slice.state.enabled)
+ valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if ((rflags & valid) != valid)
+ goto die;
+
+ rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ if (enable)
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if (put_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ current->rseq.slice.state.enabled = enable;
+ return 0;
+ }
+ default:
+ return -EINVAL;
+ }
+die:
+ force_sig(SIGSEGV);
+ return -EFAULT;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
#include <linux/futex.h>
+#include <linux/rseq.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2868,6 +2869,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
case PR_FUTEX_HASH:
error = futex_hash_prctl(arg2, arg3, arg4);
break;
+ case PR_RSEQ_SLICE_EXTENSION:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = rseq_slice_extension_prctl(arg2, arg3);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 05/11] rseq: Implement sys_rseq_slice_yield()
2025-12-15 16:52 ` [patch V6 05/11] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 14:59 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
---
V6: Switch to syscall NR 471
V5: Rework to adjust to support for arbitrary syscall changes
Use n32/n64/o32 for MIPS - Arnd
V2: Use the proper name in sys_ni.c and add comment - Prateek
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/tools/syscall_32.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/rseq_types.h | 2 ++
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 ++++-
kernel/rseq.c | 21 +++++++++++++++++++++
kernel/sys_ni.c | 1 +
scripts/syscall.tbl | 1 +
22 files changed, 46 insertions(+), 1 deletion(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -510,3 +510,4 @@
578 common file_getattr sys_file_getattr
579 common file_setattr sys_file_setattr
580 common listns sys_listns
+581 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -485,3 +485,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -482,3 +482,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -470,3 +470,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -476,3 +476,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -409,3 +409,4 @@
468 n32 file_getattr sys_file_getattr
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
+471 n32 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -385,3 +385,4 @@
468 n64 file_getattr sys_file_getattr
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
+471 n64 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -458,3 +458,4 @@
468 o32 file_getattr sys_file_getattr
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
+471 o32 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -561,3 +561,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 nospu rseq_slice_yield sys_rseq_slice_yield
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -397,3 +397,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -474,3 +474,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -516,3 +516,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -476,3 +476,4 @@
468 i386 file_getattr sys_file_getattr
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
+471 i386 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -395,6 +395,7 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
#
# Due to a historical design error, certain syscalls are numbered differently
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,9 +89,11 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @yielded: Indicator for rseq_slice_yield()
*/
struct rseq_slice {
union rseq_slice_state state;
+ u8 yielded;
};
/**
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -961,6 +961,7 @@ asmlinkage long sys_statx(int dfd, const
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_rseq_slice_yield(void);
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
unsigned flags,
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -860,8 +860,11 @@
#define __NR_listns 470
__SYSCALL(__NR_listns, sys_listns)
+#define __NR_rseq_slice_yield 471
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
#undef __NR_syscalls
-#define __NR_syscalls 471
+#define __NR_syscalls 472
/*
* 32 bit systems traditionally used different
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -553,6 +553,27 @@ int rseq_slice_extension_prctl(unsigned
return -EFAULT;
}
+/**
+ * sys_rseq_slice_yield - yield the current processor side effect free if a
+ * task granted with a time slice extension is done with
+ * the critical work before being forced out.
+ *
+ * Return: 1 if the task successfully yielded the CPU within the granted slice.
+ * 0 if the slice extension was either never granted or was revoked by
+ * going over the granted extension, using a syscall other than this one
+ * or being scheduled out earlier due to a subsequent interrupt.
+ *
+ * The syscall does not schedule because the syscall entry work immediately
+ * relinquishes the CPU and schedules if required.
+ */
+SYSCALL_DEFINE0(rseq_slice_yield)
+{
+ int yielded = !!current->rseq.slice.yielded;
+
+ current->rseq.slice.yielded = 0;
+ return yielded;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,6 +390,7 @@ COND_SYSCALL(setuid16);
/* restartable sequence */
COND_SYSCALL(rseq);
+COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -411,3 +411,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions
2025-12-15 16:52 ` [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:05 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V5: Allow arbitrary syscalls
V3: Use get/put_user()
---
include/linux/entry-common.h | 2
include/linux/rseq.h | 2
include/linux/thread_info.h | 16 ++++---
kernel/entry/syscall-common.c | 11 ++++-
kernel/rseq.c | 91 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 112 insertions(+), 10 deletions(-)
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -36,8 +36,8 @@
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
+ SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \
ARCH_SYSCALL_WORK_ENTER)
-
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -164,8 +164,10 @@ static inline void rseq_syscall(struct p
#endif /* !CONFIG_DEBUG_RSEQ */
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+void rseq_syscall_enter_work(long syscall);
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline void rseq_syscall_enter_work(long syscall) { }
static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
return -ENOTSUPP;
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
SYSCALL_WORK_BIT_SYSCALL_AUDIT,
SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+ SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
};
-#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
#endif
#include <asm/thread_info.h>
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s
}
}
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
{
long ret = 0;
@@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs
return -1L;
}
+ /*
+ * User space got a time slice extension granted and relinquishes
+ * the CPU. The work stops the slice timer to avoid an extra round
+ * through hrtimer_interrupt().
+ */
+ if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE)
+ rseq_syscall_enter_work(syscall);
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -502,6 +502,97 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+static inline void rseq_slice_set_need_resched(struct task_struct *curr)
+{
+ /*
+ * The interrupt guard is required to prevent inconsistent state in
+ * this case:
+ *
+ * set_tsk_need_resched()
+ * --> Interrupt
+ * wakeup()
+ * set_tsk_need_resched()
+ * set_preempt_need_resched()
+ * schedule_on_return()
+ * clear_tsk_need_resched()
+ * clear_preempt_need_resched()
+ * set_preempt_need_resched() <- Inconsistent state
+ *
+ * This is safe vs. a remote set of TIF_NEED_RESCHED because that
+ * only sets the already set bit and does not create inconsistent
+ * state.
+ */
+ scoped_guard(irq)
+ set_need_resched_current();
+}
+
+static void rseq_slice_validate_ctrl(u32 expected)
+{
+ u32 __user *sctrl = ¤t->rseq.usrptr->slice_ctrl.all;
+ u32 uval;
+
+ if (get_user(uval, sctrl) || uval != expected)
+ force_sig(SIGSEGV);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ *
+ * While the recommended way to relinquish the CPU side effect free is
+ * rseq_slice_yield(2), any syscall within a granted slice terminates the
+ * grant and immediately reschedules if required. This supports onion layer
+ * applications, where the code requesting the grant cannot control the
+ * code within the critical section.
+ */
+void rseq_syscall_enter_work(long syscall)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl ctrl = { .granted = curr->rseq.slice.state.granted };
+
+ clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ rseq_slice_validate_ctrl(ctrl.all);
+
+ /*
+ * The kernel might have raced, revoked the grant and updated
+ * userspace, but kept the SLICE work set.
+ */
+ if (!ctrl.granted)
+ return;
+
+ /*
+ * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
+ * kernels. Leaving the scope will reschedule on preemption models
+ * FULL, LAZY and RT if necessary.
+ */
+ scoped_guard(preempt) {
+ /*
+ * Now that preemption is disabled, quickly check whether
+ * the task was already rescheduled before arriving here.
+ */
+ if (!curr->rseq.event.sched_switch) {
+ rseq_slice_set_need_resched(curr);
+
+ if (syscall == __NR_rseq_slice_yield) {
+ rseq_stat_inc(rseq_stats.s_yielded);
+ /* Update the yielded state for syscall return */
+ curr->rseq.slice.yielded = 1;
+ } else {
+ rseq_stat_inc(rseq_stats.s_aborted);
+ }
+ }
+ }
+ /* Reschedule on NONE/VOLUNTARY preemption models */
+ cond_resched();
+
+ /* Clear the grant in kernel state and user space */
+ curr->rseq.slice.state.granted = false;
+ if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all))
+ force_sig(SIGSEGV);
+}
+
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
switch (arg2) {
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 7:18 ` Randy Dunlap
` (6 subsequent siblings)
7 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V5: Document the slice extension range - PeterZ
V4: Update comment - Steven
V3: Add sysctl documentation, simplify timer cancelation - Sebastian
---
Documentation/admin-guide/sysctl/kernel.rst | 8 +
include/linux/rseq_entry.h | 38 +++++---
include/linux/rseq_types.h | 2
kernel/rseq.c | 132 +++++++++++++++++++++++++++-
4 files changed, 167 insertions(+), 13 deletions(-)
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1228,6 +1228,14 @@ reboot-cmd (SPARC only)
ROM/Flash boot loader. Maybe to tell it what to do after
rebooting. ???
+rseq_slice_extension_nsec
+=========================
+
+A task can request to delay its scheduling if it is in a critical section
+via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
+allowed extension in nanoseconds before scheduling of the task is enforced.
+Default value is 30000ns (30us). The possible range is 10000ns (10us) to
+50000ns (50us).
sched_energy_aware
==================
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -87,8 +87,24 @@ static __always_inline bool rseq_slice_e
{
return static_branch_likely(&rseq_slice_extension_key);
}
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ if (likely(!current->rseq.slice.state.granted))
+ return false;
+
+ return __rseq_arm_slice_extension_timer();
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -543,17 +559,19 @@ static __always_inline void clear_tif_rs
static __always_inline bool
rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
- if (likely(!test_tif_rseq(ti_work)))
- return false;
-
- if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
- current->rseq.event.slowpath = true;
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
- return true;
+ if (unlikely(test_tif_rseq(ti_work))) {
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ clear_tif_rseq();
}
-
- clear_tif_rseq();
- return false;
+ /*
+ * Arm the slice extension timer if nothing to do anymore and the
+ * task really goes out to user space.
+ */
+ return rseq_arm_slice_extension_timer();
}
#else /* CONFIG_GENERIC_ENTRY */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,10 +89,12 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @expires: The time when a grant expires
* @yielded: Indicator for rseq_slice_yield()
*/
struct rseq_slice {
union rseq_slice_state state;
+ u64 expires;
u8 yielded;
};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
@@ -500,8 +502,91 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
}
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+ struct hrtimer timer;
+ void *cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+/*
+ * When the timer expires and the task is still in user space, the return
+ * from interrupt will revoke the grant and schedule. If the task already
+ * entered the kernel via a syscall and the timer fires before the syscall
+ * work was able to cancel it, then depending on the preemption model this
+ * will either reschedule on return from interrupt or in the syscall work
+ * below.
+ */
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+ struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+ /*
+ * Validate that the task which armed the timer is still on the
+ * CPU. It could have been scheduled out without canceling the
+ * timer.
+ */
+ if (st->cookie == current && current->rseq.slice.state.granted) {
+ rseq_stat_inc(rseq_stats.s_expired);
+ set_need_resched_current();
+ }
+ return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+ struct task_struct *curr = current;
+
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * This check prevents a task, which got a time slice extension
+ * granted, from exceeding the maximum scheduling latency when the
+ * grant expired before going out to user space. Don't bother to
+ * clear the grant here, it will be cleaned up automatically before
+ * going out to user space after being scheduled back in.
+ */
+ if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+ set_need_resched_current();
+ return true;
+ }
+
+ /*
+ * Store the task pointer as a cookie for comparison in the timer
+ * function. This is safe as the timer is CPU local and cannot be
+ * in the expiry function at this point.
+ */
+ st->cookie = curr;
+ hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+ /* Arm the syscall entry work */
+ set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+ return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+ /*
+ * st->cookie can be safely read as preemption is disabled and the
+ * timer is CPU local.
+ *
+ * As this is most probably the first expiring timer, the cancel is
+ * expensive as it has to reprogram the hardware, but that's less
+ * expensive than going through a full hrtimer_interrupt() cycle
+ * for nothing.
+ *
+ * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
+ * local and once the hrtimer code disabled interrupts the timer
+ * callback cannot be running.
+ */
+ if (st->cookie == current)
+ hrtimer_try_to_cancel(&st->timer);
+}
+
static inline void rseq_slice_set_need_resched(struct task_struct *curr)
{
/*
@@ -563,11 +648,14 @@ void rseq_syscall_enter_work(long syscal
return;
/*
- * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
- * kernels. Leaving the scope will reschedule on preemption models
- * FULL, LAZY and RT if necessary.
+ * Required to stabilize the per CPU timer pointer and to make
+ * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
+ *
+ * Leaving the scope will reschedule on preemption models FULL,
+ * LAZY and RT if necessary.
*/
scoped_guard(preempt) {
+ rseq_cancel_slice_extension_timer();
/*
* Now that preemption is disabled, quickly check whether
* the task was already rescheduled before arriving here.
@@ -665,6 +753,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
return yielded;
}
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+ {
+ .procname = "rseq_slice_extension_nsec",
+ .data = &rseq_slice_ext_nsecs,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
+ .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
+ },
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+ if (rseq_slice_extension_enabled())
+ register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif /* !CONFIG_SYSCTL */
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
@@ -677,4 +790,17 @@ static int __init rseq_slice_cmdline(cha
return 1;
}
__setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+ }
+ rseq_slice_sysctl_init();
+ return 0;
+}
+device_initcall(rseq_slice_init);
#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 08/11] rseq: Reset slice extension when scheduled
2025-12-15 16:52 ` [patch V6 08/11] rseq: Reset slice extension when scheduled Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:17 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.
RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.
Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq_entry.h | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -102,9 +102,17 @@ static __always_inline bool rseq_arm_sli
return __rseq_arm_slice_extension_timer();
}
+static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
+{
+ if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted)
+ rseq_stat_inc(rseq_stats.s_revoked);
+ t->rseq.slice.state.granted = false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
+static inline void rseq_slice_clear_grant(struct task_struct *t) { }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -391,8 +399,15 @@ bool rseq_set_ids_get_csaddr(struct task
unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
if (csaddr)
unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
}
+ rseq_slice_clear_grant(t);
/* Cache the new values */
t->rseq.ids.cpu_cid = ids->cpu_cid;
rseq_stat_inc(rseq_stats.ids);
@@ -488,8 +503,17 @@ static __always_inline bool rseq_exit_us
*/
u64 csaddr;
- if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs)))
- return false;
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_get_user(csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
+ }
+
+ rseq_slice_clear_grant(t);
if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
@@ -505,6 +529,8 @@ static __always_inline bool rseq_exit_us
u32 node_id = cpu_to_node(ids.cpu_id);
return rseq_update_usr(t, regs, &ids, node_id);
+efault:
+ return false;
}
static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 09/11] rseq: Implement rseq_grant_slice_extension()
2025-12-15 16:52 ` [patch V6 09/11] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:25 ` Mathieu Desnoyers
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.
The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:
#1 the functionality is enabled
#2 the exit is a return from interrupt to user mode
#3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
which means the task was already scheduled out.
The slow path, which implements the actual user space ABI, is invoked
when:
A) #1 is true, #2 is true and #3 is false
It checks whether user space requested a slice extension by setting
the request bit in the rseq slice_ctrl field. If so, it grants the
extension and stores the slice expiry time, so that the actual exit
code can double check whether the slice is already exhausted before
going back.
B) #1 - #3 are true _and_ a slice extension was granted in a previous
loop iteration
In this case the grant is revoked.
In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V2: Provide an extra stub for the !RSEQ case - Prateek
---
include/linux/rseq_entry.h | 108 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -42,6 +42,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/sched/signal.h>
#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -109,10 +110,116 @@ static __always_inline void rseq_slice_c
t->rseq.slice.state.granted = false;
}
+static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl usr_ctrl;
+ union rseq_slice_state state;
+ struct rseq __user *rseq;
+
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ /* If not enabled or not a return from interrupt, nothing to do. */
+ state = curr->rseq.slice.state;
+ state.enabled &= curr->rseq.event.user_irq;
+ if (likely(!state.state))
+ return false;
+
+ rseq = curr->rseq.usrptr;
+ scoped_user_rw_access(rseq, efault) {
+
+ /*
+ * Quick check conditions where a grant is not possible or
+ * needs to be revoked.
+ *
+ * 1) Any TIF bit which needs to do extra work aside of
+ * rescheduling prevents a grant.
+ *
+ * 2) A previous rescheduling request resulted in a slice
+ * extension grant.
+ */
+ if (unlikely(work_pending || state.granted)) {
+ /* Clear user control unconditionally. No point for checking */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ rseq_slice_clear_grant(curr);
+ return false;
+ }
+
+ unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ if (likely(!(usr_ctrl.request)))
+ return false;
+
+ /* Grant the slice extention */
+ usr_ctrl.request = 0;
+ usr_ctrl.granted = 1;
+ unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ }
+
+ rseq_stat_inc(rseq_stats.s_granted);
+
+ curr->rseq.slice.state.granted = true;
+ /* Store expiry time for arming the timer on the way out */
+ curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
+ /*
+ * This is racy against a remote CPU setting TIF_NEED_RESCHED in
+ * several ways:
+ *
+ * 1)
+ * CPU0 CPU1
+ * clear_tsk()
+ * set_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() -> Folds correctly
+ * 2)
+ * CPU0 CPU1
+ * set_tsk()
+ * clear_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false
+ *
+ * #1 is not any different from a regular remote reschedule as it
+ * sets the previously not set bit and then raises the IPI which
+ * folds it into the preempt counter
+ *
+ * #2 is obviously incorrect from a scheduler POV, but it's not
+ * differently incorrect than the code below clearing the
+ * reschedule request with the safety net of the timer.
+ *
+ * The important part is that the clearing is protected against the
+ * scheduler IPI and also against any other interrupt which might
+ * end up waking up a task and setting the bits in the middle of
+ * the operation:
+ *
+ * clear_tsk()
+ * ---> Interrupt
+ * wakeup_on_this_cpu()
+ * set_tsk()
+ * set_preempt()
+ * clear_preempt()
+ *
+ * which would be inconsistent state.
+ */
+ scoped_guard(irq) {
+ clear_tsk_need_resched(curr);
+ clear_preempt_need_resched();
+ }
+ return true;
+
+efault:
+ force_sig(SIGSEGV);
+ return false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
static inline void rseq_slice_clear_grant(struct task_struct *t) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -671,6 +778,7 @@ static inline void rseq_syscall_exit_to_
static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_exit_to_user_mode_legacy(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 10/11] entry: Hook up rseq time slice extension
2025-12-15 16:52 ` [patch V6 10/11] entry: Hook up rseq time slice extension Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:37 ` Mathieu Desnoyers
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Wire the grant decision function up in exit_to_user_mode_loop()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
kernel/entry/common.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
#endif
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
- schedule();
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+ if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ schedule();
+ }
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6 11/11] selftests/rseq: Implement time slice extension test
2025-12-15 16:52 ` [patch V6 11/11] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-15 18:24 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and arriving in rseq_slice_yield().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Add a test for a random syscall
---
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 27 +++
tools/testing/selftests/rseq/slice_test.c | 219 ++++++++++++++++++++++++++++++
4 files changed, 251 insertions(+), 1 deletion(-)
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
param_test_mm_cid_benchmark
param_test_mm_cid_compare_twice
syscall_errors_test
+slice_test
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test slice_test
TEST_GEN_PROGS_EXTENDED = librseq.so
@@ -59,3 +59,6 @@ include ../lib.mk
$(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
rseq.h rseq-*.h
$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+ $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -53,6 +53,27 @@ struct rseq_abi_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_abi_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_abi_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -165,6 +186,12 @@ struct rseq_abi {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_abi_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_rseq_slice_yield
+# define __NR_rseq_slice_yield 471
+#endif
+
+#define BITS_PER_INT 32
+#define BITS_PER_BYTE 8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+#endif
+
+#ifndef RSEQ_SLICE_EXT_REQUEST_BIT
+# define RSEQ_SLICE_EXT_REQUEST_BIT 0
+# define RSEQ_SLICE_EXT_GRANTED_BIT 1
+#endif
+
+#ifndef asm_inline
+# define asm_inline asm __inline
+#endif
+
+#define NSEC_PER_SEC 1000000000L
+#define NSEC_PER_USEC 1000L
+
+struct noise_params {
+ int64_t noise_nsecs;
+ int64_t sleep_nsecs;
+ int64_t run;
+};
+
+FIXTURE(slice_ext)
+{
+ pthread_t noise_thread;
+ struct noise_params noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+ int64_t total_nsecs;
+ int64_t slice_nsecs;
+ int64_t noise_nsecs;
+ int64_t sleep_nsecs;
+ bool no_yield;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 2 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n50_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 50 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50_no_yield)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 2 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+ .no_yield = true,
+};
+
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+ int64_t span)
+{
+ int64_t delta = now->tv_sec - start->tv_sec;
+
+ delta *= NSEC_PER_SEC;
+ delta += now->tv_nsec - start->tv_nsec;
+ return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+ struct noise_params *p = arg;
+
+ while (RSEQ_READ_ONCE(p->run)) {
+ struct timespec ts_start, ts_now;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+ ts_start.tv_sec = 0;
+ ts_start.tv_nsec = p->sleep_nsecs;
+ clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+ }
+ return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+ cpu_set_t affinity;
+
+ ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+ /* Pin it on a single CPU. Avoid CPU 0 */
+ for (int i = 1; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+
+ CPU_ZERO(&affinity);
+ CPU_SET(i, &affinity);
+ ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+ break;
+ }
+
+ ASSERT_EQ(rseq_register_current_thread(), 0);
+
+ ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+ self->noise_params.noise_nsecs = variant->noise_nsecs;
+ self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+ self->noise_params.run = 1;
+
+ ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+ self->noise_params.run = 0;
+ pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+ unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0;
+ unsigned long total = 0, aborted = 0;
+ struct rseq_abi *rs = rseq_get_abi();
+ struct timespec ts_start, ts_now;
+
+ ASSERT_NE(rs, NULL);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ struct timespec ts_cs;
+ bool req = false;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+ total++;
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 1);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs));
+
+ /*
+ * request can be cleared unconditionally, but for making
+ * the stats work this is actually checking it first
+ */
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.request)) {
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 0);
+ /* Race between check and clear! */
+ req = true;
+ success++;
+ }
+
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.granted)) {
+ /* The above raced against a late grant */
+ if (req)
+ success--;
+ if (variant->no_yield) {
+ syscall(__NR_getpid);
+ aborted++;
+ } else {
+ yielded++;
+ if (!syscall(__NR_rseq_slice_yield))
+ raced++;
+ }
+ } else {
+ if (!req)
+ scheduled++;
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+ printf("# Total %12ld\n", total);
+ printf("# Success %12ld\n", success);
+ printf("# Yielded %12ld\n", yielded);
+ printf("# Aborted %12ld\n", aborted);
+ printf("# Scheduled %12ld\n", scheduled);
+ printf("# Raced %12ld\n", raced);
+}
+
+TEST_HARNESS_MAIN
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2025-12-16 7:18 ` Randy Dunlap
2025-12-16 17:55 ` Prakash Sangappa
2025-12-16 8:26 ` [patch V6.1 " Thomas Gleixner
` (5 subsequent siblings)
7 siblings, 1 reply; 78+ messages in thread
From: Randy Dunlap @ 2025-12-16 7:18 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Peter Zijlstra, Ron Geva, Waiman Long
Ouch. Would you mind rearranging parts of the first sentence?
On 12/15/25 10:24 AM, Thomas Gleixner wrote:
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1228,6 +1228,14 @@ reboot-cmd (SPARC only)
> ROM/Flash boot loader. Maybe to tell it what to do after
> rebooting. ???
>
> +rseq_slice_extension_nsec
> +=========================
> +
> +A task can request to delay its scheduling if it is in a critical section
> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
> +allowed extension in nanoseconds before scheduling of the task is enforced.
> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
> +50000ns (50us).
Maybe
A task that is in a critical section can request to delay its scheduling via
the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism.
--
~Randy
^ permalink raw reply [flat|nested] 78+ messages in thread
* [patch V6.1 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 7:18 ` Randy Dunlap
@ 2025-12-16 8:26 ` Thomas Gleixner
2025-12-16 15:13 ` [patch V6 " Mathieu Desnoyers
` (4 subsequent siblings)
7 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-16 8:26 UTC (permalink / raw)
To: LKML
Cc: Mathieu Desnoyers, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V6.1: Reword documentation - Randy
V5: Document the slice extension range - PeterZ
V4: Update comment - Steven
V3: Add sysctl documentation, simplify timer cancelation - Sebastian
---
Documentation/admin-guide/sysctl/kernel.rst | 8 +
include/linux/rseq_entry.h | 38 +++++---
include/linux/rseq_types.h | 2
kernel/rseq.c | 132 +++++++++++++++++++++++++++-
4 files changed, 167 insertions(+), 13 deletions(-)
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1248,6 +1248,14 @@ reboot-cmd (SPARC only)
ROM/Flash boot loader. Maybe to tell it what to do after
rebooting. ???
+rseq_slice_extension_nsec
+=========================
+
+A task that is in a critical section can request to delay its scheduling
+via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
+allowed extension in nanoseconds before scheduling of the task is enforced.
+Default value is 30000ns (30us). The possible range is 10000ns (10us) to
+50000ns (50us).
sched_energy_aware
==================
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -87,8 +87,24 @@ static __always_inline bool rseq_slice_e
{
return static_branch_likely(&rseq_slice_extension_key);
}
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ if (likely(!current->rseq.slice.state.granted))
+ return false;
+
+ return __rseq_arm_slice_extension_timer();
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -543,17 +559,19 @@ static __always_inline void clear_tif_rs
static __always_inline bool
rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
- if (likely(!test_tif_rseq(ti_work)))
- return false;
-
- if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
- current->rseq.event.slowpath = true;
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
- return true;
+ if (unlikely(test_tif_rseq(ti_work))) {
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ clear_tif_rseq();
}
-
- clear_tif_rseq();
- return false;
+ /*
+ * Arm the slice extension timer if nothing to do anymore and the
+ * task really goes out to user space.
+ */
+ return rseq_arm_slice_extension_timer();
}
#else /* CONFIG_GENERIC_ENTRY */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,10 +89,12 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @expires: The time when a grant expires
* @yielded: Indicator for rseq_slice_yield()
*/
struct rseq_slice {
union rseq_slice_state state;
+ u64 expires;
u8 yielded;
};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
@@ -500,8 +502,91 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
}
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+ struct hrtimer timer;
+ void *cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+/*
+ * When the timer expires and the task is still in user space, the return
+ * from interrupt will revoke the grant and schedule. If the task already
+ * entered the kernel via a syscall and the timer fires before the syscall
+ * work was able to cancel it, then depending on the preemption model this
+ * will either reschedule on return from interrupt or in the syscall work
+ * below.
+ */
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+ struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+ /*
+ * Validate that the task which armed the timer is still on the
+ * CPU. It could have been scheduled out without canceling the
+ * timer.
+ */
+ if (st->cookie == current && current->rseq.slice.state.granted) {
+ rseq_stat_inc(rseq_stats.s_expired);
+ set_need_resched_current();
+ }
+ return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+ struct task_struct *curr = current;
+
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * This check prevents a task, which got a time slice extension
+ * granted, from exceeding the maximum scheduling latency when the
+ * grant expired before going out to user space. Don't bother to
+ * clear the grant here, it will be cleaned up automatically before
+ * going out to user space after being scheduled back in.
+ */
+ if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+ set_need_resched_current();
+ return true;
+ }
+
+ /*
+ * Store the task pointer as a cookie for comparison in the timer
+ * function. This is safe as the timer is CPU local and cannot be
+ * in the expiry function at this point.
+ */
+ st->cookie = curr;
+ hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+ /* Arm the syscall entry work */
+ set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+ return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+ /*
+ * st->cookie can be safely read as preemption is disabled and the
+ * timer is CPU local.
+ *
+ * As this is most probably the first expiring timer, the cancel is
+ * expensive as it has to reprogram the hardware, but that's less
+ * expensive than going through a full hrtimer_interrupt() cycle
+ * for nothing.
+ *
+ * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
+ * local and once the hrtimer code disabled interrupts the timer
+ * callback cannot be running.
+ */
+ if (st->cookie == current)
+ hrtimer_try_to_cancel(&st->timer);
+}
+
static inline void rseq_slice_set_need_resched(struct task_struct *curr)
{
/*
@@ -563,11 +648,14 @@ void rseq_syscall_enter_work(long syscal
return;
/*
- * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
- * kernels. Leaving the scope will reschedule on preemption models
- * FULL, LAZY and RT if necessary.
+ * Required to stabilize the per CPU timer pointer and to make
+ * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
+ *
+ * Leaving the scope will reschedule on preemption models FULL,
+ * LAZY and RT if necessary.
*/
scoped_guard(preempt) {
+ rseq_cancel_slice_extension_timer();
/*
* Now that preemption is disabled, quickly check whether
* the task was already rescheduled before arriving here.
@@ -665,6 +753,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
return yielded;
}
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+ {
+ .procname = "rseq_slice_extension_nsec",
+ .data = &rseq_slice_ext_nsecs,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
+ .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
+ },
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+ if (rseq_slice_extension_enabled())
+ register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif /* !CONFIG_SYSCTL */
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
@@ -677,4 +790,17 @@ static int __init rseq_slice_cmdline(cha
return 1;
}
__setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+ }
+ rseq_slice_sysctl_init();
+ return 0;
+}
+device_initcall(rseq_slice_init);
#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2025-12-16 14:36 ` Mathieu Desnoyers
2025-12-18 23:21 ` Thomas Gleixner
2026-01-19 10:10 ` Peter Zijlstra
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
3 siblings, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-16 14:36 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long,
Florian Weimer, carlos@redhat.com
On 2025-12-15 13:24, Thomas Gleixner wrote:
[...]
> +The thread has to enable the functionality via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
Although it is not documented, it appears that a thread can
also use this prctl to disable slice extension.
How is it meant to compose once we have libc trying to use slice
extension internally and the application also using it or wishing to
disable it, unaware that libc is also trying to use it ?
Applications are composed of various libraries, each of which may want
to use the feature. It's unclear to me how the per-thread slice
extension enable/disable state fits in this context. Unless we address
this, it will become either:
- Owned and used by a single library, or
- Owned and used by the application, unavailable to libraries.
This goes against the design goals of RSEQ features.
[...]
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a separate byte */
Aren't there 8 bits in a byte ? What am I missing ?
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
>
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 05/11] rseq: Implement sys_rseq_slice_yield()
2025-12-15 16:52 ` [patch V6 05/11] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2025-12-16 14:59 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-16 14:59 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On 2025-12-15 13:24, Thomas Gleixner wrote:
> Provide a new syscall which has the only purpose to yield the CPU after the
> kernel granted a time slice extension.
>
> sched_yield() is not suitable for that because it unconditionally
> schedules, but the end of the time slice extension is not required to
> schedule when the task was already preempted. This also allows to have a
> strict check for termination to catch user space invoking random syscalls
> including sched_yield() from a time slice extension region.
>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions
2025-12-15 16:52 ` [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2025-12-16 15:05 ` Mathieu Desnoyers
2025-12-18 22:28 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-16 15:05 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On 2025-12-15 13:24, Thomas Gleixner wrote:
[...]
>
> If the code detects inconsistent user space that result in a SIGSEGV for
> the application.
With these last updates that allow userspace to call arbitrary system
calls to terminate the grant, the only scenario which triggers SIGSEGV
is if the put_user() storing to the rseq area fails. Perhaps we should
update the wording above to be clearer on which situation triggers
SIGSEGV.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (2 preceding siblings ...)
2025-12-16 8:26 ` [patch V6.1 " Thomas Gleixner
@ 2025-12-16 15:13 ` Mathieu Desnoyers
2025-12-18 15:05 ` Peter Zijlstra
` (3 subsequent siblings)
7 siblings, 0 replies; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-16 15:13 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On 2025-12-15 13:24, Thomas Gleixner wrote:
> If a time slice extension is granted and the reschedule delayed, the kernel
> has to ensure that user space cannot abuse the extension and exceed the
> maximum granted time.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 08/11] rseq: Reset slice extension when scheduled
2025-12-15 16:52 ` [patch V6 08/11] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2025-12-16 15:17 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-16 15:17 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On 2025-12-15 13:24, Thomas Gleixner wrote:
> When a time slice extension was granted in the need_resched() check on exit
> to user space, the task can still be scheduled out in one of the other
> pending work items. When it gets scheduled back in, and need_resched() is
> not set, then the stale grant would be preserved, which is just wrong.
>
> RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
> critical section and ID update mechanisms.
>
> Utilize them and clear the user space slice control member of struct rseq
> unconditionally within the existing user access sections. That's just an
> unconditional store more in that path.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 09/11] rseq: Implement rseq_grant_slice_extension()
2025-12-15 16:52 ` [patch V6 09/11] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2025-12-16 15:25 ` Mathieu Desnoyers
2025-12-18 23:28 ` Thomas Gleixner
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-16 15:25 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On 2025-12-15 13:24, Thomas Gleixner wrote:
[...]
> In case that the user space access faults or invalid state is detected, the
> task is terminated with SIGSEGV.
It appears that only access faults trigger SIGSEGV. Perhaps removing
"or invalid state is detected" should be removed, or the code is missing
some state validation ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 10/11] entry: Hook up rseq time slice extension
2025-12-15 16:52 ` [patch V6 10/11] entry: Hook up rseq time slice extension Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2025-12-16 15:37 ` Mathieu Desnoyers
2025-12-19 11:07 ` Peter Zijlstra
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2 siblings, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-16 15:37 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On 2025-12-15 13:24, Thomas Gleixner wrote:
> Wire the grant decision function up in exit_to_user_mode_loop()
>
[...]
>
> +/* TIF bits, which prevent a time slice extension. */
> +#ifdef CONFIG_PREEMPT_RT
> +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
> +#else
> +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
It would be relevant to explain the difference between RT and non-RT
in the commit message.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-16 7:18 ` Randy Dunlap
@ 2025-12-16 17:55 ` Prakash Sangappa
0 siblings, 0 replies; 78+ messages in thread
From: Prakash Sangappa @ 2025-12-16 17:55 UTC (permalink / raw)
To: Randy Dunlap
Cc: Thomas Gleixner, LKML, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org, Peter Zijlstra,
Ron Geva, Waiman Long
> On Dec 15, 2025, at 11:18 PM, Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Ouch. Would you mind rearranging parts of the first sentence?
>
> On 12/15/25 10:24 AM, Thomas Gleixner wrote:
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -1228,6 +1228,14 @@ reboot-cmd (SPARC only)
>> ROM/Flash boot loader. Maybe to tell it what to do after
>> rebooting. ???
>>
>> +rseq_slice_extension_nsec
>> +=========================
>> +
>> +A task can request to delay its scheduling if it is in a critical section
>> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
>> +allowed extension in nanoseconds before scheduling of the task is enforced.
>> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
>> +50000ns (50us).
>
> Maybe
> A task that is in a critical section can request to delay its scheduling via
> the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism.
>
Well, this prctl call is to enable the time slice extension mechanism for the thread.
Actually requesting to delay scheduling(time slice extension) is done
by updating a member in rseq structure. Perhaps this needs to be clarified.
-Prakash
> --
> ~Randy
>
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (3 preceding siblings ...)
2025-12-16 15:13 ` [patch V6 " Mathieu Desnoyers
@ 2025-12-18 15:05 ` Peter Zijlstra
2025-12-18 23:26 ` Thomas Gleixner
2025-12-18 15:18 ` Peter Zijlstra
` (2 subsequent siblings)
7 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2025-12-18 15:05 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
> V5: Document the slice extension range - PeterZ
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1228,6 +1228,14 @@ reboot-cmd (SPARC only)
> ROM/Flash boot loader. Maybe to tell it what to do after
> rebooting. ???
>
> +rseq_slice_extension_nsec
> +=========================
> +
> +A task can request to delay its scheduling if it is in a critical section
> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
> +allowed extension in nanoseconds before scheduling of the task is enforced.
> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
> +50000ns (50us).
The important bit: we're not going to increase these numbers. If
anything, I would like the default to be 10us and taint the kernel if
you up it.
I also think we want some tracing/tool to find the actual length of the
extension used (min/avg/max etc.). That is the time between the kernel
finding the extension bit set and arming the timer and the slice_yield()
syscall.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (4 preceding siblings ...)
2025-12-18 15:05 ` Peter Zijlstra
@ 2025-12-18 15:18 ` Peter Zijlstra
2025-12-18 23:25 ` Thomas Gleixner
2026-01-17 9:57 ` Peter Zijlstra
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
7 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2025-12-18 15:18 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
> +static void rseq_cancel_slice_extension_timer(void)
> +{
> + struct slice_timer *st = this_cpu_ptr(&slice_timer);
> +
> + /*
> + * st->cookie can be safely read as preemption is disabled and the
> + * timer is CPU local.
> + *
> + * As this is most probably the first expiring timer, the cancel is
> + * expensive as it has to reprogram the hardware, but that's less
> + * expensive than going through a full hrtimer_interrupt() cycle
> + * for nothing.
So I have these hrtick patches that skip some of that reprogramming --
at the cost of causing those spurious interrupts. Overall that was a
win.
Should we look at the cost of a spurious hrtimer interrupt? IIRC each
base will stop at the first iteration if the timer is 'early', which
wasn't that bad.
> + * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
> + * local and once the hrtimer code disabled interrupts the timer
> + * callback cannot be running.
> + */
> + if (st->cookie == current)
> + hrtimer_try_to_cancel(&st->timer);
> +}
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions
2025-12-16 15:05 ` Mathieu Desnoyers
@ 2025-12-18 22:28 ` Thomas Gleixner
2025-12-18 22:30 ` Mathieu Desnoyers
0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-18 22:28 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On Tue, Dec 16 2025 at 10:05, Mathieu Desnoyers wrote:
> On 2025-12-15 13:24, Thomas Gleixner wrote:
> [...]
>>
>> If the code detects inconsistent user space that result in a SIGSEGV for
>> the application.
>
> With these last updates that allow userspace to call arbitrary system
> calls to terminate the grant, the only scenario which triggers SIGSEGV
> is if the put_user() storing to the rseq area fails. Perhaps we should
Nope. There is also the debug path which will catch offenders which
fiddle with the state.
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions
2025-12-18 22:28 ` Thomas Gleixner
@ 2025-12-18 22:30 ` Mathieu Desnoyers
0 siblings, 0 replies; 78+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 22:30 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On 2025-12-18 17:28, Thomas Gleixner wrote:
> Nope. There is also the debug path which will catch offenders which
> fiddle with the state.
I missed that one. Makes sense, thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2025-12-16 14:36 ` Mathieu Desnoyers
@ 2025-12-18 23:21 ` Thomas Gleixner
2026-01-07 21:11 ` Mathieu Desnoyers
2026-01-17 9:36 ` Peter Zijlstra
0 siblings, 2 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-18 23:21 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long,
Florian Weimer, carlos@redhat.com
On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote:
> On 2025-12-15 13:24, Thomas Gleixner wrote:
> [...]
>> +The thread has to enable the functionality via prctl(2)::
>> +
>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>
> Although it is not documented, it appears that a thread can
> also use this prctl to disable slice extension.
Obviously. Controls are supposed to be symmetrical.
> How is it meant to compose once we have libc trying to use slice
> extension internally and the application also using it or wishing to
> disable it, unaware that libc is also trying to use it ?
Tons of prctls have the same "issue". What's so special about this?
> Applications are composed of various libraries, each of which may want
I'm well aware of that fact.
> to use the feature. It's unclear to me how the per-thread slice
> extension enable/disable state fits in this context. Unless we address
> this, it will become either:
>
> - Owned and used by a single library, or
>
> - Owned and used by the application, unavailable to libraries.
The prctl allows you to query the state, so all parties can make
informed decisions. It's not any different from other mechanisms, which
require coordination between different parts.
> This goes against the design goals of RSEQ features.
These goals are documented where?
What I've seen so far at least from the implementation is that it aims
to enable the maximum amount of features, aka. overhead, unconditionally
even if nothing uses them, e.g. CID.
Your vision/goal of RSEQ being useful everywhere simply does not match
the reality.
As I pointed out in the previous submission, the benefits of time slice
extensions are limited. In low contention scenarios they result in
measurable regressions, so it's not the magic panacea which solves all
locking/critical section problems at once.
The idea that cobbling random libraries together in the hope that
everything goes well has never worked. That's simply a wet dream and
Java has proven that to the maximum extent decades ago. Nevertheless all
other programming models went down the same yawning abyss and everyone
expects that the kernel is magically solving their problems by adding
more abusable [mis]features.
Systems have to be designed carefully as a whole if you want to achieve
the maximum performance. That's not any different from other targets
like real-time. A real-time enabled kernel does not magically create a
real-time system.
TBH, the prctl should be the least of your worries. There are worse
problems with uncoordinated usage:
set(REQUEST)
....
-> Interrupt
clr(REQUEST)
set(GRANTED)
lib1fn()
set(REQUEST) <- Inconsistent state
if (a) {
lib2fn()
syscall() <- RSEQ debug will kill the task....
} else {
...
-> Interrupt
<- RSEQ debug will kill the task....
And no, we are not going to lift this restriction because it allows
abuse of the mechanism unless we track more state and inflict more
overhead on the kernel for no good reason.
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-18 15:18 ` Peter Zijlstra
@ 2025-12-18 23:25 ` Thomas Gleixner
0 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-18 23:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Thu, Dec 18 2025 at 16:18, Peter Zijlstra wrote:
> On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
>
>> +static void rseq_cancel_slice_extension_timer(void)
>> +{
>> + struct slice_timer *st = this_cpu_ptr(&slice_timer);
>> +
>> + /*
>> + * st->cookie can be safely read as preemption is disabled and the
>> + * timer is CPU local.
>> + *
>> + * As this is most probably the first expiring timer, the cancel is
>> + * expensive as it has to reprogram the hardware, but that's less
>> + * expensive than going through a full hrtimer_interrupt() cycle
>> + * for nothing.
>
> So I have these hrtick patches that skip some of that reprogramming --
> at the cost of causing those spurious interrupts. Overall that was a
> win.
>
> Should we look at the cost of a spurious hrtimer interrupt? IIRC each
> base will stop at the first iteration if the timer is 'early', which
> wasn't that bad.
Correct, but it's still going to reprogram the timer, so contrary to
cancel this takes the full overhead of the interrupt and in this case
because the expiry is short it will trigger most of the time.
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-18 15:05 ` Peter Zijlstra
@ 2025-12-18 23:26 ` Thomas Gleixner
2025-12-19 10:05 ` Peter Zijlstra
0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-18 23:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Thu, Dec 18 2025 at 16:05, Peter Zijlstra wrote:
> On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
>
>> V5: Document the slice extension range - PeterZ
>
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -1228,6 +1228,14 @@ reboot-cmd (SPARC only)
>> ROM/Flash boot loader. Maybe to tell it what to do after
>> rebooting. ???
>>
>> +rseq_slice_extension_nsec
>> +=========================
>> +
>> +A task can request to delay its scheduling if it is in a critical section
>> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
>> +allowed extension in nanoseconds before scheduling of the task is enforced.
>> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
>> +50000ns (50us).
>
> The important bit: we're not going to increase these numbers. If
> anything, I would like the default to be 10us and taint the kernel if
> you up it.
Fine with me.
> I also think we want some tracing/tool to find the actual length of the
> extension used (min/avg/max etc.). That is the time between the kernel
> finding the extension bit set and arming the timer and the slice_yield()
> syscall.
I could probably integrate that easily into the RSEQ stats mechanism.
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 09/11] rseq: Implement rseq_grant_slice_extension()
2025-12-16 15:25 ` Mathieu Desnoyers
@ 2025-12-18 23:28 ` Thomas Gleixner
2026-01-11 10:22 ` Thomas Gleixner
0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2025-12-18 23:28 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On Tue, Dec 16 2025 at 10:25, Mathieu Desnoyers wrote:
> On 2025-12-15 13:24, Thomas Gleixner wrote:
> [...]
>> In case that the user space access faults or invalid state is detected, the
>> task is terminated with SIGSEGV.
>
> It appears that only access faults trigger SIGSEGV. Perhaps removing
> "or invalid state is detected" should be removed, or the code is missing
> some state validation ?
Seems I dropped a debug path somewhere down the road.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-18 23:26 ` Thomas Gleixner
@ 2025-12-19 10:05 ` Peter Zijlstra
2026-01-16 18:15 ` Peter Zijlstra
0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2025-12-19 10:05 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 3935 bytes --]
On Fri, Dec 19, 2025 at 12:26:46AM +0100, Thomas Gleixner wrote:
> On Thu, Dec 18 2025 at 16:05, Peter Zijlstra wrote:
> > On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
> >
> >> V5: Document the slice extension range - PeterZ
> >
> >> --- a/Documentation/admin-guide/sysctl/kernel.rst
> >> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> >> @@ -1228,6 +1228,14 @@ reboot-cmd (SPARC only)
> >> ROM/Flash boot loader. Maybe to tell it what to do after
> >> rebooting. ???
> >>
> >> +rseq_slice_extension_nsec
> >> +=========================
> >> +
> >> +A task can request to delay its scheduling if it is in a critical section
> >> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
> >> +allowed extension in nanoseconds before scheduling of the task is enforced.
> >> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
> >> +50000ns (50us).
> >
> > The important bit: we're not going to increase these numbers. If
> > anything, I would like the default to be 10us and taint the kernel if
> > you up it.
>
> Fine with me.
Thanks; the thinking is that it will be very hard to shrink this number
due to unknown workloads in the wild and all that, so starting on the
small end is the conservative option.
> > I also think we want some tracing/tool to find the actual length of the
> > extension used (min/avg/max etc.). That is the time between the kernel
> > finding the extension bit set and arming the timer and the slice_yield()
> > syscall.
>
> I could probably integrate that easily into the RSEQ stats mechanism.
I was thinking that perhaps the hrtimer tracepoints, filtered on this
specific timer, might just do. Arming the timer is the point where the
extension is granted, cancelling the timer is on the slice_yield() (or
any other random syscall :/), and the timer actually firing is on fail.
Normally I would suggest using a Poison distribution to find the
'average', but this case is more complicated because the start of the
extension is lost.
Let me ask one of these fancy AI things. Ah, it says this is "a classic
example of Length-Biased Sampling combined with Left-Truncation". It
then further suggests:
If you cannot assume a distribution, you should use a Weighting
Method. Since the probability of catching an event of length L is
proportional to L, you must weight each observation by 1/L.
1. For each event, record the observed duration d_i
2. Calculate the weighted mean:
\Sum (d_i * 1/d_i) n
avg(x)_true = ------------------ = ----------
\Sum 1/d_i \Sum 1/d_i
This is the Harmonic Mean of your observed durations. The harmonic
mean effectively "penalizes" the long events you were more likely to
catch.
It also babbled something about an Inspection Paradox:
If your sampling rate is constant (a Poisson process) and the system is
in a "steady state," the most robust and mathematically elegant way to
find the true average duration (μ) is surprisingly simple.
In a steady-state system where you catch an event in progress:
The time from the start of the event to your arrival is U
(unobserved).
The time from your arrival to the end of the event is V (observed).
Under these specific conditions, the expected value of the observed
remaining duration (V) is exactly equal to the mean of the length-biased
distribution. However, because long events are over-sampled, the mean of
the durations you catch is actually higher than the true mean of all
events. For many common distributions (like the Exponential
distribution), the relationship is: μ=E[V]
Wait, if you ignore the part you missed (U) and only average the parts
you saw (V), you often arrive back at the true mean. This is known as
the Inspection Paradox.
Now I suppose I should do the real research to see how much of that is a
hallucination :-)
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 10/11] entry: Hook up rseq time slice extension
2025-12-16 15:37 ` Mathieu Desnoyers
@ 2025-12-19 11:07 ` Peter Zijlstra
2026-01-11 11:01 ` Thomas Gleixner
0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2025-12-19 11:07 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Thomas Gleixner, LKML, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Tue, Dec 16, 2025 at 10:37:24AM -0500, Mathieu Desnoyers wrote:
> On 2025-12-15 13:24, Thomas Gleixner wrote:
> > Wire the grant decision function up in exit_to_user_mode_loop()
> >
> [...]
> > +/* TIF bits, which prevent a time slice extension. */
> > +#ifdef CONFIG_PREEMPT_RT
> > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
> > +#else
> > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
>
> It would be relevant to explain the difference between RT and non-RT
> in the commit message.
So if you include TIF_NEED_RESCHED the extension period directly affects
the minimum scheduler delay like:
min(extension_period, min_sched_delay)
because this is strictly a from-userspace thing. That is, it is
equivalent to the in-kernel preemption/IRQ disabled regions -- with
exception of the scheduler critical sections itself.
As I've agrued many times -- I don't see a fundamental reason to not do
this for RT -- but perhaps further reduce the magic number such that its
impact cannot be observed on a 'good' machine.
But yes, if/when we do this on RT it needs the promise to agressively
decrease the magic number any time it can actually be measured to impact
performance.
cyclictest should probably get a mode where it (ab)uses the feature to
failure before we do this.
Anyway, I don't mind excluding RT for now, but it *does* deserve a
comment.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2025-12-18 23:21 ` Thomas Gleixner
@ 2026-01-07 21:11 ` Mathieu Desnoyers
2026-01-11 17:11 ` Thomas Gleixner
2026-01-17 9:36 ` Peter Zijlstra
1 sibling, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2026-01-07 21:11 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long,
Florian Weimer, carlos@redhat.com, Michael Jeanson
On 2025-12-18 18:21, Thomas Gleixner wrote:
> On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote:
>> On 2025-12-15 13:24, Thomas Gleixner wrote:
>> [...]
>>> +The thread has to enable the functionality via prctl(2)::
>>> +
>>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>>
>> Although it is not documented, it appears that a thread can
>> also use this prctl to disable slice extension.
>
> Obviously. Controls are supposed to be symmetrical.
I agree that the vast majority of prctl are symmetrical, but
there are exceptions, e.g. PR_SET_NO_NEW_PRIVS, PR_SET_SECCOMP.
>> How is it meant to compose once we have libc trying to use slice
>> extension internally and the application also using it or wishing to
>> disable it, unaware that libc is also trying to use it ?
>
> Tons of prctls have the same "issue". What's so special about this?
What is special about this is the fact that we want to allow userspace
to specialize its fast-path code at runtime based on availability of an
rseq feature.
If we allow slice extension to be disabled by the program or any
library within the process, this means that either the program or any
other library cannot assume slice extension availability to stay
invariant after it has been setup. This therefore requires adding
additional feature availability tests on the fast-path. And if this
state is per-thread, this means testing flags within the rseq area
on every use. Even if this is a simple load from an address at
thread pointer + offset, test, and branch, the overhead adds up quickly
for fast-paths.
Moreover, if the prctl enables the feature independently for each
thread (rather than for the whole process), this requires a conditional
state check on every use because it can be enabled or disabled
depending on the thread. This prevents code specialization that would
select the appropriate code at process startup through either ifunc
resolver, code patching or other mean.
[...]
> The prctl allows you to query the state, so all parties can make
> informed decisions. It's not any different from other mechanisms, which
> require coordination between different parts.
I'm fine with having prctl enable the feature (for the whole process)
and query its state.
The part I'm concerned with is the prctl disabling the feature, as
we're losing the availability invariant after setup.
>
>> This goes against the design goals of RSEQ features.
>
> These goals are documented where?
We should clarify those design goals somewhere. So far those have
been enforced by me when vetting new features, but that approach
is not good in the long term.
Is Documentation/userspace-api/rseq.rst a good location for this ?
>
> What I've seen so far at least from the implementation is that it aims
> to enable the maximum amount of features, aka. overhead, unconditionally
> even if nothing uses them, e.g. CID.
I don't mind having things disabled on process startup and then opt-in.
What I care about though is that the enabled state stays invariant across
the entire process after setting this up at program startup.
I agree with you in retrospect that this opt-in approach should have been
taken for CID.
> Your vision/goal of RSEQ being useful everywhere simply does not match
> the reality.
Again, I don't mind the opt-in approach, only that the state stays invariant
after program startup.
> As I pointed out in the previous submission, the benefits of time slice
> extensions are limited. In low contention scenarios they result in
> measurable regressions, so it's not the magic panacea which solves all
> locking/critical section problems at once.
I agree that whatever code we add to an uncontended spinlock fast path
will show up in microbenchmark measurements.
>
> The idea that cobbling random libraries together in the hope that
> everything goes well has never worked. That's simply a wet dream and
> Java has proven that to the maximum extent decades ago. Nevertheless all
> other programming models went down the same yawning abyss and everyone
> expects that the kernel is magically solving their problems by adding
> more abusable [mis]features.
>
> Systems have to be designed carefully as a whole if you want to achieve
> the maximum performance. That's not any different from other targets
> like real-time. A real-time enabled kernel does not magically create a
> real-time system.
[...]
I think we are talking about two different program/libraries composition
use-cases here.
AFAIU, the aspect you are focused on is whether we should allow users of
slice extension to nest. I agree with you that we should document this
as unsupported, since the goal of slice extension is really for short
spinlock critical sections, and nesting of those goes against that
basic definition.
The concern I am raising here is different. It's about just _using_
slice extension from various entities (program, libraries) within a
process, without any nesting of slice extension requests.
If libc successfully enables slice extension in its startup, the
kernel should guarantee that it stays invariant for the lifetime
of the program so libc can optimize its code accordingly, or use
a fallback, without requiring additional per-thread variable checks
in its fast paths.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 09/11] rseq: Implement rseq_grant_slice_extension()
2025-12-18 23:28 ` Thomas Gleixner
@ 2026-01-11 10:22 ` Thomas Gleixner
0 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2026-01-11 10:22 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long
On Fri, Dec 19 2025 at 00:28, Thomas Gleixner wrote:
> On Tue, Dec 16 2025 at 10:25, Mathieu Desnoyers wrote:
>> On 2025-12-15 13:24, Thomas Gleixner wrote:
>> [...]
>>> In case that the user space access faults or invalid state is detected, the
>>> task is terminated with SIGSEGV.
>>
>> It appears that only access faults trigger SIGSEGV. Perhaps removing
>> "or invalid state is detected" should be removed, or the code is missing
>> some state validation ?
>
> Seems I dropped a debug path somewhere down the road.
On purpose, so I reword the change log.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 10/11] entry: Hook up rseq time slice extension
2025-12-19 11:07 ` Peter Zijlstra
@ 2026-01-11 11:01 ` Thomas Gleixner
2026-01-17 9:51 ` Peter Zijlstra
0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2026-01-11 11:01 UTC (permalink / raw)
To: Peter Zijlstra, Mathieu Desnoyers
Cc: LKML, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Fri, Dec 19 2025 at 12:07, Peter Zijlstra wrote:
> On Tue, Dec 16, 2025 at 10:37:24AM -0500, Mathieu Desnoyers wrote:
>> On 2025-12-15 13:24, Thomas Gleixner wrote:
>> > Wire the grant decision function up in exit_to_user_mode_loop()
>> >
>> [...]
>> > +/* TIF bits, which prevent a time slice extension. */
>> > +#ifdef CONFIG_PREEMPT_RT
>> > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
>> > +#else
>> > +# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
>>
>> It would be relevant to explain the difference between RT and non-RT
>> in the commit message.
>
> So if you include TIF_NEED_RESCHED the extension period directly affects
> the minimum scheduler delay like:
>
> min(extension_period, min_sched_delay)
>
> because this is strictly a from-userspace thing. That is, it is
> equivalent to the in-kernel preemption/IRQ disabled regions -- with
> exception of the scheduler critical sections itself.
>
> As I've agrued many times -- I don't see a fundamental reason to not do
> this for RT -- but perhaps further reduce the magic number such that its
> impact cannot be observed on a 'good' machine.
>
> But yes, if/when we do this on RT it needs the promise to agressively
> decrease the magic number any time it can actually be measured to impact
> performance.
>
> cyclictest should probably get a mode where it (ab)uses the feature to
> failure before we do this.
>
> Anyway, I don't mind excluding RT for now, but it *does* deserve a
> comment.
I know you argued about this many times, but I still maintain my point
of view that TIF_PREEMPT and TIF_PREEMPT_LAZY are fundmentally different:
TIF_PREEMPT_LAZY grants a non-RT task to complete until it reaches
return to user
TIF_PREEMPT enforces preemption at the next possible preemption
point
My main concern is this scenario:
sched_other_task()
request_slice_extension()
---> interrupt
RT task is woken up
return_to_user()
grant_extension()
...
which means the RT task is delayed until the OTHER task relinquishes the
CPU voluntarily or via timeout.
That might be desired _if_ both tasks are using the same lock, but in
case of fully independent tasks it's not necessarily a good idea. If a
RT application uses locks in the RT tasks, then obviously latency is not
so much of a concern, but for optimized RT applications the side effect
of other processes getting a free pass to increase latency is troublesome.
So I prefer to keep the current semantics for RT. This can be revisited
of course when a proper evaluation has been done, but IMO there are too
many moving parts in a RT system to make this actually work correctly
under all circumstances.
I'll add proper comments to that effect.
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-07 21:11 ` Mathieu Desnoyers
@ 2026-01-11 17:11 ` Thomas Gleixner
2026-01-13 23:45 ` Florian Weimer
0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2026-01-11 17:11 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long,
Florian Weimer, carlos@redhat.com, Michael Jeanson
On Wed, Jan 07 2026 at 16:11, Mathieu Desnoyers wrote:
> On 2025-12-18 18:21, Thomas Gleixner wrote:
>> On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote:
>>> On 2025-12-15 13:24, Thomas Gleixner wrote:
>>> [...]
>>>> +The thread has to enable the functionality via prctl(2)::
>>>> +
>>>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>>>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>>>
>>> Although it is not documented, it appears that a thread can
>>> also use this prctl to disable slice extension.
>>
>> Obviously. Controls are supposed to be symmetrical.
>
> I agree that the vast majority of prctl are symmetrical, but
> there are exceptions, e.g. PR_SET_NO_NEW_PRIVS, PR_SET_SECCOMP.
Which have security requirements and are therefore different.
>>> How is it meant to compose once we have libc trying to use slice
>>> extension internally and the application also using it or wishing to
>>> disable it, unaware that libc is also trying to use it ?
>>
>> Tons of prctls have the same "issue". What's so special about this?
>
> What is special about this is the fact that we want to allow userspace
> to specialize its fast-path code at runtime based on availability of an
> rseq feature.
>
> If we allow slice extension to be disabled by the program or any
> library within the process, this means that either the program or any
> other library cannot assume slice extension availability to stay
> invariant after it has been setup.
That's really a non-problem. This is not any different from other
tunables and there is really no reason to make up theoretical cases
where a library enables and another one disables. If user space can't
get it's act together then so be it. It's not the kernels problem and as
this is not a security feature with strict semantics, there is no reason
to let the kernel implement policy.
> Moreover, if the prctl enables the feature independently for each
> thread (rather than for the whole process), this requires a conditional
> state check on every use because it can be enabled or disabled
> depending on the thread. This prevents code specialization that would
> select the appropriate code at process startup through either ifunc
> resolver, code patching or other mean.
I'm not completely opposed to make it process wide. For threads created
after enablement, that's trivial because that can be done when the per
thread RSEQ is registered. But when it gets enabled _after_ threads have
been created already then we need code to chase the threads and enable
it after the fact because we are not going to query the enablement in
curr->mm::whatever just to have another conditional and another
cacheline to access.
The only option is to reject enablement when there is already more than
one thread in the process, but there is a reasonable argument that a
process might only enable it for a subset of threads, which have actual
lock interaction and not bother with it for other things. I'm not seeing
a reason to restrict the flexibility of configuration just because you
envision magic use cases all over the place.
On the other hand there is no guarantee that libc registers RSEQ when a
thread is started as it can be disabled or not supported, so you have
exactly the same problem there that the code which wants to use it needs
to ensure that a RSEQ area is registered, no?
> [...]
>
>> The prctl allows you to query the state, so all parties can make
>> informed decisions. It's not any different from other mechanisms, which
>> require coordination between different parts.
>
> I'm fine with having prctl enable the feature (for the whole process)
> and query its state.
>
> The part I'm concerned with is the prctl disabling the feature, as
> we're losing the availability invariant after setup.
close(0);
has the same problem. How many instances of bugs in that area have you
seen so far?
>> What I've seen so far at least from the implementation is that it aims
>> to enable the maximum amount of features, aka. overhead, unconditionally
>> even if nothing uses them, e.g. CID.
>
> I don't mind having things disabled on process startup and then opt-in.
> What I care about though is that the enabled state stays invariant across
> the entire process after setting this up at program startup.
Userspace is perfectly equipped to do so and the kernel is not there to
prevent user space from shooting itself into the foot.
>> As I pointed out in the previous submission, the benefits of time slice
>> extensions are limited. In low contention scenarios they result in
>> measurable regressions, so it's not the magic panacea which solves all
>> locking/critical section problems at once.
>
> I agree that whatever code we add to an uncontended spinlock fast path
> will show up in microbenchmark measurements.
It not only shows up in microbenchmarks. It shows up in real world
scenarios too. So enabling and using it in random places just because
you can will not necessarily result in any performance gain, it might
actually get worse.
>> The idea that cobbling random libraries together in the hope that
>> everything goes well has never worked. That's simply a wet dream and
>> Java has proven that to the maximum extent decades ago. Nevertheless all
>> other programming models went down the same yawning abyss and everyone
>> expects that the kernel is magically solving their problems by adding
>> more abusable [mis]features.
>>
>> Systems have to be designed carefully as a whole if you want to achieve
>> the maximum performance. That's not any different from other targets
>> like real-time. A real-time enabled kernel does not magically create a
>> real-time system.
> [...]
>
> I think we are talking about two different program/libraries composition
> use-cases here.
>
> AFAIU, the aspect you are focused on is whether we should allow users of
> slice extension to nest. I agree with you that we should document this
> as unsupported, since the goal of slice extension is really for short
> spinlock critical sections, and nesting of those goes against that
> basic definition.
This is not about nesting. This is about the completely unrealistic idea
that combining random libraries will result in a functional optimized
system. If you want to ensure that nothing can disable it then implement
a syscall filter which rejects the disable command. That's user space
policy, not kernel side hardcoded policy.
> The concern I am raising here is different. It's about just _using_
> slice extension from various entities (program, libraries) within a
> process, without any nesting of slice extension requests.
>
> If libc successfully enables slice extension in its startup, the
> kernel should guarantee that it stays invariant for the lifetime
> of the program so libc can optimize its code accordingly, or use
> a fallback, without requiring additional per-thread variable checks
> in its fast paths.
Even if libc enables it and something else disables it, then the only
downside is that user space pointlessly does the request dance:
set_request()
critical_section()
clear_request()
if (granted()) // Guaranteed to be false
sys_rseq_slice_yield()
The resulting harm is that requests are ignored by the kernel, so the
"optimized" code is not getting what it expects and executes 3
instructions for nothing. That's all. So where is your problem?
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-11 17:11 ` Thomas Gleixner
@ 2026-01-13 23:45 ` Florian Weimer
2026-01-14 21:59 ` Thomas Gleixner
2026-01-17 16:16 ` Mathieu Desnoyers
0 siblings, 2 replies; 78+ messages in thread
From: Florian Weimer @ 2026-01-13 23:45 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva,
Waiman Long, carlos@redhat.com, Michael Jeanson
* Thomas Gleixner:
> I'm not completely opposed to make it process wide. For threads created
> after enablement, that's trivial because that can be done when the per
> thread RSEQ is registered. But when it gets enabled _after_ threads have
> been created already then we need code to chase the threads and enable
> it after the fact because we are not going to query the enablement in
> curr->mm::whatever just to have another conditional and another
> cacheline to access.
In glibc, we make sure that the registration for restartable sequences
happens before any user code (with the exception of IFUNC resolvers) can
run. This includes code from signal handlers. We started masking
signals on newly created threads for this reason, to make these
partially initialized states unobservable.
It's not clear to me what the expected outcome is. If we ever want to
offer deadline extension as a mutex attribute (for example), then we
have to switch this on at process start unconditionally because we don't
know if this new API will be used by the new process (potentially after
dlopen, so we can't even use things likely analyzing the symbol
footprint ahead of time).
> The only option is to reject enablement when there is already more than
> one thread in the process, but there is a reasonable argument that a
> process might only enable it for a subset of threads, which have actual
> lock interaction and not bother with it for other things. I'm not seeing
> a reason to restrict the flexibility of configuration just because you
> envision magic use cases all over the place.
Sure, but it looks like this needs a custom/minimal libc. It's like
repurposing set_robust_list for something else. It can be done, but it
has a significant cost in terms of compatibility because some
functionality (that other libraries in the process depend on) will stop
working.
> On the other hand there is no guarantee that libc registers RSEQ when a
> thread is started as it can be disabled or not supported, so you have
> exactly the same problem there that the code which wants to use it needs
> to ensure that a RSEQ area is registered, no?
With glibc, if RSEQ is registered on the main thread, it will be
registered on all other threads, too. Technically, it's possible to
unregister RSEQ with the kernel, of course, but that's totally
undefined, like unmapping memory originally returned from malloc.
>>> The prctl allows you to query the state, so all parties can make
>>> informed decisions. It's not any different from other mechanisms, which
>>> require coordination between different parts.
>>
>> I'm fine with having prctl enable the feature (for the whole process)
>> and query its state.
>>
>> The part I'm concerned with is the prctl disabling the feature, as
>> we're losing the availability invariant after setup.
>
> close(0);
>
> has the same problem. How many instances of bugs in that area have you
> seen so far?
We've had significant issues due to incorrect close calls (maybe not
close(0) in particular, but definitely with double-closes removing
descriptors created by other threads.
We need the prctl to unregister for CRIU, though, otherwise CRIU won't
be able to use glibc directly (or would have to re-exec itself in a new
configuration).
Thanks,
Florian
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-13 23:45 ` Florian Weimer
@ 2026-01-14 21:59 ` Thomas Gleixner
2026-01-17 16:16 ` Mathieu Desnoyers
1 sibling, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2026-01-14 21:59 UTC (permalink / raw)
To: Florian Weimer
Cc: Mathieu Desnoyers, LKML, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva,
Waiman Long, carlos@redhat.com, Michael Jeanson
On Wed, Jan 14 2026 at 00:45, Florian Weimer wrote:
> * Thomas Gleixner:
>> I'm not completely opposed to make it process wide. For threads created
>> after enablement, that's trivial because that can be done when the per
>> thread RSEQ is registered. But when it gets enabled _after_ threads have
>> been created already then we need code to chase the threads and enable
>> it after the fact because we are not going to query the enablement in
>> curr->mm::whatever just to have another conditional and another
>> cacheline to access.
>
> In glibc, we make sure that the registration for restartable sequences
> happens before any user code (with the exception of IFUNC resolvers) can
> run. This includes code from signal handlers. We started masking
> signals on newly created threads for this reason, to make these
> partially initialized states unobservable.
>
> It's not clear to me what the expected outcome is. If we ever want to
> offer deadline extension as a mutex attribute (for example), then we
> have to switch this on at process start unconditionally because we don't
> know if this new API will be used by the new process (potentially after
> dlopen, so we can't even use things likely analyzing the symbol
> footprint ahead of time).
Sure, but then you can enable it at each thread start, no?
>> The only option is to reject enablement when there is already more than
>> one thread in the process, but there is a reasonable argument that a
>> process might only enable it for a subset of threads, which have actual
>> lock interaction and not bother with it for other things. I'm not seeing
>> a reason to restrict the flexibility of configuration just because you
>> envision magic use cases all over the place.
>
> Sure, but it looks like this needs a custom/minimal libc. It's like
> repurposing set_robust_list for something else. It can be done, but it
> has a significant cost in terms of compatibility because some
> functionality (that other libraries in the process depend on) will stop
> working.
The kernel is not there to cater magic user space expectations. It
provides interfaces and the minimal amount of policy.
If glibc wants to use it for mutexes (for all the wrong reasons) then
glibc needs to take care of enabling it like it does for registering
RSEQ for each newly created thread.
If glibc does not and the application does care for their particular
concurrency control, then it is the application's problem to ensure that
it is enabled for the threads it cares about, right?
>> On the other hand there is no guarantee that libc registers RSEQ when a
>> thread is started as it can be disabled or not supported, so you have
>> exactly the same problem there that the code which wants to use it needs
>> to ensure that a RSEQ area is registered, no?
>
> With glibc, if RSEQ is registered on the main thread, it will be
> registered on all other threads, too. Technically, it's possible to
> unregister RSEQ with the kernel, of course, but that's totally
> undefined, like unmapping memory originally returned from malloc.
This is again user land policy. glibc decides to register RSEQ for each
new thread, but the kernel does not care whether it does or not.
>>>> The prctl allows you to query the state, so all parties can make
>>>> informed decisions. It's not any different from other mechanisms, which
>>>> require coordination between different parts.
>>>
>>> I'm fine with having prctl enable the feature (for the whole process)
>>> and query its state.
>>>
>>> The part I'm concerned with is the prctl disabling the feature, as
>>> we're losing the availability invariant after setup.
>>
>> close(0);
>>
>> has the same problem. How many instances of bugs in that area have you
>> seen so far?
>
> We've had significant issues due to incorrect close calls (maybe not
> close(0) in particular, but definitely with double-closes removing
> descriptors created by other threads.
That's again not a kernel problem. The primary UNIX design principle is
to allow user space to shoot itself into the foot. There is zero reason
to change that unless it's a justified security issue.
Time slice extension best effort magic does definitely qualify for
that. It's harmless as the only side effect is that user space wastes
cycles...
Thanks,
tglx
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-19 10:05 ` Peter Zijlstra
@ 2026-01-16 18:15 ` Peter Zijlstra
2026-01-18 10:46 ` Thomas Gleixner
0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-16 18:15 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Fri, Dec 19, 2025 at 11:05:17AM +0100, Peter Zijlstra wrote:
> I was thinking that perhaps the hrtimer tracepoints, filtered on this
> specific timer, might just do. Arming the timer is the point where the
> extension is granted, cancelling the timer is on the slice_yield() (or
> any other random syscall :/), and the timer actually firing is on fail.
Here, I google pasted this together. I don't actually speak much snake
(as you well know). Nor does it fully work; the handle_expire() thing is
busted, I definitely have expire entries in the trace, but they're not
showing up.
$ trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- ./slice_test
$ ./foo.py
========================================
RSEQ SLICE HISTOGRAM (us)
========================================
Task: slice_test Mean: 375.577 ns
Latency (us) | Count
------------------------------
0 us | 142031
1 us | 292
2 us | 67
3 us | 33
4 us | 34
5 us | 27
6 us | 15
7 us | 14
8 us | 24
9 us | 33
10 us | 691
---
#!/usr/bin/python3
#
# trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- $cmd
#
from tracecmd import *
def load_kallsyms(file_path='/proc/kallsyms'):
"""
Parses /proc/kallsyms into a dictionary.
Returns: { address_int: symbol_name }
"""
kallsyms_map = {}
try:
with open(file_path, 'r') as f:
for line in f:
# The format is: [address] [type] [name] [module]
parts = line.split()
if len(parts) < 3:
continue
addr = int(parts[0], 16)
name = parts[2]
kallsyms_map[addr] = name
except PermissionError:
print(f"Error: Permission denied reading {file_path}. Try running with sudo.")
except FileNotFoundError:
print(f"Error: {file_path} not found.")
return kallsyms_map
ksyms = load_kallsyms()
# pending[timer_ptr] = {'ts': timestamp, 'comm': comm}
pending = {}
# histograms[comm][bucket] = count
histograms = {}
class OnlineHarmonicMean:
def __init__(self):
self.n = 0 # Count of elements
self.S = 0.0 # Cumulative sum of reciprocals
def update(self, x):
if x == 0:
raise ValueError("Harmonic mean is undefined for zero.")
self.n += 1
self.S += 1.0 / x
return self.n / self.S
@property
def mean(self):
return self.n / self.S if self.n > 0 else 0
ohms = {}
def handle_start(record):
func_name = ksyms[record.num_field("function")]
if "rseq_slice_expired" in func_name:
timer_ptr = record.num_field("hrtimer")
pending[timer_ptr] = {
'ts': record.ts,
'comm': record.comm
}
return None
def handle_cancel(record):
timer_ptr = record.num_field("hrtimer")
if timer_ptr in pending:
start_data = pending.pop(timer_ptr)
duration_ns = record.ts - start_data['ts']
duration_us = duration_ns // 1000
comm = start_data['comm']
if comm not in ohms:
ohms[comm] = OnlineHarmonicMean()
ohms[comm].update(duration_ns)
if comm not in histograms:
histograms[comm] = {}
histograms[comm][duration_us] = histograms[comm].get(duration_us, 0) + 1
return None
def handle_expire(record):
timer_ptr = record.num_field("hrtimer")
if timer_ptr in pending:
start_data = pending.pop(timer_ptr)
comm = start_data['comm']
if comm not in histograms:
histograms[comm] = {}
# Record -1 bucket for expired (failed to cancel)
histograms[comm][-1] = histograms[comm].get(-1, 0) + 1
return None
if __name__ == "__main__":
t = Trace("trace.dat")
for cpu in range(0, t.cpus):
ev = t.read_event(cpu)
while ev:
if "hrtimer_start" in ev.name:
handle_start(ev)
if "hrtimer_cancel" in ev.name:
handle_cancel(ev)
if "hrtimer_expire_entry" in ev.name:
handle_expire(ev)
ev = t.read_event(cpu)
print("\n" + "="*40)
print("RSEQ SLICE HISTOGRAM (us)")
print("="*40)
for comm, buckets in histograms.items():
print(f"\nTask: {comm} Mean: {ohms[comm].mean:.3f} ns")
print(f" {'Latency (us)':<15} | {'Count'}")
print(f" {'-'*30}")
# Sort buckets numerically, putting -1 at the top
for bucket in sorted(buckets.keys()):
label = "EXPIRED" if bucket == -1 else f"{bucket} us"
print(f" {label:<15} | {buckets[bucket]}")
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2025-12-18 23:21 ` Thomas Gleixner
2026-01-07 21:11 ` Mathieu Desnoyers
@ 2026-01-17 9:36 ` Peter Zijlstra
1 sibling, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-17 9:36 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long,
Florian Weimer, carlos@redhat.com
On Fri, Dec 19, 2025 at 12:21:30AM +0100, Thomas Gleixner wrote:
> On Tue, Dec 16 2025 at 09:36, Mathieu Desnoyers wrote:
> > On 2025-12-15 13:24, Thomas Gleixner wrote:
> > [...]
> >> +The thread has to enable the functionality via prctl(2)::
> >> +
> >> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> >> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
> >
> > Although it is not documented, it appears that a thread can
> > also use this prctl to disable slice extension.
>
> Obviously. Controls are supposed to be symmetrical.
>
> > How is it meant to compose once we have libc trying to use slice
> > extension internally and the application also using it or wishing to
> > disable it, unaware that libc is also trying to use it ?
>
> Tons of prctls have the same "issue". What's so special about this?
So I've read this whole thread, and I'm with Thomas on this.
Yes this interface has sharp edges, but I don't think anything here
makes a case for adding more complexity.
As Thomas already stated; the very worst possible outcome is that slice
extensions are always denied -- this is a performance issue, not a
correctness issue.
To me it really reads like: Doctor, it hurts when I hit my hand with a
hammer.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 10/11] entry: Hook up rseq time slice extension
2026-01-11 11:01 ` Thomas Gleixner
@ 2026-01-17 9:51 ` Peter Zijlstra
0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-17 9:51 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Sun, Jan 11, 2026 at 12:01:31PM +0100, Thomas Gleixner wrote:
> I know you argued about this many times, but I still maintain my point
> of view that TIF_PREEMPT and TIF_PREEMPT_LAZY are fundmentally different:
>
> TIF_PREEMPT_LAZY grants a non-RT task to complete until it reaches
> return to user
>
> TIF_PREEMPT enforces preemption at the next possible preemption
> point
This is only true for lazy preemption; and that is not the only possible
model.
> My main concern is this scenario:
>
> sched_other_task()
> request_slice_extension()
>
> ---> interrupt
> RT task is woken up
>
> return_to_user()
> grant_extension()
> ...
>
> which means the RT task is delayed until the OTHER task relinquishes the
> CPU voluntarily or via timeout.
Which is exactly the same as if there were a kernel preempt_disable()
region.
> So I prefer to keep the current semantics for RT. This can be revisited
> of course when a proper evaluation has been done, but IMO there are too
> many moving parts in a RT system to make this actually work correctly
> under all circumstances.
>
> I'll add proper comments to that effect.
I've added:
+/*
+ * Since rseq slice ext has a direct correlation to the worst case
+ * scheduling latency (schedule is delayed after all), only have it affect
+ * LAZY reschedules on PREEMPT_RT for now.
+ *
+ * However, since this delay is only applicable to userspace, a value
+ * for rseq_slice_extension_nsec that is strictly less than the worst case
+ * kernel space preempt_disable() region, should mean the scheduling latency
+ * is not affected, even for !LAZY.
+ *
+ * However, since this value depends on the hardware at hand, it cannot be
+ * pre-determined in any sensible way. Hence punt on this problem for now.
+ */
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (5 preceding siblings ...)
2025-12-18 15:18 ` Peter Zijlstra
@ 2026-01-17 9:57 ` Peter Zijlstra
2026-01-23 17:38 ` Prakash Sangappa
2026-01-23 17:41 ` Prakash Sangappa
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
7 siblings, 2 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-17 9:57 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
> +rseq_slice_extension_nsec
> +=========================
> +
> +A task can request to delay its scheduling if it is in a critical section
> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
> +allowed extension in nanoseconds before scheduling of the task is enforced.
> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
> +50000ns (50us).
+
+This value has a direct correlation to the worst case scheduling latency;
+increment at your own risk.
> +unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
Changed default to 10us
Also, given the results of that slice_test thing, we might possibly get
away with a much lower value still.
Prakash, could you possibly capture a trace of hrtimer_start,
hrtimer_cancel and hrtimer_expire_entry for your Oracle workload and run
that python thing on it?
> +#ifdef CONFIG_SYSCTL
> +static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
> +static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
> +
> +static const struct ctl_table rseq_slice_ext_sysctl[] = {
> + {
> + .procname = "rseq_slice_extension_nsec",
> + .data = &rseq_slice_ext_nsecs,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_douintvec_minmax,
> + .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
> + .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
> + },
> +};
> +
> +static void rseq_slice_sysctl_init(void)
> +{
> + if (rseq_slice_extension_enabled())
> + register_sysctl_init("kernel", rseq_slice_ext_sysctl);
> +}
> +#else /* CONFIG_SYSCTL */
> +static inline void rseq_slice_sysctl_init(void) { }
> +#endif /* !CONFIG_SYSCTL */
And I was contemplating moving this to DebugFS rather than sysctl.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-13 23:45 ` Florian Weimer
2026-01-14 21:59 ` Thomas Gleixner
@ 2026-01-17 16:16 ` Mathieu Desnoyers
2026-01-19 10:21 ` Peter Zijlstra
1 sibling, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2026-01-17 16:16 UTC (permalink / raw)
To: Florian Weimer, Thomas Gleixner
Cc: LKML, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Randy Dunlap, Peter Zijlstra, Ron Geva, Waiman Long,
carlos@redhat.com, Michael Jeanson
On 2026-01-13 18:45, Florian Weimer wrote:
> * Thomas Gleixner:
>
>> I'm not completely opposed to make it process wide. For threads created
>> after enablement, that's trivial because that can be done when the per
>> thread RSEQ is registered. But when it gets enabled _after_ threads have
>> been created already then we need code to chase the threads and enable
>> it after the fact because we are not going to query the enablement in
>> curr->mm::whatever just to have another conditional and another
>> cacheline to access.
>
> In glibc, we make sure that the registration for restartable sequences
> happens before any user code (with the exception of IFUNC resolvers) can
> run. This includes code from signal handlers. We started masking
> signals on newly created threads for this reason, to make these
> partially initialized states unobservable.
>
> It's not clear to me what the expected outcome is. If we ever want to
> offer deadline extension as a mutex attribute (for example), then we
> have to switch this on at process start unconditionally because we don't
> know if this new API will be used by the new process (potentially after
> dlopen, so we can't even use things likely analyzing the symbol
> footprint ahead of time).
>
>> The only option is to reject enablement when there is already more than
>> one thread in the process, but there is a reasonable argument that a
>> process might only enable it for a subset of threads, which have actual
>> lock interaction and not bother with it for other things. I'm not seeing
>> a reason to restrict the flexibility of configuration just because you
>> envision magic use cases all over the place.
>
> Sure, but it looks like this needs a custom/minimal libc. It's like
> repurposing set_robust_list for something else. It can be done, but it
> has a significant cost in terms of compatibility because some
> functionality (that other libraries in the process depend on) will stop
> working.
My main concern is about the overhead of added system calls at thread
creation. I recall that doing an additional rseq system call at thread
creation was analyzed thoroughly for performance regressions at the
libc level. I would not want to start requiring libc to issue a
handful of additional prctl system calls per thread creation for no good
reason.
I don't mind that much whether we enable slice extension per
process or per thread, but what I do mind in the case of per-thread
enabling is whether the enabling scheme can be batched, so a user
enables a set of rseq features in one go, ideally at rseq registration.
This is missing with the prctl approach proposed by Thomas.
If the enabling is per-process, it's not so bad because there is
already a lot happening on exec, so I would not mind a prctl that
much, but for per-thread enabling I see the many individual system
calls as an issue we need to address.
> We need the prctl to unregister for CRIU, though, otherwise CRIU won't
> be able to use glibc directly (or would have to re-exec itself in a new
> configuration).
Good point that the unregister is needed. Which means the prctl is
probably needed then. But it does not solve the "handful of prctl
per thread creation" issue, which probably calls for something more
at the rseq system call level.
So I wonder if we could extend the rseq thread registration to also
specify a set of "features to enable" somehow ? This would still be
per-thread, but would not require additional prctl on thread creation.
Thanks Florian and Thomas for your input, this helps me corner the
issue that's nagging at me.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2026-01-16 18:15 ` Peter Zijlstra
@ 2026-01-18 10:46 ` Thomas Gleixner
2026-01-19 10:01 ` Peter Zijlstra
0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2026-01-18 10:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Fri, Jan 16 2026 at 19:15, Peter Zijlstra wrote:
> On Fri, Dec 19, 2025 at 11:05:17AM +0100, Peter Zijlstra wrote:
>
>> I was thinking that perhaps the hrtimer tracepoints, filtered on this
>> specific timer, might just do. Arming the timer is the point where the
>> extension is granted, cancelling the timer is on the slice_yield() (or
>> any other random syscall :/), and the timer actually firing is on fail.
>
> Here, I google pasted this together. I don't actually speak much snake
> (as you well know). Nor does it fully work; the handle_expire() thing is
> busted, I definitely have expire entries in the trace, but they're not
> showing up.
You want the below. Then you get:
Task: slice_test Mean: 350.266 ns
Latency (us) | Count
------------------------------
EXPIRED | 238
0 us | 143189
1 us | 167
2 us | 26
3 us | 11
4 us | 28
5 us | 31
6 us | 22
7 us | 23
8 us | 32
9 us | 16
10 us | 35
Thanks
tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1742,7 +1742,7 @@ static void __run_hrtimer(struct hrtimer
lockdep_assert_held(&cpu_base->lock);
- debug_deactivate(timer);
+ debug_hrtimer_deactivate(timer);
base->running = timer;
/*
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2026-01-18 10:46 ` Thomas Gleixner
@ 2026-01-19 10:01 ` Peter Zijlstra
0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-19 10:01 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Sun, Jan 18, 2026 at 11:46:18AM +0100, Thomas Gleixner wrote:
> On Fri, Jan 16 2026 at 19:15, Peter Zijlstra wrote:
> > On Fri, Dec 19, 2025 at 11:05:17AM +0100, Peter Zijlstra wrote:
> >
> >> I was thinking that perhaps the hrtimer tracepoints, filtered on this
> >> specific timer, might just do. Arming the timer is the point where the
> >> extension is granted, cancelling the timer is on the slice_yield() (or
> >> any other random syscall :/), and the timer actually firing is on fail.
> >
> > Here, I google pasted this together. I don't actually speak much snake
> > (as you well know). Nor does it fully work; the handle_expire() thing is
> > busted, I definitely have expire entries in the trace, but they're not
> > showing up.
>
> You want the below. Then you get:
>
> Task: slice_test Mean: 350.266 ns
> Latency (us) | Count
> ------------------------------
> EXPIRED | 238
> 0 us | 143189
> 1 us | 167
> 2 us | 26
> 3 us | 11
> 4 us | 28
> 5 us | 31
> 6 us | 22
> 7 us | 23
> 8 us | 32
> 9 us | 16
> 10 us | 35
>
> Thanks
>
> tglx
> ---
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1742,7 +1742,7 @@ static void __run_hrtimer(struct hrtimer
>
> lockdep_assert_held(&cpu_base->lock);
>
> - debug_deactivate(timer);
> + debug_hrtimer_deactivate(timer);
> base->running = timer;
D'0h.
I suppose doing this makes more sense than fixing the script. Want me to
write it up?
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 14:36 ` Mathieu Desnoyers
@ 2026-01-19 10:10 ` Peter Zijlstra
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-19 10:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, Randy Dunlap, Ron Geva, Waiman Long
On Mon, Dec 15, 2025 at 05:52:04PM +0100, Thomas Gleixner wrote:
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -23,9 +23,15 @@ enum rseq_flags {
> };
>
> enum rseq_cs_flags_bit {
> + /* Historical and unsupported bits */
> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
> + /* (3) Intentional gap to put new bits into a separate byte */
> +
> + /* User read only feature flags */
> + RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
> + RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
> };
Either 's/byte/nibble/' or 's/= [45]/= [78]/' I suppose.
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-17 16:16 ` Mathieu Desnoyers
@ 2026-01-19 10:21 ` Peter Zijlstra
2026-01-19 10:30 ` Mathieu Desnoyers
2026-01-19 10:46 ` Florian Weimer
0 siblings, 2 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-19 10:21 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Florian Weimer, Thomas Gleixner, LKML, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Ron Geva, Waiman Long, carlos@redhat.com,
Michael Jeanson
On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
> My main concern is about the overhead of added system calls at thread
> creation. I recall that doing an additional rseq system call at thread
> creation was analyzed thoroughly for performance regressions at the
> libc level. I would not want to start requiring libc to issue a
> handful of additional prctl system calls per thread creation for no good
> reason.
A wee something like so?
That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
set and if all the stars align, it will then have it on at the end.
---
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -424,7 +424,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
return 0;
}
- if (unlikely(flags))
+ if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)))
return -EINVAL;
if (current->rseq.usrptr) {
@@ -459,8 +459,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (!access_ok(rseq, rseq_len))
return -EFAULT;
- if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ if (rseq_slice_extension_enabled() &&
+ flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+ }
scoped_user_write_access(rseq, efault) {
/*
@@ -488,6 +492,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
current->rseq.len = rseq_len;
current->rseq.sig = sig;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ current->rseq.slice.state.enabled = !!(rseqfl & RSEQ_CS_FLAG_SLICE_EXT_ENABLED);
+#endif
+
/*
* If rseq was previously inactive, and has just been
* registered, ensure the cpu_id_start and cpu_id fields
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -19,7 +19,8 @@ enum rseq_cpu_id_state {
};
enum rseq_flags {
- RSEQ_FLAG_UNREGISTER = (1 << 0),
+ RSEQ_FLAG_UNREGISTER = (1 << 0),
+ RSEQ_FLAG_SLICE_EXT_DEFAULT_ON = (1 << 1),
};
enum rseq_cs_flags_bit {
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-19 10:21 ` Peter Zijlstra
@ 2026-01-19 10:30 ` Mathieu Desnoyers
2026-01-19 11:03 ` Peter Zijlstra
2026-01-19 10:46 ` Florian Weimer
1 sibling, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2026-01-19 10:30 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Florian Weimer, Thomas Gleixner, LKML, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Ron Geva, Waiman Long, carlos@redhat.com,
Michael Jeanson
On 2026-01-19 05:21, Peter Zijlstra wrote:
> On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
>
>> My main concern is about the overhead of added system calls at thread
>> creation. I recall that doing an additional rseq system call at thread
>> creation was analyzed thoroughly for performance regressions at the
>> libc level. I would not want to start requiring libc to issue a
>> handful of additional prctl system calls per thread creation for no good
>> reason.
>
> A wee something like so?
>
> That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
> set and if all the stars align, it will then have it on at the end.
That's a very good step in the right direction. I just wonder how
userspace is expected to learn that it runs on a kernel which
accepts the RSEQ_FLAG_SLICE_EXT_DEFAULT_ON flag ?
I think it could expect it when getauxval for AT_RSEQ_FEATURE_SIZE
includes the slice ext field. This gives us a cheap way to know
from userspace whether this new flag is supported or not.
One nit below:
[...]
> - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> + if (rseq_slice_extension_enabled() &&
> + flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
I think you want to surround flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON with
parentheses () to have the expected operator priority.
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-19 10:21 ` Peter Zijlstra
2026-01-19 10:30 ` Mathieu Desnoyers
@ 2026-01-19 10:46 ` Florian Weimer
1 sibling, 0 replies; 78+ messages in thread
From: Florian Weimer @ 2026-01-19 10:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mathieu Desnoyers, Thomas Gleixner, LKML, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Ron Geva, Waiman Long, carlos@redhat.com,
Michael Jeanson
* Peter Zijlstra:
> On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
>
>> My main concern is about the overhead of added system calls at thread
>> creation. I recall that doing an additional rseq system call at thread
>> creation was analyzed thoroughly for performance regressions at the
>> libc level. I would not want to start requiring libc to issue a
>> handful of additional prctl system calls per thread creation for no good
>> reason.
>
> A wee something like so?
>
> That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
> set and if all the stars align, it will then have it on at the end.
I think this would work for glibc because it will only show up in
__rseq_flags if we set the flag on process startup, and then all threads
would get it. It doesn't matter that __rseq_flags is not per-thread.
Thanks,
Florian
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-19 10:30 ` Mathieu Desnoyers
@ 2026-01-19 11:03 ` Peter Zijlstra
2026-01-19 11:10 ` Mathieu Desnoyers
0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-19 11:03 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Florian Weimer, Thomas Gleixner, LKML, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Ron Geva, Waiman Long, carlos@redhat.com,
Michael Jeanson
On Mon, Jan 19, 2026 at 11:30:53AM +0100, Mathieu Desnoyers wrote:
> On 2026-01-19 05:21, Peter Zijlstra wrote:
> > On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
> >
> > > My main concern is about the overhead of added system calls at thread
> > > creation. I recall that doing an additional rseq system call at thread
> > > creation was analyzed thoroughly for performance regressions at the
> > > libc level. I would not want to start requiring libc to issue a
> > > handful of additional prctl system calls per thread creation for no good
> > > reason.
> >
> > A wee something like so?
> >
> > That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
> > set and if all the stars align, it will then have it on at the end.
>
> That's a very good step in the right direction. I just wonder how
> userspace is expected to learn that it runs on a kernel which
> accepts the RSEQ_FLAG_SLICE_EXT_DEFAULT_ON flag ?
>
> I think it could expect it when getauxval for AT_RSEQ_FEATURE_SIZE
> includes the slice ext field. This gives us a cheap way to know
> from userspace whether this new flag is supported or not.
struct rseq vs struct rseq_data. I don't think that slice field is
exposed on the user side of things.
I was thinking it could just try with the flag the firs time, and then
record if that worked or not and use the 'correct' value for all future
rseq calls.
> One nit below:
>
> [...]
> > - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
> > + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
> > rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> > + if (rseq_slice_extension_enabled() &&
> > + flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
>
> I think you want to surround flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON with
> parentheses () to have the expected operator priority.
Moo, done (I added that rseq_slice_extension_enabled() test later).
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-19 11:03 ` Peter Zijlstra
@ 2026-01-19 11:10 ` Mathieu Desnoyers
2026-01-19 11:27 ` Peter Zijlstra
0 siblings, 1 reply; 78+ messages in thread
From: Mathieu Desnoyers @ 2026-01-19 11:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Florian Weimer, Thomas Gleixner, LKML, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Ron Geva, Waiman Long, carlos@redhat.com,
Michael Jeanson
On 2026-01-19 06:03, Peter Zijlstra wrote:
> On Mon, Jan 19, 2026 at 11:30:53AM +0100, Mathieu Desnoyers wrote:
>> On 2026-01-19 05:21, Peter Zijlstra wrote:
>>> On Sat, Jan 17, 2026 at 05:16:16PM +0100, Mathieu Desnoyers wrote:
>>>
>>>> My main concern is about the overhead of added system calls at thread
>>>> creation. I recall that doing an additional rseq system call at thread
>>>> creation was analyzed thoroughly for performance regressions at the
>>>> libc level. I would not want to start requiring libc to issue a
>>>> handful of additional prctl system calls per thread creation for no good
>>>> reason.
>>>
>>> A wee something like so?
>>>
>>> That would allow registering rseq with RSEQ_FLAG_SLICE_EXT_DEFAULT_ON
>>> set and if all the stars align, it will then have it on at the end.
>>
>> That's a very good step in the right direction. I just wonder how
>> userspace is expected to learn that it runs on a kernel which
>> accepts the RSEQ_FLAG_SLICE_EXT_DEFAULT_ON flag ?
>>
>> I think it could expect it when getauxval for AT_RSEQ_FEATURE_SIZE
>> includes the slice ext field. This gives us a cheap way to know
>> from userspace whether this new flag is supported or not.
>
> struct rseq vs struct rseq_data. I don't think that slice field is
> exposed on the user side of things.
Yes it is. (unless I'm missing something ?)
See the original patch of this thread at https://lore.kernel.org/lkml/20251215155708.669472597@linutronix.de/
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
[...]
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
>
> I was thinking it could just try with the flag the firs time, and then
> record if that worked or not and use the 'correct' value for all future
> rseq calls.
That would work too, but would waste a system call on process startup in case
of failure. Not a big deal, but checking with getauxval would be better because
this information is already exported to userspace at program execution and
available without doing any system call.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 01/11] rseq: Add fields and constants for time slice extension
2026-01-19 11:10 ` Mathieu Desnoyers
@ 2026-01-19 11:27 ` Peter Zijlstra
0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-19 11:27 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Florian Weimer, Thomas Gleixner, LKML, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
Randy Dunlap, Ron Geva, Waiman Long, carlos@redhat.com,
Michael Jeanson
On Mon, Jan 19, 2026 at 12:10:27PM +0100, Mathieu Desnoyers wrote:
> > struct rseq vs struct rseq_data. I don't think that slice field is
> > exposed on the user side of things.
>
> Yes it is. (unless I'm missing something ?)
*sigh*, I was looking on the wrong machine, that didn't have the patches
applied :-(
I need to go wake up or something...
^ permalink raw reply [flat|nested] 78+ messages in thread
* [tip: sched/core] selftests/rseq: Implement time slice extension test
2025-12-15 16:52 ` [patch V6 11/11] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2026-01-22 10:15 ` tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:15 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 830969e7821af377bdc1bb016929ff28c78490e8
Gitweb: https://git.kernel.org/tip/830969e7821af377bdc1bb016929ff28c78490e8
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:34 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:19 +01:00
selftests/rseq: Implement time slice extension test
Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and arriving in rseq_slice_yield().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.320325431@linutronix.de
---
tools/testing/selftests/rseq/.gitignore | 1 +-
tools/testing/selftests/rseq/Makefile | 5 +-
tools/testing/selftests/rseq/rseq-abi.h | 27 +++-
tools/testing/selftests/rseq/slice_test.c | 219 +++++++++++++++++++++-
4 files changed, 251 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/slice_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 0fda241..ec01d16 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
param_test_mm_cid_benchmark
param_test_mm_cid_compare_twice
syscall_errors_test
+slice_test
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fa..4ef9082 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test slice_test
TEST_GEN_PROGS_EXTENDED = librseq.so
@@ -59,3 +59,6 @@ $(OUTPUT)/param_test_mm_cid_compare_twice: param_test.c $(TEST_GEN_PROGS_EXTENDE
$(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
rseq.h rseq-*.h
$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+ $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selftests/rseq/rseq-abi.h
index fb4ec8a..ecef315 100644
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -53,6 +53,27 @@ struct rseq_abi_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_abi_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_abi_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -165,6 +186,12 @@ struct rseq_abi {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_abi_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
diff --git a/tools/testing/selftests/rseq/slice_test.c b/tools/testing/selftests/rseq/slice_test.c
new file mode 100644
index 0000000..357122d
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_rseq_slice_yield
+# define __NR_rseq_slice_yield 471
+#endif
+
+#define BITS_PER_INT 32
+#define BITS_PER_BYTE 8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+#endif
+
+#ifndef RSEQ_SLICE_EXT_REQUEST_BIT
+# define RSEQ_SLICE_EXT_REQUEST_BIT 0
+# define RSEQ_SLICE_EXT_GRANTED_BIT 1
+#endif
+
+#ifndef asm_inline
+# define asm_inline asm __inline
+#endif
+
+#define NSEC_PER_SEC 1000000000L
+#define NSEC_PER_USEC 1000L
+
+struct noise_params {
+ int64_t noise_nsecs;
+ int64_t sleep_nsecs;
+ int64_t run;
+};
+
+FIXTURE(slice_ext)
+{
+ pthread_t noise_thread;
+ struct noise_params noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+ int64_t total_nsecs;
+ int64_t slice_nsecs;
+ int64_t noise_nsecs;
+ int64_t sleep_nsecs;
+ bool no_yield;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+ .total_nsecs = 5LL * NSEC_PER_SEC,
+ .slice_nsecs = 2LL * NSEC_PER_USEC,
+ .noise_nsecs = 2LL * NSEC_PER_USEC,
+ .sleep_nsecs = 50LL * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n50_2_50)
+{
+ .total_nsecs = 5LL * NSEC_PER_SEC,
+ .slice_nsecs = 50LL * NSEC_PER_USEC,
+ .noise_nsecs = 2LL * NSEC_PER_USEC,
+ .sleep_nsecs = 50LL * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50_no_yield)
+{
+ .total_nsecs = 5LL * NSEC_PER_SEC,
+ .slice_nsecs = 2LL * NSEC_PER_USEC,
+ .noise_nsecs = 2LL * NSEC_PER_USEC,
+ .sleep_nsecs = 50LL * NSEC_PER_USEC,
+ .no_yield = true,
+};
+
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+ int64_t span)
+{
+ int64_t delta = now->tv_sec - start->tv_sec;
+
+ delta *= NSEC_PER_SEC;
+ delta += now->tv_nsec - start->tv_nsec;
+ return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+ struct noise_params *p = arg;
+
+ while (RSEQ_READ_ONCE(p->run)) {
+ struct timespec ts_start, ts_now;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+ ts_start.tv_sec = 0;
+ ts_start.tv_nsec = p->sleep_nsecs;
+ clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+ }
+ return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+ cpu_set_t affinity;
+
+ ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+ /* Pin it on a single CPU. Avoid CPU 0 */
+ for (int i = 1; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+
+ CPU_ZERO(&affinity);
+ CPU_SET(i, &affinity);
+ ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+ break;
+ }
+
+ ASSERT_EQ(rseq_register_current_thread(), 0);
+
+ ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+ self->noise_params.noise_nsecs = variant->noise_nsecs;
+ self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+ self->noise_params.run = 1;
+
+ ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+ self->noise_params.run = 0;
+ pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+ unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0;
+ unsigned long total = 0, aborted = 0;
+ struct rseq_abi *rs = rseq_get_abi();
+ struct timespec ts_start, ts_now;
+
+ ASSERT_NE(rs, NULL);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ struct timespec ts_cs;
+ bool req = false;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+ total++;
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 1);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs));
+
+ /*
+ * request can be cleared unconditionally, but for making
+ * the stats work this is actually checking it first
+ */
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.request)) {
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 0);
+ /* Race between check and clear! */
+ req = true;
+ success++;
+ }
+
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.granted)) {
+ /* The above raced against a late grant */
+ if (req)
+ success--;
+ if (variant->no_yield) {
+ syscall(__NR_getpid);
+ aborted++;
+ } else {
+ yielded++;
+ if (!syscall(__NR_rseq_slice_yield))
+ raced++;
+ }
+ } else {
+ if (!req)
+ scheduled++;
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+ printf("# Total %12ld\n", total);
+ printf("# Success %12ld\n", success);
+ printf("# Yielded %12ld\n", yielded);
+ printf("# Aborted %12ld\n", aborted);
+ printf("# Scheduled %12ld\n", scheduled);
+ printf("# Raced %12ld\n", raced);
+}
+
+TEST_HARNESS_MAIN
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] entry: Hook up rseq time slice extension
2025-12-15 16:52 ` [patch V6 10/11] entry: Hook up rseq time slice extension Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:37 ` Mathieu Desnoyers
@ 2026-01-22 10:15 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:15 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 3c78aaec19b0621bf952756670c8b066a55202fe
Gitweb: https://git.kernel.org/tip/3c78aaec19b0621bf952756670c8b066a55202fe
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:31 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:19 +01:00
entry: Hook up rseq time slice extension
Wire the grant decision function up in exit_to_user_mode_loop()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.258157362@linutronix.de
---
kernel/entry/common.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5c792b3..9ef63e4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,27 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
#endif
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+/*
+ * Since rseq slice ext has a direct correlation to the worst case
+ * scheduling latency (schedule is delayed after all), only have it affect
+ * LAZY reschedules on PREEMPT_RT for now.
+ *
+ * However, since this delay is only applicable to userspace, a value
+ * for rseq_slice_extension_nsec that is strictly less than the worst case
+ * kernel space preempt_disable() region, should mean the scheduling latency
+ * is not affected, even for !LAZY.
+ *
+ * However, since this value depends on the hardware at hand, it cannot be
+ * pre-determined in any sensible way. Hence punt on this problem for now.
+ */
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -28,8 +49,10 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
- schedule();
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+ if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ schedule();
+ }
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Implement rseq_grant_slice_extension()
2025-12-15 16:52 ` [patch V6 09/11] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:25 ` Mathieu Desnoyers
@ 2026-01-22 10:15 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:15 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: dfb630f548a7c715efb0651c6abf334dca75cd52
Gitweb: https://git.kernel.org/tip/dfb630f548a7c715efb0651c6abf334dca75cd52
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:28 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:18 +01:00
rseq: Implement rseq_grant_slice_extension()
Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.
The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:
#1 the functionality is enabled
#2 the exit is a return from interrupt to user mode
#3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
which means the task was already scheduled out.
The slow path, which implements the actual user space ABI, is invoked
when:
A) #1 is true, #2 is true and #3 is false
It checks whether user space requested a slice extension by setting
the request bit in the rseq slice_ctrl field. If so, it grants the
extension and stores the slice expiry time, so that the actual exit
code can double check whether the slice is already exhausted before
going back.
B) #1 - #3 are true _and_ a slice extension was granted in a previous
loop iteration
In this case the grant is revoked.
In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.195303303@linutronix.de
---
include/linux/rseq_entry.h | 108 ++++++++++++++++++++++++++++++++++++-
1 file changed, 108 insertions(+)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index fc77b9d..cbc4a79 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -42,6 +42,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/sched/signal.h>
#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -109,10 +110,116 @@ static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
t->rseq.slice.state.granted = false;
}
+static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl usr_ctrl;
+ union rseq_slice_state state;
+ struct rseq __user *rseq;
+
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ /* If not enabled or not a return from interrupt, nothing to do. */
+ state = curr->rseq.slice.state;
+ state.enabled &= curr->rseq.event.user_irq;
+ if (likely(!state.state))
+ return false;
+
+ rseq = curr->rseq.usrptr;
+ scoped_user_rw_access(rseq, efault) {
+
+ /*
+ * Quick check conditions where a grant is not possible or
+ * needs to be revoked.
+ *
+ * 1) Any TIF bit which needs to do extra work aside of
+ * rescheduling prevents a grant.
+ *
+ * 2) A previous rescheduling request resulted in a slice
+ * extension grant.
+ */
+ if (unlikely(work_pending || state.granted)) {
+ /* Clear user control unconditionally. No point for checking */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ rseq_slice_clear_grant(curr);
+ return false;
+ }
+
+ unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ if (likely(!(usr_ctrl.request)))
+ return false;
+
+ /* Grant the slice extention */
+ usr_ctrl.request = 0;
+ usr_ctrl.granted = 1;
+ unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ }
+
+ rseq_stat_inc(rseq_stats.s_granted);
+
+ curr->rseq.slice.state.granted = true;
+ /* Store expiry time for arming the timer on the way out */
+ curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
+ /*
+ * This is racy against a remote CPU setting TIF_NEED_RESCHED in
+ * several ways:
+ *
+ * 1)
+ * CPU0 CPU1
+ * clear_tsk()
+ * set_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() -> Folds correctly
+ * 2)
+ * CPU0 CPU1
+ * set_tsk()
+ * clear_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false
+ *
+ * #1 is not any different from a regular remote reschedule as it
+ * sets the previously not set bit and then raises the IPI which
+ * folds it into the preempt counter
+ *
+ * #2 is obviously incorrect from a scheduler POV, but it's not
+ * differently incorrect than the code below clearing the
+ * reschedule request with the safety net of the timer.
+ *
+ * The important part is that the clearing is protected against the
+ * scheduler IPI and also against any other interrupt which might
+ * end up waking up a task and setting the bits in the middle of
+ * the operation:
+ *
+ * clear_tsk()
+ * ---> Interrupt
+ * wakeup_on_this_cpu()
+ * set_tsk()
+ * set_preempt()
+ * clear_preempt()
+ *
+ * which would be inconsistent state.
+ */
+ scoped_guard(irq) {
+ clear_tsk_need_resched(curr);
+ clear_preempt_need_resched();
+ }
+ return true;
+
+efault:
+ force_sig(SIGSEGV);
+ return false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
static inline void rseq_slice_clear_grant(struct task_struct *t) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -671,6 +778,7 @@ static inline void rseq_syscall_exit_to_user_mode(void) { }
static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_exit_to_user_mode_legacy(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Reset slice extension when scheduled
2025-12-15 16:52 ` [patch V6 08/11] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:17 ` Mathieu Desnoyers
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 7ee58f98b59b0ec32ea8a92f0bc85cb46fcd3de3
Gitweb: https://git.kernel.org/tip/7ee58f98b59b0ec32ea8a92f0bc85cb46fcd3de3
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:26 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:18 +01:00
rseq: Reset slice extension when scheduled
When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.
RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.
Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.131081527@linutronix.de
---
include/linux/rseq_entry.h | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 8d04611..fc77b9d 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -102,9 +102,17 @@ static __always_inline bool rseq_arm_slice_extension_timer(void)
return __rseq_arm_slice_extension_timer();
}
+static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
+{
+ if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted)
+ rseq_stat_inc(rseq_stats.s_revoked);
+ t->rseq.slice.state.granted = false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
+static inline void rseq_slice_clear_grant(struct task_struct *t) { }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -391,8 +399,15 @@ bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
if (csaddr)
unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
}
+ rseq_slice_clear_grant(t);
/* Cache the new values */
t->rseq.ids.cpu_cid = ids->cpu_cid;
rseq_stat_inc(rseq_stats.ids);
@@ -488,8 +503,17 @@ static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct t
*/
u64 csaddr;
- if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs)))
- return false;
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_get_user(csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
+ }
+
+ rseq_slice_clear_grant(t);
if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
@@ -505,6 +529,8 @@ static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct t
u32 node_id = cpu_to_node(ids.cpu_id);
return rseq_update_usr(t, regs, &ids, node_id);
+efault:
+ return false;
}
static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Implement time slice extension enforcement timer
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (6 preceding siblings ...)
2026-01-17 9:57 ` Peter Zijlstra
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
7 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 0ac3b5c3dc45085b28a10ee730fb2860841f08ef
Gitweb: https://git.kernel.org/tip/0ac3b5c3dc45085b28a10ee730fb2860841f08ef
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:22 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:18 +01:00
rseq: Implement time slice extension enforcement timer
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de
---
Documentation/admin-guide/sysctl/kernel.rst | 11 ++-
include/linux/rseq_entry.h | 38 +++--
include/linux/rseq_types.h | 2 +-
kernel/rseq.c | 132 ++++++++++++++++++-
4 files changed, 170 insertions(+), 13 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 239da22..b09d18e 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1248,6 +1248,17 @@ reboot-cmd (SPARC only)
ROM/Flash boot loader. Maybe to tell it what to do after
rebooting. ???
+rseq_slice_extension_nsec
+=========================
+
+A task can request to delay its scheduling if it is in a critical section
+via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
+allowed extension in nanoseconds before scheduling of the task is enforced.
+Default value is 10000ns (10us). The possible range is 10000ns (10us) to
+50000ns (50us).
+
+This value has a direct correlation to the worst case scheduling latency;
+increment at your own risk.
sched_energy_aware
==================
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 54d8e33..8d04611 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -87,8 +87,24 @@ static __always_inline bool rseq_slice_extension_enabled(void)
{
return static_branch_likely(&rseq_slice_extension_key);
}
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ if (likely(!current->rseq.slice.state.granted))
+ return false;
+
+ return __rseq_arm_slice_extension_timer();
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -543,17 +559,19 @@ static __always_inline void clear_tif_rseq(void) { }
static __always_inline bool
rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
- if (likely(!test_tif_rseq(ti_work)))
- return false;
-
- if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
- current->rseq.event.slowpath = true;
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
- return true;
+ if (unlikely(test_tif_rseq(ti_work))) {
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ clear_tif_rseq();
}
-
- clear_tif_rseq();
- return false;
+ /*
+ * Arm the slice extension timer if nothing to do anymore and the
+ * task really goes out to user space.
+ */
+ return rseq_arm_slice_extension_timer();
}
#else /* CONFIG_GENERIC_ENTRY */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 8c540e7..8a2e76c 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,10 +89,12 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @expires: The time when a grant expires
* @yielded: Indicator for rseq_slice_yield()
*/
struct rseq_slice {
union rseq_slice_state state;
+ u64 expires;
u8 yielded;
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 8aa4821..275d701 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
@@ -500,8 +502,91 @@ efault:
}
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+ struct hrtimer timer;
+ void *cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 10 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+/*
+ * When the timer expires and the task is still in user space, the return
+ * from interrupt will revoke the grant and schedule. If the task already
+ * entered the kernel via a syscall and the timer fires before the syscall
+ * work was able to cancel it, then depending on the preemption model this
+ * will either reschedule on return from interrupt or in the syscall work
+ * below.
+ */
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+ struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+ /*
+ * Validate that the task which armed the timer is still on the
+ * CPU. It could have been scheduled out without canceling the
+ * timer.
+ */
+ if (st->cookie == current && current->rseq.slice.state.granted) {
+ rseq_stat_inc(rseq_stats.s_expired);
+ set_need_resched_current();
+ }
+ return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+ struct task_struct *curr = current;
+
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * This check prevents a task, which got a time slice extension
+ * granted, from exceeding the maximum scheduling latency when the
+ * grant expired before going out to user space. Don't bother to
+ * clear the grant here, it will be cleaned up automatically before
+ * going out to user space after being scheduled back in.
+ */
+ if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+ set_need_resched_current();
+ return true;
+ }
+
+ /*
+ * Store the task pointer as a cookie for comparison in the timer
+ * function. This is safe as the timer is CPU local and cannot be
+ * in the expiry function at this point.
+ */
+ st->cookie = curr;
+ hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+ /* Arm the syscall entry work */
+ set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+ return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+ /*
+ * st->cookie can be safely read as preemption is disabled and the
+ * timer is CPU local.
+ *
+ * As this is most probably the first expiring timer, the cancel is
+ * expensive as it has to reprogram the hardware, but that's less
+ * expensive than going through a full hrtimer_interrupt() cycle
+ * for nothing.
+ *
+ * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
+ * local and once the hrtimer code disabled interrupts the timer
+ * callback cannot be running.
+ */
+ if (st->cookie == current)
+ hrtimer_try_to_cancel(&st->timer);
+}
+
static inline void rseq_slice_set_need_resched(struct task_struct *curr)
{
/*
@@ -563,11 +648,14 @@ void rseq_syscall_enter_work(long syscall)
return;
/*
- * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
- * kernels. Leaving the scope will reschedule on preemption models
- * FULL, LAZY and RT if necessary.
+ * Required to stabilize the per CPU timer pointer and to make
+ * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
+ *
+ * Leaving the scope will reschedule on preemption models FULL,
+ * LAZY and RT if necessary.
*/
scoped_guard(preempt) {
+ rseq_cancel_slice_extension_timer();
/*
* Now that preemption is disabled, quickly check whether
* the task was already rescheduled before arriving here.
@@ -665,6 +753,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
return yielded;
}
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+ {
+ .procname = "rseq_slice_extension_nsec",
+ .data = &rseq_slice_ext_nsecs,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
+ .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
+ },
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+ if (rseq_slice_extension_enabled())
+ register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif /* !CONFIG_SYSCTL */
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
@@ -677,4 +790,17 @@ static int __init rseq_slice_cmdline(char *str)
return 1;
}
__setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+ }
+ rseq_slice_sysctl_init();
+ return 0;
+}
+device_initcall(rseq_slice_init);
#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Implement syscall entry work for time slice extensions
2025-12-15 16:52 ` [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:05 ` Mathieu Desnoyers
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: dd0a04606937af5810e9117d343ee3792635bd3d
Gitweb: https://git.kernel.org/tip/dd0a04606937af5810e9117d343ee3792635bd3d
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:19 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:18 +01:00
rseq: Implement syscall entry work for time slice extensions
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de
---
include/linux/entry-common.h | 2 +-
include/linux/rseq.h | 2 +-
include/linux/thread_info.h | 16 +++---
kernel/entry/syscall-common.c | 11 +++-
kernel/rseq.c | 91 ++++++++++++++++++++++++++++++++++-
5 files changed, 112 insertions(+), 10 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 87efb38..026201a 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -36,8 +36,8 @@
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
+ SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \
ARCH_SYSCALL_WORK_ENTER)
-
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 3c194a0..7a01a07 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -164,8 +164,10 @@ static inline void rseq_syscall(struct pt_regs *regs) { }
#endif /* !CONFIG_DEBUG_RSEQ */
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+void rseq_syscall_enter_work(long syscall);
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline void rseq_syscall_enter_work(long syscall) { }
static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
return -ENOTSUPP;
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index b40de9b..051e429 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
SYSCALL_WORK_BIT_SYSCALL_AUDIT,
SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+ SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
};
-#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
#endif
#include <asm/thread_info.h>
diff --git a/kernel/entry/syscall-common.c b/kernel/entry/syscall-common.c
index 940a597..f7ee25b 100644
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
}
}
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
{
long ret = 0;
@@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
return -1L;
}
+ /*
+ * User space got a time slice extension granted and relinquishes
+ * the CPU. The work stops the slice timer to avoid an extra round
+ * through hrtimer_interrupt().
+ */
+ if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE)
+ rseq_syscall_enter_work(syscall);
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
diff --git a/kernel/rseq.c b/kernel/rseq.c
index d8e1992..8aa4821 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -502,6 +502,97 @@ efault:
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+static inline void rseq_slice_set_need_resched(struct task_struct *curr)
+{
+ /*
+ * The interrupt guard is required to prevent inconsistent state in
+ * this case:
+ *
+ * set_tsk_need_resched()
+ * --> Interrupt
+ * wakeup()
+ * set_tsk_need_resched()
+ * set_preempt_need_resched()
+ * schedule_on_return()
+ * clear_tsk_need_resched()
+ * clear_preempt_need_resched()
+ * set_preempt_need_resched() <- Inconsistent state
+ *
+ * This is safe vs. a remote set of TIF_NEED_RESCHED because that
+ * only sets the already set bit and does not create inconsistent
+ * state.
+ */
+ scoped_guard(irq)
+ set_need_resched_current();
+}
+
+static void rseq_slice_validate_ctrl(u32 expected)
+{
+ u32 __user *sctrl = ¤t->rseq.usrptr->slice_ctrl.all;
+ u32 uval;
+
+ if (get_user(uval, sctrl) || uval != expected)
+ force_sig(SIGSEGV);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ *
+ * While the recommended way to relinquish the CPU side effect free is
+ * rseq_slice_yield(2), any syscall within a granted slice terminates the
+ * grant and immediately reschedules if required. This supports onion layer
+ * applications, where the code requesting the grant cannot control the
+ * code within the critical section.
+ */
+void rseq_syscall_enter_work(long syscall)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl ctrl = { .granted = curr->rseq.slice.state.granted };
+
+ clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ rseq_slice_validate_ctrl(ctrl.all);
+
+ /*
+ * The kernel might have raced, revoked the grant and updated
+ * userspace, but kept the SLICE work set.
+ */
+ if (!ctrl.granted)
+ return;
+
+ /*
+ * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
+ * kernels. Leaving the scope will reschedule on preemption models
+ * FULL, LAZY and RT if necessary.
+ */
+ scoped_guard(preempt) {
+ /*
+ * Now that preemption is disabled, quickly check whether
+ * the task was already rescheduled before arriving here.
+ */
+ if (!curr->rseq.event.sched_switch) {
+ rseq_slice_set_need_resched(curr);
+
+ if (syscall == __NR_rseq_slice_yield) {
+ rseq_stat_inc(rseq_stats.s_yielded);
+ /* Update the yielded state for syscall return */
+ curr->rseq.slice.yielded = 1;
+ } else {
+ rseq_stat_inc(rseq_stats.s_aborted);
+ }
+ }
+ }
+ /* Reschedule on NONE/VOLUNTARY preemption models */
+ cond_resched();
+
+ /* Clear the grant in kernel state and user space */
+ curr->rseq.slice.state.granted = false;
+ if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all))
+ force_sig(SIGSEGV);
+}
+
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
switch (arg2) {
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Implement sys_rseq_slice_yield()
2025-12-15 16:52 ` [patch V6 05/11] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 14:59 ` Mathieu Desnoyers
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers,
Arnd Bergmann, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 99d2592023e5d0a31f5f5a83c694df48239a1e6c
Gitweb: https://git.kernel.org/tip/99d2592023e5d0a31f5f5a83c694df48239a1e6c
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:15 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:17 +01:00
rseq: Implement sys_rseq_slice_yield()
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +-
arch/arm/tools/syscall.tbl | 1 +-
arch/arm64/tools/syscall_32.tbl | 1 +-
arch/m68k/kernel/syscalls/syscall.tbl | 1 +-
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +-
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +-
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +-
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +-
arch/parisc/kernel/syscalls/syscall.tbl | 1 +-
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +-
arch/s390/kernel/syscalls/syscall.tbl | 1 +-
arch/sh/kernel/syscalls/syscall.tbl | 1 +-
arch/sparc/kernel/syscalls/syscall.tbl | 1 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +-
arch/x86/entry/syscalls/syscall_64.tbl | 1 +-
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +-
include/linux/rseq_types.h | 2 ++-
include/linux/syscalls.h | 1 +-
include/uapi/asm-generic/unistd.h | 5 ++++-
kernel/rseq.c | 21 ++++++++++++++++++++-
kernel/sys_ni.c | 1 +-
scripts/syscall.tbl | 1 +-
22 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3fed974..f31b7af 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -510,3 +510,4 @@
578 common file_getattr sys_file_getattr
579 common file_setattr sys_file_setattr
580 common listns sys_listns
+581 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index fd09afa..94351e2 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -485,3 +485,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 8cdfe5d..62d93d8 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -482,3 +482,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 871a5d6..2489342 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -470,3 +470,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 022fc85..223d263 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -476,3 +476,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 8cedc83..7430714 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -409,3 +409,4 @@
468 n32 file_getattr sys_file_getattr
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
+471 n32 rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 9b92bdd..630aab9 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -385,3 +385,4 @@
468 n64 file_getattr sys_file_getattr
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
+471 n64 rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index f810b8a..1286531 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -458,3 +458,4 @@
468 o32 file_getattr sys_file_getattr
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
+471 o32 rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 39bdaca..f6e2d03 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index ec4458c..4fcc7c5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -561,3 +561,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 nospu rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 417ed16..09a7ef0 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -397,3 +397,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 969c113..70b315c 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -474,3 +474,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 39aa26b..d5b1a71 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -516,3 +516,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index e979a3e..f832ebd 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -476,3 +476,4 @@
468 i386 file_getattr sys_file_getattr
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
+471 i386 rseq_slice_yield sys_rseq_slice_yield
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8a4ac48..524155d 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -395,6 +395,7 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 438a3b1..a9bca4e 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 67e40c0..8c540e7 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,9 +89,11 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @yielded: Indicator for rseq_slice_yield()
*/
struct rseq_slice {
union rseq_slice_state state;
+ u8 yielded;
};
/**
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cf84d98..6c8a570 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -961,6 +961,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_rseq_slice_yield(void);
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
unsigned flags,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 942370b..a627acc 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -860,8 +860,11 @@ __SYSCALL(__NR_file_setattr, sys_file_setattr)
#define __NR_listns 470
__SYSCALL(__NR_listns, sys_listns)
+#define __NR_rseq_slice_yield 471
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
#undef __NR_syscalls
-#define __NR_syscalls 471
+#define __NR_syscalls 472
/*
* 32 bit systems traditionally used different
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 09848bb..d8e1992 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -553,6 +553,27 @@ die:
return -EFAULT;
}
+/**
+ * sys_rseq_slice_yield - yield the current processor side effect free if a
+ * task granted with a time slice extension is done with
+ * the critical work before being forced out.
+ *
+ * Return: 1 if the task successfully yielded the CPU within the granted slice.
+ * 0 if the slice extension was either never granted or was revoked by
+ * going over the granted extension, using a syscall other than this one
+ * or being scheduled out earlier due to a subsequent interrupt.
+ *
+ * The syscall does not schedule because the syscall entry work immediately
+ * relinquishes the CPU and schedules if required.
+ */
+SYSCALL_DEFINE0(rseq_slice_yield)
+{
+ int yielded = !!current->rseq.slice.yielded;
+
+ current->rseq.slice.yielded = 0;
+ return yielded;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bf5d05c..add3032 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,6 +390,7 @@ COND_SYSCALL(setuid16);
/* restartable sequence */
COND_SYSCALL(rseq);
+COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index e74868b..7a42b32 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -411,3 +411,4 @@
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
+471 common rseq_slice_yield sys_rseq_slice_yield
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Add prctl() to enable time slice extensions
2025-12-15 16:52 ` [patch V6 04/11] rseq: Add prctl() to enable " Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 28621ec2d46c6adf7d33a6facbd83e2fa566bd34
Gitweb: https://git.kernel.org/tip/28621ec2d46c6adf7d33a6facbd83e2fa566bd34
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:12 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:17 +01:00
rseq: Add prctl() to enable time slice extensions
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de
---
include/linux/rseq.h | 9 ++++++-
include/uapi/linux/prctl.h | 10 +++++++-
kernel/rseq.c | 52 +++++++++++++++++++++++++++++++++++++-
kernel/sys.c | 6 ++++-
4 files changed, 77 insertions(+)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 2266f4d..3c194a0 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -163,4 +163,13 @@ void rseq_syscall(struct pt_regs *regs);
static inline void rseq_syscall(struct pt_regs *regs) { }
#endif /* !CONFIG_DEBUG_RSEQ */
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ return -ENOTSUPP;
+}
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
#endif /* _LINUX_RSEQ_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 51c4e8c..79944b7 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -386,4 +386,14 @@ struct prctl_mm_map {
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE: Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 415d75b..09848bb 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,7 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
#include <linux/sched.h>
@@ -501,6 +502,57 @@ efault:
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ switch (arg2) {
+ case PR_RSEQ_SLICE_EXTENSION_GET:
+ if (arg3)
+ return -EINVAL;
+ return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+ case PR_RSEQ_SLICE_EXTENSION_SET: {
+ u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
+
+ if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+ return -EINVAL;
+ if (!rseq_slice_extension_enabled())
+ return -ENOTSUPP;
+ if (!current->rseq.usrptr)
+ return -ENXIO;
+
+ /* No change? */
+ if (enable == !!current->rseq.slice.state.enabled)
+ return 0;
+
+ if (get_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ if (current->rseq.slice.state.enabled)
+ valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if ((rflags & valid) != valid)
+ goto die;
+
+ rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ if (enable)
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if (put_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ current->rseq.slice.state.enabled = enable;
+ return 0;
+ }
+ default:
+ return -EINVAL;
+ }
+die:
+ force_sig(SIGSEGV);
+ return -EFAULT;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
diff --git a/kernel/sys.c b/kernel/sys.c
index 8b58eec..af71987 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
#include <linux/futex.h>
+#include <linux/rseq.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2868,6 +2869,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_FUTEX_HASH:
error = futex_hash_prctl(arg2, arg3, arg4);
break;
+ case PR_RSEQ_SLICE_EXTENSION:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = rseq_slice_extension_prctl(arg2, arg3);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Add statistics for time slice extensions
2025-12-15 16:52 ` [patch V6 03/11] rseq: Add statistics " Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: b5b8282441bc4f8f1ff505e19d566dbd7b805761
Gitweb: https://git.kernel.org/tip/b5b8282441bc4f8f1ff505e19d566dbd7b805761
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:09 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:17 +01:00
rseq: Add statistics for time slice extensions
Extend the quick statistics with time slice specific fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.795202254@linutronix.de
---
include/linux/rseq_entry.h | 5 +++++
kernel/rseq.c | 14 ++++++++++++++
2 files changed, 19 insertions(+)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index d0ec471..54d8e33 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -15,6 +15,11 @@ struct rseq_stats {
unsigned long cs;
unsigned long clear;
unsigned long fixup;
+ unsigned long s_granted;
+ unsigned long s_expired;
+ unsigned long s_revoked;
+ unsigned long s_yielded;
+ unsigned long s_aborted;
};
DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
diff --git a/kernel/rseq.c b/kernel/rseq.c
index bf75268..415d75b 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -138,6 +138,13 @@ static int rseq_stats_show(struct seq_file *m, void *p)
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu));
+ stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu));
+ stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu));
+ stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu));
+ stats.s_aborted += data_race(per_cpu(rseq_stats.s_aborted, cpu));
+ }
}
seq_printf(m, "exit: %16lu\n", stats.exit);
@@ -148,6 +155,13 @@ static int rseq_stats_show(struct seq_file *m, void *p)
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
+ seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
+ seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
+ seq_printf(m, "syield: %16lu\n", stats.s_yielded);
+ seq_printf(m, "sabort: %16lu\n", stats.s_aborted);
+ }
return 0;
}
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Provide static branch for time slice extensions
2025-12-15 16:52 ` [patch V6 02/11] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
1 sibling, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: f8380f976804533df4c6c3d3a0b2cd03c2d262bc
Gitweb: https://git.kernel.org/tip/f8380f976804533df4c6c3d3a0b2cd03c2d262bc
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:06 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:16 +01:00
rseq: Provide static branch for time slice extensions
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.733429292@linutronix.de
---
Documentation/admin-guide/kernel-parameters.txt | 5 +++++-
include/linux/rseq_entry.h | 11 ++++++++++-
kernel/rseq.c | 17 ++++++++++++++++-
3 files changed, 33 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a8d0afd..f2348bc 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6600,6 +6600,11 @@ Kernel parameters
rootflags= [KNL] Set root filesystem mount option string
+ rseq_slice_ext= [KNL] RSEQ based time slice extension
+ Format: boolean
+ Control enablement of RSEQ based time slice extension.
+ Default is 'on'.
+
initramfs_options= [KNL]
Specify mount options for for the initramfs mount.
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index a36b472..d0ec471 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
#define rseq_inline __always_inline
#endif
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static __always_inline bool rseq_slice_extension_enabled(void)
+{
+ return static_branch_likely(&rseq_slice_extension_key);
+}
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
bool rseq_debug_validate_ids(struct task_struct *t);
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 07c324d..bf75268 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -483,3 +483,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
efault:
return -EFAULT;
}
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static int __init rseq_slice_cmdline(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return 0;
+
+ if (!on)
+ static_branch_disable(&rseq_slice_extension_key);
+ return 1;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply related [flat|nested] 78+ messages in thread
* [tip: sched/core] rseq: Add fields and constants for time slice extension
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
` (2 preceding siblings ...)
2026-01-19 10:10 ` Peter Zijlstra
@ 2026-01-22 10:16 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2026-01-22 10:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: d7a5da7a0f7fa7ff081140c4f6f971db98882703
Gitweb: https://git.kernel.org/tip/d7a5da7a0f7fa7ff081140c4f6f971db98882703
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 15 Dec 2025 17:52:04 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 22 Jan 2026 11:11:16 +01:00
rseq: Add fields and constants for time slice extension
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
---
Documentation/userspace-api/index.rst | 1 +-
Documentation/userspace-api/rseq.rst | 135 +++++++++++++++++++++++++-
include/linux/rseq_types.h | 28 ++++-
include/uapi/linux/rseq.h | 38 +++++++-
init/Kconfig | 12 ++-
kernel/rseq.c | 7 +-
6 files changed, 220 insertions(+), 1 deletion(-)
create mode 100644 Documentation/userspace-api/rseq.rst
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 8a61ac4..fa0fe8a 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
diff --git a/Documentation/userspace-api/rseq.rst b/Documentation/userspace-api/rseq.rst
new file mode 100644
index 0000000..e1fdb0d
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,135 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out. The length of the extension is
+determined by the ``rseq_slice_extension_nsec`` sysctl.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::request and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
+leaving the critical section to relinquish the CPU. The kernel enforces
+this by arming a timer to prevent misbehaving userspace from abusing this
+mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl.request = 1;
+ barrier(); // Prevent compiler reordering
+ critical_section();
+ barrier(); // Prevent compiler reordering
+ rseq->slice_ctrl.request = 0;
+ if (rseq->slice_ctrl.granted)
+ rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+ if (rseq->slice_ctrl.granted)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+If the thread issues a syscall other than rseq_slice_yield(2) within the
+granted timeslice extension, the grant is also revoked and the CPU is
+relinquished immediately when entering the kernel. This is required as
+syscalls might consume arbitrary CPU time until they reach a scheduling
+point when the preemption model is either NONE or VOLUNTARY and therefore
+might exceed the grant by far.
+
+The preferred solution for user space is to use rseq_slice_yield(2) which
+is side effect free. The support for arbitrary syscalls is required to
+support onion layer architectured applications, where the code handling the
+critical section and requesting the time slice extension has no control
+over the code within the critical section.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 332dc14..67e40c0 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
- * @sig: Signature of critial section abort IPs
+ * @sig: Signature of critical section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 1b76d50..6afc219 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a separate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
@@ -53,6 +64,27 @@ struct rseq_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
diff --git a/init/Kconfig b/init/Kconfig
index fa79feb..00c6fbb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1938,6 +1938,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq-based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config RSEQ_STATS
default n
bool "Enable lightweight statistics of restartable sequences" if EXPERT
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 395d8b0..07c324d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*
^ permalink raw reply related [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2026-01-17 9:57 ` Peter Zijlstra
@ 2026-01-23 17:38 ` Prakash Sangappa
2026-01-23 17:41 ` Prakash Sangappa
1 sibling, 0 replies; 78+ messages in thread
From: Prakash Sangappa @ 2026-01-23 17:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, LKML, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org, Randy Dunlap, Ron Geva,
Waiman Long
> On Jan 17, 2026, at 1:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
>
>> +rseq_slice_extension_nsec
>> +=========================
>> +
>> +A task can request to delay its scheduling if it is in a critical section
>> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
>> +allowed extension in nanoseconds before scheduling of the task is enforced.
>> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
>> +50000ns (50us).
> +
> +This value has a direct correlation to the worst case scheduling latency;
> +increment at your own risk.
>
>
>> +unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
>
> Changed default to 10us
>
> Also, given the results of that slice_test thing, we might possibly get
> away with a much lower value still.
>
> Prakash, could you possibly capture a trace of hrtimer_start,
> hrtimer_cancel and hrtimer_expire_entry for your Oracle workload and run
> that python thing on it?
The database setup is on an older environment. The python script provided does not run as is.
So, modified it to gather the stats by parsing trace-cmd report output.
Modified python script is below
Here are rseq stats from the benchmark run.
# cat /sys/kernel/debug/rseq/stats
[…]
sgrant: 707530
sexpir: 19717
srevok: 26548
syield: 680982
Here is the histogram data snippet. Showing typical usage.
Gathered from 10 sec trace-cmd samples collected during the benchmark run.
# trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- sleep 10
# trace-cmd report -t > trace.report
This is with slice size of 30us.
The kernel includes the following fix
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1742,7 +1742,7 @@ static void __run_hrtimer(struct hrtimer
>
> lockdep_assert_held(&cpu_base->lock);
>
> - debug_deactivate(timer);
> + debug_hrtimer_deactivate(timer);
> base->running = timer;
========================================
RSEQ SLICE HISTOGRAM (us)
========================================
Task: ora_dbwr_lmb-666257 Mean: 1593.058 ns
Latency (us) | Count
------------------------------
EXPIRED | 3
0 us | 8
1 us | 116
2 us | 25
3 us | 2
4 us | 1
7 us | 1
Task: ora_dbwd_lmb-666196 Mean: 1548.641 ns
Latency (us) | Count
------------------------------
EXPIRED | 6
0 us | 19
1 us | 203
2 us | 33
4 us | 2
5 us | 1
[..]
Task: oracle_668951_l-668951 Mean: 1797.796 ns
Latency (us) | Count
------------------------------
1 us | 5
2 us | 1
3 us | 2
Task: oracle_671571_l-671571 Mean: 3285.425 ns
Latency (us) | Count
------------------------------
1 us | 1
3 us | 1
5 us | 1
6 us | 1
10 us | 1
Task: oracle_672277_l-672277 Mean: 2361.600 ns
Latency (us) | Count
------------------------------
1 us | 4
2 us | 3
5 us | 1
7 us | 1
9 us | 1
11 us | 1
Task: ora_dbwb_lmb-666192 Mean: 1548.157 ns
Latency (us) | Count
------------------------------
EXPIRED | 10
0 us | 24
1 us | 182
2 us | 39
3 us | 2
4 us | 4
5 us | 1
14 us | 1
———
#!/usr/bin/python3
#
# trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- $cmd
# trace-cmd report -t >trace.report
# pending[timer_ptr] = {'ts': timestamp, 'comm': comm}
pending = {}
# histograms[comm][bucket] = count
histograms = {}
class OnlineHarmonicMean:
def __init__(self):
self.n = 0 # Count of elements
self.S = 0.0 # Cumulative sum of reciprocals
def update(self, x):
if x == 0:
raise ValueError("Harmonic mean is undefined for zero.")
self.n += 1
self.S += 1.0 / x
return self.n / self.S
@property
def mean(self):
return self.n / self.S if self.n > 0 else 0
ohms = {}
def handle_start(comm, ts, timer_ptr, func):
if "rseq_slice_expired" in func:
ts = ts.replace(':', '')
pending[timer_ptr] = {
'ts': ts,
'comm': comm
}
return None
def handle_cancel(ts, timer_ptr):
if timer_ptr in pending:
start_data = pending.pop(timer_ptr)
ts= ts.replace(':', '')
duration_ns = float(ts) - float(start_data['ts'])
duration_ns = int(duration_ns * 1000000000)
duration_us = duration_ns // 1000
comm = start_data['comm']
if comm not in ohms:
ohms[comm] = OnlineHarmonicMean()
if duration_us > 0:
ohms[comm].update(duration_ns)
if comm not in histograms:
histograms[comm] = {}
histograms[comm][duration_us] = histograms[comm].get(duration_us, 0) + 1
return None
def handle_expire(timer_ptr):
if timer_ptr in pending:
start_data = pending.pop(timer_ptr)
comm = start_data['comm']
if comm not in histograms:
histograms[comm] = {}
# Record -1 bucket for expired (failed to cancel)
histograms[comm][-1] = histograms[comm].get(-1, 0) + 1
return None
if __name__ == "__main__":
file_path="./trace.report"
try:
with open(file_path, 'r') as f:
for line in f:
# format descritpion of trace-cmd report
parts = line.split()
if len(parts) < 5:
continue
if "hrtimer_cancel" in parts[3]:
handle_cancel(parts[2], parts[4]);
continue
if len(parts) < 6:
continue
if "hrtimer_start" in parts[3]:
handle_start(parts[0], parts[2], parts[4], parts[5])
continue
if "hrtimer_expire_entry" in parts[3]:
handle_expire(parts[4])
except PermissionError:
print(f"Error: Permission denied reading {file_path}")
except FileNotFoundError:
print(f"Error: {file_path} not found.")
print("\n" + "="*40)
print("RSEQ SLICE HISTOGRAM (us)")
print("="*40)
for comm, buckets in histograms.items():
print(f"\nTask: {comm} Mean: {ohms[comm].mean:.3f} ns")
print(f" {'Latency (us)':<15} | {'Count'}")
print(f" {'-'*30}")
# Sort buckets numerically, putting -1 at the top
for bucket in sorted(buckets.keys()):
label = "EXPIRED" if bucket == -1 else f"{bucket} us"
print(f" {label:<15} | {buckets[bucket]}")
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2026-01-17 9:57 ` Peter Zijlstra
2026-01-23 17:38 ` Prakash Sangappa
@ 2026-01-23 17:41 ` Prakash Sangappa
2026-01-27 18:48 ` Peter Zijlstra
1 sibling, 1 reply; 78+ messages in thread
From: Prakash Sangappa @ 2026-01-23 17:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, LKML, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org, Randy Dunlap, Ron Geva,
Waiman Long
> On Jan 17, 2026, at 1:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 15, 2025 at 05:52:22PM +0100, Thomas Gleixner wrote:
>
>> +rseq_slice_extension_nsec
>> +=========================
>> +
>> +A task can request to delay its scheduling if it is in a critical section
>> +via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
>> +allowed extension in nanoseconds before scheduling of the task is enforced.
>> +Default value is 30000ns (30us). The possible range is 10000ns (10us) to
>> +50000ns (50us).
> +
> +This value has a direct correlation to the worst case scheduling latency;
> +increment at your own risk.
>
>
>> +unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
>
> Changed default to 10us
>
> Also, given the results of that slice_test thing, we might possibly get
> away with a much lower value still.
>
> Prakash, could you possibly capture a trace of hrtimer_start,
> hrtimer_cancel and hrtimer_expire_entry for your Oracle workload and run
> that python thing on it?
The database setup is on an older environment. The python script provided does not run as is.
So, modified it to gather the stats by parsing trace-cmd report output.
Modified python script is below
Here are rseq stats from the benchmark run.
# cat /sys/kernel/debug/rseq/stats
[…]
sgrant: 707530
sexpir: 19717
srevok: 26548
syield: 680982
Here is the histogram data snippet. Showing typical usage.
Gathered from 10 sec trace-cmd samples collected during the benchmark run.
# trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- sleep 10
# trace-cmd report -t > trace.report
This is with slice size of 30us.
The kernel includes the following fix
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1742,7 +1742,7 @@ static void __run_hrtimer(struct hrtimer
>
> lockdep_assert_held(&cpu_base->lock);
>
> - debug_deactivate(timer);
> + debug_hrtimer_deactivate(timer);
> base->running = timer;
========================================
RSEQ SLICE HISTOGRAM (us)
========================================
Task: ora_dbwr_lmb-666257 Mean: 1593.058 ns
Latency (us) | Count
------------------------------
EXPIRED | 3
0 us | 8
1 us | 116
2 us | 25
3 us | 2
4 us | 1
7 us | 1
Task: ora_dbwd_lmb-666196 Mean: 1548.641 ns
Latency (us) | Count
------------------------------
EXPIRED | 6
0 us | 19
1 us | 203
2 us | 33
4 us | 2
5 us | 1
[..]
Task: oracle_668951_l-668951 Mean: 1797.796 ns
Latency (us) | Count
------------------------------
1 us | 5
2 us | 1
3 us | 2
Task: oracle_671571_l-671571 Mean: 3285.425 ns
Latency (us) | Count
------------------------------
1 us | 1
3 us | 1
5 us | 1
6 us | 1
10 us | 1
Task: oracle_672277_l-672277 Mean: 2361.600 ns
Latency (us) | Count
------------------------------
1 us | 4
2 us | 3
5 us | 1
7 us | 1
9 us | 1
11 us | 1
Task: ora_dbwb_lmb-666192 Mean: 1548.157 ns
Latency (us) | Count
------------------------------
EXPIRED | 10
0 us | 24
1 us | 182
2 us | 39
3 us | 2
4 us | 4
5 us | 1
14 us | 1
———
#!/usr/bin/python3
#
# trace-cmd record -e hrtimer_start -e hrtimer_cancel -e hrtimer_expire_entry -- $cmd
# trace-cmd report -t >trace.report
# pending[timer_ptr] = {'ts': timestamp, 'comm': comm}
pending = {}
# histograms[comm][bucket] = count
histograms = {}
class OnlineHarmonicMean:
def __init__(self):
self.n = 0 # Count of elements
self.S = 0.0 # Cumulative sum of reciprocals
def update(self, x):
if x == 0:
raise ValueError("Harmonic mean is undefined for zero.")
self.n += 1
self.S += 1.0 / x
return self.n / self.S
@property
def mean(self):
return self.n / self.S if self.n > 0 else 0
ohms = {}
def handle_start(comm, ts, timer_ptr, func):
if "rseq_slice_expired" in func:
ts = ts.replace(':', '')
pending[timer_ptr] = {
'ts': ts,
'comm': comm
}
return None
def handle_cancel(ts, timer_ptr):
if timer_ptr in pending:
start_data = pending.pop(timer_ptr)
ts= ts.replace(':', '')
duration_ns = float(ts) - float(start_data['ts'])
duration_ns = int(duration_ns * 1000000000)
duration_us = duration_ns // 1000
comm = start_data['comm']
if comm not in ohms:
ohms[comm] = OnlineHarmonicMean()
if duration_us > 0:
ohms[comm].update(duration_ns)
if comm not in histograms:
histograms[comm] = {}
histograms[comm][duration_us] = histograms[comm].get(duration_us, 0) + 1
return None
def handle_expire(timer_ptr):
if timer_ptr in pending:
start_data = pending.pop(timer_ptr)
comm = start_data['comm']
if comm not in histograms:
histograms[comm] = {}
# Record -1 bucket for expired (failed to cancel)
histograms[comm][-1] = histograms[comm].get(-1, 0) + 1
return None
if __name__ == "__main__":
file_path="./trace.report"
try:
with open(file_path, 'r') as f:
for line in f:
# format descritpion of trace-cmd report
parts = line.split()
if len(parts) < 5:
continue
if "hrtimer_cancel" in parts[3]:
handle_cancel(parts[2], parts[4]);
continue
if len(parts) < 6:
continue
if "hrtimer_start" in parts[3]:
handle_start(parts[0], parts[2], parts[4], parts[5])
continue
if "hrtimer_expire_entry" in parts[3]:
handle_expire(parts[4])
except PermissionError:
print(f"Error: Permission denied reading {file_path}")
except FileNotFoundError:
print(f"Error: {file_path} not found.")
print("\n" + "="*40)
print("RSEQ SLICE HISTOGRAM (us)")
print("="*40)
for comm, buckets in histograms.items():
print(f"\nTask: {comm} Mean: {ohms[comm].mean:.3f} ns")
print(f" {'Latency (us)':<15} | {'Count'}")
print(f" {'-'*30}")
# Sort buckets numerically, putting -1 at the top
for bucket in sorted(buckets.keys()):
label = "EXPIRED" if bucket == -1 else f"{bucket} us"
print(f" {label:<15} | {buckets[bucket]}")
^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [patch V6 07/11] rseq: Implement time slice extension enforcement timer
2026-01-23 17:41 ` Prakash Sangappa
@ 2026-01-27 18:48 ` Peter Zijlstra
0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2026-01-27 18:48 UTC (permalink / raw)
To: Prakash Sangappa
Cc: Thomas Gleixner, LKML, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org, Randy Dunlap, Ron Geva,
Waiman Long
On Fri, Jan 23, 2026 at 05:41:16PM +0000, Prakash Sangappa wrote:
> The database setup is on an older environment. The python script provided does not run as is.
> So, modified it to gather the stats by parsing trace-cmd report output.
I 'wrote' it against this thing:
https://github.com/rostedt/trace-cmd/blob/master/python/tracecmd.py
> ========================================
> RSEQ SLICE HISTOGRAM (us)
> ========================================
>
> Task: ora_dbwr_lmb-666257 Mean: 1593.058 ns
> Latency (us) | Count
> ------------------------------
> EXPIRED | 3
> 0 us | 8
> 1 us | 116
> 2 us | 25
> 3 us | 2
> 4 us | 1
> 7 us | 1
>
> Task: ora_dbwd_lmb-666196 Mean: 1548.641 ns
> Latency (us) | Count
> ------------------------------
> EXPIRED | 6
> 0 us | 19
> 1 us | 203
> 2 us | 33
> 4 us | 2
> 5 us | 1
>
>
> [..]
>
> Task: oracle_668951_l-668951 Mean: 1797.796 ns
> Latency (us) | Count
> ------------------------------
> 1 us | 5
> 2 us | 1
> 3 us | 2
>
> Task: oracle_671571_l-671571 Mean: 3285.425 ns
> Latency (us) | Count
> ------------------------------
> 1 us | 1
> 3 us | 1
> 5 us | 1
> 6 us | 1
> 10 us | 1
>
> Task: oracle_672277_l-672277 Mean: 2361.600 ns
> Latency (us) | Count
> ------------------------------
> 1 us | 4
> 2 us | 3
> 5 us | 1
> 7 us | 1
> 9 us | 1
> 11 us | 1
>
> Task: ora_dbwb_lmb-666192 Mean: 1548.157 ns
> Latency (us) | Count
> ------------------------------
> EXPIRED | 10
> 0 us | 24
> 1 us | 182
> 2 us | 39
> 3 us | 2
> 4 us | 4
> 5 us | 1
> 14 us | 1
Thanks! that seems to confirm that 5us should be good enough for you
guys too.
^ permalink raw reply [flat|nested] 78+ messages in thread
end of thread, other threads:[~2026-01-27 18:48 UTC | newest]
Thread overview: 78+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-15 16:52 [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-12-15 16:52 ` [patch V6 01/11] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 14:36 ` Mathieu Desnoyers
2025-12-18 23:21 ` Thomas Gleixner
2026-01-07 21:11 ` Mathieu Desnoyers
2026-01-11 17:11 ` Thomas Gleixner
2026-01-13 23:45 ` Florian Weimer
2026-01-14 21:59 ` Thomas Gleixner
2026-01-17 16:16 ` Mathieu Desnoyers
2026-01-19 10:21 ` Peter Zijlstra
2026-01-19 10:30 ` Mathieu Desnoyers
2026-01-19 11:03 ` Peter Zijlstra
2026-01-19 11:10 ` Mathieu Desnoyers
2026-01-19 11:27 ` Peter Zijlstra
2026-01-19 10:46 ` Florian Weimer
2026-01-17 9:36 ` Peter Zijlstra
2026-01-19 10:10 ` Peter Zijlstra
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 02/11] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 03/11] rseq: Add statistics " Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 04/11] rseq: Add prctl() to enable " Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 05/11] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 14:59 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 06/11] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:05 ` Mathieu Desnoyers
2025-12-18 22:28 ` Thomas Gleixner
2025-12-18 22:30 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 07/11] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 7:18 ` Randy Dunlap
2025-12-16 17:55 ` Prakash Sangappa
2025-12-16 8:26 ` [patch V6.1 " Thomas Gleixner
2025-12-16 15:13 ` [patch V6 " Mathieu Desnoyers
2025-12-18 15:05 ` Peter Zijlstra
2025-12-18 23:26 ` Thomas Gleixner
2025-12-19 10:05 ` Peter Zijlstra
2026-01-16 18:15 ` Peter Zijlstra
2026-01-18 10:46 ` Thomas Gleixner
2026-01-19 10:01 ` Peter Zijlstra
2025-12-18 15:18 ` Peter Zijlstra
2025-12-18 23:25 ` Thomas Gleixner
2026-01-17 9:57 ` Peter Zijlstra
2026-01-23 17:38 ` Prakash Sangappa
2026-01-23 17:41 ` Prakash Sangappa
2026-01-27 18:48 ` Peter Zijlstra
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 08/11] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:17 ` Mathieu Desnoyers
2026-01-22 10:16 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 09/11] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:25 ` Mathieu Desnoyers
2025-12-18 23:28 ` Thomas Gleixner
2026-01-11 10:22 ` Thomas Gleixner
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 10/11] entry: Hook up rseq time slice extension Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2025-12-16 15:37 ` Mathieu Desnoyers
2025-12-19 11:07 ` Peter Zijlstra
2026-01-11 11:01 ` Thomas Gleixner
2026-01-17 9:51 ` Peter Zijlstra
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 16:52 ` [patch V6 11/11] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-12-15 18:24 ` Thomas Gleixner
2026-01-22 10:15 ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
2025-12-15 18:24 ` [patch V6 00/11] rseq: Implement time slice extension mechanism Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox