* [patch V3 00/12] rseq: Implement time slice extension mechanism
@ 2025-10-29 13:22 Thomas Gleixner
2025-10-29 13:22 ` [patch V3 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
` (13 more replies)
0 siblings, 14 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
This is a follow up on the V2 version:
https://lore.kernel.org/20251022110646.839870156@linutronix.de
V1 contains a detailed explanation:
https://lore.kernel.org/20250908225709.144709889@linutronix.de
TLDR: Time slice extensions are an attempt to provide opportunistic
priority ceiling without the overhead of an actual priority ceiling
protocol, but also without the guarantees such a protocol provides.
The intent is to avoid situations where a user space thread is interrupted
in a critical section and scheduled out, while holding a resource on which
the preempting thread or other threads in the system might block on. That
obviously prevents those threads from making progress in the worst case for
at least a full time slice. Especially in the context of user space
spinlocks, which are a patently bad idea to begin with, but that's also
true for other mechanisms.
This series uses the existing RSEQ user memory to implement it.
Changes vs. V2:
- Rebase on the newest RSEQ and uaccess changes
- Document the command line parameter - Sebastian
- Use ENOTSUPP in the stub inline to be consistent - Sebastian
- Add sysctl documentation - Sebastian
- Simplify timer cancelation - Sebastian
- Restore the dropped 'From: Peter...' line in patch 1 - Sebastian
- More documentation/comment fixes - Randy
The uaccess and RSEQ modifications on which this series is based can be
found here:
https://lore.kernel.org/20251029123717.886619142@linutronix.de
and in git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
For your convenience all of it is also available as a conglomerate from
git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
Thanks,
tglx
---
Peter Zijlstra (1):
sched: Provide and use set_need_resched_current()
Thomas Gleixner (11):
rseq: Add fields and constants for time slice extension
rseq: Provide static branch for time slice extensions
rseq: Add statistics for time slice extensions
rseq: Add prctl() to enable time slice extensions
rseq: Implement sys_rseq_slice_yield()
rseq: Implement syscall entry work for time slice extensions
rseq: Implement time slice extension enforcement timer
rseq: Reset slice extension when scheduled
rseq: Implement rseq_grant_slice_extension()
entry: Hook up rseq time slice extension
selftests/rseq: Implement time slice extension test
Documentation/admin-guide/kernel-parameters.txt | 5
Documentation/admin-guide/sysctl/kernel.rst | 6
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 118 +++++++++
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/tools/syscall_32.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/s390/mm/pfault.c | 3
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
include/linux/entry-common.h | 2
include/linux/rseq.h | 11
include/linux/rseq_entry.h | 191 ++++++++++++++-
include/linux/rseq_types.h | 30 ++
include/linux/sched.h | 7
include/linux/syscalls.h | 1
include/linux/thread_info.h | 16 -
include/uapi/asm-generic/unistd.h | 5
include/uapi/linux/prctl.h | 10
include/uapi/linux/rseq.h | 38 +++
init/Kconfig | 12
kernel/entry/common.c | 14 -
kernel/entry/syscall-common.c | 11
kernel/rcu/tiny.c | 8
kernel/rcu/tree.c | 14 -
kernel/rcu/tree_exp.h | 3
kernel/rcu/tree_plugin.h | 9
kernel/rcu/tree_stall.h | 3
kernel/rseq.c | 299 ++++++++++++++++++++++++
kernel/sys.c | 6
kernel/sys_ni.c | 1
scripts/syscall.tbl | 1
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 27 ++
tools/testing/selftests/rseq/slice_test.c | 198 +++++++++++++++
47 files changed, 1019 insertions(+), 53 deletions(-)
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 01/12] sched: Provide and use set_need_resched_current()
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-29 13:22 ` [patch V3 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
` (12 subsequent siblings)
13 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
From: Peter Zijlstra <peterz@infradead.org>
set_tsk_need_resched(current) requires set_preempt_need_resched(current) to
work correctly outside of the scheduler.
Provide set_need_resched_current() which wraps this correctly and replace
all the open coded instances.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/s390/mm/pfault.c | 3 +--
include/linux/sched.h | 7 +++++++
kernel/rcu/tiny.c | 8 +++-----
kernel/rcu/tree.c | 14 +++++---------
kernel/rcu/tree_exp.h | 3 +--
kernel/rcu/tree_plugin.h | 9 +++------
kernel/rcu/tree_stall.h | 3 +--
7 files changed, 21 insertions(+), 26 deletions(-)
--- a/arch/s390/mm/pfault.c
+++ b/arch/s390/mm/pfault.c
@@ -199,8 +199,7 @@ static void pfault_interrupt(struct ext_
* return to userspace schedule() to block.
*/
__set_current_state(TASK_UNINTERRUPTIBLE);
- set_tsk_need_resched(tsk);
- set_preempt_need_resched();
+ set_need_resched_current();
}
}
out:
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2033,6 +2033,13 @@ static inline int test_tsk_need_resched(
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}
+static inline void set_need_resched_current(void)
+{
+ lockdep_assert_irqs_disabled();
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+}
+
/*
* cond_resched() and cond_resched_lock(): latency reduction via
* explicit rescheduling in places that are safe. The return
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -70,12 +70,10 @@ void rcu_qs(void)
*/
void rcu_sched_clock_irq(int user)
{
- if (user) {
+ if (user)
rcu_qs();
- } else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail) {
- set_tsk_need_resched(current);
- set_preempt_need_resched();
- }
+ else if (rcu_ctrlblk.donetail != rcu_ctrlblk.curtail)
+ set_need_resched_current();
}
/*
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2696,10 +2696,8 @@ void rcu_sched_clock_irq(int user)
/* The load-acquire pairs with the store-release setting to true. */
if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
/* Idle and userspace execution already are quiescent states. */
- if (!rcu_is_cpu_rrupt_from_idle() && !user) {
- set_tsk_need_resched(current);
- set_preempt_need_resched();
- }
+ if (!rcu_is_cpu_rrupt_from_idle() && !user)
+ set_need_resched_current();
__this_cpu_write(rcu_data.rcu_urgent_qs, false);
}
rcu_flavor_sched_clock_irq(user);
@@ -2824,7 +2822,6 @@ static void strict_work_handler(struct w
/* Perform RCU core processing work for the current CPU. */
static __latent_entropy void rcu_core(void)
{
- unsigned long flags;
struct rcu_data *rdp = raw_cpu_ptr(&rcu_data);
struct rcu_node *rnp = rdp->mynode;
@@ -2837,8 +2834,8 @@ static __latent_entropy void rcu_core(vo
if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && (!(preempt_count() & PREEMPT_MASK))) {
rcu_preempt_deferred_qs(current);
} else if (rcu_preempt_need_deferred_qs(current)) {
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ guard(irqsave)();
+ set_need_resched_current();
}
/* Update RCU state based on any recent quiescent states. */
@@ -2847,10 +2844,9 @@ static __latent_entropy void rcu_core(vo
/* No grace period and unregistered callbacks? */
if (!rcu_gp_in_progress() &&
rcu_segcblist_is_enabled(&rdp->cblist) && !rcu_rdp_is_offloaded(rdp)) {
- local_irq_save(flags);
+ guard(irqsave)();
if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
rcu_accelerate_cbs_unlocked(rnp, rdp);
- local_irq_restore(flags);
}
rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -729,8 +729,7 @@ static void rcu_exp_need_qs(void)
__this_cpu_write(rcu_data.cpu_no_qs.b.exp, true);
/* Store .exp before .rcu_urgent_qs. */
smp_store_release(this_cpu_ptr(&rcu_data.rcu_urgent_qs), true);
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ set_need_resched_current();
}
#ifdef CONFIG_PREEMPT_RCU
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -753,8 +753,7 @@ static void rcu_read_unlock_special(stru
// Also if no expediting and no possible deboosting,
// slow is OK. Plus nohz_full CPUs eventually get
// tick enabled.
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ set_need_resched_current();
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING &&
cpu_online(rdp->cpu)) {
@@ -813,10 +812,8 @@ static void rcu_flavor_sched_clock_irq(i
if (rcu_preempt_depth() > 0 ||
(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
/* No QS, force context switch if deferred. */
- if (rcu_preempt_need_deferred_qs(t)) {
- set_tsk_need_resched(t);
- set_preempt_need_resched();
- }
+ if (rcu_preempt_need_deferred_qs(t))
+ set_need_resched_current();
} else if (rcu_preempt_need_deferred_qs(t)) {
rcu_preempt_deferred_qs(t); /* Report deferred QS. */
return;
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -763,8 +763,7 @@ static void print_cpu_stall(unsigned lon
* progress and it could be we're stuck in kernel space without context
* switches for an entirely unreasonable amount of time.
*/
- set_tsk_need_resched(current);
- set_preempt_need_resched();
+ set_need_resched_current();
}
static bool csd_lock_suppress_rcu_stall;
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-10-29 13:22 ` [patch V3 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-30 22:01 ` Prakash Sangappa
` (2 more replies)
2025-10-29 13:22 ` [patch V3 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
` (11 subsequent siblings)
13 siblings, 3 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
V3: Fix more typos and expressions - Randy
V2: Fix Kconfig indentation, fix typos and expressions - Randy
Make the control fields a struct and remove the atomicity requirement - Mathieu
---
Documentation/userspace-api/index.rst | 1
Documentation/userspace-api/rseq.rst | 118 ++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 28 +++++++-
include/uapi/linux/rseq.h | 38 ++++++++++
init/Kconfig | 12 +++
kernel/rseq.c | 7 ++
6 files changed, 203 insertions(+), 1 deletion(-)
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
ebpf/index
ioctl/index
mseal
+ rseq
Security-related interfaces
===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,118 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl.request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::reqeust and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving userspace from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl.request = 1;
+ critical_section();
+ if (rseq->slice_ctrl.granted)
+ rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+ if (rseq->slice_ctrl & GRANTED)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -73,12 +73,35 @@ struct rseq_ids {
};
/**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state: Compound to access the overall state
+ * @enabled: Time slice extension is enabled for the task
+ * @granted: Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+ u16 state;
+ struct {
+ u8 enabled;
+ u8 granted;
+ };
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state: Time slice extension state
+ */
+struct rseq_slice {
+ union rseq_slice_state state;
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
- * @sig: Signature of critial section abort IPs
+ * @sig: Signature of critical section abort IPs
* @event: Storage for event management
* @ids: Storage for cached CPU ID and MM CID
+ * @slice: Storage for time slice extension data
*/
struct rseq_data {
struct rseq __user *usrptr;
@@ -86,6 +109,9 @@ struct rseq_data {
u32 sig;
struct rseq_event event;
struct rseq_ids ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+ struct rseq_slice slice;
+#endif
};
#else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a separate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
@@ -53,6 +64,27 @@ struct rseq_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -142,6 +174,12 @@ struct rseq {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1913,6 +1913,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_SLICE_EXTENSION
+ bool "Enable rseq-based time slice extension mechanism"
+ depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ help
+ Allows userspace to request a limited time slice extension when
+ returning from an interrupt to user space via the RSEQ shared
+ data ABI. If granted, that allows to complete a critical section,
+ so that other threads are not stuck on a conflicted resource,
+ while the task is scheduled out.
+
+ If unsure, say N.
+
config RSEQ_STATS
default n
bool "Enable lightweight statistics of restartable sequences" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 03/12] rseq: Provide static branch for time slice extensions
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-10-29 13:22 ` [patch V3 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-10-29 13:22 ` [patch V3 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-29 17:23 ` Randy Dunlap
2025-10-31 19:34 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 04/12] rseq: Add statistics " Thomas Gleixner
` (10 subsequent siblings)
13 siblings, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V3: Document command line parameter - Sebastian
V2: Return 1 from __setup() - Prateek
---
Documentation/admin-guide/kernel-parameters.txt | 5 +++++
include/linux/rseq_entry.h | 11 +++++++++++
kernel/rseq.c | 17 +++++++++++++++++
3 files changed, 33 insertions(+)
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6482,6 +6482,11 @@
rootflags= [KNL] Set root filesystem mount option string
+ rseq_slice_ext= [KNL] RSEQ based time slice extension
+ Format: boolean
+ Control enablement of RSEQ based time slice extension.
+ Default is 'on'.
+
initramfs_options= [KNL]
Specify mount options for for the initramfs mount.
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,17 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
#define rseq_inline __always_inline
#endif
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DECLARE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static __always_inline bool rseq_slice_extension_enabled(void)
+{
+ return static_branch_likely(&rseq_slice_extension_key);
+}
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline bool rseq_slice_extension_enabled(void) { return false; }
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
bool rseq_debug_validate_ids(struct task_struct *t);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -484,3 +484,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
efault:
return -EFAULT;
}
+
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+
+static int __init rseq_slice_cmdline(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return -EINVAL;
+
+ if (!on)
+ static_branch_disable(&rseq_slice_extension_key);
+ return 1;
+}
+__setup("rseq_slice_ext=", rseq_slice_cmdline);
+#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 04/12] rseq: Add statistics for time slice extensions
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (2 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-31 19:36 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 05/12] rseq: Add prctl() to enable " Thomas Gleixner
` (9 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Extend the quick statistics with time slice specific fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
include/linux/rseq_entry.h | 4 ++++
kernel/rseq.c | 12 ++++++++++++
2 files changed, 16 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -15,6 +15,10 @@ struct rseq_stats {
unsigned long cs;
unsigned long clear;
unsigned long fixup;
+ unsigned long s_granted;
+ unsigned long s_expired;
+ unsigned long s_revoked;
+ unsigned long s_yielded;
};
DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -138,6 +138,12 @@ static int rseq_stats_show(struct seq_fi
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ stats.s_granted += data_race(per_cpu(rseq_stats.s_granted, cpu));
+ stats.s_expired += data_race(per_cpu(rseq_stats.s_expired, cpu));
+ stats.s_revoked += data_race(per_cpu(rseq_stats.s_revoked, cpu));
+ stats.s_yielded += data_race(per_cpu(rseq_stats.s_yielded, cpu));
+ }
}
seq_printf(m, "exit: %16lu\n", stats.exit);
@@ -148,6 +154,12 @@ static int rseq_stats_show(struct seq_fi
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+ seq_printf(m, "sgrant: %16lu\n", stats.s_granted);
+ seq_printf(m, "sexpir: %16lu\n", stats.s_expired);
+ seq_printf(m, "srevok: %16lu\n", stats.s_revoked);
+ seq_printf(m, "syield: %16lu\n", stats.s_yielded);
+ }
return 0;
}
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 05/12] rseq: Add prctl() to enable time slice extensions
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (3 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 04/12] rseq: Add statistics " Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-31 19:43 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
` (8 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V3: Use -ENOTSUPP for the stub inline - Sebastian
---
include/linux/rseq.h | 9 +++++++
include/uapi/linux/prctl.h | 10 ++++++++
kernel/rseq.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 6 +++++
4 files changed, 77 insertions(+)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -164,4 +164,13 @@ void rseq_syscall(struct pt_regs *regs);
static inline void rseq_syscall(struct pt_regs *regs) { }
#endif /* !CONFIG_DEBUG_RSEQ */
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
+#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ return -ENOTSUPP;
+}
+#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
+
#endif /* _LINUX_RSEQ_H */
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -386,4 +386,14 @@ struct prctl_mm_map {
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+/* RSEQ time slice extensions */
+#define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+/*
+ * Bits for RSEQ_SLICE_EXTENSION_GET/SET
+ * PR_RSEQ_SLICE_EXT_ENABLE: Enable
+ */
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+
#endif /* _LINUX_PRCTL_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,7 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
#include <linux/sched.h>
@@ -500,6 +501,57 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
+{
+ switch (arg2) {
+ case PR_RSEQ_SLICE_EXTENSION_GET:
+ if (arg3)
+ return -EINVAL;
+ return current->rseq.slice.state.enabled ? PR_RSEQ_SLICE_EXT_ENABLE : 0;
+
+ case PR_RSEQ_SLICE_EXTENSION_SET: {
+ u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
+
+ if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
+ return -EINVAL;
+ if (!rseq_slice_extension_enabled())
+ return -ENOTSUPP;
+ if (!current->rseq.usrptr)
+ return -ENXIO;
+
+ /* No change? */
+ if (enable == !!current->rseq.slice.state.enabled)
+ return 0;
+
+ if (get_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ if (current->rseq.slice.state.enabled)
+ valid |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if ((rflags & valid) != valid)
+ goto die;
+
+ rflags &= ~RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+ if (enable)
+ rflags |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
+
+ if (put_user(rflags, ¤t->rseq.usrptr->flags))
+ goto die;
+
+ current->rseq.slice.state.enabled = enable;
+ return 0;
+ }
+ default:
+ return -EINVAL;
+ }
+die:
+ force_sig(SIGSEGV);
+ return -EFAULT;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -53,6 +53,7 @@
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
#include <linux/futex.h>
+#include <linux/rseq.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2868,6 +2869,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
case PR_FUTEX_HASH:
error = futex_hash_prctl(arg2, arg3, arg4);
break;
+ case PR_RSEQ_SLICE_EXTENSION:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = rseq_slice_extension_prctl(arg2, arg3);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 06/12] rseq: Implement sys_rseq_slice_yield()
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (4 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-31 19:46 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
` (7 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Arnd Bergmann, linux-arch, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
---
V2: Use the proper name in sys_ni.c and add comment - Prateek
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/tools/syscall_32.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 ++++-
kernel/rseq.c | 21 +++++++++++++++++++++
kernel/sys_ni.c | 1 +
scripts/syscall.tbl | 1 +
21 files changed, 44 insertions(+), 1 deletion(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
577 common open_tree_attr sys_open_tree_attr
578 common file_getattr sys_file_getattr
579 common file_setattr sys_file_setattr
+580 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -484,3 +484,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -481,3 +481,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -469,3 +469,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -408,3 +408,4 @@
467 n32 open_tree_attr sys_open_tree_attr
468 n32 file_getattr sys_file_getattr
469 n32 file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -384,3 +384,4 @@
467 n64 open_tree_attr sys_open_tree_attr
468 n64 file_getattr sys_file_getattr
469 n64 file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -457,3 +457,4 @@
467 o32 open_tree_attr sys_open_tree_attr
468 o32 file_getattr sys_file_getattr
469 o32 file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -468,3 +468,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -560,3 +560,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 nospu rseq_slice_yield sys_rseq_slice_yield
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -472,3 +472,4 @@
467 common open_tree_attr sys_open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -475,3 +475,4 @@
467 i386 open_tree_attr sys_open_tree_attr
468 i386 file_getattr sys_file_getattr
469 i386 file_setattr sys_file_setattr
+470 i386 rseq_slice_yield sys_rseq_slice_yield
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -394,6 +394,7 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
#
# Due to a historical design error, certain syscalls are numbered differently
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -957,6 +957,7 @@ asmlinkage long sys_statx(int dfd, const
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_rseq_slice_yield(void);
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
unsigned flags,
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -858,8 +858,11 @@
#define __NR_file_setattr 469
__SYSCALL(__NR_file_setattr, sys_file_setattr)
+#define __NR_rseq_slice_yield 470
+__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+
#undef __NR_syscalls
-#define __NR_syscalls 470
+#define __NR_syscalls 471
/*
* 32 bit systems traditionally used different
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -552,6 +552,27 @@ int rseq_slice_extension_prctl(unsigned
return -EFAULT;
}
+/**
+ * sys_rseq_slice_yield - yield the current processor if a task granted
+ * with a time slice extension is done with the
+ * critical work before being forced out.
+ *
+ * On entry from user space, syscall_entry_work() ensures that NEED_RESCHED is
+ * set if the task was granted a slice extension before arriving here.
+ *
+ * Return: 1 if the task successfully yielded the CPU within the granted slice.
+ * 0 if the slice extension was either never granted or was revoked by
+ * going over the granted extension or being scheduled out earlier
+ */
+SYSCALL_DEFINE0(rseq_slice_yield)
+{
+ if (need_resched()) {
+ schedule();
+ return 1;
+ }
+ return 0;
+}
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -390,6 +390,7 @@ COND_SYSCALL(setuid16);
/* restartable sequence */
COND_SYSCALL(rseq);
+COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -410,3 +410,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common rseq_slice_yield sys_rseq_slice_yield
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (5 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-31 19:53 ` Mathieu Desnoyers
2025-11-19 0:20 ` Prakash Sangappa
2025-10-29 13:22 ` [patch V3 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
` (6 subsequent siblings)
13 siblings, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues a syscall from the critical section. Wrong syscall and
inconsistent user space result in a SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V3: Use get/put_user()
---
include/linux/entry-common.h | 2 -
include/linux/rseq.h | 2 +
include/linux/thread_info.h | 16 ++++----
kernel/entry/syscall-common.c | 11 ++++-
kernel/rseq.c | 79 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 100 insertions(+), 10 deletions(-)
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -36,8 +36,8 @@
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
+ SYSCALL_WORK_SYSCALL_RSEQ_SLICE | \
ARCH_SYSCALL_WORK_ENTER)
-
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -165,8 +165,10 @@ static inline void rseq_syscall(struct p
#endif /* !CONFIG_DEBUG_RSEQ */
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+void rseq_syscall_enter_work(long syscall);
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3);
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
+static inline void rseq_syscall_enter_work(long syscall) { }
static inline int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
return -ENOTSUPP;
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,15 +46,17 @@ enum syscall_work_bit {
SYSCALL_WORK_BIT_SYSCALL_AUDIT,
SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+ SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE,
};
-#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
-#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
-#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
-#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
-#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
-#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
-#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
+#define SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_BIT_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
+#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
+#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
+#define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_RSEQ_SLICE BIT(SYSCALL_WORK_BIT_SYSCALL_RSEQ_SLICE)
#endif
#include <asm/thread_info.h>
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -17,8 +17,7 @@ static inline void syscall_enter_audit(s
}
}
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work)
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
{
long ret = 0;
@@ -32,6 +31,14 @@ long syscall_trace_enter(struct pt_regs
return -1L;
}
+ /*
+ * User space got a time slice extension granted and relinquishes
+ * the CPU. The work stops the slice timer to avoid an extra round
+ * through hrtimer_interrupt().
+ */
+ if (work & SYSCALL_WORK_SYSCALL_RSEQ_SLICE)
+ rseq_syscall_enter_work(syscall);
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -501,6 +501,85 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+static inline void rseq_slice_set_need_resched(struct task_struct *curr)
+{
+ /*
+ * The interrupt guard is required to prevent inconsistent state in
+ * this case:
+ *
+ * set_tsk_need_resched()
+ * --> Interrupt
+ * wakeup()
+ * set_tsk_need_resched()
+ * set_preempt_need_resched()
+ * schedule_on_return()
+ * clear_tsk_need_resched()
+ * clear_preempt_need_resched()
+ * set_preempt_need_resched() <- Inconsistent state
+ *
+ * This is safe vs. a remote set of TIF_NEED_RESCHED because that
+ * only sets the already set bit and does not create inconsistent
+ * state.
+ */
+ scoped_guard(irq)
+ set_need_resched_current();
+}
+
+static void rseq_slice_validate_ctrl(u32 expected)
+{
+ u32 __user *sctrl = ¤t->rseq.usrptr->slice_ctrl.all;
+ u32 uval;
+
+ if (get_user(uval, sctrl) || uval != expected)
+ force_sig(SIGSEGV);
+}
+
+/*
+ * Invoked from syscall entry if a time slice extension was granted and the
+ * kernel did not clear it before user space left the critical section.
+ */
+void rseq_syscall_enter_work(long syscall)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl ctrl = { .granted = curr->rseq.slice.state.granted };
+
+ clear_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ rseq_slice_validate_ctrl(ctrl.all);
+
+ /*
+ * The kernel might have raced, revoked the grant and updated
+ * userspace, but kept the SLICE work set.
+ */
+ if (!ctrl.granted)
+ return;
+
+ rseq_stat_inc(rseq_stats.s_yielded);
+
+ /*
+ * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
+ * kernels.
+ */
+ scoped_guard(preempt) {
+ /*
+ * Now that preemption is disabled, quickly check whether
+ * the task was already rescheduled before arriving here.
+ */
+ if (!curr->rseq.event.sched_switch)
+ rseq_slice_set_need_resched(curr);
+ }
+
+ curr->rseq.slice.state.granted = false;
+ /*
+ * Clear the grant in user space and check whether this was the
+ * correct syscall to yield. If the user access fails or the task
+ * used an arbitrary syscall, terminate it.
+ */
+ if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
+ force_sig(SIGSEGV);
+}
+
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
{
switch (arg2) {
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 08/12] rseq: Implement time slice extension enforcement timer
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (6 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-29 18:45 ` Steven Rostedt
2025-10-31 19:59 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
` (5 subsequent siblings)
13 siblings, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V3: Add sysctl documentation, simplify timer cancelation - Sebastian
---
Documentation/admin-guide/sysctl/kernel.rst | 6 +
include/linux/rseq_entry.h | 38 ++++++---
include/linux/rseq_types.h | 2
kernel/rseq.c | 115 +++++++++++++++++++++++++++-
4 files changed, 149 insertions(+), 12 deletions(-)
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1228,6 +1228,12 @@ reboot-cmd (SPARC only)
ROM/Flash boot loader. Maybe to tell it what to do after
rebooting. ???
+rseq_slice_extension_nsec
+=========================
+
+A task can request to delay its scheduling if it is in a critical section
+via the prctl(PR_RSEQ_SLICE_EXTENSION_SET) mechanism. This sets the maximum
+allowed extension in nanoseconds before scheduling of the task is enforced.
sched_energy_aware
==================
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -86,8 +86,24 @@ static __always_inline bool rseq_slice_e
{
return static_branch_likely(&rseq_slice_extension_key);
}
+
+extern unsigned int rseq_slice_ext_nsecs;
+bool __rseq_arm_slice_extension_timer(void);
+
+static __always_inline bool rseq_arm_slice_extension_timer(void)
+{
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ if (likely(!current->rseq.slice.state.granted))
+ return false;
+
+ return __rseq_arm_slice_extension_timer();
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
+static inline bool rseq_arm_slice_extension_timer(void) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -542,17 +558,19 @@ static __always_inline void clear_tif_rs
static __always_inline bool
rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
- if (likely(!test_tif_rseq(ti_work)))
- return false;
-
- if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
- current->rseq.event.slowpath = true;
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
- return true;
+ if (unlikely(test_tif_rseq(ti_work))) {
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ clear_tif_rseq();
}
-
- clear_tif_rseq();
- return false;
+ /*
+ * Arm the slice extension timer if nothing to do anymore and the
+ * task really goes out to user space.
+ */
+ return rseq_arm_slice_extension_timer();
}
#endif /* CONFIG_GENERIC_ENTRY */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -89,9 +89,11 @@ union rseq_slice_state {
/**
* struct rseq_slice - Status information for rseq time slice extension
* @state: Time slice extension state
+ * @expires: The time when a grant expires
*/
struct rseq_slice {
union rseq_slice_state state;
+ u64 expires;
};
/**
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -71,6 +71,8 @@
#define RSEQ_BUILD_SLOW_PATH
#include <linux/debugfs.h>
+#include <linux/hrtimer.h>
+#include <linux/percpu.h>
#include <linux/prctl.h>
#include <linux/ratelimit.h>
#include <linux/rseq_entry.h>
@@ -499,8 +501,78 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
}
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+struct slice_timer {
+ struct hrtimer timer;
+ void *cookie;
+};
+
+unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
+static DEFINE_PER_CPU(struct slice_timer, slice_timer);
DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
+static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
+{
+ struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
+
+ if (st->cookie == current && current->rseq.slice.state.granted) {
+ rseq_stat_inc(rseq_stats.s_expired);
+ set_need_resched_current();
+ }
+ return HRTIMER_NORESTART;
+}
+
+bool __rseq_arm_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+ struct task_struct *curr = current;
+
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * This check prevents that a granted time slice extension exceeds
+ * the maximum scheduling latency when the grant expired before
+ * going out to user space. Don't bother to clear the grant here,
+ * it will be cleaned up automatically before going out to user
+ * space.
+ */
+ if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
+ set_need_resched_current();
+ return true;
+ }
+
+ /*
+ * Store the task pointer as a cookie for comparison in the timer
+ * function. This is safe as the timer is CPU local and cannot be
+ * in the expiry function at this point.
+ */
+ st->cookie = curr;
+ hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
+ /* Arm the syscall entry work */
+ set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
+ return false;
+}
+
+static void rseq_cancel_slice_extension_timer(void)
+{
+ struct slice_timer *st = this_cpu_ptr(&slice_timer);
+
+ /*
+ * st->cookie can be safely read as preemption is disabled and the
+ * timer is CPU local.
+ *
+ * As this is most probably the first expiring timer, the cancel is
+ * expensive as it has to reprogram the hardware, but that's less
+ * expensive than going through a full hrtimer_interrupt() cycle
+ * for nothing.
+ *
+ * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
+ * local and once the hrtimer code disabled interrupts the timer
+ * callback cannot be running.
+ */
+ if (st->cookie == current)
+ hrtimer_try_to_cancel(&st->timer);
+}
+
static inline void rseq_slice_set_need_resched(struct task_struct *curr)
{
/*
@@ -558,10 +630,11 @@ void rseq_syscall_enter_work(long syscal
rseq_stat_inc(rseq_stats.s_yielded);
/*
- * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
- * kernels.
+ * Required to stabilize the per CPU timer pointer and to make
+ * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
*/
scoped_guard(preempt) {
+ rseq_cancel_slice_extension_timer();
/*
* Now that preemption is disabled, quickly check whether
* the task was already rescheduled before arriving here.
@@ -652,6 +725,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
return 0;
}
+#ifdef CONFIG_SYSCTL
+static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
+static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
+
+static const struct ctl_table rseq_slice_ext_sysctl[] = {
+ {
+ .procname = "rseq_slice_extension_nsec",
+ .data = &rseq_slice_ext_nsecs,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
+ .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
+ },
+};
+
+static void rseq_slice_sysctl_init(void)
+{
+ if (rseq_slice_extension_enabled())
+ register_sysctl_init("kernel", rseq_slice_ext_sysctl);
+}
+#else /* CONFIG_SYSCTL */
+static inline void rseq_slice_sysctl_init(void) { }
+#endif /* !CONFIG_SYSCTL */
+
static int __init rseq_slice_cmdline(char *str)
{
bool on;
@@ -664,4 +762,17 @@ static int __init rseq_slice_cmdline(cha
return 1;
}
__setup("rseq_slice_ext=", rseq_slice_cmdline);
+
+static int __init rseq_slice_init(void)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
+ }
+ rseq_slice_sysctl_init();
+ return 0;
+}
+device_initcall(rseq_slice_init);
#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 09/12] rseq: Reset slice extension when scheduled
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (7 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-31 20:03 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
` (4 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
When a time slice extension was granted in the need_resched() check on exit
to user space, the task can still be scheduled out in one of the other
pending work items. When it gets scheduled back in, and need_resched() is
not set, then the stale grant would be preserved, which is just wrong.
RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
critical section and ID update mechanisms.
Utilize them and clear the user space slice control member of struct rseq
unconditionally within the existing user access sections. That's just an
unconditional store more in that path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/rseq_entry.h | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -101,9 +101,17 @@ static __always_inline bool rseq_arm_sli
return __rseq_arm_slice_extension_timer();
}
+static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
+{
+ if (IS_ENABLED(CONFIG_RSEQ_STATS) && t->rseq.slice.state.granted)
+ rseq_stat_inc(rseq_stats.s_revoked);
+ t->rseq.slice.state.granted = false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
+static inline void rseq_slice_clear_grant(struct task_struct *t) { }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -390,8 +398,15 @@ bool rseq_set_ids_get_csaddr(struct task
unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
if (csaddr)
unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
}
+ rseq_slice_clear_grant(t);
/* Cache the new values */
t->rseq.ids.cpu_cid = ids->cpu_cid;
rseq_stat_inc(rseq_stats.ids);
@@ -487,8 +502,17 @@ static __always_inline bool rseq_exit_us
*/
u64 csaddr;
- if (unlikely(!get_user_inline(csaddr, &rseq->rseq_cs)))
- return false;
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_get_user(csaddr, &rseq->rseq_cs, efault);
+
+ /* Open coded, so it's in the same user access region */
+ if (rseq_slice_extension_enabled()) {
+ /* Unconditionally clear it, no point in conditionals */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ }
+ }
+
+ rseq_slice_clear_grant(t);
if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
@@ -504,6 +528,8 @@ static __always_inline bool rseq_exit_us
u32 node_id = cpu_to_node(ids.cpu_id);
return rseq_update_usr(t, regs, &ids, node_id);
+efault:
+ return false;
}
static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 10/12] rseq: Implement rseq_grant_slice_extension()
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (8 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-29 20:08 ` Steven Rostedt
2025-10-29 13:22 ` [patch V3 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
` (3 subsequent siblings)
13 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Provide the actual decision function, which decides whether a time slice
extension is granted in the exit to user mode path when NEED_RESCHED is
evaluated.
The decision is made in two stages. First an inline quick check to avoid
going into the actual decision function. This checks whether:
#1 the functionality is enabled
#2 the exit is a return from interrupt to user mode
#3 any TIF bit, which causes extra work is set. That includes TIF_RSEQ,
which means the task was already scheduled out.
The slow path, which implements the actual user space ABI, is invoked
when:
A) #1 is true, #2 is true and #3 is false
It checks whether user space requested a slice extension by setting
the request bit in the rseq slice_ctrl field. If so, it grants the
extension and stores the slice expiry time, so that the actual exit
code can double check whether the slice is already exhausted before
going back.
B) #1 - #3 are true _and_ a slice extension was granted in a previous
loop iteration
In this case the grant is revoked.
In case that the user space access faults or invalid state is detected, the
task is terminated with SIGSEGV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V2: Provide an extra stub for the !RSEQ case - Prateek
---
include/linux/rseq_entry.h | 108 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -41,6 +41,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/sched/signal.h>
#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -108,10 +109,116 @@ static __always_inline void rseq_slice_c
t->rseq.slice.state.granted = false;
}
+static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+{
+ struct task_struct *curr = current;
+ struct rseq_slice_ctrl usr_ctrl;
+ union rseq_slice_state state;
+ struct rseq __user *rseq;
+
+ if (!rseq_slice_extension_enabled())
+ return false;
+
+ /* If not enabled or not a return from interrupt, nothing to do. */
+ state = curr->rseq.slice.state;
+ state.enabled &= curr->rseq.event.user_irq;
+ if (likely(!state.state))
+ return false;
+
+ rseq = curr->rseq.usrptr;
+ scoped_user_rw_access(rseq, efault) {
+
+ /*
+ * Quick check conditions where a grant is not possible or
+ * needs to be revoked.
+ *
+ * 1) Any TIF bit which needs to do extra work aside of
+ * rescheduling prevents a grant.
+ *
+ * 2) A previous rescheduling request resulted in a slice
+ * extension grant.
+ */
+ if (unlikely(work_pending || state.granted)) {
+ /* Clear user control unconditionally. No point for checking */
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+ rseq_slice_clear_grant(curr);
+ return false;
+ }
+
+ unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ if (likely(!(usr_ctrl.request)))
+ return false;
+
+ /* Grant the slice extention */
+ usr_ctrl.request = 0;
+ usr_ctrl.granted = 1;
+ unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
+ }
+
+ rseq_stat_inc(rseq_stats.s_granted);
+
+ curr->rseq.slice.state.granted = true;
+ /* Store expiry time for arming the timer on the way out */
+ curr->rseq.slice.expires = data_race(rseq_slice_ext_nsecs) + ktime_get_mono_fast_ns();
+ /*
+ * This is racy against a remote CPU setting TIF_NEED_RESCHED in
+ * several ways:
+ *
+ * 1)
+ * CPU0 CPU1
+ * clear_tsk()
+ * set_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() -> Folds correctly
+ * 2)
+ * CPU0 CPU1
+ * set_tsk()
+ * clear_tsk()
+ * clear_preempt()
+ * Raise scheduler IPI on CPU0
+ * --> IPI
+ * fold_need_resched() <- NOOP as TIF_NEED_RESCHED is false
+ *
+ * #1 is not any different from a regular remote reschedule as it
+ * sets the previously not set bit and then raises the IPI which
+ * folds it into the preempt counter
+ *
+ * #2 is obviously incorrect from a scheduler POV, but it's not
+ * differently incorrect than the code below clearing the
+ * reschedule request with the safety net of the timer.
+ *
+ * The important part is that the clearing is protected against the
+ * scheduler IPI and also against any other interrupt which might
+ * end up waking up a task and setting the bits in the middle of
+ * the operation:
+ *
+ * clear_tsk()
+ * ---> Interrupt
+ * wakeup_on_this_cpu()
+ * set_tsk()
+ * set_preempt()
+ * clear_preempt()
+ *
+ * which would be inconsistent state.
+ */
+ scoped_guard(irq) {
+ clear_tsk_need_resched(curr);
+ clear_preempt_need_resched();
+ }
+ return true;
+
+efault:
+ force_sig(SIGSEGV);
+ return false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static inline bool rseq_slice_extension_enabled(void) { return false; }
static inline bool rseq_arm_slice_extension_timer(void) { return false; }
static inline void rseq_slice_clear_grant(struct task_struct *t) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -646,6 +753,7 @@ static inline bool rseq_exit_to_user_mod
static inline void rseq_syscall_exit_to_user_mode(void) { }
static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
+static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 11/12] entry: Hook up rseq time slice extension
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (9 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-29 13:22 ` [patch V3 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
` (2 subsequent siblings)
13 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Wire the grant decision function up in exit_to_user_mode_loop()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
kernel/entry/common.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,14 @@ void __weak arch_do_signal_or_restart(st
#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
#endif
+/* TIF bits, which prevent a time slice extension. */
+#ifdef CONFIG_PREEMPT_RT
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED_LAZY)
+#else
+# define TIF_SLICE_EXT_SCHED (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)
+#endif
+#define TIF_SLICE_EXT_DENY (EXIT_TO_USER_MODE_WORK & ~TIF_SLICE_EXT_SCHED)
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -28,8 +36,10 @@ static __always_inline unsigned long __e
local_irq_enable_exit_to_user(ti_work);
- if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
- schedule();
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
+ if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ schedule();
+ }
if (ti_work & _TIF_UPROBE)
uprobe_notify_resume(regs);
^ permalink raw reply [flat|nested] 63+ messages in thread
* [patch V3 12/12] selftests/rseq: Implement time slice extension test
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (10 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
@ 2025-10-29 13:22 ` Thomas Gleixner
2025-10-29 15:10 ` [patch V3 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
2025-11-06 17:28 ` Prakash Sangappa
13 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 13:22 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
Provide an initial test case to evaluate the functionality. This needs to be
extended to cover the ABI violations and expose the race condition between
observing granted and arriving in rseq_slice_yield().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
tools/testing/selftests/rseq/.gitignore | 1
tools/testing/selftests/rseq/Makefile | 5
tools/testing/selftests/rseq/rseq-abi.h | 27 ++++
tools/testing/selftests/rseq/slice_test.c | 198 ++++++++++++++++++++++++++++++
4 files changed, 230 insertions(+), 1 deletion(-)
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
param_test_mm_cid_benchmark
param_test_mm_cid_compare_twice
syscall_errors_test
+slice_test
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test slice_test
TEST_GEN_PROGS_EXTENDED = librseq.so
@@ -59,3 +59,6 @@ include ../lib.mk
$(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED) \
rseq.h rseq-*.h
$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+ $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -53,6 +53,27 @@ struct rseq_abi_cs {
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -165,6 +186,12 @@ struct rseq_abi {
__u32 mm_cid;
/*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
+ /*
* Flexible array member at end of structure, after last feature field.
*/
char end[];
--- /dev/null
+++ b/tools/testing/selftests/rseq/slice_test.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+#include "../kselftest_harness.h"
+
+#ifndef __NR_rseq_slice_yield
+# define __NR_rseq_slice_yield 470
+#endif
+
+#define BITS_PER_INT 32
+#define BITS_PER_BYTE 8
+
+#ifndef PR_RSEQ_SLICE_EXTENSION
+# define PR_RSEQ_SLICE_EXTENSION 79
+# define PR_RSEQ_SLICE_EXTENSION_GET 1
+# define PR_RSEQ_SLICE_EXTENSION_SET 2
+# define PR_RSEQ_SLICE_EXT_ENABLE 0x01
+#endif
+
+#ifndef RSEQ_SLICE_EXT_REQUEST_BIT
+# define RSEQ_SLICE_EXT_REQUEST_BIT 0
+# define RSEQ_SLICE_EXT_GRANTED_BIT 1
+#endif
+
+#ifndef asm_inline
+# define asm_inline asm __inline
+#endif
+
+#define NSEC_PER_SEC 1000000000L
+#define NSEC_PER_USEC 1000L
+
+struct noise_params {
+ int noise_nsecs;
+ int sleep_nsecs;
+ int run;
+};
+
+FIXTURE(slice_ext)
+{
+ pthread_t noise_thread;
+ struct noise_params noise_params;
+};
+
+FIXTURE_VARIANT(slice_ext)
+{
+ int64_t total_nsecs;
+ int slice_nsecs;
+ int noise_nsecs;
+ int sleep_nsecs;
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n2_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 2 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+FIXTURE_VARIANT_ADD(slice_ext, n50_2_50)
+{
+ .total_nsecs = 5 * NSEC_PER_SEC,
+ .slice_nsecs = 50 * NSEC_PER_USEC,
+ .noise_nsecs = 2 * NSEC_PER_USEC,
+ .sleep_nsecs = 50 * NSEC_PER_USEC,
+};
+
+static inline bool elapsed(struct timespec *start, struct timespec *now,
+ int64_t span)
+{
+ int64_t delta = now->tv_sec - start->tv_sec;
+
+ delta *= NSEC_PER_SEC;
+ delta += now->tv_nsec - start->tv_nsec;
+ return delta >= span;
+}
+
+static void *noise_thread(void *arg)
+{
+ struct noise_params *p = arg;
+
+ while (RSEQ_READ_ONCE(p->run)) {
+ struct timespec ts_start, ts_now;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, p->noise_nsecs));
+
+ ts_start.tv_sec = 0;
+ ts_start.tv_nsec = p->sleep_nsecs;
+ clock_nanosleep(CLOCK_MONOTONIC, 0, &ts_start, NULL);
+ }
+ return NULL;
+}
+
+FIXTURE_SETUP(slice_ext)
+{
+ cpu_set_t affinity;
+
+ ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
+
+ /* Pin it on a single CPU. Avoid CPU 0 */
+ for (int i = 1; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+
+ CPU_ZERO(&affinity);
+ CPU_SET(i, &affinity);
+ ASSERT_EQ(sched_setaffinity(0, sizeof(affinity), &affinity), 0);
+ break;
+ }
+
+ ASSERT_EQ(rseq_register_current_thread(), 0);
+
+ ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
+
+ self->noise_params.noise_nsecs = variant->noise_nsecs;
+ self->noise_params.sleep_nsecs = variant->sleep_nsecs;
+ self->noise_params.run = 1;
+
+ ASSERT_EQ(pthread_create(&self->noise_thread, NULL, noise_thread, &self->noise_params), 0);
+}
+
+FIXTURE_TEARDOWN(slice_ext)
+{
+ self->noise_params.run = 0;
+ pthread_join(self->noise_thread, NULL);
+}
+
+TEST_F(slice_ext, slice_test)
+{
+ unsigned long success = 0, yielded = 0, scheduled = 0, raced = 0;
+ struct rseq_abi *rs = rseq_get_abi();
+ struct timespec ts_start, ts_now;
+
+ ASSERT_NE(rs, NULL);
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ do {
+ struct timespec ts_cs;
+ bool req = false;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_cs);
+
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 1);
+ do {
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_cs, &ts_now, variant->slice_nsecs));
+
+ /*
+ * request can be cleared unconditionally, but for making
+ * the stats work this is actually checking it first
+ */
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.request)) {
+ RSEQ_WRITE_ONCE(rs->slice_ctrl.request, 0);
+ /* Race between check and clear! */
+ req = true;
+ success++;
+ }
+
+ if (RSEQ_READ_ONCE(rs->slice_ctrl.granted)) {
+ /* The above raced against a late grant */
+ if (req)
+ success--;
+ yielded++;
+ if (!syscall(__NR_rseq_slice_yield))
+ raced++;
+ } else {
+ if (!req)
+ scheduled++;
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_now);
+ } while (!elapsed(&ts_start, &ts_now, variant->total_nsecs));
+
+ printf("# Success %12ld\n", success);
+ printf("# Yielded %12ld\n", yielded);
+ printf("# Scheduled %12ld\n", scheduled);
+ printf("# Raced %12ld\n", raced);
+}
+
+TEST_HARNESS_MAIN
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (11 preceding siblings ...)
2025-10-29 13:22 ` [patch V3 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
@ 2025-10-29 15:10 ` Sebastian Andrzej Siewior
2025-10-29 15:40 ` Steven Rostedt
2025-11-06 17:28 ` Prakash Sangappa
13 siblings, 1 reply; 63+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-10-29 15:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Arnd Bergmann, linux-arch
On 2025-10-29 14:22:11 [+0100], Thomas Gleixner wrote:
> Changes vs. V2:
>
> - Rebase on the newest RSEQ and uaccess changes
…
> and in git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
>
> For your convenience all of it is also available as a conglomerate from
> git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
rseq/slice is older than rseq/cid. rseq/slice has the
__put_kernel_nofault typo. rseq/cid looks correct.
> Thanks,
>
> tglx
Sebastian
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-10-29 15:10 ` [patch V3 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
@ 2025-10-29 15:40 ` Steven Rostedt
2025-10-29 21:49 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2025-10-29 15:40 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Thomas Gleixner, LKML, Peter Zijlstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Arnd Bergmann, linux-arch
On Wed, 29 Oct 2025 16:10:55 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > and in git:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
> >
> > For your convenience all of it is also available as a conglomerate from
> > git:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
> rseq/slice is older than rseq/cid. rseq/slice has the
> __put_kernel_nofault typo. rseq/cid looks correct.
Yeah, I started looking at both too, and checking out req/slice and trying
to do a rebase on top of rseq/cid causes a bunch of conflicts.
I'm continuing the rebase and just skipping the changed commits.
-- Steve
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 03/12] rseq: Provide static branch for time slice extensions
2025-10-29 13:22 ` [patch V3 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
@ 2025-10-29 17:23 ` Randy Dunlap
2025-10-29 21:12 ` Thomas Gleixner
2025-10-31 19:34 ` Mathieu Desnoyers
1 sibling, 1 reply; 63+ messages in thread
From: Randy Dunlap @ 2025-10-29 17:23 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On 10/29/25 6:22 AM, Thomas Gleixner wrote:
> Guard the time slice extension functionality with a static key, which can
> be disabled on the kernel command line.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
> V3: Document command line parameter - Sebastian
> V2: Return 1 from __setup() - Prateek
> ---
> Documentation/admin-guide/kernel-parameters.txt | 5 +++++
> include/linux/rseq_entry.h | 11 +++++++++++
> kernel/rseq.c | 17 +++++++++++++++++
> 3 files changed, 33 insertions(+)
>
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -484,3 +484,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> efault:
> return -EFAULT;
> }
> +
> +#ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
> +
> +static int __init rseq_slice_cmdline(char *str)
> +{
> + bool on;
> +
> + if (kstrtobool(str, &on))
> + return -EINVAL;
The norm for __setup function returns is:
return 0; /* param not handled - will be added to ENV */
or
return 1; /* param is handled (anything non-zero) */
Anything non-zero means param is handled, so maybe -EINVAL is OK here,
since return 0 means that the string is added to init's environment.
If the parsing function recognizes the cmdline option string
(rseq_slice_ext) but the value is invalid, it should pr_error()
or something like that but still return 1; (IMHO).
No need to have "rseq_slice_ext=foo" added to init's ENV.
So return -EINVAL is like return 1 in this case.
IOW it works as needed. :)
> +
> + if (!on)
> + static_branch_disable(&rseq_slice_extension_key);
> + return 1;
> +}
> +__setup("rseq_slice_ext=", rseq_slice_cmdline);
> +#endif /* CONFIG_RSEQ_SLICE_EXTENSION */
--
~Randy
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 08/12] rseq: Implement time slice extension enforcement timer
2025-10-29 13:22 ` [patch V3 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
@ 2025-10-29 18:45 ` Steven Rostedt
2025-10-29 21:37 ` Thomas Gleixner
2025-10-31 19:59 ` Mathieu Desnoyers
1 sibling, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2025-10-29 18:45 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, 29 Oct 2025 14:22:26 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -86,8 +86,24 @@ static __always_inline bool rseq_slice_e
> {
> return static_branch_likely(&rseq_slice_extension_key);
> }
> +
> +extern unsigned int rseq_slice_ext_nsecs;
> +bool __rseq_arm_slice_extension_timer(void);
> +
> +static __always_inline bool rseq_arm_slice_extension_timer(void)
> +{
> + if (!rseq_slice_extension_enabled())
> + return false;
> +
> + if (likely(!current->rseq.slice.state.granted))
> + return false;
> +
> + return __rseq_arm_slice_extension_timer();
> +}
> +
> #else /* CONFIG_RSEQ_SLICE_EXTENSION */
> static inline bool rseq_slice_extension_enabled(void) { return false; }
> +static inline bool rseq_arm_slice_extension_timer(void) { return false; }
> #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
>
> bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
> @@ -542,17 +558,19 @@ static __always_inline void clear_tif_rs
> static __always_inline bool
> rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
> {
> - if (likely(!test_tif_rseq(ti_work)))
> - return false;
> -
> - if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
> - current->rseq.event.slowpath = true;
> - set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> - return true;
> + if (unlikely(test_tif_rseq(ti_work))) {
> + if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
> + current->rseq.event.slowpath = true;
> + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> + return true;
Just to make sure I understand this. By setting TIF_NOTIFY_RESUME and
returning true it can still comeback to set the timer?
I guess this also begs the question of if user space can use both the
restartable sequences at the same time as requesting an extended time slice?
> + }
> + clear_tif_rseq();
> }
> -
> - clear_tif_rseq();
> - return false;
> + /*
> + * Arm the slice extension timer if nothing to do anymore and the
> + * task really goes out to user space.
> + */
> + return rseq_arm_slice_extension_timer();
> }
>
> #endif /* CONFIG_GENERIC_ENTRY */
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -89,9 +89,11 @@ union rseq_slice_state {
> /**
> * struct rseq_slice - Status information for rseq time slice extension
> * @state: Time slice extension state
> + * @expires: The time when a grant expires
> */
> struct rseq_slice {
> union rseq_slice_state state;
> + u64 expires;
> };
>
> /**
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -71,6 +71,8 @@
> #define RSEQ_BUILD_SLOW_PATH
>
> #include <linux/debugfs.h>
> +#include <linux/hrtimer.h>
> +#include <linux/percpu.h>
> #include <linux/prctl.h>
> #include <linux/ratelimit.h>
> #include <linux/rseq_entry.h>
> @@ -499,8 +501,78 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> }
>
> #ifdef CONFIG_RSEQ_SLICE_EXTENSION
> +struct slice_timer {
> + struct hrtimer timer;
> + void *cookie;
> +};
> +
> +unsigned int rseq_slice_ext_nsecs __read_mostly = 30 * NSEC_PER_USEC;
> +static DEFINE_PER_CPU(struct slice_timer, slice_timer);
> DEFINE_STATIC_KEY_TRUE(rseq_slice_extension_key);
>
> +static enum hrtimer_restart rseq_slice_expired(struct hrtimer *tmr)
> +{
> + struct slice_timer *st = container_of(tmr, struct slice_timer, timer);
> +
> + if (st->cookie == current && current->rseq.slice.state.granted) {
> + rseq_stat_inc(rseq_stats.s_expired);
> + set_need_resched_current();
> + }
> + return HRTIMER_NORESTART;
> +}
> +
> +bool __rseq_arm_slice_extension_timer(void)
> +{
> + struct slice_timer *st = this_cpu_ptr(&slice_timer);
> + struct task_struct *curr = current;
> +
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * This check prevents that a granted time slice extension exceeds
This check prevents a granted time slice ...
> + * the maximum scheduling latency when the grant expired before
> + * going out to user space. Don't bother to clear the grant here,
> + * it will be cleaned up automatically before going out to user
> + * space.
> + */
> + if ((unlikely(curr->rseq.slice.expires < ktime_get_mono_fast_ns()))) {
> + set_need_resched_current();
> + return true;
> + }
> +
> + /*
> + * Store the task pointer as a cookie for comparison in the timer
> + * function. This is safe as the timer is CPU local and cannot be
> + * in the expiry function at this point.
> + */
I'm just curious in this scenario:
1) Task A requests an extension and is granted.
st->cookie = Task A
hrtimer_start();
2) Before getting back to user space, a RT kernel thread wakes up and
preempts Task A. Does this clear the timer?
3) RT kernel thread finishes but then schedules Task B within the expiry.
4) Task B requests an extension (assuming it had a short time slice that
allowed it to end before the expiry of the original timer).
I guess it doesn't matter that st->cookie = Task B, as Task A was already
scheduled out. But would calling hrtimer_start() on an existing timer cause
any issue?
I guess it doesn't matter as it looks like the code in hrtimer_start() does
indeed remove an existing timer.
> + st->cookie = curr;
> + hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
> + /* Arm the syscall entry work */
> + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
> + return false;
> +}
> +
> +static void rseq_cancel_slice_extension_timer(void)
> +{
> + struct slice_timer *st = this_cpu_ptr(&slice_timer);
> +
> + /*
> + * st->cookie can be safely read as preemption is disabled and the
> + * timer is CPU local.
> + *
> + * As this is most probably the first expiring timer, the cancel is
As this is probably the first ...
> + * expensive as it has to reprogram the hardware, but that's less
> + * expensive than going through a full hrtimer_interrupt() cycle
> + * for nothing.
> + *
> + * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
> + * local and once the hrtimer code disabled interrupts the timer
> + * callback cannot be running.
> + */
> + if (st->cookie == current)
> + hrtimer_try_to_cancel(&st->timer);
If the above scenario did happen, the timer will go off as
st->cookie == current would likely be false?
Hmm, if it does go off and the task did schedule back in, would it get its
need_resched set? This is a very unlikely scenario thus I guess it doesn't
really matter.
I'm just thinking about corner cases and how it could affect this code and
possibly cause noticeable issues.
-- Steve
> +}
> +
> static inline void rseq_slice_set_need_resched(struct task_struct *curr)
> {
> /*
> @@ -558,10 +630,11 @@ void rseq_syscall_enter_work(long syscal
> rseq_stat_inc(rseq_stats.s_yielded);
>
> /*
> - * Required to make set_tsk_need_resched() correct on PREEMPT[RT]
> - * kernels.
> + * Required to stabilize the per CPU timer pointer and to make
> + * set_tsk_need_resched() correct on PREEMPT[RT] kernels.
> */
> scoped_guard(preempt) {
> + rseq_cancel_slice_extension_timer();
> /*
> * Now that preemption is disabled, quickly check whether
> * the task was already rescheduled before arriving here.
> @@ -652,6 +725,31 @@ SYSCALL_DEFINE0(rseq_slice_yield)
> return 0;
> }
>
> +#ifdef CONFIG_SYSCTL
> +static const unsigned int rseq_slice_ext_nsecs_min = 10 * NSEC_PER_USEC;
> +static const unsigned int rseq_slice_ext_nsecs_max = 50 * NSEC_PER_USEC;
> +
> +static const struct ctl_table rseq_slice_ext_sysctl[] = {
> + {
> + .procname = "rseq_slice_extension_nsec",
> + .data = &rseq_slice_ext_nsecs,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_douintvec_minmax,
> + .extra1 = (unsigned int *)&rseq_slice_ext_nsecs_min,
> + .extra2 = (unsigned int *)&rseq_slice_ext_nsecs_max,
> + },
> +};
> +
> +static void rseq_slice_sysctl_init(void)
> +{
> + if (rseq_slice_extension_enabled())
> + register_sysctl_init("kernel", rseq_slice_ext_sysctl);
> +}
> +#else /* CONFIG_SYSCTL */
> +static inline void rseq_slice_sysctl_init(void) { }
> +#endif /* !CONFIG_SYSCTL */
> +
> static int __init rseq_slice_cmdline(char *str)
> {
> bool on;
> @@ -664,4 +762,17 @@ static int __init rseq_slice_cmdline(cha
> return 1;
> }
> __setup("rseq_slice_ext=", rseq_slice_cmdline);
> +
> +static int __init rseq_slice_init(void)
> +{
> + unsigned int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + hrtimer_setup(per_cpu_ptr(&slice_timer.timer, cpu), rseq_slice_expired,
> + CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED_HARD);
> + }
> + rseq_slice_sysctl_init();
> + return 0;
> +}
> +device_initcall(rseq_slice_init);
> #endif /* CONFIG_RSEQ_SLICE_EXTENSION */
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 10/12] rseq: Implement rseq_grant_slice_extension()
2025-10-29 13:22 ` [patch V3 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
@ 2025-10-29 20:08 ` Steven Rostedt
2025-10-29 21:46 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2025-10-29 20:08 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, 29 Oct 2025 14:22:30 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:
> +static __always_inline bool rseq_grant_slice_extension(bool work_pending)
> +{
> + struct task_struct *curr = current;
> + struct rseq_slice_ctrl usr_ctrl;
> + union rseq_slice_state state;
> + struct rseq __user *rseq;
> +
> + if (!rseq_slice_extension_enabled())
> + return false;
> +
> + /* If not enabled or not a return from interrupt, nothing to do. */
> + state = curr->rseq.slice.state;
> + state.enabled &= curr->rseq.event.user_irq;
> + if (likely(!state.state))
> + return false;
> +
> + rseq = curr->rseq.usrptr;
> + scoped_user_rw_access(rseq, efault) {
> +
> + /*
> + * Quick check conditions where a grant is not possible or
> + * needs to be revoked.
> + *
> + * 1) Any TIF bit which needs to do extra work aside of
> + * rescheduling prevents a grant.
> + *
I'm curious to why any other TIF bit causes this to refuse a grant?
If deferred unwinding gets implemented, and profiling is enabled, it uses
task_work. From my understanding, task_work will set a TIF bit. Would this
mean that we would not be able to profile this feature with the deferred
unwinder? As profiling it will prevent it from being used?
-- Steve
> + * 2) A previous rescheduling request resulted in a slice
> + * extension grant.
> + */
> + if (unlikely(work_pending || state.granted)) {
> + /* Clear user control unconditionally. No point for checking */
> + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
> + rseq_slice_clear_grant(curr);
> + return false;
> + }
> +
> + unsafe_get_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
> + if (likely(!(usr_ctrl.request)))
> + return false;
> +
> + /* Grant the slice extention */
> + usr_ctrl.request = 0;
> + usr_ctrl.granted = 1;
> + unsafe_put_user(usr_ctrl.all, &rseq->slice_ctrl.all, efault);
> + }
> +
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 03/12] rseq: Provide static branch for time slice extensions
2025-10-29 17:23 ` Randy Dunlap
@ 2025-10-29 21:12 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 21:12 UTC (permalink / raw)
To: Randy Dunlap, LKML
Cc: Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Prakash Sangappa, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, Oct 29 2025 at 10:23, Randy Dunlap wrote:
> On 10/29/25 6:22 AM, Thomas Gleixner wrote:
>> +static int __init rseq_slice_cmdline(char *str)
>> +{
>> + bool on;
>> +
>> + if (kstrtobool(str, &on))
>> + return -EINVAL;
>
> The norm for __setup function returns is:
>
> return 0; /* param not handled - will be added to ENV */
> or
> return 1; /* param is handled (anything non-zero) */
>
>
> Anything non-zero means param is handled, so maybe -EINVAL is OK here,
> since return 0 means that the string is added to init's environment.
>
> If the parsing function recognizes the cmdline option string
> (rseq_slice_ext) but the value is invalid, it should pr_error()
> or something like that but still return 1; (IMHO).
> No need to have "rseq_slice_ext=foo" added to init's ENV.
>
> So return -EINVAL is like return 1 in this case.
> IOW it works as needed. :)
Bah. I hate this logic so much and I never will memorize it.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 08/12] rseq: Implement time slice extension enforcement timer
2025-10-29 18:45 ` Steven Rostedt
@ 2025-10-29 21:37 ` Thomas Gleixner
2025-10-29 23:53 ` Steven Rostedt
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 21:37 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, Oct 29 2025 at 14:45, Steven Rostedt wrote:
> On Wed, 29 Oct 2025 14:22:26 +0100 (CET)
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
>> {
>> - if (likely(!test_tif_rseq(ti_work)))
>> - return false;
>> -
>> - if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
>> - current->rseq.event.slowpath = true;
>> - set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
>> - return true;
>> + if (unlikely(test_tif_rseq(ti_work))) {
>> + if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
>> + current->rseq.event.slowpath = true;
>> + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
>> + return true;
>
> Just to make sure I understand this. By setting TIF_NOTIFY_RESUME and
> returning true it can still comeback to set the timer?
No. NOTIFY_RESUME is only set when the access faults or when the user
space memory is corrupted and the grant is moot in that case.
But if TIF_RSEQ is set then a previously granted extensionn is anyway
revoked because that means:
granted();
---> preemption (evtl. migration): Set's TIF_RSEQ
schedule()
rseq_exit_to_user_mode_restart()
if (TIF_RSEQ is set)
handle_rseq()
revoke_grant()
> I guess this also begs the question of if user space can use both the
> restartable sequences at the same time as requesting an extended time slice?
It can and that actually makes sense.
enter_cs()
request_grant()
set_cs()
...
interrupt
set_need_resched()
exit_to_user_mode()
if (need_resched()
grant_extention() // clears NEED_RESCHED
...
rseq_exit_to_user_mode_restart()
if (IF_RSEQ is set) // Branch not taken
...
arm_timer()
return_to_user()
leave_cs()
if (granted)
sys_rseq_sched_yield()
which means the extension grant prevented the critical section to be
aborted. If the extension is not granted or revoked then this behaves
like a regular RSEQ CS abort.
>> + * This check prevents that a granted time slice extension exceeds
>
> This check prevents a granted time slice ...
>
>> + * the maximum scheduling latency when the grant expired before
I'm not a native speaker, but your suggested edit is bogus. Let me
put it into the full sentence:
This check prevents a granted time slice extension exceeds
the maximum ....
Can you spot the fail?
>> + /*
>> + * Store the task pointer as a cookie for comparison in the timer
>> + * function. This is safe as the timer is CPU local and cannot be
>> + * in the expiry function at this point.
>> + */
>
> I'm just curious in this scenario:
>
> 1) Task A requests an extension and is granted.
> st->cookie = Task A
> hrtimer_start();
>
> 2) Before getting back to user space, a RT kernel thread wakes up and
> preempts Task A. Does this clear the timer?
No.
> 3) RT kernel thread finishes but then schedules Task B within the expiry.
>
> 4) Task B requests an extension (assuming it had a short time slice that
> allowed it to end before the expiry of the original timer).
>
> I guess it doesn't matter that st->cookie = Task B, as Task A was already
> scheduled out. But would calling hrtimer_start() on an existing timer cause
> any issue?
No. The timer is canceled and reprogrammed.
> I guess it doesn't matter as it looks like the code in hrtimer_start() does
> indeed remove an existing timer.
You guessed right :)
>> + st->cookie = curr;
>> + hrtimer_start(&st->timer, curr->rseq.slice.expires, HRTIMER_MODE_ABS_PINNED_HARD);
>> + /* Arm the syscall entry work */
>> + set_task_syscall_work(curr, SYSCALL_RSEQ_SLICE);
>> + return false;
>> +}
>> +
>> +static void rseq_cancel_slice_extension_timer(void)
>> +{
>> + struct slice_timer *st = this_cpu_ptr(&slice_timer);
>> +
>> + /*
>> + * st->cookie can be safely read as preemption is disabled and the
>> + * timer is CPU local.
>> + *
>> + * As this is most probably the first expiring timer, the cancel is
>
> As this is probably the first ...
>
>> + * expensive as it has to reprogram the hardware, but that's less
>> + * expensive than going through a full hrtimer_interrupt() cycle
>> + * for nothing.
>> + *
>> + * hrtimer_try_to_cancel() is sufficient here as the timer is CPU
>> + * local and once the hrtimer code disabled interrupts the timer
>> + * callback cannot be running.
>> + */
>> + if (st->cookie == current)
>> + hrtimer_try_to_cancel(&st->timer);
>
> If the above scenario did happen, the timer will go off as
> st->cookie == current would likely be false?
>
> Hmm, if it does go off and the task did schedule back in, would it get its
> need_resched set? This is a very unlikely scenario thus I guess it doesn't
> really matter.
Correct.
> I'm just thinking about corner cases and how it could affect this code and
> possibly cause noticeable issues.
Right. That corner case exists and there is not much to be done about it
unless you inflict the timer cancelation into schedule(), which is not
an option at all.
> -- Steve
/me trims 50+ lines of pointless quotation.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 10/12] rseq: Implement rseq_grant_slice_extension()
2025-10-29 20:08 ` Steven Rostedt
@ 2025-10-29 21:46 ` Thomas Gleixner
2025-10-29 22:04 ` Steven Rostedt
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 21:46 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, Oct 29 2025 at 16:08, Steven Rostedt wrote:
> On Wed, 29 Oct 2025 14:22:30 +0100 (CET)
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> + /*
>> + * Quick check conditions where a grant is not possible or
>> + * needs to be revoked.
>> + *
>> + * 1) Any TIF bit which needs to do extra work aside of
>> + * rescheduling prevents a grant.
>> + *
>
> I'm curious to why any other TIF bit causes this to refuse a grant?
>
> If deferred unwinding gets implemented, and profiling is enabled, it uses
> task_work. From my understanding, task_work will set a TIF bit. Would this
> mean that we would not be able to profile this feature with the deferred
> unwinder? As profiling it will prevent it from being used?
You still can use it. The point is that a set TIF bit will do extra
work, which means extra scheduling latency. The extra work might be
short enough to still make the grant useful, but that's something which
needs to be worked out and analyzed. Quite some of the TIF bits actually
end up with another reschedule request.
As this whole thing is an opportunistic poor mans priority ceiling
attempt, I opted for the simple decision of not granting it when other
TIF bits are set. KISS rules :)
That's not set in stone and has no user space ABI relevance because it's
solely a kernel implementation detail.
> -- Steve
Can you please trim your replies as anybody else does?
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-10-29 15:40 ` Steven Rostedt
@ 2025-10-29 21:49 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-29 21:49 UTC (permalink / raw)
To: Steven Rostedt, Sebastian Andrzej Siewior
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Arnd Bergmann, linux-arch
On Wed, Oct 29 2025 at 11:40, Steven Rostedt wrote:
> On Wed, 29 Oct 2025 16:10:55 +0100
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>
>> > and in git:
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
>> >
>> > For your convenience all of it is also available as a conglomerate from
>> > git:
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>>
>> rseq/slice is older than rseq/cid. rseq/slice has the
>> __put_kernel_nofault typo. rseq/cid looks correct.
>
> Yeah, I started looking at both too, and checking out req/slice and trying
> to do a rebase on top of rseq/cid causes a bunch of conflicts.
>
> I'm continuing the rebase and just skipping the changed commits.
Forgot to push the updated branch out....
Fixed now.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 10/12] rseq: Implement rseq_grant_slice_extension()
2025-10-29 21:46 ` Thomas Gleixner
@ 2025-10-29 22:04 ` Steven Rostedt
2025-10-31 14:33 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2025-10-29 22:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, 29 Oct 2025 22:46:12 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:
> Can you please trim your replies as anybody else does?
I did trim it. I only kept the function in question, but deleted everything
else.
I do like to keep a bit of context, as sometimes I find people tend to trim
a bit too much.
-- Steve
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 08/12] rseq: Implement time slice extension enforcement timer
2025-10-29 21:37 ` Thomas Gleixner
@ 2025-10-29 23:53 ` Steven Rostedt
0 siblings, 0 replies; 63+ messages in thread
From: Steven Rostedt @ 2025-10-29 23:53 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, 29 Oct 2025 22:37:17 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:
> >> + * This check prevents that a granted time slice extension exceeds
> >
> > This check prevents a granted time slice ...
> >
> >> + * the maximum scheduling latency when the grant expired before
>
> I'm not a native speaker, but your suggested edit is bogus. Let me
> put it into the full sentence:
>
> This check prevents a granted time slice extension exceeds
> the maximum ....
>
> Can you spot the fail?
Ah, I should have updated the entire sentence, as the original still sounds
funny to me, but you are correct, that update wasn't enough.
Perhaps:
This check prevents a granted time slice extension from exceeding the
maximum scheduling latency when the grant expires before going back out
to user space.
-- Steve
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-29 13:22 ` [patch V3 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
@ 2025-10-30 22:01 ` Prakash Sangappa
2025-10-31 14:32 ` Thomas Gleixner
2025-10-31 19:31 ` Mathieu Desnoyers
2025-11-04 0:20 ` Steven Rostedt
2 siblings, 1 reply; 63+ messages in thread
From: Prakash Sangappa @ 2025-10-30 22:01 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Aside of a Kconfig knob add the following items:
>
> - Two flag bits for the rseq user space ABI, which allow user space to
> query the availability and enablement without a syscall.
>
> - A new member to the user space ABI struct rseq, which is going to be
> used to communicate request and grant between kernel and user space.
>
> - A rseq state struct to hold the kernel state of this
>
> - Documentation of the new mechanism
>
[…]
> +
> +If both the request bit and the granted bit are false when leaving the
> +critical section, then this indicates that a grant was revoked and no
> +further action is required by userspace.
> +
> +The required code flow is as follows::
> +
> + rseq->slice_ctrl.request = 1;
> + critical_section();
> + if (rseq->slice_ctrl.granted)
> + rseq_slice_yield();
> +
> +As all of this is strictly CPU local, there are no atomicity requirements.
> +Checking the granted state is racy, but that cannot be avoided at all::
> +
> + if (rseq->slice_ctrl & GRANTED)
Could this be?
if (rseq->slice_ctrl.granted)
> + -> Interrupt results in schedule and grant revocation
> + rseq_slice_yield();
> +
-Prakash
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-30 22:01 ` Prakash Sangappa
@ 2025-10-31 14:32 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-31 14:32 UTC (permalink / raw)
To: Prakash Sangappa
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
On Thu, Oct 30 2025 at 22:01, Prakash Sangappa wrote:
>> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Aside of a Kconfig knob add the following items:
>>
>> - Two flag bits for the rseq user space ABI, which allow user space to
>> query the availability and enablement without a syscall.
>>
>> - A new member to the user space ABI struct rseq, which is going to be
>> used to communicate request and grant between kernel and user space.
>>
>> - A rseq state struct to hold the kernel state of this
>>
>> - Documentation of the new mechanism
>>
> […]
>> +
>> +If both the request bit and the granted bit are false when leaving the
>> +critical section, then this indicates that a grant was revoked and no
>> +further action is required by userspace.
>> +
>> +The required code flow is as follows::
>> +
>> + rseq->slice_ctrl.request = 1;
>> + critical_section();
>> + if (rseq->slice_ctrl.granted)
>> + rseq_slice_yield();
>> +
>> +As all of this is strictly CPU local, there are no atomicity requirements.
>> +Checking the granted state is racy, but that cannot be avoided at all::
>> +
>> + if (rseq->slice_ctrl & GRANTED)
> Could this be?
> if (rseq->slice_ctrl.granted)
Yes.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 10/12] rseq: Implement rseq_grant_slice_extension()
2025-10-29 22:04 ` Steven Rostedt
@ 2025-10-31 14:33 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-31 14:33 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch
On Wed, Oct 29 2025 at 18:04, Steven Rostedt wrote:
> On Wed, 29 Oct 2025 22:46:12 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> Can you please trim your replies as anybody else does?
>
> I did trim it. I only kept the function in question, but deleted everything
> else.
>
> I do like to keep a bit of context, as sometimes I find people tend to trim
> a bit too much.
That's fine, but leaving stale quotes after
Steve
is not.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-29 13:22 ` [patch V3 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-10-30 22:01 ` Prakash Sangappa
@ 2025-10-31 19:31 ` Mathieu Desnoyers
2025-10-31 20:58 ` Thomas Gleixner
2025-11-04 0:20 ` Steven Rostedt
2 siblings, 1 reply; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 19:31 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-10-29 09:22, Thomas Gleixner wrote:
[...]
> +
> +The thread has to enable the functionality via prctl(2)::
> +
> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
Enabling specifically for each thread requires hooking into thread
creation, and is not a good fit for enabling this from executable or
library constructor function.
What is the use-case for enabling it only for a few threads within
a process rather than for the entire process ?
> +
> +The kernel indicates the grant by clearing rseq::slice_ctrl::reqeust and
reqeust -> request
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 03/12] rseq: Provide static branch for time slice extensions
2025-10-29 13:22 ` [patch V3 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-10-29 17:23 ` Randy Dunlap
@ 2025-10-31 19:34 ` Mathieu Desnoyers
1 sibling, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 19:34 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-10-29 09:22, Thomas Gleixner wrote:
> Guard the time slice extension functionality with a static key, which can
> be disabled on the kernel command line.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 04/12] rseq: Add statistics for time slice extensions
2025-10-29 13:22 ` [patch V3 04/12] rseq: Add statistics " Thomas Gleixner
@ 2025-10-31 19:36 ` Mathieu Desnoyers
0 siblings, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 19:36 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-10-29 09:22, Thomas Gleixner wrote:
> Extend the quick statistics with time slice specific fields.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 05/12] rseq: Add prctl() to enable time slice extensions
2025-10-29 13:22 ` [patch V3 05/12] rseq: Add prctl() to enable " Thomas Gleixner
@ 2025-10-31 19:43 ` Mathieu Desnoyers
2025-10-31 21:05 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 19:43 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-10-29 09:22, Thomas Gleixner wrote:
> Implement a prctl() so that tasks can enable the time slice extension
> mechanism. This fails, when time slice extensions are disabled at compile
> time or on the kernel command line and when no rseq pointer is registered
> in the kernel.
I'm still unsure that going for enabling per-thread vs per-process is
the right approach. Enabling per-thread requires to either modify each
thread's startup code, or integrate this into libc's thread startup.
Enabling per-process makes it easy to invoke from program or library
constructor.
[...]
>
> +int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
> +{
[...]
> + case PR_RSEQ_SLICE_EXTENSION_SET: {
> + u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> + bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
> +
> + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
> + return -EINVAL;
> + if (!rseq_slice_extension_enabled())
> + return -ENOTSUPP;
> + if (!current->rseq.usrptr)
> + return -ENXIO;
> +
So what happens if we have an (unlikely) scenario of:
- thread startup
- thread registration to rseq
- prctl PR_RSEQ_SLICE_EXTENSION_SET
- rseq unregistration
- rseq registration
--> What's the status of slice extension here ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 06/12] rseq: Implement sys_rseq_slice_yield()
2025-10-29 13:22 ` [patch V3 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
@ 2025-10-31 19:46 ` Mathieu Desnoyers
2025-10-31 21:07 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 19:46 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Arnd Bergmann, linux-arch, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior
On 2025-10-29 09:22, Thomas Gleixner wrote:
>
> +/**
> + * sys_rseq_slice_yield - yield the current processor if a task granted
> + * with a time slice extension is done with the
> + * critical work before being forced out.
> + *
> + * On entry from user space, syscall_entry_work() ensures that NEED_RESCHED is
> + * set if the task was granted a slice extension before arriving here.
> + *
> + * Return: 1 if the task successfully yielded the CPU within the granted slice.
> + * 0 if the slice extension was either never granted or was revoked by
> + * going over the granted extension or being scheduled out earlier
I notice the presence of tabs in those comments. You will likely want
to convert those to spaces.
Other than that:
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-10-29 13:22 ` [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
@ 2025-10-31 19:53 ` Mathieu Desnoyers
2025-11-19 0:20 ` Prakash Sangappa
1 sibling, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 19:53 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-10-29 09:22, Thomas Gleixner wrote:
> The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
> extension. This allows to handle the rseq_slice_yield() syscall, which is
> used by user space to relinquish the CPU after finishing the critical
> section for which it requested an extension.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 08/12] rseq: Implement time slice extension enforcement timer
2025-10-29 13:22 ` [patch V3 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-10-29 18:45 ` Steven Rostedt
@ 2025-10-31 19:59 ` Mathieu Desnoyers
1 sibling, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 19:59 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-10-29 09:22, Thomas Gleixner wrote:
>
> 3) The function is calling into the scheduler code and that might have
> unexpected consequences when this is invoked due to a time slice
> enforcement expiry. Especially when the task managed to clear the
> grant via sched_yield(0).
Do you mean sys_rseq_slice_yield here ?
> ---
> V3: Add sysctl documentation, simplify timer cancelation - Sebastian
^ cancellation
Other than those nits:
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 09/12] rseq: Reset slice extension when scheduled
2025-10-29 13:22 ` [patch V3 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
@ 2025-10-31 20:03 ` Mathieu Desnoyers
0 siblings, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-10-31 20:03 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On 2025-10-29 09:22, Thomas Gleixner wrote:
> When a time slice extension was granted in the need_resched() check on exit
> to user space, the task can still be scheduled out in one of the other
> pending work items. When it gets scheduled back in, and need_resched() is
> not set, then the stale grant would be preserved, which is just wrong.
>
> RSEQ already keeps track of that and sets TIF_RSEQ, which invokes the
> critical section and ID update mechanisms.
>
> Utilize them and clear the user space slice control member of struct rseq
> unconditionally within the existing user access sections. That's just an
> unconditional store more in that path.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-31 19:31 ` Mathieu Desnoyers
@ 2025-10-31 20:58 ` Thomas Gleixner
2025-11-01 22:53 ` Thomas Gleixner
2025-11-03 17:00 ` Mathieu Desnoyers
0 siblings, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-31 20:58 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
> On 2025-10-29 09:22, Thomas Gleixner wrote:
> [...]
>> +
>> +The thread has to enable the functionality via prctl(2)::
>> +
>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>
> Enabling specifically for each thread requires hooking into thread
> creation, and is not a good fit for enabling this from executable or
> library constructor function.
Where is the problem? It's not rocket science to handle that in user
space.
> What is the use-case for enabling it only for a few threads within
> a process rather than for the entire process ?
My general approach to all of this is to reduce overhead by default and
to provide the ability of fine grained control.
Using time slice extensions requires special care and a use case which
justifies the extra work to be done. So those people really can be asked
to do the extra work of enabling it, no?
I really don't get your attitude of enabling everything by default and
thereby inflicting the maximum amount of overhead on everything.
I've just wasted weeks to cure the fallout of that approach and it's
still unsatisfying because the whole CID management crud and related
overhead is there unconditionally with exactly zero users on any
distro. The special use cases of the uncompilable gurgle tcmalloc and
the esoteric librseq are not a justification at all to inflict that on
everyone.
Sadly nobody noticed when this got merged and now with RSEQ being widely
used by glibc it's even harder to turn the clock back. I'm still tempted
to break this half thought out ABI and make CID opt-in and default to
CID = CPUID if not activated.
Seriously the kernel is there to manage resources and provide resource
control, but it's not there to accomodate the laziness of user space
programmers and to proliferate the 'I envision this to be widely used'
wishful thinking mindset.
That said I'm not completely against making this per process, but then
it has to be enabled on the main thread _before_ it spawns threads and
rejected otherwise.
That said I just went down the obvious road of making it opt-in and
therefore low overhead and flexible by default. That is correct, simple
and straight forward. No?
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 05/12] rseq: Add prctl() to enable time slice extensions
2025-10-31 19:43 ` Mathieu Desnoyers
@ 2025-10-31 21:05 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-31 21:05 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Fri, Oct 31 2025 at 15:43, Mathieu Desnoyers wrote:
> On 2025-10-29 09:22, Thomas Gleixner wrote:
>> + case PR_RSEQ_SLICE_EXTENSION_SET: {
>> + u32 rflags, valid = RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
>> + bool enable = !!(arg3 & PR_RSEQ_SLICE_EXT_ENABLE);
>> +
>> + if (arg3 & ~PR_RSEQ_SLICE_EXT_ENABLE)
>> + return -EINVAL;
>> + if (!rseq_slice_extension_enabled())
>> + return -ENOTSUPP;
>> + if (!current->rseq.usrptr)
>> + return -ENXIO;
>> +
>
> So what happens if we have an (unlikely) scenario of:
>
> - thread startup
> - thread registration to rseq
> - prctl PR_RSEQ_SLICE_EXTENSION_SET
> - rseq unregistration
> - rseq registration
> --> What's the status of slice extension here ?
On unregister it's cleared and you have to re-register it when you
register a new rseq. It's part of the rseq state so obviously it's all
set back to zero.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 06/12] rseq: Implement sys_rseq_slice_yield()
2025-10-31 19:46 ` Mathieu Desnoyers
@ 2025-10-31 21:07 ` Thomas Gleixner
2025-11-03 17:07 ` Mathieu Desnoyers
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-10-31 21:07 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Peter Zijlstra, Arnd Bergmann, linux-arch, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior
On Fri, Oct 31 2025 at 15:46, Mathieu Desnoyers wrote:
> On 2025-10-29 09:22, Thomas Gleixner wrote:
>>
>> +/**
>> + * sys_rseq_slice_yield - yield the current processor if a task granted
>> + * with a time slice extension is done with the
>> + * critical work before being forced out.
>> + *
>> + * On entry from user space, syscall_entry_work() ensures that NEED_RESCHED is
>> + * set if the task was granted a slice extension before arriving here.
>> + *
>> + * Return: 1 if the task successfully yielded the CPU within the granted slice.
>> + * 0 if the slice extension was either never granted or was revoked by
>> + * going over the granted extension or being scheduled out earlier
>
> I notice the presence of tabs in those comments. You will likely want
> to convert those to spaces.
And why so? It's perfectly formatted and there is ZERO reason to use
spaces in comments when you want to have aligned formatting.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-31 20:58 ` Thomas Gleixner
@ 2025-11-01 22:53 ` Thomas Gleixner
2025-11-03 17:00 ` Mathieu Desnoyers
1 sibling, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-11-01 22:53 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch
On Fri, Oct 31 2025 at 21:58, Thomas Gleixner wrote:
> On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
> That said I'm not completely against making this per process, but then
> it has to be enabled on the main thread _before_ it spawns threads and
> rejected otherwise.
It's actually not so trivial because contrary to CID, which is per MM
this is per process and as a newly created thread must register RSEQ
memory first there needs to be some 'inherited enablement on thread
creation' marker which then needs to be taken into account when the new
thread registers its RSEQ memory with the kernel.
And no, we are not going to make this unconditionally enabled when RSEQ
is registered. That's just wrong as that 'oh so tiny overhead' of user
space access accumulates nicely in high frequency scheduling scenarios
as you can see from the numbers provided with the rseq and cid cleanups.
So while it's doable the real question is whether this is worth the
trouble and extra state handling all over the place. I doubt it is and
keeping the kernel simple is definitely not the wrong approach.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-31 20:58 ` Thomas Gleixner
2025-11-01 22:53 ` Thomas Gleixner
@ 2025-11-03 17:00 ` Mathieu Desnoyers
2025-11-03 19:19 ` Florian Weimer
1 sibling, 1 reply; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-11-03 17:00 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Prakash Sangappa, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch, Florian Weimer, carlos@redhat.com, Dmitry Vyukov,
Marco Elver, Peter Oskolkov
On 2025-10-31 16:58, Thomas Gleixner wrote:
> On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
>> On 2025-10-29 09:22, Thomas Gleixner wrote:
>> [...]
>>> +
>>> +The thread has to enable the functionality via prctl(2)::
>>> +
>>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>>
>> Enabling specifically for each thread requires hooking into thread
>> creation, and is not a good fit for enabling this from executable or
>> library constructor function.
>
> Where is the problem? It's not rocket science to handle that in user
> space.
Overhead at thread creation is a metric that is closely followed
by glibc developers. If we want a fine-grained per-thread control over
the slice extension mechanism, it would be good if we can think of a way
to allow userspace to enable it through either clone3 or rseq so
we don't add another round-trip to the kernel at thread creation.
This could be done either as an addition to this prctl, or as a
replacement if we don't want to add two ways to do the same thing.
AFAIU executable startup is not in the same ballpark performance
wise as thread creation, so adding the overhead of an additional
system call there is less frowned upon. This is why I am asking
whether per-thread granularity is needed.
>
>> What is the use-case for enabling it only for a few threads within
>> a process rather than for the entire process ?
>
> My general approach to all of this is to reduce overhead by default and
> to provide the ability of fine grained control.
That is a sound approach, I agree.
>
> Using time slice extensions requires special care and a use case which
> justifies the extra work to be done. So those people really can be asked
> to do the extra work of enabling it, no?
I don't mind that it needs to be enabled explicitly at all. What
am I asking here is what is the best way to express this enablement
ABI.
Here I'm just trying to put myself into the shoes of userspace
library developers to see whether the proposed ABI is a good fit, or
if due to resource conflicts with other pieces of the ecosystem
or because of overhead reasons it will be unusable by the core userspace
libraries like libc and limited to be used by niche users.
>
> I really don't get your attitude of enabling everything by default and
> thereby inflicting the maximum amount of overhead on everything.
>
> I've just wasted weeks to cure the fallout of that approach and it's
> still unsatisfying because the whole CID management crud and related
> overhead is there unconditionally with exactly zero users on any
> distro. The special use cases of the uncompilable gurgle tcmalloc and
> the esoteric librseq are not a justification at all to inflict that on
> everyone.
>
> Sadly nobody noticed when this got merged and now with RSEQ being widely
> used by glibc it's even harder to turn the clock back. I'm still tempted
> to break this half thought out ABI and make CID opt-in and default to
> CID = CPUID if not activated.
That's a good idea. Making mm_cid use cpu_id by default would not break
anything in terms of hard limits. Sure it's not close to 0 anymore, but
no application should misbehave because of this.
Then there is the question of how it should be enabled. For mm_cid,
it really only makes sense per-process.
One possibility here would be to introduce an "rseq features" enablement
prctl that affects the entire process. It would have to be done while
the process is single threaded. This could gate both mm_cid and time
slice extension.
>
> Seriously the kernel is there to manage resources and provide resource
> control, but it's not there to accomodate the laziness of user space
> programmers and to proliferate the 'I envision this to be widely used'
> wishful thinking mindset.
Gating the mm_cid with a default to cpu_id is fine with me.
>
> That said I'm not completely against making this per process, but then
> it has to be enabled on the main thread _before_ it spawns threads and
> rejected otherwise.
Agreed. And it would be nice if we can achieve rseq feature enablement
in a way that is relatively common for all rseq features (e.g. through
a single prctl option, applying per-process, requiring single-threaded
state).
>
> That said I just went down the obvious road of making it opt-in and
> therefore low overhead and flexible by default. That is correct, simple
> and straight forward. No?
As I pointed out above, I'm simply trying to find the way to express
this feature enablement in a way that's the best fit for the userspace
ecosystem as well, without that being too much trouble on the kernel
side.
I note your wish to gate the mm_cid with a similar enablement ABI, and
I'm OK with this unless an existing mm_cid user considers this a
significant ABI break. As maintainer of librseq I'm OK with this,
we should ask the tcmalloc maintainers.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 06/12] rseq: Implement sys_rseq_slice_yield()
2025-10-31 21:07 ` Thomas Gleixner
@ 2025-11-03 17:07 ` Mathieu Desnoyers
0 siblings, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-11-03 17:07 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Peter Zijlstra, Arnd Bergmann, linux-arch, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior
On 2025-10-31 17:07, Thomas Gleixner wrote:
> On Fri, Oct 31 2025 at 15:46, Mathieu Desnoyers wrote:
>
>> On 2025-10-29 09:22, Thomas Gleixner wrote:
>>>
>>> +/**
>>> + * sys_rseq_slice_yield - yield the current processor if a task granted
>>> + * with a time slice extension is done with the
>>> + * critical work before being forced out.
>>> + *
>>> + * On entry from user space, syscall_entry_work() ensures that NEED_RESCHED is
>>> + * set if the task was granted a slice extension before arriving here.
>>> + *
>>> + * Return: 1 if the task successfully yielded the CPU within the granted slice.
>>> + * 0 if the slice extension was either never granted or was revoked by
>>> + * going over the granted extension or being scheduled out earlier
>>
>> I notice the presence of tabs in those comments. You will likely want
>> to convert those to spaces.
>
> And why so? It's perfectly formatted and there is ZERO reason to use
> spaces in comments when you want to have aligned formatting.
You're right, nevermind.
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-11-03 17:00 ` Mathieu Desnoyers
@ 2025-11-03 19:19 ` Florian Weimer
0 siblings, 0 replies; 63+ messages in thread
From: Florian Weimer @ 2025-11-03 19:19 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Thomas Gleixner, LKML, Peter Zijlstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann, linux-arch,
carlos@redhat.com, Dmitry Vyukov, Marco Elver, Peter Oskolkov
* Mathieu Desnoyers:
> On 2025-10-31 16:58, Thomas Gleixner wrote:
>> On Fri, Oct 31 2025 at 15:31, Mathieu Desnoyers wrote:
>>> On 2025-10-29 09:22, Thomas Gleixner wrote:
>>> [...]
>>>> +
>>>> +The thread has to enable the functionality via prctl(2)::
>>>> +
>>>> + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
>>>> + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
>>>
>>> Enabling specifically for each thread requires hooking into thread
>>> creation, and is not a good fit for enabling this from executable or
>>> library constructor function.
>> Where is the problem? It's not rocket science to handle that in user
>> space.
>
> Overhead at thread creation is a metric that is closely followed
> by glibc developers. If we want a fine-grained per-thread control over
> the slice extension mechanism, it would be good if we can think of a way
> to allow userspace to enable it through either clone3 or rseq so
> we don't add another round-trip to the kernel at thread creation.
> This could be done either as an addition to this prctl, or as a
> replacement if we don't want to add two ways to do the same thing.
I think this is a bit exaggerated. 8-)
I'm more concerned about this: If it's a separate system call like the
quoted prctl, we'll likely have cases where the program launches and
this feature automatically gets enabled for the main thread by glibc.
Then the application installs a seccomp filter that doesn't allow the
prctl, and calls pthread_create. At this point we either end up with a
partially enabled feature (depending on which thread the code runs), or
we have to fail the pthread_create call. Neither option is great.
So something enabled by rseq flags seems better to me. Maybe
default-enable and disable with a non-zero flag if backwards
compatibility is sufficient? As far I understand it, this series has
performance improvements that more than offset the slice extension cost?
Thanks,
Florian
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 02/12] rseq: Add fields and constants for time slice extension
2025-10-29 13:22 ` [patch V3 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-10-30 22:01 ` Prakash Sangappa
2025-10-31 19:31 ` Mathieu Desnoyers
@ 2025-11-04 0:20 ` Steven Rostedt
2 siblings, 0 replies; 63+ messages in thread
From: Steven Rostedt @ 2025-11-04 0:20 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Prakash Sangappa,
Madadi Vineeth Reddy, K Prateek Nayak, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch, vineethrp
On Wed, 29 Oct 2025 14:22:14 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:
> +/**
> + * rseq_slice_ctrl - Time slice extension control structure
> + * @all: Compound value
> + * @request: Request for a time slice extension
> + * @granted: Granted time slice extension
> + *
> + * @request is set by user space and can be cleared by user space or kernel
> + * space. @granted is set and cleared by the kernel and must only be read
> + * by user space.
> + */
> +struct rseq_slice_ctrl {
> + union {
> + __u32 all;
> + struct {
> + __u8 request;
> + __u8 granted;
> + __u16 __reserved;
> + };
> + };
> +};
> +
> /*
> * struct rseq is aligned on 4 * 8 bytes to ensure it is always
> * contained within a single cache-line.
> @@ -142,6 +174,12 @@ struct rseq {
> __u32 mm_cid;
>
> /*
> + * Time slice extension control structure. CPU local updates from
> + * kernel and user space.
> + */
> + struct rseq_slice_ctrl slice_ctrl;
> +
BTW, Google is interested in expanding this feature for VMs. As a VM kernel
spinlock also happens to be a user space spinlock. The KVM folks would
rather have this implemented via the normal user spaces method than to do
anything specific to the KVM internal code. Or at least, keep it as
non-intrusive as possible.
I talked with Mathieu and the KVM folks on how it could use the rseq
method, and it was suggested that qemu would set up a shared memory region
between the qemu thread and the virtual CPU and possibly submit a driver
that would expose this memory region. This could hook to a paravirt
spinlock that would set the bit stating the system is in a critical section
and clear it when all spin locks are released. If the vCPU was granted an
extra time slice, then it would call a hypercall that would do the yield.
When I mentioned this to Mathieu, he was against sharing the qemu's thread
rseq with the guest VM, as that would expose much more than what is needed
to the guest. Especially since it needs to be a writable memory location.
What could be done is that another memory range is mapped between the qemu
thread and the vCPU memory, and the rseq would have a pointer to that memory.
To implement that, the slice_ctrl would need to be a pointer, where the
kernel would need to do another indirection to follow that pointer to
another location within the thread's memory.
Now I do believe that the return back to guest goes through a different
path. So this doesn't actually need to use rseq. But it would require a way
for the qemu thread to pass the memory to the kernel. I'm guessing that the
return to guest logic could share the code with the return to user logic
with just passing a struct rseq_slice_ctrl pointer to a function?
I'm bringing this up so that this use case is considered when implementing
the extended time slice. As I believe this would be a more common case than
then user space spin lock would be.
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
` (12 preceding siblings ...)
2025-10-29 15:10 ` [patch V3 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
@ 2025-11-06 17:28 ` Prakash Sangappa
2025-11-10 14:23 ` Mathieu Desnoyers
13 siblings, 1 reply; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-06 17:28 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This is a follow up on the V2 version:
>
> https://lore.kernel.org/20251022110646.839870156@linutronix.de
>
> V1 contains a detailed explanation:
>
> https://lore.kernel.org/20250908225709.144709889@linutronix.de
>
> TLDR: Time slice extensions are an attempt to provide opportunistic
> priority ceiling without the overhead of an actual priority ceiling
> protocol, but also without the guarantees such a protocol provides.
[…]
>
>
> The uaccess and RSEQ modifications on which this series is based can be
> found here:
>
> https://lore.kernel.org/20251029123717.886619142@linutronix.de
>
> and in git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/cid
>
> For your convenience all of it is also available as a conglomerate from
> git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice
>
Hit this watchdog panic.
Using following tree. Assume this Is the latest.
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/slice
Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
-Prakash
-------------------------------------------------------
watchdog: CPU152: Watchdog detected hard LOCKUP on cpu 152
..
93.093858] RIP: 0010:mm_get_cid+0x7e/0xd0
[ 93.093866] Code: 4c eb 63 f3 90 8b 05 f1 6a 66 02 8b 35 d7 bc 8e 01 83 c0 3f 48 89 f5 c1 e8 03 25 f8 ff ff 1f 48 8d 3c 43 e8 24 ce 62 00 89 c1 <39> e8 73 d5 8b 35 c8 6a 66 02 89 c0 8d 56 3f c1 ea 03 81 e2 f8 ff
[ 93.093867] RSP: 0018:ff734c4591c6bc38 EFLAGS: 00000046
[ 93.093869] RAX: 0000000000000180 RBX: ff3c42cea15ec2c0 RCX: 0000000000000180
[ 93.093871] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 93.093872] RBP: 0000000000000180 R08: 0000000000000000 R09: 0000000000000000
[ 93.093873] R10: 0000000000000000 R11: 00000000fffffff4 R12: ff3c42cea15ebd30
[ 93.093874] R13: ffa54c453ba41640 R14: ff3c42cea15ebd28 R15: ff3c42cea15ebd27
[ 93.093875] FS: 00007f92b1482740(0000) GS:ff3c43e8d55ef000(0000) knlGS:0000000000000000
[ 93.093876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.093877] CR2: 00007f8ebe7fbfb8 CR3: 00000126c9f61004 CR4: 0000000000f71ef0
[ 93.093878] PKRU: 55555554
[ 93.093879] Call Trace:
[ 93.093882] <TASK>
[ 93.093887] sched_mm_cid_fork+0x22d/0x300
[ 93.093895] copy_process+0x92a/0x1670
[ 93.093902] kernel_clone+0xbc/0x490
[ 93.093903] ? srso_alias_return_thunk+0x5/0xfbef5
[ 93.093907] ? __lruvec_stat_mod_folio+0x83/0xd0
[ 93.093911] __do_sys_clone+0x65/0xa0
[ 93.093916] do_syscall_64+0x7f/0x8a0
> Thanks,
>
> tglx
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-06 17:28 ` Prakash Sangappa
@ 2025-11-10 14:23 ` Mathieu Desnoyers
2025-11-10 17:05 ` Mathieu Desnoyers
2025-11-11 16:42 ` Mathieu Desnoyers
0 siblings, 2 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-11-10 14:23 UTC (permalink / raw)
To: Prakash Sangappa, Thomas Gleixner
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On 2025-11-06 12:28, Prakash Sangappa wrote:
[...]
> Hit this watchdog panic.
>
> Using following tree. Assume this Is the latest.
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/slice
>
> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
When this happened during the development of the "complex" mm_cid
scheme, this was typically caused by a stale "mm_cid" being kept around
by a task even though it was not actually scheduled, thus causing
over-reservation of concurrency IDs beyond the max_cids threshold. This
ends up looping in:
static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
while (cid == MM_CID_UNSET) {
cpu_relax();
cid = __mm_get_cid(mm, num_possible_cpus());
}
return cid;
}
Based on the stacktrace you provided, it seems to happen within
sched_mm_cid_fork() within copy_process, so perhaps it's simply an
initialization issue in fork, or an issue when cloning a new thread ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-10 14:23 ` Mathieu Desnoyers
@ 2025-11-10 17:05 ` Mathieu Desnoyers
2025-11-11 16:42 ` Mathieu Desnoyers
1 sibling, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-11-10 17:05 UTC (permalink / raw)
To: Prakash Sangappa, Thomas Gleixner
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> On 2025-11-06 12:28, Prakash Sangappa wrote:
> [...]
>> Hit this watchdog panic.
>>
>> Using following tree. Assume this Is the latest.
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/
>> slice
>>
>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>
> When this happened during the development of the "complex" mm_cid
> scheme, this was typically caused by a stale "mm_cid" being kept around
> by a task even though it was not actually scheduled, thus causing
> over-reservation of concurrency IDs beyond the max_cids threshold. This
> ends up looping in:
>
> static inline unsigned int mm_get_cid(struct mm_struct *mm)
> {
> unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm-
> >mm_cid.max_cids));
>
> while (cid == MM_CID_UNSET) {
> cpu_relax();
> cid = __mm_get_cid(mm, num_possible_cpus());
> }
> return cid;
> }
>
> Based on the stacktrace you provided, it seems to happen within
> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
> initialization issue in fork, or an issue when cloning a new thread ?
One possible issue here: I note that kernel/sched/core.c:mm_init_cid()
misses the following initialization:
mm->mm_cid.transit = 0;
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-10 14:23 ` Mathieu Desnoyers
2025-11-10 17:05 ` Mathieu Desnoyers
@ 2025-11-11 16:42 ` Mathieu Desnoyers
2025-11-12 6:30 ` Prakash Sangappa
2025-11-12 20:31 ` Thomas Gleixner
1 sibling, 2 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-11-11 16:42 UTC (permalink / raw)
To: Prakash Sangappa, Thomas Gleixner
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> On 2025-11-06 12:28, Prakash Sangappa wrote:
> [...]
>> Hit this watchdog panic.
>>
>> Using following tree. Assume this Is the latest.
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/
>> slice
>>
>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>
> When this happened during the development of the "complex" mm_cid
> scheme, this was typically caused by a stale "mm_cid" being kept around
> by a task even though it was not actually scheduled, thus causing
> over-reservation of concurrency IDs beyond the max_cids threshold. This
> ends up looping in:
>
> static inline unsigned int mm_get_cid(struct mm_struct *mm)
> {
> unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm-
> >mm_cid.max_cids));
>
> while (cid == MM_CID_UNSET) {
> cpu_relax();
> cid = __mm_get_cid(mm, num_possible_cpus());
> }
> return cid;
> }
>
> Based on the stacktrace you provided, it seems to happen within
> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
> initialization issue in fork, or an issue when cloning a new thread ?
I've spent some time digging through Thomas' implementation of
mm_cid management. I've spotted something which may explain
the watchdog panic. Here is the scenario:
1) A process is constrained to a subset of the possible CPUs,
and has enough threads to swap from per-thread to per-cpu mm_cid
mode. It runs happily in that per-cpu mode.
2) The number of allowed CPUs is increased for a process, thus invoking
mm_update_cpus_allowed. This switches the mode back to per-thread,
but delays invocation of mm_cid_work_fn to some point in the future,
in thread context, through irq_work + schedule_work.
At that point, because only __mm_update_max_cids was called by
mm_update_cpus_allowed, the max_cids is updated, but mc->transit
is still zero.
Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
scheduled work or near the end of sched_mm_cid_fork, or by
sched_mm_cid_exit, we are in a state where mm_cids are still
owned by CPUs, but we are now in per-thread mm_cid mode, which
means that the mc->max_cids value depends on the number of threads.
3) At that point, a new thread is cloned, thus invoking
sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
count and invokes mm_update_max_cids, which updates the mc->max_cids
limit, but does not set the mc->transit flag because this call does not
swap from per-cpu to per-task mode (the mode is already per-task).
Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
are held, and loops forever because the mm_cid mask has all
the max_cids IDs reserved because of the stale per-cpu CIDs.
I see two possible issues here:
A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
mode without setting the mc->transit flag.
B) sched_mm_cid_fork calls mm_get_cpu() before invoking
mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
mm_cids and make them available for mm_get_cpu().
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-11 16:42 ` Mathieu Desnoyers
@ 2025-11-12 6:30 ` Prakash Sangappa
2025-11-12 20:40 ` Mathieu Desnoyers
2025-11-12 21:57 ` Thomas Gleixner
2025-11-12 20:31 ` Thomas Gleixner
1 sibling, 2 replies; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-12 6:30 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Thomas Gleixner, LKML, Peter Zijlstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Nov 11, 2025, at 8:42 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
>> On 2025-11-06 12:28, Prakash Sangappa wrote:
>> [...]
>>> Hit this watchdog panic.
>>>
>>> Using following tree. Assume this Is the latest.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ slice
>>>
>>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>> When this happened during the development of the "complex" mm_cid
>> scheme, this was typically caused by a stale "mm_cid" being kept around
>> by a task even though it was not actually scheduled, thus causing
>> over-reservation of concurrency IDs beyond the max_cids threshold. This
>> ends up looping in:
>> static inline unsigned int mm_get_cid(struct mm_struct *mm)
>> {
>> unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm- >mm_cid.max_cids));
>> while (cid == MM_CID_UNSET) {
>> cpu_relax();
>> cid = __mm_get_cid(mm, num_possible_cpus());
>> }
>> return cid;
>> }
>> Based on the stacktrace you provided, it seems to happen within
>> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
>> initialization issue in fork, or an issue when cloning a new thread ?
>
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:
[..]
> I see two possible issues here:
>
> A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
> mode without setting the mc->transit flag.
>
> B) sched_mm_cid_fork calls mm_get_cpu() before invoking
> mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
> mm_cids and make them available for mm_get_cpu().
>
> Thoughts ?
The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
Managed to grep the ‘mksquashfs’ command that was executing, which triggers the panic.
#ps -ef |grep mksquash.
root 16614 10829 0 05:55 ? 00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz
I added following printk’s to mm_get_cid()
static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
+ int max_cids = READ_ONCE(mm->mm_cid.max_cids);
+ long *addr = mm_cidmask(mm);
+
+ if (cid == MM_CID_UNSET) {
+ printk(KERN_INFO "pid %d, exec %s, maxcids %d percpu %d pcputhr %d, users %d nrcpus_allwd %d\n",
+ mm->owner->pid, mm->owner->comm,
+ max_cids,
+ mm->mm_cid.percpu,
+ mm->mm_cid.pcpu_thrs,
+ mm->mm_cid.users,
+ mm->mm_cid.nr_cpus_allowed);
+ printk(KERN_INFO "cid bitmask %lx %lx %lx %lx %lx %lx\n",
+ addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]);
+ }
while (cid == MM_CID_UNSET) {
cpu_relax();
Got following trace(trimmed).
[ 65.139543] pid 16614, exec mksquashfs, maxcids 82 percpu 0 pcputhr 0, users 66 nrcpus_allwd 384
[ 65.139544] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f43455357 44455a494c414954
[ 65.139597] pid 16614, exec mksquashfs, maxcids 83 percpu 0 pcputhr 0, users 67 nrcpus_allwd 384
[ 65.139599] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f4345535f 44455a494c414954
..
[ 65.142665] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a5fffffffff
[ 65.142750] pid 16614, exec mksquashfs, maxcids 155 percpu 0 pcputhr 0, users 124 nrcpus_allwd 384
[ 65.142752] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a7fffffffff
..
[ 65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
[ 65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
[ 65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
Followed by the panic.
[ 99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
..
99.979340] RIP: 0010:mm_get_cid+0xf5/0x150
[ 99.979346] Code: 4d 8b 44 24 18 48 c7 c7 e0 07 86 b6 49 8b 4c 24 10 49 8b 54 24 08 41 ff 74 24 28 49 8b 34 24 e8 c1 b7 04 00 48 83 c4 18 f3 90 <8b> 05 65 ae ec 01 8b 35 eb e0 68 01 83 c0 3f 48 89 f5 c1 e8 03 25
[ 99.979348] RSP: 0018:ff75650cf9717d20 EFLAGS: 00000046
[ 99.979349] RAX: 0000000000000180 RBX: ff424236e5d55c40 RCX: 0000000000000180
[ 99.979351] RDX: 0000000000000000 RSI: 0000000000000180 RDI: ff424236e5d55cd0
[ 99.979352] RBP: 0000000000000180 R08: 0000000000000180 R09: c0000000fffdffff
[ 99.979352] R10: 0000000000000001 R11: ff75650cf9717a80 R12: ff424236e5d55ca0
[ 99.979353] R13: ff424236e5d55668 R14: ffa7650cba2841c0 R15: ff42423881a5aa80
[ 99.979355] FS: 00007f469ed6b740(0000) GS:ff424351c24d6000(0000) knlGS:0000000000000000
[ 99.979356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 99.979357] CR2: 00007f443b7fdfb8 CR3: 0000012724555006 CR4: 0000000000771ef0
[ 99.979358] PKRU: 55555554
[ 99.979359] Call Trace:
[ 99.979361] <TASK>
[ 99.979364] sched_mm_cid_fork+0x3fb/0x590
[ 99.979369] copy_process+0xd1a/0x2130
[ 99.979375] kernel_clone+0x9d/0x3b0
[ 99.979379] __do_sys_clone+0x65/0x90
[ 99.979384] do_syscall_64+0x64/0x670
[ 99.979388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 99.979391] RIP: 0033:0x7f469d77d8c5
As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management.
Hopeful you can make out something from the above trace.
Let me know if you want me to add more tracing.
-Prakash
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-11 16:42 ` Mathieu Desnoyers
2025-11-12 6:30 ` Prakash Sangappa
@ 2025-11-12 20:31 ` Thomas Gleixner
2025-11-12 20:46 ` Mathieu Desnoyers
1 sibling, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-11-12 20:31 UTC (permalink / raw)
To: Mathieu Desnoyers, Prakash Sangappa
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On Tue, Nov 11 2025 at 11:42, Mathieu Desnoyers wrote:
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:
>
> 1) A process is constrained to a subset of the possible CPUs,
> and has enough threads to swap from per-thread to per-cpu mm_cid
> mode. It runs happily in that per-cpu mode.
>
> 2) The number of allowed CPUs is increased for a process, thus invoking
> mm_update_cpus_allowed. This switches the mode back to per-thread,
> but delays invocation of mm_cid_work_fn to some point in the future,
> in thread context, through irq_work + schedule_work.
>
> At that point, because only __mm_update_max_cids was called by
> mm_update_cpus_allowed, the max_cids is updated, but mc->transit
> is still zero.
>
> Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
> scheduled work or near the end of sched_mm_cid_fork, or by
> sched_mm_cid_exit, we are in a state where mm_cids are still
> owned by CPUs, but we are now in per-thread mm_cid mode, which
> means that the mc->max_cids value depends on the number of threads.
No. It stays in per CPU mode. The mode switch itself happens either in
the worker or on fork/exit whatever comes first.
> 3) At that point, a new thread is cloned, thus invoking
> sched_mm_cid_fork. Calling sched_mm_cid_add_user increases the user
> count and invokes mm_update_max_cids, which updates the mc->max_cids
> limit, but does not set the mc->transit flag because this call does not
> swap from per-cpu to per-task mode (the mode is already per-task).
No. mm::mm_cid::percpu is still set. So mm::mm_cid::transit is irrelevant.
> Immediately after the call to sched_mm_cid_add_user, sched_mm_cid_fork()
> attempts to call mm_get_cid while the mm_cid mutex and mm_cid lock
> are held, and loops forever because the mm_cid mask has all
> the max_cids IDs reserved because of the stale per-cpu CIDs.
Definitely not. sched_mm_cid_add_user() invokes mm_update_max_cids()
which does the mode switch in mm_cid, sets transit and returns true,
which means that fork() goes and does the transition game and allocates
the CID for the new task after that completed.
There was an issue in V3 with the not-initialized transit member and a
off by one in one of the transition functions. It's fixed in the git
tree, but I haven't posted it yet because I was AFK for a week.
I did not notice the V3 issue because tests passed on a small machine,
but after I did a rebase to the tip rseq and uaccess bits, I noticed the
failure because I tested on a larger box.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-12 6:30 ` Prakash Sangappa
@ 2025-11-12 20:40 ` Mathieu Desnoyers
2025-11-12 21:57 ` Thomas Gleixner
1 sibling, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-11-12 20:40 UTC (permalink / raw)
To: Prakash Sangappa
Cc: Thomas Gleixner, LKML, Peter Zijlstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
On 2025-11-12 01:30, Prakash Sangappa wrote:
[...]
>
> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
> It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
>
> Managed to grep the ‘mksquashfs’ command that was executing, which triggers the panic.
>
> #ps -ef |grep mksquash.
> root 16614 10829 0 05:55 ? 00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz
>
>
[...]
> ..
> [ 65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> [ 65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
> [ 65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
>
It's weird that the cid bitmask is all f values (all 1). Aren't those
zeroed on mm init ?
> Followed by the panic.
> [ 99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
> ..
[...]
>
> As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
> Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management.
> Hopeful you can make out something from the above trace.
>
> Let me know if you want me to add more tracing.
How soon is that after boot up ?
I'm starting to wonder if the num_possible_cpus() value used in
mm_cid_size() and mm_init_cid used respectively for mm allocation
and initialization may be read before it is initialized by the boot up
sequence ?
That's far fetched, but it would be good if we can double-check that
those are never called before the last call to init_cpu_possible and
set_cpu_possible().
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-12 20:31 ` Thomas Gleixner
@ 2025-11-12 20:46 ` Mathieu Desnoyers
2025-11-12 21:54 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Mathieu Desnoyers @ 2025-11-12 20:46 UTC (permalink / raw)
To: Thomas Gleixner, Prakash Sangappa
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On 2025-11-12 15:31, Thomas Gleixner wrote:
> On Tue, Nov 11 2025 at 11:42, Mathieu Desnoyers wrote:
>> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
>> I've spent some time digging through Thomas' implementation of
>> mm_cid management. I've spotted something which may explain
>> the watchdog panic. Here is the scenario:
>>
>> 1) A process is constrained to a subset of the possible CPUs,
>> and has enough threads to swap from per-thread to per-cpu mm_cid
>> mode. It runs happily in that per-cpu mode.
>>
>> 2) The number of allowed CPUs is increased for a process, thus invoking
>> mm_update_cpus_allowed. This switches the mode back to per-thread,
>> but delays invocation of mm_cid_work_fn to some point in the future,
>> in thread context, through irq_work + schedule_work.
>>
>> At that point, because only __mm_update_max_cids was called by
>> mm_update_cpus_allowed, the max_cids is updated, but mc->transit
>> is still zero.
>>
>> Also, until mm_cid_fixup_cpus_to_tasks is invoked by either the
>> scheduled work or near the end of sched_mm_cid_fork, or by
>> sched_mm_cid_exit, we are in a state where mm_cids are still
>> owned by CPUs, but we are now in per-thread mm_cid mode, which
>> means that the mc->max_cids value depends on the number of threads.
>
> No. It stays in per CPU mode. The mode switch itself happens either in
> the worker or on fork/exit whatever comes first.
Ah, that's what I missed. All good then.
[...]
>
> There was an issue in V3 with the not-initialized transit member and a
> off by one in one of the transition functions. It's fixed in the git
> tree, but I haven't posted it yet because I was AFK for a week.
>
> I did not notice the V3 issue because tests passed on a small machine,
> but after I did a rebase to the tip rseq and uaccess bits, I noticed the
> failure because I tested on a larger box.
Good ! We'll see if this fixes the issue observed by Prakash. If not,
I'm curious to validate that num_possible_cpus() is always set to its
final value before _any_ mm is created.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-12 20:46 ` Mathieu Desnoyers
@ 2025-11-12 21:54 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-11-12 21:54 UTC (permalink / raw)
To: Mathieu Desnoyers, Prakash Sangappa
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On Wed, Nov 12 2025 at 15:46, Mathieu Desnoyers wrote:
> On 2025-11-12 15:31, Thomas Gleixner wrote:
>> I did not notice the V3 issue because tests passed on a small machine,
>> but after I did a rebase to the tip rseq and uaccess bits, I noticed the
>> failure because I tested on a larger box.
>
> Good ! We'll see if this fixes the issue observed by Prakash. If not,
> I'm curious to validate that num_possible_cpus() is always set to its
> final value before _any_ mm is created.
It _is_ set to it's final value in start_kernel() before
setup_per_cpu_areas() is invoked. Otherwise the kernel would not work at
all.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-12 6:30 ` Prakash Sangappa
2025-11-12 20:40 ` Mathieu Desnoyers
@ 2025-11-12 21:57 ` Thomas Gleixner
2025-11-12 23:17 ` Prakash Sangappa
1 sibling, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-11-12 21:57 UTC (permalink / raw)
To: Prakash Sangappa, Mathieu Desnoyers
Cc: LKML, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Jonathan Corbet, Madadi Vineeth Reddy, K Prateek Nayak,
Steven Rostedt, Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
> It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
Can you verify that the top most commit of the rseq/slice branch is:
d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-12 21:57 ` Thomas Gleixner
@ 2025-11-12 23:17 ` Prakash Sangappa
2025-11-13 2:34 ` Prakash Sangappa
0 siblings, 1 reply; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-12 23:17 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, Peter Zijlstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
>> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
>> It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
>
> Can you verify that the top most commit of the rseq/slice branch is:
>
> d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
No it is
c46f12a1166058764da8e84a215a6b66cae2fe0a
selftests/rseq: Implement time slice extension test
I can refresh and try.
-Prakash
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-12 23:17 ` Prakash Sangappa
@ 2025-11-13 2:34 ` Prakash Sangappa
2025-11-13 14:38 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-13 2:34 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, Peter Zijlstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Nov 12, 2025, at 3:17 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>
>
>
>> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Wed, Nov 12 2025 at 06:30, Prakash Sangappa wrote:
>>> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
>>> It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
>>
>> Can you verify that the top most commit of the rseq/slice branch is:
>>
>> d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
>
> No it is
>
> c46f12a1166058764da8e84a215a6b66cae2fe0a
> selftests/rseq: Implement time slice extension test
>
Tested the latest from rseq/slice with the top most commit you mentioned above and the
watchdog panic does not reproduce anymore.
Thanks,
-Prakash
> I can refresh and try.
> -Prakash
>
>>
>> Thanks,
>>
>> tglx
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
2025-11-13 2:34 ` Prakash Sangappa
@ 2025-11-13 14:38 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-11-13 14:38 UTC (permalink / raw)
To: Prakash Sangappa
Cc: Mathieu Desnoyers, LKML, Peter Zijlstra, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
On Thu, Nov 13 2025 at 02:34, Prakash Sangappa wrote:
>> On Nov 12, 2025, at 3:17 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>>> On Nov 12, 2025, at 1:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Can you verify that the top most commit of the rseq/slice branch is:
>>>
>>> d2eb5c9c0693 ("selftests/rseq: Implement time slice extension test")
>>
>> No it is
>>
>> c46f12a1166058764da8e84a215a6b66cae2fe0a
>> selftests/rseq: Implement time slice extension test
>>
>
> Tested the latest from rseq/slice with the top most commit you mentioned above and the
> watchdog panic does not reproduce anymore.
Thanks for checking. I'll post a V4 soonish.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-10-29 13:22 ` [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-10-31 19:53 ` Mathieu Desnoyers
@ 2025-11-19 0:20 ` Prakash Sangappa
2025-11-19 15:25 ` Thomas Gleixner
1 sibling, 1 reply; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-19 0:20 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
> extension. This allows to handle the rseq_slice_yield() syscall, which is
> used by user space to relinquish the CPU after finishing the critical
> section for which it requested an extension.
>
> In case the kernel state is still GRANTED, the kernel resets both kernel
> and user space state with a set of sanity checks. If the kernel state is
> already cleared, then this raced against the timer or some other interrupt
> and just clears the work bit.
>
> Doing it in syscall entry work allows to catch misbehaving user space,
> which issues a syscall from the critical section. Wrong syscall and
> inconsistent user space result in a SIGSEGV.
>
>
[…]
> +/*
> + * Invoked from syscall entry if a time slice extension was granted and the
> + * kernel did not clear it before user space left the critical section.
> + */
> +void rseq_syscall_enter_work(long syscall)
> +{
[…]
>
> + curr->rseq.slice.state.granted = false;
> + /*
> + * Clear the grant in user space and check whether this was the
> + * correct syscall to yield. If the user access fails or the task
> + * used an arbitrary syscall, terminate it.
> + */
> + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
> + force_sig(SIGSEGV);
> +}
I have been trying to get our Database team to implement changes to use the slice extension API.
They encounter the issue with a system call being made within the slice extension window and the
process dies with SEGV.
Apparently it will be hard to enforce not calling a system call in the slice extension window due to layering.
For the DB use case, It is fine to terminate the slice extension if a system call is made, but the process
getting killed will not work.
Thanks,
-Prakash
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-11-19 0:20 ` Prakash Sangappa
@ 2025-11-19 15:25 ` Thomas Gleixner
2025-11-20 7:37 ` Prakash Sangappa
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2025-11-19 15:25 UTC (permalink / raw)
To: Prakash Sangappa
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
On Wed, Nov 19 2025 at 00:20, Prakash Sangappa wrote:
>> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
>> + force_sig(SIGSEGV);
>> +}
>
> I have been trying to get our Database team to implement changes to
> use the slice extension API. They encounter the issue with a system
> call being made within the slice extension window and the process dies
> with SEGV.
Good. Works as designed.
> Apparently it will be hard to enforce not calling a system call in the
> slice extension window due to layering.
Why do I have a smell of rotten onions in my nose right now?
> For the DB use case, It is fine to terminate the slice extension if a
> system call is made, but the process getting killed will not work.
That's not a question of being fine or not.
The point is that on PREEMPT_NONE/VOLUNATRY that arbitrary syscall can
consume tons of CPU cycles until it either schedules out voluntarily or
reaches __exit_to_user_mode_loop(), which is defeating the whole
mechanism. The timer does not help in that case because once the task is
in the kernel it won't be preempted on return from interrupt.
sys_rseq_sched_yield() is time bound, which is why it was implemented
that way.
I was absolutely right when I asked to tie this mechanism to
PREEMPT_LAZY|FULL in the first place. That would nicely avoid the whole
problem.
Something like the uncompiled and untested below should work. Though I
hate it with a passion.
Thanks,
tglx
---
Subject: rseq/slice: Handle rotten onions gracefully
From: Thomas Gleixner <tglx@linutronix.de>
Date: Wed, 19 Nov 2025 16:07:15 +0100
Add rant here.
Not-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
kernel/rseq.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -643,13 +643,21 @@ void rseq_syscall_enter_work(long syscal
}
curr->rseq.slice.state.granted = false;
+ /* Clear the grant in user space. */
+ if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all))
+ force_sig(SIGSEGV);
+
/*
- * Clear the grant in user space and check whether this was the
- * correct syscall to yield. If the user access fails or the task
- * used an arbitrary syscall, terminate it.
+ * Grudgingly support onion layer applications which cannot
+ * guarantee that rseq_slice_yield() is used to yield the CPU for
+ * terminating a grant. This is a NOP on PREEMPT_FULL/LAZY because
+ * enabling preemption above already scheduled, but required for
+ * PREEMPT_NONE/VOLUNTARY to prevent that the slice is further
+ * expanded up to the point where the syscall code schedules
+ * voluntarily or reaches exit_to_user_mode_loop().
*/
- if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
- force_sig(SIGSEGV);
+ if (syscall != __NR_rseq_slice_yield)
+ cond_resched();
}
int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-11-19 15:25 ` Thomas Gleixner
@ 2025-11-20 7:37 ` Prakash Sangappa
2025-11-20 11:31 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-20 7:37 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Nov 19, 2025, at 7:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Nov 19 2025 at 00:20, Prakash Sangappa wrote:
>>> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
>>> + force_sig(SIGSEGV);
>>> +}
>>
>> I have been trying to get our Database team to implement changes to
>> use the slice extension API. They encounter the issue with a system
>> call being made within the slice extension window and the process dies
>> with SEGV.
>
> Good. Works as designed.
>
>> Apparently it will be hard to enforce not calling a system call in the
>> slice extension window due to layering.
>
> Why do I have a smell of rotten onions in my nose right now?
>
>> For the DB use case, It is fine to terminate the slice extension if a
>> system call is made, but the process getting killed will not work.
>
> That's not a question of being fine or not.
>
> The point is that on PREEMPT_NONE/VOLUNATRY that arbitrary syscall can
> consume tons of CPU cycles until it either schedules out voluntarily or
> reaches __exit_to_user_mode_loop(), which is defeating the whole
> mechanism. The timer does not help in that case because once the task is
> in the kernel it won't be preempted on return from interrupt.
>
> sys_rseq_sched_yield() is time bound, which is why it was implemented
> that way.
>
> I was absolutely right when I asked to tie this mechanism to
> PREEMPT_LAZY|FULL in the first place. That would nicely avoid the whole
> problem.
>
> Something like the uncompiled and untested below should work. Though I
> hate it with a passion.
That works. It addresses DB issue.
> + * Grudgingly support onion layer applications which cannot
> + * guarantee that rseq_slice_yield() is used to yield the CPU for
> + * terminating a grant. This is a NOP on PREEMPT_FULL/LAZY because
> + * enabling preemption above already scheduled, but required for
> + * PREEMPT_NONE/VOLUNTARY to prevent that the slice is further
> + * expanded up to the point where the syscall code schedules
> + * voluntarily or reaches exit_to_user_mode_loop().
> */
> - if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
> - force_sig(SIGSEGV);
> + if (syscall != __NR_rseq_slice_yield)
> + cond_resched();
> }
With this change, here are the ’swingbench’ performance results I received from our Database team.
https://www.dominicgiles.com/swingbench/
Kernel based on rseq/slice v3 + above change.
System: 2 socket AMD.
Cached DB config - i.e DB files cached on tmpfs.
Response from Database performance engineer:-
Overall the results are very positive and consistent with the earlier findings, we see a clear benefit from the optimization running the same tests as earlier.
• The sgrant figure in /sys/kernel/debug/rseq/stats increases with the DB side optimization enabled, while it stays flat when disabled. I believe this indicates that both the kernel-side code & the DB side triggers are working as expected.
• Due to the contentious nature of the workload these tests produce highly erratic results, but the optimization is showing improved performance across 3x tests with/without use of time slice extension.
• Swingbench throughput with use of time slice optimization
• Run 1: 50,008.10
• Run 2: 59,160.60
• Run 3: 67,342.70
• Swingbench throughput without use of time slice optimization
• Run 1: 36,422.80
• Run 2: 33,186.00
• Run 3: 44,309.80
• The application performs 55% better on average with the optimization.
-Prakash
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-11-20 7:37 ` Prakash Sangappa
@ 2025-11-20 11:31 ` Thomas Gleixner
2025-11-21 0:12 ` Prakash Sangappa
2025-11-21 9:28 ` david laight
0 siblings, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2025-11-20 11:31 UTC (permalink / raw)
To: Prakash Sangappa
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
On Thu, Nov 20 2025 at 07:37, Prakash Sangappa wrote:
>> On Nov 19, 2025, at 7:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Something like the uncompiled and untested below should work. Though I
>> hate it with a passion.
>
> That works. It addresses DB issue.
>
> With this change, here are the ’swingbench’ performance results I received from our Database team.
> https://www.dominicgiles.com/swingbench/
>
> Kernel based on rseq/slice v3 + above change.
> System: 2 socket AMD.
> Cached DB config - i.e DB files cached on tmpfs.
>
> Response from Database performance engineer:-
>
> Overall the results are very positive and consistent with the earlier
> findings, we see a clear benefit from the optimization running the
> same tests as earlier.
>
> • The sgrant figure in /sys/kernel/debug/rseq/stats increases with the
> DB side optimization enabled, while it stays flat when disabled. I
> believe this indicates that both the kernel-side code & the DB side
> triggers are working as expected.
Correct.
> • Due to the contentious nature of the workload these tests produce
> highly erratic results, but the optimization is showing improved
> performance across 3x tests with/without use of time slice extension.
>
> • Swingbench throughput with use of time slice optimization
> • Run 1: 50,008.10
> • Run 2: 59,160.60
> • Run 3: 67,342.70
> • Swingbench throughput without use of time slice optimization
> • Run 1: 36,422.80
> • Run 2: 33,186.00
> • Run 3: 44,309.80
> • The application performs 55% better on average with the optimization.
55% is insane.
Could you please ask your performance guys to provide numbers for the
below configurations to see how the different parts of this work are
affecting the overall result:
1) Linux 6.17 (no rseq rework, no slice)
2) Linux 6.17 + your initial attempt to enable slice extension
We already have the numbers for the full new stack above (with and
without slice), so that should give us the full picture.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-11-20 11:31 ` Thomas Gleixner
@ 2025-11-21 0:12 ` Prakash Sangappa
2025-11-26 22:02 ` Prakash Sangappa
2025-11-21 9:28 ` david laight
1 sibling, 1 reply; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-21 0:12 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Nov 20, 2025, at 3:31 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Nov 20 2025 at 07:37, Prakash Sangappa wrote:
>>> On Nov 19, 2025, at 7:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Something like the uncompiled and untested below should work. Though I
>>> hate it with a passion.
>>
>> That works. It addresses DB issue.
>>
>> With this change, here are the ’swingbench’ performance results I received from our Database team.
>> https://www.dominicgiles.com/swingbench/
>>
>> Kernel based on rseq/slice v3 + above change.
>> System: 2 socket AMD.
>> Cached DB config - i.e DB files cached on tmpfs.
>>
>> Response from Database performance engineer:-
>>
>> Overall the results are very positive and consistent with the earlier
>> findings, we see a clear benefit from the optimization running the
>> same tests as earlier.
>>
>> • The sgrant figure in /sys/kernel/debug/rseq/stats increases with the
>> DB side optimization enabled, while it stays flat when disabled. I
>> believe this indicates that both the kernel-side code & the DB side
>> triggers are working as expected.
>
> Correct.
>
>> • Due to the contentious nature of the workload these tests produce
>> highly erratic results, but the optimization is showing improved
>> performance across 3x tests with/without use of time slice extension.
>>
>> • Swingbench throughput with use of time slice optimization
>> • Run 1: 50,008.10
>> • Run 2: 59,160.60
>> • Run 3: 67,342.70
>> • Swingbench throughput without use of time slice optimization
>> • Run 1: 36,422.80
>> • Run 2: 33,186.00
>> • Run 3: 44,309.80
>> • The application performs 55% better on average with the optimization.
>
> 55% is insane.
>
> Could you please ask your performance guys to provide numbers for the
> below configurations to see how the different parts of this work are
> affecting the overall result:
>
> 1) Linux 6.17 (no rseq rework, no slice)
>
> 2) Linux 6.17 + your initial attempt to enable slice extension
>
> We already have the numbers for the full new stack above (with and
> without slice), so that should give us the full picture.
>
Ok, will ask him to run these.
-Prakash.
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-11-20 11:31 ` Thomas Gleixner
2025-11-21 0:12 ` Prakash Sangappa
@ 2025-11-21 9:28 ` david laight
1 sibling, 0 replies; 63+ messages in thread
From: david laight @ 2025-11-21 9:28 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Prakash Sangappa, LKML, Peter Zijlstra, Mathieu Desnoyers,
Paul E. McKenney, Boqun Feng, Jonathan Corbet,
Madadi Vineeth Reddy, K Prateek Nayak, Steven Rostedt,
Sebastian Andrzej Siewior, Arnd Bergmann,
linux-arch@vger.kernel.org
On Thu, 20 Nov 2025 12:31:54 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:
...
> > • Due to the contentious nature of the workload these tests produce
> > highly erratic results, but the optimization is showing improved
> > performance across 3x tests with/without use of time slice extension.
> >
> > • Swingbench throughput with use of time slice optimization
> > • Run 1: 50,008.10
> > • Run 2: 59,160.60
> > • Run 3: 67,342.70
> > • Swingbench throughput without use of time slice optimization
> > • Run 1: 36,422.80
> > • Run 2: 33,186.00
> > • Run 3: 44,309.80
> > • The application performs 55% better on average with the optimization.
>
> 55% is insane.
>
> Could you please ask your performance guys to provide numbers for the
> below configurations to see how the different parts of this work are
> affecting the overall result:
>
> 1) Linux 6.17 (no rseq rework, no slice)
>
> 2) Linux 6.17 + your initial attempt to enable slice extension
>
> We already have the numbers for the full new stack above (with and
> without slice), so that should give us the full picture.
If is also worth checking that you don't have a single (or limited)
thread test where the busy thread is being bounced between cpu.
While busy the cpu frequency is increased, when moved to an idle
cpu it will initially run at the low frequency and then speed up.
This effect doubled the execution time of a (mostly) single threaded
fpga compile from 10 minutes to 20 minutes - all caused by one of
the mitigations that slowed down syscall entry/exit enough that a
load of basically idle processes that woke every 10ms to all be
active at once.
You've also got the underlying problem that you can't disable
interrupts in userspace.
If an ISR happens in your 'critical region' you just lose 'big time'.
Any threads that contend pretty much have to wait for the ISR
(and any non-threaded softints) to complete.
With heavy network traffic that can easily exceed 1ms.
Nothing you can to to the scheduler will change it.
David
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions
2025-11-21 0:12 ` Prakash Sangappa
@ 2025-11-26 22:02 ` Prakash Sangappa
0 siblings, 0 replies; 63+ messages in thread
From: Prakash Sangappa @ 2025-11-26 22:02 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, Madadi Vineeth Reddy,
K Prateek Nayak, Steven Rostedt, Sebastian Andrzej Siewior,
Arnd Bergmann, linux-arch@vger.kernel.org
> On Nov 20, 2025, at 4:12 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote:
>
>
>
>> On Nov 20, 2025, at 3:31 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Thu, Nov 20 2025 at 07:37, Prakash Sangappa wrote:
>>>> On Nov 19, 2025, at 7:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> Something like the uncompiled and untested below should work. Though I
>>>> hate it with a passion.
>>>
>>> That works. It addresses DB issue.
>>>
>>> With this change, here are the ’swingbench’ performance results I received from our Database team.
>>> https://www.dominicgiles.com/swingbench/
>>>
>>> Kernel based on rseq/slice v3 + above change.
>>> System: 2 socket AMD.
>>> Cached DB config - i.e DB files cached on tmpfs.
>>>
>>> Response from Database performance engineer:-
>>>
>>> Overall the results are very positive and consistent with the earlier
>>> findings, we see a clear benefit from the optimization running the
>>> same tests as earlier.
>>>
>>> • The sgrant figure in /sys/kernel/debug/rseq/stats increases with the
>>> DB side optimization enabled, while it stays flat when disabled. I
>>> believe this indicates that both the kernel-side code & the DB side
>>> triggers are working as expected.
>>
>> Correct.
>>
>>> • Due to the contentious nature of the workload these tests produce
>>> highly erratic results, but the optimization is showing improved
>>> performance across 3x tests with/without use of time slice extension.
>>>
>>> • Swingbench throughput with use of time slice optimization
>>> • Run 1: 50,008.10
>>> • Run 2: 59,160.60
>>> • Run 3: 67,342.70
>>> • Swingbench throughput without use of time slice optimization
>>> • Run 1: 36,422.80
>>> • Run 2: 33,186.00
>>> • Run 3: 44,309.80
>>> • The application performs 55% better on average with the optimization.
>>
>> 55% is insane.
>>
>> Could you please ask your performance guys to provide numbers for the
>> below configurations to see how the different parts of this work are
>> affecting the overall result:
>>
>> 1) Linux 6.17 (no rseq rework, no slice)
>>
>> 2) Linux 6.17 + your initial attempt to enable slice extension
>>
>> We already have the numbers for the full new stack above (with and
>> without slice), so that should give us the full picture.
>>
>
My previous(initial) implementation on v6.17 kernel was showing higher numbers.
So, to keep things similar to the rseq/slice kernel, got following numbers From DB engineer
with the previous implementation built on v6.18-rc4 kernel.
Swingbench thought put with use of slice extension(previous implementation)
* Run 1: 50824.10
* Run 2: 54058.30
* Run 3: 30212.50
Swingbench through put without use of optimization.
* Run 1: 33036.50
* Run 2: 35939.60
* Run 3: 40461.70
Performs 23% better with time slice optimization.
The workload shows lot of variability. However overall trend seems consistent(ie we see
improvement with slice extension).
I think above should give an idea of potential gains the underlying rseq framework optimization adds.
Thanks,
-Prakash
> Ok, will ask him to run these.
> -Prakash.
>
>> Thanks,
>>
>> tglx
>
^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2025-11-26 22:03 UTC | newest]
Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 13:22 [patch V3 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-10-29 13:22 ` [patch V3 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-10-29 13:22 ` [patch V3 02/12] rseq: Add fields and constants for time slice extension Thomas Gleixner
2025-10-30 22:01 ` Prakash Sangappa
2025-10-31 14:32 ` Thomas Gleixner
2025-10-31 19:31 ` Mathieu Desnoyers
2025-10-31 20:58 ` Thomas Gleixner
2025-11-01 22:53 ` Thomas Gleixner
2025-11-03 17:00 ` Mathieu Desnoyers
2025-11-03 19:19 ` Florian Weimer
2025-11-04 0:20 ` Steven Rostedt
2025-10-29 13:22 ` [patch V3 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-10-29 17:23 ` Randy Dunlap
2025-10-29 21:12 ` Thomas Gleixner
2025-10-31 19:34 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 04/12] rseq: Add statistics " Thomas Gleixner
2025-10-31 19:36 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 05/12] rseq: Add prctl() to enable " Thomas Gleixner
2025-10-31 19:43 ` Mathieu Desnoyers
2025-10-31 21:05 ` Thomas Gleixner
2025-10-29 13:22 ` [patch V3 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-10-31 19:46 ` Mathieu Desnoyers
2025-10-31 21:07 ` Thomas Gleixner
2025-11-03 17:07 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-10-31 19:53 ` Mathieu Desnoyers
2025-11-19 0:20 ` Prakash Sangappa
2025-11-19 15:25 ` Thomas Gleixner
2025-11-20 7:37 ` Prakash Sangappa
2025-11-20 11:31 ` Thomas Gleixner
2025-11-21 0:12 ` Prakash Sangappa
2025-11-26 22:02 ` Prakash Sangappa
2025-11-21 9:28 ` david laight
2025-10-29 13:22 ` [patch V3 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-10-29 18:45 ` Steven Rostedt
2025-10-29 21:37 ` Thomas Gleixner
2025-10-29 23:53 ` Steven Rostedt
2025-10-31 19:59 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-10-31 20:03 ` Mathieu Desnoyers
2025-10-29 13:22 ` [patch V3 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-10-29 20:08 ` Steven Rostedt
2025-10-29 21:46 ` Thomas Gleixner
2025-10-29 22:04 ` Steven Rostedt
2025-10-31 14:33 ` Thomas Gleixner
2025-10-29 13:22 ` [patch V3 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
2025-10-29 13:22 ` [patch V3 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-10-29 15:10 ` [patch V3 00/12] rseq: Implement time slice extension mechanism Sebastian Andrzej Siewior
2025-10-29 15:40 ` Steven Rostedt
2025-10-29 21:49 ` Thomas Gleixner
2025-11-06 17:28 ` Prakash Sangappa
2025-11-10 14:23 ` Mathieu Desnoyers
2025-11-10 17:05 ` Mathieu Desnoyers
2025-11-11 16:42 ` Mathieu Desnoyers
2025-11-12 6:30 ` Prakash Sangappa
2025-11-12 20:40 ` Mathieu Desnoyers
2025-11-12 21:57 ` Thomas Gleixner
2025-11-12 23:17 ` Prakash Sangappa
2025-11-13 2:34 ` Prakash Sangappa
2025-11-13 14:38 ` Thomas Gleixner
2025-11-12 20:31 ` Thomas Gleixner
2025-11-12 20:46 ` Mathieu Desnoyers
2025-11-12 21:54 ` Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).