public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Peter Zilstra <peterz@infradead.org>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-arch@vger.kernel.org
Subject: [patch 02/12] rseq: Add fields and constants for time slice extension
Date: Tue,  9 Sep 2025 00:59:51 +0200 (CEST)	[thread overview]
Message-ID: <20250908225752.679815003@linutronix.de> (raw)
In-Reply-To: 20250908225709.144709889@linutronix.de

Aside of a Kconfig knob add the following items:

   - Two flag bits for the rseq user space ABI, which allow user space to
     query the availability and enablement without a syscall.

   - A new member to the user space ABI struct rseq, which is going to be
     used to communicate request and grant between kernel and user space.

   - A rseq state struct to hold the kernel state of this

   - Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 Documentation/userspace-api/index.rst |    1 
 Documentation/userspace-api/rseq.rst  |  129 ++++++++++++++++++++++++++++++++++
 include/linux/rseq_types.h            |   26 ++++++
 include/uapi/linux/rseq.h             |   28 +++++++
 init/Kconfig                          |   12 +++
 kernel/rseq.c                         |    8 ++
 6 files changed, 204 insertions(+)

--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -21,6 +21,7 @@ System calls
    ebpf/index
    ioctl/index
    mseal
+   rseq
 
 Security-related interfaces
 ===========================
--- /dev/null
+++ b/Documentation/userspace-api/rseq.rst
@@ -0,0 +1,129 @@
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and user-space for three purposes:
+
+ * user-space restartable sequences
+
+ * quick access to read the current CPU number, node ID from user-space
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartables sequences allow user-space to perform update operations on
+per-cpu data without requiring heavy-weight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+    * Enabled in Kconfig
+
+    * Enabled at boot time (default is enabled)
+
+    * A rseq user space pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+    prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+          PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success and otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg4 and arg5 must be zero
+ENOTSUPP  Functionality was disabled on the kernel command line
+ENXIO	  Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+  prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL	  Functionality not available or invalid function arguments.
+          Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting the ``RSEQ_SLICE_EXT_REQUEST_BIT`` in the struct
+rseq slice_ctrl field. If the thread is interrupted and the interrupt
+results in a reschedule request in the kernel, then the kernel can grant a
+time slice extension and return to user space instead of scheduling
+out.
+
+The kernel indicates the grant by clearing ``RSEQ_SLICE_EXT_REQUEST_BIT``
+and setting ``RSEQ_SLICE_EXT_GRANTED_BIT`` in the rseq::slice_ctrl
+field. If there is a reschedule of the thread after granting the extension,
+the kernel clears the granted bit to indicate that to user space.
+
+If the request bit is still set when the leaving the critical section, user
+space can clear it and continue.
+
+If the granted bit is set, then user space has to invoke rseq_slice_yield()
+when leaving the critical section to relinquish the CPU. The kernel
+enforces this by arming a timer to prevent misbehaving user space from
+abusing this mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by user space.
+
+The required code flow is as follows::
+
+    rseq->slice_ctrl = REQUEST;
+    critical_section();
+    if (!local_test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) {
+        if (rseq->slice_ctrl & GRANTED)
+                rseq_slice_yield();
+    }
+
+local_test_and_clear_bit() has to be local CPU atomic to prevent the
+obvious RMW race versus an interrupt. On X86 this can be achieved with BTRL
+without LOCK prefix. On architectures, which do not provide lightweight CPU
+local atomics this needs to be implemented with regular atomic operations.
+
+Setting REQUEST has no atomicity requirements as there is no concurrency
+vs. the GRANTED bit.
+
+Checking the GRANTED has no atomicity requirements as there is obviously a
+race which cannot be avoided at all::
+
+    if (rseq->slice_ctrl & GRANTED)
+      -> Interrupt results in schedule and grant revocation
+        rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -71,12 +71,35 @@ struct rseq_ids {
 };
 
 /**
+ * union rseq_slice_state - Status information for rseq time slice extension
+ * @state:	Compound to access the overall state
+ * @enabled:	Time slice extension is enabled for the task
+ * @granted:	Time slice extension was granted to the task
+ */
+union rseq_slice_state {
+	u16			state;
+	struct {
+		u8		enabled;
+		u8		granted;
+	};
+};
+
+/**
+ * struct rseq_slice - Status information for rseq time slice extension
+ * @state:	Time slice extension state
+ */
+struct rseq_slice {
+	union rseq_slice_state	state;
+};
+
+/**
  * struct rseq_data - Storage for all rseq related data
  * @usrptr:	Pointer to the registered user space RSEQ memory
  * @len:	Length of the RSEQ region
  * @sig:	Signature of critial section abort IPs
  * @event:	Storage for event management
  * @ids:	Storage for cached CPU ID and MM CID
+ * @slice:	Storage for time slice extension data
  */
 struct rseq_data {
 	struct rseq __user		*usrptr;
@@ -84,6 +107,9 @@ struct rseq_data {
 	u32				sig;
 	struct rseq_event		event;
 	struct rseq_ids			ids;
+#ifdef CONFIG_RSEQ_SLICE_EXTENSION
+	struct rseq_slice		slice;
+#endif
 };
 
 #else /* CONFIG_RSEQ */
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -23,9 +23,15 @@ enum rseq_flags {
 };
 
 enum rseq_cs_flags_bit {
+	/* Historical and unsupported bits */
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	/* (3) Intentional gap to put new bits into a seperate byte */
+
+	/* User read only feature flags */
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT	= 4,
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT	= 5,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +41,22 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+	RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE	=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+	RSEQ_CS_FLAG_SLICE_EXT_ENABLED		=
+		(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
+};
+
+enum rseq_slice_bits {
+	/* Time slice extension ABI bits */
+	RSEQ_SLICE_EXT_REQUEST_BIT		= 0,
+	RSEQ_SLICE_EXT_GRANTED_BIT		= 1,
+};
+
+enum rseq_slice_masks {
+	RSEQ_SLICE_EXT_REQUEST	= (1U << RSEQ_SLICE_EXT_REQUEST_BIT),
+	RSEQ_SLICE_EXT_GRANTED	= (1U << RSEQ_SLICE_EXT_GRANTED_BIT),
 };
 
 /*
@@ -142,6 +164,12 @@ struct rseq {
 	__u32 mm_cid;
 
 	/*
+	 * Time slice extension control word. CPU local atomic updates from
+	 * kernel and user space.
+	 */
+	__u32 slice_ctrl;
+
+	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
 	char end[];
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1908,6 +1908,18 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
 
 	  If unsure, say N.
 
+config RSEQ_SLICE_EXTENSION
+	bool "Enable rseq based time slice extension mechanism"
+	depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+	help
+          Allows userspace to request a limited time slice extension when
+	  returning from an interrupt to user space via the RSEQ shared
+	  data ABI. If granted, that allows to complete a critical section,
+	  so that other threads are not stuck on a conflicted resource,
+	  while the task is scheduled out.
+
+	  If unsure, say N.
+
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -387,6 +387,8 @@ static bool rseq_reset_ids(void)
  */
 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
 {
+	u32 rseqfl = 0;
+
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
 			return -EINVAL;
@@ -448,6 +450,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
 		return -EFAULT;
 
+	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+		rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
+	if (put_user_masked_u32(rseqfl, &rseq->flags))
+		return -EFAULT;
+
 	/*
 	 * Activate the registration by setting the rseq area address, length
 	 * and signature in the task struct.


  parent reply	other threads:[~2025-09-08 22:59 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-08 22:59 [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-08 22:59 ` [patch 01/12] sched: Provide and use set_need_resched_current() Thomas Gleixner
2025-09-08 22:59 ` Thomas Gleixner [this message]
2025-09-09  0:04   ` [patch 02/12] rseq: Add fields and constants for time slice extension Randy Dunlap
2025-09-11 15:41   ` Mathieu Desnoyers
2025-09-11 15:49     ` Mathieu Desnoyers
2025-09-22  5:28   ` Prakash Sangappa
2025-09-22  5:57     ` K Prateek Nayak
2025-09-22 13:57       ` Mathieu Desnoyers
2025-09-22 13:55     ` Mathieu Desnoyers
2025-09-23  0:57       ` Prakash Sangappa
2025-09-08 22:59 ` [patch 03/12] rseq: Provide static branch for time slice extensions Thomas Gleixner
2025-09-09  3:10   ` K Prateek Nayak
2025-09-09  4:11     ` Randy Dunlap
2025-09-09 12:12       ` Thomas Gleixner
2025-09-09 16:01         ` Randy Dunlap
2025-09-11 15:42   ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 04/12] rseq: Add statistics " Thomas Gleixner
2025-09-11 15:43   ` Mathieu Desnoyers
2025-09-08 22:59 ` [patch 05/12] rseq: Add prctl() to enable " Thomas Gleixner
2025-09-11 15:50   ` Mathieu Desnoyers
2025-09-11 16:52     ` K Prateek Nayak
2025-09-11 17:18       ` Mathieu Desnoyers
2025-09-08 23:00 ` [patch 06/12] rseq: Implement sys_rseq_slice_yield() Thomas Gleixner
2025-09-09  9:52   ` K Prateek Nayak
2025-09-09 12:23     ` Thomas Gleixner
2025-09-10 11:15   ` K Prateek Nayak
2025-09-08 23:00 ` [patch 07/12] rseq: Implement syscall entry work for time slice extensions Thomas Gleixner
2025-09-10  5:22   ` K Prateek Nayak
2025-09-10  7:49     ` Thomas Gleixner
2025-09-08 23:00 ` [patch 08/12] rseq: Implement time slice extension enforcement timer Thomas Gleixner
2025-09-10 11:20   ` K Prateek Nayak
2025-09-08 23:00 ` [patch 09/12] rseq: Reset slice extension when scheduled Thomas Gleixner
2025-09-08 23:00 ` [patch 10/12] rseq: Implement rseq_grant_slice_extension() Thomas Gleixner
2025-09-09  8:14   ` K Prateek Nayak
2025-09-09 12:16     ` Thomas Gleixner
2025-09-08 23:00 ` [patch 11/12] entry: Hook up rseq time slice extension Thomas Gleixner
2025-09-08 23:00 ` [patch 12/12] selftests/rseq: Implement time slice extension test Thomas Gleixner
2025-09-10 11:23   ` K Prateek Nayak
2025-09-09 12:37 ` [patch 00/12] rseq: Implement time slice extension mechanism Thomas Gleixner
2025-09-10  4:42   ` K Prateek Nayak
2025-09-10 11:28 ` K Prateek Nayak
2025-09-10 14:50   ` Thomas Gleixner
2025-09-11  3:03     ` K Prateek Nayak
2025-09-11  7:36       ` Prakash Sangappa
2025-09-11 15:27 ` Mathieu Desnoyers
2025-09-11 20:18   ` Thomas Gleixner
2025-09-12 12:33     ` Mathieu Desnoyers
2025-09-12 16:31       ` Thomas Gleixner
2025-09-12 19:26         ` Mathieu Desnoyers
2025-09-13 13:02           ` Thomas Gleixner
2025-09-19 17:30             ` Prakash Sangappa
2025-09-22 14:09               ` Mathieu Desnoyers
2025-09-23  1:01                 ` Prakash Sangappa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250908225752.679815003@linutronix.de \
    --to=tglx@linutronix.de \
    --cc=arnd@arndb.de \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=corbet@lwn.net \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=prakash.sangappa@oracle.com \
    --cc=rostedt@goodmis.org \
    --cc=vineethr@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox