From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 64DAD2EE29F; Mon, 8 Sep 2025 22:59:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372395; cv=none; b=JeAb8p1tTJuVFne8GC31KV2eKRWRg+YnXDvTQiF1Jp1v5NyZi2VjEJKNELpXz4AIGx9PHWmHKz4KxTnYR6Mo/1vRn/Xit+lflCUvZ/GaI+hiKyVIhWwI5x8VVB847jZa5dz3Wo9ueoSm6ZfRUGLE0Zox9ZYrbxuoMHrLioPs3ME= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757372395; c=relaxed/simple; bh=OhoCJiace39APLY9BORpuiClEw/tqSZuYxa6DA0/4Uk=; h=Message-ID:From:To:Subject:cc:Date; b=mjDZYEdA4UyymynxMIb3Tr4IRZkvJXvCaFRxQu8q7vShb3+xd0QltBXGveqLDf5eZ81E+LVz0zXV8dukI0SmfYa9IpeKWvhi6aE2+QM8Wvb8olQCM9BZgpKMaAcMzsYvwcmaYG7kZJb/fk1VQ9/3uoAUjZJhnYQLrKrOVPxhALk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=sKig7PyC; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=NAwb8SPG; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="sKig7PyC"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="NAwb8SPG" Message-ID: <20250908225709.144709889@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1757372388; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc; bh=ffQK8V2HC9r6ZlSkKcj/pbF3WZFDnMYvI5kkxuc7stg=; b=sKig7PyCMqUQ+TfYv2rSBbTktS/LsmHVilZJecgNVYODYwj9nkwXZiZxIMPcHYLlswtVZ4 V7deVBYK3IEBZmSxAZHKJ7CV+h0sest9pjYyPJ/PVPr7uWWBhYmPR8LFKtrQmxiv0j88t4 QJRNRy0128O4xaNoYtINJYTHu+s9NPekrPc7rwwiLLlGFQiZYLRsDhqfKFP375ZklomtNN nHfDuQPWdaqXO6GeJCiHOYxbveMYyZFD3PtGIqMsgFg48qCZzwur0WKNPUV+95oWyu7GZy B2y5t2hwIANMhXCzi5KmA9jKtCoT9pVDHk379iZrBleigpnTWreJKsHDlpeVFg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1757372388; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc; bh=ffQK8V2HC9r6ZlSkKcj/pbF3WZFDnMYvI5kkxuc7stg=; b=NAwb8SPGA31KjMmlOz2OL7GG7Wkm4UOaToykFwtO+ahPJqQNHtXYW9qSzmgXaii/+0CAj0 Ndw3AeWQ1kDmnxAw== From: Thomas Gleixner To: LKML Subject: [patch 00/12] rseq: Implement time slice extension mechanism cc: Peter Zilstra , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , K Prateek Nayak , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org Date: Tue, 9 Sep 2025 00:59:47 +0200 (CEST) Precedence: bulk X-Mailing-List: linux-arch@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: This is the proper implementation of the PoC code, which I posted in reply to the latest iteration of Prakash's time slice extension patches: https://lore.kernel.org/all/87o6smb3a0.ffs@tglx Time slice extensions are an attempt to provide opportunistic priority ceiling without the overhead of an actual priority ceiling protocol, but also without the guarantees such a protocol provides. The intent is to avoid situations where a user space thread is interrupted in a critical section and scheduled out, while holding a resource on which the preempting thread or other threads in the system might block on. That obviously prevents those threads from making progress in the worst case for at least a full time slice. Especially in the context of user space spinlocks, which are a patently bad idea to begin with, but that's also true for other mechanisms. This has been attempted to solve at least for a decade, but so far this went nowhere. The recent attempts, which started to integrate with the already existing RSEQ mechanism, have been at least going into the right direction. The full history is partially in the above mentioned mail thread and it's ancestors, but also in various threads in the LKML archives, which require archaeological efforts to retrieve. When trying to morph the PoC into actual mergeable code, I stumbled over various shortcomings in the RSEQ code, which have been addressed in a separate effort. The latest iteration can be found here: https://lore.kernel.org/all/20250908212737.353775467@linutronix.de That is a prerequisite for this series as it allows a tight integration into the RSEQ code without inflicting a lot of extra overhead into the hot paths. The main change vs. the PoC and the previous attempts is that it utilizes a new field in the user space ABI rseq struct, which allows to reduce the atomic operations in user space to a bare minimum. If the architecture supports CPU local atomics, which protect against the obvious RMW race vs. an interrupt, then there is no actual overhead, e.g. LOCK prefix on x86, required. The kernel user space ABI consists only of two bits in this new field: REQUEST and GRANTED User space sets REQUEST at the begin of the critical section. If it finishes the critical section without interruption then it can clear the bit and move on. If it is interrupted and the interrupt return path in the kernel observes a rescheduling request, then the kernel can grant a time slice extension. The kernel clears the REQUEST bit and sets the GRANTED bit with a simple non-atomic store operation. If it does not grant the extension only the REQUEST bit is cleared. If user space observes the REQUEST bit cleared, when it finished the critical section, then it has to check the GRANTED bit. If that is set, then it has to invoke the rseq_slice_yield() syscall to terminate the extension and yield the CPU. The code flow in user space is: // Simple store as there is no concurrency vs. the GRANTED bit rseq->slice_ctrl = REQUEST; critical_section(); // CPU local atomic required here: if (!test_and_clear_bit(REQUEST, &rseq->slice_ctrl)) { // Non-atomic check is sufficient as this can race // against an interrupt, which revokes the grant // // If not set, then the request was either cleared by the kernel // without grant or the grant was revoked. // // If set, tell the kernel that the critical section is done // so it can reschedule if (rseq->slice_ctrl & GRANTED) rseq_slice_yield(); } The other details, which differ from earlier attempts and the PoC, are: - A separate syscall for terminating the extension to avoid side effects and overloading of the already ill defined sched_yield(2) - A separate per CPU timer, which again does not inflict side effects on the scheduler internal hrtick timer. The hrtick timer can be disabled at run-time and an expiry can cause interesting problems in the scheduler code when it is unexpectedly invoked. - Tight integration into the rseq exit to user mode code. It utilizes the path when TIF_RESQ is not set at the end of exit_to_user_mode() to arm the timer if an extension was granted. TIF_RSEQ indicates that the task was scheduled and therefore would revoke the grant anyway. - A futile attempt to make this "work" on the PREEMPT_LAZY preemption model which is utilized by PREEMPT_RT. It allows the extension to be granted when TIF_PREEMPT_LAZY is set, but not TIF_PREEMPT. Pretending that this can be made work for TIF_PREEMPT on a fully preemptible kernel is just wishful thinking as the chance that TIF_PREEMPT is set in exit_to_user_mode() is close to zero for obvious reasons. This only "works" by some definition of works, i.e. on a best effort basis, for the PREEMPT_NONE model and nothing else. Though given the problems PREEMPT_NONE and also PREEMPT_VOLUNTARY have vs. long running code sections, the days of these models should be hopefully numbered and everything consolidated on the LAZY model. That makes this distinction moot and everything restricted to TIF_PREEMPT_LAZY unless someone is crazy enough to inflict the slice extension mechanism into the scheduler hotpath. I'm sure there will be attempts to do that as there is no lack of crazy folks out there... - Actual documentation of the user space ABI and a initial self test. The RSEQ modifications on which this series is based can be found here: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf For your convenience all of it is also available as a conglomerate from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/slice Thanks, tglx --- Documentation/userspace-api/index.rst | 1 Documentation/userspace-api/rseq.rst | 129 ++++++++++++ arch/alpha/kernel/syscalls/syscall.tbl | 1 arch/arm/tools/syscall.tbl | 1 arch/arm64/tools/syscall_32.tbl | 1 arch/m68k/kernel/syscalls/syscall.tbl | 1 arch/microblaze/kernel/syscalls/syscall.tbl | 1 arch/mips/kernel/syscalls/syscall_n32.tbl | 1 arch/mips/kernel/syscalls/syscall_n64.tbl | 1 arch/mips/kernel/syscalls/syscall_o32.tbl | 1 arch/parisc/kernel/syscalls/syscall.tbl | 1 arch/powerpc/kernel/syscalls/syscall.tbl | 1 arch/s390/kernel/syscalls/syscall.tbl | 1 arch/s390/mm/pfault.c | 3 arch/sh/kernel/syscalls/syscall.tbl | 1 arch/sparc/kernel/syscalls/syscall.tbl | 1 arch/x86/entry/syscalls/syscall_32.tbl | 1 arch/x86/entry/syscalls/syscall_64.tbl | 1 arch/xtensa/kernel/syscalls/syscall.tbl | 1 include/linux/entry-common.h | 2 include/linux/rseq.h | 11 + include/linux/rseq_entry.h | 176 ++++++++++++++++ include/linux/rseq_types.h | 28 ++ include/linux/sched.h | 7 include/linux/syscalls.h | 1 include/linux/thread_info.h | 16 - include/uapi/asm-generic/unistd.h | 5 include/uapi/linux/prctl.h | 10 include/uapi/linux/rseq.h | 28 ++ init/Kconfig | 12 + kernel/entry/common.c | 14 + kernel/entry/syscall-common.c | 11 - kernel/rcu/tiny.c | 8 kernel/rcu/tree.c | 14 - kernel/rcu/tree_exp.h | 3 kernel/rcu/tree_plugin.h | 9 kernel/rcu/tree_stall.h | 3 kernel/rseq.c | 293 ++++++++++++++++++++++++++++ kernel/sys.c | 6 kernel/sys_ni.c | 1 scripts/syscall.tbl | 1 tools/testing/selftests/rseq/.gitignore | 1 tools/testing/selftests/rseq/Makefile | 5 tools/testing/selftests/rseq/rseq-abi.h | 2 tools/testing/selftests/rseq/slice_test.c | 217 ++++++++++++++++++++ 45 files changed, 991 insertions(+), 42 deletions(-)