* [PATCH v2 3/3] Documentation: Add real-time to core-api
2025-08-15 9:38 [PATCH v2 0/3] Documentation: Add real-time bits Sebastian Andrzej Siewior
2025-08-15 9:38 ` [PATCH v2 1/3] Documentation: seqlock: Add a SPDX license identifier Sebastian Andrzej Siewior
2025-08-15 9:38 ` [PATCH v2 2/3] Documentation: locking: Add local_lock_nested_bh() to locktypes Sebastian Andrzej Siewior
@ 2025-08-15 9:38 ` Sebastian Andrzej Siewior
2025-08-18 16:16 ` Jonathan Corbet
2 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-15 9:38 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-rt-devel
Cc: Boqun Feng, Clark Williams, Frederic Weisbecker, Ingo Molnar,
John Ogness, Jonathan Corbet, Peter Zijlstra, Steven Rostedt,
Thomas Gleixner, Valentin Schneider, Waiman Long, Will Deacon,
Sebastian Andrzej Siewior
The documents explain the design concepts behind PREEMPT_RT and highlight key
differences necessary to achieve it.
It also include a list of requirements that must be fulfilled to support
PREEMPT_RT on a given architecture.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Documentation/core-api/index.rst | 1 +
.../real-time/architecture-porting.rst | 109 ++++++++
.../core-api/real-time/differences.rst | 242 ++++++++++++++++++
Documentation/core-api/real-time/index.rst | 16 ++
Documentation/core-api/real-time/theory.rst | 116 +++++++++
5 files changed, 484 insertions(+)
create mode 100644 Documentation/core-api/real-time/architecture-porting.rst
create mode 100644 Documentation/core-api/real-time/differences.rst
create mode 100644 Documentation/core-api/real-time/index.rst
create mode 100644 Documentation/core-api/real-time/theory.rst
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index a03a99c2cac56..6cbdcbfa79c30 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -24,6 +24,7 @@ it.
printk-index
symbol-namespaces
asm-annotations
+ real-time/index
Data structures and low-level utilities
=======================================
diff --git a/Documentation/core-api/real-time/architecture-porting.rst b/Documentation/core-api/real-time/architecture-porting.rst
new file mode 100644
index 0000000000000..d822fac29922d
--- /dev/null
+++ b/Documentation/core-api/real-time/architecture-porting.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
+Porting an architecture to support PREEMPT_RT
+=============================================
+
+:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
+
+This list outlines the architecture specific requirements that must be
+implemented in order to enable PREEMPT_RT. Once all required features are
+implemented, ARCH_SUPPORTS_RT can be selected in architecture’s Kconfig to make
+PREEMPT_RT selectable.
+Many prerequisites (genirq support for example) are enforced by the common code
+and are omitted here.
+
+The optional features are not strictly required but it is worth to consider
+them.
+
+Requirements
+------------
+
+Forced threaded interrupts
+ CONFIG_IRQ_FORCED_THREADING must be selected. Any interrupts that must
+ remain in hard-IRQ context must be marked with IRQF_NO_THREAD. This
+ requirement applies for instance to clocksource event interrupts,
+ perf interrupts and cascading interrupt-controller handlers.
+
+PREEMPTION support
+ Kernel preemption must be supported and requires that
+ CONFIG_ARCH_NO_PREEMPT remain unselected. Scheduling requests, such as those
+ issued from an interrupt or other exception handler, must be processed
+ immediately.
+
+POSIX CPU timers and KVM
+ POSIX CPU timers must expire from thread context rather than directly within
+ the timer interrupt. This behavior is enabled by setting the configuration
+ option CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK.
+ When KVM is enabled, CONFIG_KVM_XFER_TO_GUEST_WORK must also be set to ensure
+ that any pending work, such as POSIX timer expiration, is handled before
+ transitioning into guest mode.
+
+Hard-IRQ and Soft-IRQ stacks
+ Soft interrupts are handled in the thread context in which they are raised. If
+ a soft interrupt is triggered from hard-IRQ context, its execution is deferred
+ to the ksoftirqd thread. Preemption is never disabled during soft interrupt
+ handling, which makes soft interrupts preemptible.
+ If an architecture provides a custom __do_softirq() implementation that uses a
+ separate stack, it must select CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK. The
+ functionality should only be enabled when CONFIG_SOFTIRQ_ON_OWN_STACK is set.
+
+FPU and SIMD access in kernel mode
+ FPU and SIMD registers are typically not used in kernel mode and are therefore
+ not saved during kernel preemption. As a result, any kernel code that uses
+ these registers must be enclosed within a kernel_fpu_begin() and
+ kernel_fpu_end() section.
+ The kernel_fpu_begin() function usually invokes local_bh_disable() to prevent
+ interruptions from softirqs and to disable regular preemption. This allows the
+ protected code to run safely in both thread and softirq contexts.
+ On PREEMPT_RT kernels, however, kernel_fpu_begin() must not call
+ local_bh_disable(). Instead, it should use preempt_disable(), since softirqs
+ are always handled in thread context under PREEMPT_RT. In this case, disabling
+ preemption alone is sufficient.
+ The crypto subsystem operates on memory pages and requires users to "walk and
+ map" these pages while processing a request. This operation must occur outside
+ the kernel_fpu_begin()/ kernel_fpu_end() section because it requires preemption
+ to be enabled. These preemption points are generally sufficient to avoid
+ excessive scheduling latency.
+
+Exception handlers
+ Exception handlers, such as the page fault handler, typically enable interrupts
+ early, before invoking any generic code to process the exception. This is
+ necessary because handling a page fault may involve operations that can sleep.
+ Enabling interrupts is especially important on PREEMPT_RT, where certain
+ locks, such as spinlock_t, become sleepable. For example, handling an
+ invalid opcode may result in sending a SIGILL signal to the user task. A
+ debug excpetion will send a SIGTRAP signal.
+ In both cases, if the exception occurred in user space, it is safe to enable
+ interrupts early. Sending a signal requires both interrupts and kernel
+ preemption to be enabled.
+
+Optional features
+-----------------
+
+Timer and clocksource
+ A high-resolution clocksource and clockevents device are recommended. The
+ clockevents device should support the CLOCK_EVT_FEAT_ONESHOT feature for
+ optimal timer behavior. In most cases, microsecond-level accuracy is
+ sufficient
+
+Lazy preemption
+ This mechanism allows an in-kernel scheduling request for non-real-time tasks
+ to be delayed until the task is about to return to user space. It helps avoid
+ preempting a task that holds a sleeping lock at the time of the scheduling
+ request.
+ With CONFIG_GENERIC_IRQ_ENTRY enabled, supporting this feature requires
+ defining a bit for TIF_NEED_RESCHED_LAZY, preferably near TIF_NEED_RESCHED.
+
+Serial console with NBCON
+ With PREEMPT_RT enabled, all console output is handled by a dedicated thread
+ rather than directly from the context in which printk() is invoked. This design
+ allows printk() to be safely used in atomic contexts.
+ However, this also means that if the kernel crashes and cannot switch to the
+ printing thread, no output will be visible preventing the system from printing
+ its final messages.
+ There are exceptions for immediate output, such as during panic() handling. To
+ support this, the console driver must implement new-style lock handling. This
+ involves setting the CON_NBCON flag in console::flags and providing
+ implementations for the write_atomic, write_thread, device_lock, and
+ device_unlock callbacks.
diff --git a/Documentation/core-api/real-time/differences.rst b/Documentation/core-api/real-time/differences.rst
new file mode 100644
index 0000000000000..50d994a31e11c
--- /dev/null
+++ b/Documentation/core-api/real-time/differences.rst
@@ -0,0 +1,242 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+Significant differences
+========================
+
+:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
+
+Preface
+=======
+
+With forced-threaded interrupts and sleeping spin locks, code paths that
+previously caused long scheduling latencies have been made preemptible and
+moved into process context. This allows the scheduler to manage them more
+effectively and respond to higher-priority tasks with reduced latency.
+
+The following chapters provide an overview of key differences between a
+PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.
+
+Locking
+=======
+
+Spinning locks such as spinlock_t are used to provide synchronization for data
+structures accessed from both interrupt context and process context. For this
+reason, locking functions are also available with the _irq() or _irqsave()
+suffixes, which disable interrupts before acquiring the lock. This ensures that
+the lock can be safely acquired in process context when interrupts are enabled.
+
+However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer
+run in hard IRQ context. As a result, there is no need to disable interrupts as
+part of the locking procedure when using spinlock_t.
+
+For low-level core components such as interrupt handling, the scheduler, or the
+timer subsystem the kernel uses raw_spinlock_t. This lock type preserves
+traditional semantics: it disables preemption and, when used with _irq() or
+_irqsave(), also disables interrupts. This ensures proper synchronization in
+critical sections that must remain non-preemptible or with interrupts disabled.
+
+Execution context
+=================
+
+Interrupt handling in a PREEMPT_RT system is invoked in process context through
+the use of threaded interrupts. Other parts of the kernel also shift their
+execution into threaded context by different mechanisms. The goal is to keep
+execution paths preemptible, allowing the scheduler to interrupt them when a
+higher-priority task needs to run.
+
+Below is an overview of the kernel subsystems involved in this transition to
+threaded, preemptible execution.
+
+Interrupt handling
+------------------
+
+All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are
+interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or
+IRQF_ONESHOT flags.
+
+The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those
+registered using request_threaded_irq() and providing only a threaded handler.
+Its purpose is to keep the interrupt line masked until the threaded handler has
+completed.
+
+If a primary handler is also provided in this case, it is essential that the
+handler does not acquire any sleeping locks, as it will not be threaded. The
+handler should be minimal and must avoid introducing delays, such as
+busy-waiting on hardware registers.
+
+
+Soft interrupts, bottom half handling
+-------------------------------------
+
+Soft interrupts are raised by the interrupt handler and are executed after the
+handler returns. Since they run in thread context, they can be preempted by
+other threads. Do not assume that softirq context runs with preemption
+disabled. This means you must not rely on mechanisms like local_bh_disable() in
+process context to protect per-CPU variables. Because softirq handlers are
+preemptible under PREEMPT_RT, this approach does not provide reliable
+synchronization.
+
+If this kind of protection is required for performance reasons, consider using
+local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to
+verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the
+necessary locking to ensure proper protection.
+
+Using local_lock_nested_bh() also makes the locking scope explicit and easier
+for readers and maintainers to understand.
+
+
+per-CPU variables
+-----------------
+
+Protecting access to per-CPU variables solely by using preempt_disable() should
+be avoided, especially if the critical section has unbounded runtime or may
+call APIs that can sleep.
+
+If using a spinlock_t is considered too costly for performance reasons,
+consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces
+no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies
+that the lock is only acquired in process context and never from softirq or
+hard IRQ context.
+
+On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,
+which provides safe local protection for per-CPU data while keeping the system
+preemptible.
+
+Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used
+to protect per-CPU data by relying on implicit preemption disabling. If this
+inherited preemption disabling is essential and if local_lock_t cannot be used
+due to performance constraints, brevity of the code, or abstraction boundaries
+within an API then preempt_disable_nested() may be a suitable alternative. On
+non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already
+disabled. On PREEMPT_RT, it explicitly disables preemption.
+
+Timers
+------
+
+By default, an hrtimer is executed in hard interrupt context. The exception is
+timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in
+softirq context.
+
+On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in
+softirq context by default, typically within the ktimersd thread. This thread
+runs at the lowest real-time priority, ensuring it executes before any
+SCHED_OTHER tasks but does not interfere with higher-priority real-time
+threads. To explicitly request execution in hard interrupt context on
+PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.
+
+Memory allocation
+-----------------
+
+The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
+gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
+necessary to use GFP_ATOMIC when allocating memory from interrupt context or
+from sections where preemption is disabled. This is because the allocator must
+not sleep in these contexts waiting for memory to become available.
+
+However, this approach does not work on PREEMPT_RT kernels. The memory
+allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
+acquired when preemption is disabled. Fortunately, this is generally not a
+problem, because PREEMPT_RT moves most contexts that would traditionally run
+with preemption or interrupts disabled into threaded context, where sleeping is
+allowed.
+
+What remains problematic is code that explicitly disables preemption or
+interrupts. In such cases, memory allocation must be performed outside the
+critical section.
+
+This restriction also applies to memory deallocation routines such as kfree()
+and free_pages(), which may also involve internal locking and must not be
+called from non-preemptible contexts.
+
+IRQ work
+--------
+
+The irq_work API provides a mechanism to schedule a callback in interrupt
+context. It is designed for use in contexts where traditional scheduling is not
+possible, such as from within NMI handlers or from inside the scheduler, where
+using a workqueue would be unsafe.
+
+On non-PREEMPT_RT systems, all irq_work items are executed immediately in
+interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next
+timer tick but are still executed in interrupt context.
+
+On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks
+may acquire sleeping locks or have unbounded execution time, they are handled
+in thread context by a per-CPU irq_work kernel thread. This thread runs at the
+lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks
+but does not interfere with higher-priority real-time threads.
+
+The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still
+executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be
+deferred until the next timer tick and are also executed by the irq_work/
+thread.
+
+RCU callbacks
+-------------
+
+RCU callbacks are invoked by default in softirq context. Their execution is
+important because, depending on the use case, they either free memory or ensure
+progress in state transitions. Running these callbacks as part of the softirq
+chain can lead to undesired situations, such as contention for CPU resources
+with other SCHED_OTHER tasks when executed within ksoftirqd.
+
+To avoid running callbacks in softirq context, the RCU subsystem provides a
+mechanism to execute them in process context instead. This behavior can be
+enabled by setting the boot command-line parameter rcutree.use_softirq=0. This
+setting is enforced in kernels configured with PREEMPT_RT.
+
+Spin until ready
+================
+
+The "spin until ready" pattern involves repeatedly checking (spinning on) the
+state of a data structure until it becomes available. This pattern assumes that
+preemption, soft interrupts, or interrupts are disabled. If the data structure
+is marked busy, it is presumed to be in use by another CPU, and spinning should
+eventually succeed as that CPU makes progress.
+
+Some examples are hrtimer_cancel() or timer_delete_sync(). These functions
+cancel timers that execute with interrupts or soft interrupts disabled. If a
+thread attempts to cancel a timer and finds it active, spinning until the
+callback completes is safe because the callback can only run on another CPU and
+will eventually finish.
+
+On PREEMPT_RT kernels, however, timer callbacks run in thread context. This
+introduces a challenge: a higher-priority thread attempting to cancel the timer
+may preempt the timer callback thread. Since the scheduler cannot migrate the
+callback thread to another CPU due to affinity constraints, spinning can result
+in livelock even on multiprocessor systems.
+
+To avoid this, both the canceling and callback sides must use a handshake
+mechanism that supports priority inheritance. This allows the canceling thread
+to suspend until the callback completes, ensuring forward progress without
+risking livelock.
+
+In order to solve the problem at the API level, the sequence locks were extended
+to allow a proper handover between the the spinning reader and the maybe
+blocked writer.
+
+Sequence locks
+--------------
+
+Sequence counters and sequential locks are documented in
+Documentation/locking/seqlock.rst.
+
+The interface has been extended to ensure proper preemption states for the
+writer and spinning reader contexts. This is achieved by embedding the writer
+serialization lock directly into the sequence counter type, resulting in
+composite types such as seqcount_spinlock_t or seqcount_mutex_t.
+
+These composite types allow readers to detect an ongoing write and actively
+boost the writer’s priority to help it complete its update instead of spinning
+and waiting for its completion.
+
+If the plain seqcount_t is used, extra care must be taken to synchronize the
+reader with the writer during updates. The writer must ensure its update is
+serialized and non-preemptible relative to the reader. This cannot be achieved
+using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable
+preemption. In such cases, using seqcount_spinlock_t is the preferred solution.
+
+However, if there is no spinning involved i.e., if the reader only needs to
+detect whether a write has started and not serialize against it then using
+seqcount_t is reasonable.
diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst
new file mode 100644
index 0000000000000..7e14c4ea3d592
--- /dev/null
+++ b/Documentation/core-api/real-time/index.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Real-time preemption
+=====================
+
+This documentation is intended for Linux kernel developers and contributors
+interested in the inner workings of PREEMPT_RT. It explains key concepts and
+the required changes compared to a non-PREEMPT_RT configuration.
+
+.. toctree::
+ :maxdepth: 2
+
+ theory
+ differences
+ architecture-porting
diff --git a/Documentation/core-api/real-time/theory.rst b/Documentation/core-api/real-time/theory.rst
new file mode 100644
index 0000000000000..43d0120737f87
--- /dev/null
+++ b/Documentation/core-api/real-time/theory.rst
@@ -0,0 +1,116 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Theory of operation
+=====================
+
+:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
+
+Preface
+=======
+
+PREEMPT_RT transforms the Linux kernel into a real-time kernel. It achieves
+this by replacing locking primitives, such as spinlock_t, with a preemptible
+and priority-inheritance aware implementation known as rtmutex, and by enforcing
+the use of threaded interrupts. As a result, the kernel becomes fully
+preemptible, with the exception of a few critical code paths, including entry
+code, the scheduler, and low-level interrupt handling routines.
+
+This transformation places the majority of kernel execution contexts under the
+control of the scheduler and significantly increasing the number of preemption
+points. Consequently, it reduces the latency between a high-priority task
+becoming runnable and its actual execution on the CPU.
+
+Scheduling
+==========
+
+The core principles of Linux scheduling and the associated user-space API are
+documented in the man page sched(7)
+`sched(7) <https://man7.org/linux/man-pages/man7/sched.7.html>`_.
+By default, the Linux kernel uses the SCHED_OTHER scheduling policy. Under
+this policy, a task is preempted when the scheduler determines that it has
+consumed a fair share of CPU time relative to other runnable tasks. However,
+the policy does not guarantee immediate preemption when a new SCHED_OTHER task
+becomes runnable. The currently running task may continue executing.
+
+This behavior differs from that of real-time scheduling policies such as
+SCHED_FIFO. When a task with a real-time policy becomes runnable, the
+scheduler immediately selects it for execution if it has a higher priority than
+the currently running task. The task continues to run until it voluntarily
+yields the CPU, typically by blocking on an event.
+
+Sleeping spin locks
+===================
+
+The various lock types and their behavior under real-time configurations are
+described in detail in Documentation/locking/locktypes.rst.
+In a non-PREEMPT_RT configuration, a spinlock_t is acquired by first disabling
+preemption and then actively spinning until the lock becomes available. Once
+the lock is released, preemption is enabled. From a real-time perspective,
+this approach is undesirable because disabling preemption prevents the
+scheduler from switching to a higher-priority task, potentially increasing
+latency.
+
+To address this, PREEMPT_RT replaces spinning locks with sleeping spin locks
+that do not disable preemption. On PREEMPT_RT, spinlock_t is implemented using
+rtmutex. Instead of spinning, a task attempting to acquire a contended lock
+disables CPU migration, donates its priority to the lock owner (priority
+inheritance), and voluntarily schedules out while waiting for the lock to
+become available.
+
+Disabling CPU migration provides the same effect as disabling preemption, while
+still allowing preemption and ensuring that the task continues to run on the
+same CPU while holding a sleeping lock.
+
+Priority inheritance
+====================
+
+Lock types such as spinlock_t and mutex_t in a PREEMPT_RT enabled kernel are
+implemented on top of rtmutex, which provides support for priority inheritance
+(PI). When a task blocks on such a lock, the PI mechanism temporarily
+propagates the blocked task’s scheduling parameters to the lock owner.
+
+For example, if a SCHED_FIFO task A blocks on a lock currently held by a
+SCHED_OTHER task B, task A’s scheduling policy and priority are temporarily
+inherited by task B. After this inheritance, task A is put to sleep while
+waiting for the lock, and task B effectively becomes the highest-priority task
+in the system. This allows B to continue executing, make progress, and
+eventually release the lock.
+
+Once B releases the lock, it reverts to its original scheduling parameters, and
+task A can resume execution.
+
+Threaded interrupts
+===================
+
+Interrupt handlers are another source of code that executes with preemption
+disabled and outside the control of the scheduler. To bring interrupt handling
+under scheduler control, PREEMPT_RT enforces threaded interrupt handlers.
+
+With forced threading, interrupt handling is split into two stages. The first
+stage, the primary handler, is executed in IRQ context with interrupts disabled.
+Its sole responsibility is to wake the associated threaded handler. The second
+stage, the threaded handler, is the function passed to request_irq() as the
+interrupt handler. It runs in process context, scheduled by the kernel.
+
+From waking the interrupt thread until threaded handling is completed, the
+interrupt source is masked in the interrupt controller. This ensures that the
+device interrupt remains pending but does not retrigger the CPU, allowing the
+system to exit IRQ context and handle the interrupt in a scheduled thread.
+
+By default, the threaded handler executes with the SCHED_FIFO scheduling policy
+and a priority of 50 (MAX_RT_PRIO / 2), which is midway between the minimum and
+maximum real-time priorities.
+
+If the threaded interrupt handler raises any soft interrupts during its
+execution, those soft interrupt routines are invoked after the threaded handler
+completes, within the same thread. Preemption remains enabled during the
+execution of the soft interrupt handler.
+
+Summary
+=======
+
+By using sleeping locks and forced-threaded interrupts, PREEMPT_RT
+significantly reduces sections of code where interrupts or preemption is
+disabled, allowing the scheduler to preempt the current execution context and
+switch to a higher-priority task.
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread