public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Further restrict the preemption modes
@ 2025-12-19 10:15 Peter Zijlstra
  2026-01-06 15:23 ` Valentin Schneider
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-12-19 10:15 UTC (permalink / raw)
  To: mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, bigeasy, clrkwllms, linux-kernel,
	linux-rt-devel, Linus Torvalds


[ with 6.18 being an LTS release, it might be a good time for this ]

The introduction of PREEMPT_LAZY was for multiple reasons:

  - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
    !PREEMPT_RT.

  - the introduction of (more) features that rely on preemption; like
    folio_zero_user() which can do large memset() without preemption checks.

    (Xen already had a horrible hack to deal with long running hypercalls)

  - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
    cult or in response to poor to replicate workloads.

By moving to a model that is fundamentally preemptable these things become
manageable and avoid needing to introduce more horrible hacks.

Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)

This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy (like PREEMPT_RT).

While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/Kconfig.preempt |    3 +++
 kernel/sched/core.c    |    2 +-
 kernel/sched/debug.c   |    2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
 
 choice
 	prompt "Preemption Model"
+	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
 	default PREEMPT_NONE
 
 config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
 	depends on !PREEMPT_RT
+	depends on ARCH_NO_PREEMPT
 	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
 	help
 	  This is the traditional Linux preemption model, geared towards
@@ -35,6 +37,7 @@ config PREEMPT_NONE
 
 config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
+	depends on !ARCH_HAS_PREEMPT_LAZY
 	depends on !ARCH_NO_PREEMPT
 	depends on !PREEMPT_RT
 	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynam
 
 int sched_dynamic_mode(const char *str)
 {
-# ifndef CONFIG_PREEMPT_RT
+# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
 	if (!strcmp(str, "none"))
 		return preempt_dynamic_none;
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
 
 static int sched_dynamic_show(struct seq_file *m, void *v)
 {
-	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
+	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
 	int j;
 
 	/* Count entries in NULL terminated preempt_modes */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2025-12-19 10:15 [PATCH] sched: Further restrict the preemption modes Peter Zijlstra
@ 2026-01-06 15:23 ` Valentin Schneider
  2026-01-06 16:40 ` Steven Rostedt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2026-01-06 15:23 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bigeasy, clrkwllms, linux-kernel,
	linux-rt-devel, Linus Torvalds

On 19/12/25 11:15, Peter Zijlstra wrote:
> [ with 6.18 being an LTS release, it might be a good time for this ]
>
> The introduction of PREEMPT_LAZY was for multiple reasons:
>
>   - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
>     !PREEMPT_RT.
>
>   - the introduction of (more) features that rely on preemption; like
>     folio_zero_user() which can do large memset() without preemption checks.
>
>     (Xen already had a horrible hack to deal with long running hypercalls)
>
>   - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
>     cult or in response to poor to replicate workloads.
>
> By moving to a model that is fundamentally preemptable these things become
> manageable and avoid needing to introduce more horrible hacks.
>
> Since this is a requirement; limit PREEMPT_NONE to architectures that do not
> support preemption at all. Further limit PREEMPT_VOLUNTARY to those
> architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
> to make this the empty set and completely remove voluntary preemption and
> cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
>
> This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
> x86) with only two preemption models: full and lazy (like PREEMPT_RT).
>
> While Lazy has been the recommended setting for a while, not all distributions
> have managed to make the switch yet. Force things along. Keep the patch minimal
> in case of hard to address regressions that might pop up.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2025-12-19 10:15 [PATCH] sched: Further restrict the preemption modes Peter Zijlstra
  2026-01-06 15:23 ` Valentin Schneider
@ 2026-01-06 16:40 ` Steven Rostedt
  2026-01-09 11:23 ` Shrikanth Hegde
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2026-01-06 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, Thomas Gleixner, Sebastian Andrzej Siewior, juri.lelli,
	vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid,
	clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds

On Fri, 19 Dec 2025 11:15:02 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
>  
>  choice
>  	prompt "Preemption Model"
> +	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
>  	default PREEMPT_NONE

I think you can just make this:

	default PREEMPT_LAZY

and remove the PREEMPT_NONE.

As PREEMPT_NONE now depends on ARCH_NO_PREEMPT and all the other options
depend on !ARCH_NO_PREEMPT, the default will be PREEMPT_LAZY if it's
available, but it will never be PREEMPT_NONE if it isn't unless
PREEMPT_NONE is the only option available.

I added default PREEMPT_LAZY and did a:

   $ mkdir /tmp/build
   $ make O=/tmp/build ARCH=alpha defconfig

And the result is:

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y

-- Steve


>  
>  config PREEMPT_NONE
>  	bool "No Forced Preemption (Server)"
>  	depends on !PREEMPT_RT
> +	depends on ARCH_NO_PREEMPT
>  	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
>  	help
>  	  This is the traditional Linux preemption model, geared towards
> @@ -35,6 +37,7 @@ config PREEMPT_NONE
>  
>  config PREEMPT_VOLUNTARY
>  	bool "Voluntary Kernel Preemption (Desktop)"
> +	depends on !ARCH_HAS_PREEMPT_LAZY
>  	depends on !ARCH_NO_PREEMPT
>  	depends on !PREEMPT_RT
>  	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2025-12-19 10:15 [PATCH] sched: Further restrict the preemption modes Peter Zijlstra
  2026-01-06 15:23 ` Valentin Schneider
  2026-01-06 16:40 ` Steven Rostedt
@ 2026-01-09 11:23 ` Shrikanth Hegde
  2026-02-25 10:53   ` Peter Zijlstra
  2026-01-12  8:03 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2026-02-24 15:45 ` [PATCH] " Ciunas Bennett
  4 siblings, 1 reply; 22+ messages in thread
From: Shrikanth Hegde @ 2026-01-09 11:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, mingo, Thomas Gleixner, Sebastian Andrzej Siewior

Hi Peter.

On 12/19/25 3:45 PM, Peter Zijlstra wrote:
> 
> [ with 6.18 being an LTS release, it might be a good time for this ]
> 
> The introduction of PREEMPT_LAZY was for multiple reasons:
> 
>    - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
>      !PREEMPT_RT.
> 
>    - the introduction of (more) features that rely on preemption; like
>      folio_zero_user() which can do large memset() without preemption checks.
> 
>      (Xen already had a horrible hack to deal with long running hypercalls)
> 
>    - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
>      cult or in response to poor to replicate workloads.
> 
> By moving to a model that is fundamentally preemptable these things become
> manageable and avoid needing to introduce more horrible hacks.
> 
> Since this is a requirement; limit PREEMPT_NONE to architectures that do not
> support preemption at all. Further limit PREEMPT_VOLUNTARY to those
> architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
> to make this the empty set and completely remove voluntary preemption and
> cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
> 
> This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
> x86) with only two preemption models: full and lazy (like PREEMPT_RT).
> 
> While Lazy has been the recommended setting for a while, not all distributions
> have managed to make the switch yet. Force things along. Keep the patch minimal
> in case of hard to address regressions that might pop up.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/Kconfig.preempt |    3 +++
>   kernel/sched/core.c    |    2 +-
>   kernel/sched/debug.c   |    2 +-
>   3 files changed, 5 insertions(+), 2 deletions(-)
> 
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
>   
>   choice
>   	prompt "Preemption Model"
> +	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
>   	default PREEMPT_NONE
>   
>   config PREEMPT_NONE
>   	bool "No Forced Preemption (Server)"
>   	depends on !PREEMPT_RT
> +	depends on ARCH_NO_PREEMPT
>   	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
>   	help
>   	  This is the traditional Linux preemption model, geared towards
> @@ -35,6 +37,7 @@ config PREEMPT_NONE
>   
>   config PREEMPT_VOLUNTARY
>   	bool "Voluntary Kernel Preemption (Desktop)"
> +	depends on !ARCH_HAS_PREEMPT_LAZY
>   	depends on !ARCH_NO_PREEMPT
>   	depends on !PREEMPT_RT
>   	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynam
>   
>   int sched_dynamic_mode(const char *str)
>   {
> -# ifndef CONFIG_PREEMPT_RT
> +# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
>   	if (!strcmp(str, "none"))
>   		return preempt_dynamic_none;
>   
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
>   
>   static int sched_dynamic_show(struct seq_file *m, void *v)
>   {
> -	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
> +	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
>   	int j;
>   
>   	/* Count entries in NULL terminated preempt_modes */

Maybe only change the default to LAZY, but keep other options possible 
via dynamic update?

- When the kernel changes to lazy being the default, the scheduling 
pattern can change and it may affect the workloads. having ability to 
dynamically change to none/voluntary could help one to figure out where
it is regressing. we could document cases where regression is expected.

- with preempt=full/lazy we will likely never see softlockups. How are 
we going to find out longer kernel paths(some maybe design, some may be 
bugs) apart from observing workload regression?


Also, is softlockup code is of any use in preempt=full/lazy?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched: Further restrict the preemption modes
  2025-12-19 10:15 [PATCH] sched: Further restrict the preemption modes Peter Zijlstra
                   ` (2 preceding siblings ...)
  2026-01-09 11:23 ` Shrikanth Hegde
@ 2026-01-12  8:03 ` tip-bot2 for Peter Zijlstra
  2026-02-24 15:45 ` [PATCH] " Ciunas Bennett
  4 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-01-12  8:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7dadeaa6e851e7d67733f3e24fc53ee107781d0f
Gitweb:        https://git.kernel.org/tip/7dadeaa6e851e7d67733f3e24fc53ee107781d0f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 18 Dec 2025 15:25:10 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 08 Jan 2026 12:43:57 +01:00

sched: Further restrict the preemption modes

The introduction of PREEMPT_LAZY was for multiple reasons:

  - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
    !PREEMPT_RT.

  - the introduction of (more) features that rely on preemption; like
    folio_zero_user() which can do large memset() without preemption checks.

    (Xen already had a horrible hack to deal with long running hypercalls)

  - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
    cult or in response to poor to replicate workloads.

By moving to a model that is fundamentally preemptable these things become
managable and avoid needing to introduce more horrible hacks.

Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)

This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy.

While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
---
 kernel/Kconfig.preempt | 3 +++
 kernel/sched/core.c    | 2 +-
 kernel/sched/debug.c   | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index da32680..88c594c 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
 
 choice
 	prompt "Preemption Model"
+	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
 	default PREEMPT_NONE
 
 config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
 	depends on !PREEMPT_RT
+	depends on ARCH_NO_PREEMPT
 	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
 	help
 	  This is the traditional Linux preemption model, geared towards
@@ -35,6 +37,7 @@ config PREEMPT_NONE
 
 config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
+	depends on !ARCH_HAS_PREEMPT_LAZY
 	depends on !ARCH_NO_PREEMPT
 	depends on !PREEMPT_RT
 	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b17d8e..fa72075 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynamic_undefined;
 
 int sched_dynamic_mode(const char *str)
 {
-# ifndef CONFIG_PREEMPT_RT
+# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
 	if (!strcmp(str, "none"))
 		return preempt_dynamic_none;
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22..5f9b771 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
 
 static int sched_dynamic_show(struct seq_file *m, void *v)
 {
-	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
+	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
 	int j;
 
 	/* Count entries in NULL terminated preempt_modes */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2025-12-19 10:15 [PATCH] sched: Further restrict the preemption modes Peter Zijlstra
                   ` (3 preceding siblings ...)
  2026-01-12  8:03 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2026-02-24 15:45 ` Ciunas Bennett
  2026-02-24 17:11   ` Sebastian Andrzej Siewior
  2026-02-25  2:30   ` Ilya Leoshkevich
  4 siblings, 2 replies; 22+ messages in thread
From: Ciunas Bennett @ 2026-02-24 15:45 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390



On 19/12/2025 10:15, Peter Zijlstra wrote:


Hi Peter,
We are observing a performance regression on s390 since enabling PREEMPT_LAZY.
Test Environment
Architecture: s390
Setup:

Single KVM host running two identical guests
Guests are connected virtually via Open vSwitch
Workload: uperf streaming read test with 50 parallel connections
One guest acts as the uperf client, the other as the server

Open vSwitch configuration:

OVS bridge with two ports
Guests attached via virtio‑net
Each guest configured with 4 vhost‑queues

Problem Description
When comparing PREEMPT_LAZY against full PREEMPT, we see a substantial drop in throughput—on some systems up to 50%.

Observed Behaviour
By tracing packets inside Open vSwitch (ovs_do_execute_action), we see:
Packet drops
Retransmissions
Reductions in packet size (from 64K down to 32K)

Capturing traffic inside the VM and inspecting it in Wireshark shows the following TCP‑level differences between PREEMPT_FULL and PREEMPT_LAZY:
|--------------------------------------+--------------+--------------+------------------|
| Wireshark Warning / Note             | PREEMPT_FULL | PREEMPT_LAZY | (lazy vs full)   |
|--------------------------------------+--------------+--------------+------------------|
| D-SACK Sequence                      |          309 |         2603 | ×8.4             |
| Partial Acknowledgement of a segment |           54 |          279 | ×5.2             |
| Ambiguous ACK (Karn)                 |           32 |          747 | ×23              |
| (Suspected) spurious retransmission  |          205 |          857 | ×4.2             |
| (Suspected) fast retransmission      |           54 |         1622 | ×30              |
| Duplicate ACK                        |          504 |         3446 | ×6.8             |
| Packet length exceeds MSS (TSO/GRO)  |        13172 |        34790 | ×2.6             |
| Previous segment(s) not captured     |         9205 |         6730 | -27%             |
| ACKed segment that wasn't captured   |         7022 |         8272 | +18%             |
| (Suspected) out-of-order segment     |          436 |          303 | -31%             |
|--------------------------------------+--------------+--------------+------------------|
This pattern indicates reordering, loss, or scheduling‑related delays, but it is still unclear why PREEMPT_LAZY is causing this behaviour in this workload.

Additional observations:

Monitoring the guest CPU run time shows that it drops from 16% with PREEMPT_FULL to 9% with PREEMPT_LAZY.

The workload is dominated by voluntary preemption (schedule()), and PREEMPT_LAZY is, as far as I understand, mainly concerned with forced preemption.
It is therefore not obvious why PREEMPT_LAZY has an impact here.

Changing guest configuration to disable mergeable RX buffers:
       <host mrg_rxbuf="off"/>
       had a clear effect on throughput:
       PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-24 15:45 ` [PATCH] " Ciunas Bennett
@ 2026-02-24 17:11   ` Sebastian Andrzej Siewior
  2026-02-25  9:56     ` Ciunas Bennett
  2026-02-25  2:30   ` Ilya Leoshkevich
  1 sibling, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-24 17:11 UTC (permalink / raw)
  To: Ciunas Bennett
  Cc: Peter Zijlstra, mingo, Thomas Gleixner, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	linux-s390

On 2026-02-24 15:45:39 [+0000], Ciunas Bennett wrote:
> Monitoring the guest CPU run time shows that it drops from 16% with
> PREEMPT_FULL to 9% with PREEMPT_LAZY.
> 
> The workload is dominated by voluntary preemption (schedule()), and
> PREEMPT_LAZY is, as far as I understand, mainly concerned with forced
> preemption.
> It is therefore not obvious why PREEMPT_LAZY has an impact here.

PREEMPT_FULL schedules immediately if there is a preemption request
either due to a wake up of a task, or because the time slice is used up
(while in kernel).
PREEMPT_LAZY delays the preemption request, caused by the scheduling
event, either until the task returns to userland or the next HZ tick.

The voluntary schedule() invocation shouldn't be effected by FULL-> LAZY
but I guess FULL scheduled more often after a wake up which is in
favour.

> Changing guest configuration to disable mergeable RX buffers:
>       <host mrg_rxbuf="off"/>
>       had a clear effect on throughput:
>       PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s
> 

Brings this the workload/ test to PREEMPT_FULL level?

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-24 15:45 ` [PATCH] " Ciunas Bennett
  2026-02-24 17:11   ` Sebastian Andrzej Siewior
@ 2026-02-25  2:30   ` Ilya Leoshkevich
  2026-02-25 16:33     ` Christian Borntraeger
  1 sibling, 1 reply; 22+ messages in thread
From: Ilya Leoshkevich @ 2026-02-25  2:30 UTC (permalink / raw)
  To: Ciunas Bennett, Peter Zijlstra, mingo, Thomas Gleixner,
	Sebastian Andrzej Siewior, Christian Borntraeger
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390

On 2/24/26 16:45, Ciunas Bennett wrote:
>
>
> On 19/12/2025 10:15, Peter Zijlstra wrote:
>
>
> Hi Peter,
> We are observing a performance regression on s390 since enabling 
> PREEMPT_LAZY.
> Test Environment
> Architecture: s390
> Setup:
>
> Single KVM host running two identical guests
> Guests are connected virtually via Open vSwitch
> Workload: uperf streaming read test with 50 parallel connections
> One guest acts as the uperf client, the other as the server
>
> Open vSwitch configuration:
>
> OVS bridge with two ports
> Guests attached via virtio‑net
> Each guest configured with 4 vhost‑queues
>
> Problem Description
> When comparing PREEMPT_LAZY against full PREEMPT, we see a substantial 
> drop in throughput—on some systems up to 50%.
>
> Observed Behaviour
> By tracing packets inside Open vSwitch (ovs_do_execute_action), we see:
> Packet drops
> Retransmissions
> Reductions in packet size (from 64K down to 32K)
>
> Capturing traffic inside the VM and inspecting it in Wireshark shows 
> the following TCP‑level differences between PREEMPT_FULL and 
> PREEMPT_LAZY:
> |--------------------------------------+--------------+--------------+------------------| 
>
> | Wireshark Warning / Note             | PREEMPT_FULL | PREEMPT_LAZY | 
> (lazy vs full)   |
> |--------------------------------------+--------------+--------------+------------------| 
>
> | D-SACK Sequence                      |          309 | 2603 | 
> ×8.4             |
> | Partial Acknowledgement of a segment |           54 | 279 | 
> ×5.2             |
> | Ambiguous ACK (Karn)                 |           32 | 747 | 
> ×23              |
> | (Suspected) spurious retransmission  |          205 | 857 | 
> ×4.2             |
> | (Suspected) fast retransmission      |           54 | 1622 | 
> ×30              |
> | Duplicate ACK                        |          504 | 3446 | 
> ×6.8             |
> | Packet length exceeds MSS (TSO/GRO)  |        13172 | 34790 | 
> ×2.6             |
> | Previous segment(s) not captured     |         9205 | 6730 | 
> -27%             |
> | ACKed segment that wasn't captured   |         7022 | 8272 | 
> +18%             |
> | (Suspected) out-of-order segment     |          436 | 303 | 
> -31%             |
> |--------------------------------------+--------------+--------------+------------------| 
>
> This pattern indicates reordering, loss, or scheduling‑related delays, 
> but it is still unclear why PREEMPT_LAZY is causing this behaviour in 
> this workload.
>
> Additional observations:
>
> Monitoring the guest CPU run time shows that it drops from 16% with 
> PREEMPT_FULL to 9% with PREEMPT_LAZY.
>
> The workload is dominated by voluntary preemption (schedule()), and 
> PREEMPT_LAZY is, as far as I understand, mainly concerned with forced 
> preemption.
> It is therefore not obvious why PREEMPT_LAZY has an impact here.
>
> Changing guest configuration to disable mergeable RX buffers:
>       <host mrg_rxbuf="off"/>
>       had a clear effect on throughput:
>       PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s 


When I look at top sched_switch kstacks on s390 with this workload, 20% 
of them are worker_thread() -> schedule(), both with CONFIG_PREEMPT and 
CONFIG_PREEMPT_LAZY. The others are vhost and idle.

On x86 I see only vhost and idle, but not worker_thread().


According to runqlat.bt, average run queue latency goes up from 4us to 
18us when switching from CONFIG_PREEMPT to CONFIG_PREEMPT_LAZY.

I modified the script to show per-comm latencies, and it shows 
that worker_thread() is disproportionately penalized: the latency 
increases from 2us to 60us!

For vhost it's better: 5us -> 2us, and for KVM it's better too: 8us -> 2us.


Finally, what is the worker doing? I looked at __queue_work() kstacks, 
and they all come from irqfd_wakeup().

irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which is 
implemented on x86 and not implemented on s390.


This may explain why we on s390 are the first to see this.


Christian, do you think if it would make sense to 
implement kvm_arch_set_irq_inatomic() on s390?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-24 17:11   ` Sebastian Andrzej Siewior
@ 2026-02-25  9:56     ` Ciunas Bennett
  0 siblings, 0 replies; 22+ messages in thread
From: Ciunas Bennett @ 2026-02-25  9:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, mingo, Thomas Gleixner, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	linux-s390



On 24/02/2026 17:11, Sebastian Andrzej Siewior wrote:
> On 2026-02-24 15:45:39 [+0000], Ciunas Bennett wrote:

>> Changing guest configuration to disable mergeable RX buffers:
>>        <host mrg_rxbuf="off"/>
>>        had a clear effect on throughput:
>>        PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s
>>
> 
> Brings this the workload/ test to PREEMPT_FULL level?
>

Sorry was not clear here, so when I enable this there is also an improvement in PREEMPT_FULL
from 55Gb/s -> 60Gb/s

So I see an improvement in both test cases.
PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s
PREEMPT_FULL: throughput improved from 55 Gb/s → 60 Gb/s

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-01-09 11:23 ` Shrikanth Hegde
@ 2026-02-25 10:53   ` Peter Zijlstra
  2026-02-25 12:56     ` Shrikanth Hegde
  2026-02-26  0:48     ` Steven Rostedt
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2026-02-25 10:53 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, mingo, Thomas Gleixner, Sebastian Andrzej Siewior

On Fri, Jan 09, 2026 at 04:53:04PM +0530, Shrikanth Hegde wrote:

> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
> >   static int sched_dynamic_show(struct seq_file *m, void *v)
> >   {
> > -	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
> > +	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
> >   	int j;
> >   	/* Count entries in NULL terminated preempt_modes */
> 
> Maybe only change the default to LAZY, but keep other options possible via
> dynamic update?
> 
> - When the kernel changes to lazy being the default, the scheduling pattern
> can change and it may affect the workloads. having ability to dynamically
> change to none/voluntary could help one to figure out where
> it is regressing. we could document cases where regression is expected.

I suppose we could do this. I just worry people will end up with 'echo
volatile > /debug/sched/preempt' in their startup script, rather than
trying to actually debug their issues.

Anybody with enough knowledge to be useful, can edit this line on their
own, rebuild the kernel and go forth.

Also, I've already heard people are interested in compile-time removing
of cond_resched() infrastructure for ARCH_HAS_PREEMPT_LAZY, so this
would be short lived indeed.

> - with preempt=full/lazy we will likely never see softlockups. How are we
> going to find out longer kernel paths(some maybe design, some may be bugs)
> apart from observing workload regression?

Given the utter cargo cult placement of cond_resched(); I don't think
we've actually lost much here. You wouldn't have seen the softlockup
thing anyway, because of cond_resched().

Anyway, you can always build on top of function graph tracing, create a
flame graph of stuff and see just where all your runtime went. I'm sure
there's tools that do this already. Perhaps if you're handy with the BPF
stuff you can even create a 'watchdog' of sorts that will scream if any
function takes longer than X us to run or whatever.

Oh, that reminds me, Steve, would it make sense to have
task_struct::se.sum_exec_runtime as a trace-clock?

> Also, is softlockup code is of any use in preempt=full/lazy?

Softlockup has always seemed of dubious value to me -- then again, I've
been running preempt=y kernels from about the day that became an option
:-)

I think it still trips if you loose a wakeup or something.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25 10:53   ` Peter Zijlstra
@ 2026-02-25 12:56     ` Shrikanth Hegde
  2026-02-26  0:48     ` Steven Rostedt
  1 sibling, 0 replies; 22+ messages in thread
From: Shrikanth Hegde @ 2026-02-25 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, mingo, Thomas Gleixner, Sebastian Andrzej Siewior



On 2/25/26 4:23 PM, Peter Zijlstra wrote:
> On Fri, Jan 09, 2026 at 04:53:04PM +0530, Shrikanth Hegde wrote:
> 
>>> --- a/kernel/sched/debug.c
>>> +++ b/kernel/sched/debug.c
>>> @@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
>>>    static int sched_dynamic_show(struct seq_file *m, void *v)
>>>    {
>>> -	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
>>> +	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
>>>    	int j;
>>>    	/* Count entries in NULL terminated preempt_modes */
>>
>> Maybe only change the default to LAZY, but keep other options possible via
>> dynamic update?
>>
>> - When the kernel changes to lazy being the default, the scheduling pattern
>> can change and it may affect the workloads. having ability to dynamically
>> change to none/voluntary could help one to figure out where
>> it is regressing. we could document cases where regression is expected.
> 
> I suppose we could do this. I just worry people will end up with 'echo
> volatile > /debug/sched/preempt' in their startup script, rather than
> trying to actually debug their issues.

Ack.

> 
> Anybody with enough knowledge to be useful, can edit this line on their
> own, rebuild the kernel and go forth.
> 
> Also, I've already heard people are interested in compile-time removing
> of cond_resched() infrastructure for ARCH_HAS_PREEMPT_LAZY, so this
> would be short lived indeed.
> 
>> - with preempt=full/lazy we will likely never see softlockups. How are we
>> going to find out longer kernel paths(some maybe design, some may be bugs)
>> apart from observing workload regression?
> 
> Given the utter cargo cult placement of cond_resched(); I don't think
> we've actually lost much here. You wouldn't have seen the softlockup
> thing anyway, because of cond_resched().
> 
> Anyway, you can always build on top of function graph tracing, create a
> flame graph of stuff and see just where all your runtime went. I'm sure
> there's tools that do this already. Perhaps if you're handy with the BPF
> stuff you can even create a 'watchdog' of sorts that will scream if any
> function takes longer than X us to run or whatever.
> 
> Oh, that reminds me, Steve, would it make sense to have
> task_struct::se.sum_exec_runtime as a trace-clock?
> 
>> Also, is softlockup code is of any use in preempt=full/lazy?
> 
> Softlockup has always seemed of dubious value to me -- then again, I've
> been running preempt=y kernels from about the day that became an option
> :-)
> 
> I think it still trips if you loose a wakeup or something.
> 

That's probably hungtask report right?
IIUC that would be independent of preemption model.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25  2:30   ` Ilya Leoshkevich
@ 2026-02-25 16:33     ` Christian Borntraeger
  2026-02-25 18:30       ` Douglas Freimuth
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Borntraeger @ 2026-02-25 16:33 UTC (permalink / raw)
  To: Ilya Leoshkevich, Ciunas Bennett, Peter Zijlstra, mingo,
	Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390, Douglas Freimuth, Matthew Rosato,
	Hendrik Brueckner

Am 24.02.26 um 21:30 schrieb Ilya Leoshkevich:
> Finally, what is the worker doing? I looked at __queue_work() kstacks, and they all come from irqfd_wakeup().
> 
> irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which is implemented on x86 and not implemented on s390.
> 
> 
> This may explain why we on s390 are the first to see this.
> 
> 
> Christian, do you think if it would make sense to implement kvm_arch_set_irq_inatomic() on s390?

So in fact Doug is working on that at the moment. There are some corner
cases where we had concerns as we have to pin the guest pages holding
the interrupt bits. This was secure execution, I need to followup if
we have already solved those cases. But we can try if the current patch
will help this particular problem.

If yes, then we can try to speed up the work on this.

Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25 16:33     ` Christian Borntraeger
@ 2026-02-25 18:30       ` Douglas Freimuth
  2026-03-03  9:15         ` Ciunas Bennett
  0 siblings, 1 reply; 22+ messages in thread
From: Douglas Freimuth @ 2026-02-25 18:30 UTC (permalink / raw)
  To: Christian Borntraeger, Ilya Leoshkevich, Ciunas Bennett,
	Peter Zijlstra, mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390, Matthew Rosato, Hendrik Brueckner



On 2/25/26 11:33 AM, Christian Borntraeger wrote:
> Am 24.02.26 um 21:30 schrieb Ilya Leoshkevich:
>> Finally, what is the worker doing? I looked at __queue_work() kstacks, 
>> and they all come from irqfd_wakeup().
>>
>> irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which 
>> is implemented on x86 and not implemented on s390.
>>
>>
>> This may explain why we on s390 are the first to see this.
>>
>>
>> Christian, do you think if it would make sense to 
>> implement kvm_arch_set_irq_inatomic() on s390?
> 
> So in fact Doug is working on that at the moment. There are some corner
> cases where we had concerns as we have to pin the guest pages holding
> the interrupt bits. This was secure execution, I need to followup if
> we have already solved those cases. But we can try if the current patch
> will help this particular problem.
> 
> If yes, then we can try to speed up the work on this.
> 
> Christian

Christian, the patch is very close to ready. The last step, I rebased on 
Master today to pickup the latest changes to interrupt.c. I am building 
that now and will test for non-SE and SE environments. I have been 
testing my solution for SE environments for a few weeks and it seems to 
cover the use cases I have tested.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25 10:53   ` Peter Zijlstra
  2026-02-25 12:56     ` Shrikanth Hegde
@ 2026-02-26  0:48     ` Steven Rostedt
  2026-02-26  5:30       ` Shrikanth Hegde
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2026-02-26  0:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shrikanth Hegde, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, vschneid, clrkwllms, linux-kernel,
	linux-rt-devel, Linus Torvalds, mingo, Thomas Gleixner,
	Sebastian Andrzej Siewior

On Wed, 25 Feb 2026 11:53:45 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Oh, that reminds me, Steve, would it make sense to have
> task_struct::se.sum_exec_runtime as a trace-clock?

That's unique per task right? As tracing is global it requires the
clock to be monotonic, and I'm guessing a single sched_switch will
break that.

Now if one wants to trace how long kernel paths are, I'm sure we could
trivially make a new tracer to do so.

  echo max_kernel_time > current_tracer

or something like that, that could act like a latency tracer that
monitors how long any kernel thread runs without being preempted.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-26  0:48     ` Steven Rostedt
@ 2026-02-26  5:30       ` Shrikanth Hegde
  2026-02-26 17:22         ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Shrikanth Hegde @ 2026-02-26  5:30 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	mingo, Thomas Gleixner, Sebastian Andrzej Siewior,
	Madhavan Srinivasan, Nicholas Piggin



On 2/26/26 6:18 AM, Steven Rostedt wrote:
> On Wed, 25 Feb 2026 11:53:45 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> Oh, that reminds me, Steve, would it make sense to have
>> task_struct::se.sum_exec_runtime as a trace-clock?
> 
> That's unique per task right? As tracing is global it requires the
> clock to be monotonic, and I'm guessing a single sched_switch will
> break that.
> 
> Now if one wants to trace how long kernel paths are, I'm sure we could
> trivially make a new tracer to do so.
> 
>    echo max_kernel_time > current_tracer

That is good idea.

> 
> or something like that, that could act like a latency tracer that
> monitors how long any kernel thread runs without being preempted.
> 
> -- Steve

With preempt=full/lazy a long running kernel task can get
preempted if it is running in preemptible section. that's okay.

My intent was to have a tracer that can say, look this kernel task took this much time
before it completed. For some task such as long page walk, we know it is okay since
it is expected to take time, but for some task such as reading watchdog shouldn't take
time. But on large system's doing these global variable update itself may take a long time.
Updating less often was a fix which had fixed that lockup IIRC. So how can we identify such
opportunities. Hopefully I am making sense.

Earlier, one would have got a softlockup when things were making very slow progress(one's
which didn't have a cond_resched)
Now, we don't know unless we see a workload regression.


If we don't have a tracer/mechanism today which gives kernel_tasks > timelimit,
then having a new one would help.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-26  5:30       ` Shrikanth Hegde
@ 2026-02-26 17:22         ` Steven Rostedt
  2026-02-27  9:09           ` Shrikanth Hegde
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2026-02-26 17:22 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Peter Zijlstra, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, vschneid, clrkwllms, linux-kernel,
	linux-rt-devel, Linus Torvalds, mingo, Thomas Gleixner,
	Sebastian Andrzej Siewior, Madhavan Srinivasan, Nicholas Piggin

On Thu, 26 Feb 2026 11:00:14 +0530
Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> On 2/26/26 6:18 AM, Steven Rostedt wrote:
> > On Wed, 25 Feb 2026 11:53:45 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> >> Oh, that reminds me, Steve, would it make sense to have
> >> task_struct::se.sum_exec_runtime as a trace-clock?  
> > 
> > That's unique per task right? As tracing is global it requires the
> > clock to be monotonic, and I'm guessing a single sched_switch will
> > break that.
> > 
> > Now if one wants to trace how long kernel paths are, I'm sure we could
> > trivially make a new tracer to do so.
> > 
> >    echo max_kernel_time > current_tracer  
> 
> That is good idea.

Yeah, I think something like this should be added now that LAZY will
prevent us from knowing where in the kernel is really going on for a long
time.

> 
> > 
> > or something like that, that could act like a latency tracer that
> > monitors how long any kernel thread runs without being preempted.
> > 
> > -- Steve  
> 
> With preempt=full/lazy a long running kernel task can get
> preempted if it is running in preemptible section. that's okay.
> 
> My intent was to have a tracer that can say, look this kernel task took this much time
> before it completed. For some task such as long page walk, we know it is okay since

Tracers can be set to only watch a single task. The function and function
graph tracers use set_ftrace_pid. I could extend that to other tracers.
Hmm, that may even be useful for the preemptirq tracer!

> it is expected to take time, but for some task such as reading watchdog shouldn't take
> time. But on large system's doing these global variable update itself may take a long time.
> Updating less often was a fix which had fixed that lockup IIRC. So how can we identify such
> opportunities. Hopefully I am making sense.

Not really. Can you explain in more detail, or specific examples of what
constitutes a path you want to trace and one that you do not?

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-26 17:22         ` Steven Rostedt
@ 2026-02-27  9:09           ` Shrikanth Hegde
  2026-02-27 14:53             ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Shrikanth Hegde @ 2026-02-27  9:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, vschneid, clrkwllms, linux-kernel,
	linux-rt-devel, Linus Torvalds, mingo, Thomas Gleixner,
	Sebastian Andrzej Siewior, Madhavan Srinivasan, Nicholas Piggin

Hi Steven.

On 2/26/26 10:52 PM, Steven Rostedt wrote:
> On Thu, 26 Feb 2026 11:00:14 +0530
> Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> On 2/26/26 6:18 AM, Steven Rostedt wrote:
>>> On Wed, 25 Feb 2026 11:53:45 +0100
>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>>    
>>>> Oh, that reminds me, Steve, would it make sense to have
>>>> task_struct::se.sum_exec_runtime as a trace-clock?
>>>
>>> That's unique per task right? As tracing is global it requires the
>>> clock to be monotonic, and I'm guessing a single sched_switch will
>>> break that.
>>>
>>> Now if one wants to trace how long kernel paths are, I'm sure we could
>>> trivially make a new tracer to do so.
>>>
>>>     echo max_kernel_time > current_tracer
>>
>> That is good idea.
> 
> Yeah, I think something like this should be added now that LAZY will
> prevent us from knowing where in the kernel is really going on for a long
> time.
> 

That would be the goal.

>>
>>>
>>> or something like that, that could act like a latency tracer that
>>> monitors how long any kernel thread runs without being preempted.
>>>
>>> -- Steve
>>
>> With preempt=full/lazy a long running kernel task can get
>> preempted if it is running in preemptible section. that's okay.
>>
>> My intent was to have a tracer that can say, look this kernel task took this much time
>> before it completed. For some task such as long page walk, we know it is okay since
> 
> Tracers can be set to only watch a single task. The function and function
> graph tracers use set_ftrace_pid. I could extend that to other tracers.
> Hmm, that may even be useful for the preemptirq tracer!
> 
>> it is expected to take time, but for some task such as reading watchdog shouldn't take
>> time. But on large system's doing these global variable update itself may take a long time.
>> Updating less often was a fix which had fixed that lockup IIRC. So how can we identify such

That was a hardlockup. wrong example.

>> opportunities. Hopefully I am making sense.
> 
> Not really. Can you explain in more detail, or specific examples of what
> constitutes a path you want to trace and one that you do not?

All I was saying, there have been fixes which solved softlockup issues
without using cond_resched. But seeing softlockup was important to know
that issue existed.

Some reference commit I think that did this;
a8c861f401b4 xfs: avoid busy loops in GCD
e1b849cfa6b6 writeback: Avoid contention on wb->list_lock when switching inodes
0ddfb62f5d01 fix the softlockups in attach_recursive_mnt()


I am afraid we will have trace all functions to begin with (which is expensive), but filter
out those which took minimal time (like less than a 1s or so). that would eventually leave only a
few functions that actually took more than 1s(that should have limited overhead).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-27  9:09           ` Shrikanth Hegde
@ 2026-02-27 14:53             ` Steven Rostedt
  2026-02-27 15:28               ` Shrikanth Hegde
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2026-02-27 14:53 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Peter Zijlstra, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, vschneid, clrkwllms, linux-kernel,
	linux-rt-devel, Linus Torvalds, mingo, Thomas Gleixner,
	Sebastian Andrzej Siewior, Madhavan Srinivasan, Nicholas Piggin

On Fri, 27 Feb 2026 14:39:42 +0530
Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> I am afraid we will have trace all functions to begin with (which is expensive), but filter
> out those which took minimal time (like less than a 1s or so). that would eventually leave only a
> few functions that actually took more than 1s(that should have limited overhead).

Well, I think the detection can be done with timings between schedules.
What's the longest running task without any voluntary schedule. Then you
can add function graph tracing to it where it can possibly trigger in the
location that detected the issue.

On a detection of a long schedule, a stack trace can be recorded. Using
that stack trace, you could use the function graph tracer to see what is
happening.

Anyway, something to think about, and this could be a topic at this years
Linux Plumbers Tracing MC ;-)

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-27 14:53             ` Steven Rostedt
@ 2026-02-27 15:28               ` Shrikanth Hegde
  2026-03-09  9:13                 ` Shrikanth Hegde
  0 siblings, 1 reply; 22+ messages in thread
From: Shrikanth Hegde @ 2026-02-27 15:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, vschneid, clrkwllms, linux-kernel,
	linux-rt-devel, Linus Torvalds, mingo, Thomas Gleixner,
	Sebastian Andrzej Siewior, Madhavan Srinivasan, Nicholas Piggin,
	sathvika

Hi Steve.

On 2/27/26 8:23 PM, Steven Rostedt wrote:
> On Fri, 27 Feb 2026 14:39:42 +0530
> Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> I am afraid we will have trace all functions to begin with (which is expensive), but filter
>> out those which took minimal time (like less than a 1s or so). that would eventually leave only a
>> few functions that actually took more than 1s(that should have limited overhead).
> 
> Well, I think the detection can be done with timings between schedules.
> What's the longest running task without any voluntary schedule. Then you
> can add function graph tracing to it where it can possibly trigger in the
> location that detected the issue.
> 
> On a detection of a long schedule, a stack trace can be recorded. Using
> that stack trace, you could use the function graph tracer to see what is
> happening.
> 
> Anyway, something to think about, and this could be a topic at this years
> Linux Plumbers Tracing MC ;-)
> 

Yep. Will try to do this.

Someone from our tracing team wanted to give this a try too. Lets see.

> -- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25 18:30       ` Douglas Freimuth
@ 2026-03-03  9:15         ` Ciunas Bennett
  2026-03-03 11:52           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Ciunas Bennett @ 2026-03-03  9:15 UTC (permalink / raw)
  To: Douglas Freimuth, Christian Borntraeger, Ilya Leoshkevich,
	Peter Zijlstra, mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390, Matthew Rosato, Hendrik Brueckner

A quick update on the issue.
Introducing kvm_arch_set_irq_inatomic() appears to make the problem go away on my setup.
That said, this still begs the question: why does irqfd_wakeup behave differently (or poorly) in this scenario compared to the in-atomic IRQ injection path?
Is there a known interaction with workqueues, contexts, or locking that would explain the divergence here?

Observations:
irqfd_wakeup: triggers the problematic behaviour.
Forcing in-atomic IRQ injection (kvm_arch_set_irq_inatomic): issue not observed.

@Peter Zijlstra — Peter, do you have thoughts on how the workqueue scheduling context here could differ enough to cause this regression?
Any pointers on what to trace specifically in irqfd_wakeup and the work item path would be appreciated.
Thanks,
Ciunas Bennett

On 25/02/2026 18:30, Douglas Freimuth wrote:
> 
> Christian, the patch is very close to ready. The last step, I rebased on Master today to pickup the latest changes to interrupt.c. I am building that now and will test for non-SE and SE environments. I have been testing my solution for SE environments for a few weeks and it seems to cover the use cases I have tested.
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-03-03  9:15         ` Ciunas Bennett
@ 2026-03-03 11:52           ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2026-03-03 11:52 UTC (permalink / raw)
  To: Ciunas Bennett
  Cc: Douglas Freimuth, Christian Borntraeger, Ilya Leoshkevich, mingo,
	Thomas Gleixner, Sebastian Andrzej Siewior, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	linux-s390, Matthew Rosato, Hendrik Brueckner

On Tue, Mar 03, 2026 at 09:15:55AM +0000, Ciunas Bennett wrote:
> A quick update on the issue.
> Introducing kvm_arch_set_irq_inatomic() appears to make the problem go away on my setup.
> That said, this still begs the question: why does irqfd_wakeup behave differently (or poorly) in this scenario compared to the in-atomic IRQ injection path?
> Is there a known interaction with workqueues, contexts, or locking that would explain the divergence here?
> 
> Observations:
> irqfd_wakeup: triggers the problematic behaviour.
> Forcing in-atomic IRQ injection (kvm_arch_set_irq_inatomic): issue not observed.
> 
> @Peter Zijlstra — Peter, do you have thoughts on how the workqueue scheduling context here could differ enough to cause this regression?
> Any pointers on what to trace specifically in irqfd_wakeup and the work item path would be appreciated.

So the thing that LAZY does different from FULL is that it delays
preemption a bit.

This has two ramifications:

1) some ping-pong workloads will turn into block+wakeup, adding
overhead.

 FULL: running your task A, an interrupt would come in, wake task B and
 set Need Resched and the interrupt return path calls schedule() and
 you're task B. B does its thing, 'wakes' A and blocks.

 LAZY: running your task A, an interrupt would come in, wake task B (no
 NR set), you continue running A, A blocks for it needs something of B,
 now you schedule() [*] B runs, does its thing, does an actual wakeup of
 A and blocks.

The distinct difference here is that LAZY does a block of A and
consequently B has to do a full wakeup of A, whereas FULL doesn't do a
block of A, and hence the wakeup of A is NOP as well.


2) Since the schedule() is delayed, it might happen that by the time it
does get around to it, your task B is no longer the most eligible
option.

Same as above, except now, C is also woken, and the schedule marked with
[*] picks C, this then results in a detour, delaying things further.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-27 15:28               ` Shrikanth Hegde
@ 2026-03-09  9:13                 ` Shrikanth Hegde
  0 siblings, 0 replies; 22+ messages in thread
From: Shrikanth Hegde @ 2026-03-09  9:13 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	mingo, Thomas Gleixner, Sebastian Andrzej Siewior,
	Madhavan Srinivasan, Nicholas Piggin, sathvika



On 2/27/26 8:58 PM, Shrikanth Hegde wrote:
> Hi Steve.
> 
> On 2/27/26 8:23 PM, Steven Rostedt wrote:
>> On Fri, 27 Feb 2026 14:39:42 +0530
>> Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>>
>>> I am afraid we will have trace all functions to begin with (which is 
>>> expensive), but filter
>>> out those which took minimal time (like less than a 1s or so). that 
>>> would eventually leave only a
>>> few functions that actually took more than 1s(that should have 
>>> limited overhead).
>>

It is possible to remove tracing for some function after they were enabled
in kernel? or this could only be done from user by looking at trace buffer?

Even if it doable, This would allow us to trace functions that took a lot of time.
But we should be aiming to calculate the kernel paths that took a lot of time?

>> Well, I think the detection can be done with timings between schedules.
>> What's the longest running task without any voluntary schedule. Then you
>> can add function graph tracing to it where it can possibly trigger in the
>> location that detected the issue.
>>

This would not work either. We will have sched in/sched out even when
running in userspace.

Lets say, user makes a syscall, the process will continue to in R state,
we only need to track the long running time in kernel, but not in userspace.

>> On a detection of a long schedule, a stack trace can be recorded. Using
>> that stack trace, you could use the function graph tracer to see what is
>> happening.
>>
>> Anyway, something to think about, and this could be a topic at this years
>> Linux Plumbers Tracing MC ;-)
>>
>

We could track the kernel paths, i.e different entry/exit points into kernel.

1. syscall entry/exit.
2. irq entry/exit.
3. kworker threads.

For 1 and 2 we have tracepoints already. For 3, we can use sched in/sched out tracepoints
to see if and when it takes a long time.

All of them could be combined in one bpf program. Any thoughts?

Getting stacktrace of 3 is doable i guess, i.e when sched_out happen while in R state and time
check has failed. But for 1,2 getting a stack is going to be difficult.

Please add if i have missed more kernel paths where we want detection to happen.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-03-09  9:13 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19 10:15 [PATCH] sched: Further restrict the preemption modes Peter Zijlstra
2026-01-06 15:23 ` Valentin Schneider
2026-01-06 16:40 ` Steven Rostedt
2026-01-09 11:23 ` Shrikanth Hegde
2026-02-25 10:53   ` Peter Zijlstra
2026-02-25 12:56     ` Shrikanth Hegde
2026-02-26  0:48     ` Steven Rostedt
2026-02-26  5:30       ` Shrikanth Hegde
2026-02-26 17:22         ` Steven Rostedt
2026-02-27  9:09           ` Shrikanth Hegde
2026-02-27 14:53             ` Steven Rostedt
2026-02-27 15:28               ` Shrikanth Hegde
2026-03-09  9:13                 ` Shrikanth Hegde
2026-01-12  8:03 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2026-02-24 15:45 ` [PATCH] " Ciunas Bennett
2026-02-24 17:11   ` Sebastian Andrzej Siewior
2026-02-25  9:56     ` Ciunas Bennett
2026-02-25  2:30   ` Ilya Leoshkevich
2026-02-25 16:33     ` Christian Borntraeger
2026-02-25 18:30       ` Douglas Freimuth
2026-03-03  9:15         ` Ciunas Bennett
2026-03-03 11:52           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox