Re: [PATCH] sched: Further restrict the preemption modes

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] sched: Further restrict the preemption modes
       [not found] <20251219101502.GB1132199@noisy.programming.kicks-ass.net>
@ 2026-02-24 15:45 ` Ciunas Bennett
  2026-02-24 17:11   ` Sebastian Andrzej Siewior
  2026-02-25  2:30   ` Ilya Leoshkevich
  0 siblings, 2 replies; 8+ messages in thread
From: Ciunas Bennett @ 2026-02-24 15:45 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390



On 19/12/2025 10:15, Peter Zijlstra wrote:


Hi Peter,
We are observing a performance regression on s390 since enabling PREEMPT_LAZY.
Test Environment
Architecture: s390
Setup:

Single KVM host running two identical guests
Guests are connected virtually via Open vSwitch
Workload: uperf streaming read test with 50 parallel connections
One guest acts as the uperf client, the other as the server

Open vSwitch configuration:

OVS bridge with two ports
Guests attached via virtio‑net
Each guest configured with 4 vhost‑queues

Problem Description
When comparing PREEMPT_LAZY against full PREEMPT, we see a substantial drop in throughput—on some systems up to 50%.

Observed Behaviour
By tracing packets inside Open vSwitch (ovs_do_execute_action), we see:
Packet drops
Retransmissions
Reductions in packet size (from 64K down to 32K)

Capturing traffic inside the VM and inspecting it in Wireshark shows the following TCP‑level differences between PREEMPT_FULL and PREEMPT_LAZY:
|--------------------------------------+--------------+--------------+------------------|
| Wireshark Warning / Note             | PREEMPT_FULL | PREEMPT_LAZY | (lazy vs full)   |
|--------------------------------------+--------------+--------------+------------------|
| D-SACK Sequence                      |          309 |         2603 | ×8.4             |
| Partial Acknowledgement of a segment |           54 |          279 | ×5.2             |
| Ambiguous ACK (Karn)                 |           32 |          747 | ×23              |
| (Suspected) spurious retransmission  |          205 |          857 | ×4.2             |
| (Suspected) fast retransmission      |           54 |         1622 | ×30              |
| Duplicate ACK                        |          504 |         3446 | ×6.8             |
| Packet length exceeds MSS (TSO/GRO)  |        13172 |        34790 | ×2.6             |
| Previous segment(s) not captured     |         9205 |         6730 | -27%             |
| ACKed segment that wasn't captured   |         7022 |         8272 | +18%             |
| (Suspected) out-of-order segment     |          436 |          303 | -31%             |
|--------------------------------------+--------------+--------------+------------------|
This pattern indicates reordering, loss, or scheduling‑related delays, but it is still unclear why PREEMPT_LAZY is causing this behaviour in this workload.

Additional observations:

Monitoring the guest CPU run time shows that it drops from 16% with PREEMPT_FULL to 9% with PREEMPT_LAZY.

The workload is dominated by voluntary preemption (schedule()), and PREEMPT_LAZY is, as far as I understand, mainly concerned with forced preemption.
It is therefore not obvious why PREEMPT_LAZY has an impact here.

Changing guest configuration to disable mergeable RX buffers:
       <host mrg_rxbuf="off"/>
       had a clear effect on throughput:
       PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-24 15:45 ` [PATCH] sched: Further restrict the preemption modes Ciunas Bennett
@ 2026-02-24 17:11   ` Sebastian Andrzej Siewior
  2026-02-25  9:56     ` Ciunas Bennett
  2026-02-25  2:30   ` Ilya Leoshkevich
  1 sibling, 1 reply; 8+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-24 17:11 UTC (permalink / raw)
  To: Ciunas Bennett
  Cc: Peter Zijlstra, mingo, Thomas Gleixner, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	linux-s390

On 2026-02-24 15:45:39 [+0000], Ciunas Bennett wrote:
> Monitoring the guest CPU run time shows that it drops from 16% with
> PREEMPT_FULL to 9% with PREEMPT_LAZY.
> 
> The workload is dominated by voluntary preemption (schedule()), and
> PREEMPT_LAZY is, as far as I understand, mainly concerned with forced
> preemption.
> It is therefore not obvious why PREEMPT_LAZY has an impact here.

PREEMPT_FULL schedules immediately if there is a preemption request
either due to a wake up of a task, or because the time slice is used up
(while in kernel).
PREEMPT_LAZY delays the preemption request, caused by the scheduling
event, either until the task returns to userland or the next HZ tick.

The voluntary schedule() invocation shouldn't be effected by FULL-> LAZY
but I guess FULL scheduled more often after a wake up which is in
favour.

> Changing guest configuration to disable mergeable RX buffers:
>       <host mrg_rxbuf="off"/>
>       had a clear effect on throughput:
>       PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s
> 

Brings this the workload/ test to PREEMPT_FULL level?

Sebastian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-24 15:45 ` [PATCH] sched: Further restrict the preemption modes Ciunas Bennett
  2026-02-24 17:11   ` Sebastian Andrzej Siewior
@ 2026-02-25  2:30   ` Ilya Leoshkevich
  2026-02-25 16:33     ` Christian Borntraeger
  1 sibling, 1 reply; 8+ messages in thread
From: Ilya Leoshkevich @ 2026-02-25  2:30 UTC (permalink / raw)
  To: Ciunas Bennett, Peter Zijlstra, mingo, Thomas Gleixner,
	Sebastian Andrzej Siewior, Christian Borntraeger
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390

On 2/24/26 16:45, Ciunas Bennett wrote:
>
>
> On 19/12/2025 10:15, Peter Zijlstra wrote:
>
>
> Hi Peter,
> We are observing a performance regression on s390 since enabling 
> PREEMPT_LAZY.
> Test Environment
> Architecture: s390
> Setup:
>
> Single KVM host running two identical guests
> Guests are connected virtually via Open vSwitch
> Workload: uperf streaming read test with 50 parallel connections
> One guest acts as the uperf client, the other as the server
>
> Open vSwitch configuration:
>
> OVS bridge with two ports
> Guests attached via virtio‑net
> Each guest configured with 4 vhost‑queues
>
> Problem Description
> When comparing PREEMPT_LAZY against full PREEMPT, we see a substantial 
> drop in throughput—on some systems up to 50%.
>
> Observed Behaviour
> By tracing packets inside Open vSwitch (ovs_do_execute_action), we see:
> Packet drops
> Retransmissions
> Reductions in packet size (from 64K down to 32K)
>
> Capturing traffic inside the VM and inspecting it in Wireshark shows 
> the following TCP‑level differences between PREEMPT_FULL and 
> PREEMPT_LAZY:
> |--------------------------------------+--------------+--------------+------------------| 
>
> | Wireshark Warning / Note             | PREEMPT_FULL | PREEMPT_LAZY | 
> (lazy vs full)   |
> |--------------------------------------+--------------+--------------+------------------| 
>
> | D-SACK Sequence                      |          309 | 2603 | 
> ×8.4             |
> | Partial Acknowledgement of a segment |           54 | 279 | 
> ×5.2             |
> | Ambiguous ACK (Karn)                 |           32 | 747 | 
> ×23              |
> | (Suspected) spurious retransmission  |          205 | 857 | 
> ×4.2             |
> | (Suspected) fast retransmission      |           54 | 1622 | 
> ×30              |
> | Duplicate ACK                        |          504 | 3446 | 
> ×6.8             |
> | Packet length exceeds MSS (TSO/GRO)  |        13172 | 34790 | 
> ×2.6             |
> | Previous segment(s) not captured     |         9205 | 6730 | 
> -27%             |
> | ACKed segment that wasn't captured   |         7022 | 8272 | 
> +18%             |
> | (Suspected) out-of-order segment     |          436 | 303 | 
> -31%             |
> |--------------------------------------+--------------+--------------+------------------| 
>
> This pattern indicates reordering, loss, or scheduling‑related delays, 
> but it is still unclear why PREEMPT_LAZY is causing this behaviour in 
> this workload.
>
> Additional observations:
>
> Monitoring the guest CPU run time shows that it drops from 16% with 
> PREEMPT_FULL to 9% with PREEMPT_LAZY.
>
> The workload is dominated by voluntary preemption (schedule()), and 
> PREEMPT_LAZY is, as far as I understand, mainly concerned with forced 
> preemption.
> It is therefore not obvious why PREEMPT_LAZY has an impact here.
>
> Changing guest configuration to disable mergeable RX buffers:
>       <host mrg_rxbuf="off"/>
>       had a clear effect on throughput:
>       PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s 


When I look at top sched_switch kstacks on s390 with this workload, 20% 
of them are worker_thread() -> schedule(), both with CONFIG_PREEMPT and 
CONFIG_PREEMPT_LAZY. The others are vhost and idle.

On x86 I see only vhost and idle, but not worker_thread().


According to runqlat.bt, average run queue latency goes up from 4us to 
18us when switching from CONFIG_PREEMPT to CONFIG_PREEMPT_LAZY.

I modified the script to show per-comm latencies, and it shows 
that worker_thread() is disproportionately penalized: the latency 
increases from 2us to 60us!

For vhost it's better: 5us -> 2us, and for KVM it's better too: 8us -> 2us.


Finally, what is the worker doing? I looked at __queue_work() kstacks, 
and they all come from irqfd_wakeup().

irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which is 
implemented on x86 and not implemented on s390.


This may explain why we on s390 are the first to see this.


Christian, do you think if it would make sense to 
implement kvm_arch_set_irq_inatomic() on s390?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-24 17:11   ` Sebastian Andrzej Siewior
@ 2026-02-25  9:56     ` Ciunas Bennett
  0 siblings, 0 replies; 8+ messages in thread
From: Ciunas Bennett @ 2026-02-25  9:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, mingo, Thomas Gleixner, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	linux-s390



On 24/02/2026 17:11, Sebastian Andrzej Siewior wrote:
> On 2026-02-24 15:45:39 [+0000], Ciunas Bennett wrote:

>> Changing guest configuration to disable mergeable RX buffers:
>>        <host mrg_rxbuf="off"/>
>>        had a clear effect on throughput:
>>        PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s
>>
> 
> Brings this the workload/ test to PREEMPT_FULL level?
>

Sorry was not clear here, so when I enable this there is also an improvement in PREEMPT_FULL
from 55Gb/s -> 60Gb/s

So I see an improvement in both test cases.
PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s
PREEMPT_FULL: throughput improved from 55 Gb/s → 60 Gb/s

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25  2:30   ` Ilya Leoshkevich
@ 2026-02-25 16:33     ` Christian Borntraeger
  2026-02-25 18:30       ` Douglas Freimuth
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Borntraeger @ 2026-02-25 16:33 UTC (permalink / raw)
  To: Ilya Leoshkevich, Ciunas Bennett, Peter Zijlstra, mingo,
	Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390, Douglas Freimuth, Matthew Rosato,
	Hendrik Brueckner

Am 24.02.26 um 21:30 schrieb Ilya Leoshkevich:
> Finally, what is the worker doing? I looked at __queue_work() kstacks, and they all come from irqfd_wakeup().
> 
> irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which is implemented on x86 and not implemented on s390.
> 
> 
> This may explain why we on s390 are the first to see this.
> 
> 
> Christian, do you think if it would make sense to implement kvm_arch_set_irq_inatomic() on s390?

So in fact Doug is working on that at the moment. There are some corner
cases where we had concerns as we have to pin the guest pages holding
the interrupt bits. This was secure execution, I need to followup if
we have already solved those cases. But we can try if the current patch
will help this particular problem.

If yes, then we can try to speed up the work on this.

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25 16:33     ` Christian Borntraeger
@ 2026-02-25 18:30       ` Douglas Freimuth
  2026-03-03  9:15         ` Ciunas Bennett
  0 siblings, 1 reply; 8+ messages in thread
From: Douglas Freimuth @ 2026-02-25 18:30 UTC (permalink / raw)
  To: Christian Borntraeger, Ilya Leoshkevich, Ciunas Bennett,
	Peter Zijlstra, mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390, Matthew Rosato, Hendrik Brueckner



On 2/25/26 11:33 AM, Christian Borntraeger wrote:
> Am 24.02.26 um 21:30 schrieb Ilya Leoshkevich:
>> Finally, what is the worker doing? I looked at __queue_work() kstacks, 
>> and they all come from irqfd_wakeup().
>>
>> irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which 
>> is implemented on x86 and not implemented on s390.
>>
>>
>> This may explain why we on s390 are the first to see this.
>>
>>
>> Christian, do you think if it would make sense to 
>> implement kvm_arch_set_irq_inatomic() on s390?
> 
> So in fact Doug is working on that at the moment. There are some corner
> cases where we had concerns as we have to pin the guest pages holding
> the interrupt bits. This was secure execution, I need to followup if
> we have already solved those cases. But we can try if the current patch
> will help this particular problem.
> 
> If yes, then we can try to speed up the work on this.
> 
> Christian

Christian, the patch is very close to ready. The last step, I rebased on 
Master today to pickup the latest changes to interrupt.c. I am building 
that now and will test for non-SE and SE environments. I have been 
testing my solution for SE environments for a few weeks and it seems to 
cover the use cases I have tested.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-02-25 18:30       ` Douglas Freimuth
@ 2026-03-03  9:15         ` Ciunas Bennett
  2026-03-03 11:52           ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Ciunas Bennett @ 2026-03-03  9:15 UTC (permalink / raw)
  To: Douglas Freimuth, Christian Borntraeger, Ilya Leoshkevich,
	Peter Zijlstra, mingo, Thomas Gleixner, Sebastian Andrzej Siewior
  Cc: juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clrkwllms, linux-kernel, linux-rt-devel,
	Linus Torvalds, linux-s390, Matthew Rosato, Hendrik Brueckner

A quick update on the issue.
Introducing kvm_arch_set_irq_inatomic() appears to make the problem go away on my setup.
That said, this still begs the question: why does irqfd_wakeup behave differently (or poorly) in this scenario compared to the in-atomic IRQ injection path?
Is there a known interaction with workqueues, contexts, or locking that would explain the divergence here?

Observations:
irqfd_wakeup: triggers the problematic behaviour.
Forcing in-atomic IRQ injection (kvm_arch_set_irq_inatomic): issue not observed.

@Peter Zijlstra — Peter, do you have thoughts on how the workqueue scheduling context here could differ enough to cause this regression?
Any pointers on what to trace specifically in irqfd_wakeup and the work item path would be appreciated.
Thanks,
Ciunas Bennett

On 25/02/2026 18:30, Douglas Freimuth wrote:
> 
> Christian, the patch is very close to ready. The last step, I rebased on Master today to pickup the latest changes to interrupt.c. I am building that now and will test for non-SE and SE environments. I have been testing my solution for SE environments for a few weeks and it seems to cover the use cases I have tested.
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched: Further restrict the preemption modes
  2026-03-03  9:15         ` Ciunas Bennett
@ 2026-03-03 11:52           ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-03-03 11:52 UTC (permalink / raw)
  To: Ciunas Bennett
  Cc: Douglas Freimuth, Christian Borntraeger, Ilya Leoshkevich, mingo,
	Thomas Gleixner, Sebastian Andrzej Siewior, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clrkwllms, linux-kernel, linux-rt-devel, Linus Torvalds,
	linux-s390, Matthew Rosato, Hendrik Brueckner

On Tue, Mar 03, 2026 at 09:15:55AM +0000, Ciunas Bennett wrote:
> A quick update on the issue.
> Introducing kvm_arch_set_irq_inatomic() appears to make the problem go away on my setup.
> That said, this still begs the question: why does irqfd_wakeup behave differently (or poorly) in this scenario compared to the in-atomic IRQ injection path?
> Is there a known interaction with workqueues, contexts, or locking that would explain the divergence here?
> 
> Observations:
> irqfd_wakeup: triggers the problematic behaviour.
> Forcing in-atomic IRQ injection (kvm_arch_set_irq_inatomic): issue not observed.
> 
> @Peter Zijlstra — Peter, do you have thoughts on how the workqueue scheduling context here could differ enough to cause this regression?
> Any pointers on what to trace specifically in irqfd_wakeup and the work item path would be appreciated.

So the thing that LAZY does different from FULL is that it delays
preemption a bit.

This has two ramifications:

1) some ping-pong workloads will turn into block+wakeup, adding
overhead.

 FULL: running your task A, an interrupt would come in, wake task B and
 set Need Resched and the interrupt return path calls schedule() and
 you're task B. B does its thing, 'wakes' A and blocks.

 LAZY: running your task A, an interrupt would come in, wake task B (no
 NR set), you continue running A, A blocks for it needs something of B,
 now you schedule() [*] B runs, does its thing, does an actual wakeup of
 A and blocks.

The distinct difference here is that LAZY does a block of A and
consequently B has to do a full wakeup of A, whereas FULL doesn't do a
block of A, and hence the wakeup of A is NOP as well.

2) Since the schedule() is delayed, it might happen that by the time it
does get around to it, your task B is no longer the most eligible
option.

Same as above, except now, C is also woken, and the schedule marked with
[*] picks C, this then results in a detour, delaying things further.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-03 11:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251219101502.GB1132199@noisy.programming.kicks-ass.net>
2026-02-24 15:45 ` [PATCH] sched: Further restrict the preemption modes Ciunas Bennett
2026-02-24 17:11   ` Sebastian Andrzej Siewior
2026-02-25  9:56     ` Ciunas Bennett
2026-02-25  2:30   ` Ilya Leoshkevich
2026-02-25 16:33     ` Christian Borntraeger
2026-02-25 18:30       ` Douglas Freimuth
2026-03-03  9:15         ` Ciunas Bennett
2026-03-03 11:52           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox