Re: [PATCH] sched: Further restrict the preemption modes

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

From: Ilya Leoshkevich <iii@linux.ibm.com>
To: Ciunas Bennett <ciunas@linux.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	mingo@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	clrkwllms@kernel.org, linux-kernel@vger.kernel.org,
	linux-rt-devel@lists.linux.dev,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-s390@vger.kernel.org
Subject: Re: [PATCH] sched: Further restrict the preemption modes
Date: Wed, 25 Feb 2026 03:30:04 +0100	[thread overview]
Message-ID: <a7180379-04f5-4f61-b60a-0ff7cf85134d@linux.ibm.com> (raw)
In-Reply-To: <182f110b-ac63-4db4-8b01-0e841639bc39@linux.ibm.com>

On 2/24/26 16:45, Ciunas Bennett wrote:
>
>
> On 19/12/2025 10:15, Peter Zijlstra wrote:
>
>
> Hi Peter,
> We are observing a performance regression on s390 since enabling 
> PREEMPT_LAZY.
> Test Environment
> Architecture: s390
> Setup:
>
> Single KVM host running two identical guests
> Guests are connected virtually via Open vSwitch
> Workload: uperf streaming read test with 50 parallel connections
> One guest acts as the uperf client, the other as the server
>
> Open vSwitch configuration:
>
> OVS bridge with two ports
> Guests attached via virtio‑net
> Each guest configured with 4 vhost‑queues
>
> Problem Description
> When comparing PREEMPT_LAZY against full PREEMPT, we see a substantial 
> drop in throughput—on some systems up to 50%.
>
> Observed Behaviour
> By tracing packets inside Open vSwitch (ovs_do_execute_action), we see:
> Packet drops
> Retransmissions
> Reductions in packet size (from 64K down to 32K)
>
> Capturing traffic inside the VM and inspecting it in Wireshark shows 
> the following TCP‑level differences between PREEMPT_FULL and 
> PREEMPT_LAZY:
> |--------------------------------------+--------------+--------------+------------------| 
>
> | Wireshark Warning / Note             | PREEMPT_FULL | PREEMPT_LAZY | 
> (lazy vs full)   |
> |--------------------------------------+--------------+--------------+------------------| 
>
> | D-SACK Sequence                      |          309 | 2603 | 
> ×8.4             |
> | Partial Acknowledgement of a segment |           54 | 279 | 
> ×5.2             |
> | Ambiguous ACK (Karn)                 |           32 | 747 | 
> ×23              |
> | (Suspected) spurious retransmission  |          205 | 857 | 
> ×4.2             |
> | (Suspected) fast retransmission      |           54 | 1622 | 
> ×30              |
> | Duplicate ACK                        |          504 | 3446 | 
> ×6.8             |
> | Packet length exceeds MSS (TSO/GRO)  |        13172 | 34790 | 
> ×2.6             |
> | Previous segment(s) not captured     |         9205 | 6730 | 
> -27%             |
> | ACKed segment that wasn't captured   |         7022 | 8272 | 
> +18%             |
> | (Suspected) out-of-order segment     |          436 | 303 | 
> -31%             |
> |--------------------------------------+--------------+--------------+------------------| 
>
> This pattern indicates reordering, loss, or scheduling‑related delays, 
> but it is still unclear why PREEMPT_LAZY is causing this behaviour in 
> this workload.
>
> Additional observations:
>
> Monitoring the guest CPU run time shows that it drops from 16% with 
> PREEMPT_FULL to 9% with PREEMPT_LAZY.
>
> The workload is dominated by voluntary preemption (schedule()), and 
> PREEMPT_LAZY is, as far as I understand, mainly concerned with forced 
> preemption.
> It is therefore not obvious why PREEMPT_LAZY has an impact here.
>
> Changing guest configuration to disable mergeable RX buffers:
>       <host mrg_rxbuf="off"/>
>       had a clear effect on throughput:
>       PREEMPT_LAZY: throughput improved from 40 Gb/s → 60 Gb/s 


When I look at top sched_switch kstacks on s390 with this workload, 20% 
of them are worker_thread() -> schedule(), both with CONFIG_PREEMPT and 
CONFIG_PREEMPT_LAZY. The others are vhost and idle.

On x86 I see only vhost and idle, but not worker_thread().


According to runqlat.bt, average run queue latency goes up from 4us to 
18us when switching from CONFIG_PREEMPT to CONFIG_PREEMPT_LAZY.

I modified the script to show per-comm latencies, and it shows 
that worker_thread() is disproportionately penalized: the latency 
increases from 2us to 60us!

For vhost it's better: 5us -> 2us, and for KVM it's better too: 8us -> 2us.


Finally, what is the worker doing? I looked at __queue_work() kstacks, 
and they all come from irqfd_wakeup().

irqfd_wakeup() calls arch-specific kvm_arch_set_irq_inatomic(), which is 
implemented on x86 and not implemented on s390.


This may explain why we on s390 are the first to see this.


Christian, do you think if it would make sense to 
implement kvm_arch_set_irq_inatomic() on s390?

next prev parent reply	other threads:[~2026-02-25  2:30 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251219101502.GB1132199@noisy.programming.kicks-ass.net>
2026-02-24 15:45 ` [PATCH] sched: Further restrict the preemption modes Ciunas Bennett
2026-02-24 17:11   ` Sebastian Andrzej Siewior
2026-02-25  9:56     ` Ciunas Bennett
2026-02-25  2:30   ` Ilya Leoshkevich [this message]
2026-02-25 16:33     ` Christian Borntraeger
2026-02-25 18:30       ` Douglas Freimuth
2026-03-03  9:15         ` Ciunas Bennett
2026-03-03 11:52           ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a7180379-04f5-4f61-b60a-0ff7cf85134d@linux.ibm.com \
    --to=iii@linux.ibm.com \
    --cc=bigeasy@linutronix.de \
    --cc=borntraeger@linux.ibm.com \
    --cc=bsegall@google.com \
    --cc=ciunas@linux.ibm.com \
    --cc=clrkwllms@kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=linux-s390@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox