[PATCH 1/3] KVM: x86: Enhance kvm_vcpu_eligible_for_directed

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] KVM: x86: Enhance kvm_vcpu_eligible_for_directed_yield to detect golden targets
       [not found] <20260215140402.24659-1-76824143@qq.com>
@ 2026-02-15 14:04 ` 76824143
  2026-02-17 15:37   ` Sean Christopherson
  2026-02-15 14:04 ` [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop 76824143
  2026-02-15 14:04 ` [PATCH 3/3] KVM: x86: Use dynamic try count based on vCPU count 76824143
  2 siblings, 1 reply; 10+ messages in thread
From: 76824143 @ 2026-02-15 14:04 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, zhanghao

From: zhanghao <zhanghao1@kylinos.cn>

Detect "golden targets" - vCPUs that are preempted and ready.
These are ideal yield targets as they can be immediately scheduled.

This check reduces unnecessary yield attempts to vCPUs that are
unlikely to benefit from directed yield.

Signed-off-by: zhanghao <zhanghao1@kylinos.cn>
---
 virt/kvm/kvm_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 61dca8d37abc..476ecdb18bdd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3927,6 +3927,9 @@ static bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 	bool eligible;
 
+	if (READ_ONCE(vcpu->preempted) && READ_ONCE(vcpu->mode) == IN_GUEST_MODE)
+		return true;
+
 	eligible = !vcpu->spin_loop.in_spin_loop ||
 		    vcpu->spin_loop.dy_eligible;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop
       [not found] <20260215140402.24659-1-76824143@qq.com>
  2026-02-15 14:04 ` [PATCH 1/3] KVM: x86: Enhance kvm_vcpu_eligible_for_directed_yield to detect golden targets 76824143
@ 2026-02-15 14:04 ` 76824143
  2026-02-17 15:41   ` Sean Christopherson
  2026-02-15 14:04 ` [PATCH 3/3] KVM: x86: Use dynamic try count based on vCPU count 76824143
  2 siblings, 1 reply; 10+ messages in thread
From: 76824143 @ 2026-02-15 14:04 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, zhanghao

From: zhanghao <zhanghao1@kylinos.cn>

Add a check in the kvm_vcpu_on_spin() main loop to skip vCPUs
that are already running in guest mode.

Reduces unnecessary yield_to() calls and VM exits.

Signed-off-by: zhanghao <zhanghao1@kylinos.cn>
---
 virt/kvm/kvm_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 476ecdb18bdd..663df3a121c8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4026,6 +4026,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 		vcpu = xa_load(&kvm->vcpu_array, idx);
 		if (!READ_ONCE(vcpu->ready))
 			continue;
+
+		if (READ_ONCE(vcpu->mode) == IN_GUEST_MODE)
+			continue;
+
 		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
 			continue;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] KVM: x86: Use dynamic try count based on vCPU count
       [not found] <20260215140402.24659-1-76824143@qq.com>
  2026-02-15 14:04 ` [PATCH 1/3] KVM: x86: Enhance kvm_vcpu_eligible_for_directed_yield to detect golden targets 76824143
  2026-02-15 14:04 ` [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop 76824143
@ 2026-02-15 14:04 ` 76824143
  2026-02-17 16:21   ` Sean Christopherson
  2 siblings, 1 reply; 10+ messages in thread
From: 76824143 @ 2026-02-15 14:04 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, zhanghao

From: zhanghao <zhanghao1@kylinos.cn>

Replace the fixed try count (3) with a dynamic calculation based
on the number of online vCPUs. This allows larger VMs to try more
candidates before giving up, while keeping small VMs efficient.

Formula: clamp(ilog2(nr_vcpus + 1), 3, 10)
- 4 vCPUs: try = 3
- 64 vCPUs: try = 6
- 256 vCPUs: try = 8

Signed-off-by: zhanghao <zhanghao1@kylinos.cn>
---
 virt/kvm/kvm_main.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 663df3a121c8..7f83e434e39a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3984,12 +3984,13 @@ bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu)
 
 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
 {
-	int nr_vcpus, start, i, idx, yielded;
+	int nr_vcpus, start = 0, i, idx, yielded;
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
-	int try = 3;
+	int try;
 
 	nr_vcpus = atomic_read(&kvm->online_vcpus);
+	try = clamp(ilog2(nr_vcpus + 1), 3, 10);
 	if (nr_vcpus < 2)
 		return;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] KVM: x86: Enhance kvm_vcpu_eligible_for_directed_yield to detect golden targets
  2026-02-15 14:04 ` [PATCH 1/3] KVM: x86: Enhance kvm_vcpu_eligible_for_directed_yield to detect golden targets 76824143
@ 2026-02-17 15:37   ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2026-02-17 15:37 UTC (permalink / raw)
  To: 76824143; +Cc: pbonzini, kvm, zhanghao

On Sun, Feb 15, 2026, 76824143@qq.com wrote:
> From: zhanghao <zhanghao1@kylinos.cn>
> 
> Detect "golden targets" - vCPUs that are preempted and ready.
> These are ideal yield targets as they can be immediately scheduled.
> 
> This check reduces unnecessary yield attempts to vCPUs that are
> unlikely to benefit from directed yield.
> 
> Signed-off-by: zhanghao <zhanghao1@kylinos.cn>
> ---
>  virt/kvm/kvm_main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 61dca8d37abc..476ecdb18bdd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3927,6 +3927,9 @@ static bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
>  #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>  	bool eligible;
>  
> +	if (READ_ONCE(vcpu->preempted) && READ_ONCE(vcpu->mode) == IN_GUEST_MODE)

This is nonsensical.  It should be impossible for a vCPU to be preempted while
IN_GUEST_MODE is true.  Even if a host IRQ arrives while the guest is active,
KVM should set vcpu->mode back to OUTSIDE_GUEST_MODE prior to servicing the IRQ.

Even more confusing, the next patch explicitly rejects IN_GUEST_MODE vCPUs from
kvm_vcpu_on_spin().

> +		return true;
> +
>  	eligible = !vcpu->spin_loop.in_spin_loop ||
>  		    vcpu->spin_loop.dy_eligible;
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop
  2026-02-15 14:04 ` [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop 76824143
@ 2026-02-17 15:41   ` Sean Christopherson
  2026-03-12  5:57     ` zhanghao
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-02-17 15:41 UTC (permalink / raw)
  To: 76824143; +Cc: pbonzini, kvm, zhanghao

On Sun, Feb 15, 2026, 76824143@qq.com wrote:
> From: zhanghao <zhanghao1@kylinos.cn>
> 
> Add a check in the kvm_vcpu_on_spin() main loop to skip vCPUs
> that are already running in guest mode.
> 
> Reduces unnecessary yield_to() calls and VM exits.
> 
> Signed-off-by: zhanghao <zhanghao1@kylinos.cn>
> ---
>  virt/kvm/kvm_main.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 476ecdb18bdd..663df3a121c8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4026,6 +4026,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  		vcpu = xa_load(&kvm->vcpu_array, idx);
>  		if (!READ_ONCE(vcpu->ready))
>  			continue;
> +
> +		if (READ_ONCE(vcpu->mode) == IN_GUEST_MODE)
> +			continue;

This should generally not happen, as vcpu->ready should only be true when a vCPU
is scheduled out.  Although it does look like there's a race in kvm_vcpu_wake_up()
where vcpu->ready could be left %true, e.g. if the task was delyed or preempted
after __kvm_vcpu_wake_up(), before the "WRITE_ONCE(vcpu->ready, true)".  Not sure
how best to handle that scenario.

> +
>  		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
>  			continue;
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Use dynamic try count based on vCPU count
  2026-02-15 14:04 ` [PATCH 3/3] KVM: x86: Use dynamic try count based on vCPU count 76824143
@ 2026-02-17 16:21   ` Sean Christopherson
  2026-03-12  6:24     ` zhanghao
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-02-17 16:21 UTC (permalink / raw)
  To: 76824143; +Cc: pbonzini, kvm, zhanghao

On Sun, Feb 15, 2026, 76824143@qq.com wrote:
> From: zhanghao <zhanghao1@kylinos.cn>
> 
> Replace the fixed try count (3) with a dynamic calculation based
> on the number of online vCPUs. This allows larger VMs to try more
> candidates before giving up, while keeping small VMs efficient.
> 
> Formula: clamp(ilog2(nr_vcpus + 1), 3, 10)
> - 4 vCPUs: try = 3
> - 64 vCPUs: try = 6
> - 256 vCPUs: try = 8

Why do larger VMs warrant more attempts though?  E.g. what are practical downsides
of trying min(nr_vcpus - 1, 8) times?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop
  2026-02-17 15:41   ` Sean Christopherson
@ 2026-03-12  5:57     ` zhanghao
  2026-03-13  1:02       ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: zhanghao @ 2026-03-12  5:57 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, kvm, zhanghao

On Tue, Feb 17, 2026, Sean Christopherson wrote:
> On Sun, Feb 15, 2026, 76824143@qq.com wrote:
> > From: zhanghao <zhanghao1@kylinos.cn>
> > 
> > Add a check in the kvm_vcpu_on_spin() main loop to skip vCPUs
> > that are already running in guest mode.
> > 
> > Reduces unnecessary yield_to() calls and VM exits.
> > 
> > Signed-off-by: zhanghao <zhanghao1@kylinos.cn>
> > ---
> >  virt/kvm/kvm_main.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 476ecdb18bdd..663df3a121c8 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4026,6 +4026,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >  		vcpu = xa_load(&kvm->vcpu_array, idx);
> >  		if (!READ_ONCE(vcpu->ready))
> >  			continue;
> > +
> > +		if (READ_ONCE(vcpu->mode) == IN_GUEST_MODE)
> > +			continue;
> 
> This should generally not happen, as vcpu->ready should only be true when a vCPU
> is scheduled out.  Although it does look like there's a race in kvm_vcpu_wake_up()
> where vcpu->ready could be left %true, e.g. if the task was delyed or preempted
> after __kvm_vcpu_wake_up(), before the "WRITE_ONCE(vcpu->ready, true)".  Not sure
> how best to handle that scenario.
> 
> > +
> >  		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
> >  			continue;
> >  
> > -- 
> > 2.39.2
> >

Thank you for reviewing this patch. As pointed out in the discussion,
this addresses a race condition where kvm_vcpu_wake_up() can set
"ready=true" while kvm_sched_in() is concurrently setting the vCPU to
running state. Without this check, we might attempt to yield to a vCPU
that is already running, which is futile.

I'd like to provide additional context on the two new checks in
kvm_vcpu_on_spin() and request your feedback on the implementation.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1bc1da66b4b0..20d56c0479c8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4007,6 +4007,23 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
                vcpu = xa_load(&kvm->vcpu_array, idx);
                if (!READ_ONCE(vcpu->ready))
                        continue;
+
+               /*
+                * kvm_vcpu_wake_up() can race with kvm_sched_in() and leave @ready
+                * true for a running vCPU.  Filter out vCPUs that are not currently
+                * scheduled out to avoid futile directed-yield attempts.
+                */
+               if (!READ_ONCE(vcpu->scheduled_out))
+                               continue;
+
+               /*
+                * Additionally, skip vCPUs that are currently in guest mode.
+                * A vCPU in IN_GUEST_MODE cannot be yielded to until it exits
+                * to host mode.
+                */
+               if (READ_ONCE(vcpu->mode) == IN_GUEST_MODE)
+                               continue;
+
               if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
                        continue;

The scheduled_out check address the race between kvm_vcpu_wake_up() and
kvm_sched_in(). The IN_GUEST_MODE check ensure we only yield to vCPUs
that can actually be scheduled.

The implementation has been tested with sysbench mutex
workload(16threads, 32vCPUs):
| Metric                      |   Before |   After | Improvement |
| :-------------------------- | -------: | ------: | :---------- |
| Directed-yield success rate |  0.0113% | 2.8169% |     249x    |
| Mutex latency (sysbench)    | 1,658 ms |  162 ms |   -90.2%    |
| Futile yield attempts       |   17,744 |      69 |   -99.6%    |

I'm happy to make any additional changes based on your suggestions.
Thank you for your time and guidance.

Best regards,
zhanghao


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Use dynamic try count based on vCPU count
  2026-02-17 16:21   ` Sean Christopherson
@ 2026-03-12  6:24     ` zhanghao
  0 siblings, 0 replies; 10+ messages in thread
From: zhanghao @ 2026-03-12  6:24 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, kvm, zhanghao

On Tue, Feb 17, 2026, Sean Christopherson wrote:
> On Sun, Feb 15, 2026, 76824143@qq.com wrote:
> > From: zhanghao <zhanghao1@kylinos.cn>
> > 
> > Replace the fixed try count (3) with a dynamic calculation based
> > on the number of online vCPUs. This allows larger VMs to try more
> > candidates before giving up, while keeping small VMs efficient.
> > 
> > Formula: clamp(ilog2(nr_vcpus + 1), 3, 10)
> > - 4 vCPUs: try = 3
> > - 64 vCPUs: try = 6
> > - 256 vCPUs: try = 8
> 
> Why do larger VMs warrant more attempts though?  E.g. what are practical downsides
> of trying min(nr_vcpus - 1, 8) times?

Larger VMs need more attempts because with more vCPUs, the probability
of randomly selecting a "good" yield target decreases.

Consider this scenario:
    4 vCPUs: 1 lock holder + 3 waiters. Random pick has 33% chance of finding the holder.
    64 vCPUs: 1 lock holder + 63 waiters. Random pick has only 1.6% chance of finding the holder.
With 64 vCPUs, you need ~20x more attempts to have the same probability of success as with 4 vCPUS.

Given your feedback, I'm open to changing to the simpler formula:
    try = min(nr_vcpus - 1, 8);
The performance difference is minimal in practice, and code clarity is
valuable.

Best regards,
zhanghao


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop
  2026-03-12  5:57     ` zhanghao
@ 2026-03-13  1:02       ` Sean Christopherson
  2026-03-19  8:40         ` zhanghao
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-03-13  1:02 UTC (permalink / raw)
  To: zhanghao; +Cc: pbonzini, kvm, zhanghao

On Thu, Mar 12, 2026, zhanghao wrote:
> On Tue, Feb 17, 2026, Sean Christopherson wrote:
> > On Sun, Feb 15, 2026, 76824143@qq.com wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 476ecdb18bdd..663df3a121c8 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -4026,6 +4026,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > >  		vcpu = xa_load(&kvm->vcpu_array, idx);
> > >  		if (!READ_ONCE(vcpu->ready))
> > >  			continue;
> > > +
> > > +		if (READ_ONCE(vcpu->mode) == IN_GUEST_MODE)
> > > +			continue;
> > 
> > This should generally not happen, as vcpu->ready should only be true when a vCPU
> > is scheduled out.  Although it does look like there's a race in kvm_vcpu_wake_up()
> > where vcpu->ready could be left %true, e.g. if the task was delyed or preempted
> > after __kvm_vcpu_wake_up(), before the "WRITE_ONCE(vcpu->ready, true)".  Not sure
> > how best to handle that scenario.
> > 
> > > +
> > >  		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
> > >  			continue;
> > >  
> > > -- 
> > > 2.39.2
> > >
> 
> Thank you for reviewing this patch. As pointed out in the discussion,
> this addresses a race condition where kvm_vcpu_wake_up() can set
> "ready=true" while kvm_sched_in() is concurrently setting the vCPU to
> running state. Without this check, we might attempt to yield to a vCPU
> that is already running, which is futile.

Right, but I want to solve this by eliminating the race, not by papering over
the issue in kvm_vcpu_on_spin().  I don't see a sane way of handling the cross-vCPU
writes without atomics, and I'd prefer to avoid the complexity that comes with
that.

What if instead of having the waker set vcpu->ready, we remove it entirely and
instead do a best-effort detection of the "vCPU was blocking, is now awake, but
probably hasn't been scheduled in yet".  We can use a combination of flags to
detect that, and thanks to vcpu->scheduled_out, I'm pretty sure the false positive
rate would be on par with the existing vcpu->ready check (for the preempted case).

The idea is to identify the case where the vCPU is in the blocking sequence, but
not actually blocking, wants to run, and is scheduled out.

Compile tested only (and I don't love the name of the helper).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b5d48e75b65..ac849d879b73 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10393,7 +10393,7 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
 
        rcu_read_unlock();
 
-       if (!target || !READ_ONCE(target->ready))
+       if (!target || !kvm_vcpu_is_runnable_and_scheduled_out(target))
                goto no_yield;
 
        /* Ignore requests to yield to self */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 581b217abc33..70739179af58 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1746,6 +1746,15 @@ static inline bool kvm_vcpu_is_blocking(struct kvm_vcpu *vcpu)
        return rcuwait_active(kvm_arch_vcpu_get_wait(vcpu));
 }
 
+static inline bool kvm_vcpu_is_runnable_and_scheduled_out(struct kvm_vcpu *vcpu)
+{
+       return READ_ONCE(vcpu->preempted) ||
+              (READ_ONCE(vcpu->scheduled_out) &&
+               READ_ONCE(vcpu->wants_to_run) &&
+               READ_ONCE(vcpu->stat.generic.blocking) &&
+               !kvm_vcpu_is_blocking(vcpu));
+}
+
 #ifdef __KVM_HAVE_ARCH_INTC_INITIALIZED
 /*
  * returns true if the virtual interrupt controller is initialized and
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9faf70ccae7a..9f71e32daac5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -455,7 +455,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
        kvm_vcpu_set_in_spin_loop(vcpu, false);
        kvm_vcpu_set_dy_eligible(vcpu, false);
        vcpu->preempted = false;
-       vcpu->ready = false;
        preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
        vcpu->last_used_slot = NULL;
 
@@ -3803,7 +3802,6 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_vcpu_halt);
 bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
 {
        if (__kvm_vcpu_wake_up(vcpu)) {
-               WRITE_ONCE(vcpu->ready, true);
                ++vcpu->stat.generic.halt_wakeup;
                return true;
        }
@@ -4008,7 +4006,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
                        continue;
 
                vcpu = xa_load(&kvm->vcpu_array, idx);
-               if (!READ_ONCE(vcpu->ready))
+               if (!kvm_vcpu_is_runnable_and_scheduled_out(vcpu))
                        continue;
                if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
                        continue;
@@ -6393,7 +6391,6 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
        struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
 
        WRITE_ONCE(vcpu->preempted, false);
-       WRITE_ONCE(vcpu->ready, false);
 
        __this_cpu_write(kvm_running_vcpu, vcpu);
        kvm_arch_vcpu_load(vcpu, cpu);
@@ -6408,10 +6405,9 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 
        WRITE_ONCE(vcpu->scheduled_out, true);
 
-       if (task_is_runnable(current) && vcpu->wants_to_run) {
+       if (task_is_runnable(current) && vcpu->wants_to_run)
                WRITE_ONCE(vcpu->preempted, true);
-               WRITE_ONCE(vcpu->ready, true);
-       }
+
        kvm_arch_vcpu_put(vcpu);
        __this_cpu_write(kvm_running_vcpu, NULL);
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop
  2026-03-13  1:02       ` Sean Christopherson
@ 2026-03-19  8:40         ` zhanghao
  0 siblings, 0 replies; 10+ messages in thread
From: zhanghao @ 2026-03-19  8:40 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, kvm, zhanghao

On Thu, Mar 12, 2026, Sean Christopherson wrote:
> On Thu, Mar 12, 2026, zhanghao wrote:
> > On Tue, Feb 17, 2026, Sean Christopherson wrote:
> > > On Sun, Feb 15, 2026, 76824143@qq.com wrote:
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 476ecdb18bdd..663df3a121c8 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -4026,6 +4026,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > > >  		vcpu = xa_load(&kvm->vcpu_array, idx);
> > > >  		if (!READ_ONCE(vcpu->ready))
> > > >  			continue;
> > > > +
> > > > +		if (READ_ONCE(vcpu->mode) == IN_GUEST_MODE)
> > > > +			continue;
> > > 
> > > This should generally not happen, as vcpu->ready should only be true when a vCPU
> > > is scheduled out.  Although it does look like there's a race in kvm_vcpu_wake_up()
> > > where vcpu->ready could be left %true, e.g. if the task was delyed or preempted
> > > after __kvm_vcpu_wake_up(), before the "WRITE_ONCE(vcpu->ready, true)".  Not sure
> > > how best to handle that scenario.
> > > 
> > > > +
> > > >  		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
> > > >  			continue;
> > > >  
> > > > -- 
> > > > 2.39.2
> > > >
> > 
> > Thank you for reviewing this patch. As pointed out in the discussion,
> > this addresses a race condition where kvm_vcpu_wake_up() can set
> > "ready=true" while kvm_sched_in() is concurrently setting the vCPU to
> > running state. Without this check, we might attempt to yield to a vCPU
> > that is already running, which is futile.
> 
> Right, but I want to solve this by eliminating the race, not by papering over
> the issue in kvm_vcpu_on_spin().  I don't see a sane way of handling the cross-vCPU
> writes without atomics, and I'd prefer to avoid the complexity that comes with
> that.
> 
> What if instead of having the waker set vcpu->ready, we remove it entirely and
> instead do a best-effort detection of the "vCPU was blocking, is now awake, but
> probably hasn't been scheduled in yet".  We can use a combination of flags to
> detect that, and thanks to vcpu->scheduled_out, I'm pretty sure the false positive
> rate would be on par with the existing vcpu->ready check (for the preempted case).
> 
> The idea is to identify the case where the vCPU is in the blocking sequence, but
> not actually blocking, wants to run, and is scheduled out.
> 
> Compile tested only (and I don't love the name of the helper).
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0b5d48e75b65..ac849d879b73 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10393,7 +10393,7 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
>  
>         rcu_read_unlock();
>  
> -       if (!target || !READ_ONCE(target->ready))
> +       if (!target || !kvm_vcpu_is_runnable_and_scheduled_out(target))
>                 goto no_yield;
>  
>         /* Ignore requests to yield to self */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 581b217abc33..70739179af58 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1746,6 +1746,15 @@ static inline bool kvm_vcpu_is_blocking(struct kvm_vcpu *vcpu)
>         return rcuwait_active(kvm_arch_vcpu_get_wait(vcpu));
>  }
>  
> +static inline bool kvm_vcpu_is_runnable_and_scheduled_out(struct kvm_vcpu *vcpu)
> +{
> +       return READ_ONCE(vcpu->preempted) ||
> +              (READ_ONCE(vcpu->scheduled_out) &&
> +               READ_ONCE(vcpu->wants_to_run) &&
> +               READ_ONCE(vcpu->stat.generic.blocking) &&
> +               !kvm_vcpu_is_blocking(vcpu));
> +}
> +
>  #ifdef __KVM_HAVE_ARCH_INTC_INITIALIZED
>  /*
>   * returns true if the virtual interrupt controller is initialized and
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9faf70ccae7a..9f71e32daac5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -455,7 +455,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>         kvm_vcpu_set_in_spin_loop(vcpu, false);
>         kvm_vcpu_set_dy_eligible(vcpu, false);
>         vcpu->preempted = false;
> -       vcpu->ready = false;
>         preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
>         vcpu->last_used_slot = NULL;
>  
> @@ -3803,7 +3802,6 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_vcpu_halt);
>  bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
>  {
>         if (__kvm_vcpu_wake_up(vcpu)) {
> -               WRITE_ONCE(vcpu->ready, true);
>                 ++vcpu->stat.generic.halt_wakeup;
>                 return true;
>         }
> @@ -4008,7 +4006,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>                         continue;
>  
>                 vcpu = xa_load(&kvm->vcpu_array, idx);
> -               if (!READ_ONCE(vcpu->ready))
> +               if (!kvm_vcpu_is_runnable_and_scheduled_out(vcpu))
>                         continue;
>                 if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
>                         continue;
> @@ -6393,7 +6391,6 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
>         struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
>  
>         WRITE_ONCE(vcpu->preempted, false);
> -       WRITE_ONCE(vcpu->ready, false);
>  
>         __this_cpu_write(kvm_running_vcpu, vcpu);
>         kvm_arch_vcpu_load(vcpu, cpu);
> @@ -6408,10 +6405,9 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>  
>         WRITE_ONCE(vcpu->scheduled_out, true);
>  
> -       if (task_is_runnable(current) && vcpu->wants_to_run) {
> +       if (task_is_runnable(current) && vcpu->wants_to_run)
>                 WRITE_ONCE(vcpu->preempted, true);
> -               WRITE_ONCE(vcpu->ready, true);
> -       }
> +
>         kvm_arch_vcpu_put(vcpu);
>         __this_cpu_write(kvm_running_vcpu, NULL);
>  }

Thank you for your detailed review and the alternative implementation.
I have tested both approaches and would like to share my findings.

I conducted comprehensive benchmarks using stress-ng memory pressure tests (10 rounds, 60 seconds each)
on a Proxmox VE host with Intel Xeon E5-2680 v4 processors:
  | Version | attempted | successful | Success Rate |
  |---------|-----------|------------|--------------|
  | Native  | 25,413    | 63         | 0.24%        |
  | My patch  | 15,040  | 70         | 0.46%        |
  | Your patch | 19,940 | 9          | 0.05%        |

I have tested repeatedly over multiple rounds; the number of attempts fluctuates, but 
the data is roughly as shown in the table above.

And the memory bandwidth remains consistent across all versions (~48-52 GB/s).

Your approach (removing vcpu->ready), successfully eliminates the race condition. Using 4 READ_ONCE
operations in kvm_vcpu_is_runnable_and_scheduled_out(), which is more robust but the conditions seem
overly strict, resulting in a low success rate. And the number of attempts has not decreased.

Thank you for your valuable feedback and the thorough review.

Best regards,
zhanghao


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-19  8:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260215140402.24659-1-76824143@qq.com>
2026-02-15 14:04 ` [PATCH 1/3] KVM: x86: Enhance kvm_vcpu_eligible_for_directed_yield to detect golden targets 76824143
2026-02-17 15:37   ` Sean Christopherson
2026-02-15 14:04 ` [PATCH 2/3] KVM: x86: Skip IN_GUEST_MODE vCPUs in kvm_vcpu_on_spin main loop 76824143
2026-02-17 15:41   ` Sean Christopherson
2026-03-12  5:57     ` zhanghao
2026-03-13  1:02       ` Sean Christopherson
2026-03-19  8:40         ` zhanghao
2026-02-15 14:04 ` [PATCH 3/3] KVM: x86: Use dynamic try count based on vCPU count 76824143
2026-02-17 16:21   ` Sean Christopherson
2026-03-12  6:24     ` zhanghao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox