From: Wanpeng Li <kernellwp@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
Juri Lelli <juri.lelli@redhat.com>,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Wanpeng Li <wanpengli@tencent.com>,
Richie Buturla <richie@linux.ibm.com>
Subject: [PATCH v3 06/10] KVM: x86: Add IPI tracking infrastructure for directed yield
Date: Fri, 12 Jun 2026 09:33:51 +0800 [thread overview]
Message-ID: <20260612013355.59231-7-kernellwp@gmail.com> (raw)
In-Reply-To: <20260612013355.59231-1-kernellwp@gmail.com>
From: Wanpeng Li <wanpengli@tencent.com>
On overcommitted hosts, a vCPU spinning on an IPI response is difficult
to distinguish from a vCPU spinning on a lock. kvm_vcpu_on_spin() can
therefore yield to an unrelated vCPU based only on coarse preemption
state.
Add per-vCPU IPI tracking for directed yield. struct kvm_vcpu_arch now
records the last sender and receiver vCPU indexes, the vector, a pending
flag, and a monotonic timestamp. Add helpers to record a send, query
whether a vCPU is the recent IPI receiver of another vCPU, and clear or
reset the context. Accesses use READ_ONCE() and WRITE_ONCE() because the
state is only a best-effort scheduling hint.
Add module parameters to enable tracking and to control the recency
window. Provide a weak generic kvm_vcpu_is_ipi_receiver() stub so
non-x86 builds keep the existing behavior. The state is reset on vCPU
create and destroy, and cleared on INIT.
This adds only state and helpers; directed-yield candidate selection is
unchanged.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
arch/x86/include/asm/kvm_host.h | 19 ++++++
arch/x86/kvm/lapic.c | 102 ++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 3 +
arch/x86/kvm/x86.h | 8 +++
include/linux/kvm_host.h | 8 +++
virt/kvm/kvm_main.c | 6 ++
6 files changed, 146 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f14009f25a3b..a26623716a53 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1065,6 +1065,25 @@ struct kvm_vcpu_arch {
int pending_external_vector;
int highest_stale_pending_ioapic_eoi;
+ /*
+ * IPI tracking for directed-yield optimization.
+ *
+ * Populated by kvm_track_ipi_communication() when a unicast fixed
+ * IPI is delivered, and queried by kvm_vcpu_is_ipi_receiver() from
+ * kvm_vcpu_on_spin() to prefer the confirmed IPI target before
+ * generic preempted-lock-holder heuristics.
+ *
+ * All accesses are lockless READ_ONCE/WRITE_ONCE; best-effort by
+ * design (see comment on kvm_vcpu_is_good_yield_candidate()).
+ */
+ struct {
+ int last_ipi_sender; /* vCPU idx of last IPI sender */
+ int last_ipi_receiver; /* vCPU idx of last IPI target */
+ u8 vector; /* vector of the pending IPI */
+ bool pending_ipi; /* awaiting IPI response */
+ u64 ipi_time_ns; /* mono timestamp of IPI send */
+ } ipi_context;
+
/* be preempted when it's in kernel-mode(cpl=0) */
bool preempted_in_kernel;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 4078e624ca66..515409e0e22c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -78,6 +78,29 @@ module_param(lapic_timer_advance, bool, 0444);
static bool __read_mostly vector_hashing_enabled = true;
module_param_named(vector_hashing, vector_hashing_enabled, bool, 0444);
+/*
+ * IPI tracking for directed-yield optimization.
+ *
+ * ipi_tracking_enabled - master switch (default on). When off, the
+ * tracking hooks become no-ops and
+ * kvm_vcpu_is_ipi_receiver() always returns
+ * false, falling back to the legacy
+ * preempted-in-kernel heuristic.
+ *
+ * ipi_window_ns - recency window. An IPI older than this is
+ * treated as stale and does not influence
+ * directed-yield selection. Long enough to
+ * cover typical spin-on-IPI-response periods,
+ * short enough to avoid stale state inflating
+ * boost priority on throughput-sensitive
+ * workloads.
+ */
+static bool ipi_tracking_enabled = true;
+module_param(ipi_tracking_enabled, bool, 0644);
+
+static unsigned long ipi_window_ns = 50 * NSEC_PER_MSEC;
+module_param(ipi_window_ns, ulong, 0644);
+
static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data);
static int kvm_lapic_msr_write(struct kvm_lapic *apic, u32 reg, u64 data);
@@ -1144,6 +1167,85 @@ static int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
}
+/*
+ * Record a sender -> receiver IPI relationship for directed-yield use.
+ *
+ * Accessed lockless (READ_ONCE/WRITE_ONCE); this is best-effort, racy
+ * information consumed only as a scheduling hint by
+ * kvm_vcpu_on_spin(), so occasional torn or stale reads are harmless.
+ *
+ * Callers should already have filtered out self-IPIs and non-unicast
+ * or non-fixed-mode deliveries; this function only records the state.
+ */
+void kvm_track_ipi_communication(struct kvm_vcpu *sender,
+ struct kvm_vcpu *receiver, u8 vector)
+{
+ if (!sender || !receiver || sender == receiver)
+ return;
+ if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+ return;
+
+ WRITE_ONCE(sender->arch.ipi_context.last_ipi_receiver,
+ receiver->vcpu_idx);
+ WRITE_ONCE(sender->arch.ipi_context.vector, vector);
+ WRITE_ONCE(sender->arch.ipi_context.pending_ipi, true);
+ WRITE_ONCE(sender->arch.ipi_context.ipi_time_ns,
+ ktime_get_mono_fast_ns());
+
+ WRITE_ONCE(receiver->arch.ipi_context.last_ipi_sender,
+ sender->vcpu_idx);
+ WRITE_ONCE(receiver->arch.ipi_context.vector, vector);
+}
+
+/*
+ * Return true if @receiver is the confirmed recent IPI target of
+ * @sender, within the configured recency window. Directed yield uses
+ * this as a high-confidence signal that selecting @receiver may
+ * unblock @sender's spin loop.
+ */
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+ struct kvm_vcpu *receiver)
+{
+ u64 then, now;
+
+ if (unlikely(!READ_ONCE(ipi_tracking_enabled)))
+ return false;
+
+ if (!READ_ONCE(sender->arch.ipi_context.pending_ipi))
+ return false;
+
+ if (READ_ONCE(sender->arch.ipi_context.last_ipi_receiver) !=
+ receiver->vcpu_idx)
+ return false;
+
+ then = READ_ONCE(sender->arch.ipi_context.ipi_time_ns);
+ now = ktime_get_mono_fast_ns();
+ return now - then <= READ_ONCE(ipi_window_ns);
+}
+
+/*
+ * Clear the IPI tracking state of a single vCPU, typically when the
+ * associated interrupt has been acknowledged (EOI) or the vCPU has
+ * been reset/destroyed.
+ *
+ * Leaves the monotonic timestamp untouched to keep staleness checks
+ * on other vCPUs that may reference this one well-defined; use
+ * kvm_vcpu_reset_ipi_context() for a hard reset.
+ */
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu)
+{
+ WRITE_ONCE(vcpu->arch.ipi_context.pending_ipi, false);
+ WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_sender, -1);
+ WRITE_ONCE(vcpu->arch.ipi_context.last_ipi_receiver, -1);
+ WRITE_ONCE(vcpu->arch.ipi_context.vector, 0);
+}
+
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu)
+{
+ kvm_vcpu_clear_ipi_context(vcpu);
+ WRITE_ONCE(vcpu->arch.ipi_context.ipi_time_ns, 0);
+}
+
/* Return true if the interrupt can be handled by using *bitmap as index mask
* for valid destinations in *dst array.
* Return false if kvm_apic_map_get_dest_lapic did nothing useful.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0550359ed798..dcedd09bac10 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12907,6 +12907,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
goto free_guest_fpu;
kvm_xen_init_vcpu(vcpu);
+ kvm_vcpu_reset_ipi_context(vcpu);
vcpu_load(vcpu);
kvm_vcpu_after_set_cpuid(vcpu);
kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
@@ -12974,6 +12975,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
kvm_mmu_destroy(vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, idx);
free_page((unsigned long)vcpu->arch.pio_data);
+ kvm_vcpu_reset_ipi_context(vcpu);
kvfree(vcpu->arch.cpuid_entries);
}
@@ -13050,6 +13052,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_leave_nested(vcpu);
kvm_lapic_reset(vcpu, init_event);
+ kvm_vcpu_clear_ipi_context(vcpu);
WARN_ON_ONCE(is_guest_mode(vcpu) || is_smm(vcpu));
vcpu->arch.hflags = 0;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 38a905fa86de..eb7f50018f78 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -475,6 +475,14 @@ int handle_ud(struct kvm_vcpu *vcpu);
void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
struct kvm_queued_exception *ex);
+/* IPI tracking helpers for directed-yield optimization (see lapic.c). */
+void kvm_track_ipi_communication(struct kvm_vcpu *sender,
+ struct kvm_vcpu *receiver, u8 vector);
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+ struct kvm_vcpu *receiver);
+void kvm_vcpu_clear_ipi_context(struct kvm_vcpu *vcpu);
+void kvm_vcpu_reset_ipi_context(struct kvm_vcpu *vcpu);
+
int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4c14aee1fb06..e54e72ae5ebb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1549,6 +1549,14 @@ static inline void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
int kvm_vcpu_yield_to(struct kvm_vcpu *target);
void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool yield_to_kernel_mode);
+/*
+ * IPI-aware directed-yield hook. Architectures that support IPI
+ * tracking (currently x86 via arch/x86/kvm/lapic.c) override this;
+ * the generic __weak stub in virt/kvm/kvm_main.c returns false.
+ */
+bool kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+ struct kvm_vcpu *receiver);
+
void kvm_flush_remote_tlbs(struct kvm *kvm);
void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 881f92d7a469..2e11c6cfc167 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3957,6 +3957,12 @@ bool __weak kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu)
return false;
}
+bool __weak kvm_vcpu_is_ipi_receiver(struct kvm_vcpu *sender,
+ struct kvm_vcpu *receiver)
+{
+ return false;
+}
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
{
int nr_vcpus, start, i, idx, yielded;
--
2.43.0
next prev parent reply other threads:[~2026-06-12 1:34 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
2026-06-12 1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
2026-06-12 1:49 ` sashiko-bot
2026-06-12 5:34 ` K Prateek Nayak
2026-06-12 1:33 ` [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin Wanpeng Li
2026-06-12 1:53 ` sashiko-bot
2026-06-12 6:07 ` K Prateek Nayak
2026-06-12 1:33 ` [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue Wanpeng Li
2026-06-12 1:55 ` sashiko-bot
2026-06-12 1:33 ` [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair() Wanpeng Li
2026-06-12 1:54 ` sashiko-bot
2026-06-12 1:33 ` [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked Wanpeng Li
2026-06-12 1:50 ` sashiko-bot
2026-06-12 1:33 ` Wanpeng Li [this message]
2026-06-12 1:33 ` [PATCH v3 07/10] KVM: x86/lapic: Track unicast fixed IPI delivery Wanpeng Li
2026-06-12 1:33 ` [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI Wanpeng Li
2026-06-12 3:46 ` sashiko-bot
2026-06-12 1:33 ` [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection Wanpeng Li
2026-06-12 1:48 ` sashiko-bot
2026-06-12 1:33 ` [PATCH v3 10/10] KVM: Add relaxed preempted-only fallback for directed yield Wanpeng Li
2026-06-12 5:17 ` [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM K Prateek Nayak
2026-06-12 9:43 ` Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612013355.59231-7-kernellwp@gmail.com \
--to=kernellwp@gmail.com \
--cc=borntraeger@linux.ibm.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=richie@linux.ibm.com \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=wanpengli@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox