* [PATCH 1/7] KVM: Add API documentation for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-04 14:08 ` Nikolas Wipper
2024-10-10 8:57 ` Vitaly Kuznetsov
2024-10-04 14:08 ` [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state Nikolas Wipper
` (6 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-04 14:08 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Add API documentation for the new KVM_HYPERV_SET_TLB_FLUSH_INHIBIT ioctl.
Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
---
Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a4b7dc4a9dda..9c11a8af336b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6443,6 +6443,47 @@ the capability to be present.
`flags` must currently be zero.
+4.144 KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
+--------------------------------------
+
+:Capability: KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_hyperv_tlb_flush_inhibit
+:returnReturns: 0 on success, this ioctl can't fail
+
+KVM_HYPERV_SET_TLB_FLUSH_INHIBIT allows userspace to prevent Hyper-V hyper-calls
+that remotely flush a vCPU's TLB, i.e. HvFlushVirtualAddressSpace(Ex)/
+HvFlushVirtualAddressList(Ex). When the flag is set, a vCPU attempting to flush
+an inhibited vCPU will be suspended and will only resume once the flag is
+cleared again using this ioctl. During suspension, the vCPU will not finish the
+hyper-call, but may enter the guest to retry it. Because it is caused by a
+hyper-call, the suspension naturally happens on a guest instruction boundary.
+This behaviour and the suspend state itself are specified in Microsoft's
+"Hypervisor Top Level Functional Specification" (TLFS).
+
+::
+
+ /* for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT */
+ struct kvm_hyperv_tlb_flush_inhibit {
+ /* in */
+ __u16 flags;
+ #define KVM_HYPERV_UNINHIBIT_TLB_FLUSH 0
+ #define KVM_HYPERV_INHIBIT_TLB_FLUSH 1
+ __u8 inhibit;
+ __u8 padding[5];
+ };
+
+No flags are specified so far, the corresponding field must be set to zero,
+otherwise the ioctl will fail with exit code -EINVAL.
+
+The suspension is transparent to userspace. It won't cause KVM_RUN to return or
+the MP state to be changed. The suspension cannot be manually induced or exited
+apart from changing the TLB flush inhibit flag of a targeted processor.
+
+There is no way for userspace to query the state of the flush inhibit flag.
+Userspace must keep track of the required state itself.
+
5. The kvm_run structure
========================
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 1/7] KVM: Add API documentation for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 ` [PATCH 1/7] KVM: Add API documentation for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-10 8:57 ` Vitaly Kuznetsov
0 siblings, 0 replies; 21+ messages in thread
From: Vitaly Kuznetsov @ 2024-10-10 8:57 UTC (permalink / raw)
To: Nikolas Wipper
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest
Nikolas Wipper <nikwip@amazon.de> writes:
> Add API documentation for the new KVM_HYPERV_SET_TLB_FLUSH_INHIBIT ioctl.
>
> Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
> ---
> Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++
> 1 file changed, 41 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a4b7dc4a9dda..9c11a8af336b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6443,6 +6443,47 @@ the capability to be present.
> `flags` must currently be zero.
>
>
> +4.144 KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
> +--------------------------------------
> +
> +:Capability: KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_hyperv_tlb_flush_inhibit
> +:returnReturns: 0 on success, this ioctl can't fail
> +
> +KVM_HYPERV_SET_TLB_FLUSH_INHIBIT allows userspace to prevent Hyper-V
> hyper-calls
Very minor nitpick: I suggest standardize on "hypercall" spelling
without the dash because:
$ grep -c hypercall Documentation/virt/kvm/api.rst
56
$ grep -c hyper-call Documentation/virt/kvm/api.rst
3
(I see all three 'hypercall', 'hyper-call', 'hyper call' usages in the
wild and I honestly don't think it matters but it would be nice to
adhere to one share across the same file / KVM docs).
> +that remotely flush a vCPU's TLB, i.e. HvFlushVirtualAddressSpace(Ex)/
> +HvFlushVirtualAddressList(Ex). When the flag is set, a vCPU attempting to flush
> +an inhibited vCPU will be suspended and will only resume once the flag is
> +cleared again using this ioctl. During suspension, the vCPU will not finish the
> +hyper-call, but may enter the guest to retry it. Because it is caused by a
> +hyper-call, the suspension naturally happens on a guest instruction boundary.
> +This behaviour and the suspend state itself are specified in Microsoft's
> +"Hypervisor Top Level Functional Specification" (TLFS).
> +
> +::
> +
> + /* for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT */
> + struct kvm_hyperv_tlb_flush_inhibit {
> + /* in */
> + __u16 flags;
> + #define KVM_HYPERV_UNINHIBIT_TLB_FLUSH 0
> + #define KVM_HYPERV_INHIBIT_TLB_FLUSH 1
> + __u8 inhibit;
> + __u8 padding[5];
> + };
> +
> +No flags are specified so far, the corresponding field must be set to zero,
> +otherwise the ioctl will fail with exit code -EINVAL.
> +
> +The suspension is transparent to userspace. It won't cause KVM_RUN to return or
> +the MP state to be changed. The suspension cannot be manually induced or exited
> +apart from changing the TLB flush inhibit flag of a targeted processor.
> +
> +There is no way for userspace to query the state of the flush inhibit flag.
> +Userspace must keep track of the required state itself.
> +
> 5. The kvm_run structure
> ========================
--
Vitaly
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
2024-10-04 14:08 ` [PATCH 1/7] KVM: Add API documentation for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-04 14:08 ` Nikolas Wipper
2024-10-10 8:57 ` Vitaly Kuznetsov
2024-10-04 14:08 ` [PATCH 3/7] KVM: x86: Check vCPUs before enqueuing TLB flushes in kvm_hv_flush_tlb() Nikolas Wipper
` (5 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-04 14:08 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Introduce a suspension state for Hyper-V enlightened vCPUs. Microsoft's
"Hypervisor Top Level Functional Specification" (TLFS) introduces this
state as a "vCPU that is stopped on a instruction guest boundary, either
explicitly or implicitly due to an intercept". The only documented
explicit suspension is caused in response to a TLB Flush hypercall, which
will use the state switching API in subsequent patches.
Each Hyper-V vCPU now has a 'suspended' boolean field, which is checked
before entering the guest. When set, it forces the vCPU to block. Once in
that state, the vCPU ignores any events. The vCPU is unsuspended by
clearing 'suspend' and issuing a request to force the vCPU out of sleep.
Suspensions are issued as a mechanism to halt execution until state change
is observed on a remote vCPU. Hyper-V vCPUs track this with 'waiting_on',
which holds the 'vcpu_id' of the remote vCPU that forced the vCPU to enter
the suspended state. It's the remote vCPU's responsibility to wake up the
suspended vCPUs when ready. 'waiting_on' ensures the remote vCPU can
selectively unsuspend vCPUs that blocked on its behalf while leaving other
suspended vCPUs undisturbed. One vCPU can only be suspended due to a
single remote vCPU, but different vCPUs can be suspended on behalf of
different or the same remote vCPU(s). The guest is responsible for
avoiding circular dependencies between suspended vCPUs.
Callers of the suspend API are responsible for ensuring that suspend and
unsuspend aren't called in parallel while targeting the same pair of
vCPUs. Otherwise kvm_hv_vcpu_{un}suspend_tlb_flush() ensure 'waiting_on'
and 'suspended' are updated and accessed in the correct order. This, for
example, avoids races where the unsuspended vCPU re-suspends before
kvm_hv_vcpu_unsuspend_tlb_flush() is done updating 'waiting_on'.
Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/hyperv.c | 30 ++++++++++++++++++++++++++++++
arch/x86/kvm/hyperv.h | 17 +++++++++++++++++
arch/x86/kvm/x86.c | 4 +++-
4 files changed, 53 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 46e0a466d7fb..7571ac578884 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -695,6 +695,9 @@ struct kvm_vcpu_hv {
u64 vm_id;
u32 vp_id;
} nested;
+
+ bool suspended;
+ int waiting_on;
};
struct kvm_hypervisor_cpuid {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 4f0a94346d00..6e7941ed25ae 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -971,6 +971,7 @@ int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu)
vcpu->arch.hyperv = hv_vcpu;
hv_vcpu->vcpu = vcpu;
+ hv_vcpu->waiting_on = -1;
synic_init(&hv_vcpu->synic);
@@ -2915,3 +2916,32 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
return 0;
}
+
+void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)
+{
+ /* waiting_on's store should happen before suspended's */
+ WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
+ WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
+}
+
+void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu)
+{
+ DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS);
+ struct kvm_vcpu_hv *vcpu_hv;
+ struct kvm_vcpu *v;
+ unsigned long i;
+
+ kvm_for_each_vcpu(i, v, vcpu->kvm) {
+ vcpu_hv = to_hv_vcpu(v);
+
+ if (kvm_hv_vcpu_suspended(v) &&
+ READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
+ /* waiting_on's store should happen before suspended's */
+ WRITE_ONCE(v->arch.hyperv->waiting_on, -1);
+ WRITE_ONCE(v->arch.hyperv->suspended, false);
+ __set_bit(i, vcpu_mask);
+ }
+ }
+
+ kvm_make_vcpus_request_mask(vcpu->kvm, KVM_REQ_EVENT, vcpu_mask);
+}
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 913bfc96959c..a55832cea221 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -265,6 +265,15 @@ static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu,
}
int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
+
+static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.hyperv_enabled &&
+ READ_ONCE(vcpu->arch.hyperv->suspended);
+}
+
+void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id);
+void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu);
#else /* CONFIG_KVM_HYPERV */
static inline void kvm_hv_setup_tsc_page(struct kvm *kvm,
struct pvclock_vcpu_time_info *hv_clock) {}
@@ -321,6 +330,14 @@ static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu)
return vcpu->vcpu_idx;
}
static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, bool tdp_enabled) {}
+
+static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
+{
+ return false;
+}
+
+static inline void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id) {}
+static inline void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu) {}
#endif /* CONFIG_KVM_HYPERV */
#endif /* __ARCH_X86_KVM_HYPERV_H__ */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 15080385b8fe..18d0a300e79a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -135,6 +135,8 @@ static void store_regs(struct kvm_vcpu *vcpu);
static int sync_regs(struct kvm_vcpu *vcpu);
static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu);
+static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu);
+
static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
@@ -11107,7 +11109,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
static bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
{
return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
- !vcpu->arch.apf.halted);
+ !vcpu->arch.apf.halted && !kvm_hv_vcpu_suspended(vcpu));
}
static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-04 14:08 ` [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state Nikolas Wipper
@ 2024-10-10 8:57 ` Vitaly Kuznetsov
2024-10-14 17:50 ` Nikolas Wipper
0 siblings, 1 reply; 21+ messages in thread
From: Vitaly Kuznetsov @ 2024-10-10 8:57 UTC (permalink / raw)
To: Nikolas Wipper
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest
Nikolas Wipper <nikwip@amazon.de> writes:
> Introduce a suspension state for Hyper-V enlightened vCPUs. Microsoft's
> "Hypervisor Top Level Functional Specification" (TLFS) introduces this
> state as a "vCPU that is stopped on a instruction guest boundary, either
> explicitly or implicitly due to an intercept". The only documented
> explicit suspension is caused in response to a TLB Flush hypercall, which
> will use the state switching API in subsequent patches.
>
> Each Hyper-V vCPU now has a 'suspended' boolean field, which is checked
> before entering the guest. When set, it forces the vCPU to block. Once in
> that state, the vCPU ignores any events. The vCPU is unsuspended by
> clearing 'suspend' and issuing a request to force the vCPU out of sleep.
>
> Suspensions are issued as a mechanism to halt execution until state change
> is observed on a remote vCPU. Hyper-V vCPUs track this with 'waiting_on',
> which holds the 'vcpu_id' of the remote vCPU that forced the vCPU to enter
> the suspended state. It's the remote vCPU's responsibility to wake up the
> suspended vCPUs when ready. 'waiting_on' ensures the remote vCPU can
> selectively unsuspend vCPUs that blocked on its behalf while leaving other
> suspended vCPUs undisturbed. One vCPU can only be suspended due to a
> single remote vCPU, but different vCPUs can be suspended on behalf of
> different or the same remote vCPU(s). The guest is responsible for
> avoiding circular dependencies between suspended vCPUs.
>
> Callers of the suspend API are responsible for ensuring that suspend and
> unsuspend aren't called in parallel while targeting the same pair of
> vCPUs. Otherwise kvm_hv_vcpu_{un}suspend_tlb_flush() ensure 'waiting_on'
> and 'suspended' are updated and accessed in the correct order. This, for
> example, avoids races where the unsuspended vCPU re-suspends before
> kvm_hv_vcpu_unsuspend_tlb_flush() is done updating 'waiting_on'.
>
> Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
> ---
> arch/x86/include/asm/kvm_host.h | 3 +++
> arch/x86/kvm/hyperv.c | 30 ++++++++++++++++++++++++++++++
> arch/x86/kvm/hyperv.h | 17 +++++++++++++++++
> arch/x86/kvm/x86.c | 4 +++-
> 4 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 46e0a466d7fb..7571ac578884 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -695,6 +695,9 @@ struct kvm_vcpu_hv {
> u64 vm_id;
> u32 vp_id;
> } nested;
> +
> + bool suspended;
> + int waiting_on;
I don't quite understand why we need 'suspended' at all. Isn't it always
suspended when 'waiting_on != -1'? I can see we always update these two
in pair.
Also, I would suggest we use a more descriptive
name. 'waiting_on_vcpu_id', for example.
> };
>
> struct kvm_hypervisor_cpuid {
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 4f0a94346d00..6e7941ed25ae 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -971,6 +971,7 @@ int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu)
>
> vcpu->arch.hyperv = hv_vcpu;
> hv_vcpu->vcpu = vcpu;
> + hv_vcpu->waiting_on = -1;
>
> synic_init(&hv_vcpu->synic);
>
> @@ -2915,3 +2916,32 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>
> return 0;
> }
> +
> +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)
Can we make parameter's name 'waiting_on_vcpu_id' as well? Because as-is
I'm getting confused which CPU of these two is actually getting
suspended)
Also, why do we need '_tlb_flush' in the name? The mechanism seems to be
fairly generic, it's just that we use it for TLB flushes.
> +{
> + /* waiting_on's store should happen before suspended's */
> + WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
> + WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
> +}
> +
> +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu)
And here someone may expect this means 'unsuspend vcpu' but in reality
this means 'unsuspend all vCPUs which are waiting on 'vcpu'). I guess we
need a rename. How about
void kvm_hv_unsuspend_vcpus(struct kvm_vcpu *waiting_on_vcpu)
?
> +{
> + DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS);
> + struct kvm_vcpu_hv *vcpu_hv;
> + struct kvm_vcpu *v;
> + unsigned long i;
> +
> + kvm_for_each_vcpu(i, v, vcpu->kvm) {
> + vcpu_hv = to_hv_vcpu(v);
> +
> + if (kvm_hv_vcpu_suspended(v) &&
> + READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
> + /* waiting_on's store should happen before suspended's */
> + WRITE_ONCE(v->arch.hyperv->waiting_on, -1);
> + WRITE_ONCE(v->arch.hyperv->suspended, false);
> + __set_bit(i, vcpu_mask);
> + }
> + }
> +
> + kvm_make_vcpus_request_mask(vcpu->kvm, KVM_REQ_EVENT, vcpu_mask);
> +}
> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
> index 913bfc96959c..a55832cea221 100644
> --- a/arch/x86/kvm/hyperv.h
> +++ b/arch/x86/kvm/hyperv.h
> @@ -265,6 +265,15 @@ static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu,
> }
>
> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
> +
> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
> +{
> + return vcpu->arch.hyperv_enabled &&
> + READ_ONCE(vcpu->arch.hyperv->suspended);
I don't think READ_ONCE() means anything here, does it?
> +}
> +
> +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id);
> +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu);
> #else /* CONFIG_KVM_HYPERV */
> static inline void kvm_hv_setup_tsc_page(struct kvm *kvm,
> struct pvclock_vcpu_time_info *hv_clock) {}
> @@ -321,6 +330,14 @@ static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu)
> return vcpu->vcpu_idx;
> }
> static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, bool tdp_enabled) {}
> +
> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
> +{
> + return false;
> +}
> +
> +static inline void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id) {}
> +static inline void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu) {}
> #endif /* CONFIG_KVM_HYPERV */
>
> #endif /* __ARCH_X86_KVM_HYPERV_H__ */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 15080385b8fe..18d0a300e79a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -135,6 +135,8 @@ static void store_regs(struct kvm_vcpu *vcpu);
> static int sync_regs(struct kvm_vcpu *vcpu);
> static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu);
>
> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu);
> +
> static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
> static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
>
> @@ -11107,7 +11109,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> static bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
> {
> return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
> - !vcpu->arch.apf.halted);
> + !vcpu->arch.apf.halted && !kvm_hv_vcpu_suspended(vcpu));
> }
>
> static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
--
Vitaly
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-10 8:57 ` Vitaly Kuznetsov
@ 2024-10-14 17:50 ` Nikolas Wipper
2024-10-15 8:18 ` Vitaly Kuznetsov
0 siblings, 1 reply; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-14 17:50 UTC (permalink / raw)
To: Vitaly Kuznetsov, Nikolas Wipper
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
linux-kernel, kvm, x86, linux-doc, linux-kselftest
On 10.10.24 10:57, Vitaly Kuznetsov wrote:
> Nikolas Wipper <nikwip@amazon.de> writes:
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 46e0a466d7fb..7571ac578884 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -695,6 +695,9 @@ struct kvm_vcpu_hv {
>> u64 vm_id;
>> u32 vp_id;
>> } nested;
>> +
>> + bool suspended;
>> + int waiting_on;
>
> I don't quite understand why we need 'suspended' at all. Isn't it always
> suspended when 'waiting_on != -1'? I can see we always update these two
> in pair.
>
This is mainly for future proofing the implementation. You are right, this
is currently not required, but it's nice to have a single flags, so that
when the suspended state is used in a different context, the whole logic
surrounding it still works.
> Also, I would suggest we use a more descriptive
> name. 'waiting_on_vcpu_id', for example.
>
Sounds good.
>> };
>>
>> struct kvm_hypervisor_cpuid {
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index 4f0a94346d00..6e7941ed25ae 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -971,6 +971,7 @@ int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu)
>>
>> vcpu->arch.hyperv = hv_vcpu;
>> hv_vcpu->vcpu = vcpu;
>> + hv_vcpu->waiting_on = -1;
>>
>> synic_init(&hv_vcpu->synic);
>>
>> @@ -2915,3 +2916,32 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>>
>> return 0;
>> }
>> +
>> +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)
>
> Can we make parameter's name 'waiting_on_vcpu_id' as well? Because as-is
> I'm getting confused which CPU of these two is actually getting
> suspended)
>
Yup, that would certainly help readability.
> Also, why do we need '_tlb_flush' in the name? The mechanism seems to be
> fairly generic, it's just that we use it for TLB flushes.
>
The 'waiting_on' part is TLB flushing specific.
>> +{
>> + /* waiting_on's store should happen before suspended's */
>> + WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
>> + WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
>> +}
>> +
>> +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu)
>
> And here someone may expect this means 'unsuspend vcpu' but in reality
> this means 'unsuspend all vCPUs which are waiting on 'vcpu'). I guess we
> need a rename. How about
>
> void kvm_hv_unsuspend_vcpus(struct kvm_vcpu *waiting_on_vcpu)
>
> ?
>
Also sounds good.
>> +{
>> + DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS);
>> + struct kvm_vcpu_hv *vcpu_hv;
>> + struct kvm_vcpu *v;
>> + unsigned long i;
>> +
>> + kvm_for_each_vcpu(i, v, vcpu->kvm) {
>> + vcpu_hv = to_hv_vcpu(v);
>> +
>> + if (kvm_hv_vcpu_suspended(v) &&
>> + READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
>> + /* waiting_on's store should happen before suspended's */
>> + WRITE_ONCE(v->arch.hyperv->waiting_on, -1);
>> + WRITE_ONCE(v->arch.hyperv->suspended, false);
>> + __set_bit(i, vcpu_mask);
>> + }
>> + }
>> +
>> + kvm_make_vcpus_request_mask(vcpu->kvm, KVM_REQ_EVENT, vcpu_mask);
>> +}
>> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
>> index 913bfc96959c..a55832cea221 100644
>> --- a/arch/x86/kvm/hyperv.h
>> +++ b/arch/x86/kvm/hyperv.h
>> @@ -265,6 +265,15 @@ static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu,
>> }
>>
>> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
>> +
>> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
>> +{
>> + return vcpu->arch.hyperv_enabled &&
>> + READ_ONCE(vcpu->arch.hyperv->suspended);
>
> I don't think READ_ONCE() means anything here, does it?
>
It does prevent compiler optimisations and is actually required[1]. Also
it makes clear that this variable is shared, and may be accessed from
remote CPUs.
[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access
Nikolas
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-14 17:50 ` Nikolas Wipper
@ 2024-10-15 8:18 ` Vitaly Kuznetsov
2024-10-15 15:58 ` Sean Christopherson
0 siblings, 1 reply; 21+ messages in thread
From: Vitaly Kuznetsov @ 2024-10-15 8:18 UTC (permalink / raw)
To: Nikolas Wipper, Nikolas Wipper
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
linux-kernel, kvm, x86, linux-doc, linux-kselftest
Nikolas Wipper <nik.wipper@gmx.de> writes:
> On 10.10.24 10:57, Vitaly Kuznetsov wrote:
...
>>> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
>>> +
>>> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
>>> +{
>>> + return vcpu->arch.hyperv_enabled &&
>>> + READ_ONCE(vcpu->arch.hyperv->suspended);
>>
>> I don't think READ_ONCE() means anything here, does it?
>>
>
> It does prevent compiler optimisations and is actually required[1]. Also
> it makes clear that this variable is shared, and may be accessed from
> remote CPUs.
>
> [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access
It certainly does no harm but I think if we follow 'Loads from and
stores to shared (but non-atomic) variables should be protected with the
READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them
all over KVM/kernel ;-) And personally, this makes reading the code
harder.
To my (very limited) knowledge, we really need READ_ONCE()s when we need
to have some sort of a serialization, e.g. the moment when this read
happens actually makes a difference. If we can e.g. use a local variable
in the beginning of a function and replace all READ_ONCE()s with
reading this local variable -- then we don't need READ_ONCE()s and are
OK with possible compiler optimizations. Similar (reversed) thoughts go
to WRITE_ONCE().
I think it's OK to keep them but it would be nice (not mandatory IMO,
but nice) to have a comment describing which particular synchronization
we are achieving (== the compiler optimization scenario we are protecting
against).
In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly
looked at all kvm_hv_vcpu_suspended() call sites (there are three) in
your series but couldn't think of a place where the READ_ONCE() makes a
real difference. kvm_hv_hypercall_complete() looks pretty safe
anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified
significantly if we merge 'suspended' with 'waiting_on': instead of
kvm_for_each_vcpu(i, v, vcpu->kvm) {
vcpu_hv = to_hv_vcpu(v);
if (kvm_hv_vcpu_suspended(v) &&
READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
...
you will have just
kvm_for_each_vcpu(i, v, vcpu->kvm) {
vcpu_hv = to_hv_vcpu(v);
if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) {
...
(and yes, I also think that READ_ONCE() is superfluous here, as real
(non-speculative) write below can't happen _before_ the check )
The last one, kvm_vcpu_running(), should also be indifferent to
READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of
course, but I hope you got my line of thought.
--
Vitaly
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-15 8:18 ` Vitaly Kuznetsov
@ 2024-10-15 15:58 ` Sean Christopherson
2024-10-15 17:16 ` Nicolas Saenz Julienne
2024-10-15 17:40 ` Nikolas Wipper
0 siblings, 2 replies; 21+ messages in thread
From: Sean Christopherson @ 2024-10-15 15:58 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nikolas Wipper, Nikolas Wipper, Nicolas Saenz Julienne,
Alexander Graf, James Gowans, nh-open-source, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
linux-kernel, kvm, x86, linux-doc, linux-kselftest
On Tue, Oct 15, 2024, Vitaly Kuznetsov wrote:
> Nikolas Wipper <nik.wipper@gmx.de> writes:
>
> > On 10.10.24 10:57, Vitaly Kuznetsov wrote:
>
> ...
>
> >>> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
> >>> +
> >>> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
> >>> +{
> >>> + return vcpu->arch.hyperv_enabled &&
> >>> + READ_ONCE(vcpu->arch.hyperv->suspended);
> >>
> >> I don't think READ_ONCE() means anything here, does it?
> >>
> >
> > It does prevent compiler optimisations and is actually required[1]. Also
> > it makes clear that this variable is shared, and may be accessed from
> > remote CPUs.
> >
> > [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access
>
> It certainly does no harm but I think if we follow 'Loads from and
> stores to shared (but non-atomic) variables should be protected with the
> READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them
> all over KVM/kernel ;-) And personally, this makes reading the code
> harder.
>
> To my (very limited) knowledge, we really need READ_ONCE()s when we need
> to have some sort of a serialization, e.g. the moment when this read
> happens actually makes a difference. If we can e.g. use a local variable
> in the beginning of a function and replace all READ_ONCE()s with
> reading this local variable -- then we don't need READ_ONCE()s and are
> OK with possible compiler optimizations. Similar (reversed) thoughts go
> to WRITE_ONCE().
>
> I think it's OK to keep them but it would be nice (not mandatory IMO,
> but nice) to have a comment describing which particular synchronization
> we are achieving (== the compiler optimization scenario we are protecting
> against).
>
> In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly
> looked at all kvm_hv_vcpu_suspended() call sites (there are three) in
> your series but couldn't think of a place where the READ_ONCE() makes a
> real difference. kvm_hv_hypercall_complete() looks pretty safe
> anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified
> significantly if we merge 'suspended' with 'waiting_on': instead of
>
> kvm_for_each_vcpu(i, v, vcpu->kvm) {
> vcpu_hv = to_hv_vcpu(v);
>
> if (kvm_hv_vcpu_suspended(v) &&
> READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
> ...
>
> you will have just
>
> kvm_for_each_vcpu(i, v, vcpu->kvm) {
> vcpu_hv = to_hv_vcpu(v);
>
> if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) {
> ...
> (and yes, I also think that READ_ONCE() is superfluous here, as real
> (non-speculative) write below can't happen _before_ the check )
>
> The last one, kvm_vcpu_running(), should also be indifferent to
> READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of
> course, but I hope you got my line of thought.
I don't think you're missing anything. In general, all of this code is more than
a bit heavy-handed and lacks any kind of precision, which makes it *really* hard
to see what actually guarantees a vCPU won't get stuck blocking.
Writers synchronize SRCU and readers are required to acquire SRCU, but there's
no actual data tagged as being protected by SRCU, i.e. tlb_flush_inhibit should
be __rcu.
All of the {READ,WRITE}_ONCE() stuff provides some implicit compiler barriers,
but the actual protection to ensure a vCPU either observes inhibit=false or a wake
event is provided by the smp_wmb() in __kvm_make_request().
And from a performance perspective, synchronizing on kvm->srcu is going to be
susceptible to random slowdowns, because writers will have to wait until all vCPUs
drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs
are faulting in memory from swap, uninhibiting a TLB flushes could be stalled
unnecessarily for an extended duration.
Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically
misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely
more appropriate.
Before we spend too much time cleaning things up, I want to first settle on the
overall design, because it's not clear to me that punting HvTranslateVirtualAddress
to userspace is a net positive. We agreed that VTLs should be modeled primarily
in userspace, but that doesn't automatically make punting everything to userspace
the best option, especially given the discussion at KVM Forum with respect to
mplementing VTLs, VMPLs, TD partitions, etc.
The cover letters for this series and KVM_TRANSLATE2 simply say they're needed
for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt
HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and
KVM. And it very much is a split, because there are obviously a lot of details
around TlbFlushInhibit that bleed into KVM.
Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The
TLFS just says
After the memory intercept routine performs instruction completion, it should
clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register.
but I can't find anything that says _how_ it clears TlbFlushInhibit.
[*] https://lore.kernel.org/all/20240609154945.55332-8-nsaenz@amazon.com
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-15 15:58 ` Sean Christopherson
@ 2024-10-15 17:16 ` Nicolas Saenz Julienne
2024-10-15 17:51 ` Sean Christopherson
2024-10-15 17:40 ` Nikolas Wipper
1 sibling, 1 reply; 21+ messages in thread
From: Nicolas Saenz Julienne @ 2024-10-15 17:16 UTC (permalink / raw)
To: Sean Christopherson, Vitaly Kuznetsov
Cc: Nikolas Wipper, Nikolas Wipper, Alexander Graf, James Gowans,
nh-open-source, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, linux-kernel, kvm, x86, linux-doc,
linux-kselftest
Hi Sean,
On Tue Oct 15, 2024 at 3:58 PM UTC, Sean Christopherson wrote:
> Before we spend too much time cleaning things up, I want to first settle on the
> overall design, because it's not clear to me that punting HvTranslateVirtualAddress
> to userspace is a net positive. We agreed that VTLs should be modeled primarily
> in userspace, but that doesn't automatically make punting everything to userspace
> the best option, especially given the discussion at KVM Forum with respect to
> mplementing VTLs, VMPLs, TD partitions, etc.
Since you mention it, Paolo said he was going to prep a doc with an
overview of the design we discussed there. Was it published? Did I miss
it?
Thanks,
Nicolas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-15 17:16 ` Nicolas Saenz Julienne
@ 2024-10-15 17:51 ` Sean Christopherson
0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2024-10-15 17:51 UTC (permalink / raw)
To: Nicolas Saenz Julienne
Cc: Vitaly Kuznetsov, Nikolas Wipper, Nikolas Wipper, Alexander Graf,
James Gowans, nh-open-source, Paolo Bonzini, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, linux-kernel, kvm, x86,
linux-doc, linux-kselftest
On Tue, Oct 15, 2024, Nicolas Saenz Julienne wrote:
> Hi Sean,
>
> On Tue Oct 15, 2024 at 3:58 PM UTC, Sean Christopherson wrote:
> > Before we spend too much time cleaning things up, I want to first settle on the
> > overall design, because it's not clear to me that punting HvTranslateVirtualAddress
> > to userspace is a net positive. We agreed that VTLs should be modeled primarily
> > in userspace, but that doesn't automatically make punting everything to userspace
> > the best option, especially given the discussion at KVM Forum with respect to
> > mplementing VTLs, VMPLs, TD partitions, etc.
>
> Since you mention it, Paolo said he was going to prep a doc with an
> overview of the design we discussed there. Was it published? Did I miss
> it?
Nope, we're all hitting F5 mercilessly :-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-15 15:58 ` Sean Christopherson
2024-10-15 17:16 ` Nicolas Saenz Julienne
@ 2024-10-15 17:40 ` Nikolas Wipper
2024-10-15 18:10 ` Sean Christopherson
1 sibling, 1 reply; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-15 17:40 UTC (permalink / raw)
To: Sean Christopherson, Vitaly Kuznetsov
Cc: Nikolas Wipper, Nicolas Saenz Julienne, Alexander Graf,
James Gowans, nh-open-source, Paolo Bonzini, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, linux-kernel, kvm, x86,
linux-doc, linux-kselftest
On 15.10.24 17:58, Sean Christopherson wrote:
> ...
>
> And from a performance perspective, synchronizing on kvm->srcu is going to be
> susceptible to random slowdowns, because writers will have to wait until all vCPUs
> drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs
> are faulting in memory from swap, uninhibiting a TLB flushes could be stalled
> unnecessarily for an extended duration.
>
This should be an easy fix, right? Just create an SRCU only for the TLB flushes only.
> Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically
> misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely
> more appropriate.
>
> Before we spend too much time cleaning things up, I want to first settle on the
> overall design, because it's not clear to me that punting HvTranslateVirtualAddress
> to userspace is a net positive. We agreed that VTLs should be modeled primarily
> in userspace, but that doesn't automatically make punting everything to userspace
> the best option, especially given the discussion at KVM Forum with respect to
> mplementing VTLs, VMPLs, TD partitions, etc.
>
I wasn't at the discussion, so maybe I'm missing something, but the hypercall
still needs VTL awareness. For one, it is primarily executed from VTL0 and
primarily targets VTL1 (primarily here means "thats what I see when I boot
Windows Server 2019"), so it would need to know which vCPU is the corresponding
VTL (this assumes one vCPU per VTL, as in the QEMU implementation). To make
matters worse, the hypercall can also arbitrarily choose to target a different
VP. This would require a way to map (VP index, VTL) -> (vcpu_id) within KVM.
> The cover letters for this series and KVM_TRANSLATE2 simply say they're needed
> for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt
> HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and
> KVM. And it very much is a split, because there are obviously a lot of details
> around TlbFlushInhibit that bleed into KVM.
>
> Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The
> TLFS just says
>
> After the memory intercept routine performs instruction completion, it should
> clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register.
>
> but I can't find anything that says _how_ it clears TlbFlushInhibit.
>
The register cannot be accessed using the HvSetVpRegisters hypercall, but the TLFS
talks about it elsewhere. I'm assuming this is a formatting issue (there are a few
elsewhere). In 15.5.1.3 it says
To unlock the TLB, the higher VTL can clear this bit. Also, once a VP returns
to a lower VTL, it releases all TLB locks which it holds at the time.
The QEMU implementation also just uninhibits on intercept exit, and that, at least,
does not crash.
Nikolas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state
2024-10-15 17:40 ` Nikolas Wipper
@ 2024-10-15 18:10 ` Sean Christopherson
0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2024-10-15 18:10 UTC (permalink / raw)
To: Nikolas Wipper
Cc: Vitaly Kuznetsov, Nikolas Wipper, Nicolas Saenz Julienne,
Alexander Graf, James Gowans, nh-open-source, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
linux-kernel, kvm, x86, linux-doc, linux-kselftest
On Tue, Oct 15, 2024, Nikolas Wipper wrote:
> On 15.10.24 17:58, Sean Christopherson wrote:
> > ...
> >
> > And from a performance perspective, synchronizing on kvm->srcu is going to be
> > susceptible to random slowdowns, because writers will have to wait until all vCPUs
> > drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs
> > are faulting in memory from swap, uninhibiting a TLB flushes could be stalled
> > unnecessarily for an extended duration.
> >
>
> This should be an easy fix, right? Just create an SRCU only for the TLB flushes only.
Yes, this is a very solvable problem. But while SRCU objects aren't expensive,
they aren't entirely free either.
> > Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically
> > misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely
> > more appropriate.
> >
> > Before we spend too much time cleaning things up, I want to first settle on the
> > overall design, because it's not clear to me that punting HvTranslateVirtualAddress
> > to userspace is a net positive. We agreed that VTLs should be modeled primarily
> > in userspace, but that doesn't automatically make punting everything to userspace
> > the best option, especially given the discussion at KVM Forum with respect to
> > mplementing VTLs, VMPLs, TD partitions, etc.
> >
>
> I wasn't at the discussion, so maybe I'm missing something, but the hypercall
> still needs VTL awareness.
Yeah, the KVM Forum discussion is relevant, because one of the key takeaways from
that discussion was that KVM will need some amount of VTL awareness.
> For one, it is primarily executed from VTL0 and primarily targets VTL1
> (primarily here means "thats what I see when I boot Windows Server 2019"), so
> it would need to know which vCPU is the corresponding VTL (this assumes one
> vCPU per VTL, as in the QEMU implementation).
Right, but even without the takeways from KVM Forum, we need to look at big picture
and come up with a design that makes the most sense. E.g. if making KVM aware
of "struct kvm" objects that represent different VTLs in the same VM greatly
simplifies supporting HvTranslateVirtualAddress, then it's likely worth doing.
> To make matters worse, the hypercall can also arbitrarily choose to target a
> different VP. This would require a way to map (VP index, VTL) -> (vcpu_id)
> within KVM.
I don't think so. The TLFS definition for TlbFlushInhibit give KVM a _lot_ of
wiggle room, e.g. KVM can retry the hypercall as many times as necessary. To
honor TlbFlushInhibit, KVM just needs to ensure that flushes are blocked if any
vCPU at the target VTL is blocking flushes. And to avoid hanging a vCPU, KVM
only needs to ensure a vCPU is awakened if it _might_ be able to make forward
progress.
I.e. I don't think KVM needs to be super precise when waking blocking vCPUs, and
thus there's no need to precisely track who is blocking whom. I think :-)
> > The cover letters for this series and KVM_TRANSLATE2 simply say they're needed
> > for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt
> > HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and
> > KVM. And it very much is a split, because there are obviously a lot of details
> > around TlbFlushInhibit that bleed into KVM.
> >
> > Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The
> > TLFS just says
> >
> > After the memory intercept routine performs instruction completion, it should
> > clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register.
> >
> > but I can't find anything that says _how_ it clears TlbFlushInhibit.
> >
>
> The register cannot be accessed using the HvSetVpRegisters hypercall, but the TLFS
> talks about it elsewhere. I'm assuming this is a formatting issue (there are a few
> elsewhere). In 15.5.1.3 it says
>
> To unlock the TLB, the higher VTL can clear this bit. Also, once a VP returns
> to a lower VTL, it releases all TLB locks which it holds at the time.
>
> The QEMU implementation also just uninhibits on intercept exit, and that, at least,
> does not crash.
Hmm, it would be nice to bottom out on whether the higher VLT or the VMM/hypervisor
is responsible for clearing TlbFlushInhibit, because that may influence KVM's
design.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/7] KVM: x86: Check vCPUs before enqueuing TLB flushes in kvm_hv_flush_tlb()
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
2024-10-04 14:08 ` [PATCH 1/7] KVM: Add API documentation for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
2024-10-04 14:08 ` [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state Nikolas Wipper
@ 2024-10-04 14:08 ` Nikolas Wipper
2024-10-04 14:08 ` [PATCH 4/7] KVM: Introduce KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
` (4 subsequent siblings)
7 siblings, 0 replies; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-04 14:08 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
In Hyper-V's kvm_hv_flush_tlb(), check targeted vCPUs and store them in a
bitmap before flushing their TLBs. In a subsequent commit, remote TLB
flushes may need to be aborted, so allow checking for that before starting
to enque the flushes.
No functional change intended.
Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
---
arch/x86/kvm/hyperv.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 6e7941ed25ae..e68fbc0c7fc1 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2134,26 +2134,21 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
* analyze it here, flush TLB regardless of the specified address space.
*/
if (all_cpus && !is_guest_mode(vcpu)) {
+ bitmap_zero(vcpu_mask, KVM_MAX_VCPUS);
+
kvm_for_each_vcpu(i, v, kvm) {
- tlb_flush_fifo = kvm_hv_get_tlb_flush_fifo(v, false);
- hv_tlb_flush_enqueue(v, tlb_flush_fifo,
- tlb_flush_entries, hc->rep_cnt);
+ __set_bit(i, vcpu_mask);
}
-
- kvm_make_all_cpus_request(kvm, KVM_REQ_HV_TLB_FLUSH);
} else if (!is_guest_mode(vcpu)) {
sparse_set_to_vcpu_mask(kvm, sparse_banks, valid_bank_mask, vcpu_mask);
for_each_set_bit(i, vcpu_mask, KVM_MAX_VCPUS) {
v = kvm_get_vcpu(kvm, i);
- if (!v)
+ if (!v) {
+ __clear_bit(i, vcpu_mask);
continue;
- tlb_flush_fifo = kvm_hv_get_tlb_flush_fifo(v, false);
- hv_tlb_flush_enqueue(v, tlb_flush_fifo,
- tlb_flush_entries, hc->rep_cnt);
+ }
}
-
- kvm_make_vcpus_request_mask(kvm, KVM_REQ_HV_TLB_FLUSH, vcpu_mask);
} else {
struct kvm_vcpu_hv *hv_v;
@@ -2181,14 +2176,19 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
continue;
__set_bit(i, vcpu_mask);
- tlb_flush_fifo = kvm_hv_get_tlb_flush_fifo(v, true);
- hv_tlb_flush_enqueue(v, tlb_flush_fifo,
- tlb_flush_entries, hc->rep_cnt);
}
+ }
- kvm_make_vcpus_request_mask(kvm, KVM_REQ_HV_TLB_FLUSH, vcpu_mask);
+ for_each_set_bit(i, vcpu_mask, KVM_MAX_VCPUS) {
+ v = kvm_get_vcpu(kvm, i);
+
+ tlb_flush_fifo = kvm_hv_get_tlb_flush_fifo(v, is_guest_mode(vcpu));
+ hv_tlb_flush_enqueue(v, tlb_flush_fifo,
+ tlb_flush_entries, hc->rep_cnt);
}
+ kvm_make_vcpus_request_mask(kvm, KVM_REQ_HV_TLB_FLUSH, vcpu_mask);
+
ret_success:
/* We always do full TLB flush, set 'Reps completed' = 'Rep Count' */
return (u64)HV_STATUS_SUCCESS |
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 4/7] KVM: Introduce KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
` (2 preceding siblings ...)
2024-10-04 14:08 ` [PATCH 3/7] KVM: x86: Check vCPUs before enqueuing TLB flushes in kvm_hv_flush_tlb() Nikolas Wipper
@ 2024-10-04 14:08 ` Nikolas Wipper
2024-10-10 8:57 ` Vitaly Kuznetsov
2024-10-04 14:08 ` [PATCH 5/7] KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
` (3 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-04 14:08 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Introduce a new ioctl to control whether remote flushing via Hyper-V
hyper-calls should be allowed on a vCPU. When the tlb_flush_inhibit bit is
set, vCPUs attempting to flush the TLB of the inhibitied vCPU will be
suspended until the bit is clearded.
Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
---
include/uapi/linux/kvm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc055145..3bc43fdcab88 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -933,6 +933,7 @@ struct kvm_enable_cap {
#define KVM_CAP_PRE_FAULT_MEMORY 236
#define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
#define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT 239
struct kvm_irq_routing_irqchip {
__u32 irqchip;
@@ -1573,4 +1574,18 @@ struct kvm_pre_fault_memory {
__u64 padding[5];
};
+/* Available with KVM_CAP_HYPERV_TLBFLUSH */
+#define KVM_HYPERV_SET_TLB_FLUSH_INHIBIT \
+ _IOW(KVMIO, 0xd6, struct kvm_hyperv_tlb_flush_inhibit)
+
+/* for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT */
+struct kvm_hyperv_tlb_flush_inhibit {
+ /* in */
+ __u16 flags;
+#define KVM_HYPERV_UNINHIBIT_TLB_FLUSH 0
+#define KVM_HYPERV_INHIBIT_TLB_FLUSH 1
+ __u8 inhibit;
+ __u8 reserved[5];
+};
+
#endif /* __LINUX_KVM_H */
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 4/7] KVM: Introduce KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 ` [PATCH 4/7] KVM: Introduce KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-10 8:57 ` Vitaly Kuznetsov
0 siblings, 0 replies; 21+ messages in thread
From: Vitaly Kuznetsov @ 2024-10-10 8:57 UTC (permalink / raw)
To: Nikolas Wipper
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Nikolas Wipper <nikwip@amazon.de> writes:
> Introduce a new ioctl to control whether remote flushing via Hyper-V
> hyper-calls should be allowed on a vCPU. When the tlb_flush_inhibit bit is
> set, vCPUs attempting to flush the TLB of the inhibitied vCPU will be
> suspended until the bit is clearded.
>
> Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
> ---
> include/uapi/linux/kvm.h | 15 +++++++++++++++
I guess we can merge this patch with documentation (PATCH1) or even
implementation (PATCH5) as I don't see why separation helps here.
> 1 file changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 637efc055145..3bc43fdcab88 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -933,6 +933,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_PRE_FAULT_MEMORY 236
> #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
> #define KVM_CAP_X86_GUEST_MODE 238
> +#define KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT 239
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1573,4 +1574,18 @@ struct kvm_pre_fault_memory {
> __u64 padding[5];
> };
>
> +/* Available with KVM_CAP_HYPERV_TLBFLUSH */
> +#define KVM_HYPERV_SET_TLB_FLUSH_INHIBIT \
> + _IOW(KVMIO, 0xd6, struct kvm_hyperv_tlb_flush_inhibit)
> +
> +/* for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT */
> +struct kvm_hyperv_tlb_flush_inhibit {
> + /* in */
> + __u16 flags;
> +#define KVM_HYPERV_UNINHIBIT_TLB_FLUSH 0
> +#define KVM_HYPERV_INHIBIT_TLB_FLUSH 1
> + __u8 inhibit;
> + __u8 reserved[5];
> +};
> +
> #endif /* __LINUX_KVM_H */
--
Vitaly
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 5/7] KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
` (3 preceding siblings ...)
2024-10-04 14:08 ` [PATCH 4/7] KVM: Introduce KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-04 14:08 ` Nikolas Wipper
2024-10-10 8:57 ` Vitaly Kuznetsov
2024-10-04 14:08 ` [PATCH 6/7] KVM: x86: Add trace events to track Hyper-V suspensions Nikolas Wipper
` (2 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-04 14:08 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT for x86. Apart from setting/
clearing the internal TLB flush inhibit flag this ioctl also wakes up
vCPUs suspended and waiting on this vCPU.
When the flag is set, a vCPU trying to flush the inhibited vCPUs TLB with
a Hyper-V hyper-call suspendeds until the bit is cleared again. The hyper-
call finishes internally, but the instruction isn't skipped, and execution
doesn't return to the guest. This behaviour and the suspended state itself
are specified in Microsoft's "Hypervisor Top Level Functional
Specification" (TLFS).
To maintain thread safety during suspension and unsuspension, the IOCTL
uses the KVM SRCU. After flipping the flush inhibit flag, it synchronizes
SRCU to ensure that all running TLB flush attempts that saw the old state,
finish before the IOCTL exits. If the flag was changed to inhibit new TLB
flushes, this guarantees that no TLB flushes happen once the ioctl exits.
If the flag was changed to no longer inhibit TLB flushes, the
synchronization ensures that all suspended CPUs finished entering the
suspended state properly, so they can be unsuspended again.
Once TLB flushes are no longer inhibited, all suspended vCPUs are
unsuspended.
Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/hyperv.c | 22 ++++++++++++++++++++
arch/x86/kvm/x86.c | 37 +++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7571ac578884..ab3a9beb61a2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -698,6 +698,8 @@ struct kvm_vcpu_hv {
bool suspended;
int waiting_on;
+
+ int tlb_flush_inhibit;
};
struct kvm_hypervisor_cpuid {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index e68fbc0c7fc1..40ea8340838f 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2137,6 +2137,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
bitmap_zero(vcpu_mask, KVM_MAX_VCPUS);
kvm_for_each_vcpu(i, v, kvm) {
+ if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
+ goto ret_suspend;
+
__set_bit(i, vcpu_mask);
}
} else if (!is_guest_mode(vcpu)) {
@@ -2148,6 +2151,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
__clear_bit(i, vcpu_mask);
continue;
}
+
+ if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
+ goto ret_suspend;
}
} else {
struct kvm_vcpu_hv *hv_v;
@@ -2175,6 +2181,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
sparse_banks))
continue;
+ if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
+ goto ret_suspend;
+
__set_bit(i, vcpu_mask);
}
}
@@ -2193,6 +2202,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
/* We always do full TLB flush, set 'Reps completed' = 'Rep Count' */
return (u64)HV_STATUS_SUCCESS |
((u64)hc->rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
+ret_suspend:
+ kvm_hv_vcpu_suspend_tlb_flush(vcpu, v->vcpu_id);
+ return -EBUSY;
}
static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
@@ -2380,6 +2392,13 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
u32 tlb_lock_count = 0;
int ret;
+ /*
+ * Reached when the hyper-call resulted in a suspension of the vCPU.
+ * The instruction will be re-tried once the vCPU is unsuspended.
+ */
+ if (kvm_hv_vcpu_suspended(vcpu))
+ return 1;
+
if (hv_result_success(result) && is_guest_mode(vcpu) &&
kvm_hv_is_tlb_flush_hcall(vcpu) &&
kvm_read_guest(vcpu->kvm, to_hv_vcpu(vcpu)->nested.pa_page_gpa,
@@ -2919,6 +2938,9 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)
{
+ RCU_LOCKDEP_WARN(!srcu_read_lock_held(&vcpu->kvm->srcu),
+ "Suspicious Hyper-V TLB flush inhibit usage\n");
+
/* waiting_on's store should happen before suspended's */
WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 18d0a300e79a..1f925e32a927 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4642,6 +4642,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_HYPERV_CPUID:
case KVM_CAP_HYPERV_ENFORCE_CPUID:
case KVM_CAP_SYS_HYPERV_CPUID:
+ case KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT:
#endif
case KVM_CAP_PCI_SEGMENT:
case KVM_CAP_DEBUGREGS:
@@ -5853,6 +5854,31 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}
+static int kvm_vcpu_ioctl_set_tlb_flush_inhibit(struct kvm_vcpu *vcpu,
+ struct kvm_hyperv_tlb_flush_inhibit *set)
+{
+ if (set->inhibit == READ_ONCE(vcpu->arch.hyperv->tlb_flush_inhibit))
+ return 0;
+
+ WRITE_ONCE(vcpu->arch.hyperv->tlb_flush_inhibit, set->inhibit);
+
+ /*
+ * synchronize_srcu() ensures that:
+ * - On inhibit, all concurrent TLB flushes finished before this ioctl
+ * exits.
+ * - On uninhibit, there are no longer vCPUs being suspended due to this
+ * inhibit.
+ * This function can't race with itself, because vCPU IOCTLs are
+ * serialized, so if the inhibit bit is already the desired value this
+ * synchronization has already happened.
+ */
+ synchronize_srcu(&vcpu->kvm->srcu);
+ if (!set->inhibit)
+ kvm_hv_vcpu_unsuspend_tlb_flush(vcpu);
+
+ return 0;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -6306,6 +6332,17 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
case KVM_SET_DEVICE_ATTR:
r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
break;
+#ifdef CONFIG_KVM_HYPERV
+ case KVM_HYPERV_SET_TLB_FLUSH_INHIBIT: {
+ struct kvm_hyperv_tlb_flush_inhibit set;
+
+ r = -EFAULT;
+ if (copy_from_user(&set, argp, sizeof(set)))
+ goto out;
+ r = kvm_vcpu_ioctl_set_tlb_flush_inhibit(vcpu, &set);
+ break;
+ }
+#endif
default:
r = -EINVAL;
}
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 5/7] KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 ` [PATCH 5/7] KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-10 8:57 ` Vitaly Kuznetsov
2024-10-14 18:02 ` Nikolas Wipper
0 siblings, 1 reply; 21+ messages in thread
From: Vitaly Kuznetsov @ 2024-10-10 8:57 UTC (permalink / raw)
To: Nikolas Wipper
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Nikolas Wipper <nikwip@amazon.de> writes:
> Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT for x86. Apart from setting/
> clearing the internal TLB flush inhibit flag this ioctl also wakes up
> vCPUs suspended and waiting on this vCPU.
>
> When the flag is set, a vCPU trying to flush the inhibited vCPUs TLB with
> a Hyper-V hyper-call suspendeds until the bit is cleared again. The hyper-
> call finishes internally, but the instruction isn't skipped, and execution
> doesn't return to the guest. This behaviour and the suspended state itself
> are specified in Microsoft's "Hypervisor Top Level Functional
> Specification" (TLFS).
>
> To maintain thread safety during suspension and unsuspension, the IOCTL
> uses the KVM SRCU. After flipping the flush inhibit flag, it synchronizes
> SRCU to ensure that all running TLB flush attempts that saw the old state,
> finish before the IOCTL exits. If the flag was changed to inhibit new TLB
> flushes, this guarantees that no TLB flushes happen once the ioctl exits.
> If the flag was changed to no longer inhibit TLB flushes, the
> synchronization ensures that all suspended CPUs finished entering the
> suspended state properly, so they can be unsuspended again.
>
> Once TLB flushes are no longer inhibited, all suspended vCPUs are
> unsuspended.
>
> Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/hyperv.c | 22 ++++++++++++++++++++
> arch/x86/kvm/x86.c | 37 +++++++++++++++++++++++++++++++++
> 3 files changed, 61 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7571ac578884..ab3a9beb61a2 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -698,6 +698,8 @@ struct kvm_vcpu_hv {
>
> bool suspended;
> int waiting_on;
> +
> + int tlb_flush_inhibit;
This is basically boolean, right? And we only make it 'int' to be able
to store 'u8' from the ioctl? This doesn't look very clean. Do you
envision anything but '1'/'0' in 'inhibit'? If not, maybe we can just
make it a flag (and e.g. extend 'flags' to be u32/u64)? This way we can
convert 'tlb_flush_inhibit' to a normal bool.
> };
>
> struct kvm_hypervisor_cpuid {
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index e68fbc0c7fc1..40ea8340838f 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -2137,6 +2137,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> bitmap_zero(vcpu_mask, KVM_MAX_VCPUS);
>
> kvm_for_each_vcpu(i, v, kvm) {
> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
> + goto ret_suspend;
> +
> __set_bit(i, vcpu_mask);
> }
> } else if (!is_guest_mode(vcpu)) {
> @@ -2148,6 +2151,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> __clear_bit(i, vcpu_mask);
> continue;
> }
> +
> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
> + goto ret_suspend;
> }
> } else {
> struct kvm_vcpu_hv *hv_v;
> @@ -2175,6 +2181,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> sparse_banks))
> continue;
>
> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
> + goto ret_suspend;
> +
These READ_ONCEs make me think I misunderstand something here, please
bear with me :-).
Like we're trying to protect against 'tlb_flush_inhibit' being read
somewhere in the beginning of the function and want to generate real
memory accesses. But what happens if tlb_flush_inhibit changes right
_after_ we checked it here and _before_ we actuall do
kvm_make_vcpus_request_mask()? Wouldn't it be a problem? In case it
would, I think we need to reverse the order: do
kvm_make_vcpus_request_mask() anyway and after it go through vcpu_mask
checking whether any of the affected vCPUs has 'tlb_flush_inhibit' and
if it does, suspend the caller.
> __set_bit(i, vcpu_mask);
> }
> }
> @@ -2193,6 +2202,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> /* We always do full TLB flush, set 'Reps completed' = 'Rep Count' */
> return (u64)HV_STATUS_SUCCESS |
> ((u64)hc->rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
> +ret_suspend:
> + kvm_hv_vcpu_suspend_tlb_flush(vcpu, v->vcpu_id);
> + return -EBUSY;
> }
>
> static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> @@ -2380,6 +2392,13 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
> u32 tlb_lock_count = 0;
> int ret;
>
> + /*
> + * Reached when the hyper-call resulted in a suspension of the vCPU.
> + * The instruction will be re-tried once the vCPU is unsuspended.
> + */
> + if (kvm_hv_vcpu_suspended(vcpu))
> + return 1;
> +
> if (hv_result_success(result) && is_guest_mode(vcpu) &&
> kvm_hv_is_tlb_flush_hcall(vcpu) &&
> kvm_read_guest(vcpu->kvm, to_hv_vcpu(vcpu)->nested.pa_page_gpa,
> @@ -2919,6 +2938,9 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>
> void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)
> {
> + RCU_LOCKDEP_WARN(!srcu_read_lock_held(&vcpu->kvm->srcu),
> + "Suspicious Hyper-V TLB flush inhibit usage\n");
> +
> /* waiting_on's store should happen before suspended's */
> WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
> WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 18d0a300e79a..1f925e32a927 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4642,6 +4642,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_HYPERV_CPUID:
> case KVM_CAP_HYPERV_ENFORCE_CPUID:
> case KVM_CAP_SYS_HYPERV_CPUID:
> + case KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT:
> #endif
> case KVM_CAP_PCI_SEGMENT:
> case KVM_CAP_DEBUGREGS:
> @@ -5853,6 +5854,31 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +static int kvm_vcpu_ioctl_set_tlb_flush_inhibit(struct kvm_vcpu *vcpu,
> + struct kvm_hyperv_tlb_flush_inhibit *set)
> +{
> + if (set->inhibit == READ_ONCE(vcpu->arch.hyperv->tlb_flush_inhibit))
> + return 0;
> +
> + WRITE_ONCE(vcpu->arch.hyperv->tlb_flush_inhibit, set->inhibit);
As you say before, vCPU ioctls are serialized and noone else sets
tlb_flush_inhibit, do I understand correctly that
READ_ONCE()/WRITE_ONCE() are redundant here?
> +
> + /*
> + * synchronize_srcu() ensures that:
> + * - On inhibit, all concurrent TLB flushes finished before this ioctl
> + * exits.
> + * - On uninhibit, there are no longer vCPUs being suspended due to this
> + * inhibit.
> + * This function can't race with itself, because vCPU IOCTLs are
> + * serialized, so if the inhibit bit is already the desired value this
> + * synchronization has already happened.
> + */
> + synchronize_srcu(&vcpu->kvm->srcu);
> + if (!set->inhibit)
> + kvm_hv_vcpu_unsuspend_tlb_flush(vcpu);
> +
> + return 0;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -6306,6 +6332,17 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> case KVM_SET_DEVICE_ATTR:
> r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
> break;
> +#ifdef CONFIG_KVM_HYPERV
> + case KVM_HYPERV_SET_TLB_FLUSH_INHIBIT: {
> + struct kvm_hyperv_tlb_flush_inhibit set;
> +
> + r = -EFAULT;
> + if (copy_from_user(&set, argp, sizeof(set)))
> + goto out;
> + r = kvm_vcpu_ioctl_set_tlb_flush_inhibit(vcpu, &set);
> + break;
> + }
> +#endif
> default:
> r = -EINVAL;
> }
--
Vitaly
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 5/7] KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-10 8:57 ` Vitaly Kuznetsov
@ 2024-10-14 18:02 ` Nikolas Wipper
0 siblings, 0 replies; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-14 18:02 UTC (permalink / raw)
To: Vitaly Kuznetsov, Nikolas Wipper
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
linux-kernel, kvm, x86, linux-doc, linux-kselftest
On 10.10.24 10:57, Vitaly Kuznetsov wrote:
> Nikolas Wipper <nikwip@amazon.de> writes:
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 7571ac578884..ab3a9beb61a2 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -698,6 +698,8 @@ struct kvm_vcpu_hv {
>>
>> bool suspended;
>> int waiting_on;
>> +
>> + int tlb_flush_inhibit;
>
> This is basically boolean, right? And we only make it 'int' to be able
> to store 'u8' from the ioctl? This doesn't look very clean. Do you
> envision anything but '1'/'0' in 'inhibit'? If not, maybe we can just
> make it a flag (and e.g. extend 'flags' to be u32/u64)? This way we can
> convert 'tlb_flush_inhibit' to a normal bool.
>
Yes, inhibit would always be binary, so incorporating it into the flags
sounds reasonable. Even with the current API, this could just be a bool
(tlb_flush_inhibit = inhibit == 1;)
>> };
>>
>> struct kvm_hypervisor_cpuid {
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index e68fbc0c7fc1..40ea8340838f 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -2137,6 +2137,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>> bitmap_zero(vcpu_mask, KVM_MAX_VCPUS);
>>
>> kvm_for_each_vcpu(i, v, kvm) {
>> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
>> + goto ret_suspend;
>> +
>> __set_bit(i, vcpu_mask);
>> }
>> } else if (!is_guest_mode(vcpu)) {
>> @@ -2148,6 +2151,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>> __clear_bit(i, vcpu_mask);
>> continue;
>> }
>> +
>> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
>> + goto ret_suspend;
>> }
>> } else {
>> struct kvm_vcpu_hv *hv_v;
>> @@ -2175,6 +2181,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>> sparse_banks))
>> continue;
>>
>> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
>> + goto ret_suspend;
>> +
>
> These READ_ONCEs make me think I misunderstand something here, please
> bear with me :-).
>
> Like we're trying to protect against 'tlb_flush_inhibit' being read
> somewhere in the beginning of the function and want to generate real
> memory accesses. But what happens if tlb_flush_inhibit changes right
> _after_ we checked it here and _before_ we actuall do
> kvm_make_vcpus_request_mask()? Wouldn't it be a problem? In case it
> would, I think we need to reverse the order: do
> kvm_make_vcpus_request_mask() anyway and after it go through vcpu_mask
> checking whether any of the affected vCPUs has 'tlb_flush_inhibit' and
> if it does, suspend the caller.
>
The case you're describing is prevented through SRCU synchronisation in
the ioctl. The hypercall actually holds a read side critical section
during the whole of its execution, so when tlb_flush_inhibit changes
after we read it, the ioctl would wait for the flushes to complete:
vCPU 0 | vCPU 1
-------------------------+------------------------
| hypercall enter
| srcu_read_lock()
ioctl enter |
| tlb_flush_inhibit read
tlb_flush_inhibit write |
synchronize_srcu() start |
| TLB flush reqs send
| srcu_read_unlock()
synchronize_srcu() end |
ioctl exit |
>> __set_bit(i, vcpu_mask);
>> }
>> }
>> @@ -2193,6 +2202,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>> /* We always do full TLB flush, set 'Reps completed' = 'Rep Count' */
>> return (u64)HV_STATUS_SUCCESS |
>> ((u64)hc->rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
>> +ret_suspend:
>> + kvm_hv_vcpu_suspend_tlb_flush(vcpu, v->vcpu_id);
>> + return -EBUSY;
>> }
>>
>> static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
>> @@ -2380,6 +2392,13 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
>> u32 tlb_lock_count = 0;
>> int ret;
>>
>> + /*
>> + * Reached when the hyper-call resulted in a suspension of the vCPU.
>> + * The instruction will be re-tried once the vCPU is unsuspended.
>> + */
>> + if (kvm_hv_vcpu_suspended(vcpu))
>> + return 1;
>> +
>> if (hv_result_success(result) && is_guest_mode(vcpu) &&
>> kvm_hv_is_tlb_flush_hcall(vcpu) &&
>> kvm_read_guest(vcpu->kvm, to_hv_vcpu(vcpu)->nested.pa_page_gpa,
>> @@ -2919,6 +2938,9 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>>
>> void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)
>> {
>> + RCU_LOCKDEP_WARN(!srcu_read_lock_held(&vcpu->kvm->srcu),
>> + "Suspicious Hyper-V TLB flush inhibit usage\n");
>> +
>> /* waiting_on's store should happen before suspended's */
>> WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
>> WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 18d0a300e79a..1f925e32a927 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4642,6 +4642,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> case KVM_CAP_HYPERV_CPUID:
>> case KVM_CAP_HYPERV_ENFORCE_CPUID:
>> case KVM_CAP_SYS_HYPERV_CPUID:
>> + case KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT:
>> #endif
>> case KVM_CAP_PCI_SEGMENT:
>> case KVM_CAP_DEBUGREGS:
>> @@ -5853,6 +5854,31 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>> }
>> }
>>
>> +static int kvm_vcpu_ioctl_set_tlb_flush_inhibit(struct kvm_vcpu *vcpu,
>> + struct kvm_hyperv_tlb_flush_inhibit *set)
>> +{
>> + if (set->inhibit == READ_ONCE(vcpu->arch.hyperv->tlb_flush_inhibit))
>> + return 0;
>> +
>> + WRITE_ONCE(vcpu->arch.hyperv->tlb_flush_inhibit, set->inhibit);
>
> As you say before, vCPU ioctls are serialized and noone else sets
> tlb_flush_inhibit, do I understand correctly that
> READ_ONCE()/WRITE_ONCE() are redundant here?
>
As mentioned before, since tlb_flush_inhibit is shared it needs
these calls.
Nikolas
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 6/7] KVM: x86: Add trace events to track Hyper-V suspensions
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
` (4 preceding siblings ...)
2024-10-04 14:08 ` [PATCH 5/7] KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-04 14:08 ` Nikolas Wipper
2024-10-04 14:08 ` [PATCH 7/7] KVM: selftests: Add tests for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
2024-10-14 23:36 ` [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Sean Christopherson
7 siblings, 0 replies; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-04 14:08 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Add trace events to track suspensions and unsuspensions vCPUs, because of
Hyper-V mechanisms.
Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
---
arch/x86/kvm/hyperv.c | 4 ++++
arch/x86/kvm/trace.h | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 40ea8340838f..55b8f88a91cb 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2944,6 +2944,8 @@ void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)
/* waiting_on's store should happen before suspended's */
WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
+
+ trace_kvm_hv_vcpu_suspend_tlb_flush(vcpu->vcpu_id, vcpu_id);
}
void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu)
@@ -2962,6 +2964,8 @@ void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu)
WRITE_ONCE(v->arch.hyperv->waiting_on, -1);
WRITE_ONCE(v->arch.hyperv->suspended, false);
__set_bit(i, vcpu_mask);
+
+ trace_kvm_hv_vcpu_unsuspend_tlb_flush(v->vcpu_id);
}
}
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index d3aeffd6ae75..5564ef52fd9d 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1871,6 +1871,45 @@ TRACE_EVENT(kvm_rmp_fault,
__entry->error_code, __entry->rmp_level, __entry->psmash_ret)
);
+/*
+ * Tracepoint for Hyper-V TlbFlushInhibit suspension
+ */
+TRACE_EVENT(kvm_hv_vcpu_suspend_tlb_flush,
+ TP_PROTO(int vcpu_id, int waiting_on),
+ TP_ARGS(vcpu_id, waiting_on),
+
+ TP_STRUCT__entry(
+ __field(int, vcpu_id)
+ __field(int, waiting_on)
+ ),
+
+ TP_fast_assign(
+ __entry->vcpu_id = vcpu_id;
+ __entry->waiting_on = waiting_on;
+ ),
+
+ TP_printk("suspending vcpu %u waiting on %u",
+ __entry->vcpu_id, __entry->waiting_on)
+);
+
+/*
+ * Tracepoint for Hyper-V TlbFlushInhibit unsuspension
+ */
+TRACE_EVENT(kvm_hv_vcpu_unsuspend_tlb_flush,
+ TP_PROTO(int vcpu_id),
+ TP_ARGS(vcpu_id),
+ TP_STRUCT__entry(
+ __field(int, vcpu_id)
+ ),
+
+ TP_fast_assign(
+ __entry->vcpu_id = vcpu_id;
+ ),
+
+ TP_printk("unsuspending vcpu %u",
+ __entry->vcpu_id)
+);
+
#endif /* _TRACE_KVM_H */
#undef TRACE_INCLUDE_PATH
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 7/7] KVM: selftests: Add tests for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
` (5 preceding siblings ...)
2024-10-04 14:08 ` [PATCH 6/7] KVM: x86: Add trace events to track Hyper-V suspensions Nikolas Wipper
@ 2024-10-04 14:08 ` Nikolas Wipper
2024-10-14 23:36 ` [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Sean Christopherson
7 siblings, 0 replies; 21+ messages in thread
From: Nikolas Wipper @ 2024-10-04 14:08 UTC (permalink / raw)
To: Vitaly Kuznetsov
Cc: Nicolas Saenz Julienne, Alexander Graf, James Gowans,
nh-open-source, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Nikolas Wipper, linux-kernel, kvm, x86, linux-doc,
linux-kselftest, Nikolas Wipper
Add basic test for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT. Since information
about the suspension is not communicated to userspace this test checks for
suspension by checking whether the thread running a vCPU is still
executing.
Signed-off-by: Nikolas Wipper <nikwip@amazon.de>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/hyperv_tlb_flush_inhibit.c | 274 ++++++++++++++++++
2 files changed, 275 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/hyperv_tlb_flush_inhibit.c
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 45cb70c048bb..098c9c3f9ad8 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -79,6 +79,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_features
TEST_GEN_PROGS_x86_64 += x86_64/hyperv_ipi
TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test
TEST_GEN_PROGS_x86_64 += x86_64/hyperv_tlb_flush
+TEST_GEN_PROGS_x86_64 += x86_64/hyperv_tlb_flush_inhibit
TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_tlb_flush_inhibit.c b/tools/testing/selftests/kvm/x86_64/hyperv_tlb_flush_inhibit.c
new file mode 100644
index 000000000000..638ed6ec8ae1
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_tlb_flush_inhibit.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test for KVM's emulation of Hyper-V's TlbFlushInhibit bit
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+#include <pthread.h>
+#include <time.h>
+
+#include "apic.h"
+#include "hyperv.h"
+
+struct timespec abstime;
+
+struct test_data {
+ bool entered;
+ bool hypercall_done;
+ vm_paddr_t hcall_gpa;
+};
+
+void guest_main(vm_vaddr_t test_data)
+{
+ struct test_data *data = (struct test_data *)test_data;
+
+ wrmsr(HV_X64_MSR_GUEST_OS_ID, HYPERV_LINUX_OS_ID);
+ wrmsr(HV_X64_MSR_HYPERCALL, data->hcall_gpa);
+
+ WRITE_ONCE(data->entered, true);
+
+ /* Aligned for loading into XMM registers */
+ __aligned(16) u64 processor_mask = BIT(0) | BIT(1) | BIT(2);
+
+ /* Setup fast hyper-call */
+ hyperv_write_xmm_input(&processor_mask, 1);
+ hyperv_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE |
+ HV_HYPERCALL_FAST_BIT,
+ 0x0, HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES);
+ data->hypercall_done = true;
+
+ GUEST_DONE();
+}
+
+struct test_data *test_data_init(struct kvm_vcpu *vcpu)
+{
+ vm_vaddr_t test_data_page;
+
+ test_data_page = vm_vaddr_alloc_page(vcpu->vm);
+
+ vcpu_args_set(vcpu, 1, test_data_page);
+
+ return (struct test_data *)addr_gva2hva(vcpu->vm, test_data_page);
+}
+
+static void *vcpu_thread(void *arg)
+{
+ struct kvm_vcpu *vcpu = (struct kvm_vcpu *)arg;
+ struct ucall uc;
+
+ while (1) {
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_PRINTF:
+ REPORT_GUEST_PRINTF(uc);
+ break;
+ default:
+ TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_DONE);
+ return NULL;
+ }
+ }
+}
+
+/* Test one vCPU being inhibited while another tries to flush its TLB */
+void test_single(struct kvm_vm *vm, struct kvm_vcpu *inhibitor,
+ struct kvm_vcpu *flusher)
+{
+ struct kvm_hyperv_tlb_flush_inhibit set;
+ struct test_data *data;
+ unsigned int to_sleep;
+ pthread_t thread;
+
+ printf("%s ...\t", __func__);
+
+ vcpu_arch_set_entry_point(flusher, guest_main);
+
+ data = test_data_init(flusher);
+
+ data->entered = false;
+ data->hypercall_done = false;
+ data->hcall_gpa = addr_gva2gpa(vm, vm_vaddr_alloc_pages(vm, 1));
+
+ set.inhibit = true;
+ vcpu_ioctl(inhibitor, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+
+ pthread_create(&thread, NULL, vcpu_thread, flusher);
+
+ /* Waiting on the guest to fully enter */
+ while (READ_ONCE(data->entered) == false)
+ asm volatile ("nop");
+
+ /* Give the guest some time to attempt the hyper-call */
+ to_sleep = 2;
+ while ((to_sleep = sleep(to_sleep)))
+ asm volatile ("nop");
+
+ TEST_ASSERT_EQ(data->hypercall_done, false);
+ TEST_ASSERT(pthread_tryjoin_np(thread, NULL) != 0, "thread finished early");
+
+ set.inhibit = false;
+ vcpu_ioctl(inhibitor, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+
+ clock_gettime(CLOCK_REALTIME, &abstime);
+ abstime.tv_sec += 5;
+ TEST_ASSERT(pthread_timedjoin_np(thread, NULL, &abstime) == 0,
+ "couldn't join thread");
+
+ TEST_ASSERT_EQ(data->hypercall_done, true);
+
+ printf("[ok]\n");
+}
+
+/* Test one vCPU being inhibited while two others try to flush its TLB */
+void test_multi_flusher(struct kvm_vm *vm, struct kvm_vcpu *inhibitor,
+ struct kvm_vcpu *flusher1, struct kvm_vcpu *flusher2)
+{
+ struct kvm_hyperv_tlb_flush_inhibit set;
+ struct test_data *data1, *data2;
+ pthread_t thread1, thread2;
+ unsigned int to_sleep;
+
+ printf("%s ...\t", __func__);
+
+ vcpu_arch_set_entry_point(flusher1, guest_main);
+ vcpu_arch_set_entry_point(flusher2, guest_main);
+
+ data1 = test_data_init(flusher1);
+ data2 = test_data_init(flusher2);
+
+ data1->entered = false;
+ data1->hypercall_done = false;
+ data1->hcall_gpa = addr_gva2gpa(vm, vm_vaddr_alloc_pages(vm, 1));
+ data2->entered = false;
+ data2->hypercall_done = false;
+ data2->hcall_gpa = addr_gva2gpa(vm, vm_vaddr_alloc_pages(vm, 1));
+
+ set.inhibit = true;
+ vcpu_ioctl(inhibitor, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+
+ pthread_create(&thread1, NULL, vcpu_thread, flusher1);
+ pthread_create(&thread2, NULL, vcpu_thread, flusher2);
+
+ /* Waiting on the guests to fully enter */
+ while (READ_ONCE(data1->entered) == false)
+ asm volatile("nop");
+ while (READ_ONCE(data2->entered) == false)
+ asm volatile("nop");
+
+ /* Give the guests some time to attempt the hyper-call */
+ to_sleep = 2;
+ while ((to_sleep = sleep(to_sleep)))
+ asm volatile("nop");
+
+ TEST_ASSERT_EQ(data1->hypercall_done, false);
+ TEST_ASSERT_EQ(data2->hypercall_done, false);
+
+ TEST_ASSERT(pthread_tryjoin_np(thread1, NULL) != 0,
+ "thread 1 finished early");
+ TEST_ASSERT(pthread_tryjoin_np(thread2, NULL) != 0,
+ "thread 2 finished early");
+
+ set.inhibit = false;
+ vcpu_ioctl(inhibitor, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+
+ clock_gettime(CLOCK_REALTIME, &abstime);
+ abstime.tv_sec += 5;
+ TEST_ASSERT(pthread_timedjoin_np(thread1, NULL, &abstime) == 0,
+ "couldn't join thread1");
+
+ clock_gettime(CLOCK_REALTIME, &abstime);
+ abstime.tv_sec += 5;
+ TEST_ASSERT(pthread_timedjoin_np(thread2, NULL, &abstime) == 0,
+ "couldn't join thread2");
+
+ TEST_ASSERT_EQ(data1->hypercall_done, true);
+ TEST_ASSERT_EQ(data2->hypercall_done, true);
+
+ printf("[ok]\n");
+}
+
+/* Test two vCPUs being inhibited while another tries to flush their TLBs */
+void test_multi_inhibitor(struct kvm_vm *vm, struct kvm_vcpu *inhibitor1,
+ struct kvm_vcpu *inhibitor2, struct kvm_vcpu *flusher)
+{
+ struct kvm_hyperv_tlb_flush_inhibit set;
+ struct test_data *data;
+ unsigned int to_sleep;
+ pthread_t thread;
+
+ printf("%s ...\t", __func__);
+
+ vcpu_arch_set_entry_point(flusher, guest_main);
+
+ data = test_data_init(flusher);
+
+ data->entered = false;
+ data->hypercall_done = false;
+ data->hcall_gpa = addr_gva2gpa(vm, vm_vaddr_alloc_pages(vm, 1));
+
+ set.inhibit = true;
+ vcpu_ioctl(inhibitor1, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+ vcpu_ioctl(inhibitor2, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+
+ pthread_create(&thread, NULL, vcpu_thread, flusher);
+
+ /* Waiting on the guest to fully enter */
+ while (READ_ONCE(data->entered) == false)
+ asm volatile ("nop");
+
+ /* Give the guest some time to attempt the hyper-call */
+ to_sleep = 2;
+ while ((to_sleep = sleep(to_sleep)))
+ asm volatile ("nop");
+
+ TEST_ASSERT_EQ(data->hypercall_done, false);
+ TEST_ASSERT(pthread_tryjoin_np(thread, NULL) != 0, "thread finished early");
+
+ set.inhibit = false;
+ vcpu_ioctl(inhibitor1, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+
+ to_sleep = 1;
+ while ((to_sleep = sleep(to_sleep)))
+ asm volatile ("nop");
+
+ TEST_ASSERT_EQ(data->hypercall_done, false);
+ TEST_ASSERT(pthread_tryjoin_np(thread, NULL) != 0, "thread finished early");
+
+ set.inhibit = false;
+ vcpu_ioctl(inhibitor2, KVM_HYPERV_SET_TLB_FLUSH_INHIBIT, &set);
+
+ clock_gettime(CLOCK_REALTIME, &abstime);
+ abstime.tv_sec += 5;
+ TEST_ASSERT(pthread_timedjoin_np(thread, NULL, &abstime) == 0,
+ "couldn't join thread");
+
+ TEST_ASSERT_EQ(data->entered, true);
+ TEST_ASSERT_EQ(data->hypercall_done, true);
+
+ printf("[ok]\n");
+}
+
+int main(int argc, char *argv[])
+{
+ struct kvm_vcpu *vcpu[3];
+ struct kvm_vm *vm;
+
+ TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_TLBFLUSH));
+ TEST_REQUIRE(kvm_has_cap(KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT));
+
+ vm = vm_create_with_vcpus(3, guest_main, vcpu);
+
+ vcpu_set_hv_cpuid(vcpu[0]);
+ vcpu_set_hv_cpuid(vcpu[1]);
+ vcpu_set_hv_cpuid(vcpu[2]);
+
+ test_single(vm, vcpu[1], vcpu[0]);
+ test_multi_flusher(vm, vcpu[1], vcpu[0], vcpu[2]);
+ test_multi_inhibitor(vm, vcpu[1], vcpu[2], vcpu[0]);
+
+ kvm_vm_free(vm);
+
+ return 0;
+}
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
2024-10-04 14:08 [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
` (6 preceding siblings ...)
2024-10-04 14:08 ` [PATCH 7/7] KVM: selftests: Add tests for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT Nikolas Wipper
@ 2024-10-14 23:36 ` Sean Christopherson
7 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2024-10-14 23:36 UTC (permalink / raw)
To: Nikolas Wipper
Cc: Vitaly Kuznetsov, Nicolas Saenz Julienne, Alexander Graf,
James Gowans, nh-open-source, Paolo Bonzini, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Nikolas Wipper,
linux-kernel, kvm, x86, linux-doc, linux-kselftest
On Fri, Oct 04, 2024, Nikolas Wipper wrote:
> This series introduces a new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT. It
> allows hypervisors to inhibit remote TLB flushing of a vCPU coming from
> Hyper-V hyper-calls (namely HvFlushVirtualAddressSpace(Ex) and
> HvFlushirtualAddressList(Ex)). It is required to implement the
> HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
> emulate VSM within KVM and QEMU. The hyper-call requires several new KVM
> APIs, one of which is KVM_HYPERV_SET_TLB_FLUSH_INHIBIT.
>
> Once the inhibit flag is set, any processor attempting to flush the TLB on
> the marked vCPU, with a HyperV hyper-call, will be suspended until the
> flag is cleared again. During the suspension the vCPU will not run at all,
> neither receiving events nor running other code. It will wake up from
> suspension once the vCPU it is waiting on clears the inhibit flag. This
> behaviour is specified in Microsoft's "Hypervisor Top Level Functional
> Specification" (TLFS).
>
> The vCPU will block execution during the suspension, making it transparent
> to the hypervisor.
s/hypervisor/VMM. In the world of KVM, the typical terminology is that KVM itself
is the hypervisor, and the userspace side is the VMM. It's not perfect, but it's
good enough and fairly ubiquitous at this point, and thus many readers will be
quite confused as to how a vCPU blocking is transparent to KVM :-)
^ permalink raw reply [flat|nested] 21+ messages in thread