Kernel KVM virtualization development

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* Re: [PATCH v5 0/6] Deliver vGPU display refresh event to userspace
From: kraxel @ 2019-09-05  7:48 UTC (permalink / raw)
  To: Zhang, Tina
  Cc: intel-gvt-dev@lists.freedesktop.org, alex.williamson@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yuan, Hang,
	Lv, Zhiyuan
In-Reply-To: <237F54289DF84E4997F34151298ABEBC8771E7AE@SHSMSX101.ccr.corp.intel.com>

  Hi,

> Option 2: QEMU provides the emulated display refresh event to the
> vgpus provided by vendor driver. For vgpus, the display refresh event
> can be considered as the vblank event which is leveraged by guest
> window manager to do the plane update or mode-setting.

> People are asking if option 2 could be a better choice.

Certainly worth trying, maybe it even makes sense to implement both and
let qemu pick one, possibly even switch them at runtime.

qemu can change the refresh rate.  vnc and sdl use that to reduce the
refresh rate in case nobody is looking (no vnc client connected, sdl
window minimized).  It surely makes sense to make that visible to the
guest so it can throttle display updates too.  I'm not sure vblank is
the way to go though, guests might run into vblank irq timeouts in case
the refresh rate is very low ...

cheers,
  Gerd

^ permalink raw reply

* [PATCH 1/2] KVM: LAPIC: Micro optimize IPI latency
From: Wanpeng Li @ 2019-09-05  6:26 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Paolo Bonzini, Radim Krčmář

From: Wanpeng Li <wanpengli@tencent.com>

This patch optimizes the virtual IPI emulation sequence:

write ICR2                     write ICR2
write ICR                      read ICR2
read ICR            ==>        send virtual IPI
read ICR2                      write ICR
send virtual IPI

It can reduce kvm-unit-tests/vmexit.flat IPI testing latency(from sender
send IPI to sender receive the ACK) from 3319 cycles to 3203 cycles on
SKylake server.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/lapic.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 12ade70..34fd299 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1200,10 +1200,8 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
 }
 EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
 
-static void apic_send_ipi(struct kvm_lapic *apic)
+static void apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
 {
-	u32 icr_low = kvm_lapic_get_reg(apic, APIC_ICR);
-	u32 icr_high = kvm_lapic_get_reg(apic, APIC_ICR2);
 	struct kvm_lapic_irq irq;
 
 	irq.vector = icr_low & APIC_VECTOR_MASK;
@@ -1940,8 +1938,9 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 	}
 	case APIC_ICR:
 		/* No delay here, so we always clear the pending bit */
-		kvm_lapic_set_reg(apic, APIC_ICR, val & ~(1 << 12));
-		apic_send_ipi(apic);
+		val &= ~(1 << 12);
+		apic_send_ipi(apic, val, kvm_lapic_get_reg(apic, APIC_ICR2));
+		kvm_lapic_set_reg(apic, APIC_ICR, val);
 		break;
 
 	case APIC_ICR2:
-- 
2.7.4


^ permalink raw reply related

* [PATCH 2/2] KVM: VMX: Stop the preemption timer during vCPU reset
From: Wanpeng Li @ 2019-09-05  6:26 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Paolo Bonzini, Radim Krčmář
In-Reply-To: <1567664788-10249-1-git-send-email-wanpengli@tencent.com>

From: Wanpeng Li <wanpengli@tencent.com>

The hrtimer which is used to emulate lapic timer is stopped during 
vcpu reset, preemption timer should do the same.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/vmx/vmx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 570a233..f794929 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4162,6 +4162,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 
 	vcpu->arch.microcode_version = 0x100000000ULL;
 	vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
+	vmx->hv_deadline_tsc = -1;
 	kvm_set_cr8(vcpu, 0);
 
 	if (!init_event) {
-- 
2.7.4


^ permalink raw reply related

* [PATCH v3 4/4] KVM: VMX: Change ple_window type to unsigned int
From: Peter Xu @ 2019-09-05  2:36 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Paolo Bonzini, Wanpeng Li, Krish Sadhukhan,
	peterx
In-Reply-To: <20190905023616.29082-1-peterx@redhat.com>

The VMX ple_window is 32 bits wide, so logically it can overflow with
an int.  The module parameter is declared as unsigned int which is
good, however the dynamic variable is not.  Switching all the
ple_window references to use unsigned int.

The tracepoint changes will also affect SVM, but SVM is using an even
smaller width (16 bits) so it's always fine.

Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/trace.h   | 8 ++++----
 arch/x86/kvm/vmx/vmx.c | 4 ++--
 arch/x86/kvm/vmx/vmx.h | 2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index f1177e03768f..ae924566c401 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -891,13 +891,13 @@ TRACE_EVENT(kvm_pml_full,
 );
 
 TRACE_EVENT(kvm_ple_window_update,
-	TP_PROTO(unsigned int vcpu_id, int new, int old),
+	TP_PROTO(unsigned int vcpu_id, unsigned int new, unsigned int old),
 	TP_ARGS(vcpu_id, new, old),
 
 	TP_STRUCT__entry(
 		__field(        unsigned int,   vcpu_id         )
-		__field(                 int,       new         )
-		__field(                 int,       old         )
+		__field(        unsigned int,       new         )
+		__field(        unsigned int,       old         )
 	),
 
 	TP_fast_assign(
@@ -906,7 +906,7 @@ TRACE_EVENT(kvm_ple_window_update,
 		__entry->old            = old;
 	),
 
-	TP_printk("vcpu %u old %d new %d (%s)",
+	TP_printk("vcpu %u old %u new %u (%s)",
 	          __entry->vcpu_id, __entry->old, __entry->new,
 		  __entry->old < __entry->new ? "growed" : "shrinked")
 );
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 469c4134a4a7..1dbb63ffdd6d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5227,7 +5227,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 static void grow_ple_window(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	int old = vmx->ple_window;
+	unsigned int old = vmx->ple_window;
 
 	vmx->ple_window = __grow_ple_window(old, ple_window,
 					    ple_window_grow,
@@ -5243,7 +5243,7 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
 static void shrink_ple_window(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	int old = vmx->ple_window;
+	unsigned int old = vmx->ple_window;
 
 	vmx->ple_window = __shrink_ple_window(old, ple_window,
 					      ple_window_shrink,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 82d0bc3a4d52..64d5a4890aa9 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -253,7 +253,7 @@ struct vcpu_vmx {
 	struct nested_vmx nested;
 
 	/* Dynamic PLE window. */
-	int ple_window;
+	unsigned int ple_window;
 	bool ple_window_dirty;
 
 	bool req_immediate_exit;
-- 
2.21.0


^ permalink raw reply related

* [PATCH v3 3/4] KVM: X86: Tune PLE Window tracepoint
From: Peter Xu @ 2019-09-05  2:36 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Paolo Bonzini, Wanpeng Li, Krish Sadhukhan,
	peterx
In-Reply-To: <20190905023616.29082-1-peterx@redhat.com>

The PLE window tracepoint triggers even if the window is not changed,
and the wording can be a bit confusing too.  One example line:

  kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096)

It easily let people think of "the window now is 4096 which is
shrinked", but the truth is the value actually didn't change (4096).

Let's only dump this message if the value really changed, and we make
the message even simpler like:

  kvm_ple_window: vcpu 4 old 4096 new 8192 (growed)

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/svm.c     | 16 ++++++++--------
 arch/x86/kvm/trace.h   | 21 ++++++---------------
 arch/x86/kvm/vmx/vmx.c | 14 ++++++++------
 arch/x86/kvm/x86.c     |  2 +-
 4 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d685491fce4d..d5cb6b5a9254 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1269,11 +1269,11 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
 							pause_filter_count_grow,
 							pause_filter_count_max);
 
-	if (control->pause_filter_count != old)
+	if (control->pause_filter_count != old) {
 		mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
-
-	trace_kvm_ple_window_grow(vcpu->vcpu_id,
-				  control->pause_filter_count, old);
+		trace_kvm_ple_window_update(vcpu->vcpu_id,
+					    control->pause_filter_count, old);
+	}
 }
 
 static void shrink_ple_window(struct kvm_vcpu *vcpu)
@@ -1287,11 +1287,11 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 						    pause_filter_count,
 						    pause_filter_count_shrink,
 						    pause_filter_count);
-	if (control->pause_filter_count != old)
+	if (control->pause_filter_count != old) {
 		mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
-
-	trace_kvm_ple_window_shrink(vcpu->vcpu_id,
-				    control->pause_filter_count, old);
+		trace_kvm_ple_window_update(vcpu->vcpu_id,
+					    control->pause_filter_count, old);
+	}
 }
 
 static __init int svm_hardware_setup(void)
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 8a7570f8c943..f1177e03768f 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -890,36 +890,27 @@ TRACE_EVENT(kvm_pml_full,
 	TP_printk("vcpu %d: PML full", __entry->vcpu_id)
 );
 
-TRACE_EVENT(kvm_ple_window,
-	TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
-	TP_ARGS(grow, vcpu_id, new, old),
+TRACE_EVENT(kvm_ple_window_update,
+	TP_PROTO(unsigned int vcpu_id, int new, int old),
+	TP_ARGS(vcpu_id, new, old),
 
 	TP_STRUCT__entry(
-		__field(                bool,      grow         )
 		__field(        unsigned int,   vcpu_id         )
 		__field(                 int,       new         )
 		__field(                 int,       old         )
 	),
 
 	TP_fast_assign(
-		__entry->grow           = grow;
 		__entry->vcpu_id        = vcpu_id;
 		__entry->new            = new;
 		__entry->old            = old;
 	),
 
-	TP_printk("vcpu %u: ple_window %d (%s %d)",
-	          __entry->vcpu_id,
-	          __entry->new,
-	          __entry->grow ? "grow" : "shrink",
-	          __entry->old)
+	TP_printk("vcpu %u old %d new %d (%s)",
+	          __entry->vcpu_id, __entry->old, __entry->new,
+		  __entry->old < __entry->new ? "growed" : "shrinked")
 );
 
-#define trace_kvm_ple_window_grow(vcpu_id, new, old) \
-	trace_kvm_ple_window(true, vcpu_id, new, old)
-#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
-	trace_kvm_ple_window(false, vcpu_id, new, old)
-
 TRACE_EVENT(kvm_pvclock_update,
 	TP_PROTO(unsigned int vcpu_id, struct pvclock_vcpu_time_info *pvclock),
 	TP_ARGS(vcpu_id, pvclock),
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 42ed3faa6af8..469c4134a4a7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5233,10 +5233,11 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
 					    ple_window_grow,
 					    ple_window_max);
 
-	if (vmx->ple_window != old)
+	if (vmx->ple_window != old) {
 		vmx->ple_window_dirty = true;
-
-	trace_kvm_ple_window_grow(vcpu->vcpu_id, vmx->ple_window, old);
+		trace_kvm_ple_window_update(vcpu->vcpu_id,
+					    vmx->ple_window, old);
+	}
 }
 
 static void shrink_ple_window(struct kvm_vcpu *vcpu)
@@ -5248,10 +5249,11 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 					      ple_window_shrink,
 					      ple_window);
 
-	if (vmx->ple_window != old)
+	if (vmx->ple_window != old) {
 		vmx->ple_window_dirty = true;
-
-	trace_kvm_ple_window_shrink(vcpu->vcpu_id, vmx->ple_window, old);
+		trace_kvm_ple_window_update(vcpu->vcpu_id,
+					    vmx->ple_window, old);
+	}
 }
 
 /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 93b0bd45ac73..69ad184edc90 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10082,7 +10082,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
-EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
-- 
2.21.0


^ permalink raw reply related

* [PATCH v3 2/4] KVM: X86: Remove tailing newline for tracepoints
From: Peter Xu @ 2019-09-05  2:36 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Paolo Bonzini, Wanpeng Li, Krish Sadhukhan,
	peterx
In-Reply-To: <20190905023616.29082-1-peterx@redhat.com>

It's done by TP_printk() already.

Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/trace.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 20d6cac9f157..8a7570f8c943 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1323,7 +1323,7 @@ TRACE_EVENT(kvm_avic_incomplete_ipi,
 		__entry->index = index;
 	),
 
-	TP_printk("vcpu=%u, icrh:icrl=%#010x:%08x, id=%u, index=%u\n",
+	TP_printk("vcpu=%u, icrh:icrl=%#010x:%08x, id=%u, index=%u",
 		  __entry->vcpu, __entry->icrh, __entry->icrl,
 		  __entry->id, __entry->index)
 );
@@ -1348,7 +1348,7 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
 		__entry->vec = vec;
 	),
 
-	TP_printk("vcpu=%u, offset=%#x(%s), %s, %s, vec=%#x\n",
+	TP_printk("vcpu=%u, offset=%#x(%s), %s, %s, vec=%#x",
 		  __entry->vcpu,
 		  __entry->offset,
 		  __print_symbolic(__entry->offset, kvm_trace_symbol_apic),
-- 
2.21.0


^ permalink raw reply related

* [PATCH v3 1/4] KVM: X86: Trace vcpu_id for vmexit
From: Peter Xu @ 2019-09-05  2:36 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Paolo Bonzini, Wanpeng Li, Krish Sadhukhan,
	peterx
In-Reply-To: <20190905023616.29082-1-peterx@redhat.com>

Tracing the ID helps to pair vmenters and vmexits for guests with
multiple vCPUs.

Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/trace.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index b5c831e79094..20d6cac9f157 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -232,17 +232,20 @@ TRACE_EVENT(kvm_exit,
 		__field(	u32,	        isa             )
 		__field(	u64,	        info1           )
 		__field(	u64,	        info2           )
+		__field(	unsigned int,	vcpu_id         )
 	),
 
 	TP_fast_assign(
 		__entry->exit_reason	= exit_reason;
 		__entry->guest_rip	= kvm_rip_read(vcpu);
 		__entry->isa            = isa;
+		__entry->vcpu_id        = vcpu->vcpu_id;
 		kvm_x86_ops->get_exit_info(vcpu, &__entry->info1,
 					   &__entry->info2);
 	),
 
-	TP_printk("reason %s rip 0x%lx info %llx %llx",
+	TP_printk("vcpu %u reason %s rip 0x%lx info %llx %llx",
+		  __entry->vcpu_id,
 		 (__entry->isa == KVM_ISA_VMX) ?
 		 __print_symbolic(__entry->exit_reason, VMX_EXIT_REASONS) :
 		 __print_symbolic(__entry->exit_reason, SVM_EXIT_REASONS),
-- 
2.21.0


^ permalink raw reply related

* [PATCH v3 0/4] KVM: X86: Some tracepoint enhancements
From: Peter Xu @ 2019-09-05  2:36 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Paolo Bonzini, Wanpeng Li, Krish Sadhukhan,
	peterx

v3:
- use unsigned int for vcpu id [Sean]
- a new patch to fix ple_window type [Sean]

v2:
- fix commit messages, change format of ple window tracepoint [Sean]
- rebase [Wanpeng]

Each small patch explains itself.  I noticed them when I'm tracing
some IRQ paths and I found them helpful at least to me.

Please have a look.  Thanks,

Peter Xu (4):
  KVM: X86: Trace vcpu_id for vmexit
  KVM: X86: Remove tailing newline for tracepoints
  KVM: X86: Tune PLE Window tracepoint
  KVM: VMX: Change ple_window type to unsigned int

 arch/x86/kvm/svm.c     | 16 ++++++++--------
 arch/x86/kvm/trace.h   | 34 ++++++++++++++--------------------
 arch/x86/kvm/vmx/vmx.c | 18 ++++++++++--------
 arch/x86/kvm/vmx/vmx.h |  2 +-
 arch/x86/kvm/x86.c     |  2 +-
 5 files changed, 34 insertions(+), 38 deletions(-)

-- 
2.21.0


^ permalink raw reply

* Re: [PATCH v2 3/3] KVM: X86: Tune PLE Window tracepoint
From: Peter Xu @ 2019-09-05  2:25 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, Wanpeng Li, Krish Sadhukhan
In-Reply-To: <20190904173254.GJ24079@linux.intel.com>

On Wed, Sep 04, 2019 at 10:32:54AM -0700, Sean Christopherson wrote:
> On Thu, Aug 15, 2019 at 06:34:58PM +0800, Peter Xu wrote:
> > The PLE window tracepoint triggers even if the window is not changed,
> > and the wording can be a bit confusing too.  One example line:
> > 
> >   kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096)
> > 
> > It easily let people think of "the window now is 4096 which is
> > shrinked", but the truth is the value actually didn't change (4096).
> > 
> > Let's only dump this message if the value really changed, and we make
> > the message even simpler like:
> > 
> >   kvm_ple_window: vcpu 4 old 4096 new 8192 (growed)
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/kvm/svm.c     | 16 ++++++++--------
> >  arch/x86/kvm/trace.h   | 21 ++++++---------------
> >  arch/x86/kvm/vmx/vmx.c | 14 ++++++++------
> >  arch/x86/kvm/x86.c     |  2 +-
> >  4 files changed, 23 insertions(+), 30 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index d685491fce4d..d5cb6b5a9254 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -1269,11 +1269,11 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
> >  							pause_filter_count_grow,
> >  							pause_filter_count_max);
> >  
> > -	if (control->pause_filter_count != old)
> > +	if (control->pause_filter_count != old) {
> >  		mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
> > -
> > -	trace_kvm_ple_window_grow(vcpu->vcpu_id,
> > -				  control->pause_filter_count, old);
> > +		trace_kvm_ple_window_update(vcpu->vcpu_id,
> > +					    control->pause_filter_count, old);
> > +	}
> >  }
> >  
> >  static void shrink_ple_window(struct kvm_vcpu *vcpu)
> > @@ -1287,11 +1287,11 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
> >  						    pause_filter_count,
> >  						    pause_filter_count_shrink,
> >  						    pause_filter_count);
> > -	if (control->pause_filter_count != old)
> > +	if (control->pause_filter_count != old) {
> >  		mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
> > -
> > -	trace_kvm_ple_window_shrink(vcpu->vcpu_id,
> > -				    control->pause_filter_count, old);
> > +		trace_kvm_ple_window_update(vcpu->vcpu_id,
> > +					    control->pause_filter_count, old);
> > +	}
> >  }
> >  
> >  static __init int svm_hardware_setup(void)
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 76a39bc25b95..97df9d7cae71 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -890,36 +890,27 @@ TRACE_EVENT(kvm_pml_full,
> >  	TP_printk("vcpu %d: PML full", __entry->vcpu_id)
> >  );
> >  
> > -TRACE_EVENT(kvm_ple_window,
> > -	TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
> > -	TP_ARGS(grow, vcpu_id, new, old),
> > +TRACE_EVENT(kvm_ple_window_update,
> > +	TP_PROTO(unsigned int vcpu_id, int new, int old),
> > +	TP_ARGS(vcpu_id, new, old),
> >  
> >  	TP_STRUCT__entry(
> > -		__field(                bool,      grow         )
> >  		__field(        unsigned int,   vcpu_id         )
> >  		__field(                 int,       new         )
> >  		__field(                 int,       old         )
> 
> Not your code, but these should really be 'unsigned int', especially now
> that they are directly compared when printing "growed" versus "shrinked".
> For SVM it doesn't matter since the underlying hardware fields are only
> 16 bits, but on VMX they're 32 bits, e.g. theoretically userspace could
> set ple_window and ple_window_max to a negative value.
> 
> The ple_window variable in struct vcpu_vmx and local snapshots of the
> field should also be updated, but that can be done separately.

Indeed.  Let me add a separated patch.  Thanks,

-- 
Peter Xu

^ permalink raw reply

* Re: [PATCH v2 1/3] KVM: X86: Trace vcpu_id for vmexit
From: Peter Xu @ 2019-09-05  2:15 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, Wanpeng Li, Krish Sadhukhan
In-Reply-To: <20190904172658.GH24079@linux.intel.com>

On Wed, Sep 04, 2019 at 10:26:58AM -0700, Sean Christopherson wrote:
> On Thu, Aug 15, 2019 at 06:34:56PM +0800, Peter Xu wrote:
> > Tracing the ID helps to pair vmenters and vmexits for guests with
> > multiple vCPUs.
> > 
> > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/kvm/trace.h | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index b5c831e79094..c682f3f7f998 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -232,17 +232,20 @@ TRACE_EVENT(kvm_exit,
> >  		__field(	u32,	        isa             )
> >  		__field(	u64,	        info1           )
> >  		__field(	u64,	        info2           )
> > +		__field(	int,	        vcpu_id         )
> 
> It doesn't actually affect anything, but vcpu_id is stored and printed as
> an 'unsigned int' everywhere else in the trace code.  Stylistically I like
> that approach even though struct kvm_vcpu holds it as a signed int.

True.  I can switch to unsigned int to get aligned with the rest if
there's other comment to address.  Though from codebase-wise I would
even prefer signed because it gives us a chance to set an invalid vcpu
id (-1) where necessary, or notice something severly wrong when <-1.
After all it should far cover our usage (IIUC max vcpu id should be
512 cross archs).

Thanks,

-- 
Peter Xu

^ permalink raw reply

* [PATCH v3] doc: kvm: Fix return description of KVM_SET_MSRS
From: Xiaoyao Li @ 2019-09-05  0:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Xiaoyao Li, Jonathan Corbet, Sean Christopherson, kvm, linux-doc,
	linux-kernel

Userspace can use ioctl KVM_SET_MSRS to update a set of MSRs of guest.
This ioctl set specified MSRs one by one. If it fails to set an MSR,
e.g., due to setting reserved bits, the MSR is not supported/emulated by
KVM, etc..., it stops processing the MSR list and returns the number of
MSRs have been set successfully.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
v3:
  refine the description based on Sean's comment.  

v2:
  elaborate the changelog and description of ioctl KVM_SET_MSRS based on
  Sean's comments.
---
 Documentation/virt/kvm/api.txt | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 2d067767b617..24541e52e96e 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -586,7 +586,7 @@ Capability: basic
 Architectures: x86
 Type: vcpu ioctl
 Parameters: struct kvm_msrs (in)
-Returns: 0 on success, -1 on error
+Returns: number of msrs successfully set (see below), -1 on error
 
 Writes model-specific registers to the vcpu.  See KVM_GET_MSRS for the
 data structures.
@@ -595,6 +595,11 @@ Application code should set the 'nmsrs' member (which indicates the
 size of the entries array), and the 'index' and 'data' members of each
 array entry.
 
+It tries to set the MSRs in array entries[] one by one. If setting an MSR
+fails, e.g., due to setting reserved bits, the MSR isn't supported/emulated
+by KVM, etc..., it stops processing the MSR list and returns the number of
+MSRs that have been set successfully.
+
 
 4.20 KVM_SET_CPUID
 
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH v2] doc: kvm: Fix return description of KVM_SET_MSRS
From: Xiaoyao Li @ 2019-09-05  1:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář, Jonathan Corbet, kvm,
	linux-doc, linux-kernel
In-Reply-To: <20190904174122.GK24079@linux.intel.com>

On 9/5/2019 1:41 AM, Sean Christopherson wrote:
> On Wed, Sep 04, 2019 at 02:01:18PM +0800, Xiaoyao Li wrote:
>> Userspace can use ioctl KVM_SET_MSRS to update a set of MSRs of guest.
>> This ioctl sets specified MSRs one by one. Once it fails to set an MSR
>> due to setting reserved bits, the MSR is not supported/emulated by kvm,
>> or violating other restrictions, it stops further processing and returns
>> the number of MSRs have been set successfully.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> v2:
>>    elaborate the changelog and description of ioctl KVM_SET_MSRS based on
>>    Sean's comments.
>>
>> ---
>>   Documentation/virt/kvm/api.txt | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
>> index 2d067767b617..4638e893dec0 100644
>> --- a/Documentation/virt/kvm/api.txt
>> +++ b/Documentation/virt/kvm/api.txt
>> @@ -586,7 +586,7 @@ Capability: basic
>>   Architectures: x86
>>   Type: vcpu ioctl
>>   Parameters: struct kvm_msrs (in)
>> -Returns: 0 on success, -1 on error
>> +Returns: number of msrs successfully set (see below), -1 on error
>>   
>>   Writes model-specific registers to the vcpu.  See KVM_GET_MSRS for the
>>   data structures.
>> @@ -595,6 +595,11 @@ Application code should set the 'nmsrs' member (which indicates the
>>   size of the entries array), and the 'index' and 'data' members of each
>>   array entry.
>>   
>> +It tries to set the MSRs in array entries[] one by one. Once failing to
> 
> Probably better to say 'If' as opposed to 'Once', don't want to imply that
> userspace is incompetent :)
> 
>> +set an MSR (due to setting reserved bits, the MSR is not supported/emulated
>> +by kvm, or violating other restrctions),
> 
> Make it clear the list is not exhaustive, e.g.:
> 
> It tries to set the MSRs in array entries[] one by one.  If setting an MSR
> fails, e.g. due to setting reserved bits, the MSR isn't supported/emulated by
> KVM, etc..., it stops processing the MSR list and returns the number of MSRs
> that have been set successfully.
>

Refine it as you commented and send out v3, thanks.

>> it stops setting following MSRs
>> +and returns the number of MSRs have been set successfully.
>> +
>>   
>>   4.20 KVM_SET_CPUID
>>   
>> -- 
>> 2.19.1
>>

^ permalink raw reply

* Re: [kvm-unit-tests PATCH v3 7/8] x86: VMX: Make guest_state_test_main() check state from nested VM
From: Oliver Upton @ 2019-09-05  0:49 UTC (permalink / raw)
  To: Krish Sadhukhan
  Cc: kvm, Paolo Bonzini, Radim Krčmář, Jim Mattson,
	Peter Shier, Sean Christopherson
In-Reply-To: <1fb19467-a743-1886-de52-a63bd19b0b00@oracle.com>

On Wed, Sep 04, 2019 at 05:25:40PM -0700, Krish Sadhukhan wrote:
> 
> 
> On 09/03/2019 02:58 PM, Oliver Upton wrote:
> > The current tests for guest state do not yet check the validity of
> > loaded state from within the nested VM. Introduce the
> > load_state_test_data struct to share data with the nested VM.
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >   x86/vmx_tests.c | 23 ++++++++++++++++++++---
> >   1 file changed, 20 insertions(+), 3 deletions(-)
> > 
> > diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> > index f035f24a771a..b72a27583793 100644
> > --- a/x86/vmx_tests.c
> > +++ b/x86/vmx_tests.c
> > @@ -5017,13 +5017,28 @@ static void test_entry_msr_load(void)
> >   	test_vmx_valid_controls(false);
> >   }
> > +static struct load_state_test_data {
> > +	u32 msr;
> > +	u64 exp;
> > +	bool enabled;
> > +} load_state_test_data;
> 
> A better name is probably 'loaded_state_test_data'  as you are checking the
> validity of the loaded MSR in the guest.

Other usages of structs for data sharing follow the previous naming
convention, but I slightly missed the mark with that as well. Other
structs seem to use the same prefix that the associated tests have (e.g.
ept_access_test_data corresponds to ept_access_test_*). To best match
that pattern, I should instead name it "vmx_state_area_test_data" (since
its used for both guest/host test data anyway.

That isn't to say there is a better pattern we could follow for naming
this! Which do you think is better?

> > +
> >   static void guest_state_test_main(void)
> >   {
> > +	u64 obs;
> > +	struct load_state_test_data *data = &load_state_test_data;
> > +
> >   	while (1) {
> > -		if (vmx_get_test_stage() != 2)
> > -			vmcall();
> > -		else
> > +		if (vmx_get_test_stage() == 2)
> >   			break;
> > +
> > +		if (data->enabled) {
> > +			obs = rdmsr(obs);
> 
> Although you fixed it in the next patch, why not use  'data->msr' in place
> of 'obs' as the parameter to rdmsr() in this patch only ?

Ugh, I mucked this up when reworking before sending out. 'data->msr'
should have appeared in this patch. I'll fix this.

> > +			report("Guest state is 0x%lx (expected 0x%lx)",
> > +			       data->exp == obs, obs, data->exp);
> > +		}
> > +
> > +		vmcall();
> >   	}
> >   	asm volatile("fnop");
> > @@ -6854,7 +6869,9 @@ static void test_pat(u32 field, const char * field_name, u32 ctrl_field,
> >   	u64 i, val;
> >   	u32 j;
> >   	int error;
> > +	struct load_state_test_data *data = &load_state_test_data;
> > +	data->enabled = false;
> >   	vmcs_clear_bits(ctrl_field, ctrl_bit);
> >   	if (field == GUEST_PAT) {
> >   		vmx_set_test_stage(1);
>

Thanks for the review, Krish. Looks like a typo I didn't rework into
this patch correctly, please let me know what you think on the other
comment.

--
Thanks,
Oliver

^ permalink raw reply

* Re: [kvm-unit-tests PATCH v3 8/8] x86: VMX: Add tests for nested "load IA32_PERF_GLOBAL_CTRL"
From: Oliver Upton @ 2019-09-05  0:35 UTC (permalink / raw)
  To: Krish Sadhukhan
  Cc: kvm, Paolo Bonzini, Radim Krčmář, Jim Mattson,
	Peter Shier, Sean Christopherson
In-Reply-To: <5dc635d7-607a-ef0e-d9c9-8dced7fd89b2@oracle.com>

On Wed, Sep 04, 2019 at 05:13:17PM -0700, Krish Sadhukhan wrote:
> 
> 
> On 09/03/2019 02:58 PM, Oliver Upton wrote:
> > Tests to verify that KVM performs the correct checks on Host/Guest state
> > at VM-entry, as described in SDM 26.3.1.1 "Checks on Guest Control
> > Registers, Debug Registers, and MSRs" and SDM 26.2.2 "Checks on Host
> > Control Registers and MSRs".
> > 
> > Test that KVM does the following:
> > 
> >      If the "load IA32_PERF_GLOBAL_CTRL" VM-entry control is 1, the
> >      reserved bits of the IA32_PERF_GLOBAL_CTRL MSR must be 0 in the
> >      GUEST_IA32_PERF_GLOBAL_CTRL VMCS field. Otherwise, the VM-entry
> >      should fail with an exit reason of "VM-entry failure due to invalid
> >      guest state" (33). On a successful VM-entry, the correct value
> >      should be observed when the nested VM performs an RDMSR on
> >      IA32_PERF_GLOBAL_CTRL.
> > 
> >      If the "load IA32_PERF_GLOBAL_CTRL" VM-exit control is 1, the
> >      reserved bits of the IA32_PERF_GLOBAL_CTRL MSR must be 0 in the
> >      HOST_IA32_PERF_GLOBAL_CTRL VMCS field. Otherwise, the VM-entry
> >      should fail with a VM-instruction error of "VM entry with invalid
> >      host-state field(s)" (8). On a successful VM-exit, the correct value
> >      should be observed when L1 performs an RDMSR on
> >      IA32_PERF_GLOBAL_CTRL.
> > 
> > Suggested-by: Jim Mattson <jmattson@google.com>
> > Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> > Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >   x86/vmx_tests.c | 199 +++++++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 197 insertions(+), 2 deletions(-)
> > 
> > diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> > index b72a27583793..73c46eba6be9 100644
> > --- a/x86/vmx_tests.c
> > +++ b/x86/vmx_tests.c
> > @@ -5033,7 +5033,7 @@ static void guest_state_test_main(void)
> >   			break;
> >   		if (data->enabled) {
> > -			obs = rdmsr(obs);
> > +			obs = rdmsr(data->msr);
> >   			report("Guest state is 0x%lx (expected 0x%lx)",
> >   			       data->exp == obs, obs, data->exp);
> >   		}
> > @@ -6854,6 +6854,200 @@ static void test_host_efer(void)
> >   	test_efer(HOST_EFER, "HOST_EFER", EXI_CONTROLS, EXI_LOAD_EFER);
> >   }
> > +union cpuid10_eax {
> > +	struct {
> > +		unsigned int version_id:8;
> > +		unsigned int num_counters:8;
> 
> It's better to name it num_gpc or num_counters_gp.

Ack to all style comments that follow. I brought these structs over from
kvm, but I'll align the style with what is suggested for kvm-unit-tests.

> > +		unsigned int bit_width:8;
> > +		unsigned int mask_length:8;
> > +	} split;
> > +	unsigned int full;
> > +};
> > +
> > +union cpuid10_edx {
> > +	struct {
> > +		unsigned int num_counters_fixed:5;
> > +		unsigned int bit_width_fixed:8;
> > +		unsigned int reserved:19;
> > +	} split;
> > +	unsigned int full;
> > +};
> > +
> 
> In kvm-unit-test source,  the naming of variables, functions etc. seems to
> be based on the hex value of the CPUID leaf that they represent. For
> example, cpuid_7_ebx, check_cpuid_80000001_edx etc. Isn't it better to name
> these structures something like cpuidA_eax and cpuidA_edx ?
> 
> > +static bool valid_pgc(u64 val)
> > +{
> > +	struct cpuid id;
> > +	union cpuid10_eax eax;
> > +	union cpuid10_edx edx;
> > +	u64 mask;
> > +
> > +	id = cpuid(0xA);
> > +	eax.full = id.a;
> > +	edx.full = id.d;
> > +	mask = ~(((1ull << eax.split.num_counters) - 1) |
> > +		(((1ull << edx.split.num_counters_fixed) - 1) << 32));
> > +
> > +	return !(val & mask);
> > +}
> > +
> > +static void test_pgc_vmlaunch(u32 xerror, bool xfail, bool host)
> > +{
> > +	u32 inst_err;
> > +	u64 guest_rip, inst_len, obs;
> > +	bool success;
> > +	struct load_state_test_data *data = &load_state_test_data;
> > +
> > +	if (host) {
> > +		success = vmlaunch_succeeds();
> > +		obs = rdmsr(data->msr);
> > +		if (data->enabled && success)
> > +			report("Host state is 0x%lx (expected 0x%lx)",
> > +			       data->exp == obs, obs, data->exp);
> > +	} else {
> > +		if (xfail)
> > +			enter_guest_with_invalid_guest_state();
> > +		else
> > +			enter_guest();
> > +		success = VMX_VMCALL == (vmcs_read(EXI_REASON) & 0xff);
> > +		guest_rip = vmcs_read(GUEST_RIP);
> > +		inst_len = vmcs_read(EXI_INST_LEN);
> > +		if (success)
> > +			vmcs_write(GUEST_RIP, guest_rip + inst_len);
> 
> Is it possible to re-use the existing report_guest_state_test() here and the
> below code ?

Ah, I missed this in my cursory search. Better to reuse what's already
here :)

> > +	}
> > +	if (!success) {
> > +		inst_err = vmcs_read(VMX_INST_ERROR);
> > +		report("vmlaunch failed, VMX Inst Error is %d (expected %d)",
> > +		       xerror == inst_err, inst_err, xerror);
> > +	} else {
> > +		report("vmlaunch succeeded", success != xfail);
> > +	}
> > +}
> > +
> > +/*
> > + * test_load_pgc is a generic function for testing the
> > + * "load IA32_PERF_GLOBAL_CTRL" VM-{entry,exit} control. This test function
> > + * will test the provided ctrl_val disabled and enabled.
> > + *
> > + * @nr - VMCS field number corresponding to the Host/Guest state field
> > + * @name - Name of the above VMCS field for printing in test report
> > + * @ctrl_nr - VMCS field number corresponding to the VM-{entry,exit} control
> > + * @ctrl_val - Bit to set on the ctrl field.
> > + */
> > +static void test_load_pgc(u32 nr, const char *name, u32 ctrl_nr,
> > +			  const char *ctrl_name, u64 ctrl_val)
> > +{
> > +	u64 ctrl_saved = vmcs_read(ctrl_nr);
> > +	u64 pgc_saved = vmcs_read(nr);
> > +	u64 i, val;
> > +	bool host = nr == HOST_PERF_GLOBAL_CTRL;
> > +	struct load_state_test_data *data = &load_state_test_data;
> > +
> > +	data->msr = MSR_CORE_PERF_GLOBAL_CTRL;
> > +	msr_bmp_init();
> > +	if (!host) {
> > +		vmx_set_test_stage(1);
> > +		test_set_guest(guest_state_test_main);
> > +	}
> > +	vmcs_write(ctrl_nr, ctrl_saved & ~ctrl_val);
> > +	data->enabled = false;
> > +	report_prefix_pushf("\"load IA32_PERF_GLOBAL_CTRL\"=0 on %s",
> > +			    ctrl_name);
> > +	for (i = 0; i < 64; i++) {
> > +		val = 1ull << i;
> > +		vmcs_write(nr, val);
> > +		report_prefix_pushf("%s = 0x%lx", name, val);
> > +		/*
> > +		 * If the "load IA32_PERF_GLOBAL_CTRL" bit is 0 then
> > +		 * the {HOST,GUEST}_IA32_PERF_GLOBAL_CTRL field is ignored,
> > +		 * thus setting reserved bits in this field does not cause
> > +		 * vmlaunch to fail.
> > +		 */
> 
> This comment is really not required as it's obvious that there's no effect
> on VM-entry if the control bit is not set.

I'll drop this.

> > +		test_pgc_vmlaunch(0, false, host);
> > +		report_prefix_pop();
> > +	}
> > +	report_prefix_pop();
> > +
> > +	vmcs_write(ctrl_nr, ctrl_saved | ctrl_val);
> > +	data->enabled = true;
> > +	report_prefix_pushf("\"load IA32_PERF_GLOBAL_CTRL\"=1 on %s",
> > +			    ctrl_name);
> > +	for (i = 0; i < 64; i++) {
> > +		val = 1ull << i;
> > +		data->exp = val;
> > +		vmcs_write(nr, val);
> > +		report_prefix_pushf("%s = 0x%lx", name, val);
> > +		if (valid_pgc(val)) {
> > +			test_pgc_vmlaunch(0, false, host);
> > +		} else {
> > +			/*
> > +			 * [SDM 30.4]
> > +			 *
> > +			 * Invalid host state fields result in an VM
> > +			 * instruction error with error number 8
> > +			 * (VMXERR_ENTRY_INVALID_HOST_STATE_FIELD)
> > +			 */
> > +			if (host) {
> > +				test_pgc_vmlaunch(
> > +					VMXERR_ENTRY_INVALID_HOST_STATE_FIELD,
> > +					true, host);
> > +			/*
> > +			 * [SDM 26.1]
> > +			 *
> > +			 * If a VM-Entry fails according to one of
> > +			 * the guest-state checks, the exit reason on the VMCS
> > +			 * will be set to reason number 33 (VMX_FAIL_STATE)
> 
> Again, these comments are probably not necessary.

I wanted to capture the nuance between host/guest, since the control
flow seems unnecessarily complicated on first glance. If they're
distracting, I'll drop these comments as well.

> > +			 */
> > +			} else {
> > +				test_pgc_vmlaunch(
> > +					0,
> > +					true, host);
> > +				TEST_ASSERT_EQ(
> > +					VMX_ENTRY_FAILURE | VMX_FAIL_STATE,
> > +					vmcs_read(EXI_REASON));
> > +			}
> > +		}
> > +		report_prefix_pop();
> > +	}
> > +
> > +	report_prefix_pop();
> > +
> > +	if (nr == GUEST_PERF_GLOBAL_CTRL) {
> > +		/*
> > +		 * Let the guest finish execution
> > +		 */
> > +		vmx_set_test_stage(2);
> > +		vmcs_write(ctrl_nr, ctrl_saved);
> > +		vmcs_write(nr, pgc_saved);
> > +		enter_guest();
> > +	}
> > +
> > +	vmcs_write(ctrl_nr, ctrl_saved);
> > +	vmcs_write(nr, pgc_saved);
> > +}
> > +
> > +static void test_load_host_pgc(void)
> > +{
> > +	if (!(ctrl_exit_rev.clr & EXI_LOAD_PERF)) {
> > +		printf("\"load IA32_PERF_GLOBAL_CTRL\" "
> > +		       "exit control not supported\n");
> > +		return;
> > +	}
> > +
> > +	test_load_pgc(HOST_PERF_GLOBAL_CTRL, "HOST_PERF_GLOBAL_CTRL",
> > +		      EXI_CONTROLS, "EXI_CONTROLS", EXI_LOAD_PERF);
> > +}
> > +
> > +
> > +static void test_load_guest_pgc(void)
> > +{
> > +	if (!(ctrl_enter_rev.clr & ENT_LOAD_PERF)) {
> > +		printf("\"load IA32_PERF_GLOBAL_CTRL\" "
> > +		       "entry control not supported\n");
> > +	}
> > +
> > +	test_load_pgc(GUEST_PERF_GLOBAL_CTRL, "GUEST_PERF_GLOBAL_CTRL",
> > +		      ENT_CONTROLS, "ENT_CONTROLS", ENT_LOAD_PERF);
> > +}
> > +
> >   /*
> >    * PAT values higher than 8 are uninteresting since they're likely lumped
> >    * in with "8". We only test values above 8 one bit at a time,
> > @@ -7147,6 +7341,7 @@ static void vmx_host_state_area_test(void)
> >   	test_sysenter_field(HOST_SYSENTER_EIP, "HOST_SYSENTER_EIP");
> >   	test_host_efer();
> > +	test_load_host_pgc();
> 
> I would rename it to test_load_host_perf_global_ctrl to avoid confusion
> between "performance counter" and "VMCS control field" though the name is
> bit long.

Sure thing.

> >   	test_load_host_pat();
> >   	test_host_segment_regs();
> >   	test_host_desc_tables();
> > @@ -8587,7 +8782,6 @@ static int invalid_msr_entry_failure(struct vmentry_failure *failure)
> >   	return VMX_TEST_VMEXIT;
> >   }
> > -
> >   #define TEST(name) { #name, .v2 = name }
> >   /* name/init/guest_main/exit_handler/syscall_handler/guest_regs */
> > @@ -8637,6 +8831,7 @@ struct vmx_test vmx_tests[] = {
> >   	TEST(vmx_host_state_area_test),
> >   	TEST(vmx_guest_state_area_test),
> >   	TEST(vmentry_movss_shadow_test),
> > +	TEST(test_load_guest_pgc),
> 
> Same comment as above.
> The other comment I have is, why not put this inside of
> 'vmx_guest_state_area_test' because this VMCS field belongs to Guest State
> Area only.

This is actually a point of discussion. Are we going to want every guest
state test to set up its guest each time? As the code stands currently,
we can only set an L2 guest once, or the test will fail out.

This was less invasive, but I think the better solution would be to set
up the guest once from 'vmx_guest_state_area_test', let the test
functions run, then clean it up. This would require also moving the
guest setup tidbits out of test_pat(). Would you agree this is the
better route?

> >   	/* APICv tests */
> >   	TEST(vmx_eoi_bitmap_ioapic_scan_test),
> >   	TEST(vmx_hlt_with_rvi_test),
>

Thanks for the review, Krish. I'll address style fixes where noted, only
pending question is on the guest state test.

--
Thanks,
Oliver

^ permalink raw reply

* Re: [kvm-unit-tests PATCH v3 7/8] x86: VMX: Make guest_state_test_main() check state from nested VM
From: Krish Sadhukhan @ 2019-09-05  0:25 UTC (permalink / raw)
  To: Oliver Upton, kvm, Paolo Bonzini, Radim Krčmář
  Cc: Jim Mattson, Peter Shier, Sean Christopherson
In-Reply-To: <20190903215801.183193-8-oupton@google.com>



On 09/03/2019 02:58 PM, Oliver Upton wrote:
> The current tests for guest state do not yet check the validity of
> loaded state from within the nested VM. Introduce the
> load_state_test_data struct to share data with the nested VM.
>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>   x86/vmx_tests.c | 23 ++++++++++++++++++++---
>   1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> index f035f24a771a..b72a27583793 100644
> --- a/x86/vmx_tests.c
> +++ b/x86/vmx_tests.c
> @@ -5017,13 +5017,28 @@ static void test_entry_msr_load(void)
>   	test_vmx_valid_controls(false);
>   }
>   
> +static struct load_state_test_data {
> +	u32 msr;
> +	u64 exp;
> +	bool enabled;
> +} load_state_test_data;

A better name is probably 'loaded_state_test_data'  as you are checking 
the validity of the loaded MSR in the guest.

> +
>   static void guest_state_test_main(void)
>   {
> +	u64 obs;
> +	struct load_state_test_data *data = &load_state_test_data;
> +
>   	while (1) {
> -		if (vmx_get_test_stage() != 2)
> -			vmcall();
> -		else
> +		if (vmx_get_test_stage() == 2)
>   			break;
> +
> +		if (data->enabled) {
> +			obs = rdmsr(obs);

Although you fixed it in the next patch, why not use  'data->msr' in 
place of 'obs' as the parameter to rdmsr() in this patch only ?

> +			report("Guest state is 0x%lx (expected 0x%lx)",
> +			       data->exp == obs, obs, data->exp);
> +		}
> +
> +		vmcall();
>   	}
>   
>   	asm volatile("fnop");
> @@ -6854,7 +6869,9 @@ static void test_pat(u32 field, const char * field_name, u32 ctrl_field,
>   	u64 i, val;
>   	u32 j;
>   	int error;
> +	struct load_state_test_data *data = &load_state_test_data;
>   
> +	data->enabled = false;
>   	vmcs_clear_bits(ctrl_field, ctrl_bit);
>   	if (field == GUEST_PAT) {
>   		vmx_set_test_stage(1);


^ permalink raw reply

* Re: [PATCH v2] cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available
From: Wanpeng Li @ 2019-09-05  0:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, kvm, Paolo Bonzini, Radim Krčmář,
	Marcelo Tosatti, Linux PM
In-Reply-To: <a70aeec2-1572-ea09-a0c5-299cd70ddc8a@intel.com>

On Wed, 4 Sep 2019 at 17:48, Rafael J. Wysocki
<rafael.j.wysocki@intel.com> wrote:
>
> On 8/29/2019 10:49 AM, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > The downside of guest side polling is that polling is performed even
> > with other runnable tasks in the host. However, even if poll in kvm
> > can aware whether or not other runnable tasks in the same pCPU, it
> > can still incur extra overhead in over-subscribe scenario. Now we can
> > just enable guest polling when dedicated pCPUs are available.
> >
> > Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
>
> As stated before, I'm going to queue up this change for 5.4, with the
> Paolo's ACK.
>
> BTW, in the future please CC power management changes to
> linux-pm@vger.kernel.org for easier handling.

Ok, thanks.

Wanpeng

^ permalink raw reply

* Re: [kvm-unit-tests PATCH v3 8/8] x86: VMX: Add tests for nested "load IA32_PERF_GLOBAL_CTRL"
From: Krish Sadhukhan @ 2019-09-05  0:13 UTC (permalink / raw)
  To: Oliver Upton, kvm, Paolo Bonzini, Radim Krčmář
  Cc: Jim Mattson, Peter Shier, Sean Christopherson
In-Reply-To: <20190903215801.183193-9-oupton@google.com>



On 09/03/2019 02:58 PM, Oliver Upton wrote:
> Tests to verify that KVM performs the correct checks on Host/Guest state
> at VM-entry, as described in SDM 26.3.1.1 "Checks on Guest Control
> Registers, Debug Registers, and MSRs" and SDM 26.2.2 "Checks on Host
> Control Registers and MSRs".
>
> Test that KVM does the following:
>
>      If the "load IA32_PERF_GLOBAL_CTRL" VM-entry control is 1, the
>      reserved bits of the IA32_PERF_GLOBAL_CTRL MSR must be 0 in the
>      GUEST_IA32_PERF_GLOBAL_CTRL VMCS field. Otherwise, the VM-entry
>      should fail with an exit reason of "VM-entry failure due to invalid
>      guest state" (33). On a successful VM-entry, the correct value
>      should be observed when the nested VM performs an RDMSR on
>      IA32_PERF_GLOBAL_CTRL.
>
>      If the "load IA32_PERF_GLOBAL_CTRL" VM-exit control is 1, the
>      reserved bits of the IA32_PERF_GLOBAL_CTRL MSR must be 0 in the
>      HOST_IA32_PERF_GLOBAL_CTRL VMCS field. Otherwise, the VM-entry
>      should fail with a VM-instruction error of "VM entry with invalid
>      host-state field(s)" (8). On a successful VM-exit, the correct value
>      should be observed when L1 performs an RDMSR on
>      IA32_PERF_GLOBAL_CTRL.
>
> Suggested-by: Jim Mattson <jmattson@google.com>
> Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>   x86/vmx_tests.c | 199 +++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 197 insertions(+), 2 deletions(-)
>
> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> index b72a27583793..73c46eba6be9 100644
> --- a/x86/vmx_tests.c
> +++ b/x86/vmx_tests.c
> @@ -5033,7 +5033,7 @@ static void guest_state_test_main(void)
>   			break;
>   
>   		if (data->enabled) {
> -			obs = rdmsr(obs);
> +			obs = rdmsr(data->msr);
>   			report("Guest state is 0x%lx (expected 0x%lx)",
>   			       data->exp == obs, obs, data->exp);
>   		}
> @@ -6854,6 +6854,200 @@ static void test_host_efer(void)
>   	test_efer(HOST_EFER, "HOST_EFER", EXI_CONTROLS, EXI_LOAD_EFER);
>   }
>   
> +union cpuid10_eax {
> +	struct {
> +		unsigned int version_id:8;
> +		unsigned int num_counters:8;

It's better to name it num_gpc or num_counters_gp.

> +		unsigned int bit_width:8;
> +		unsigned int mask_length:8;
> +	} split;
> +	unsigned int full;
> +};
> +
> +union cpuid10_edx {
> +	struct {
> +		unsigned int num_counters_fixed:5;
> +		unsigned int bit_width_fixed:8;
> +		unsigned int reserved:19;
> +	} split;
> +	unsigned int full;
> +};
> +

In kvm-unit-test source,  the naming of variables, functions etc. seems 
to be based on the hex value of the CPUID leaf that they represent. For 
example, cpuid_7_ebx, check_cpuid_80000001_edx etc. Isn't it better to 
name these structures something like cpuidA_eax and cpuidA_edx ?

> +static bool valid_pgc(u64 val)
> +{
> +	struct cpuid id;
> +	union cpuid10_eax eax;
> +	union cpuid10_edx edx;
> +	u64 mask;
> +
> +	id = cpuid(0xA);
> +	eax.full = id.a;
> +	edx.full = id.d;
> +	mask = ~(((1ull << eax.split.num_counters) - 1) |
> +		(((1ull << edx.split.num_counters_fixed) - 1) << 32));
> +
> +	return !(val & mask);
> +}
> +
> +static void test_pgc_vmlaunch(u32 xerror, bool xfail, bool host)
> +{
> +	u32 inst_err;
> +	u64 guest_rip, inst_len, obs;
> +	bool success;
> +	struct load_state_test_data *data = &load_state_test_data;
> +
> +	if (host) {
> +		success = vmlaunch_succeeds();
> +		obs = rdmsr(data->msr);
> +		if (data->enabled && success)
> +			report("Host state is 0x%lx (expected 0x%lx)",
> +			       data->exp == obs, obs, data->exp);
> +	} else {
> +		if (xfail)
> +			enter_guest_with_invalid_guest_state();
> +		else
> +			enter_guest();
> +		success = VMX_VMCALL == (vmcs_read(EXI_REASON) & 0xff);
> +		guest_rip = vmcs_read(GUEST_RIP);
> +		inst_len = vmcs_read(EXI_INST_LEN);
> +		if (success)
> +			vmcs_write(GUEST_RIP, guest_rip + inst_len);

Is it possible to re-use the existing report_guest_state_test() here and 
the below code ?

> +	}
> +	if (!success) {
> +		inst_err = vmcs_read(VMX_INST_ERROR);
> +		report("vmlaunch failed, VMX Inst Error is %d (expected %d)",
> +		       xerror == inst_err, inst_err, xerror);
> +	} else {
> +		report("vmlaunch succeeded", success != xfail);
> +	}
> +}
> +
> +/*
> + * test_load_pgc is a generic function for testing the
> + * "load IA32_PERF_GLOBAL_CTRL" VM-{entry,exit} control. This test function
> + * will test the provided ctrl_val disabled and enabled.
> + *
> + * @nr - VMCS field number corresponding to the Host/Guest state field
> + * @name - Name of the above VMCS field for printing in test report
> + * @ctrl_nr - VMCS field number corresponding to the VM-{entry,exit} control
> + * @ctrl_val - Bit to set on the ctrl field.
> + */
> +static void test_load_pgc(u32 nr, const char *name, u32 ctrl_nr,
> +			  const char *ctrl_name, u64 ctrl_val)
> +{
> +	u64 ctrl_saved = vmcs_read(ctrl_nr);
> +	u64 pgc_saved = vmcs_read(nr);
> +	u64 i, val;
> +	bool host = nr == HOST_PERF_GLOBAL_CTRL;
> +	struct load_state_test_data *data = &load_state_test_data;
> +
> +	data->msr = MSR_CORE_PERF_GLOBAL_CTRL;
> +	msr_bmp_init();
> +	if (!host) {
> +		vmx_set_test_stage(1);
> +		test_set_guest(guest_state_test_main);
> +	}
> +	vmcs_write(ctrl_nr, ctrl_saved & ~ctrl_val);
> +	data->enabled = false;
> +	report_prefix_pushf("\"load IA32_PERF_GLOBAL_CTRL\"=0 on %s",
> +			    ctrl_name);
> +	for (i = 0; i < 64; i++) {
> +		val = 1ull << i;
> +		vmcs_write(nr, val);
> +		report_prefix_pushf("%s = 0x%lx", name, val);
> +		/*
> +		 * If the "load IA32_PERF_GLOBAL_CTRL" bit is 0 then
> +		 * the {HOST,GUEST}_IA32_PERF_GLOBAL_CTRL field is ignored,
> +		 * thus setting reserved bits in this field does not cause
> +		 * vmlaunch to fail.
> +		 */

This comment is really not required as it's obvious that there's no 
effect on VM-entry if the control bit is not set.

> +		test_pgc_vmlaunch(0, false, host);
> +		report_prefix_pop();
> +	}
> +	report_prefix_pop();
> +
> +	vmcs_write(ctrl_nr, ctrl_saved | ctrl_val);
> +	data->enabled = true;
> +	report_prefix_pushf("\"load IA32_PERF_GLOBAL_CTRL\"=1 on %s",
> +			    ctrl_name);
> +	for (i = 0; i < 64; i++) {
> +		val = 1ull << i;
> +		data->exp = val;
> +		vmcs_write(nr, val);
> +		report_prefix_pushf("%s = 0x%lx", name, val);
> +		if (valid_pgc(val)) {
> +			test_pgc_vmlaunch(0, false, host);
> +		} else {
> +			/*
> +			 * [SDM 30.4]
> +			 *
> +			 * Invalid host state fields result in an VM
> +			 * instruction error with error number 8
> +			 * (VMXERR_ENTRY_INVALID_HOST_STATE_FIELD)
> +			 */
> +			if (host) {
> +				test_pgc_vmlaunch(
> +					VMXERR_ENTRY_INVALID_HOST_STATE_FIELD,
> +					true, host);
> +			/*
> +			 * [SDM 26.1]
> +			 *
> +			 * If a VM-Entry fails according to one of
> +			 * the guest-state checks, the exit reason on the VMCS
> +			 * will be set to reason number 33 (VMX_FAIL_STATE)

Again, these comments are probably not necessary.

> +			 */
> +			} else {
> +				test_pgc_vmlaunch(
> +					0,
> +					true, host);
> +				TEST_ASSERT_EQ(
> +					VMX_ENTRY_FAILURE | VMX_FAIL_STATE,
> +					vmcs_read(EXI_REASON));
> +			}
> +		}
> +		report_prefix_pop();
> +	}
> +
> +	report_prefix_pop();
> +
> +	if (nr == GUEST_PERF_GLOBAL_CTRL) {
> +		/*
> +		 * Let the guest finish execution
> +		 */
> +		vmx_set_test_stage(2);
> +		vmcs_write(ctrl_nr, ctrl_saved);
> +		vmcs_write(nr, pgc_saved);
> +		enter_guest();
> +	}
> +
> +	vmcs_write(ctrl_nr, ctrl_saved);
> +	vmcs_write(nr, pgc_saved);
> +}
> +
> +static void test_load_host_pgc(void)
> +{
> +	if (!(ctrl_exit_rev.clr & EXI_LOAD_PERF)) {
> +		printf("\"load IA32_PERF_GLOBAL_CTRL\" "
> +		       "exit control not supported\n");
> +		return;
> +	}
> +
> +	test_load_pgc(HOST_PERF_GLOBAL_CTRL, "HOST_PERF_GLOBAL_CTRL",
> +		      EXI_CONTROLS, "EXI_CONTROLS", EXI_LOAD_PERF);
> +}
> +
> +
> +static void test_load_guest_pgc(void)
> +{
> +	if (!(ctrl_enter_rev.clr & ENT_LOAD_PERF)) {
> +		printf("\"load IA32_PERF_GLOBAL_CTRL\" "
> +		       "entry control not supported\n");
> +	}
> +
> +	test_load_pgc(GUEST_PERF_GLOBAL_CTRL, "GUEST_PERF_GLOBAL_CTRL",
> +		      ENT_CONTROLS, "ENT_CONTROLS", ENT_LOAD_PERF);
> +}
> +
>   /*
>    * PAT values higher than 8 are uninteresting since they're likely lumped
>    * in with "8". We only test values above 8 one bit at a time,
> @@ -7147,6 +7341,7 @@ static void vmx_host_state_area_test(void)
>   	test_sysenter_field(HOST_SYSENTER_EIP, "HOST_SYSENTER_EIP");
>   
>   	test_host_efer();
> +	test_load_host_pgc();

I would rename it to test_load_host_perf_global_ctrl to avoid confusion 
between "performance counter" and "VMCS control field" though the name 
is bit long.

>   	test_load_host_pat();
>   	test_host_segment_regs();
>   	test_host_desc_tables();
> @@ -8587,7 +8782,6 @@ static int invalid_msr_entry_failure(struct vmentry_failure *failure)
>   	return VMX_TEST_VMEXIT;
>   }
>   
> -
>   #define TEST(name) { #name, .v2 = name }
>   
>   /* name/init/guest_main/exit_handler/syscall_handler/guest_regs */
> @@ -8637,6 +8831,7 @@ struct vmx_test vmx_tests[] = {
>   	TEST(vmx_host_state_area_test),
>   	TEST(vmx_guest_state_area_test),
>   	TEST(vmentry_movss_shadow_test),
> +	TEST(test_load_guest_pgc),

Same comment as above.
The other comment I have is, why not put this inside of 
'vmx_guest_state_area_test' because this VMCS field belongs to Guest 
State Area only.

>   	/* APICv tests */
>   	TEST(vmx_eoi_bitmap_ioapic_scan_test),
>   	TEST(vmx_hlt_with_rvi_test),


^ permalink raw reply

* Re: [PATCH v7 2/6] mm: Move set/get_pcppage_migratetype to mmzone.h
From: Dan Williams @ 2019-09-04 21:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: nitesh, KVM list, Michael S. Tsirkin, David Hildenbrand,
	Dave Hansen, Linux Kernel Mailing List, Matthew Wilcox,
	Michal Hocko, Linux MM, Andrew Morton, virtio-dev, Oscar Salvador,
	yang.zhang.wz, Pankaj Gupta, Rik van Riel, Konrad Rzeszutek Wilk,
	lcapitulino, Wang, Wei W, Andrea Arcangeli, Paolo Bonzini,
	Alexander Duyck
In-Reply-To: <20190904151036.13848.36062.stgit@localhost.localdomain>

On Wed, Sep 4, 2019 at 8:11 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>
> In order to support page reporting it will be necessary to store and
> retrieve the migratetype of a page. To enable that I am moving the set and
> get operations for pcppage_migratetype into the mm/internal.h header so
> that they can be used outside of the page_alloc.c file.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply

* Re: [PATCH v7 1/6] mm: Adjust shuffle code to allow for future coalescing
From: Dan Williams @ 2019-09-04 21:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: nitesh, KVM list, Michael S. Tsirkin, David Hildenbrand,
	Dave Hansen, Linux Kernel Mailing List, Matthew Wilcox,
	Michal Hocko, Linux MM, Andrew Morton, virtio-dev, Oscar Salvador,
	yang.zhang.wz, Pankaj Gupta, Rik van Riel, Konrad Rzeszutek Wilk,
	lcapitulino, Wang, Wei W, Andrea Arcangeli, Paolo Bonzini,
	Alexander Duyck
In-Reply-To: <20190904151030.13848.25822.stgit@localhost.localdomain>

On Wed, Sep 4, 2019 at 8:10 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>
> Move the head/tail adding logic out of the shuffle code and into the
> __free_one_page function since ultimately that is where it is really
> needed anyway. By doing this we should be able to reduce the overhead
> and can consolidate all of the list addition bits in one spot.
>
> While changing out the code I also opted to go for a bit more thread safe
> approach to getting the boolean value. This way we can avoid possible cache
> line bouncing of the batched entropy between CPUs.

The original version of this patch just did the movement, but now the
patch also does the percpu optimization. At this point it warrants
being split into a "move" patch and then "rework". Otherwise the bulk
of the patch is not really well described by the patch title. With the
split there's a commit id for each of the performance improvement
claims.

Other than that the percpu logic changes look good to me.

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v7 5/6] virtio-balloon: Pull page poisoning config out of free page hinting
From: Alexander Duyck @ 2019-09-04 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: nitesh, kvm, david, dave.hansen, linux-kernel, willy, mhocko,
	linux-mm, akpm, virtio-dev, osalvador, yang.zhang.wz, pagupta,
	riel, konrad.wilk, lcapitulino, wei.w.wang, aarcange, pbonzini,
	dan.j.williams
In-Reply-To: <20190904152244-mutt-send-email-mst@kernel.org>

On Wed, 2019-09-04 at 15:28 -0400, Michael S. Tsirkin wrote:
> On Wed, Sep 04, 2019 at 08:10:55AM -0700, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > 
> > Currently the page poisoning setting wasn't being enabled unless free page
> > hinting was enabled. However we will need the page poisoning tracking logic
> > as well for unused page reporting. As such pull it out and make it a
> > separate bit of config in the probe function.
> > 
> > In addition we can actually wrap the code in a check for NO_SANITY. If we
> > don't care what is actually in the page we can just default to 0 and leave
> > it there.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> >  drivers/virtio/virtio_balloon.c |   19 +++++++++++++------
> >  mm/page_reporting.c             |    4 ++++
> >  2 files changed, 17 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > index 226fbb995fb0..2c19457ab573 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -842,7 +842,6 @@ static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
> >  static int virtballoon_probe(struct virtio_device *vdev)
> >  {
> >  	struct virtio_balloon *vb;
> > -	__u32 poison_val;
> >  	int err;
> >  
> >  	if (!vdev->config->get) {
> > @@ -909,11 +908,19 @@ static int virtballoon_probe(struct virtio_device *vdev)
> >  						  VIRTIO_BALLOON_CMD_ID_STOP);
> >  		spin_lock_init(&vb->free_page_list_lock);
> >  		INIT_LIST_HEAD(&vb->free_page_list);
> > -		if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
> > -			memset(&poison_val, PAGE_POISON, sizeof(poison_val));
> > -			virtio_cwrite(vb->vdev, struct virtio_balloon_config,
> > -				      poison_val, &poison_val);
> > -		}
> > +	}
> > +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
> > +		__u32 poison_val = 0;
> > +
> > +#if !defined(CONFIG_PAGE_POISONING_NO_SANITY)
> > +		/*
> > +		 * Let hypervisor know that we are expecting a specific
> > +		 * value to be written back in unused pages.
> > +		 */
> > +		memset(&poison_val, PAGE_POISON, sizeof(poison_val));
> > +#endif
> > +		virtio_cwrite(vb->vdev, struct virtio_balloon_config,
> > +			      poison_val, &poison_val);
> >  	}
> >  	/*
> >  	 * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a
> 
> I'm a bit confused by this part. Should we not just clear
> VIRTIO_BALLOON_F_PAGE_POISON completely?
> 
> In my mind the value written should be what guest puts in
> free pages - and possibly what it expects to find there later.

I thought it better to err on the side of more information rather than
less. With this the host knows that page poisoning is enabled, but that 0
is an acceptable value.

> If it doesn't expect anything there then it makes sense
> to clear VIRTIO_BALLOON_F_PAGE_POISON so that host does
> not try to put the poison value there.

That makes sense.

> But I think that it does not make sense to lie to host about the poison
> value - I think that if we do send poison value to
> host it's reasonable for host to expect free pages
> have that value - and even possibly to validate that.
> 
> So I think that the hack belongs in virtballoon_validate,
> near the page_poisoning_enabled check.

Yeah, I will move that for v8. I will just add an IS_ENABLED check to the
!page_poisoning_enabled check in virtballoon_validate and can drop the
#ifdef check.

Thanks.

- Alex


^ permalink raw reply

* Re: [PATCH v7 6/6] virtio-balloon: Add support for providing unused page reports to host
From: Alexander Duyck @ 2019-09-04 20:15 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alexander Duyck
  Cc: nitesh, kvm, david, dave.hansen, linux-kernel, willy, mhocko,
	linux-mm, akpm, virtio-dev, osalvador, yang.zhang.wz, pagupta,
	riel, konrad.wilk, lcapitulino, wei.w.wang, aarcange, pbonzini,
	dan.j.williams
In-Reply-To: <20190904151506-mutt-send-email-mst@kernel.org>

On Wed, 2019-09-04 at 15:17 -0400, Michael S. Tsirkin wrote:
> On Wed, Sep 04, 2019 at 08:11:02AM -0700, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > 
> > Add support for the page reporting feature provided by virtio-balloon.
> > Reporting differs from the regular balloon functionality in that is is
> > much less durable than a standard memory balloon. Instead of creating a
> > list of pages that cannot be accessed the pages are only inaccessible
> > while they are being indicated to the virtio interface. Once the
> > interface has acknowledged them they are placed back into their respective
> > free lists and are once again accessible by the guest system.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> >  drivers/virtio/Kconfig              |    1 +
> >  drivers/virtio/virtio_balloon.c     |   65 +++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_balloon.h |    1 +
> >  3 files changed, 67 insertions(+)
> > 
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 078615cf2afc..4b2dd8259ff5 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -58,6 +58,7 @@ config VIRTIO_BALLOON
> >  	tristate "Virtio balloon driver"
> >  	depends on VIRTIO
> >  	select MEMORY_BALLOON
> > +	select PAGE_REPORTING
> >  	---help---
> >  	 This driver supports increasing and decreasing the amount
> >  	 of memory within a KVM guest.
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > index 2c19457ab573..0b400bb382c0 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/mount.h>
> >  #include <linux/magic.h>
> >  #include <linux/pseudo_fs.h>
> > +#include <linux/page_reporting.h>
> >  
> >  /*
> >   * Balloon device works in 4K page units.  So each page is pointed to by
> > @@ -37,6 +38,9 @@
> >  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
> >  	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> >  
> > +/*  limit on the number of pages that can be on the reporting vq */
> > +#define VIRTIO_BALLOON_VRING_HINTS_MAX	16
> > +
> >  #ifdef CONFIG_BALLOON_COMPACTION
> >  static struct vfsmount *balloon_mnt;
> >  #endif
> > @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
> >  	VIRTIO_BALLOON_VQ_DEFLATE,
> >  	VIRTIO_BALLOON_VQ_STATS,
> >  	VIRTIO_BALLOON_VQ_FREE_PAGE,
> > +	VIRTIO_BALLOON_VQ_REPORTING,
> >  	VIRTIO_BALLOON_VQ_MAX
> >  };
> >  
> > @@ -113,6 +118,10 @@ struct virtio_balloon {
> >  
> >  	/* To register a shrinker to shrink memory upon memory pressure */
> >  	struct shrinker shrinker;
> > +
> > +	/* Unused page reporting device */
> > +	struct virtqueue *reporting_vq;
> > +	struct page_reporting_dev_info ph_dev_info;
> >  };
> >  
> >  static struct virtio_device_id id_table[] = {
> > @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> >  
> >  }
> >  
> > +void virtballoon_unused_page_report(struct page_reporting_dev_info *ph_dev_info,
> > +				    unsigned int nents)
> > +{
> > +	struct virtio_balloon *vb =
> > +		container_of(ph_dev_info, struct virtio_balloon, ph_dev_info);
> > +	struct virtqueue *vq = vb->reporting_vq;
> > +	unsigned int unused, err;
> > +
> > +	/* We should always be able to add these buffers to an empty queue. */
> > +	err = virtqueue_add_inbuf(vq, ph_dev_info->sg, nents, vb,
> > +				  GFP_NOWAIT | __GFP_NOWARN);
> > +
> > +	/*
> > +	 * In the extremely unlikely case that something has changed and we
> > +	 * are able to trigger an error we will simply display a warning
> > +	 * and exit without actually processing the pages.
> > +	 */
> > +	if (WARN_ON(err))
> > +		return;
> > +
> > +	virtqueue_kick(vq);
> > +
> > +	/* When host has read buffer, this completes via balloon_ack */
> > +	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> > +}
> > +
> 
> So just to make sure I understand, this always passes a single
> buf to the vq and then waits until that completes, correct?

Correct.

> Thus there are never outstanding bufs on the vq and this
> is why we don't need e.g. any cleanup.

Yes. Basically this will wait until the queue has been processed by the
host before we process any additional pages. We can't have the pages in an
unknown state when we put them back so we have to wait here until we know
they have been fully reported to the host.


^ permalink raw reply

* Re: [PATCH v7 5/6] virtio-balloon: Pull page poisoning config out of free page hinting
From: Michael S. Tsirkin @ 2019-09-04 19:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: nitesh, kvm, david, dave.hansen, linux-kernel, willy, mhocko,
	linux-mm, akpm, virtio-dev, osalvador, yang.zhang.wz, pagupta,
	riel, konrad.wilk, lcapitulino, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck
In-Reply-To: <20190904151055.13848.27351.stgit@localhost.localdomain>

On Wed, Sep 04, 2019 at 08:10:55AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Currently the page poisoning setting wasn't being enabled unless free page
> hinting was enabled. However we will need the page poisoning tracking logic
> as well for unused page reporting. As such pull it out and make it a
> separate bit of config in the probe function.
> 
> In addition we can actually wrap the code in a check for NO_SANITY. If we
> don't care what is actually in the page we can just default to 0 and leave
> it there.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  drivers/virtio/virtio_balloon.c |   19 +++++++++++++------
>  mm/page_reporting.c             |    4 ++++
>  2 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 226fbb995fb0..2c19457ab573 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -842,7 +842,6 @@ static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> -	__u32 poison_val;
>  	int err;
>  
>  	if (!vdev->config->get) {
> @@ -909,11 +908,19 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  						  VIRTIO_BALLOON_CMD_ID_STOP);
>  		spin_lock_init(&vb->free_page_list_lock);
>  		INIT_LIST_HEAD(&vb->free_page_list);
> -		if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
> -			memset(&poison_val, PAGE_POISON, sizeof(poison_val));
> -			virtio_cwrite(vb->vdev, struct virtio_balloon_config,
> -				      poison_val, &poison_val);
> -		}
> +	}
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
> +		__u32 poison_val = 0;
> +
> +#if !defined(CONFIG_PAGE_POISONING_NO_SANITY)
> +		/*
> +		 * Let hypervisor know that we are expecting a specific
> +		 * value to be written back in unused pages.
> +		 */
> +		memset(&poison_val, PAGE_POISON, sizeof(poison_val));
> +#endif
> +		virtio_cwrite(vb->vdev, struct virtio_balloon_config,
> +			      poison_val, &poison_val);
>  	}
>  	/*
>  	 * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a

I'm a bit confused by this part. Should we not just clear
VIRTIO_BALLOON_F_PAGE_POISON completely?

In my mind the value written should be what guest puts in
free pages - and possibly what it expects to find there later.

If it doesn't expect anything there then it makes sense
to clear VIRTIO_BALLOON_F_PAGE_POISON so that host does
not try to put the poison value there.
But I think that it does not make sense to lie to host about the poison
value - I think that if we do send poison value to
host it's reasonable for host to expect free pages
have that value - and even possibly to validate that.

So I think that the hack belongs in virtballoon_validate,
near the page_poisoning_enabled check.



> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index 5006b08d5eec..35c0fe4c4471 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -299,6 +299,10 @@ int page_reporting_startup(struct page_reporting_dev_info *phdev)
>  	struct zone *zone;
>  	int err = 0;
>  
> +	/* No point in enabling this if it cannot handle any pages */
> +	if (!phdev->capacity)
> +		return -EINVAL;
> +
>  	mutex_lock(&page_reporting_mutex);
>  
>  	/* nothing to do if already in use */

^ permalink raw reply

* Re: [PATCH v7 0/6] mm / virtio: Provide support for unused page reporting
From: Michael S. Tsirkin @ 2019-09-04 19:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: nitesh, kvm, david, dave.hansen, linux-kernel, willy, mhocko,
	linux-mm, akpm, virtio-dev, osalvador, yang.zhang.wz, pagupta,
	riel, konrad.wilk, lcapitulino, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck
In-Reply-To: <20190904150920.13848.32271.stgit@localhost.localdomain>

On Wed, Sep 04, 2019 at 08:10:24AM -0700, Alexander Duyck wrote:
> This series provides an asynchronous means of reporting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as unused page reporting
> 
> The functionality for this is fairly simple. When enabled it will allocate
> statistics to track the number of reported pages in a given free area.
> When the number of free pages exceeds this value plus a high water value,
> currently 32, it will begin performing page reporting which consists of
> pulling pages off of free list and placing them into a scatter list. The
> scatterlist is then given to the page reporting device and it will perform
> the required action to make the pages "reported", in the case of
> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> and as such they are forced out of the guest. After this they are placed
> back on the free list, and an additional bit is added if they are not
> merged indicating that they are a reported buddy page instead of a
> standard buddy page. The cycle then repeats with additional non-reported
> pages being pulled until the free areas all consist of reported pages.
> 
> I am leaving a number of things hard-coded such as limiting the lowest
> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> determine what the limit is on how many pages it wants to allocate to
> process the hints. The upper limit for this is based on the size of the
> queue used to store the scattergather list.
> 
> My primary testing has just been to verify the memory is being freed after
> allocation by running memhog 40g on a 40g guest and watching the total
> free memory via /proc/meminfo on the host. With this I have verified most
> of the memory is freed after each iteration. As far as performance I have
> been mainly focusing on the will-it-scale/page_fault1 test running with
> 16 vcpus. I have modified it to use Transparent Huge Pages. With this I
> see almost no difference, -0.08%, with the patches applied and the feature
> disabled. I see a regression of -0.86% with the feature enabled, but the
> madvise disabled in the hypervisor due to a device being assigned. With
> the feature fully enabled I see a regression of -3.27% versus the baseline
> without these patches applied. In my testing I found that most of the
> overhead was due to the page zeroing that comes as a result of the pages
> having to be faulted back into the guest.
> 
> One side effect of these patches is that the guest becomes much more
> resilient in terms of NUMA locality. With the pages being freed and then
> reallocated when used it allows for the pages to be much closer to the
> active thread, and as a result there can be situations where this patch
> set will out-perform the stock kernel when the guest memory is not local
> to the guest vCPUs. To avoid that in my testing I set the affinity of all
> the vCPUs and QEMU instance to the same node.


So the virtio side looks good to me. But we need mm faulks to ack the mm
changes. And maybe merge all of this ...


> Changes from the RFC:
> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> Moved aeration requested flag out of aerator and into zone->flags.
> Moved boundary out of free_area and into local variables for aeration.
> Moved aeration cycle out of interrupt and into workqueue.
> Left nr_free as total pages instead of splitting it between raw and aerated.
> Combined size and physical address values in virtio ring into one 64b value.
> 
> Changes from v1:
> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> Dropped "waste page treatment" in favor of "page hinting"
> Renamed files and functions from "aeration" to "page_hinting"
> Moved from page->lru list to scatterlist
> Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
> Virtio now uses scatterlist directly instead of intermediate array
> Moved stats out of free_area, now in separate area and pointed to from zone
> Merged patch 5 into patch 4 to improve review-ability
> Updated various code comments throughout
> 
> Changes from v2:
> https://lore.kernel.org/lkml/20190724165158.6685.87228.stgit@localhost.localdomain/
> Dropped "page hinting" in favor of "page reporting"
> Renamed files from "hinting" to "reporting"
> Replaced "Hinted" page type with "Reported" page flag
> Added support for page poisoning while hinting is active
> Add QEMU patch that implements PAGE_POISON feature
> 
> Changes from v3:
> https://lore.kernel.org/lkml/20190801222158.22190.96964.stgit@localhost.localdomain/
> Added mutex lock around page reporting startup and shutdown
> Fixed reference to "page aeration" in patch 2
> Split page reporting function bit out into separate QEMU patch
> Limited capacity of scatterlist to vq size - 1 instead of vq size
> Added exception handling for case of virtio descriptor allocation failure
> 
> Changes from v4:
> https://lore.kernel.org/lkml/20190807224037.6891.53512.stgit@localhost.localdomain/
> Replaced spin_(un)lock with spin_(un)lock_irq in page_reporting_cycle()
> Dropped if/continue for ternary operator in page_reporting_process()
> Added checks for isolate and cma types to for_each_reporting_migratetype_order
> Added virtio-dev, Michal Hocko, and Oscar Salvador to to:/cc:
> Rebased on latest linux-next and QEMU git trees
> 
> Changes from v5:
> https://lore.kernel.org/lkml/20190812213158.22097.30576.stgit@localhost.localdomain/
> Replaced spin_(un)lock with spin_(un)lock_irq in page_reporting_startup()
> Updated shuffle code to use "shuffle_pick_tail" and updated patch description
> Dropped storage of order and migratettype while page is being reported
> Used get_pfnblock_migratetype to determine migratetype of page
> Renamed put_reported_page to free_reported_page, added order as argument
> Dropped check for CMA type as I believe we should be reporting those
> Added code to allow moving of reported pages into and out of isolation
> Defined page reporting order as minimum of Huge Page size vs MAX_ORDER - 1
> Cleaned up use of static branch usage for page_reporting_notify_enabled
> 
> Changes from v6:
> https://lore.kernel.org/lkml/20190821145806.20926.22448.stgit@localhost.localdomain/
> Rebased on linux-next for 20190903
> Added jump label to __page_reporting_request so we release RCU read lock
> Removed "- 1" from capacity limit based on virtio ring
> Added code to verify capacity is non-zero or return error on startup
> 
> ---
> 
> Alexander Duyck (6):
>       mm: Adjust shuffle code to allow for future coalescing
>       mm: Move set/get_pcppage_migratetype to mmzone.h
>       mm: Use zone and order instead of free area in free_list manipulators
>       mm: Introduce Reported pages
>       virtio-balloon: Pull page poisoning config out of free page hinting
>       virtio-balloon: Add support for providing unused page reports to host
> 
> 
>  drivers/virtio/Kconfig              |    1 
>  drivers/virtio/virtio_balloon.c     |   84 ++++++++-
>  include/linux/mmzone.h              |  124 ++++++++-----
>  include/linux/page-flags.h          |   11 +
>  include/linux/page_reporting.h      |  177 ++++++++++++++++++
>  include/uapi/linux/virtio_balloon.h |    1 
>  mm/Kconfig                          |    5 +
>  mm/Makefile                         |    1 
>  mm/internal.h                       |   18 ++
>  mm/memory_hotplug.c                 |    1 
>  mm/page_alloc.c                     |  216 ++++++++++++++++------
>  mm/page_reporting.c                 |  340 +++++++++++++++++++++++++++++++++++
>  mm/shuffle.c                        |   40 ++--
>  mm/shuffle.h                        |   12 +
>  14 files changed, 900 insertions(+), 131 deletions(-)
>  create mode 100644 include/linux/page_reporting.h
>  create mode 100644 mm/page_reporting.c
> 
> --

^ permalink raw reply

* Re: [PATCH v7 6/6] virtio-balloon: Add support for providing unused page reports to host
From: Michael S. Tsirkin @ 2019-09-04 19:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: nitesh, kvm, david, dave.hansen, linux-kernel, willy, mhocko,
	linux-mm, akpm, virtio-dev, osalvador, yang.zhang.wz, pagupta,
	riel, konrad.wilk, lcapitulino, wei.w.wang, aarcange, pbonzini,
	dan.j.williams, alexander.h.duyck
In-Reply-To: <20190904151102.13848.65770.stgit@localhost.localdomain>

On Wed, Sep 04, 2019 at 08:11:02AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Add support for the page reporting feature provided by virtio-balloon.
> Reporting differs from the regular balloon functionality in that is is
> much less durable than a standard memory balloon. Instead of creating a
> list of pages that cannot be accessed the pages are only inaccessible
> while they are being indicated to the virtio interface. Once the
> interface has acknowledged them they are placed back into their respective
> free lists and are once again accessible by the guest system.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  drivers/virtio/Kconfig              |    1 +
>  drivers/virtio/virtio_balloon.c     |   65 +++++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_balloon.h |    1 +
>  3 files changed, 67 insertions(+)
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 078615cf2afc..4b2dd8259ff5 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -58,6 +58,7 @@ config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
>  	select MEMORY_BALLOON
> +	select PAGE_REPORTING
>  	---help---
>  	 This driver supports increasing and decreasing the amount
>  	 of memory within a KVM guest.
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 2c19457ab573..0b400bb382c0 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -19,6 +19,7 @@
>  #include <linux/mount.h>
>  #include <linux/magic.h>
>  #include <linux/pseudo_fs.h>
> +#include <linux/page_reporting.h>
>  
>  /*
>   * Balloon device works in 4K page units.  So each page is pointed to by
> @@ -37,6 +38,9 @@
>  #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
>  	(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>  
> +/*  limit on the number of pages that can be on the reporting vq */
> +#define VIRTIO_BALLOON_VRING_HINTS_MAX	16
> +
>  #ifdef CONFIG_BALLOON_COMPACTION
>  static struct vfsmount *balloon_mnt;
>  #endif
> @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
>  	VIRTIO_BALLOON_VQ_DEFLATE,
>  	VIRTIO_BALLOON_VQ_STATS,
>  	VIRTIO_BALLOON_VQ_FREE_PAGE,
> +	VIRTIO_BALLOON_VQ_REPORTING,
>  	VIRTIO_BALLOON_VQ_MAX
>  };
>  
> @@ -113,6 +118,10 @@ struct virtio_balloon {
>  
>  	/* To register a shrinker to shrink memory upon memory pressure */
>  	struct shrinker shrinker;
> +
> +	/* Unused page reporting device */
> +	struct virtqueue *reporting_vq;
> +	struct page_reporting_dev_info ph_dev_info;
>  };
>  
>  static struct virtio_device_id id_table[] = {
> @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  
>  }
>  
> +void virtballoon_unused_page_report(struct page_reporting_dev_info *ph_dev_info,
> +				    unsigned int nents)
> +{
> +	struct virtio_balloon *vb =
> +		container_of(ph_dev_info, struct virtio_balloon, ph_dev_info);
> +	struct virtqueue *vq = vb->reporting_vq;
> +	unsigned int unused, err;
> +
> +	/* We should always be able to add these buffers to an empty queue. */
> +	err = virtqueue_add_inbuf(vq, ph_dev_info->sg, nents, vb,
> +				  GFP_NOWAIT | __GFP_NOWARN);
> +
> +	/*
> +	 * In the extremely unlikely case that something has changed and we
> +	 * are able to trigger an error we will simply display a warning
> +	 * and exit without actually processing the pages.
> +	 */
> +	if (WARN_ON(err))
> +		return;
> +
> +	virtqueue_kick(vq);
> +
> +	/* When host has read buffer, this completes via balloon_ack */
> +	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> +}
> +

So just to make sure I understand, this always passes a single
buf to the vq and then waits until that completes, correct?
Thus there are never outstanding bufs on the vq and this
is why we don't need e.g. any cleanup.



>  static void set_page_pfns(struct virtio_balloon *vb,
>  			  __virtio32 pfns[], struct page *page)
>  {
> @@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
>  	names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
>  	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
>  	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> +	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
>  
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
>  		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>  	}
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> +		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> +		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> +	}
> +
>  	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
>  					 vqs, callbacks, names, NULL, NULL);
>  	if (err)
>  		return err;
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> +		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> +
>  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>  	vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> @@ -931,12 +975,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		if (err)
>  			goto out_del_balloon_wq;
>  	}
> +
> +	vb->ph_dev_info.report = virtballoon_unused_page_report;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> +		unsigned int capacity;
> +
> +		capacity = min_t(unsigned int,
> +				 virtqueue_get_vring_size(vb->reporting_vq),
> +				 VIRTIO_BALLOON_VRING_HINTS_MAX);
> +		vb->ph_dev_info.capacity = capacity;
> +
> +		err = page_reporting_startup(&vb->ph_dev_info);
> +		if (err)
> +			goto out_unregister_shrinker;
> +	}
> +
>  	virtio_device_ready(vdev);
>  
>  	if (towards_target(vb))
>  		virtballoon_changed(vdev);
>  	return 0;
>  
> +out_unregister_shrinker:
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> +		virtio_balloon_unregister_shrinker(vb);
>  out_del_balloon_wq:
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
>  		destroy_workqueue(vb->balloon_wq);
> @@ -965,6 +1027,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb = vdev->priv;
>  
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> +		page_reporting_shutdown(&vb->ph_dev_info);
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
>  		virtio_balloon_unregister_shrinker(vb);
>  	spin_lock_irq(&vb->stop_update_lock);
> @@ -1034,6 +1098,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>  	VIRTIO_BALLOON_F_PAGE_POISON,
> +	VIRTIO_BALLOON_F_REPORTING,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index a1966cd7b677..19974392d324 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
>  #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12

^ permalink raw reply

* Re: [Patch] KVM: SVM: Fix svm_xsaves_supported
From: Aaron Lewis @ 2019-09-04 19:03 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: kvm, Janakarajan.Natarajan, Jim Mattson
In-Reply-To: <87o900j98f.fsf@vitty.brq.redhat.com>

On Wed, Sep 4, 2019 at 9:51 AM Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
> Aaron Lewis <aaronlewis@google.com> writes:
>
> > AMD allows guests to execute XSAVES/XRSTORS if supported by the host.  This is different than Intel as they have an additional control bit that determines if XSAVES/XRSTORS can be used by the guest. Intel also has intercept bits that might prevent the guest from intercepting the instruction as well. AMD has none of that, not even an Intercept mechanism.  AMD simply allows XSAVES/XRSTORS to be executed by the guest if also supported by the host.
> >
>
> WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line)
>
> > Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> > ---
> >  arch/x86/kvm/svm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 1f220a85514f..b681a89f4f7e 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -5985,7 +5985,7 @@ static bool svm_mpx_supported(void)
> >
> >  static bool svm_xsaves_supported(void)
> >  {
> > -     return false;
> > +     return boot_cpu_has(X86_FEATURE_XSAVES);
> >  }
> >
> >  static bool svm_umip_emulated(void)
>
> I had a similar patch in my stash when I tried to debug Hyper-V 2016
> not being able to boot on KVM. I may have forgotten some important
> details, but if I'm not mistaken XSAVES comes paired with MSR_IA32_XSS
> and some OSes may actually try to write there, in particular I've
> observed Hyper-V 2016 trying to write '0'. Currently, we only support
> MSR_IA32_XSS in vmx code, this will need to be extended to SVM.
>
> Currently, VMX code only supports writing '0' to MSR_IA32_XSS:
>
>         case MSR_IA32_XSS:
>                 if (!vmx_xsaves_supported() ||
>                     (!msr_info->host_initiated &&
>                      !(guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) &&
>                        guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))))
>                         return 1;
>                 /*
>                  * The only supported bit as of Skylake is bit 8, but
>                  * it is not supported on KVM.
>                  */
>                 if (data != 0)
>                         return 1;
>
>
> we will probably need the same limitation for SVM, however, I'd vote for
> creating separate kvm_x86_ops->set_xss() implementations.
>
> --
> Vitaly

Fixed the unwrapped description in v2.

As for extending VMX behavior to SVM for MSR_IA_32_XSS; I will do this
in a follow up patch.  Thanks for calling this out.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox