Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH] vhost-net: add module alias (v2.1)
From: Stephen Hemminger @ 2012-01-16 15:52 UTC (permalink / raw)
  To: Alan Cox
  Cc: kvm, Michael S. Tsirkin, netdev, kay.sievers, virtualization,
	device, David Miller
In-Reply-To: <20120116122645.2257b40b@bob.linux.org.uk>

On Mon, 16 Jan 2012 12:26:45 +0000
Alan Cox <alan@linux.intel.com> wrote:

> > > ACKs, NACKs?  What is happening here?
> > 
> > I would like an Ack from Alan Cox who switched vhost-net
> > to a dynamic minor in the first place, in commit
> > 79907d89c397b8bc2e05b347ec94e928ea919d33.
> 
> Sorry device@lanana.org isn't yet back from the kernel hack incident.
> 
> I don't read netdev so someone needs to summarise the issue and send me
> a copy of the patch to look at.
> 
> Alan

Subject: vhost-net: add module alias (v2.1)

By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.

Also:
  - use C99 style initialization.
  - add missing entry in documentation for loop-control

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
2.1 - add missing documentation for loop control as well

 Documentation/devices.txt  |    3 +++
 drivers/vhost/net.c        |    8 +++++---
 include/linux/miscdevice.h |    1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

--- a/drivers/vhost/net.c	2012-01-12 14:14:25.681815487 -0800
+++ b/drivers/vhost/net.c	2012-01-12 18:09:56.810680816 -0800
@@ -856,9 +856,9 @@ static const struct file_operations vhos
 };
 
 static struct miscdevice vhost_net_misc = {
-	MISC_DYNAMIC_MINOR,
-	"vhost-net",
-	&vhost_net_fops,
+	.minor = VHOST_NET_MINOR,
+	.name = "vhost-net",
+	.fops = &vhost_net_fops,
 };
 
 static int vhost_net_init(void)
@@ -879,3 +879,5 @@ MODULE_VERSION("0.0.1");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Michael S. Tsirkin");
 MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
+MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
+MODULE_ALIAS("devname:vhost-net");
--- a/include/linux/miscdevice.h	2012-01-12 14:14:25.725815981 -0800
+++ b/include/linux/miscdevice.h	2012-01-12 18:09:56.810680816 -0800
@@ -42,6 +42,7 @@
 #define AUTOFS_MINOR		235
 #define MAPPER_CTRL_MINOR	236
 #define LOOP_CTRL_MINOR		237
+#define VHOST_NET_MINOR		238
 #define MISC_DYNAMIC_MINOR	255
 
 struct device;
--- a/Documentation/devices.txt	2012-01-12 14:14:25.701815712 -0800
+++ b/Documentation/devices.txt	2012-01-12 18:09:56.814680860 -0800
@@ -447,6 +447,9 @@ Your cooperation is appreciated.
 		234 = /dev/btrfs-control	Btrfs control device
 		235 = /dev/autofs	Autofs control device
 		236 = /dev/mapper/control	Device-Mapper control device
+		237 = /dev/loop-control Loopback control device
+		238 = /dev/vhost-net	Host kernel accelerator for virtio net
+
 		240-254			Reserved for local use
 		255			Reserved for MISC_DYNAMIC_MINOR

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Raghavendra K T @ 2012-01-16 18:38 UTC (permalink / raw)
  To: Alexander Graf, Jeremy Fitzhardinge
  Cc: Greg Kroah-Hartman, linux-doc, Peter Zijlstra, Jan Kiszka,
	Srivatsa Vaddagiri, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
	Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
	Sasha Levin, Sedat Dilek, Thomas Gleixner, Virtualization, LKML,
	Dave Hansen, Suzuki Poulose <suzuki>
In-Reply-To: <B6E21B69-17D1-48E0-AFD4-B52075094005@suse.de>

On 01/16/2012 07:53 PM, Alexander Graf wrote:
>
> On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:
>
>> * Alexander Graf<agraf@suse.de>  [2012-01-16 04:57:45]:
>>
>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>>
>> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
>> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
>> some workload(s)?
>
> Yup
>
>>
>> In some sense, the 1x overcommitcase results posted does measure the overhead
>> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
>> kernbench ..
>>
>>> Result for Non PLE machine :
>>> ============================
>>
>> [snip]
>>
>>> Kernbench:
>>>                BASE                    BASE+patch
>
> What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.
>
>
> Alex

Sorry for confusion, I think I was little imprecise on the BASE.

The BASE is pre 3.2.0 + Jeremy's following patches:
xadd (https://lkml.org/lkml/2011/10/4/328)
x86/ticketlocklock  (https://lkml.org/lkml/2011/10/12/496).
So this would have ticketlock cleanups from Jeremy and
CONFIG_PARAVIRT_SPINLOCKS=y

BASE+patch = pre 3.2.0 + Jeremy's above patches + above V5 PV spinlock
series and CONFIG_PARAVIRT_SPINLOCKS=y

In both the cases  CONFIG_PARAVIRT_SPINLOCKS=y.

So let,
A. pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n
B. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = n
C. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = y
D. pre-3.2.0 + Jeremy's above patches + V5 patches with 
CONFIG_PARAVIRT_SPINLOCKS = n
E. pre-3.2.0 + Jeremy's above patches + V5 patches with 
CONFIG_PARAVIRT_SPINLOCKS = y

is it performance of A vs E ? (currently C vs E)

Please let me know the configuration expected for testing.

Jeremy,
I would be happy to test A vs B vs C vs E, for some workload of interest 
if you wish, for your upcoming patches.

Thanks and Regards
Raghu

>
>>>                %improvement
>>>                mean (sd)               mean (sd)
>>> Scenario A:
>>> case 1x:	 164.233 (16.5506)	 163.584 (15.4598	0.39517
>>
>> [snip]
>>
>>> Result for PLE machine:
>>> ======================
>>
>> [snip]
>>> Kernbench:
>>>                BASE                    BASE+patch
>>>                %improvement
>>>                mean (sd)               mean (sd)
>>> Scenario A:
>>> case 1x:	 161.263 (56.518)        159.635 (40.5621)	1.00953
>>
>> - vatsa
>>
>
>

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Alexander Graf @ 2012-01-16 18:42 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
	Peter Zijlstra, Jan Kiszka, Srivatsa Vaddagiri, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Virtualization, LKML, Dave Hansen
In-Reply-To: <4F146EA5.3010106@linux.vnet.ibm.com>


On 16.01.2012, at 19:38, Raghavendra K T wrote:

> On 01/16/2012 07:53 PM, Alexander Graf wrote:
>> 
>> On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:
>> 
>>> * Alexander Graf<agraf@suse.de>  [2012-01-16 04:57:45]:
>>> 
>>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>>> 
>>> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
>>> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
>>> some workload(s)?
>> 
>> Yup
>> 
>>> 
>>> In some sense, the 1x overcommitcase results posted does measure the overhead
>>> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
>>> kernbench ..
>>> 
>>>> Result for Non PLE machine :
>>>> ============================
>>> 
>>> [snip]
>>> 
>>>> Kernbench:
>>>>               BASE                    BASE+patch
>> 
>> What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.
>> 
>> 
>> Alex
> 
> Sorry for confusion, I think I was little imprecise on the BASE.
> 
> The BASE is pre 3.2.0 + Jeremy's following patches:
> xadd (https://lkml.org/lkml/2011/10/4/328)
> x86/ticketlocklock  (https://lkml.org/lkml/2011/10/12/496).
> So this would have ticketlock cleanups from Jeremy and
> CONFIG_PARAVIRT_SPINLOCKS=y
> 
> BASE+patch = pre 3.2.0 + Jeremy's above patches + above V5 PV spinlock
> series and CONFIG_PARAVIRT_SPINLOCKS=y
> 
> In both the cases  CONFIG_PARAVIRT_SPINLOCKS=y.
> 
> So let,
> A. pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n
> B. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = n
> C. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = y
> D. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = n
> E. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = y
> 
> is it performance of A vs E ? (currently C vs E)

Since D and E only matter with KVM in use, yes, I'm mostly interested in A, B and C :).


Alex

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Raghavendra K T @ 2012-01-16 18:48 UTC (permalink / raw)
  To: Avi Kivity, Alexander Graf
  Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Rob Landley, X86, Ingo Molnar, Rik van Riel,
	Konrad Rzeszutek Wilk, Srivatsa Vaddagiri, Sasha Levin,
	Sedat Dilek, Thomas Gleixner, LKML, Dave Hansen, Suzuki
In-Reply-To: <4F142AF1.2020801@redhat.com>

On 01/16/2012 07:19 PM, Avi Kivity wrote:
> On 01/16/2012 03:43 PM, Raghavendra K T wrote:
>>>> Dbench:
>>>> Throughput is in MB/sec
>>>> NRCLIENTS     BASE                    BASE+patch
>>>> %improvement
>>>>                      mean (sd)               mean (sd)
>>>> 8           1.101190  (0.875082)     1.700395  (0.846809)     54.4143
>>>> 16          1.524312  (0.120354)     1.477553  (0.058166)     -3.06755
>>>> 32            2.143028  (0.157103)     2.090307  (0.136778)
>>>> -2.46012
>>>
>>> So on a very contended system we're actually slower? Is this expected?
>>>
>>>
>>
>>
>> I think, the result is interesting because its PLE machine. I have to
>> experiment more with parameters, SPIN_THRESHOLD, and also may be
>> ple_gap and ple_window.
>
> Perhaps the PLE stuff fights with the PV stuff?
>

I also think so. The slight advantage in PLE, with current patch would
be  that, we are be able to say " This is the next guy who should
probably get his turn".  But If total number of unnecessary "halt
exits" disadvantage dominates above advantage, then we see degradation.

One clarification in above benchmarking is,  Dbench is run
simultaneously on all (8 vcpu) 3 guests. So we already have 1:3 
overcommit when we run 8 clients of dbench. after that it was just 
increasing number of clients.

^ permalink raw reply

* Re: [PATCH] vhost-net: add module alias (v2.1)
From: David Miller @ 2012-01-16 23:06 UTC (permalink / raw)
  To: shemminger; +Cc: kvm, mst, netdev, kay.sievers, virtualization, device, alan
In-Reply-To: <20120116075236.110cc4b4@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 16 Jan 2012 07:52:36 -0800

> On Mon, 16 Jan 2012 12:26:45 +0000
> Alan Cox <alan@linux.intel.com> wrote:
> 
>> > > ACKs, NACKs?  What is happening here?
>> > 
>> > I would like an Ack from Alan Cox who switched vhost-net
>> > to a dynamic minor in the first place, in commit
>> > 79907d89c397b8bc2e05b347ec94e928ea919d33.
>> 
>> Sorry device@lanana.org isn't yet back from the kernel hack incident.
>> 
>> I don't read netdev so someone needs to summarise the issue and send me
>> a copy of the patch to look at.
>> 
>> Alan
> 
> Subject: vhost-net: add module alias (v2.1)
> 
> By adding some module aliases, programs (or users) won't have to explicitly
> call modprobe. Vhost-net will always be available if built into the kernel.
> It does require assigning a permanent minor number for depmod to work.
> 
> Also:
>   - use C99 style initialization.
>   - add missing entry in documentation for loop-control
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

I already applied your first patch, so you need to give me something
relative to apply on top of your original one.

And it also shows that you're really not generating these patches
against current 'net', otherwise you'd have noticed your other patch
already there.

^ permalink raw reply

* Re: [PATCH] vhost-net: add module alias (v2.1)
From: Alan Cox @ 2012-01-16 23:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: kvm, Michael S. Tsirkin, netdev, kay.sievers, virtualization,
	device, David Miller
In-Reply-To: <20120116075236.110cc4b4@nehalam.linuxnetplumber.net>

> Also:
>   - use C99 style initialization.
>   - add missing entry in documentation for loop-control
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

For the device allocation

Acked-by: Alan Cox <devices@lanana.org>

^ permalink raw reply

* Re: [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Jeremy Fitzhardinge @ 2012-01-16 23:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, linux-doc, Peter Zijlstra, Jan Kiszka,
	Virtualization, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
	Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Greg Kroah-Hartman, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	LKML, Dave Hansen, Suzuki Poulose
In-Reply-To: <4F143874.9010307@redhat.com>

On 01/17/2012 01:47 AM, Avi Kivity wrote:
> On 01/16/2012 04:13 PM, Raghavendra K T wrote:
>>> Please drop all of these and replace with tracepoints in the appropriate
>>> spots.  Everything else (including the historgram) can be reconstructed
>>> the tracepoints in userspace.
>>>
>>
>> I think Jeremy pointed that tracepoints use spinlocks and hence debugfs
>> is the option.. no ?
>>
> Yeah, I think you're right.  What a pity.

Yes, I went through the same thought process.

    J

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Jeremy Fitzhardinge @ 2012-01-16 23:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, linux-doc, Peter Zijlstra, Jan Kiszka,
	Virtualization, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
	Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Greg Kroah-Hartman, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	LKML, Dave Hansen, Suzuki Poulose
In-Reply-To: <4F13E613.7090602@redhat.com>

On 01/16/2012 07:55 PM, Avi Kivity wrote:
> On 01/16/2012 08:40 AM, Jeremy Fitzhardinge wrote:
>>> That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.
>>>
>>> Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.
>> I'm not quite sure what your concern is.  The lock is under contention, so there's nothing to do except spin; all this patch adds is a variable decrement/test to the spin loop, but that's not going to waste any more CPU than the non-counting case.  And once it falls into the blocking path, its a win because the VCPU isn't burning CPU any more.
> The wakeup path is slower though.  The previous lock holder has to
> hypercall, and the new lock holder has to be scheduled, and transition
> from halted state to running (a vmentry).  So it's only a clear win if
> we can do something with the cpu other than go into the idle loop.

Not burning power is a win too.

Actually what you want is something like "if you preempt a VCPU while
its spinning in a lock, then push it into the slowpath and don't
reschedule it without a kick".  But I think that interface would have a
lot of fiddly corners.

    J

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Jeremy Fitzhardinge @ 2012-01-17  0:30 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Raghavendra K T, linux-doc, Peter Zijlstra, Jan Kiszka,
	Virtualization, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
	Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
	Greg Kroah-Hartman, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	LKML, Dave Hansen
In-Reply-To: <0B6BF918-45C7-45C6-AAA4-FB73EFAB9FDB@suse.de>

On 01/16/2012 09:24 PM, Alexander Graf wrote:
> This is true in case you're spinning. If on overcommit spinlocks would
> instead of spin just yield(), we wouldn't have any vcpu running that's
> just waiting for a late ticket.

Yes, but the reality is that most spinlocks are held for a short period
of time and there's a low likelihood of being preempted while within a
spinlock critical section.  Therefore if someone else tries to get the
spinlock and there's contention, it's always worth spinning for a little
while because the lock will likely become free soon.

At least that's the case if the lock has low contention (shallow queue
depth and not in slow state).  Again, maybe it makes sense to never spin
for deep queues or already slowstate locks.

> We still have an issue finding the point in time when a vcpu could run again, which is what this whole series is about. My point above was that instead of doing a count loop, we could just do the normal spin dance and set the threshold to when we enable the magic to have another spin lock notify us in the CPU. That way we
>
>   * don't change the uncontended case

I don't follow you.  What do you mean by "the normal spin dance"?  What
do you mean by "have another spinlock notify us in the CPU"?  Don't
change which uncontended case?  Do you mean in the locking path?  Or the
unlock path?  Or both?

>   * can set the threshold on the host, which knows how contended the system is

Hm, I'm not convinced that knowing how contended the system is is all
that useful overall.  What's important is how contended a particular
lock is, and what state the current holder is in.  If it's not currently
running, then knowing the overall system contention would give you some
idea about how long you need to wait for it to be rescheduled, but
that's getting pretty indirect.

I think the "slowpath if preempted while spinning" idea I mentioned in
the other mail is probably worth following up, since that give specific
actionable information to the guest from the hypervisor.  But lots of
caveats.

[[
A possible mechanism:

  * register ranges of [er]ips with the hypervisor
  * each range is paired with a "resched handler block"
  * if vcpu is preempted within such a range, make sure it is
    rescheduled in the resched handler block

This is obviously akin to the exception mechanism, but it is partially
implemented by the hypervisor.  It allows the spinlock code to be
unchanged from native, but make use of a resched rather than an explicit
counter to determine when to slowpath the lock.  And it's a nice general
mechanism that could be potentially useful elsewhere.

Unfortunately, it doesn't change the unlock path at all; it still needs
to explicitly test if a VCPU needs to be kicked on unlock.
]]

> And since we control what spin locks look like, we can for example always keep the pointer to it in a specific register so that we can handle pv_lock_ops.lock_spinning() inside there and fetch all the information we need from our pt_regs.

You've left a pile of parts of an idea lying around, but I'm not sure
what shape you intend it to be.

>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests.
>> The whole point of the pv-ticketlock work is to keep the pvops hooks out of the locking fast path, so that the calls are only made on the slow path - that is, when spinning too long on a contended lock, and when releasing a lock that's in a "slow" state.  In the fast path case of no contention, there are no pvops, and the executed code path is almost identical to native.
> You're still changing a tight loop that does nothing (CPU detects it and saves power) into something that performs calculations.

It still has a "pause" instruction in that loop, so that CPU mechanism
will still come into play.  "pause" doesn't directly "save power"; it's
more about making sure that memory dependence cycles are broken and that
two competing threads will make similar progress.  Besides I'm not sure
adding a dec+test to a loop that's already got a memory read and compare
in it is adding much in the way of "calculations".

    J

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Gleb Natapov @ 2012-01-17  9:14 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120116141117.GB6019@linux.vnet.ibm.com>

On Mon, Jan 16, 2012 at 07:41:17PM +0530, Srivatsa Vaddagiri wrote:
> * Avi Kivity <avi@redhat.com> [2012-01-16 12:14:27]:
> 
> > > One option is to make the kick hypercall available only when
> > > yield_on_hlt=1?
> > 
> > It's not a good idea to tie various options together.  Features should
> > be orthogonal.
> > 
> > Can't we make it work?  Just have different handling for
> > KVM_REQ_PVLOCK_KICK (let 's rename it, and the hypercall, PV_UNHALT,
> > since we can use it for non-locks too).
> 
> The problem case I was thinking of was when guest VCPU would have issued
> HLT with interrupts disabled. I guess one option is to inject an NMI,
> and have the guest kernel NMI handler recognize this and make
> adjustments such that the vcpu avoids going back to HLT instruction.
> 
Just kick vcpu out of a guest mode and adjust rip to point after HLT on
next re-entry. Don't forget to call vmx_clear_hlt().

> Having another hypercall to do yield/sleep (rather than effecting that
> via HLT) seems like an alternate clean solution here ..
> 
> - vatsa

--
			Gleb.

^ permalink raw reply

* Re: [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Marcelo Tosatti @ 2012-01-17 11:02 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, linux-doc, Peter Zijlstra, Jan Kiszka,
	Virtualization, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
	Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML, Dave Hansen
In-Reply-To: <20120114182645.8604.68884.sendpatchset@oc5400248562.ibm.com>

On Sat, Jan 14, 2012 at 11:56:46PM +0530, Raghavendra K T wrote:
> Extends Linux guest running on KVM hypervisor to support pv-ticketlocks. 
> 
> During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
> required feature (KVM_FEATURE_PVLOCK_KICK) to support pv-ticketlocks. If so,
>  support for pv-ticketlocks is registered via pv_lock_ops.
> 
> Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.
> 
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 7a94987..cf5327c 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -195,10 +195,20 @@ void kvm_async_pf_task_wait(u32 token);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  extern void kvm_disable_steal_time(void);
> -#else
> -#define kvm_guest_init() do { } while (0)
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +void __init kvm_spinlock_init(void);
> +#else /* CONFIG_PARAVIRT_SPINLOCKS */
> +static void kvm_spinlock_init(void)
> +{
> +}
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#else /* CONFIG_KVM_GUEST */
> +#define kvm_guest_init() do {} while (0)
>  #define kvm_async_pf_task_wait(T) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
> +
>  static inline u32 kvm_read_and_reset_pf_reason(void)
>  {
>  	return 0;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index a9c2116..ec55a0b 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -33,6 +33,7 @@
>  #include <linux/sched.h>
>  #include <linux/slab.h>
>  #include <linux/kprobes.h>
> +#include <linux/debugfs.h>
>  #include <asm/timer.h>
>  #include <asm/cpu.h>
>  #include <asm/traps.h>
> @@ -545,6 +546,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
>  #endif
>  	kvm_guest_cpu_init();
>  	native_smp_prepare_boot_cpu();
> +	kvm_spinlock_init();
>  }
>  
>  static void __cpuinit kvm_guest_cpu_online(void *dummy)
> @@ -627,3 +629,250 @@ static __init int activate_jump_labels(void)
>  	return 0;
>  }
>  arch_initcall(activate_jump_labels);
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +
> +enum kvm_contention_stat {
> +	TAKEN_SLOW,
> +	TAKEN_SLOW_PICKUP,
> +	RELEASED_SLOW,
> +	RELEASED_SLOW_KICKED,
> +	NR_CONTENTION_STATS
> +};
> +
> +#ifdef CONFIG_KVM_DEBUG_FS
> +
> +static struct kvm_spinlock_stats
> +{
> +	u32 contention_stats[NR_CONTENTION_STATS];
> +
> +#define HISTO_BUCKETS	30
> +	u32 histo_spin_blocked[HISTO_BUCKETS+1];
> +
> +	u64 time_blocked;
> +} spinlock_stats;
> +
> +static u8 zero_stats;
> +
> +static inline void check_zero(void)
> +{
> +	u8 ret;
> +	u8 old = ACCESS_ONCE(zero_stats);
> +	if (unlikely(old)) {
> +		ret = cmpxchg(&zero_stats, old, 0);
> +		/* This ensures only one fellow resets the stat */
> +		if (ret == old)
> +			memset(&spinlock_stats, 0, sizeof(spinlock_stats));
> +	}
> +}
> +
> +static inline void add_stats(enum kvm_contention_stat var, u32 val)
> +{
> +	check_zero();
> +	spinlock_stats.contention_stats[var] += val;
> +}
> +
> +
> +static inline u64 spin_time_start(void)
> +{
> +	return sched_clock();
> +}
> +
> +static void __spin_time_accum(u64 delta, u32 *array)
> +{
> +	unsigned index = ilog2(delta);
> +
> +	check_zero();
> +
> +	if (index < HISTO_BUCKETS)
> +		array[index]++;
> +	else
> +		array[HISTO_BUCKETS]++;
> +}
> +
> +static inline void spin_time_accum_blocked(u64 start)
> +{
> +	u32 delta = sched_clock() - start;
> +
> +	__spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
> +	spinlock_stats.time_blocked += delta;
> +}
> +
> +static struct dentry *d_spin_debug;
> +static struct dentry *d_kvm_debug;
> +
> +struct dentry *kvm_init_debugfs(void)
> +{
> +	d_kvm_debug = debugfs_create_dir("kvm", NULL);
> +	if (!d_kvm_debug)
> +		printk(KERN_WARNING "Could not create 'kvm' debugfs directory\n");
> +
> +	return d_kvm_debug;
> +}
> +
> +static int __init kvm_spinlock_debugfs(void)
> +{
> +	struct dentry *d_kvm = kvm_init_debugfs();
> +
> +	if (d_kvm == NULL)
> +		return -ENOMEM;
> +
> +	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
> +
> +	debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
> +
> +	debugfs_create_u32("taken_slow", 0444, d_spin_debug,
> +		   &spinlock_stats.contention_stats[TAKEN_SLOW]);
> +	debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
> +		   &spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
> +
> +	debugfs_create_u32("released_slow", 0444, d_spin_debug,
> +		   &spinlock_stats.contention_stats[RELEASED_SLOW]);
> +	debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
> +		   &spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
> +
> +	debugfs_create_u64("time_blocked", 0444, d_spin_debug,
> +			   &spinlock_stats.time_blocked);
> +
> +	debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
> +		     spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
> +
> +	return 0;
> +}
> +fs_initcall(kvm_spinlock_debugfs);
> +#else  /* !CONFIG_KVM_DEBUG_FS */
> +#define TIMEOUT			(1 << 10)
> +static inline void add_stats(enum kvm_contention_stat var, u32 val)
> +{
> +}
> +
> +static inline u64 spin_time_start(void)
> +{
> +	return 0;
> +}
> +
> +static inline void spin_time_accum_blocked(u64 start)
> +{
> +}
> +#endif  /* CONFIG_KVM_DEBUG_FS */
> +
> +struct kvm_lock_waiting {
> +	struct arch_spinlock *lock;
> +	__ticket_t want;
> +};
> +
> +/* cpus 'waiting' on a spinlock to become available */
> +static cpumask_t waiting_cpus;
> +
> +/* Track spinlock on which a cpu is waiting */
> +static DEFINE_PER_CPU(struct kvm_lock_waiting, lock_waiting);
> +
> +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
> +{
> +	struct kvm_lock_waiting *w = &__get_cpu_var(lock_waiting);
> +	int cpu = smp_processor_id();
> +	u64 start;
> +	unsigned long flags;
> +
> +	start = spin_time_start();
> +
> +	/*
> +	 * Make sure an interrupt handler can't upset things in a
> +	 * partially setup state.
> +	 */
> +	local_irq_save(flags);
> +
> +	/*
> +	 * The ordering protocol on this is that the "lock" pointer
> +	 * may only be set non-NULL if the "want" ticket is correct.
> +	 * If we're updating "want", we must first clear "lock".
> +	 */
> +	w->lock = NULL;
> +	smp_wmb();
> +	w->want = want;
> +	smp_wmb();
> +	w->lock = lock;
> +
> +	add_stats(TAKEN_SLOW, 1);
> +
> +	/*
> +	 * This uses set_bit, which is atomic but we should not rely on its
> +	 * reordering gurantees. So barrier is needed after this call.
> +	 */
> +	cpumask_set_cpu(cpu, &waiting_cpus);
> +
> +	barrier();
> +
> +	/*
> +	 * Mark entry to slowpath before doing the pickup test to make
> +	 * sure we don't deadlock with an unlocker.
> +	 */
> +	__ticket_enter_slowpath(lock);
> +
> +	/*
> +	 * check again make sure it didn't become free while
> +	 * we weren't looking.
> +	 */
> +	if (ACCESS_ONCE(lock->tickets.head) == want) {
> +		add_stats(TAKEN_SLOW_PICKUP, 1);
> +		goto out;
> +	}
> +
> +	/* Allow interrupts while blocked */
> +	local_irq_restore(flags);
> +
> +	/* halt until it's our turn and kicked. */
> +	halt();
> +
> +	local_irq_save(flags);
> +out:
> +	cpumask_clear_cpu(cpu, &waiting_cpus);
> +	w->lock = NULL;
> +	local_irq_restore(flags);
> +	spin_time_accum_blocked(start);
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> +
> +/* Kick a cpu by its apicid*/
> +static inline void kvm_kick_cpu(int apicid)
> +{
> +	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
> +}
> +
> +/* Kick vcpu waiting on @lock->head to reach value @ticket */
> +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
> +{
> +	int cpu;
> +	int apicid;
> +
> +	add_stats(RELEASED_SLOW, 1);
> +
> +	for_each_cpu(cpu, &waiting_cpus) {
> +		const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
> +		if (ACCESS_ONCE(w->lock) == lock &&
> +		    ACCESS_ONCE(w->want) == ticket) {
> +			add_stats(RELEASED_SLOW_KICKED, 1);
> +			apicid = per_cpu(x86_cpu_to_apicid, cpu);
> +			kvm_kick_cpu(apicid);
> +			break;
> +		}
> +	}

What prevents a kick from being lost here, if say, the waiter is at
local_irq_save in kvm_lock_spinning, before the lock/want assignments?

> +
> +/*
> + * Setup pv_lock_ops to exploit KVM_FEATURE_PVLOCK_KICK if present.
> + */
> +void __init kvm_spinlock_init(void)
> +{
> +	if (!kvm_para_available())
> +		return;
> +	/* Does host kernel support KVM_FEATURE_PVLOCK_KICK? */
> +	if (!kvm_para_has_feature(KVM_FEATURE_PVLOCK_KICK))
> +		return;
> +
> +	jump_label_inc(&paravirt_ticketlocks_enabled);
> +
> +	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
> +	pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +}
> +#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c7b05fc..4d7a950 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5754,8 +5754,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  	local_irq_disable();
>  
> -	if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> -	    || need_resched() || signal_pending(current)) {
> +	if (vcpu->mode == EXITING_GUEST_MODE
> +		 || (vcpu->requests & ~(1UL<<KVM_REQ_PVLOCK_KICK))
> +		 || need_resched() || signal_pending(current)) {
>  		vcpu->mode = OUTSIDE_GUEST_MODE;
>  		smp_wmb();
>  		local_irq_enable();
> @@ -6711,6 +6712,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>  		!vcpu->arch.apf.halted)
>  		|| !list_empty_careful(&vcpu->async_pf.done)
>  		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> +		|| kvm_check_request(KVM_REQ_PVLOCK_KICK, vcpu)

The bit should only be read here (kvm_arch_vcpu_runnable), but cleared
on vcpu entry (along with the other kvm_check_request processing).

Then the first hunk becomes unnecessary.

Please do not mix host/guest patches.

^ permalink raw reply

* Re: [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Srivatsa Vaddagiri @ 2012-01-17 11:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-doc, Peter Zijlstra,
	Jan Kiszka, Virtualization, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
	Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
	Sasha Levin, Sedat Dilek, Thomas Gleixner, Greg Kroah-Hartman,
	LKML
In-Reply-To: <20120117110210.GA17420@amt.cnet>

* Marcelo Tosatti <mtosatti@redhat.com> [2012-01-17 09:02:11]:

> > +/* Kick vcpu waiting on @lock->head to reach value @ticket */
> > +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
> > +{
> > +	int cpu;
> > +	int apicid;
> > +
> > +	add_stats(RELEASED_SLOW, 1);
> > +
> > +	for_each_cpu(cpu, &waiting_cpus) {
> > +		const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
> > +		if (ACCESS_ONCE(w->lock) == lock &&
> > +		    ACCESS_ONCE(w->want) == ticket) {
> > +			add_stats(RELEASED_SLOW_KICKED, 1);
> > +			apicid = per_cpu(x86_cpu_to_apicid, cpu);
> > +			kvm_kick_cpu(apicid);
> > +			break;
> > +		}
> > +	}
> 
> What prevents a kick from being lost here, if say, the waiter is at
> local_irq_save in kvm_lock_spinning, before the lock/want assignments?

The waiter does check for lock becoming available before actually
sleeping:

+	/*
+        * check again make sure it didn't become free while
+        * we weren't looking.
+        */
+	if (ACCESS_ONCE(lock->tickets.head) == want) {
+               add_stats(TAKEN_SLOW_PICKUP, 1);
+               goto out;
+	}

- vatsa

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Srivatsa Vaddagiri @ 2012-01-17 12:26 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120117091413.GM2167@redhat.com>

* Gleb Natapov <gleb@redhat.com> [2012-01-17 11:14:13]:

> > The problem case I was thinking of was when guest VCPU would have issued
> > HLT with interrupts disabled. I guess one option is to inject an NMI,
> > and have the guest kernel NMI handler recognize this and make
> > adjustments such that the vcpu avoids going back to HLT instruction.
> > 
> Just kick vcpu out of a guest mode and adjust rip to point after HLT on
> next re-entry. Don't forget to call vmx_clear_hlt().

Looks bit hackish to me compared to having another hypercall to yield!

> > Having another hypercall to do yield/sleep (rather than effecting that
> > via HLT) seems like an alternate clean solution here ..

- vatsa

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Gleb Natapov @ 2012-01-17 12:51 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120117122650.GC30757@linux.vnet.ibm.com>

On Tue, Jan 17, 2012 at 05:56:50PM +0530, Srivatsa Vaddagiri wrote:
> * Gleb Natapov <gleb@redhat.com> [2012-01-17 11:14:13]:
> 
> > > The problem case I was thinking of was when guest VCPU would have issued
> > > HLT with interrupts disabled. I guess one option is to inject an NMI,
> > > and have the guest kernel NMI handler recognize this and make
> > > adjustments such that the vcpu avoids going back to HLT instruction.
> > > 
> > Just kick vcpu out of a guest mode and adjust rip to point after HLT on
> > next re-entry. Don't forget to call vmx_clear_hlt().
> 
> Looks bit hackish to me compared to having another hypercall to yield!
> 
Do not see anything hackish about it. But what you described above (the
part I replied to) is not another hypercall, but yet another NMI source
and special handling in a guest. So what hypercall do you mean?

> > > Having another hypercall to do yield/sleep (rather than effecting that
> > > via HLT) seems like an alternate clean solution here ..
> 
> - vatsa

--
			Gleb.

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Srivatsa Vaddagiri @ 2012-01-17 13:11 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120117125126.GQ2167@redhat.com>

* Gleb Natapov <gleb@redhat.com> [2012-01-17 14:51:26]:

> On Tue, Jan 17, 2012 at 05:56:50PM +0530, Srivatsa Vaddagiri wrote:
> > * Gleb Natapov <gleb@redhat.com> [2012-01-17 11:14:13]:
> > 
> > > > The problem case I was thinking of was when guest VCPU would have issued
> > > > HLT with interrupts disabled. I guess one option is to inject an NMI,
> > > > and have the guest kernel NMI handler recognize this and make
> > > > adjustments such that the vcpu avoids going back to HLT instruction.
> > > > 
> > > Just kick vcpu out of a guest mode and adjust rip to point after HLT on
> > > next re-entry. Don't forget to call vmx_clear_hlt().
> > 
> > Looks bit hackish to me compared to having another hypercall to yield!
> > 
> Do not see anything hackish about it. But what you described above (the
> part I replied to) is not another hypercall, but yet another NMI source
> and special handling in a guest.

True, which I didn't exactly like and hence was suggesting we use
another hypercall to let spinning vcpu sleep.

> So what hypercall do you mean?

The hypercall is described below:

> > > > Having another hypercall to do yield/sleep (rather than effecting that
> > > > via HLT) seems like an alternate clean solution here ..

and was implemented in an earlier version of this patch (v2) as
KVM_HC_WAIT_FOR_KICK hypercall:

https://lkml.org/lkml/2011/10/23/211

Having the hypercall makes the intent of vcpu (to sleep on a kick) clear to 
hypervisor vs assuming that because of a trapped HLT instruction (which
will anyway won't work when yield_on_hlt=0).

- vatsa

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Raghavendra K T @ 2012-01-17 13:13 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Jeremy Fitzhardinge, KVM, linux-doc, Peter Zijlstra, Jan Kiszka,
	Srivatsa Vaddagiri, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, Glauber Costa, X86,
	Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
	Sasha Levin, Sedat Dilek, Thomas Gleixner, Virtualization,
	Greg Kroah-Hartman, LKML, Dave Hansen
In-Reply-To: <20120117125126.GQ2167@redhat.com>

On 01/17/2012 06:21 PM, Gleb Natapov wrote:
> On Tue, Jan 17, 2012 at 05:56:50PM +0530, Srivatsa Vaddagiri wrote:
>> * Gleb Natapov<gleb@redhat.com>  [2012-01-17 11:14:13]:
>>
>>>> The problem case I was thinking of was when guest VCPU would have issued
>>>> HLT with interrupts disabled. I guess one option is to inject an NMI,
>>>> and have the guest kernel NMI handler recognize this and make
>>>> adjustments such that the vcpu avoids going back to HLT instruction.
>>>>
>>> Just kick vcpu out of a guest mode and adjust rip to point after HLT on
>>> next re-entry. Don't forget to call vmx_clear_hlt().
>>
>> Looks bit hackish to me compared to having another hypercall to yield!
>>
> Do not see anything hackish about it. But what you described above (the
> part I replied to) is not another hypercall, but yet another NMI source
> and special handling in a guest. So what hypercall do you mean?
>

Earlier version had a hypercall to sleep instead of current halt()
approach. This was taken out to avoid extra hypercall.

So here is the hypercall hunk referred :

+/*
+ * kvm_pv_wait_for_kick_op : Block until kicked by either a KVM_HC_KICK_CPU
+ * hypercall or a event like interrupt.
+ *
+ * @vcpu : vcpu which is blocking.
+ */
+static void kvm_pv_wait_for_kick_op(struct kvm_vcpu *vcpu)
+{
+       DEFINE_WAIT(wait);
+
+       /*
+        * Blocking on vcpu->wq allows us to wake up sooner if required to
+        * service pending events (like interrupts).
+        *
+        * Also set state to TASK_INTERRUPTIBLE before checking 
vcpu->kicked to
+        * avoid racing with kvm_pv_kick_cpu_op().
+        */
+       prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+
+       /*
+        * Somebody has already tried kicking us. Acknowledge that
+        * and terminate the wait.
+        */
+       if (vcpu->kicked) {
+               vcpu->kicked = 0;
+               goto end_wait;
+       }
+
+       /* Let's wait for either KVM_HC_KICK_CPU or someother event
+        * to wake us up.
+        */
+
+       srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
+       schedule();
+       vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+end_wait:
+       finish_wait(&vcpu->wq, &wait);
+}

>>>> Having another hypercall to do yield/sleep (rather than effecting that
>>>> via HLT) seems like an alternate clean solution here ..

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Gleb Natapov @ 2012-01-17 13:20 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120117131103.GD30398@linux.vnet.ibm.com>

On Tue, Jan 17, 2012 at 06:41:03PM +0530, Srivatsa Vaddagiri wrote:
> * Gleb Natapov <gleb@redhat.com> [2012-01-17 14:51:26]:
> 
> > On Tue, Jan 17, 2012 at 05:56:50PM +0530, Srivatsa Vaddagiri wrote:
> > > * Gleb Natapov <gleb@redhat.com> [2012-01-17 11:14:13]:
> > > 
> > > > > The problem case I was thinking of was when guest VCPU would have issued
> > > > > HLT with interrupts disabled. I guess one option is to inject an NMI,
> > > > > and have the guest kernel NMI handler recognize this and make
> > > > > adjustments such that the vcpu avoids going back to HLT instruction.
> > > > > 
> > > > Just kick vcpu out of a guest mode and adjust rip to point after HLT on
> > > > next re-entry. Don't forget to call vmx_clear_hlt().
> > > 
> > > Looks bit hackish to me compared to having another hypercall to yield!
> > > 
> > Do not see anything hackish about it. But what you described above (the
> > part I replied to) is not another hypercall, but yet another NMI source
> > and special handling in a guest.
> 
> True, which I didn't exactly like and hence was suggesting we use
> another hypercall to let spinning vcpu sleep.
> 
Ah, sorry. Missed that.

> > So what hypercall do you mean?
> 
> The hypercall is described below:
> 
> > > > > Having another hypercall to do yield/sleep (rather than effecting that
> > > > > via HLT) seems like an alternate clean solution here ..
> 
> and was implemented in an earlier version of this patch (v2) as
> KVM_HC_WAIT_FOR_KICK hypercall:
> 
> https://lkml.org/lkml/2011/10/23/211
> 
> Having the hypercall makes the intent of vcpu (to sleep on a kick) clear to 
> hypervisor vs assuming that because of a trapped HLT instruction (which
> will anyway won't work when yield_on_hlt=0).
> 
The purpose of yield_on_hlt=0 is to allow VCPU to occupy CPU for the
entire time slice no mater what. I do not think disabling yield on HLT
is even make sense in CPU oversubscribe scenario. Now if you'll call
KVM_HC_WAIT_FOR_KICK instead of HLT you will effectively ignore
yield_on_hlt=0 setting. This is like having PV HLT that does not obey
VMX exit control setting.

--
			Gleb.

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Srivatsa Vaddagiri @ 2012-01-17 14:28 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120117132051.GR2167@redhat.com>

* Gleb Natapov <gleb@redhat.com> [2012-01-17 15:20:51]:

> > Having the hypercall makes the intent of vcpu (to sleep on a kick) clear to 
> > hypervisor vs assuming that because of a trapped HLT instruction (which
> > will anyway won't work when yield_on_hlt=0).
> > 
> The purpose of yield_on_hlt=0 is to allow VCPU to occupy CPU for the
> entire time slice no mater what. I do not think disabling yield on HLT
> is even make sense in CPU oversubscribe scenario.

Yes, so is there any real use for yield_on_hlt=0? I believe Anthony
initially added it as a way to implement CPU bandwidth capping for VMs,
which would ensure that busy VMs don't eat into cycles meant for a idle
VM. Now that we have proper support in scheduler for CPU bandwidth capping, is 
there any real world use for yield_on_hlt=0? If not, deprecate it?

> Now if you'll call
> KVM_HC_WAIT_FOR_KICK instead of HLT you will effectively ignore
> yield_on_hlt=0 setting.

I guess that depends on what we do in KVM_HC_WAIT_FOR_KICK. If we do
yield_to() rather than sleep, it should minimize how much cycles vcpu gives away
to a competing VM (which seems to be the biggest purpose why one may
want to set yield_on_hlt=0).

> This is like having PV HLT that does not obey
> VMX exit control setting.

- vatsa

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Gleb Natapov @ 2012-01-17 15:32 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120117142818.GE30398@linux.vnet.ibm.com>

On Tue, Jan 17, 2012 at 07:58:18PM +0530, Srivatsa Vaddagiri wrote:
> * Gleb Natapov <gleb@redhat.com> [2012-01-17 15:20:51]:
> 
> > > Having the hypercall makes the intent of vcpu (to sleep on a kick) clear to 
> > > hypervisor vs assuming that because of a trapped HLT instruction (which
> > > will anyway won't work when yield_on_hlt=0).
> > > 
> > The purpose of yield_on_hlt=0 is to allow VCPU to occupy CPU for the
> > entire time slice no mater what. I do not think disabling yield on HLT
> > is even make sense in CPU oversubscribe scenario.
> 
> Yes, so is there any real use for yield_on_hlt=0? I believe Anthony
> initially added it as a way to implement CPU bandwidth capping for VMs,
> which would ensure that busy VMs don't eat into cycles meant for a idle
> VM. Now that we have proper support in scheduler for CPU bandwidth capping, is 
> there any real world use for yield_on_hlt=0? If not, deprecate it?
> 
I was against adding it in the first place, so if IBM no longer needs it
I am for removing it ASAP.

> > Now if you'll call
> > KVM_HC_WAIT_FOR_KICK instead of HLT you will effectively ignore
> > yield_on_hlt=0 setting.
> 
> I guess that depends on what we do in KVM_HC_WAIT_FOR_KICK. If we do
> yield_to() rather than sleep, it should minimize how much cycles vcpu gives away
> to a competing VM (which seems to be the biggest purpose why one may
> want to set yield_on_hlt=0).
> 
> > This is like having PV HLT that does not obey
> > VMX exit control setting.
> 
> - vatsa

--
			Gleb.

^ permalink raw reply

* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Marcelo Tosatti @ 2012-01-17 15:53 UTC (permalink / raw)
  To: Gleb Natapov, Anthony Liguori
  Cc: Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc,
	Peter Zijlstra, Jan Kiszka, Srivatsa Vaddagiri, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Virtualization, Greg Kroah-Hartman
In-Reply-To: <20120117153233.GA7911@redhat.com>

On Tue, Jan 17, 2012 at 05:32:33PM +0200, Gleb Natapov wrote:
> On Tue, Jan 17, 2012 at 07:58:18PM +0530, Srivatsa Vaddagiri wrote:
> > * Gleb Natapov <gleb@redhat.com> [2012-01-17 15:20:51]:
> > 
> > > > Having the hypercall makes the intent of vcpu (to sleep on a kick) clear to 
> > > > hypervisor vs assuming that because of a trapped HLT instruction (which
> > > > will anyway won't work when yield_on_hlt=0).
> > > > 
> > > The purpose of yield_on_hlt=0 is to allow VCPU to occupy CPU for the
> > > entire time slice no mater what. I do not think disabling yield on HLT
> > > is even make sense in CPU oversubscribe scenario.
> > 
> > Yes, so is there any real use for yield_on_hlt=0? I believe Anthony
> > initially added it as a way to implement CPU bandwidth capping for VMs,
> > which would ensure that busy VMs don't eat into cycles meant for a idle
> > VM. Now that we have proper support in scheduler for CPU bandwidth capping, is 
> > there any real world use for yield_on_hlt=0? If not, deprecate it?
> > 
> I was against adding it in the first place, so if IBM no longer needs it
> I am for removing it ASAP.

+1.

Anthony?

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Raghavendra K T @ 2012-01-17 17:27 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
	Peter Zijlstra, Jan Kiszka, Srivatsa Vaddagiri, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Virtualization, LKML, Dave Hansen
In-Reply-To: <E9F33AFD-F051-4D68-84FF-D259FD6AD19D@suse.de>

On 01/17/2012 12:12 AM, Alexander Graf wrote:
>
> On 16.01.2012, at 19:38, Raghavendra K T wrote:
>
>> On 01/16/2012 07:53 PM, Alexander Graf wrote:
>>>
>>> On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:
>>>
>>>> * Alexander Graf<agraf@suse.de>   [2012-01-16 04:57:45]:
>>>>
>>>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>>>>
>>>> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
>>>> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
>>>> some workload(s)?
>>>
>>> Yup
>>>
>>>>
>>>> In some sense, the 1x overcommitcase results posted does measure the overhead
>>>> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
>>>> kernbench ..
>>>>
>>>>> Result for Non PLE machine :
>>>>> ============================
>>>>
>>>> [snip]
>>>>
>>>>> Kernbench:
>>>>>                BASE                    BASE+patch
>>>
>>> What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.
>>>
>>>
>>> Alex
>>
>> Sorry for confusion, I think I was little imprecise on the BASE.
>>
>> The BASE is pre 3.2.0 + Jeremy's following patches:
>> xadd (https://lkml.org/lkml/2011/10/4/328)
>> x86/ticketlocklock  (https://lkml.org/lkml/2011/10/12/496).
>> So this would have ticketlock cleanups from Jeremy and
>> CONFIG_PARAVIRT_SPINLOCKS=y
>>
>> BASE+patch = pre 3.2.0 + Jeremy's above patches + above V5 PV spinlock
>> series and CONFIG_PARAVIRT_SPINLOCKS=y
>>
>> In both the cases  CONFIG_PARAVIRT_SPINLOCKS=y.
>>
>> So let,
>> A. pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n
>> B. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = n
>> C. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = y
>> D. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = n
>> E. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = y
>>
>> is it performance of A vs E ? (currently C vs E)
>
> Since D and E only matter with KVM in use, yes, I'm mostly interested in A, B and C :).
>
>
> Alex
>
>
setup :
Native: IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core 
, 64GB RAM, (16 cpu online)

Guest : Single guest with 8 VCPU 4GB Ram.
benchmark : kernbench -f -H -M -o 20

Here is the result :
Native Run
============
case A               case B             %improvement   case C 
  %improvement
56.1917 (2.57125)    56.035 (2.02439)   0.278867       56.27 (2.40401) 
   -0.139344	

Guest Run
============
case A               case B             %improvement   case C 
  %improvement
166.999 (15.7613)    161.876 (14.4874) 	3.06768        161.24 (12.6497) 
  3.44852

We do not see much overhead in native run with CONFIG_PARAVIRT_SPINLOCKS = y

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Alexander Graf @ 2012-01-17 17:39 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
	Peter Zijlstra, Jan Kiszka, Srivatsa Vaddagiri, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Virtualization, LKML, Dave Hansen
In-Reply-To: <4F15AF9E.9000907@linux.vnet.ibm.com>


On 17.01.2012, at 18:27, Raghavendra K T wrote:

> On 01/17/2012 12:12 AM, Alexander Graf wrote:
>> 
>> On 16.01.2012, at 19:38, Raghavendra K T wrote:
>> 
>>> On 01/16/2012 07:53 PM, Alexander Graf wrote:
>>>> 
>>>> On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:
>>>> 
>>>>> * Alexander Graf<agraf@suse.de>   [2012-01-16 04:57:45]:
>>>>> 
>>>>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>>>>> 
>>>>> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
>>>>> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
>>>>> some workload(s)?
>>>> 
>>>> Yup
>>>> 
>>>>> 
>>>>> In some sense, the 1x overcommitcase results posted does measure the overhead
>>>>> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
>>>>> kernbench ..
>>>>> 
>>>>>> Result for Non PLE machine :
>>>>>> ============================
>>>>> 
>>>>> [snip]
>>>>> 
>>>>>> Kernbench:
>>>>>>               BASE                    BASE+patch
>>>> 
>>>> What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.
>>>> 
>>>> 
>>>> Alex
>>> 
>>> Sorry for confusion, I think I was little imprecise on the BASE.
>>> 
>>> The BASE is pre 3.2.0 + Jeremy's following patches:
>>> xadd (https://lkml.org/lkml/2011/10/4/328)
>>> x86/ticketlocklock  (https://lkml.org/lkml/2011/10/12/496).
>>> So this would have ticketlock cleanups from Jeremy and
>>> CONFIG_PARAVIRT_SPINLOCKS=y
>>> 
>>> BASE+patch = pre 3.2.0 + Jeremy's above patches + above V5 PV spinlock
>>> series and CONFIG_PARAVIRT_SPINLOCKS=y
>>> 
>>> In both the cases  CONFIG_PARAVIRT_SPINLOCKS=y.
>>> 
>>> So let,
>>> A. pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n
>>> B. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = n
>>> C. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = y
>>> D. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = n
>>> E. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = y
>>> 
>>> is it performance of A vs E ? (currently C vs E)
>> 
>> Since D and E only matter with KVM in use, yes, I'm mostly interested in A, B and C :).
>> 
>> 
>> Alex
>> 
>> 
> setup :
> Native: IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM, (16 cpu online)
> 
> Guest : Single guest with 8 VCPU 4GB Ram.
> benchmark : kernbench -f -H -M -o 20
> 
> Here is the result :
> Native Run
> ============
> case A               case B             %improvement   case C  %improvement
> 56.1917 (2.57125)    56.035 (2.02439)   0.278867       56.27 (2.40401)   -0.139344	

This looks a lot like statistical derivation. How often did you execute the test case? Did you make sure to have a clean base state every time?

Maybe it'd be a good idea to create a small in-kernel microbenchmark with a couple threads that take spinlocks, then do work for a specified number of cycles, then release them again and start anew. At the end of it, we can check how long the whole thing took for n runs. That would enable us to measure the worst case scenario.

> 
> Guest Run
> ============
> case A               case B             %improvement   case C  %improvement
> 166.999 (15.7613)    161.876 (14.4874) 	3.06768        161.24 (12.6497)  3.44852

Is this the same machine? Why is the guest 3x slower?


Alex

> 
> We do not see much overhead in native run with CONFIG_PARAVIRT_SPINLOCKS = y
> 

^ permalink raw reply

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Raghavendra K T @ 2012-01-17 18:36 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
	Peter Zijlstra, Jan Kiszka, Srivatsa Vaddagiri, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Virtualization, LKML, Dave Hansen
In-Reply-To: <1485A122-9D48-46E3-A01E-E37B5C9EC54A@suse.de>

On 01/17/2012 11:09 PM, Alexander Graf wrote:
>
> On 17.01.2012, at 18:27, Raghavendra K T wrote:
>
>> On 01/17/2012 12:12 AM, Alexander Graf wrote:
>>>
>>> On 16.01.2012, at 19:38, Raghavendra K T wrote:
>>>
>>>> On 01/16/2012 07:53 PM, Alexander Graf wrote:
>>>>>
>>>>> On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:
>>>>>
>>>>>> * Alexander Graf<agraf@suse.de>    [2012-01-16 04:57:45]:
>>>>>>
>>>>>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>>>>>>
>>>>>> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
>>>>>> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
>>>>>> some workload(s)?
>>>>>
>>>>> Yup
>>>>>
>>>>>>
>>>>>> In some sense, the 1x overcommitcase results posted does measure the overhead
>>>>>> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
>>>>>> kernbench ..
>>>>>>
>>>>>>> Result for Non PLE machine :
>>>>>>> ============================
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>> Kernbench:
>>>>>>>                BASE                    BASE+patch
>>>>>
>>>>> What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.
>>>>>
>>>>>
>>>>> Alex
>>>>
>>>> Sorry for confusion, I think I was little imprecise on the BASE.
>>>>
>>>> The BASE is pre 3.2.0 + Jeremy's following patches:
>>>> xadd (https://lkml.org/lkml/2011/10/4/328)
>>>> x86/ticketlocklock  (https://lkml.org/lkml/2011/10/12/496).
>>>> So this would have ticketlock cleanups from Jeremy and
>>>> CONFIG_PARAVIRT_SPINLOCKS=y
>>>>
>>>> BASE+patch = pre 3.2.0 + Jeremy's above patches + above V5 PV spinlock
>>>> series and CONFIG_PARAVIRT_SPINLOCKS=y
>>>>
>>>> In both the cases  CONFIG_PARAVIRT_SPINLOCKS=y.
>>>>
>>>> So let,
>>>> A. pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n
>>>> B. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = n
>>>> C. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = y
>>>> D. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = n
>>>> E. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = y
>>>>
>>>> is it performance of A vs E ? (currently C vs E)
>>>
>>> Since D and E only matter with KVM in use, yes, I'm mostly interested in A, B and C :).
>>>
>>>
>>> Alex
>>>
>>>
>> setup :
>> Native: IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM, (16 cpu online)
>>
>> Guest : Single guest with 8 VCPU 4GB Ram.
>> benchmark : kernbench -f -H -M -o 20
>>
>> Here is the result :
>> Native Run
>> ============
>> case A               case B             %improvement   case C  %improvement
>> 56.1917 (2.57125)    56.035 (2.02439)   0.278867       56.27 (2.40401)   -0.139344	
>
> This looks a lot like statistical derivation. How often did you execute the test case? Did you make sure to have a clean base state every time?
>
> Maybe it'd be a good idea to create a small in-kernel microbenchmark with a couple threads that take spinlocks, then do work for a specified number of cycles, then release them again and start anew. At the end of it, we can check how long the whole thing took for n runs. That would enable us to measure the worst case scenario.
>

It was a quick test.  two iteration of kernbench (=6runs) and had 
ensured cache is cleared.

echo "1" > /proc/sys/vm/drop_caches
ccache -C. Yes may be I can run test as you mentioned..

>>
>> Guest Run
>> ============
>> case A               case B             %improvement   case C  %improvement
>> 166.999 (15.7613)    161.876 (14.4874) 	3.06768        161.24 (12.6497)  3.44852
>
> Is this the same machine? Why is the guest 3x slower?
Yes non - ple machine but with all 16 cpus online. 3x slower you meant 
case A is slower (pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n) ?

>
>
> Alex
>
>>
>> We do not see much overhead in native run with CONFIG_PARAVIRT_SPINLOCKS = y
>>
>
>

^ permalink raw reply

* Re: [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Raghavendra K T @ 2012-01-17 18:57 UTC (permalink / raw)
  To: Marcelo Tosatti, Avi Kivity
  Cc: Jeremy Fitzhardinge, Raghavendra, linux-doc, Peter Zijlstra,
	Jan Kiszka, Virtualization, Paul Mackerras, H. Peter Anvin,
	Stefano Stabellini, Xen, Dave Jiang, KVM, Rob Landley, X86,
	Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Greg Kroah-Hartman, LKML
In-Reply-To: <20120117110210.GA17420@amt.cnet>

On 01/17/2012 04:32 PM, Marcelo Tosatti wrote:
> On Sat, Jan 14, 2012 at 11:56:46PM +0530, Raghavendra K T wrote:
>> Extends Linux guest running on KVM hypervisor to support pv-ticketlocks.
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c7b05fc..4d7a950 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5754,8 +5754,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>
>>   	local_irq_disable();
>>
>> -	if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
>> -	    || need_resched() || signal_pending(current)) {
>> +	if (vcpu->mode == EXITING_GUEST_MODE
>> +		 || (vcpu->requests&  ~(1UL<<KVM_REQ_PVLOCK_KICK))
>> +		 || need_resched() || signal_pending(current)) {
>>   		vcpu->mode = OUTSIDE_GUEST_MODE;
>>   		smp_wmb();
>>   		local_irq_enable();
>> @@ -6711,6 +6712,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>   		!vcpu->arch.apf.halted)
>>   		|| !list_empty_careful(&vcpu->async_pf.done)
>>   		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
>> +		|| kvm_check_request(KVM_REQ_PVLOCK_KICK, vcpu)
>
> The bit should only be read here (kvm_arch_vcpu_runnable), but cleared
> on vcpu entry (along with the other kvm_check_request processing).
>
> Then the first hunk becomes unnecessary.

true. [ patch below ]

>
> Please do not mix host/guest patches.

yes, will be taken care in next version..

>
>

I had tried alternative approach earlier, I think that is closer
to your expectation.

where
- flag is read in kvm_arch_vcpu_runnable
- flag cleared in vcpu entry along with others.

But it needs per vcpu flag to remember that pv_unhalted while clearing
the flag in vcpu enter [ patch below ]. Could not find third alternative
though.

Simply clearing the request bit in vcpu entry had made guest hang in 
*rare* scenario.  [as kick will be lost].

[ I had observed guest hang after 4 iteration of kernbench with 1:3 
overcommit. with 2/3 guest running while 1 hogs ]

Avi,
do you think having pv_unhalt flag in below patch cause problem to
live migration still? (vcpu->request bit is retained as is) OR do we 
have to have KVM_GET_MP_STATE changes also with below patch you 
mentioned earlier.

---8<---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..1bf8fa8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5684,6 +5717,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
  			r = 1;
  			goto out;
  		}
+		if (kvm_check_request(KVM_REQ_PVKICK, vcpu)) {
+			vcpu->pv_unhalted = 1;
+			r = 1;
+			goto out;
+		}
  		if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
  			record_steal_time(vcpu);
  		if (kvm_check_request(KVM_REQ_NMI, vcpu))
@@ -6683,6 +6720,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
  		!vcpu->arch.apf.halted)
  		|| !list_empty_careful(&vcpu->async_pf.done)
  		|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
+		|| (kvm_test_request(KVM_REQ_PVKICK, vcpu) || vcpu->pv_unhalted)
  		|| atomic_read(&vcpu->arch.nmi_queued) ||
  		(kvm_arch_interrupt_allowed(vcpu) &&
  		 kvm_cpu_has_interrupt(vcpu));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..a48e0f2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -154,6 +155,8 @@ struct kvm_vcpu {
  #endif

  	struct kvm_vcpu_arch arch;
+
+	int pv_unhalted;
  };

  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -770,5 +773,12 @@ static inline bool kvm_check_request(int req, 
struct kvm_vcpu *vcpu)
  	}
  }

+static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
+{
+	if (test_bit(req, &vcpu->requests))
+		return true;
+	else
+		return false;
+}
  #endif

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..55c44a2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -226,6 +226,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm 
*kvm, unsigned id)
  	vcpu->kvm = kvm;
  	vcpu->vcpu_id = id;
  	vcpu->pid = NULL;
+	vcpu->pv_unhalted = 0;
  	init_waitqueue_head(&vcpu->wq);
  	kvm_async_pf_vcpu_init(vcpu);

@@ -1509,11 +1510,12 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
  {
  	DEFINE_WAIT(wait);
  	for (;;) {
  		prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);

  		if (kvm_arch_vcpu_runnable(vcpu)) {
+			vcpu->pv_unhalted = 0;
  			kvm_make_request(KVM_REQ_UNHALT, vcpu);
  			break;
  		}

^ permalink raw reply related

* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Dave Hansen @ 2012-01-17 21:57 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
	Peter Zijlstra, Jan Kiszka, Srivatsa Vaddagiri, Paul Mackerras,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Sasha Levin, Sedat Dilek, Thomas Gleixner,
	Virtualization, LKML, Suzuki Poulose <suzuk>
In-Reply-To: <4F15BFAE.7060500@linux.vnet.ibm.com>

On 01/17/2012 10:36 AM, Raghavendra K T wrote:
> It was a quick test.  two iteration of kernbench (=6runs) and had
> ensured cache is cleared.
> 
> echo "1" > /proc/sys/vm/drop_caches
> ccache -C. Yes may be I can run test as you mentioned..

echo 3 > /proc/sys/vm/drop_caches

is better.  1 will only do page cache, but 3 also does dentries fwiw.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox