* Re: [PULL] virtio and lguest
From: Linus Torvalds @ 2012-01-13 0:29 UTC (permalink / raw)
To: Rusty Russell
Cc: Stratos Psomadakis, Michael S. Tsirkin,
lkml - Kernel Mailing List, virtualization, Sasha Levin,
Amit Shah, Jacek Galowicz, Christoph Hellwig, Davidlohr Bueso
In-Reply-To: <87lipd4hqx.fsf@rustcorp.com.au>
On Wed, Jan 11, 2012 at 9:22 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
>
> Amit Shah (12):
> virtio: pci: switch to new PM API
Hmm. Afaik, this is broken, or at least not complete.
Sure, it switches to the new PM API, but it still does the PCI ops itself.
It should not need to - the PCI layer will do the power state and
standard PCI device state saving. And setting the PCI_D3hot state when
shared interrupts can still happen at suspend time is just a bad idea.
So I think you're doing extra work and introducing bugs by doing so -
the default PCI bus operations should already do all you do, just do
it better. And then you can use the SIMPLE_DEV_PM_OPS() to build the
dev_pm_ops structure and get all the normal cases right automatically.
I don't know if there is any particularly good example of this, but
you can see some of the network drivers for examples of this. Notice
how they don't need to worry about PCI power states etc at all, they
just need to worry about the actual chip suspend/resume (and for a
network driver, you'd do the netif_device_detach/netif_device_attach
etc)
Linus
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-13 2:19 UTC (permalink / raw)
To: Rusty Russell
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin, Anthony Liguori
In-Reply-To: <87wr8x4rye.fsf@rustcorp.com.au>
On Thu, Jan 12, 2012 at 12:12:17PM +1030, Rusty Russell wrote:
> On Thu, 12 Jan 2012 00:02:33 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > Look, we have a race currently. Let us not tie a bug fix to a huge
> > rewrite with unclear performance benefits, please.
>
> In theory, yes. In practice, we bandaid it.
>
> I think in the short term we change ->get to get the entire sequence
> twice, and check it's the same. Theoretically, still racy, but it does
> cut the window. And we haven't seen the bug yet, either.
I thought about this some more. Since we always get
an interrupt on config changes, it seems that a rather
robust method would be to just synchronize against that.
Something like the below (warning - completely untested).
Still need to think about memory barriers, overflow etc.
What do you think?
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 03d1984..b5df385 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -57,6 +57,7 @@ struct virtio_pci_device
unsigned msix_used_vectors;
/* Whether we have vector per vq */
bool per_vq_vectors;
+ atomic_t config_changes;
};
/* Constants for MSI-X */
@@ -125,6 +126,19 @@ static void vp_finalize_features(struct virtio_device *vdev)
iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
}
+/* wait for pending irq handlers */
+static void vp_synchronize_vectors(struct virtio_device *vdev)
+{
+ struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+ int i;
+
+ if (vp_dev->intx_enabled)
+ synchronize_irq(vp_dev->pci_dev->irq);
+
+ for (i = 0; i < vp_dev->msix_vectors; ++i)
+ synchronize_irq(vp_dev->msix_entries[i].vector);
+}
+
/* virtio config->get() implementation */
static void vp_get(struct virtio_device *vdev, unsigned offset,
void *buf, unsigned len)
@@ -134,9 +148,20 @@ static void vp_get(struct virtio_device *vdev, unsigned offset,
VIRTIO_PCI_CONFIG(vp_dev) + offset;
u8 *ptr = buf;
int i;
-
- for (i = 0; i < len; i++)
- ptr[i] = ioread8(ioaddr + i);
+ int uninitialized_var(c);
+ c = atomic_read(&vp_dev->config_changes);
+ /* Make sure read is done before we get the first config byte */
+ rmb();
+ do {
+ for (i = 0; i < len; i++)
+ ptr[i] = ioread8(ioaddr + i);
+ /* Synchronize with config interrupt */
+ vp_synchronize_vectors(vdev);
+ /*
+ * For multi-byte fields, we might get a config change interrupt
+ * between byte reads. If this happens, retry the read.
+ */
+ } while (c != atomic_read(&vp_dev->config_changes))
}
/* the config->set() implementation. it's symmetric to the config->get()
@@ -169,19 +194,6 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
iowrite8(status, vp_dev->ioaddr + VIRTIO_PCI_STATUS);
}
-/* wait for pending irq handlers */
-static void vp_synchronize_vectors(struct virtio_device *vdev)
-{
- struct virtio_pci_device *vp_dev = to_vp_device(vdev);
- int i;
-
- if (vp_dev->intx_enabled)
- synchronize_irq(vp_dev->pci_dev->irq);
-
- for (i = 0; i < vp_dev->msix_vectors; ++i)
- synchronize_irq(vp_dev->msix_entries[i].vector);
-}
-
static void vp_reset(struct virtio_device *vdev)
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
@@ -213,6 +225,8 @@ static irqreturn_t vp_config_changed(int irq, void *opaque)
drv = container_of(vp_dev->vdev.dev.driver,
struct virtio_driver, driver);
+ atomic_inc(&vp_dev->config_changes);
+
if (drv && drv->config_changed)
drv->config_changed(&vp_dev->vdev);
return IRQ_HANDLED;
@@ -646,6 +660,7 @@ static int __devinit virtio_pci_probe(struct pci_dev *pci_dev,
vp_dev->vdev.config = &virtio_pci_config_ops;
vp_dev->pci_dev = pci_dev;
INIT_LIST_HEAD(&vp_dev->virtqueues);
+ atomic_set(&vp_dev->config_changes, 0);
spin_lock_init(&vp_dev->lock);
/* Disable MSI/MSIX to bring device to a known good state. */
^ permalink raw reply related
* Re: [PULL] virtio and lguest
From: Rusty Russell @ 2012-01-13 2:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stratos Psomadakis, Michael S. Tsirkin,
lkml - Kernel Mailing List, virtualization, Sasha Levin,
Amit Shah, Jacek Galowicz, Christoph Hellwig, Davidlohr Bueso
In-Reply-To: <CA+55aFyOy4bWUq6PH3ThM2CXOFwi75FE8HGOJ8DNZjFWw9rq6A@mail.gmail.com>
On Thu, 12 Jan 2012 16:29:14 -0800, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, Jan 11, 2012 at 9:22 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> >
> > Amit Shah (12):
> > virtio: pci: switch to new PM API
>
> Hmm. Afaik, this is broken, or at least not complete.
>
> Sure, it switches to the new PM API, but it still does the PCI ops itself.
>
> It should not need to - the PCI layer will do the power state and
> standard PCI device state saving. And setting the PCI_D3hot state when
> shared interrupts can still happen at suspend time is just a bad idea.
>
> So I think you're doing extra work and introducing bugs by doing so -
> the default PCI bus operations should already do all you do, just do
> it better. And then you can use the SIMPLE_DEV_PM_OPS() to build the
> dev_pm_ops structure and get all the normal cases right automatically.
>
> I don't know if there is any particularly good example of this, but
> you can see some of the network drivers for examples of this. Notice
> how they don't need to worry about PCI power states etc at all, they
> just need to worry about the actual chip suspend/resume (and for a
> network driver, you'd do the netif_device_detach/netif_device_attach
> etc)
Ok, I'll confess complete ignorance, and wait for Amit to respond. I
must admit that PM for virtual devices is not a personal priority...
Thanks,
Rusty.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-13 2:32 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pawel Moll, virtualization, Christian Borntraeger, Sasha Levin,
Anthony Liguori
In-Reply-To: <20120113021930.GA15379@redhat.com>
On Fri, 2012-01-13 at 04:19 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 12, 2012 at 12:12:17PM +1030, Rusty Russell wrote:
> > On Thu, 12 Jan 2012 00:02:33 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > Look, we have a race currently. Let us not tie a bug fix to a huge
> > > rewrite with unclear performance benefits, please.
> >
> > In theory, yes. In practice, we bandaid it.
> >
> > I think in the short term we change ->get to get the entire sequence
> > twice, and check it's the same. Theoretically, still racy, but it does
> > cut the window. And we haven't seen the bug yet, either.
>
> I thought about this some more. Since we always get
> an interrupt on config changes, it seems that a rather
> robust method would be to just synchronize against that.
> Something like the below (warning - completely untested).
> Still need to think about memory barriers, overflow etc.
> What do you think?
Your interrupt may take an unpredictable amount of time to arrive, I
don't see how you can use that as a guarantee.
Cheers,
Ben.
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index 03d1984..b5df385 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -57,6 +57,7 @@ struct virtio_pci_device
> unsigned msix_used_vectors;
> /* Whether we have vector per vq */
> bool per_vq_vectors;
> + atomic_t config_changes;
> };
>
> /* Constants for MSI-X */
> @@ -125,6 +126,19 @@ static void vp_finalize_features(struct virtio_device *vdev)
> iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
> }
>
> +/* wait for pending irq handlers */
> +static void vp_synchronize_vectors(struct virtio_device *vdev)
> +{
> + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> + int i;
> +
> + if (vp_dev->intx_enabled)
> + synchronize_irq(vp_dev->pci_dev->irq);
> +
> + for (i = 0; i < vp_dev->msix_vectors; ++i)
> + synchronize_irq(vp_dev->msix_entries[i].vector);
> +}
> +
> /* virtio config->get() implementation */
> static void vp_get(struct virtio_device *vdev, unsigned offset,
> void *buf, unsigned len)
> @@ -134,9 +148,20 @@ static void vp_get(struct virtio_device *vdev, unsigned offset,
> VIRTIO_PCI_CONFIG(vp_dev) + offset;
> u8 *ptr = buf;
> int i;
> -
> - for (i = 0; i < len; i++)
> - ptr[i] = ioread8(ioaddr + i);
> + int uninitialized_var(c);
> + c = atomic_read(&vp_dev->config_changes);
> + /* Make sure read is done before we get the first config byte */
> + rmb();
> + do {
> + for (i = 0; i < len; i++)
> + ptr[i] = ioread8(ioaddr + i);
> + /* Synchronize with config interrupt */
> + vp_synchronize_vectors(vdev);
> + /*
> + * For multi-byte fields, we might get a config change interrupt
> + * between byte reads. If this happens, retry the read.
> + */
> + } while (c != atomic_read(&vp_dev->config_changes))
> }
>
> /* the config->set() implementation. it's symmetric to the config->get()
> @@ -169,19 +194,6 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
> iowrite8(status, vp_dev->ioaddr + VIRTIO_PCI_STATUS);
> }
>
> -/* wait for pending irq handlers */
> -static void vp_synchronize_vectors(struct virtio_device *vdev)
> -{
> - struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> - int i;
> -
> - if (vp_dev->intx_enabled)
> - synchronize_irq(vp_dev->pci_dev->irq);
> -
> - for (i = 0; i < vp_dev->msix_vectors; ++i)
> - synchronize_irq(vp_dev->msix_entries[i].vector);
> -}
> -
> static void vp_reset(struct virtio_device *vdev)
> {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> @@ -213,6 +225,8 @@ static irqreturn_t vp_config_changed(int irq, void *opaque)
> drv = container_of(vp_dev->vdev.dev.driver,
> struct virtio_driver, driver);
>
> + atomic_inc(&vp_dev->config_changes);
> +
> if (drv && drv->config_changed)
> drv->config_changed(&vp_dev->vdev);
> return IRQ_HANDLED;
> @@ -646,6 +660,7 @@ static int __devinit virtio_pci_probe(struct pci_dev *pci_dev,
> vp_dev->vdev.config = &virtio_pci_config_ops;
> vp_dev->pci_dev = pci_dev;
> INIT_LIST_HEAD(&vp_dev->virtqueues);
> + atomic_set(&vp_dev->config_changes, 0);
> spin_lock_init(&vp_dev->lock);
>
> /* Disable MSI/MSIX to bring device to a known good state. */
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Rusty Russell @ 2012-01-13 3:22 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Christian Borntraeger, Benjamin Herrenschmidt, Sasha Levin,
Pawel Moll, virtualization
In-Reply-To: <20120112060009.GC10319@redhat.com>
On Thu, 12 Jan 2012 08:00:10 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Jan 12, 2012 at 12:31:09PM +1030, Rusty Russell wrote:
> > If we use a 32-bit counter, we also get this though, right?
> >
> > If counter has changed, it's a config interrupt...
>
> But we need an exit to read the counter. We can put the counter
> in memory but this looks suspiciously like a simplified VQ -
> so why not use a VQ then?
Because now a driver first gets the data from config space. But from
then on, they have to get it from the vq, and ignore the config space.
That's a bit weird.
> > > If we do require config VQ anyway, why not use it to notify
> > > guest of config changes? Guest could pre-post an in buffer
> > > and host uses that.
> >
> > We could, but it's an additional burden on each device. vqs are cheap,
> > but not free. And the config area is so damn convenient...
>
> Not if you start playing with counters, checking it twice,
> reinvent all kind of barriers ...
None of that appears inside the driver, though. And let's be honest,
it's not *that* bad (very approx code):
static u32 vp_get_gen(struct virtio_pci_device *vp_dev)
{
u32 gen;
do {
gen = ioread32(vp_dev->ioaddr + VIRTIO_PCI_CONFIG_GEN);
} while (unlikely((gen & 1) == 1));
virtio_rmb();
return gen;
}
static bool vp_check_gen(struct virtio_pci_device *vp_dev, u32 gen)
{
virtio_rmb();
return ioread32(vp_dev->ioaddr + VIRTIO_PCI_CONFIG_GEN) == gen;
}
static void vp_get32(struct virtio_device *vdev, unsigned offset, u32 *buf)
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
u32 gen;
do {
gen = vp_get_gen(vdev);
*buf = ioread32(vp_dev->ioaddr + VIRTIO_PCI_CONFIG(vp_dev) + offset);
} while (unlikely(!vp_check_gen(vp_dev, gen)));
}
...
> > It was suggested by others, but I think TCP Acks are the classic one.
> > 12 + 14 + 20 + 40 = 86 bytes with virtio_net_hdr_mrg_rxbuf at the front.
>
> That's only the simplest IPv4, right?
> Anyway, this spans multiple descriptors so this complicates allocation
> significantly.
Yes, I think any general-but-useful inline will need to span multiple
descriptors. That's part of the fun!
Let's get totally crazy and implement our ring in stripes, like:
00 04 08 12 01 05 09 13 02 06 10 14 03 07 11 15
That way consecutive (32-byte) descriptors don't share a cacheline!
(Not serious... quiet.)
> > Yes, I'm thinking #define VIRTIO_F_VIRTIO2 (-1). For PCI, this gets
> > mapped into a "are we using the new config layout?". For others, it
> > gets mapped into a transport-specific feature.
> >
> > (I'm sure you get it, but for the others) This is because I want to be
> > draw a clear line between all the legacy stuff at the same time, not
> > have to support part of it later because someone might not flip the
> > feature bit.
>
> So my point is, config stuff and ring are completely separate,
> they are different layers.
Absolutely, and we should analyze them separately as well as together.
*But* for maintenance is far easier if we only have to test new
config+new ring and old config+old ring. They do interact, because
remember, the allocation of the ring changes with new config, too...
Cheers,
Rusty.
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias (v2.1)
From: David Miller @ 2012-01-13 4:07 UTC (permalink / raw)
To: shemminger; +Cc: kvm, mst, netdev, kay.sievers, virtualization, device
In-Reply-To: <20120111213038.39213819@nehalam.linuxnetplumber.net>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 11 Jan 2012 21:30:38 -0800
> Subject: vhost-net: add module alias (v2.1)
>
> By adding some module aliases, programs (or users) won't have to explicitly
> call modprobe. Vhost-net will always be available if built into the kernel.
> It does require assigning a permanent minor number for depmod to work.
>
> Also:
> - use C99 style initialization.
> - add missing entry in documentation for loop-control
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
ACKs, NACKs? What is happening here?
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias (v2.1)
From: Kay Sievers @ 2012-01-13 4:19 UTC (permalink / raw)
To: David Miller; +Cc: kvm, mst, netdev, virtualization, shemminger, device
In-Reply-To: <20120112.200701.1473475851890804136.davem@davemloft.net>
On Fri, Jan 13, 2012 at 05:07, David Miller <davem@davemloft.net> wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Wed, 11 Jan 2012 21:30:38 -0800
>
>> Subject: vhost-net: add module alias (v2.1)
>>
>> By adding some module aliases, programs (or users) won't have to explicitly
>> call modprobe. Vhost-net will always be available if built into the kernel.
>> It does require assigning a permanent minor number for depmod to work.
>>
>> Also:
>> - use C99 style initialization.
>> - add missing entry in documentation for loop-control
>>
>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> ACKs, NACKs? What is happening here?
In general, static minors are acceptable and very useful to make
on-demand loading of kernel modules working. They should be used only
for single-instance devices though, which usually means: One single
static device name associated with a module.
That looks all fine here, and for what it's worth:
Acked-By: Kay Sievers <kay.sievers@vrfy.org>
Kay
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PULL] virtio and lguest
From: Amit Shah @ 2012-01-13 10:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stratos Psomadakis, Michael S. Tsirkin,
lkml - Kernel Mailing List, virtualization, Sasha Levin,
Jacek Galowicz, Christoph Hellwig, Davidlohr Bueso
In-Reply-To: <CA+55aFyOy4bWUq6PH3ThM2CXOFwi75FE8HGOJ8DNZjFWw9rq6A@mail.gmail.com>
Hi,
On (Thu) 12 Jan 2012 [16:29:14], Linus Torvalds wrote:
> On Wed, Jan 11, 2012 at 9:22 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> >
> > Amit Shah (12):
> > virtio: pci: switch to new PM API
>
> Hmm. Afaik, this is broken, or at least not complete.
>
> Sure, it switches to the new PM API, but it still does the PCI ops itself.
>
> It should not need to - the PCI layer will do the power state and
> standard PCI device state saving. And setting the PCI_D3hot state when
> shared interrupts can still happen at suspend time is just a bad idea.
The idea behind this patchset is to get S4 working properly. There's
no change to the way S3 was/is being done, and the state-setting is
done only in the S3 PM callbacks.
> So I think you're doing extra work and introducing bugs by doing so -
> the default PCI bus operations should already do all you do, just do
For S4, we need some driver-specific (not just virtio-specific) work
to be done on the freeze/restore callbacks...
> it better. And then you can use the SIMPLE_DEV_PM_OPS() to build the
> dev_pm_ops structure and get all the normal cases right automatically.
... and we also have separate stuff to be done in thaw/restore/freeze
callbacks for different drivers. So using the *_PM_OPS() macros
wouldn't have worked.
> I don't know if there is any particularly good example of this, but
> you can see some of the network drivers for examples of this. Notice
> how they don't need to worry about PCI power states etc at all, they
> just need to worry about the actual chip suspend/resume (and for a
> network driver, you'd do the netif_device_detach/netif_device_attach
> etc)
I think your concern is with the way S3 is being done, and I volunteer
to look at improving the situation there. Might take a while, though.
Amit
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias (v2.1)
From: David Miller @ 2012-01-13 18:12 UTC (permalink / raw)
To: kay.sievers; +Cc: kvm, mst, netdev, virtualization, shemminger, device
In-Reply-To: <CAPXgP10uf-5wDxGOP8yhfUcn+18Ga+eZp-Psz_Pxe9PoJZagoA@mail.gmail.com>
From: Kay Sievers <kay.sievers@vrfy.org>
Date: Fri, 13 Jan 2012 05:19:05 +0100
> On Fri, Jan 13, 2012 at 05:07, David Miller <davem@davemloft.net> wrote:
>> From: Stephen Hemminger <shemminger@vyatta.com>
>> Date: Wed, 11 Jan 2012 21:30:38 -0800
>>
>>> Subject: vhost-net: add module alias (v2.1)
>>>
>>> By adding some module aliases, programs (or users) won't have to explicitly
>>> call modprobe. Vhost-net will always be available if built into the kernel.
>>> It does require assigning a permanent minor number for depmod to work.
>>>
>>> Also:
>>> - use C99 style initialization.
>>> - add missing entry in documentation for loop-control
>>>
>>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>>
>> ACKs, NACKs? What is happening here?
>
> In general, static minors are acceptable and very useful to make
> on-demand loading of kernel modules working. They should be used only
> for single-instance devices though, which usually means: One single
> static device name associated with a module.
>
> That looks all fine here, and for what it's worth:
> Acked-By: Kay Sievers <kay.sievers@vrfy.org>
Ok, applied, thanks everyone.
^ permalink raw reply
* CFP: ACM HPDC 2012, abstracts due January 16th, 2012
From: Ioan Raicu @ 2012-01-14 14:09 UTC (permalink / raw)
To: virtualization
**** CALL FOR PAPERS ****
The 21st International ACM Symposium on
High-Performance Parallel and Distributed Computing
(HPDC'12)
Delft University of Technology, Delft, the Netherlands
June 18-22, 2012
http://www.hpdc.org/2012
The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)
is the premier annual conference on the design, the implementation, the evaluation, and
the use of parallel and distributed systems for high-end computing. HPDC'12 will take place
in Delft, the Netherlands, a historical, picturesque city that is less than one hour away
from Amsterdam-Schiphol airport. The conference will be held on June 20-22 (Wednesday to
Friday), with affiliated workshops taking place on June 18-19 (Monday and Tuesday).
**** SUBMISSION DEADLINES ****
Abstracts: 16 January 2012
Papers: 23 January 2012 (No extensions!)
**** HPDC'12 GENERAL CHAIR ****
Dick Epema, Delft University of Technology, Delft, the Netherlands
**** HPDC'12 PROGRAM CO-CHAIRS ****
Thilo Kielmann, Vrije Universiteit, Amsterdam, the Netherlands
Matei Ripeanu, The University of British Columbia, Vancouver, Canada
**** HPDC'12 WORKSHOPS CHAIR ****
Alexandru Iosup, Delft University of Technology, Delft, the Netherlands
**** SCOPE AND TOPICS ****
Submissions are welcomed on all forms of high-performance parallel and distributed computing,
including but not limited to clusters, clouds, grids, utility computing, data-intensive
computing, and massively multicore systems. Submissions that explore solutions to estimate
and reduce the energy footprint of such systems are particularly encouraged. All papers
will be evaluated for their originality, potential impact, correctness, quality of
presentation, appropriate presentation of related work, and relevance to the conference,
with a strong preference for rigorous results obtained in operational parallel and
distributed systems.
The topics of interest of the conference include, but are not limited to, the following,
in the context of high-performance parallel and distributed computing:
- Systems, networks, and architectures for high-end computing
- Massively multicore systems
- Virtualization of machines, networks, and storage
- Programming languages and environments
- I/O, storage systems, and data management
- Resource management, energy and cost minimizations
- Performance modeling and analysis
- Fault tolerance, reliability, and availability
- Data-intensive computing
- Applications of parallel and distributed computing
**** PAPER SUBMISSION GUIDELINES ****
Authors are invited to submit technical papers of at most 12 pages in PDF format, including
figures and references. Papers should be formatted in the ACM Proceedings Style and
submitted via the conference web site. No changes to the margins, spacing, or font sizes as
specified by the style file are allowed. Accepted papers will appear in the conference
proceedings, and will be incorporated into the ACM Digital Library. A limited number of
papers will be accepted as posters.
Papers must be self-contained and provide the technical substance required for the program
committee to evaluate their contributions. Submitted papers must be original work that has
not appeared in and is not under consideration for another conference or a journal. See the
ACM Prior Publication Policy for more details.
**** IMPORTANT DATES ****
Abstracts Due: 16 January 2012
Papers Due: 23 January 2012 (No extensions!)
Reviews Released to Authors: 8 March 2012
Author Rebuttals Due: 12 March 2012
Author Notifications: 19 March 2012
Final Papers Due: 16 April 2012
Conference Dates: 18-22 June 2012
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel: 1-847-722-0876
Office: 1-312-567-5704
Email: iraicu@cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
Web: http://datasys.cs.iit.edu/
=================================================================
=================================================================
^ permalink raw reply
* CFP: IEEE eScience 2012 in Chicago IL USA
From: Ioan Raicu @ 2012-01-14 18:01 UTC (permalink / raw)
To: virtualization
[-- Attachment #1.1: Type: text/plain, Size: 11305 bytes --]
Call for Papers
8th IEEE International Conference on eScience
October 8-12, 2012
Chicago, IL, USA
Researchers in all disciplines are increasingly adopting digital tools,
techniques and practices, often in communities and projects that span
disciplines, laboratories, organizations, and national boundaries. The
eScience 2012 conference is designed to bring together leading
international and interdisciplinary research communities, developers,
and users of eScience applications and enabling IT technologies. The
conference serves as a forum to present the results of the latest
applications research and product/tool developments and to highlight
related activities from around the world. Also, we are now entering the
second decade of eScience and the 2012 conference gives an opportunity
to take stock of what has been achieved so far and look forward to the
challenges and opportunities the next decade will bring.
A special emphasis of the 2012 conference is on advances in the
application of technology in a particular discipline. Accordingly,
significant advances in applications science and technology will be
considered as important as the development of new technologies
themselves. Further, we welcome contributions in educational activities
under any of these disciplines.
As a result, the conference will be structured around two e-Science tracks:
* *eScience Algorithms and Applications*
o eScience application areas, including:
+ Physical sciences
+ Biomedical sciences
+ Social sciences and humanities
o Data-oriented approaches and applications
o Compute-oriented approaches and applications
o Extreme scale approaches and applications
* *Cyberinfrastructure to support eScience*
o Novel hardware
o Novel uses of production infrastructure
o Software and services
o Tools
The conference proceedings will be published by the IEEE Computer
Society Press, USA and will be made available online through the IEEE
Digital Library. Selected papers will be invited to submit extended
versions to a special issue of the Future Generation Computer Systems
(FGCS)
<http://www.journals.elsevier.com/future-generation-computer-systems/>
journal.
SUBMISSION PROCESS
Authors are invited to submit papers with unpublished, original work of
not more than 8 pages of double column text using single spaced 10 point
size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines.
(Up to 2 additional pages may be purchased for US$150/page)
Templates are available from
http://www.ieee.org/conferences_events/conferences/publishing/templates.html.
Authors should submit a PDF file that will print on a PostScript printer
to https://www.easychair.org/conferences/?conf=escience2012
(Note that paper submitters also must submit an abstract in advance of
the paper deadline. This should be done through the same site where
papers are submitted.)
It is a requirement that at least one author of each accepted paper
attend the conference.
ORGANIZATION
General Chair
* *Ian Foster*, University of Chicago & Argonne National Laboratory, USA
Program Co-Chairs
* *Daniel S. Katz*, University of Chicago & Argonne National
Laboratory, USA
* *Heinz Stockinger*, SIB Swiss Institute of Bioinformatics, Switzerland
Program Vice Co-Chairs
* eScience Algorithms and Applications Track
o *David Abramson*, Monash University, Australia
o *Gabrielle Allen*, Louisiana State University, USA
* Cyberinfrastructure to support eScience Track
o *Rosa M. Badia*, Barcelona Supercomputing Center / CSIC, Spain
o *Geoffrey Fox*, Indiana University, USA
Sponsorship Chair
* *Charlie Catlett*, Argonne National Laboratory, USA
Conference Manager and Finance Chair
* *Julie Wulf-Knoerzer*, University of Chicago & Argonne National
Laboratory, USA
Publicity Chairs
* *Kento Aida*, National Institute of Informatics, Japan
* *Ioan Raicu*, Illinois Institute of Technology, USA
* *David Wallom*, Oxford e-Research Centre, UK
Local Organizing Committee
* *Ninfa Mayorga*, University of Chicago, USA
* *Evelyn Rayburn*, University of Chicago, USA
* *Lynn Valentini*, Argonne National Laboratory, USA
Program Committee
* eScience Algorithms and Applications Track
o *Srinivas Aluru*, Iowa State University, USA
o *Ashiq Anjum*, University of Derby, UK
o *David A. Bader*, Georgia Institute of Technology, USA
o *Jon Blower*, University of Reading, UK
o *Paul Bonnington*, Monash University, Australia
o *Simon Cox*, University of Southampton, UK
o *David De Roure*, Oxford e-Research Centre, UK
o *George Djorgovski*, California Institute of Technology, USA
o *Anshu Dubey*, University of Chicago & Argonne National
Laboratory, USA
o *Yuri Estrin*, Monash University, Australia
o *Dan Fay*, Microsoft, USA
o *Jeremy Frey*, University of Southampton, UK
o *Wolfgang Gentzsch*, HPC Consultant, Germany
o *Lutz Gross*, The University of Queensland, Austrialia
o *Sverker Holmgren*, Uppsala University, Sweden
o *Bill Howe*, University of Washington, USA
o *Marina Jirotka*, University of Oxford, UK
o *Timoleon Kipouros*, University of Cambridge, UK
o *Kerstin Kleese van Dam*, Pacific Northwest National Laboratory, USA
o *Arun S. Konagurthu*, Monash University, Australia
o *Peter Kunszt*, SystemsX.ch, Switzerland
o *Alexey Lastovetsky*, University College Dublin, Ireland
o *Andrew Lewis*, Griffith University, Australia
o *Sergio Maffioletti*, University of Zurich, Switzerland
o *Amitava Majumdar*, San Diego Supercomputer Center, University
of California at San Diego, USA
o *Rui Mao*, Shenzhen University, China
o *Madhav V. Marathe*, Virginia Tech, USA
o *Maryann Martone*, University of California at San Diego, USA
o *Louis Moresi*, Monash University, Australia
o *Riccardo Murri*, University of Zurich, Switzerland
o *Silvia D. Olabarriaga*, Academic Medical Center of the
University of Amsterdam, Netherlands
o *Enrique S. Quintana-Ortí*, Universidad Jaume I, Spain
o *Abani Patra*, University at Buffalo, USA
o *Rob Pennington*, NSF, USA
o *Andrew Perry*, Monash University, Australia
o *Beth Plale*, Indiana University, USA
o *Michael Resch*, University of Stuttgart, Germany
o *Adrian Sandu*, Virginia Tech, USA
o *Mark Savill*, Cranfield University, UK
o *Erik Schnetter*, Perimeter Institute for Theoretical Physics,
Canada
o *Edward Seidel*, Louisiana State University, USA
o *Suzanne M. Shontz*, The Pennsylvania State University, USA
o *David Skinner*, Lawrence Berkeley National Laboratory, USA
o *Alan Sussman*, University of Maryland, USA
o *Alex Szalay*, Johns Hopkins University, USA
o *Domenico Talia*, ICAR-CNR & University of Calabria, Italy
o *Jian Tao*, Louisiana State University, USA
o *David Wallom*, Oxford e-Research Centre, UK
o *Shaowen Wang*, University of Illinois at Urbana-Champaign, USA
o *Michael Wilde*, Argonne National Laboratory & University of
Chicago, USA
o *Nancy Wilkins-Diehr*, San Diego Supercomputer Center,
University of California at San Diego, USA
o *Wu Zhang*, Shanghai University, China
o *Yunquan Zhang*, Chinese Academy of Sciences, China
* Cyberinfrastructure to support eScience Track
o *Deb Agarwal*, Lawrence Berkeley National Laboratory, USA
o *Ilkay Altintas*, San Diego Supercomputer Center, University of
California at San Diego, USA
o *Henri Bal*, Vrije Universiteit, Netherlands
o *Roger Barga*, Microsoft, USA
o *Martin Berzins*, University of Utah, USA
o *John Brooke*, University of Manchester, UK
o *Thomas Fahringer*, University of Innsbruck, Austria
o *Gilles Fedak*, INRIA, France
o *José A. B. Fortes*, University of Florida, USA
o *Yolanda Gil*, ISI/USC, USA
o *Madhusudhan Govindaraju*, SUNY Binghamton, USA
o *Thomas Hacker*, Purdue University, USA
o *Ken Hawick*, Massey University, New Zealand
o *Marty Humphrey*, University of Virginia, USA
o *Hai Jin*, Huazhong University of Science and Technology, China
o *Thilo Kielmann*, Vrije Universiteit, Netherlands
o *Scott Klasky*, Oak Ridge National Laboratory, USA
o *Isao Kojima*, AIST, Japan
o *Tevfik Kosar*, University at Buffalo, USA
o *Dieter Kranzlmueller*, LMU & LRZ Munich, Germany
o *Erwin Laure*, KTH, Sweden
o *Jysoo Lee*, KISTI, Korea
o *Li Xiaoming*, Peking University, China
o *Bertram Ludäscher*, University of California, Davis, USA
o *Andrew Lumsdaine*, Indiana University, USA
o *Tanu Malik*, University of Chicago, USA
o *Satoshi Matsuoka*, Tokyo Institute of Technology, Japan
o *Reagan Moore*, University of North Carolina at Chapel Hill, USA
o *Shirley Moore*, University of Kentucky, USA
o *Steven Newhouse*, EGI, Netherlands
o *Dhabaleswar K. (DK) Panda*, The Ohio State University, USA
o *Manish Parashar*, Rutgers University, USA
o *Ron Perrott*, University of Oxford, UK
o *Depei Qian*, Beihang University, China
o *Judy Qui*, Indiana University, USA
o *Ioan Raicu*, Illinois Institute of Technology, USA
o *Lavanya Ramakrishnan*, Lawrence Berkeley National Laboratory, USA
o *Omer Rana*, Cardiff University, UK
o *Paul Roe*, Queensland University of Technology, Australia
o *Bruno Schulze*, LNCC, Brazil
o *Marc Snir*, Argonne National Laboratory & University of
Illinois at Urbana-Champaign, USA
o *Xian-He Sun*, Illinois Institute of Technology, USA
o *Yoshio Tanaka*, AIST, Japan
o *Michela Taufer*, University of Delaware, USA
o *Kerry Taylor*, CSIRO, Australia
o *Douglas Thain*, University of Notre Dame, USA
o *Paul Watson*, Newcastle University, UK
o *Jun Zhao*, University of Oxford, UK
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel: 1-847-722-0876
Office: 1-312-567-5704
Email: iraicu@cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
Web: http://datasys.cs.iit.edu/
=================================================================
=================================================================
[-- Attachment #1.2: Type: text/html, Size: 16967 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Raghavendra K T @ 2012-01-14 18:25 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Randy Dunlap, linux-doc, KVM,
Konrad Rzeszutek Wilk, Glauber Costa, Jan Kiszka, Rik van Riel,
Dave Jiang, H. Peter Anvin, Thomas Gleixner, X86, Marcelo Tosatti,
Gleb Natapov, Avi Kivity, Alexander Graf, Stefano Stabellini,
Paul Mackerras, Sedat Dilek, Ingo Molnar, LKML,
Greg Kroah-Hartman, Virtualization, Rob Landley
Cc: Peter Zijlstra, Raghavendra K T, Srivatsa Vaddagiri, Dave Hansen,
Suzuki Poulose, Sasha Levin
The 5-patch series to follow this email extends KVM-hypervisor and Linux guest
running on KVM-hypervisor to support pv-ticket spinlocks, based on Xen's implementation.
One hypercall is introduced in KVM hypervisor,that allows a vcpu to kick
another vcpu out of halt state.
The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
Changes in V4:
- reabsed to 3.2.0 pre.
- use APIC ID for kicking the vcpu and use kvm_apic_match_dest for matching. (Avi)
- fold vcpu->kicked flag into vcpu->requests (KVM_REQ_PVLOCK_KICK) and related
changes for UNHALT path to make pv ticket spinlock migration friendly. (Avi, Marcello)
- Added Documentation for CPUID, Hypercall (KVM_HC_KICK_CPU)
and capabilty (KVM_CAP_PVLOCK_KICK) (Avi)
- Remove unneeded kvm_arch_vcpu_ioctl_set_mpstate call. (Marcello)
- cumulative variable type changed (int ==> u32) in add_stat (Konrad)
- remove unneeded kvm_guest_init for !CONFIG_KVM_GUEST case
Changes in V3:
- rebased to 3.2-rc1
- use halt() instead of wait for kick hypercall.
- modify kick hyper call to do wakeup halted vcpu.
- hook kvm_spinlock_init to smp_prepare_cpus call (moved the call out of head##.c).
- fix the potential race when zero_stat is read.
- export debugfs_create_32 and add documentation to API.
- use static inline and enum instead of ADDSTAT macro.
- add barrier() in after setting kick_vcpu.
- empty static inline function for kvm_spinlock_init.
- combine the patches one and two readuce overhead.
- make KVM_DEBUGFS depends on DEBUGFS.
- include debugfs header unconditionally.
Changes in V2:
- rebased patchesto -rc9
- synchronization related changes based on Jeremy's changes
(Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>) pointed by
Stephan Diestelhorst <stephan.diestelhorst@amd.com>
- enabling 32 bit guests
- splitted patches into two more chunks
Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T (5):
Add debugfs support to print u32-arrays in debugfs
Add a hypercall to KVM hypervisor to support pv-ticketlocks
Added configuration support to enable debug information for KVM Guests
pv-ticketlocks support for linux guests running on KVM hypervisor
Add documentation on Hypercalls and features used for PV spinlock
Test Set up :
The BASE patch is pre 3.2.0 + Jeremy's following patches.
xadd (https://lkml.org/lkml/2011/10/4/328)
x86/ticketlocklock (https://lkml.org/lkml/2011/10/12/496).
Kernel for host/guest : 3.2.0 + Jeremy's xadd, pv spinlock patches as BASE
(Note:locked add change is not taken yet)
Results:
The performance gain is mainly because of reduced busy-wait time.
From the results we can see that patched kernel performance is similar to
BASE when there is no lock contention. But once we start seeing more
contention, patched kernel outperforms BASE (non PLE).
On PLE machine we do not see greater performance improvement because of PLE
complimenting halt()
3 guests with 8VCPU, 4GB RAM, 1 used for kernbench
(kernbench -f -H -M -o 20) other for cpuhog (shell script while
true with an instruction)
scenario A: unpinned
1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest
scenario B: unpinned, run kernbench on all the guests no hogs.
Dbench on PLE machine:
dbench run on all the guest simultaneously with
dbench --warmup=30 -t 120 with NRCLIENTS=(8/16/32).
Result for Non PLE machine :
============================
Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM
BASE BASE+patch %improvement
mean (sd) mean (sd)
Scenario A:
case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517
case 2x: 897.654 (543.993) 328.63 (103.771) 63.3901
case 3x: 2855.73 (2201.41) 315.029 (111.854) 88.9685
Dbench:
Throughput is in MB/sec
NRCLIENTS BASE BASE+patch %improvement
mean (sd) mean (sd)
8 1.774307 (0.061361) 1.725667 (0.034644) -2.74135
16 1.445967 (0.044805) 1.463173 (0.094399) 1.18993
32 2.136667 (0.105717) 2.193792 (0.129357) 2.67356
Result for PLE machine:
======================
Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32/64 core, with 8
online cores and 4*64GB RAM
Kernbench:
BASE BASE+patch %improvement
mean (sd) mean (sd)
Scenario A:
case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953
case 2x: 190.748 (61.2745) 190.606 (54.4766) 0.0744438
case 3x: 227.378 (100.215) 225.442 (92.0809) 0.851446
Scenario B:
446.104 (58.54 ) 433.12733 (54.476) 2.91
Dbench:
Throughput is in MB/sec
NRCLIENTS BASE BASE+patch %improvement
mean (sd) mean (sd)
8 1.101190 (0.875082) 1.700395 (0.846809) 54.4143
16 1.524312 (0.120354) 1.477553 (0.058166) -3.06755
32 2.143028 (0.157103) 2.090307 (0.136778) -2.46012
---
V3 kernel Changes:
https://lkml.org/lkml/2011/11/30/62
V2 kernel changes :
https://lkml.org/lkml/2011/10/23/207
Previous discussions : (posted by Srivatsa V).
https://lkml.org/lkml/2010/7/26/24
https://lkml.org/lkml/2011/1/19/212
Qemu patch for V3:
http://lists.gnu.org/archive/html/qemu-devel/2011-12/msg00397.html
Documentation/virtual/kvm/api.txt | 7 +
Documentation/virtual/kvm/cpuid.txt | 4 +
Documentation/virtual/kvm/hypercalls.txt | 54 +++++++
arch/x86/Kconfig | 9 +
arch/x86/include/asm/kvm_para.h | 16 ++-
arch/x86/kernel/kvm.c | 249 ++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 37 ++++-
arch/x86/xen/debugfs.c | 104 -------------
arch/x86/xen/debugfs.h | 4 -
arch/x86/xen/spinlock.c | 2 +-
fs/debugfs/file.c | 128 +++++++++++++++
include/linux/debugfs.h | 11 ++
include/linux/kvm.h | 1 +
include/linux/kvm_host.h | 1 +
include/linux/kvm_para.h | 1 +
15 files changed, 514 insertions(+), 114 deletions(-)
^ permalink raw reply
* [PATCH RFC V4 1/5] debugfs: Add support to print u32 array in debugfs
From: Raghavendra K T @ 2012-01-14 18:25 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Randy Dunlap, linux-doc, KVM,
Konrad Rzeszutek Wilk, Glauber Costa, Jan Kiszka, Rik van Riel,
Dave Jiang, H. Peter Anvin, Thomas Gleixner, Rob Landley, X86,
Gleb Natapov, Avi Kivity, Alexander Graf, Stefano Stabellini,
Paul Mackerras, Sedat Dilek, Ingo Molnar, LKML,
Greg Kroah-Hartman, Virtualization, Marcelo Tosatti
Cc: Peter Zijlstra, Raghavendra K T, Srivatsa Vaddagiri, Dave Hansen,
Suzuki Poulose, Sasha Levin
In-Reply-To: <20120114182501.8604.68416.sendpatchset@oc5400248562.ibm.com>
Add debugfs support to print u32-arrays in debugfs. Move the code from Xen to debugfs
to make the code common for other users as well.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
diff --git a/arch/x86/xen/debugfs.c b/arch/x86/xen/debugfs.c
index 7c0fedd..c8377fb 100644
--- a/arch/x86/xen/debugfs.c
+++ b/arch/x86/xen/debugfs.c
@@ -19,107 +19,3 @@ struct dentry * __init xen_init_debugfs(void)
return d_xen_debug;
}
-struct array_data
-{
- void *array;
- unsigned elements;
-};
-
-static int u32_array_open(struct inode *inode, struct file *file)
-{
- file->private_data = NULL;
- return nonseekable_open(inode, file);
-}
-
-static size_t format_array(char *buf, size_t bufsize, const char *fmt,
- u32 *array, unsigned array_size)
-{
- size_t ret = 0;
- unsigned i;
-
- for(i = 0; i < array_size; i++) {
- size_t len;
-
- len = snprintf(buf, bufsize, fmt, array[i]);
- len++; /* ' ' or '\n' */
- ret += len;
-
- if (buf) {
- buf += len;
- bufsize -= len;
- buf[-1] = (i == array_size-1) ? '\n' : ' ';
- }
- }
-
- ret++; /* \0 */
- if (buf)
- *buf = '\0';
-
- return ret;
-}
-
-static char *format_array_alloc(const char *fmt, u32 *array, unsigned array_size)
-{
- size_t len = format_array(NULL, 0, fmt, array, array_size);
- char *ret;
-
- ret = kmalloc(len, GFP_KERNEL);
- if (ret == NULL)
- return NULL;
-
- format_array(ret, len, fmt, array, array_size);
- return ret;
-}
-
-static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
- loff_t *ppos)
-{
- struct inode *inode = file->f_path.dentry->d_inode;
- struct array_data *data = inode->i_private;
- size_t size;
-
- if (*ppos == 0) {
- if (file->private_data) {
- kfree(file->private_data);
- file->private_data = NULL;
- }
-
- file->private_data = format_array_alloc("%u", data->array, data->elements);
- }
-
- size = 0;
- if (file->private_data)
- size = strlen(file->private_data);
-
- return simple_read_from_buffer(buf, len, ppos, file->private_data, size);
-}
-
-static int xen_array_release(struct inode *inode, struct file *file)
-{
- kfree(file->private_data);
-
- return 0;
-}
-
-static const struct file_operations u32_array_fops = {
- .owner = THIS_MODULE,
- .open = u32_array_open,
- .release= xen_array_release,
- .read = u32_array_read,
- .llseek = no_llseek,
-};
-
-struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
- struct dentry *parent,
- u32 *array, unsigned elements)
-{
- struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
-
- if (data == NULL)
- return NULL;
-
- data->array = array;
- data->elements = elements;
-
- return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
-}
diff --git a/arch/x86/xen/debugfs.h b/arch/x86/xen/debugfs.h
index e281320..12ebf33 100644
--- a/arch/x86/xen/debugfs.h
+++ b/arch/x86/xen/debugfs.h
@@ -3,8 +3,4 @@
struct dentry * __init xen_init_debugfs(void);
-struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
- struct dentry *parent,
- u32 *array, unsigned elements);
-
#endif /* _XEN_DEBUGFS_H */
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index fc506e6..14a8961 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -286,7 +286,7 @@ static int __init xen_spinlock_debugfs(void)
debugfs_create_u64("time_blocked", 0444, d_spin_debug,
&spinlock_stats.time_blocked);
- xen_debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
+ debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
return 0;
diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 90f7657..df44ccf 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/namei.h>
#include <linux/debugfs.h>
+#include <linux/slab.h>
static ssize_t default_read_file(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
@@ -525,3 +526,130 @@ struct dentry *debugfs_create_blob(const char *name, mode_t mode,
return debugfs_create_file(name, mode, parent, blob, &fops_blob);
}
EXPORT_SYMBOL_GPL(debugfs_create_blob);
+
+struct array_data {
+ void *array;
+ u32 elements;
+};
+
+static int u32_array_open(struct inode *inode, struct file *file)
+{
+ file->private_data = NULL;
+ return nonseekable_open(inode, file);
+}
+
+static size_t format_array(char *buf, size_t bufsize, const char *fmt,
+ u32 *array, u32 array_size)
+{
+ size_t ret = 0;
+ u32 i;
+
+ for (i = 0; i < array_size; i++) {
+ size_t len;
+
+ len = snprintf(buf, bufsize, fmt, array[i]);
+ len++; /* ' ' or '\n' */
+ ret += len;
+
+ if (buf) {
+ buf += len;
+ bufsize -= len;
+ buf[-1] = (i == array_size-1) ? '\n' : ' ';
+ }
+ }
+
+ ret++; /* \0 */
+ if (buf)
+ *buf = '\0';
+
+ return ret;
+}
+
+static char *format_array_alloc(const char *fmt, u32 *array,
+ u32 array_size)
+{
+ size_t len = format_array(NULL, 0, fmt, array, array_size);
+ char *ret;
+
+ ret = kmalloc(len, GFP_KERNEL);
+ if (ret == NULL)
+ return NULL;
+
+ format_array(ret, len, fmt, array, array_size);
+ return ret;
+}
+
+static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
+ loff_t *ppos)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct array_data *data = inode->i_private;
+ size_t size;
+
+ if (*ppos == 0) {
+ if (file->private_data) {
+ kfree(file->private_data);
+ file->private_data = NULL;
+ }
+
+ file->private_data = format_array_alloc("%u", data->array,
+ data->elements);
+ }
+
+ size = 0;
+ if (file->private_data)
+ size = strlen(file->private_data);
+
+ return simple_read_from_buffer(buf, len, ppos,
+ file->private_data, size);
+}
+
+static int u32_array_release(struct inode *inode, struct file *file)
+{
+ kfree(file->private_data);
+
+ return 0;
+}
+
+static const struct file_operations u32_array_fops = {
+ .owner = THIS_MODULE,
+ .open = u32_array_open,
+ .release = u32_array_release,
+ .read = u32_array_read,
+ .llseek = no_llseek,
+};
+
+/**
+ * debugfs_create_u32_array - create a debugfs file that is used to read u32
+ * array.
+ * @name: a pointer to a string containing the name of the file to create.
+ * @mode: the permission that the file should have.
+ * @parent: a pointer to the parent dentry for this file. This should be a
+ * directory dentry if set. If this parameter is %NULL, then the
+ * file will be created in the root of the debugfs filesystem.
+ * @array: u32 array that provides data.
+ * @elements: total number of elements in the array.
+ *
+ * This function creates a file in debugfs with the given name that exports
+ * @array as data. If the @mode variable is so set it can be read from.
+ * Writing is not supported. Seek within the file is also not supported.
+ * Once array is created its size can not be changed.
+ *
+ * The function returns a pointer to dentry on success. If debugfs is not
+ * enabled in the kernel, the value -%ENODEV will be returned.
+ */
+struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
+ struct dentry *parent,
+ u32 *array, u32 elements)
+{
+ struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
+
+ if (data == NULL)
+ return NULL;
+
+ data->array = array;
+ data->elements = elements;
+
+ return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
+}
+EXPORT_SYMBOL_GPL(debugfs_create_u32_array);
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index e7d9b20..253e2fb 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -74,6 +74,10 @@ struct dentry *debugfs_create_blob(const char *name, mode_t mode,
struct dentry *parent,
struct debugfs_blob_wrapper *blob);
+struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
+ struct dentry *parent,
+ u32 *array, u32 elements);
+
bool debugfs_initialized(void);
#else
@@ -193,6 +197,13 @@ static inline bool debugfs_initialized(void)
return false;
}
+struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
+ struct dentry *parent,
+ u32 *array, u32 elements)
+{
+ return ERR_PTR(-ENODEV);
+}
+
#endif
#endif
^ permalink raw reply related
* [PATCH RFC V4 2/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Raghavendra K T @ 2012-01-14 18:25 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Randy Dunlap, linux-doc, KVM,
Konrad Rzeszutek Wilk, Glauber Costa, Jan Kiszka, Rik van Riel,
Dave Jiang, H. Peter Anvin, Thomas Gleixner, X86, Marcelo Tosatti,
Gleb Natapov, Avi Kivity, Alexander Graf, Stefano Stabellini,
Paul Mackerras, Sedat Dilek, Ingo Molnar, LKML,
Greg Kroah-Hartman, Virtualization, Rob Landley
Cc: Peter Zijlstra, Raghavendra K T, Srivatsa Vaddagiri, Dave Hansen,
Suzuki Poulose, Sasha Levin
In-Reply-To: <20120114182501.8604.68416.sendpatchset@oc5400248562.ibm.com>
Add a hypercall to KVM hypervisor to support pv-ticketlocks
KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt state.
The presence of these hypercalls is indicated to guest via
KVM_FEATURE_PVLOCK_KICK/KVM_CAP_PVLOCK_KICK.
Qemu needs a corresponding patch to pass up the presence of this feature to
guest via cpuid. Patch to qemu will be sent separately.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 734c376..7a94987 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,12 +16,14 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+
/* This indicates that the new set of kvmclock msrs
* are available. The use of 0x11 and 0x12 is deprecated
*/
#define KVM_FEATURE_CLOCKSOURCE2 3
#define KVM_FEATURE_ASYNC_PF 4
#define KVM_FEATURE_STEAL_TIME 5
+#define KVM_FEATURE_PVLOCK_KICK 6
/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4c938da..c7b05fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2099,6 +2099,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
case KVM_CAP_GET_TSC_KHZ:
+ case KVM_CAP_PVLOCK_KICK:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -2576,7 +2577,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
- (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+ (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+ (1 << KVM_FEATURE_PVLOCK_KICK);
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
@@ -5304,6 +5306,29 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
return 1;
}
+/*
+ * kvm_pv_kick_cpu_op: Kick a vcpu.
+ *
+ * @apicid - apicid of vcpu to be kicked.
+ */
+static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
+{
+ struct kvm_vcpu *vcpu = NULL;
+ int i;
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ if (!kvm_apic_present(vcpu))
+ continue;
+
+ if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
+ break;
+ }
+ if (vcpu) {
+ kvm_make_request(KVM_REQ_PVLOCK_KICK, vcpu);
+ kvm_vcpu_kick(vcpu);
+ }
+}
+
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
unsigned long nr, a0, a1, a2, a3, ret;
@@ -5340,6 +5365,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_MMU_OP:
r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
break;
+ case KVM_HC_KICK_CPU:
+ kvm_pv_kick_cpu_op(vcpu->kvm, a0);
+ ret = 0;
+ break;
default:
ret = -KVM_ENOSYS;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 68e67e5..63fb6b0 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -558,6 +558,7 @@ struct kvm_ppc_pvinfo {
#define KVM_CAP_PPC_PAPR 68
#define KVM_CAP_S390_GMAP 71
#define KVM_CAP_TSC_DEADLINE_TIMER 72
+#define KVM_CAP_PVLOCK_KICK 73
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..3b1ae7b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -50,6 +50,7 @@
#define KVM_REQ_APF_HALT 12
#define KVM_REQ_STEAL_UPDATE 13
#define KVM_REQ_NMI 14
+#define KVM_REQ_PVLOCK_KICK 15
#define KVM_USERSPACE_IRQ_SOURCE_ID 0
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 47a070b..19f10bd 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -19,6 +19,7 @@
#define KVM_HC_MMU_OP 2
#define KVM_HC_FEATURES 3
#define KVM_HC_PPC_MAP_MAGIC_PAGE 4
+#define KVM_HC_KICK_CPU 5
/*
* hypercalls use architecture specific
^ permalink raw reply related
* [PATCH RFC V4 3/5] kvm guest : Added configuration support to enable debug information for KVM Guests
From: Raghavendra K T @ 2012-01-14 18:26 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Randy Dunlap, linux-doc, KVM,
Konrad Rzeszutek Wilk, Glauber Costa, Jan Kiszka, Rik van Riel,
Dave Jiang, H. Peter Anvin, Thomas Gleixner, Rob Landley, X86,
Gleb Natapov, Avi Kivity, Alexander Graf, Stefano Stabellini,
Paul Mackerras, Sedat Dilek, Ingo Molnar, LKML,
Greg Kroah-Hartman, Virtualization, Marcelo Tosatti
Cc: Peter Zijlstra, Raghavendra K T, Srivatsa Vaddagiri, Dave Hansen,
Suzuki Poulose, Sasha Levin
In-Reply-To: <20120114182501.8604.68416.sendpatchset@oc5400248562.ibm.com>
Added configuration support to enable debug information
for KVM Guests in debugfs
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 72e8b64..344a7db 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -565,6 +565,15 @@ config KVM_GUEST
This option enables various optimizations for running under the KVM
hypervisor.
+config KVM_DEBUG_FS
+ bool "Enable debug information for KVM Guests in debugfs"
+ depends on KVM_GUEST && DEBUG_FS
+ default n
+ ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
source "arch/x86/lguest/Kconfig"
config PARAVIRT
^ permalink raw reply related
* [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Raghavendra K T @ 2012-01-14 18:26 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Randy Dunlap, linux-doc, KVM,
Konrad Rzeszutek Wilk, Glauber Costa, Jan Kiszka, Rik van Riel,
Dave Jiang, H. Peter Anvin, Thomas Gleixner, X86, Marcelo Tosatti,
Gleb Natapov, Avi Kivity, Alexander Graf, Stefano Stabellini,
Paul Mackerras, Sedat Dilek, Ingo Molnar, LKML,
Greg Kroah-Hartman, Virtualization, Rob Landley
Cc: Peter Zijlstra, Raghavendra K T, Srivatsa Vaddagiri, Dave Hansen,
Suzuki Poulose, Sasha Levin
In-Reply-To: <20120114182501.8604.68416.sendpatchset@oc5400248562.ibm.com>
Extends Linux guest running on KVM hypervisor to support pv-ticketlocks.
During smp_boot_cpus paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PVLOCK_KICK) to support pv-ticketlocks. If so,
support for pv-ticketlocks is registered via pv_lock_ops.
Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 7a94987..cf5327c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -195,10 +195,20 @@ void kvm_async_pf_task_wait(u32 token);
void kvm_async_pf_task_wake(u32 token);
u32 kvm_read_and_reset_pf_reason(void);
extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* CONFIG_PARAVIRT_SPINLOCKS */
+static void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
#define kvm_async_pf_task_wait(T) do {} while(0)
#define kvm_async_pf_task_wake(T) do {} while(0)
+
static inline u32 kvm_read_and_reset_pf_reason(void)
{
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a9c2116..ec55a0b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -33,6 +33,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/kprobes.h>
+#include <linux/debugfs.h>
#include <asm/timer.h>
#include <asm/cpu.h>
#include <asm/traps.h>
@@ -545,6 +546,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
#endif
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+ kvm_spinlock_init();
}
static void __cpuinit kvm_guest_cpu_online(void *dummy)
@@ -627,3 +629,250 @@ static __init int activate_jump_labels(void)
return 0;
}
arch_initcall(activate_jump_labels);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+ TAKEN_SLOW,
+ TAKEN_SLOW_PICKUP,
+ RELEASED_SLOW,
+ RELEASED_SLOW_KICKED,
+ NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+
+static struct kvm_spinlock_stats
+{
+ u32 contention_stats[NR_CONTENTION_STATS];
+
+#define HISTO_BUCKETS 30
+ u32 histo_spin_blocked[HISTO_BUCKETS+1];
+
+ u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+ u8 ret;
+ u8 old = ACCESS_ONCE(zero_stats);
+ if (unlikely(old)) {
+ ret = cmpxchg(&zero_stats, old, 0);
+ /* This ensures only one fellow resets the stat */
+ if (ret == old)
+ memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+ }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+ check_zero();
+ spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+ return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+ unsigned index = ilog2(delta);
+
+ check_zero();
+
+ if (index < HISTO_BUCKETS)
+ array[index]++;
+ else
+ array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+ u32 delta = sched_clock() - start;
+
+ __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+ spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+ d_kvm_debug = debugfs_create_dir("kvm", NULL);
+ if (!d_kvm_debug)
+ printk(KERN_WARNING "Could not create 'kvm' debugfs directory\n");
+
+ return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debugfs(void)
+{
+ struct dentry *d_kvm = kvm_init_debugfs();
+
+ if (d_kvm == NULL)
+ return -ENOMEM;
+
+ d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
+
+ debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
+
+ debugfs_create_u32("taken_slow", 0444, d_spin_debug,
+ &spinlock_stats.contention_stats[TAKEN_SLOW]);
+ debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
+ &spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
+
+ debugfs_create_u32("released_slow", 0444, d_spin_debug,
+ &spinlock_stats.contention_stats[RELEASED_SLOW]);
+ debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
+ &spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
+
+ debugfs_create_u64("time_blocked", 0444, d_spin_debug,
+ &spinlock_stats.time_blocked);
+
+ debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
+ spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
+
+ return 0;
+}
+fs_initcall(kvm_spinlock_debugfs);
+#else /* !CONFIG_KVM_DEBUG_FS */
+#define TIMEOUT (1 << 10)
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+}
+
+static inline u64 spin_time_start(void)
+{
+ return 0;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+}
+#endif /* CONFIG_KVM_DEBUG_FS */
+
+struct kvm_lock_waiting {
+ struct arch_spinlock *lock;
+ __ticket_t want;
+};
+
+/* cpus 'waiting' on a spinlock to become available */
+static cpumask_t waiting_cpus;
+
+/* Track spinlock on which a cpu is waiting */
+static DEFINE_PER_CPU(struct kvm_lock_waiting, lock_waiting);
+
+static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
+{
+ struct kvm_lock_waiting *w = &__get_cpu_var(lock_waiting);
+ int cpu = smp_processor_id();
+ u64 start;
+ unsigned long flags;
+
+ start = spin_time_start();
+
+ /*
+ * Make sure an interrupt handler can't upset things in a
+ * partially setup state.
+ */
+ local_irq_save(flags);
+
+ /*
+ * The ordering protocol on this is that the "lock" pointer
+ * may only be set non-NULL if the "want" ticket is correct.
+ * If we're updating "want", we must first clear "lock".
+ */
+ w->lock = NULL;
+ smp_wmb();
+ w->want = want;
+ smp_wmb();
+ w->lock = lock;
+
+ add_stats(TAKEN_SLOW, 1);
+
+ /*
+ * This uses set_bit, which is atomic but we should not rely on its
+ * reordering gurantees. So barrier is needed after this call.
+ */
+ cpumask_set_cpu(cpu, &waiting_cpus);
+
+ barrier();
+
+ /*
+ * Mark entry to slowpath before doing the pickup test to make
+ * sure we don't deadlock with an unlocker.
+ */
+ __ticket_enter_slowpath(lock);
+
+ /*
+ * check again make sure it didn't become free while
+ * we weren't looking.
+ */
+ if (ACCESS_ONCE(lock->tickets.head) == want) {
+ add_stats(TAKEN_SLOW_PICKUP, 1);
+ goto out;
+ }
+
+ /* Allow interrupts while blocked */
+ local_irq_restore(flags);
+
+ /* halt until it's our turn and kicked. */
+ halt();
+
+ local_irq_save(flags);
+out:
+ cpumask_clear_cpu(cpu, &waiting_cpus);
+ w->lock = NULL;
+ local_irq_restore(flags);
+ spin_time_accum_blocked(start);
+}
+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
+
+/* Kick a cpu by its apicid*/
+static inline void kvm_kick_cpu(int apicid)
+{
+ kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
+/* Kick vcpu waiting on @lock->head to reach value @ticket */
+static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
+{
+ int cpu;
+ int apicid;
+
+ add_stats(RELEASED_SLOW, 1);
+
+ for_each_cpu(cpu, &waiting_cpus) {
+ const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
+ if (ACCESS_ONCE(w->lock) == lock &&
+ ACCESS_ONCE(w->want) == ticket) {
+ add_stats(RELEASED_SLOW_KICKED, 1);
+ apicid = per_cpu(x86_cpu_to_apicid, cpu);
+ kvm_kick_cpu(apicid);
+ break;
+ }
+ }
+}
+
+/*
+ * Setup pv_lock_ops to exploit KVM_FEATURE_PVLOCK_KICK if present.
+ */
+void __init kvm_spinlock_init(void)
+{
+ if (!kvm_para_available())
+ return;
+ /* Does host kernel support KVM_FEATURE_PVLOCK_KICK? */
+ if (!kvm_para_has_feature(KVM_FEATURE_PVLOCK_KICK))
+ return;
+
+ jump_label_inc(¶virt_ticketlocks_enabled);
+
+ pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
+ pv_lock_ops.unlock_kick = kvm_unlock_kick;
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c7b05fc..4d7a950 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5754,8 +5754,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
local_irq_disable();
- if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
- || need_resched() || signal_pending(current)) {
+ if (vcpu->mode == EXITING_GUEST_MODE
+ || (vcpu->requests & ~(1UL<<KVM_REQ_PVLOCK_KICK))
+ || need_resched() || signal_pending(current)) {
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
local_irq_enable();
@@ -6711,6 +6712,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
!vcpu->arch.apf.halted)
|| !list_empty_careful(&vcpu->async_pf.done)
|| vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
+ || kvm_check_request(KVM_REQ_PVLOCK_KICK, vcpu)
|| atomic_read(&vcpu->arch.nmi_queued) ||
(kvm_arch_interrupt_allowed(vcpu) &&
kvm_cpu_has_interrupt(vcpu));
^ permalink raw reply related
* [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Raghavendra K T @ 2012-01-14 18:27 UTC (permalink / raw)
To: Jeremy Fitzhardinge, Randy Dunlap, linux-doc, KVM,
Konrad Rzeszutek Wilk, Glauber Costa, Jan Kiszka, Rik van Riel,
Dave Jiang, H. Peter Anvin, Thomas Gleixner, Rob Landley, X86,
Gleb Natapov, Avi Kivity, Alexander Graf, Stefano Stabellini,
Paul Mackerras, Sedat Dilek, Ingo Molnar, LKML,
Greg Kroah-Hartman, Virtualization, Marcelo Tosatti
Cc: Peter Zijlstra, Raghavendra K T, Srivatsa Vaddagiri, Dave Hansen,
Suzuki Poulose, Sasha Levin
In-Reply-To: <20120114182501.8604.68416.sendpatchset@oc5400248562.ibm.com>
Add Documentation on CPUID, KVM_CAP_PVLOCK_KICK, and Hypercalls supported.
KVM_HC_KICK_CPU hypercall added to wakeup halted vcpu in
paravirtual spinlock enabled guest.
KVM_FEATURE_PVLOCK_KICK enables guest to check whether pv spinlock can
be enabled in guest. support in host is queried via
ioctl(KVM_CHECK_EXTENSION, KVM_CAP_PVLOCK_KICK)
A minimal Documentation and template is added for hypercalls.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index e2a4b52..1583bc7 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1109,6 +1109,13 @@ support. Instead it is reported via
if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the
feature in userspace, then you can enable the feature for KVM_SET_CPUID2.
+Paravirtualized ticket spinlocks can be enabled in guest by checking whether
+support exists in host via,
+
+ ioctl(KVM_CHECK_EXTENSION, KVM_CAP_PVLOCK_KICK)
+
+if this call return true, guest can use the feature.
+
4.47 KVM_PPC_GET_PVINFO
Capability: KVM_CAP_PPC_GET_PVINFO
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 8820685..c7fc0da 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -39,6 +39,10 @@ KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs
KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by
|| || writing to msr 0x4b564d02
------------------------------------------------------------------------------
+KVM_FEATURE_PVLOCK_KICK || 6 || guest checks this feature bit
+ || || before enabling paravirtualized
+ || || spinlock support.
+------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|| || per-cpu warps are expected in
|| || kvmclock.
diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
new file mode 100644
index 0000000..7872da5
--- /dev/null
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -0,0 +1,54 @@
+KVM Hypercalls Documentation
+===========================
+Template for documentation is
+The documenenation for hypercalls should inlcude
+1. Hypercall name, value.
+2. Architecture(s)
+3. Purpose
+
+
+1. KVM_HC_VAPIC_POLL_IRQ
+------------------------
+value: 1
+Architecture: x86
+Purpose:
+
+2. KVM_HC_MMU_OP
+------------------------
+value: 2
+Architecture: x86
+Purpose: Support MMU operations such as writing to PTE,
+flushing TLB, release PT.
+
+3. KVM_HC_FEATURES
+------------------------
+value: 3
+Architecture: PPC
+Purpose:
+
+4. KVM_HC_PPC_MAP_MAGIC_PAGE
+------------------------
+value: 4
+Architecture: PPC
+Purpose: To enable communication between the hypervisor and guest there is a
+new shared page that contains parts of supervisor visible register state.
+The guest can map this shared page using this hypercall.
+
+5. KVM_HC_KICK_CPU
+------------------------
+value: 5
+Architecture: x86
+Purpose: Hypercall used to wakeup a vcpu from HLT state
+
+Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
+kernel mode for an event to occur (ex: a spinlock to become available)
+can execute HLT instruction once it has busy-waited for more than a
+threshold time-interval. Execution of HLT instruction would cause
+the hypervisor to put the vcpu to sleep (unless yield_on_hlt=0) until occurence
+of an appropriate event. Another vcpu of the same guest can wakeup the sleeping
+vcpu by issuing KVM_HC_KICK_CPU hypercall, specifying APIC ID of the vcpu to be
+wokenup.
+
+TODO:
+1. more information on input and output needed?
+2. Add more detail to purpose of hypercalls.
^ permalink raw reply related
* Call for Workshops at IEEE eScience, due January 23, 2012
From: Ioan Raicu @ 2012-01-15 3:57 UTC (permalink / raw)
To: virtualization
[-- Attachment #1.1: Type: text/plain, Size: 4339 bytes --]
Call for Workshops
8th IEEE International Conference on eScience
October 8-12, 2012
Chicago, IL, USA
The 8th IEEE eScience conference (e-Science 2012), sponsored by the IEEE
Computer Society's Technical Committee for Scalable Computing (TCSC),
will be held in Chicago Illinois from 8-12th October 2012. The eScience
2011 conference is designed to bring together leading international and
interdisciplinary research communities, developers, and users of
eScience applications and enabling IT technologies.
Multiple e-Science 2012 Workshops will be held on Monday and Tuesday,
8th and 9th October, co-located with the main conference.
Workshops are an important part of the conference in providing
opportunity for researchers to present their work in a more focused way
than the conference itself and to have discussion of particular topics
of interest to the community. We cordially invite you to submit workshop
proposals on any eScience related topic to the Workshop Chair.
To help those interested know their purpose and scope, workshop
proposals should include:
* A description of the workshop, its focus, goals, and outcome
* A draft call for papers
* Names and affiliations of the organizers and tentative composition
of the committees
* Expected numbers of submissions and accepted papers
* Prior history of this workshop, if any. Please include: number of
submissions, number of accepted papers, and attendee count.
Workshop organizers are responsible for establishing a program
committee, collecting and evaluating submissions, notifying authors of
acceptance or rejection in due time, ensuring a transparent and fair
selection process, organizing selected papers into sessions, and
assigning session chairs. Proposals will be selected that show clear
focus and objectives in areas of emerging or developing interest
guaranteed to generate significant interest in the community.
Once accepted, the workshop should establish its own paper submission
system. For each paper selected for publication, an author must be
registered for eScience 2012. Each paper must be presented in person by
at least one of the authors. It is expected that the proceedings of the
eScience 2012 workshops will be published by the IEEE Computer Society
Press, USA and will be made available online through the IEEE Digital
Library.
SUBMISSION PROCESS
Workshop proposals should be emailed to escience2012-workshops@fnal.gov
<mailto:escience2012-workshops@fnal.gov?subject=Workshop%20Submission%20for%20the%208th%20IEEE%20International%20Conference%20on%20eScience>
ORGANIZATION
General Chair
* *Ian Foster*, University of Chicago & Argonne National Laboratory, USA
Program Co-Chairs
* *Daniel S. Katz*, University of Chicago & Argonne National
Laboratory, USA
* *Heinz Stockinger*, SIB Swiss Institute of Bioinformatics, Switzerland
Workshops Chair
* *Ruth Pordes*, FNAL, USA
Sponsorship Chair
* *Charlie Catlett*, Argonne National Laboratory, USA
Conference Manager and Finance Chair
* *Julie Wulf-Knoerzer*, University of Chicago & Argonne National
Laboratory, USA
Publicity Chairs
* *Kento Aida*, National Institute of Informatics, Japan
* *Ioan Raicu*, Illinois Institute of Technology, USA
* *David Wallom*, Oxford e-Research Centre, UK
Local Organizing Committee
* *Ninfa Mayorga*, University of Chicago, USA
* *Evelyn Rayburn*, University of Chicago, USA
* *Lynn Valentini*, Argonne National Laboratory, USA
--
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor, Illinois Institute of Technology (IIT)
Guest Research Faculty, Argonne National Laboratory (ANL)
=================================================================
Data-Intensive Distributed Systems Laboratory, CS/IIT
Distributed Systems Laboratory, MCS/ANL
=================================================================
Cel: 1-847-722-0876
Office: 1-312-567-5704
Email: iraicu@cs.iit.edu
Web: http://www.cs.iit.edu/~iraicu/
Web: http://datasys.cs.iit.edu/
=================================================================
=================================================================
[-- Attachment #1.2: Type: text/html, Size: 5869 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias (v2.1)
From: Michael S. Tsirkin @ 2012-01-15 12:42 UTC (permalink / raw)
To: David Miller
Cc: kvm, netdev, kay.sievers, virtualization, shemminger, device,
Alan Cox
In-Reply-To: <20120112.200701.1473475851890804136.davem@davemloft.net>
On Thu, Jan 12, 2012 at 08:07:01PM -0800, David Miller wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Wed, 11 Jan 2012 21:30:38 -0800
>
> > Subject: vhost-net: add module alias (v2.1)
> >
> > By adding some module aliases, programs (or users) won't have to explicitly
> > call modprobe. Vhost-net will always be available if built into the kernel.
> > It does require assigning a permanent minor number for depmod to work.
> >
> > Also:
> > - use C99 style initialization.
> > - add missing entry in documentation for loop-control
> >
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> ACKs, NACKs? What is happening here?
I would like an Ack from Alan Cox who switched vhost-net
to a dynamic minor in the first place, in commit
79907d89c397b8bc2e05b347ec94e928ea919d33.
--
MST
^ permalink raw reply
* Re: [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
From: Alexander Graf @ 2012-01-16 3:12 UTC (permalink / raw)
To: Raghavendra K T
Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
Konrad Rzeszutek Wilk, Srivatsa Vaddagiri, Sasha Levin,
Sedat Dilek, Thomas Gleixner, LKML, Dave Hansen
In-Reply-To: <20120114182645.8604.68884.sendpatchset@oc5400248562.ibm.com>
On 14.01.2012, at 19:26, Raghavendra K T wrote:
> Extends Linux guest running on KVM hypervisor to support pv-ticketlocks.
>
> During smp_boot_cpus paravirtualied KVM guest detects if the hypervisor has
> required feature (KVM_FEATURE_PVLOCK_KICK) to support pv-ticketlocks. If so,
> support for pv-ticketlocks is registered via pv_lock_ops.
>
> Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.
>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 7a94987..cf5327c 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -195,10 +195,20 @@ void kvm_async_pf_task_wait(u32 token);
> void kvm_async_pf_task_wake(u32 token);
> u32 kvm_read_and_reset_pf_reason(void);
> extern void kvm_disable_steal_time(void);
> -#else
> -#define kvm_guest_init() do { } while (0)
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +void __init kvm_spinlock_init(void);
> +#else /* CONFIG_PARAVIRT_SPINLOCKS */
> +static void kvm_spinlock_init(void)
> +{
> +}
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#else /* CONFIG_KVM_GUEST */
> +#define kvm_guest_init() do {} while (0)
> #define kvm_async_pf_task_wait(T) do {} while(0)
> #define kvm_async_pf_task_wake(T) do {} while(0)
> +
> static inline u32 kvm_read_and_reset_pf_reason(void)
> {
> return 0;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index a9c2116..ec55a0b 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -33,6 +33,7 @@
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/kprobes.h>
> +#include <linux/debugfs.h>
> #include <asm/timer.h>
> #include <asm/cpu.h>
> #include <asm/traps.h>
> @@ -545,6 +546,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
> #endif
> kvm_guest_cpu_init();
> native_smp_prepare_boot_cpu();
> + kvm_spinlock_init();
> }
>
> static void __cpuinit kvm_guest_cpu_online(void *dummy)
> @@ -627,3 +629,250 @@ static __init int activate_jump_labels(void)
> return 0;
> }
> arch_initcall(activate_jump_labels);
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +
> +enum kvm_contention_stat {
> + TAKEN_SLOW,
> + TAKEN_SLOW_PICKUP,
> + RELEASED_SLOW,
> + RELEASED_SLOW_KICKED,
> + NR_CONTENTION_STATS
> +};
> +
> +#ifdef CONFIG_KVM_DEBUG_FS
> +
> +static struct kvm_spinlock_stats
> +{
> + u32 contention_stats[NR_CONTENTION_STATS];
> +
> +#define HISTO_BUCKETS 30
> + u32 histo_spin_blocked[HISTO_BUCKETS+1];
> +
> + u64 time_blocked;
> +} spinlock_stats;
> +
> +static u8 zero_stats;
> +
> +static inline void check_zero(void)
> +{
> + u8 ret;
> + u8 old = ACCESS_ONCE(zero_stats);
> + if (unlikely(old)) {
> + ret = cmpxchg(&zero_stats, old, 0);
> + /* This ensures only one fellow resets the stat */
> + if (ret == old)
> + memset(&spinlock_stats, 0, sizeof(spinlock_stats));
> + }
> +}
> +
> +static inline void add_stats(enum kvm_contention_stat var, u32 val)
> +{
> + check_zero();
> + spinlock_stats.contention_stats[var] += val;
> +}
> +
> +
> +static inline u64 spin_time_start(void)
> +{
> + return sched_clock();
> +}
> +
> +static void __spin_time_accum(u64 delta, u32 *array)
> +{
> + unsigned index = ilog2(delta);
> +
> + check_zero();
> +
> + if (index < HISTO_BUCKETS)
> + array[index]++;
> + else
> + array[HISTO_BUCKETS]++;
> +}
> +
> +static inline void spin_time_accum_blocked(u64 start)
> +{
> + u32 delta = sched_clock() - start;
> +
> + __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
> + spinlock_stats.time_blocked += delta;
> +}
> +
> +static struct dentry *d_spin_debug;
> +static struct dentry *d_kvm_debug;
> +
> +struct dentry *kvm_init_debugfs(void)
> +{
> + d_kvm_debug = debugfs_create_dir("kvm", NULL);
> + if (!d_kvm_debug)
> + printk(KERN_WARNING "Could not create 'kvm' debugfs directory\n");
> +
> + return d_kvm_debug;
> +}
> +
> +static int __init kvm_spinlock_debugfs(void)
> +{
> + struct dentry *d_kvm = kvm_init_debugfs();
> +
> + if (d_kvm == NULL)
> + return -ENOMEM;
> +
> + d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
> +
> + debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
> +
> + debugfs_create_u32("taken_slow", 0444, d_spin_debug,
> + &spinlock_stats.contention_stats[TAKEN_SLOW]);
> + debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
> + &spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
> +
> + debugfs_create_u32("released_slow", 0444, d_spin_debug,
> + &spinlock_stats.contention_stats[RELEASED_SLOW]);
> + debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
> + &spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
> +
> + debugfs_create_u64("time_blocked", 0444, d_spin_debug,
> + &spinlock_stats.time_blocked);
> +
> + debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
> + spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
> +
> + return 0;
> +}
> +fs_initcall(kvm_spinlock_debugfs);
> +#else /* !CONFIG_KVM_DEBUG_FS */
> +#define TIMEOUT (1 << 10)
> +static inline void add_stats(enum kvm_contention_stat var, u32 val)
> +{
> +}
> +
> +static inline u64 spin_time_start(void)
> +{
> + return 0;
> +}
> +
> +static inline void spin_time_accum_blocked(u64 start)
> +{
> +}
> +#endif /* CONFIG_KVM_DEBUG_FS */
> +
> +struct kvm_lock_waiting {
> + struct arch_spinlock *lock;
> + __ticket_t want;
> +};
> +
> +/* cpus 'waiting' on a spinlock to become available */
> +static cpumask_t waiting_cpus;
> +
> +/* Track spinlock on which a cpu is waiting */
> +static DEFINE_PER_CPU(struct kvm_lock_waiting, lock_waiting);
> +
> +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
> +{
> + struct kvm_lock_waiting *w = &__get_cpu_var(lock_waiting);
> + int cpu = smp_processor_id();
> + u64 start;
> + unsigned long flags;
> +
> + start = spin_time_start();
> +
> + /*
> + * Make sure an interrupt handler can't upset things in a
> + * partially setup state.
> + */
> + local_irq_save(flags);
> +
> + /*
> + * The ordering protocol on this is that the "lock" pointer
> + * may only be set non-NULL if the "want" ticket is correct.
> + * If we're updating "want", we must first clear "lock".
> + */
> + w->lock = NULL;
> + smp_wmb();
> + w->want = want;
> + smp_wmb();
> + w->lock = lock;
> +
> + add_stats(TAKEN_SLOW, 1);
> +
> + /*
> + * This uses set_bit, which is atomic but we should not rely on its
> + * reordering gurantees. So barrier is needed after this call.
> + */
> + cpumask_set_cpu(cpu, &waiting_cpus);
> +
> + barrier();
> +
> + /*
> + * Mark entry to slowpath before doing the pickup test to make
> + * sure we don't deadlock with an unlocker.
> + */
> + __ticket_enter_slowpath(lock);
> +
> + /*
> + * check again make sure it didn't become free while
> + * we weren't looking.
> + */
> + if (ACCESS_ONCE(lock->tickets.head) == want) {
> + add_stats(TAKEN_SLOW_PICKUP, 1);
> + goto out;
> + }
> +
> + /* Allow interrupts while blocked */
> + local_irq_restore(flags);
> +
> + /* halt until it's our turn and kicked. */
> + halt();
> +
> + local_irq_save(flags);
> +out:
> + cpumask_clear_cpu(cpu, &waiting_cpus);
> + w->lock = NULL;
> + local_irq_restore(flags);
> + spin_time_accum_blocked(start);
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> +
> +/* Kick a cpu by its apicid*/
> +static inline void kvm_kick_cpu(int apicid)
> +{
> + kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
> +}
> +
> +/* Kick vcpu waiting on @lock->head to reach value @ticket */
> +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
> +{
> + int cpu;
> + int apicid;
> +
> + add_stats(RELEASED_SLOW, 1);
> +
> + for_each_cpu(cpu, &waiting_cpus) {
> + const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
> + if (ACCESS_ONCE(w->lock) == lock &&
> + ACCESS_ONCE(w->want) == ticket) {
> + add_stats(RELEASED_SLOW_KICKED, 1);
> + apicid = per_cpu(x86_cpu_to_apicid, cpu);
> + kvm_kick_cpu(apicid);
> + break;
> + }
> + }
> +}
> +
> +/*
> + * Setup pv_lock_ops to exploit KVM_FEATURE_PVLOCK_KICK if present.
> + */
> +void __init kvm_spinlock_init(void)
> +{
> + if (!kvm_para_available())
> + return;
> + /* Does host kernel support KVM_FEATURE_PVLOCK_KICK? */
> + if (!kvm_para_has_feature(KVM_FEATURE_PVLOCK_KICK))
> + return;
> +
> + jump_label_inc(¶virt_ticketlocks_enabled);
> +
> + pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
> + pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +}
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c7b05fc..4d7a950 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
This patch is mixing host and guest code. Please split those up.
Alex
> @@ -5754,8 +5754,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
> local_irq_disable();
>
> - if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> - || need_resched() || signal_pending(current)) {
> + if (vcpu->mode == EXITING_GUEST_MODE
> + || (vcpu->requests & ~(1UL<<KVM_REQ_PVLOCK_KICK))
> + || need_resched() || signal_pending(current)) {
> vcpu->mode = OUTSIDE_GUEST_MODE;
> smp_wmb();
> local_irq_enable();
> @@ -6711,6 +6712,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> !vcpu->arch.apf.halted)
> || !list_empty_careful(&vcpu->async_pf.done)
> || vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> + || kvm_check_request(KVM_REQ_PVLOCK_KICK, vcpu)
> || atomic_read(&vcpu->arch.nmi_queued) ||
> (kvm_arch_interrupt_allowed(vcpu) &&
> kvm_cpu_has_interrupt(vcpu));
>
^ permalink raw reply
* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Alexander Graf @ 2012-01-16 3:23 UTC (permalink / raw)
To: Raghavendra K T
Cc: Jeremy Fitzhardinge, linux-doc, Peter Zijlstra, Jan Kiszka,
Virtualization, Paul Mackerras, H. Peter Anvin,
Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
Srivatsa Vaddagiri, Sasha Levin, Sedat Dilek, Thomas Gleixner,
Greg Kroah-Hartman, LKML, Dave Hansen
In-Reply-To: <20120114182710.8604.22277.sendpatchset@oc5400248562.ibm.com>
On 14.01.2012, at 19:27, Raghavendra K T wrote:
> Add Documentation on CPUID, KVM_CAP_PVLOCK_KICK, and Hypercalls supported.
>
> KVM_HC_KICK_CPU hypercall added to wakeup halted vcpu in
> paravirtual spinlock enabled guest.
>
> KVM_FEATURE_PVLOCK_KICK enables guest to check whether pv spinlock can
> be enabled in guest. support in host is queried via
> ioctl(KVM_CHECK_EXTENSION, KVM_CAP_PVLOCK_KICK)
>
> A minimal Documentation and template is added for hypercalls.
>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> ---
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index e2a4b52..1583bc7 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1109,6 +1109,13 @@ support. Instead it is reported via
> if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the
> feature in userspace, then you can enable the feature for KVM_SET_CPUID2.
>
> +Paravirtualized ticket spinlocks can be enabled in guest by checking whether
> +support exists in host via,
> +
> + ioctl(KVM_CHECK_EXTENSION, KVM_CAP_PVLOCK_KICK)
> +
> +if this call return true, guest can use the feature.
> +
> 4.47 KVM_PPC_GET_PVINFO
>
> Capability: KVM_CAP_PPC_GET_PVINFO
> diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
> index 8820685..c7fc0da 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -39,6 +39,10 @@ KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs
> KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by
> || || writing to msr 0x4b564d02
> ------------------------------------------------------------------------------
> +KVM_FEATURE_PVLOCK_KICK || 6 || guest checks this feature bit
> + || || before enabling paravirtualized
> + || || spinlock support.
> +------------------------------------------------------------------------------
> KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
> || || per-cpu warps are expected in
> || || kvmclock.
> diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
> new file mode 100644
> index 0000000..7872da5
> --- /dev/null
> +++ b/Documentation/virtual/kvm/hypercalls.txt
> @@ -0,0 +1,54 @@
> +KVM Hypercalls Documentation
> +===========================
> +Template for documentation is
> +The documenenation for hypercalls should inlcude
> +1. Hypercall name, value.
> +2. Architecture(s)
> +3. Purpose
> +
> +
> +1. KVM_HC_VAPIC_POLL_IRQ
> +------------------------
> +value: 1
> +Architecture: x86
> +Purpose:
> +
> +2. KVM_HC_MMU_OP
> +------------------------
> +value: 2
> +Architecture: x86
> +Purpose: Support MMU operations such as writing to PTE,
> +flushing TLB, release PT.
This one is deprecated, no? Should probably be mentioned here.
> +
> +3. KVM_HC_FEATURES
> +------------------------
> +value: 3
> +Architecture: PPC
> +Purpose:
Expose hypercall availability to the guest. On x86 you use cpuid to enumerate which hypercalls are available. The natural fit on ppc would be device tree based lookup (which is also what EPAPR dictates), but we also have a second enumeration mechanism that's KVM specific - which is this hypercall.
> +
> +4. KVM_HC_PPC_MAP_MAGIC_PAGE
> +------------------------
> +value: 4
> +Architecture: PPC
> +Purpose: To enable communication between the hypervisor and guest there is a
> +new
It's not new anymore :)
> shared page that contains parts of supervisor visible register state.
> +The guest can map this shared page using this hypercall.
... to access its supervisor register through memory.
> +
> +5. KVM_HC_KICK_CPU
> +------------------------
> +value: 5
> +Architecture: x86
> +Purpose: Hypercall used to wakeup a vcpu from HLT state
> +
> +Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
> +kernel mode for an event to occur (ex: a spinlock to become available)
> +can execute HLT instruction once it has busy-waited for more than a
> +threshold time-interval. Execution of HLT instruction would cause
> +the hypervisor to put the vcpu to sleep (unless yield_on_hlt=0) until occurence
> +of an appropriate event. Another vcpu of the same guest can wakeup the sleeping
> +vcpu by issuing KVM_HC_KICK_CPU hypercall, specifying APIC ID of the vcpu to be
> +wokenup.
The description is way too specific. The hypercall basically gives the guest the ability to yield() its current vcpu to another chosen vcpu. The APIC piece is an implementation detail for x86. On PPC we could just use the PIR register contents (processor identifier).
Maybe I didn't fully understand what this really is about though :)
Alex
^ permalink raw reply
* Re: [PATCH RFC V4 2/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Alexander Graf @ 2012-01-16 3:24 UTC (permalink / raw)
To: Raghavendra K T
Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
Konrad Rzeszutek Wilk, Srivatsa Vaddagiri, Sasha Levin,
Sedat Dilek, Thomas Gleixner, LKML, Dave Hansen
In-Reply-To: <20120114182553.8604.41642.sendpatchset@oc5400248562.ibm.com>
On 14.01.2012, at 19:25, Raghavendra K T wrote:
> Add a hypercall to KVM hypervisor to support pv-ticketlocks
>
> KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt state.
>
> The presence of these hypercalls is indicated to guest via
> KVM_FEATURE_PVLOCK_KICK/KVM_CAP_PVLOCK_KICK.
>
> Qemu needs a corresponding patch to pass up the presence of this feature to
> guest via cpuid. Patch to qemu will be sent separately.
>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 734c376..7a94987 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -16,12 +16,14 @@
> #define KVM_FEATURE_CLOCKSOURCE 0
> #define KVM_FEATURE_NOP_IO_DELAY 1
> #define KVM_FEATURE_MMU_OP 2
> +
> /* This indicates that the new set of kvmclock msrs
> * are available. The use of 0x11 and 0x12 is deprecated
> */
> #define KVM_FEATURE_CLOCKSOURCE2 3
> #define KVM_FEATURE_ASYNC_PF 4
> #define KVM_FEATURE_STEAL_TIME 5
> +#define KVM_FEATURE_PVLOCK_KICK 6
>
> /* The last 8 bits are used to indicate how to interpret the flags field
> * in pvclock structure. If no bits are set, all flags are ignored.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4c938da..c7b05fc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2099,6 +2099,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> case KVM_CAP_XSAVE:
> case KVM_CAP_ASYNC_PF:
> case KVM_CAP_GET_TSC_KHZ:
> + case KVM_CAP_PVLOCK_KICK:
> r = 1;
> break;
> case KVM_CAP_COALESCED_MMIO:
> @@ -2576,7 +2577,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> (1 << KVM_FEATURE_NOP_IO_DELAY) |
> (1 << KVM_FEATURE_CLOCKSOURCE2) |
> (1 << KVM_FEATURE_ASYNC_PF) |
> - (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> + (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> + (1 << KVM_FEATURE_PVLOCK_KICK);
>
> if (sched_info_on())
> entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> @@ -5304,6 +5306,29 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> +/*
> + * kvm_pv_kick_cpu_op: Kick a vcpu.
> + *
> + * @apicid - apicid of vcpu to be kicked.
> + */
> +static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> +{
> + struct kvm_vcpu *vcpu = NULL;
> + int i;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!kvm_apic_present(vcpu))
> + continue;
> +
> + if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> + break;
> + }
> + if (vcpu) {
> + kvm_make_request(KVM_REQ_PVLOCK_KICK, vcpu);
> + kvm_vcpu_kick(vcpu);
> + }
> +}
> +
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
> unsigned long nr, a0, a1, a2, a3, ret;
> @@ -5340,6 +5365,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> case KVM_HC_MMU_OP:
> r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
> break;
> + case KVM_HC_KICK_CPU:
> + kvm_pv_kick_cpu_op(vcpu->kvm, a0);
> + ret = 0;
> + break;
> default:
> ret = -KVM_ENOSYS;
> break;
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 68e67e5..63fb6b0 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -558,6 +558,7 @@ struct kvm_ppc_pvinfo {
> #define KVM_CAP_PPC_PAPR 68
> #define KVM_CAP_S390_GMAP 71
> #define KVM_CAP_TSC_DEADLINE_TIMER 72
> +#define KVM_CAP_PVLOCK_KICK 73
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d526231..3b1ae7b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -50,6 +50,7 @@
> #define KVM_REQ_APF_HALT 12
> #define KVM_REQ_STEAL_UPDATE 13
> #define KVM_REQ_NMI 14
> +#define KVM_REQ_PVLOCK_KICK 15
Everything I see in this patch is pvlock agnostic. It's only a vcpu kick hypercall. So it's probably a good idea to also name it accordingly :).
Alex
^ permalink raw reply
* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Srivatsa Vaddagiri @ 2012-01-16 3:51 UTC (permalink / raw)
To: Alexander Graf
Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-doc, Peter Zijlstra,
Jan Kiszka, Virtualization, Paul Mackerras, H. Peter Anvin,
Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
Sasha Levin, Sedat Dilek, Thomas Gleixner, Greg Kroah-Hartman,
LKML
In-Reply-To: <AD31813D-E4D5-43F3-B06A-9EB1B6FC9381@suse.de>
* Alexander Graf <agraf@suse.de> [2012-01-16 04:23:24]:
> > +5. KVM_HC_KICK_CPU
> > +------------------------
> > +value: 5
> > +Architecture: x86
> > +Purpose: Hypercall used to wakeup a vcpu from HLT state
> > +
> > +Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
> > +kernel mode for an event to occur (ex: a spinlock to become available)
> > +can execute HLT instruction once it has busy-waited for more than a
> > +threshold time-interval. Execution of HLT instruction would cause
> > +the hypervisor to put the vcpu to sleep (unless yield_on_hlt=0) until occurence
> > +of an appropriate event. Another vcpu of the same guest can wakeup the sleeping
> > +vcpu by issuing KVM_HC_KICK_CPU hypercall, specifying APIC ID of the vcpu to be
> > +wokenup.
>
> The description is way too specific. The hypercall basically gives the guest the ability to yield() its current vcpu to another chosen vcpu.
Hmm ..the hypercall does not allow a vcpu to yield. It just allows some
target vcpu to be prodded/wokenup, after which vcpu continues execution.
Note that semantics of this hypercall is different from the hypercall on which
PPC pv-spinlock (__spin_yield()) is currently dependent. This is mainly because
of ticketlocks on x86 (which does not allow us to easily store owning cpu
details in lock word itself).
> The APIC piece is an implementation detail for x86. On PPC we could just use the PIR register contents (processor identifier).
- vatsa
^ permalink raw reply
* Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
From: Alexander Graf @ 2012-01-16 3:57 UTC (permalink / raw)
To: Raghavendra K T
Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, linux-doc,
Peter Zijlstra, Jan Kiszka, Virtualization, Paul Mackerras,
H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
Glauber Costa, X86, Ingo Molnar, Avi Kivity, Rik van Riel,
Konrad Rzeszutek Wilk, Srivatsa Vaddagiri, Sasha Levin,
Sedat Dilek, Thomas Gleixner, LKML, Dave Hansen
In-Reply-To: <20120114182501.8604.68416.sendpatchset@oc5400248562.ibm.com>
On 14.01.2012, at 19:25, Raghavendra K T wrote:
> The 5-patch series to follow this email extends KVM-hypervisor and Linux guest
> running on KVM-hypervisor to support pv-ticket spinlocks, based on Xen's implementation.
>
> One hypercall is introduced in KVM hypervisor,that allows a vcpu to kick
> another vcpu out of halt state.
> The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
Is the code for this even upstream? Prerequisite series seem to have been posted by Jeremy, but they didn't appear to have made it in yet.
Either way, thinking about this I stumbled over the following passage of his patch:
> + unsigned count = SPIN_THRESHOLD;
> +
> + do {
> + if (inc.head == inc.tail)
> + goto out;
> + cpu_relax();
> + inc.head = ACCESS_ONCE(lock->tickets.head);
> + } while (--count);
> + __ticket_lock_spinning(lock, inc.tail);
That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.
Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.
Imagine we have a contended host. Every vcpu gets at most 10% of a real CPU's runtime. So chances are 1:10 that you're currently running while you need to be. In such a setup, it's probably a good idea to be very pessimistic. Try to fetch the lock for 100 cycles and then immediately make room for all the other VMs that have real work going on!
So what I'm trying to get to is that if we had a hypervisor settable spin threshold, we could adjust it according to the host's load, getting VMs to behave differently on different (guest invisible) circumstances.
Speaking of which - don't we have spin lock counters in the CPUs now? I thought we could set intercepts that notify us when the guest issues too many repz nops or whatever the typical spinlock identifier was. Can't we reuse that and just interrupt the guest if we see this with a special KVM interrupt that kicks off the internal spin lock waiting code? That way we don't slow down all those bare metal boxes.
Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests.
>
> Changes in V4:
> - reabsed to 3.2.0 pre.
> - use APIC ID for kicking the vcpu and use kvm_apic_match_dest for matching. (Avi)
> - fold vcpu->kicked flag into vcpu->requests (KVM_REQ_PVLOCK_KICK) and related
> changes for UNHALT path to make pv ticket spinlock migration friendly. (Avi, Marcello)
> - Added Documentation for CPUID, Hypercall (KVM_HC_KICK_CPU)
> and capabilty (KVM_CAP_PVLOCK_KICK) (Avi)
> - Remove unneeded kvm_arch_vcpu_ioctl_set_mpstate call. (Marcello)
> - cumulative variable type changed (int ==> u32) in add_stat (Konrad)
> - remove unneeded kvm_guest_init for !CONFIG_KVM_GUEST case
>
> Changes in V3:
> - rebased to 3.2-rc1
> - use halt() instead of wait for kick hypercall.
> - modify kick hyper call to do wakeup halted vcpu.
> - hook kvm_spinlock_init to smp_prepare_cpus call (moved the call out of head##.c).
> - fix the potential race when zero_stat is read.
> - export debugfs_create_32 and add documentation to API.
> - use static inline and enum instead of ADDSTAT macro.
> - add barrier() in after setting kick_vcpu.
> - empty static inline function for kvm_spinlock_init.
> - combine the patches one and two readuce overhead.
> - make KVM_DEBUGFS depends on DEBUGFS.
> - include debugfs header unconditionally.
>
> Changes in V2:
> - rebased patchesto -rc9
> - synchronization related changes based on Jeremy's changes
> (Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>) pointed by
> Stephan Diestelhorst <stephan.diestelhorst@amd.com>
> - enabling 32 bit guests
> - splitted patches into two more chunks
>
> Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T (5):
> Add debugfs support to print u32-arrays in debugfs
> Add a hypercall to KVM hypervisor to support pv-ticketlocks
> Added configuration support to enable debug information for KVM Guests
> pv-ticketlocks support for linux guests running on KVM hypervisor
> Add documentation on Hypercalls and features used for PV spinlock
>
> Test Set up :
> The BASE patch is pre 3.2.0 + Jeremy's following patches.
> xadd (https://lkml.org/lkml/2011/10/4/328)
> x86/ticketlocklock (https://lkml.org/lkml/2011/10/12/496).
> Kernel for host/guest : 3.2.0 + Jeremy's xadd, pv spinlock patches as BASE
> (Note:locked add change is not taken yet)
>
> Results:
> The performance gain is mainly because of reduced busy-wait time.
> From the results we can see that patched kernel performance is similar to
> BASE when there is no lock contention. But once we start seeing more
> contention, patched kernel outperforms BASE (non PLE).
> On PLE machine we do not see greater performance improvement because of PLE
> complimenting halt()
>
> 3 guests with 8VCPU, 4GB RAM, 1 used for kernbench
> (kernbench -f -H -M -o 20) other for cpuhog (shell script while
> true with an instruction)
>
> scenario A: unpinned
>
> 1x: no hogs
> 2x: 8hogs in one guest
> 3x: 8hogs each in two guest
>
> scenario B: unpinned, run kernbench on all the guests no hogs.
>
> Dbench on PLE machine:
> dbench run on all the guest simultaneously with
> dbench --warmup=30 -t 120 with NRCLIENTS=(8/16/32).
>
> Result for Non PLE machine :
> ============================
> Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM
> BASE BASE+patch %improvement
> mean (sd) mean (sd)
> Scenario A:
> case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517
> case 2x: 897.654 (543.993) 328.63 (103.771) 63.3901
> case 3x: 2855.73 (2201.41) 315.029 (111.854) 88.9685
>
> Dbench:
> Throughput is in MB/sec
> NRCLIENTS BASE BASE+patch %improvement
> mean (sd) mean (sd)
> 8 1.774307 (0.061361) 1.725667 (0.034644) -2.74135
> 16 1.445967 (0.044805) 1.463173 (0.094399) 1.18993
> 32 2.136667 (0.105717) 2.193792 (0.129357) 2.67356
>
> Result for PLE machine:
> ======================
> Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32/64 core, with 8
> online cores and 4*64GB RAM
>
> Kernbench:
> BASE BASE+patch %improvement
> mean (sd) mean (sd)
> Scenario A:
> case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953
> case 2x: 190.748 (61.2745) 190.606 (54.4766) 0.0744438
> case 3x: 227.378 (100.215) 225.442 (92.0809) 0.851446
>
> Scenario B:
> 446.104 (58.54 ) 433.12733 (54.476) 2.91
>
> Dbench:
> Throughput is in MB/sec
> NRCLIENTS BASE BASE+patch %improvement
> mean (sd) mean (sd)
> 8 1.101190 (0.875082) 1.700395 (0.846809) 54.4143
> 16 1.524312 (0.120354) 1.477553 (0.058166) -3.06755
> 32 2.143028 (0.157103) 2.090307 (0.136778) -2.46012
So on a very contended system we're actually slower? Is this expected?
Alex
^ permalink raw reply
* Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Alexander Graf @ 2012-01-16 4:00 UTC (permalink / raw)
To: Srivatsa Vaddagiri
Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-doc, Peter Zijlstra,
Jan Kiszka, Virtualization, Paul Mackerras, H. Peter Anvin,
Stefano Stabellini, Xen, Dave Jiang, KVM, Glauber Costa, X86,
Ingo Molnar, Avi Kivity, Rik van Riel, Konrad Rzeszutek Wilk,
Sasha Levin, Sedat Dilek, Thomas Gleixner, Greg Kroah-Hartman,
LKML
In-Reply-To: <20120116035114.GI9129@linux.vnet.ibm.com>
On 16.01.2012, at 04:51, Srivatsa Vaddagiri wrote:
> * Alexander Graf <agraf@suse.de> [2012-01-16 04:23:24]:
>
>>> +5. KVM_HC_KICK_CPU
>>> +------------------------
>>> +value: 5
>>> +Architecture: x86
>>> +Purpose: Hypercall used to wakeup a vcpu from HLT state
>>> +
>>> +Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
>>> +kernel mode for an event to occur (ex: a spinlock to become available)
>>> +can execute HLT instruction once it has busy-waited for more than a
>>> +threshold time-interval. Execution of HLT instruction would cause
>>> +the hypervisor to put the vcpu to sleep (unless yield_on_hlt=0) until occurence
>>> +of an appropriate event. Another vcpu of the same guest can wakeup the sleeping
>>> +vcpu by issuing KVM_HC_KICK_CPU hypercall, specifying APIC ID of the vcpu to be
>>> +wokenup.
>>
>> The description is way too specific. The hypercall basically gives the guest the ability to yield() its current vcpu to another chosen vcpu.
>
> Hmm ..the hypercall does not allow a vcpu to yield. It just allows some
> target vcpu to be prodded/wokenup, after which vcpu continues execution.
>
> Note that semantics of this hypercall is different from the hypercall on which
> PPC pv-spinlock (__spin_yield()) is currently dependent. This is mainly because
> of ticketlocks on x86 (which does not allow us to easily store owning cpu
> details in lock word itself).
Yes, sorry for not being more exact in my wording. It is a directed yield(). Not like the normal old style thing that just says "I'm done, get some work to someone else" but more something like "I'm done, get some work to this specific guy over there" :).
Alex
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox