Linux virtualization list
 help / color / mirror / Atom feed
* Re: [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions
From: Anup Patel @ 2011-11-30  4:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: xen-devel, linaro-dev, kvm, Stefano Stabellini, linux-kernel,
	virtualization, android-virt, embeddedxen-devel, linux-arm-kernel
In-Reply-To: <201111292129.20444.arnd@arndb.de>


[-- Attachment #1.1: Type: text/plain, Size: 1712 bytes --]

Hi all,

I wanted to know how Xen-ARM for A15 will address following concerns:

- How will Xen-ARM for A15 support legacy guest environment like ARMv5 or
ARMv6 ?
- What if my Cortex-A15 board does not have a GIC with virtualization
support ?

Best Regards,
Anup Patel

On Wed, Nov 30, 2011 at 2:59 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> On Tuesday 29 November 2011, Stefano Stabellini wrote:
> > Hi all,
> > a few weeks ago I (and a few others) started hacking on a
> > proof-of-concept hypervisor port to Cortex-A15 which uses and requires
> > ARMv7 virtualization extensions. The intention of this work was to find
> > out how to best support ARM v7+ on Xen. See
> >
> http://old-list-archives.xen.org/archives/html/xen-arm/2011-09/msg00013.html
> > for more details.
> >
> > I am pleased to announce that significant progress has been made, and
> > that we now have a nascent Xen port for Cortex-A15. The port is based on
> > xen-unstable (HG CS 8d6edc3d26d2) and written from scratch exploiting
> > the latest virtualization, LPAE, GIC and generic timer support in
> > hardware.
>
> Very nice!
>
> Do you have a pointer to the kernel sources for the Linux guest?
> Since Xen and KVM are both in an early working state right now,
> it would be very nice if we could agree on the guest model to make
> sure that it's always possible to run the same kernel in both
> (and potentially other future) hypervisors without modifications.
>
>        Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

[-- Attachment #1.2: Type: text/html, Size: 2540 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Rusty Russell @ 2011-11-29 23:40 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Krishna Kumar, kvm, Pawel Moll, Michael S. Tsirkin,
	Alexey Kardashevskiy, Wang Sheng-Hui, lkml - Kernel Mailing List,
	virtualization, Christian Borntraeger, Amit Shah
In-Reply-To: <1322471731.3577.10.camel@lappy>

On Mon, 28 Nov 2011 11:15:31 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> On Mon, 2011-11-28 at 11:25 +1030, Rusty Russell wrote:
> > I'd like to see kvmtools remove support for legacy mode altogether,
> > but they probably have existing users.
> 
> While we can't simply remove it right away, instead of mixing our
> implementation for both legacy and new spec in the same code we can
> split the virtio-pci implementation into two:
> 
> 	- virtio/virtio-pci-legacy.c
> 	- virtio/virtio-pci.c
> 
> At that point we can #ifdef the entire virtio-pci-legacy.c for now and
> remove it at the same time legacy virtio-pci is removed from the kernel.

Hmm, that might be neat, but we can't tell the driver core to try
virtio-pci before virtio-pci-legacy, so we need detection code in both
modules (and add a "force" flag to virtio-pci-legacy to tell it to
accept the device even if it's not a legacy-only one).

Then it should work...
Cheers,
Rusty.

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Rusty Russell @ 2011-11-29 23:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <20111128084009.GB20084@redhat.com>

On Mon, 28 Nov 2011 10:41:51 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, Nov 28, 2011 at 11:25:43AM +1030, Rusty Russell wrote:
> > > > But I'm *terrified* of making the spec more complex;
> > > 
> > > All you do is move stuff around. Why do you think it simplifies the spec
> > > so much?
> > 
> > No, but it reduces the yuk factor.  Which has been important to adoption.
> 
> Sorry if I'm dense. Could you please clarify: do you think we can live
> with the slightly higher yuk factor assuming the spec moves the
> legacy mode into an appendix as you explain below and driver has a
> single 'legacy' switch?

Yep, it's all a trade-off.  A clean slate is good, but if we can make
our lives in transition less painful, I'm all for it.

> I think I see a way to do that in a relatively painless way.
> Do you prefer seeing driver patches or spec? Or are you not interested
> in reusing the same structure at all?

I think we should look at code at this point; my gut says we're going to
be not-quite-similar-enough-to-be-useful.  At which point, a clean-slate
approach is more appealing.  But the code will show, one way or another.

Thanks,
Rusty.

^ permalink raw reply

* Re: [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions
From: Arnd Bergmann @ 2011-11-29 21:29 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: xen-devel, linaro-dev, kvm, Stefano Stabellini, linux-kernel,
	virtualization, android-virt, embeddedxen-devel
In-Reply-To: <alpine.DEB.2.00.1111281009090.31179@kaball-desktop>

On Tuesday 29 November 2011, Stefano Stabellini wrote:
> Hi all,
> a few weeks ago I (and a few others) started hacking on a
> proof-of-concept hypervisor port to Cortex-A15 which uses and requires
> ARMv7 virtualization extensions. The intention of this work was to find
> out how to best support ARM v7+ on Xen. See
> http://old-list-archives.xen.org/archives/html/xen-arm/2011-09/msg00013.html
> for more details. 
> 
> I am pleased to announce that significant progress has been made, and
> that we now have a nascent Xen port for Cortex-A15. The port is based on
> xen-unstable (HG CS 8d6edc3d26d2) and written from scratch exploiting
> the latest virtualization, LPAE, GIC and generic timer support in
> hardware.

Very nice!

Do you have a pointer to the kernel sources for the Linux guest?
Since Xen and KVM are both in an early working state right now,
it would be very nice if we could agree on the guest model to make
sure that it's always possible to run the same kernel in both
(and potentially other future) hypervisors without modifications.

	Arnd

^ permalink raw reply

* [PATCH] xen-blkfront: Use kcalloc instead of kzalloc to allocate array
From: Thomas Meyer @ 2011-11-29 21:08 UTC (permalink / raw)
  To: xen-devel, virtualization, linux-kernel

The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.

The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
---

diff -u -p a/drivers/block/cciss_scsi.c b/drivers/block/cciss_scsi.c
--- a/drivers/block/cciss_scsi.c 2011-11-28 19:36:47.343430551 +0100
+++ b/drivers/block/cciss_scsi.c 2011-11-28 19:49:24.922716381 +0100
@@ -534,10 +534,10 @@ adjust_cciss_scsi_table(ctlr_info_t *h,
 	int nadded, nremoved;
 	struct Scsi_Host *sh = NULL;
 
-	added = kzalloc(sizeof(*added) * CCISS_MAX_SCSI_DEVS_PER_HBA,
-			GFP_KERNEL);
-	removed = kzalloc(sizeof(*removed) * CCISS_MAX_SCSI_DEVS_PER_HBA,
+	added = kcalloc(CCISS_MAX_SCSI_DEVS_PER_HBA, sizeof(*added),
 			GFP_KERNEL);
+	removed = kcalloc(CCISS_MAX_SCSI_DEVS_PER_HBA, sizeof(*removed),
+			  GFP_KERNEL);
 
 	if (!added || !removed) {
 		dev_warn(&h->pdev->dev,
@@ -1191,8 +1191,8 @@ cciss_update_non_disk_devices(ctlr_info_
 
 	ld_buff = kzalloc(reportlunsize, GFP_KERNEL);
 	inq_buff = kmalloc(OBDR_TAPE_INQ_SIZE, GFP_KERNEL);
-	currentsd = kzalloc(sizeof(*currentsd) *
-			(CCISS_MAX_SCSI_DEVS_PER_HBA+1), GFP_KERNEL);
+	currentsd = kcalloc(CCISS_MAX_SCSI_DEVS_PER_HBA + 1,
+			    sizeof(*currentsd), GFP_KERNEL);
 	if (ld_buff == NULL || inq_buff == NULL || currentsd == NULL) {
 		printk(KERN_ERR "cciss: out of memory\n");
 		goto out;
diff -u -p a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
--- a/drivers/block/xen-blkfront.c 2011-11-13 11:07:22.680095573 +0100
+++ b/drivers/block/xen-blkfront.c 2011-11-28 19:49:29.109460410 +0100
@@ -156,7 +156,7 @@ static int xlbd_reserve_minors(unsigned
 	if (end > nr_minors) {
 		unsigned long *bitmap, *old;
 
-		bitmap = kzalloc(BITS_TO_LONGS(end) * sizeof(*bitmap),
+		bitmap = kcalloc(BITS_TO_LONGS(end), sizeof(*bitmap),
 				 GFP_KERNEL);
 		if (bitmap == NULL)
 			return -ENOMEM;

^ permalink raw reply

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Miche Baker-Harvey @ 2011-11-29 17:50 UTC (permalink / raw)
  To: Amit Shah
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, linux-kernel, virtualization,
	Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
	Eric Northrup
In-Reply-To: <CAB8RdapuPRym2dH_9mgmGdzpCxnP1KpS+hS8JJ-geVkga5fMcw@mail.gmail.com>

Good grief!  Sorry for the spacing mess-up!  Here's a resend with reformatting.

Amit,
We aren't using either QEMU or kvmtool, but we are using KVM.  All
the issues we are seeing happen when we try to establish multiple
virtioconsoles at boot time.  The command line isn't relevant, but I
can tell you the protocol that's passing between the host (kvm) and
the guest (see the end of this message).

We do go through the control_work_handler(), but it's not
providing synchronization.  Here's a trace of the
control_work_handler() and handle_control_message() calls; note that
there are two concurrent calls to control_work_handler().

I decorated control_work_handler() with a "lifetime" marker, and
passed this value to handle_control_message(), so we can see which
control messages are being handled from which instance of
the control_work_handler() thread.

Notice that we enter control_work_handler() a second time before
the handling of the second PORT_ADD message is complete. The
first CONSOLE_PORT message is handled by the second
control_work_handler() call, but the second is handled by the first
control_work_handler() call.

root@myubuntu:~# dmesg | grep MBH
[3371055.808738] control_work_handler #1
[3371055.809372] + #1 handle_control_message PORT_ADD
[3371055.810169] - handle_control_message PORT_ADD
[3371055.810170] + #1 handle_control_message PORT_ADD
[3371055.810244]  control_work_handler #2
[3371055.810245] + #2 handle_control_message CONSOLE_PORT
[3371055.810246]  got hvc_ports_mutex
[3371055.810578] - handle_control_message PORT_ADD
[3371055.810579] + #1 handle_control_message CONSOLE_PORT
[3371055.810580]  trylock of hvc_ports_mutex failed
[3371055.811352]  got hvc_ports_mutex
[3371055.811370] - handle_control_message CONSOLE_PORT
[3371055.816609] - handle_control_message CONSOLE_PORT

So, I'm guessing the bug is that there shouldn't be two instances of
control_work_handler() running simultaneously?

Thanks,

Miche

Protocol:
We set up the virtio console device registers during initialization,
specifying the multiport feature, and some number of
ports, n, where nis greater than 1.

In the guest, virtcons_probe() finds our device, and successfully
sends the VIRTIO_CONSOLE_DEVICE_READY=1 control message.
On the host, we receive the VIRTIO_CONSOLE_DEVICE_READY message,
and send one VIRTIO_CONSOLE_PORT_ADD message via the Receive Control
queuefor each port in the number of ports.  These messages are
not serialized: they are all sent at once.

The VIRTIO_CONSOLE_PORT_ADD messages are received
in handle_control_message() in virtio_console.c, and add_port() is
called for each.  After each port is added, the guest
sendsVIRTIO_CONSOLE_PORT_READY to the host, and in these messages, the
id of the port is included in the message.

On the host, in response to each VIRTIO_CONSOLE_PORT_READY=1
message, we may send a VIRTIO_CONSOLE_CONSOLE_PORT message.
On the guest, in response to each VIRTIO_CONSOLE_CONSOLE_PORT
message, init_port_console() is called on the individual port.  The
same id is used for these messages as was used for the PORT_READY
messages.

After each successful init_port_console(), the guest
sends VIRTIO_CONSOLE_PORT_OPEN back to the host.

^ permalink raw reply

* Re: [PATCH] virtio-mmio: Devices parameter parsing
From: Pawel Moll @ 2011-11-29 17:36 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <87zkfht7dp.fsf@rustcorp.com.au>

On Mon, 2011-11-28 at 00:31 +0000, Rusty Russell wrote:
> Off the top of my head, this makes me think of the way initcalls are
> ordered.  We could put a parameter parsing initcall at the start of each
> initcall level in include/asm-generic/vmlinux.lds.h's INITCALLS macro.
> 
> Then we steal four bits from struct kernel_param's flags to indicate the
> level of the initcall (-1 == existing ones, otherwise N == before level
> N initcalls).

Yes, this was my initial idea as well. The only problem I faced is the
fact that there is no "between levels"... It's easy to add parameters
parsing _at_ any particular level, but hard to do this _after_ level A
and _before_ level B. The initcalls section simply contains all the
calls, ordered by the level - the only "separated" level is the pre-SMP
early one. And order within one level is determined by the link order,
so I can't guarantee parsing the parameters as the first call of a level
(nor as the last call of the previous level).

I've came up with a simple prototype (below, only relevant fragments),
doing exactly what I want at the "late" level (maybe that's all we
actually need? ;-) Of course adding all other levels is possible,
probably some clever "generalization" will be useful. And of course the
level would be simply a number in the flags, but what to do with "_sync"
levels (like "arch_initcall_sync", which is "3s" not just "3").

If it makes any sense I'll carry on, any feedback will be useful.

Cheers!

Paweł

---
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 7317dc2..22e6196 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -443,6 +489,122 @@ static int __devexit virtio_mmio_remove(struct platform_device *pdev)
 
 
 
+/* Devices list parameter */
+
+#if defined(CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES)
+
+static struct device vm_cmdline_parent = {
+	.init_name = "virtio-mmio-cmdline",
+};
+
+static int vm_cmdline_parent_registered;
+static int vm_cmdline_id;
+
+static int vm_cmdline_set(const char *device,
+		const struct kernel_param *kp)
+{
+	int err;
+	struct resource resources[2] = {};
+	char *str;
+	long long int base;
+	int processed, consumed = 0;
+	struct platform_device *pdev;
+
+	resources[0].flags = IORESOURCE_MEM;
+	resources[1].flags = IORESOURCE_IRQ;
+
+	resources[0].end = memparse(device, &str) - 1;
+
+	processed = sscanf(str, "@%lli:%u%n:%d%n",
+			&base, &resources[1].start, &consumed,
+			&vm_cmdline_id, &consumed);
+
+	if (processed < 2 || processed > 3 || str[consumed])
+		return -EINVAL;
+
+	resources[0].start = base;
+	resources[0].end += base;
+	resources[1].end = resources[1].start;
+
+	if (!vm_cmdline_parent_registered) {
+		err = device_register(&vm_cmdline_parent);
+		if (err) {
+			pr_err("Failed to register parent device!\n");
+			return err;
+		}
+		vm_cmdline_parent_registered = 1;
+	}
+
+	pr_info("Registering device virtio-mmio.%d at 0x%llx-0x%llx, IRQ %d.\n",
+		       vm_cmdline_id++,
+		       (unsigned long long)resources[0].start,
+		       (unsigned long long)resources[0].end,
+		       (int)resources[1].start);
+
+	pdev = platform_device_register_resndata(&vm_cmdline_parent,
+			"virtio-mmio", vm_cmdline_id++,
+			resources, ARRAY_SIZE(resources), NULL, 0);
+	if (IS_ERR(pdev))
+		return PTR_ERR(pdev);
+
+	return 0;
+}
+
+static int vm_cmdline_get_device(struct device *dev, void *data)
+{
+	char *buffer = data;
+	unsigned int len = strlen(buffer);
+	struct platform_device *pdev = to_platform_device(dev);
+
+	snprintf(buffer + len, PAGE_SIZE - len, "0x%llx@0x%llx:%llu:%d\n",
+			pdev->resource[0].end - pdev->resource[0].start + 1ULL,
+			(unsigned long long)pdev->resource[0].start,
+			(unsigned long long)pdev->resource[1].start,
+			pdev->id);
+	return 0;
+}
+
+static int vm_cmdline_get(char *buffer, const struct kernel_param *kp)
+{
+	buffer[0] = '\0';
+	device_for_each_child(&vm_cmdline_parent, buffer,
+			vm_cmdline_get_device);
+	return strlen(buffer) + 1;
+}
+
+static struct kernel_param_ops vm_cmdline_param_ops = {
+	.set = vm_cmdline_set,
+	.get = vm_cmdline_get,
+};
+
+module_param_cb_late(device, &vm_cmdline_param_ops, NULL, S_IRUSR);
+
+static int vm_unregister_cmdline_device(struct device *dev,
+		void *data)
+{
+	platform_device_unregister(to_platform_device(dev));
+
+	return 0;
+}
+
+static void vm_unregister_cmdline_devices(void)
+{
+	if (vm_cmdline_parent_registered) {
+		device_for_each_child(&vm_cmdline_parent, NULL,
+				vm_unregister_cmdline_device);
+		device_unregister(&vm_cmdline_parent);
+		vm_cmdline_parent_registered = 0;
+	}
+}
+
+#else
+
+static void virtio_mmio_unregister_cmdline_devices(void)
+{
+}
+
+#endif
+
 /* Platform driver */
 
 static struct of_device_id virtio_mmio_match[] = {
@@ -469,6 +631,7 @@ static int __init virtio_mmio_init(void)
 static void __exit virtio_mmio_exit(void)
 {
 	platform_driver_unregister(&virtio_mmio_driver);
+	vm_unregister_cmdline_devices();
 }
 
 module_init(virtio_mmio_init);
diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
index 7939f63..fccf0cb 100644
--- a/include/linux/moduleparam.h
+++ b/include/linux/moduleparam.h
@@ -48,7 +48,8 @@ struct kernel_param_ops {
 };
 
 /* Flag bits for kernel_param.flags */
-#define KPARAM_ISBOOL		2
+#define KPARAM_ISBOOL		(1 << 1)
+#define KPARAM_LATE		(1 << 2)
 
 struct kernel_param {
 	const char *name;
@@ -132,7 +133,13 @@ struct kparam_array
  */
 #define module_param_cb(name, ops, arg, perm)				      \
 	__module_param_call(MODULE_PARAM_PREFIX,			      \
-			    name, ops, arg, __same_type((arg), bool *), perm)
+			    name, ops, arg, 				      \
+			    __same_type((arg), bool *), 0, perm)
+
+#define module_param_cb_late(name, ops, arg, perm)			      \
+	__module_param_call(MODULE_PARAM_PREFIX,			      \
+			    name, ops, arg,				      \
+			    __same_type((arg), bool *), 1, perm)
 
 /* On alpha, ia64 and ppc64 relocations to global data cannot go into
    read-only sections (which is part of respective UNIX ABI on these
@@ -146,7 +153,7 @@ struct kparam_array
 
 /* This is the fundamental function for registering boot/module
    parameters. */
-#define __module_param_call(prefix, name, ops, arg, isbool, perm)	\
+#define __module_param_call(prefix, name, ops, arg, isbool, late, perm)	\
 	/* Default value instead of permissions? */			\
 	static int __param_perm_check_##name __attribute__((unused)) =	\
 	BUILD_BUG_ON_ZERO((perm) < 0 || (perm) > 0777 || ((perm) & 2))	\
@@ -155,7 +162,8 @@ struct kparam_array
 	static struct kernel_param __moduleparam_const __param_##name	\
 	__used								\
     __attribute__ ((unused,__section__ ("__param"),aligned(sizeof(void *)))) \
-	= { __param_str_##name, ops, perm, isbool ? KPARAM_ISBOOL : 0,	\
+	= { __param_str_##name, ops, perm, 				\
+	    (isbool ? KPARAM_ISBOOL : 0) | (late ? KPARAM_LATE : 0),	\
 	    { arg } }
 
 /* Obsolete - use module_param_cb() */
@@ -165,6 +173,7 @@ struct kparam_array
 	__module_param_call(MODULE_PARAM_PREFIX,			\
 			    name, &__param_ops_##name, arg,		\
 			    __same_type(arg, bool *),			\
+			    0,					\
 			    (perm) + sizeof(__check_old_set_param(set))*0)
 
 /* We don't get oldget: it's often a new-style param_get_uint, etc. */
@@ -246,7 +255,7 @@ static inline void __kernel_param_unlock(void)
 #define core_param(name, var, type, perm)				\
 	param_check_##type(name, &(var));				\
 	__module_param_call("", name, &param_ops_##type,		\
-			    &var, __same_type(var, bool), perm)
+			    &var, __same_type(var, bool), 0, perm)
 #endif /* !MODULE */
 
 /**
@@ -264,7 +273,7 @@ static inline void __kernel_param_unlock(void)
 		= { len, string };					\
 	__module_param_call(MODULE_PARAM_PREFIX, name,			\
 			    &param_ops_string,				\
-			    .str = &__param_string_##name, 0, perm);	\
+			    .str = &__param_string_##name, 0, 0, perm);	\
 	__MODULE_PARM_TYPE(name, "string")
 
 /**
@@ -292,6 +301,8 @@ extern int parse_args(const char *name,
 		      char *args,
 		      const struct kernel_param *params,
 		      unsigned num,
+		      u16 flags_mask,
+		      u16 flags,
 		      int (*unknown)(char *param, char *val));
 
 /* Called by module remove. */
@@ -402,7 +413,7 @@ extern int param_get_invbool(char *buffer, const struct kernel_param *kp);
 	__module_param_call(MODULE_PARAM_PREFIX, name,			\
 			    &param_array_ops,				\
 			    .arr = &__param_arr_##name,			\
-			    __same_type(array[0], bool), perm);		\
+			    __same_type(array[0], bool), 0, perm);	\
 	__MODULE_PARM_TYPE(name, "array of " #type)
 
 extern struct kernel_param_ops param_array_ops;
diff --git a/init/main.c b/init/main.c
index 217ed23..ce89a53 100644
--- a/init/main.c
+++ b/init/main.c
@@ -407,7 +407,7 @@ static int __init do_early_param(char *param, char *val)
 
 void __init parse_early_options(char *cmdline)
 {
-	parse_args("early options", cmdline, NULL, 0, do_early_param);
+	parse_args("early options", cmdline, NULL, 0, 0, 0, do_early_param);
 }
 
 /* Arch code calls this early on, or if not, just before other parsing. */
@@ -511,7 +511,7 @@ asmlinkage void __init start_kernel(void)
 	parse_early_param();
 	parse_args("Booting kernel", static_command_line, __start___param,
 		   __stop___param - __start___param,
-		   &unknown_bootoption);
+		   KPARAM_LATE, 0, &unknown_bootoption);
 
 	jump_label_init();
 
@@ -737,6 +737,22 @@ static void __init do_basic_setup(void)
 	do_initcalls();
 }
 
+static int __init ignore_unknown_bootoption(char *param, char *val)
+{
+	return 0;
+}
+
+static int __init parse_late_args(void)
+{
+	extern const struct kernel_param __start___param[], __stop___param[];
+
+	strcpy(static_command_line, saved_command_line);
+	parse_args("Late parameters", static_command_line, __start___param,
+		   __stop___param - __start___param,
+		   KPARAM_LATE, KPARAM_LATE, ignore_unknown_bootoption);
+}
+late_initcall(parse_late_args);
+
 static void __init do_pre_smp_initcalls(void)
 {
 	initcall_t *fn;
diff --git a/kernel/module.c b/kernel/module.c
index 178333c..66d3e75 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2893,7 +2893,8 @@ static struct module *load_module(void __user *umod,
 	mutex_unlock(&module_mutex);
 
 	/* Module is ready to execute: parsing args may do that. */
-	err = parse_args(mod->name, mod->args, mod->kp, mod->num_kp, NULL);
+	err = parse_args(mod->name, mod->args, mod->kp, mod->num_kp,
+			 0, 0, NULL);
 	if (err < 0)
 		goto unlink;
 
diff --git a/kernel/params.c b/kernel/params.c
index 65aae11..560345c 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -94,6 +94,8 @@ static int parse_one(char *param,
 		     char *val,
 		     const struct kernel_param *params,
 		     unsigned num_params,
+		     int flags_mask,
+		     int flags,
 		     int (*handle_unknown)(char *param, char *val))
 {
 	unsigned int i;
@@ -102,6 +104,8 @@ static int parse_one(char *param,
 	/* Find parameter */
 	for (i = 0; i < num_params; i++) {
 		if (parameq(param, params[i].name)) {
+			if ((params[i].flags & flags_mask) != flags)
+				return 0;
 			/* No one handled NULL, so do it here. */
 			if (!val && params[i].ops->set != param_set_bool)
 				return -EINVAL;
@@ -180,6 +184,8 @@ int parse_args(const char *name,
 	       char *args,
 	       const struct kernel_param *params,
 	       unsigned num,
+	       u16 flags_mask,
+	       u16 flags,
 	       int (*unknown)(char *param, char *val))
 {
 	char *param, *val;
@@ -195,7 +201,8 @@ int parse_args(const char *name,
 
 		args = next_arg(args, &param, &val);
 		irq_was_disabled = irqs_disabled();
-		ret = parse_one(param, val, params, num, unknown);
+		ret = parse_one(param, val, params, num,
+			        flags_mask, flags, unknown);
 		if (irq_was_disabled && !irqs_disabled()) {
 			printk(KERN_WARNING "parse_args(): option '%s' enabled "
 					"irq's!\n", param);



_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Miche Baker-Harvey @ 2011-11-29 17:04 UTC (permalink / raw)
  To: Amit Shah
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, linux-kernel, virtualization,
	Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
	Eric Northrup
In-Reply-To: <20111129142149.GE2822@amit-x200.redhat.com>

Amit,
We aren't using either QEMU or kvmtool, but we are using KVM.  All
theissues we are seeing happen when we try to establish multiple
virtioconsoles at boot time.  The command line isn't relevant, but I
cantell you the protocol that's passing between the host (kvm) and
theguest (see the end of this message).
We do go through the control_work_handler(), but it's not
providingsynchronization.  Here's a trace of the
control_work_handler() andhandle_control_message() calls; note that
there are two concurrentcalls to control_work_handler().
I decorated control_work_handler() with a "lifetime" marker, and
passedthis value to handle_control_message(), so we can see which
control
messages are being handled from which instance of
thecontrol_work_handler() thread.
Notice that we enter control_work_handler() a second time before
thehandling of the second PORT_ADD message is complete. The
firstCONSOLE_PORT message is handled by the second
control_work_handler()call, but the second is handled by the first
control_work_handler() call.

root@myubuntu:~# dmesg | grep MBH[3371055.808738]
control_work_handler #1[3371055.809372] + #1 handle_control_message
PORT_ADD[3371055.810169] - handle_control_message
PORT_ADD[3371055.810170] + #1 handle_control_message PORT_ADD
[3371055.810244]  control_work_handler #2[3371055.810245] + #2
handle_control_message CONSOLE_PORT[3371055.810246]  got
hvc_ports_mutex[3371055.810578] - handle_control_message
PORT_ADD[3371055.810579] + #1 handle_control_message
CONSOLE_PORT[3371055.810580]  trylock of hvc_ports_mutex
failed[3371055.811352]  got hvc_ports_mutex[3371055.811370] -
handle_control_message CONSOLE_PORT[3371055.816609] -
handle_control_message CONSOLE_PORT  So, I'm guessing the bug is that
there shouldn't be two instances ofcontrol_work_handler() running
simultaneously?
Thanks,
Miche

Protocol:We set up the virtio console device registers during
initialization,specifying the multiport feature, and some number of
ports, n, where nis greater than 1.
In the guest, virtcons_probe() finds our device, and successfully
sendsthe VIRTIO_CONSOLE_DEVICE_READY=1 control message.
On the host, we receive the VIRTIO_CONSOLE_DEVICE_READY message,
andsend one VIRTIO_CONSOLE_PORT_ADD message via the Receive Control
queuefor each port in the number of ports.  These messages are
notserialized: they are all sent at once.
The VIRTIO_CONSOLE_PORT_ADD messages are received
inhandle_control_message() in virtio_console.c, and add_port() is
calledfor each.  After each port is added, the guest
sendsVIRTIO_CONSOLE_PORT_READY to the host, and in these messages, the
idof the port is included in the message.
On the host, in response to each VIRTIO_CONSOLE_PORT_READY=1
message,we may send a VIRTIO_CONSOLE_CONSOLE_PORT message.
On the guest, in response to each VIRTIO_CONSOLE_CONSOLE_PORT
message,init_port_console() is called on the individual port.  The
same id isused for these messages as was used for the PORT_READY
messages.
After each successful init_port_console(), the guest
sendsVIRTIO_CONSOLE_PORT_OPEN back to the host.

On Tue, Nov 29, 2011 at 6:21 AM, Amit Shah <amit.shah@redhat.com> wrote:
> Hi,
>
> On (Mon) 28 Nov 2011 [15:40:41], Miche Baker-Harvey wrote:
>> Amit,
>>
>> You said that the work would be serialized "due to port additions
>> being on work items on the same workqueue".  I'm not seeing that.
>
> You leave a lot of questions unanswered.  What's your environment?
> Are you hot-plugging ports?  Are you using qemu?  What's your command
> line?
>
>> I've double checked this by using a mutex_trylock in
>> hvc_console::hvc_alloc(), and here's the relevant output from dmesg:
>>
>> root@myubuntu:~# dmesg | grep MBH
>> [3307216.210274] MBH: got hvc_ports_mutex
>> [3307216.210690] MBH: trylock of hvc_ports_mutex failed
>> [3307216.211143] MBH: got hvc_ports_mutex
>>
>> This is in a system with two virtio console ports, each of which is a
>> console.  I think if the VIRTIO_CONSOLE_CONSOLE_PORT message handling
>> were actually being serialized, the trylock should never fail.
>
> Agreed.
>
>> What's the source of the serialization for the workqueue items?  At
>> first reading it looks like the control_work_handler gets called for
>> each virtio interrupt?
>
> It all depends on how you add ports.  If you're using qemu, they
> happen via the control work handler.
>
>                Amit
>

^ permalink raw reply

* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Michael S. Tsirkin @ 2011-11-29 15:19 UTC (permalink / raw)
  To: Ohad Ben-Cohen; +Cc: linux-arm-kernel, linux-kernel, kvm, virtualization
In-Reply-To: <CAK=WgbZoGYVyKt=XXvnPVdh0koSBJC0zcgkLaYRkmjwBLOFA-A@mail.gmail.com>

On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
> > Is an extra branch faster or slower than reverting d57ed95?
> 
> Sorry, unfortunately I have no way to measure this, as I don't have
> any virtualization/x86 setup. I'm developing on ARM SoCs, where
> virtualization hardware is coming, but not here yet.

You can try using the micro-benchmark in tools/virtio/.
That does not need virtualization.

-- 
MST

^ permalink raw reply

* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Michael S. Tsirkin @ 2011-11-29 15:16 UTC (permalink / raw)
  To: Ohad Ben-Cohen; +Cc: linux-arm-kernel, linux-kernel, kvm, virtualization
In-Reply-To: <CAK=WgbZoGYVyKt=XXvnPVdh0koSBJC0zcgkLaYRkmjwBLOFA-A@mail.gmail.com>

On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
> On Tue, Nov 29, 2011 at 3:11 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote:
> >> Virtio is using memory barriers to control the ordering of
> >> references to the vrings on SMP systems. When the guest is compiled
> >> with SMP support, virtio is only using SMP barriers in order to
> >> avoid incurring the overhead involved with mandatory barriers.
> >>
> >> Lately, though, virtio is being increasingly used with inter-processor
> >> communication scenarios too, which involve running two (separate)
> >> instances of operating systems on two (separate) processors, each of
> >> which might either be UP or SMP.
> >
> > Is that using virtio-mmio?
> 
> No, I'm using this:
> 
> https://lkml.org/lkml/2011/10/25/139

This mentions iommu - is there a need to use dma api to let
the firmware acess the rings? Or does it have access to all
of memory?

> > Sorry, could you pls explain what are 'two external processors'?
> > I think I know that if two x86 CPUs in an SMP system run kernels built
> > in an SMP configuration, smp_*mb barriers are enough.
> 
> Sure:
> 
> My setup is not SMP-based; it's two separate processors running in AMP
> configuration. The processors have completely different architectures,
> are not cache coherent, and only simply share some memory, which is
> used for communications using virtio as the shared memory "wire"
> protocol (i.e. we're not even doing virtualization: we have Linux on
> one processor, and some RTOS on another processor, and they use virtio
> to send and receive buffers).

I'd like to make sure I understand the memory model some more.
Is there cache snooping? If yes access from an external device
typically works mostly in the same way as smp ...


> So it's not SMP effects we're controlling; we're pretty much doing
> MMIO and must use mandatory barriers

So you put virtio rings in MMIO memory?

> (otherwise we see breakage).

Could you please give a couple of examples of breakage?

-- 
MST

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-11-29 14:58 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: markmc, kvm, linux-kernel, virtualization, Sasha Levin
In-Reply-To: <20111129145451.GD30966@redhat.com>

On 11/29/2011 04:54 PM, Michael S. Tsirkin wrote:
> > 
> > Which is actually strange, weren't indirect buffers introduced to make
> > the performance *better*? From what I see it's pretty much the
> > same/worse for virtio-blk.
>
> I know they were introduced to allow adding very large bufs.
> See 9fa29b9df32ba4db055f3977933cd0c1b8fe67cd
> Mark, you wrote the patch, could you tell us which workloads
> benefit the most from indirect bufs?
>

Indirects are really for block devices with many spindles, since there
the limiting factor is the number of requests in flight.  Network
interfaces are limited by bandwidth, it's better to increase the ring
size and use direct buffers there (so the ring size more or less
corresponds to the buffer size).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-11-29 14:54 UTC (permalink / raw)
  To: Sasha Levin; +Cc: markmc, linux-kernel, kvm, virtualization
In-Reply-To: <1322576464.7003.6.camel@lappy>

On Tue, Nov 29, 2011 at 04:21:04PM +0200, Sasha Levin wrote:
> > > > Need to verify the effect on block too, and do some more
> > > > benchmarks. In particular we are making the ring
> > > > in effect smaller, how will this affect small packet perf
> > > > with multiple streams?
> > > 
> > > I couldn't get good block benchmarks on my hardware. They were all over
> > > the place even when I was trying to get the baseline. I'm guessing my
> > > disk is about to kick the bucket.
> > 
> > Try using memory as a backing store.
> 
> Here are the results from fio doing random reads:
> 
> With indirect buffers:
> Run status group 0 (all jobs):
>    READ: io=2419.7MB, aggrb=126001KB/s, minb=12887KB/s, maxb=13684KB/s, mint=18461msec, maxt=19664msec
> 
> Disk stats (read/write):
>   vda: ios=612107/0, merge=0/0, ticks=37559/0, in_queue=32723, util=76.70%
> 
> Indirect buffers disabled in the host:
> Run status group 0 (all jobs):
>    READ: io=2419.7MB, aggrb=127106KB/s, minb=12811KB/s, maxb=14557KB/s, mint=17486msec, maxt=19493msec
> 
> Disk stats (read/write):
>   vda: ios=617315/0, merge=1/0, ticks=166751/0, in_queue=162807, util=88.19%

I don't know much about this, only difference I see is that
in_queue is way higher. 

> 
> Which is actually strange, weren't indirect buffers introduced to make
> the performance *better*? From what I see it's pretty much the
> same/worse for virtio-blk.

I know they were introduced to allow adding very large bufs.
See 9fa29b9df32ba4db055f3977933cd0c1b8fe67cd
Mark, you wrote the patch, could you tell us which workloads
benefit the most from indirect bufs?

> Here's my fio test file:
> [random-read]
> rw=randread
> size=256m
> filename=/dev/vda
> ioengine=libaio
> iodepth=8
> direct=1
> invalidate=1
> numjobs=10
> > 
> > > This threshold should be dynamic and be based on the amount of avail
> > > descriptors over time, so for example, if the vring is 90% full over
> > > time the threshold will go up allowing for more indirect buffers.
> > > Currently it's static, but it's a first step to making it dynamic :)
> > > 
> > > I'll do a benchmark with small packets.
> > > 
> > > > A very simple test is to disable indirect buffers altogether.
> > > > qemu-kvm has a flag for this.
> > > > Is this an equivalent test?
> > > > If yes I'll try that.
> > > 
> > > Yes, it should be equivalent to qemu without that flag.
> > > 
> > > > 
> > > > 
> > > > > Cc: Rusty Russell <rusty@rustcorp.com.au>
> > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > Cc: virtualization@lists.linux-foundation.org
> > > > > Cc: kvm@vger.kernel.org
> > > > > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> > > > > ---
> > > > >  drivers/virtio/virtio_ring.c |   12 ++++++++++--
> > > > >  1 files changed, 10 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index c7a2c20..5e0ce15 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -82,6 +82,7 @@ struct vring_virtqueue
> > > > >  
> > > > >  	/* Host supports indirect buffers */
> > > > >  	bool indirect;
> > > > 
> > > > We can get rid of bool indirect now, just set indirect_threshold to 0,
> > > > right?
> > > 
> > > Yup.
> > > 
> > > > 
> > > > > +	unsigned int indirect_threshold;
> > > > 
> > > > Please add a comment. It should be something like
> > > > 'Min. number of free space in the ring to trigger direct descriptor use'
> > > 
> > > Will do.
> > > 
> > > > 
> > > > >  
> > > > >  	/* Host publishes avail event idx */
> > > > >  	bool event;
> > > > > @@ -176,8 +177,9 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
> > > > >  	BUG_ON(data == NULL);
> > > > >  
> > > > >  	/* If the host supports indirect descriptor tables, and we have multiple
> > > > > -	 * buffers, then go indirect. FIXME: tune this threshold */
> > > > > -	if (vq->indirect && (out + in) > 1 && vq->num_free) {
> > > > > +	 * buffers, then go indirect. */
> > > > > +	if (vq->indirect && (out + in) > 1 &&
> > > > > +	   (vq->num_free < vq->indirect_threshold)) {
> > > > 
> > > > If num_free is 0, this will allocate the buffer which is
> > > > not a good idea.
> > > > 
> > > > I think there's a regression here: with a small vq, we could
> > > > previously put in an indirect descriptor, with your patch
> > > > add_buf will fail. I think this is a real problem for block
> > > > which was the original reason indirect bufs were introduced.
> > > 
> > > I defined the threshold so at least 16 descriptors will be used as
> > > indirect buffers, so if you have a small vq theres still a solid minimum
> > > of indirect descriptors it could use.
> > 
> > Yes but request size might be > 16.
> > 
> > > > 
> > > > 
> > > > >  		head = vring_add_indirect(vq, sg, out, in, gfp);
> > > > >  		if (likely(head >= 0))
> > > > >  			goto add_head;
> > > > > @@ -485,6 +487,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
> > > > >  #endif
> > > > >  
> > > > >  	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
> > > > > +	/*
> > > > > +	 * Use indirect descriptors only when we have less than either 12%
> > > > > +	 * or 16 of the descriptors in the ring available.
> > > > > +	 */
> > > > > +	if (vq->indirect)
> > > > > +		vq->indirect_threshold = max(16U, num >> 3);
> > > > 
> > > > Let's add some defines at top of the file please, maybe even
> > > > a module parameter.
> > > > 
> > > > >  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> > > > >  
> > > > >  	/* No callback?  Tell other side not to bother us. */
> > > > > -- 
> > > > > 1.7.8.rc3
> > > 
> > > -- 
> > > 
> > > Sasha.
> 
> -- 
> 
> Sasha.

^ permalink raw reply

* [ANNOUNCE] Xen port to Cortex-A15 / ARMv7 with virt extensions
From: Stefano Stabellini @ 2011-11-29 14:53 UTC (permalink / raw)
  To: xen-devel
  Cc: linaro-dev, kvm, Stefano Stabellini, linux-kernel, virtualization,
	android-virt, embeddedxen-devel, linux-arm-kernel

Hi all,
a few weeks ago I (and a few others) started hacking on a
proof-of-concept hypervisor port to Cortex-A15 which uses and requires
ARMv7 virtualization extensions. The intention of this work was to find
out how to best support ARM v7+ on Xen. See
http://old-list-archives.xen.org/archives/html/xen-arm/2011-09/msg00013.html
for more details. 

I am pleased to announce that significant progress has been made, and
that we now have a nascent Xen port for Cortex-A15. The port is based on
xen-unstable (HG CS 8d6edc3d26d2) and written from scratch exploiting
the latest virtualization, LPAE, GIC and generic timer support in
hardware.

We started the work less than three months ago, but the port is already
capable of booting a Linux 3.0 based virtual machine (dom0) up to a
shell prompt on an ARM Architecture Envelope Model, configured to emulate
an A15-based Versatile Express. In this context, we wanted to thank ARM
for making the model available to us.
Now we are looking forward to porting the tools and running multiple
guests.

The code requires virtualization, LPAE and GIC support and therefore it
won't be able to run on anything older than a Cortex-A15.
On the other hand, thanks to this, it is very small and easy to read,
write and understand.
The implementation does not distinguish between PV and HVM guests: there
is just one type of guests that would be comparable to Linux PV on HVM
in the Xen X86 world, but with no need for Qemu emulation.
The code only requires minimal changes to the Linux kernel: just enough
to support PV drivers.

Even though we are currently targeting Versatile Express and Cortex-A15
we do intend to support other machines and other ARMv7 with
virtualization extensions CPUs.
We are also looking forward to ARMv8 and 64 bits support.

Given that porting Xen to Cortex-A15 could be done with so little code,
we believe that the best course of action is to merge it into
xen-unstable as quickly as possible. There are still few rough edges to
sort out but we should be able to produce a clean and digestible patch
series for submission to xen-devel within the next couple of months. I
hope to see the first patches going to the list as soon as possible.
We would very welcome any contributions, in the form of testing, code
reviews and, of course, patches! 


A git tree is available here:

git://xenbits.xen.org/people/sstabellini/xen-unstable.git arm

the gitweb url is the following:

http://xenbits.xen.org/gitweb/?p=people/sstabellini/xen-unstable.git/.git;a=shortlog;h=refs/heads/arm

And here is the full diff:

http://xenbits.xen.org/people/sstabellini/diff


We want to point out that this effort is in addition to Samsung's
ongoing efforts to upstream Xen ARM to xen-unstable. Samsung's XenARM
port allows virtualization of Xen on ARM CPUs prior to virtualization
extensions and supports traditional PV guests.

I would like to thank Tim Deegan and Ian Campbell: if you spend some
time reading the history, you'll see that this project wouldn't have
been possible in such a short time without great contributions from
them.

Cheers,
Stefano

^ permalink raw reply

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Amit Shah @ 2011-11-29 14:21 UTC (permalink / raw)
  To: Miche Baker-Harvey
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, linux-kernel, virtualization,
	Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
	Eric Northrup
In-Reply-To: <CAB8RdaoDxMCgTjuGP70bGrsrtgh=USYtU3Spo5OPtVMYAM5EzQ@mail.gmail.com>

Hi,

On (Mon) 28 Nov 2011 [15:40:41], Miche Baker-Harvey wrote:
> Amit,
> 
> You said that the work would be serialized "due to port additions
> being on work items on the same workqueue".  I'm not seeing that.

You leave a lot of questions unanswered.  What's your environment?
Are you hot-plugging ports?  Are you using qemu?  What's your command
line?

> I've double checked this by using a mutex_trylock in
> hvc_console::hvc_alloc(), and here's the relevant output from dmesg:
> 
> root@myubuntu:~# dmesg | grep MBH
> [3307216.210274] MBH: got hvc_ports_mutex
> [3307216.210690] MBH: trylock of hvc_ports_mutex failed
> [3307216.211143] MBH: got hvc_ports_mutex
> 
> This is in a system with two virtio console ports, each of which is a
> console.  I think if the VIRTIO_CONSOLE_CONSOLE_PORT message handling
> were actually being serialized, the trylock should never fail.

Agreed.

> What's the source of the serialization for the workqueue items?  At
> first reading it looks like the control_work_handler gets called for
> each virtio interrupt?

It all depends on how you add ports.  If you're using qemu, they
happen via the control work handler.

		Amit

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-11-29 14:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <20111129135406.GB30966@redhat.com>

On Tue, 2011-11-29 at 15:54 +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 29, 2011 at 03:34:48PM +0200, Sasha Levin wrote:
> > On Tue, 2011-11-29 at 14:56 +0200, Michael S. Tsirkin wrote:
> > > On Tue, Nov 29, 2011 at 11:33:16AM +0200, Sasha Levin wrote:
> > > > Currently if VIRTIO_RING_F_INDIRECT_DESC is enabled we will use indirect
> > > > descriptors even if we have plenty of space in the ring. This means that
> > > > we take a performance hit at all times due to the overhead of creating
> > > > indirect descriptors.
> > > 
> > > Is it the overhead of creating them or just allocating the pages?
> > 
> > My guess here is that it's the allocation since creating them is very
> > similar to creating regular descriptors.
> 
> Well, there is some formatting overhead ...

Very little. The formatting code is very similar to regular buffers.

> 
> > > The logic you propose is basically add direct as long as
> > > the ring is mostly empty. So if the problem is in allocations,
> > > one simple optimization for this one workload is add a small
> > > cache of memory to use for indirect bufs. Of course building
> > > a good API for this is where we got blocked in the past...
> > 
> > I thought the issue of using a single pool was that the sizes of
> > indirect descriptors are dynamic, so you can't use a single kmemcache
> > for all of them unless you're ok with having a bunch of wasted bytes.
> 
> If the pool size is limited, the waste is limited too, so maybe
> we are OK with that...

What would you say are the best numbers for indirect descriptor sizes
and the amount of those in a kmemcache?

> > > 
> > > > With this patch, we will use indirect descriptors only if we have less than
> > > > either 16, or 12% of the total amount of descriptors available.
> > > 
> > > One notes that this to some level conflicts with patches that change
> > > virtio net not to drain the vq before add buf, in that we are
> > > required here to drain the vq to avoid indirect.
> > 
> > You don't have to avoid indirects by all means, if the vq is so full it
> > has to resort to indirect buffers we better let him do that.
> 
> With the limited polling patches, the vq stays full all of
> the time, we only poll enough to create space for the new
> descriptor.
> It's not a must to make them work as they are not upstream,
> but worth considering.
> 
> > > 
> > > Not necessarily a serious problem, but something to keep in mind:
> > > a memory pool would not have this issue.
> > > 
> > > > 
> > > > I did basic performance benchmark on virtio-net with vhost enabled.
> > > > 
> > > > Before:
> > > > 	Recv   Send    Send
> > > > 	Socket Socket  Message  Elapsed
> > > > 	Size   Size    Size     Time     Throughput
> > > > 	bytes  bytes   bytes    secs.    10^6bits/sec
> > > > 
> > > > 	 87380  16384  16384    10.00    4563.92
> > > > 
> > > > After:
> > > > 	Recv   Send    Send
> > > > 	Socket Socket  Message  Elapsed
> > > > 	Size   Size    Size     Time     Throughput
> > > > 	bytes  bytes   bytes    secs.    10^6bits/sec
> > > > 
> > > > 	 87380  16384  16384    10.00    5353.28
> > > 
> > > Is this with the kvm tool? what kind of benchmark is this?
> > 
> > It's using the kvm tool and netperf. It's a simple TCP_STREAM test with
> > vhost enabled and using a regular TAP device to connect between guest
> > and host.
> 
> guest to host?

guest is running as server.

> 
> > > 
> > > Need to verify the effect on block too, and do some more
> > > benchmarks. In particular we are making the ring
> > > in effect smaller, how will this affect small packet perf
> > > with multiple streams?
> > 
> > I couldn't get good block benchmarks on my hardware. They were all over
> > the place even when I was trying to get the baseline. I'm guessing my
> > disk is about to kick the bucket.
> 
> Try using memory as a backing store.

Here are the results from fio doing random reads:

With indirect buffers:
Run status group 0 (all jobs):
   READ: io=2419.7MB, aggrb=126001KB/s, minb=12887KB/s, maxb=13684KB/s, mint=18461msec, maxt=19664msec

Disk stats (read/write):
  vda: ios=612107/0, merge=0/0, ticks=37559/0, in_queue=32723, util=76.70%

Indirect buffers disabled in the host:
Run status group 0 (all jobs):
   READ: io=2419.7MB, aggrb=127106KB/s, minb=12811KB/s, maxb=14557KB/s, mint=17486msec, maxt=19493msec

Disk stats (read/write):
  vda: ios=617315/0, merge=1/0, ticks=166751/0, in_queue=162807, util=88.19%

Which is actually strange, weren't indirect buffers introduced to make
the performance *better*? From what I see it's pretty much the
same/worse for virtio-blk.

Here's my fio test file:
[random-read]
rw=randread
size=256m
filename=/dev/vda
ioengine=libaio
iodepth=8
direct=1
invalidate=1
numjobs=10

> 
> > This threshold should be dynamic and be based on the amount of avail
> > descriptors over time, so for example, if the vring is 90% full over
> > time the threshold will go up allowing for more indirect buffers.
> > Currently it's static, but it's a first step to making it dynamic :)
> > 
> > I'll do a benchmark with small packets.
> > 
> > > A very simple test is to disable indirect buffers altogether.
> > > qemu-kvm has a flag for this.
> > > Is this an equivalent test?
> > > If yes I'll try that.
> > 
> > Yes, it should be equivalent to qemu without that flag.
> > 
> > > 
> > > 
> > > > Cc: Rusty Russell <rusty@rustcorp.com.au>
> > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Cc: virtualization@lists.linux-foundation.org
> > > > Cc: kvm@vger.kernel.org
> > > > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> > > > ---
> > > >  drivers/virtio/virtio_ring.c |   12 ++++++++++--
> > > >  1 files changed, 10 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index c7a2c20..5e0ce15 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -82,6 +82,7 @@ struct vring_virtqueue
> > > >  
> > > >  	/* Host supports indirect buffers */
> > > >  	bool indirect;
> > > 
> > > We can get rid of bool indirect now, just set indirect_threshold to 0,
> > > right?
> > 
> > Yup.
> > 
> > > 
> > > > +	unsigned int indirect_threshold;
> > > 
> > > Please add a comment. It should be something like
> > > 'Min. number of free space in the ring to trigger direct descriptor use'
> > 
> > Will do.
> > 
> > > 
> > > >  
> > > >  	/* Host publishes avail event idx */
> > > >  	bool event;
> > > > @@ -176,8 +177,9 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
> > > >  	BUG_ON(data == NULL);
> > > >  
> > > >  	/* If the host supports indirect descriptor tables, and we have multiple
> > > > -	 * buffers, then go indirect. FIXME: tune this threshold */
> > > > -	if (vq->indirect && (out + in) > 1 && vq->num_free) {
> > > > +	 * buffers, then go indirect. */
> > > > +	if (vq->indirect && (out + in) > 1 &&
> > > > +	   (vq->num_free < vq->indirect_threshold)) {
> > > 
> > > If num_free is 0, this will allocate the buffer which is
> > > not a good idea.
> > > 
> > > I think there's a regression here: with a small vq, we could
> > > previously put in an indirect descriptor, with your patch
> > > add_buf will fail. I think this is a real problem for block
> > > which was the original reason indirect bufs were introduced.
> > 
> > I defined the threshold so at least 16 descriptors will be used as
> > indirect buffers, so if you have a small vq theres still a solid minimum
> > of indirect descriptors it could use.
> 
> Yes but request size might be > 16.
> 
> > > 
> > > 
> > > >  		head = vring_add_indirect(vq, sg, out, in, gfp);
> > > >  		if (likely(head >= 0))
> > > >  			goto add_head;
> > > > @@ -485,6 +487,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
> > > >  #endif
> > > >  
> > > >  	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
> > > > +	/*
> > > > +	 * Use indirect descriptors only when we have less than either 12%
> > > > +	 * or 16 of the descriptors in the ring available.
> > > > +	 */
> > > > +	if (vq->indirect)
> > > > +		vq->indirect_threshold = max(16U, num >> 3);
> > > 
> > > Let's add some defines at top of the file please, maybe even
> > > a module parameter.
> > > 
> > > >  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> > > >  
> > > >  	/* No callback?  Tell other side not to bother us. */
> > > > -- 
> > > > 1.7.8.rc3
> > 
> > -- 
> > 
> > Sasha.

-- 

Sasha.

^ permalink raw reply

* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Ohad Ben-Cohen @ 2011-11-29 13:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-arm-kernel, linux-kernel, kvm, virtualization
In-Reply-To: <20111129131110.GC19157@redhat.com>

On Tue, Nov 29, 2011 at 3:11 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote:
>> Virtio is using memory barriers to control the ordering of
>> references to the vrings on SMP systems. When the guest is compiled
>> with SMP support, virtio is only using SMP barriers in order to
>> avoid incurring the overhead involved with mandatory barriers.
>>
>> Lately, though, virtio is being increasingly used with inter-processor
>> communication scenarios too, which involve running two (separate)
>> instances of operating systems on two (separate) processors, each of
>> which might either be UP or SMP.
>
> Is that using virtio-mmio?

No, I'm using this:

https://lkml.org/lkml/2011/10/25/139

> Sorry, could you pls explain what are 'two external processors'?
> I think I know that if two x86 CPUs in an SMP system run kernels built
> in an SMP configuration, smp_*mb barriers are enough.

Sure:

My setup is not SMP-based; it's two separate processors running in AMP
configuration. The processors have completely different architectures,
are not cache coherent, and only simply share some memory, which is
used for communications using virtio as the shared memory "wire"
protocol (i.e. we're not even doing virtualization: we have Linux on
one processor, and some RTOS on another processor, and they use virtio
to send and receive buffers).

So it's not SMP effects we're controlling; we're pretty much doing
MMIO and must use mandatory barriers (otherwise we see breakage).

> Is an extra branch faster or slower than reverting d57ed95?

Sorry, unfortunately I have no way to measure this, as I don't have
any virtualization/x86 setup. I'm developing on ARM SoCs, where
virtualization hardware is coming, but not here yet.

> One wonders how the remote side knows enough to set this flag?

The remote side is a dedicated firmware loaded in order to boot the
other processor, so it really is intimate with the platform and its
requirements (and it publishes the virtio device features for every
supported virtio device).

>> Though I also wonder how big really is the perforamnce gain of d57ed95 ?
>
> Want to check and tell us?

Sorry, as I said, I have no way to do that... Any chance you did some
measurements back then when d57ed95 was introduced ?

I just guessed it did improve performance, otherwise you probably
wouldn't have bothered. But if not, then reverting it is surely
preferable to adding more complexity.

Thanks!
Ohad.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-11-29 13:54 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1322573688.4395.11.camel@lappy>

On Tue, Nov 29, 2011 at 03:34:48PM +0200, Sasha Levin wrote:
> On Tue, 2011-11-29 at 14:56 +0200, Michael S. Tsirkin wrote:
> > On Tue, Nov 29, 2011 at 11:33:16AM +0200, Sasha Levin wrote:
> > > Currently if VIRTIO_RING_F_INDIRECT_DESC is enabled we will use indirect
> > > descriptors even if we have plenty of space in the ring. This means that
> > > we take a performance hit at all times due to the overhead of creating
> > > indirect descriptors.
> > 
> > Is it the overhead of creating them or just allocating the pages?
> 
> My guess here is that it's the allocation since creating them is very
> similar to creating regular descriptors.

Well, there is some formatting overhead ...

> > The logic you propose is basically add direct as long as
> > the ring is mostly empty. So if the problem is in allocations,
> > one simple optimization for this one workload is add a small
> > cache of memory to use for indirect bufs. Of course building
> > a good API for this is where we got blocked in the past...
> 
> I thought the issue of using a single pool was that the sizes of
> indirect descriptors are dynamic, so you can't use a single kmemcache
> for all of them unless you're ok with having a bunch of wasted bytes.

If the pool size is limited, the waste is limited too, so maybe
we are OK with that...

> > 
> > > With this patch, we will use indirect descriptors only if we have less than
> > > either 16, or 12% of the total amount of descriptors available.
> > 
> > One notes that this to some level conflicts with patches that change
> > virtio net not to drain the vq before add buf, in that we are
> > required here to drain the vq to avoid indirect.
> 
> You don't have to avoid indirects by all means, if the vq is so full it
> has to resort to indirect buffers we better let him do that.

With the limited polling patches, the vq stays full all of
the time, we only poll enough to create space for the new
descriptor.
It's not a must to make them work as they are not upstream,
but worth considering.

> > 
> > Not necessarily a serious problem, but something to keep in mind:
> > a memory pool would not have this issue.
> > 
> > > 
> > > I did basic performance benchmark on virtio-net with vhost enabled.
> > > 
> > > Before:
> > > 	Recv   Send    Send
> > > 	Socket Socket  Message  Elapsed
> > > 	Size   Size    Size     Time     Throughput
> > > 	bytes  bytes   bytes    secs.    10^6bits/sec
> > > 
> > > 	 87380  16384  16384    10.00    4563.92
> > > 
> > > After:
> > > 	Recv   Send    Send
> > > 	Socket Socket  Message  Elapsed
> > > 	Size   Size    Size     Time     Throughput
> > > 	bytes  bytes   bytes    secs.    10^6bits/sec
> > > 
> > > 	 87380  16384  16384    10.00    5353.28
> > 
> > Is this with the kvm tool? what kind of benchmark is this?
> 
> It's using the kvm tool and netperf. It's a simple TCP_STREAM test with
> vhost enabled and using a regular TAP device to connect between guest
> and host.

guest to host?

> > 
> > Need to verify the effect on block too, and do some more
> > benchmarks. In particular we are making the ring
> > in effect smaller, how will this affect small packet perf
> > with multiple streams?
> 
> I couldn't get good block benchmarks on my hardware. They were all over
> the place even when I was trying to get the baseline. I'm guessing my
> disk is about to kick the bucket.

Try using memory as a backing store.

> This threshold should be dynamic and be based on the amount of avail
> descriptors over time, so for example, if the vring is 90% full over
> time the threshold will go up allowing for more indirect buffers.
> Currently it's static, but it's a first step to making it dynamic :)
> 
> I'll do a benchmark with small packets.
> 
> > A very simple test is to disable indirect buffers altogether.
> > qemu-kvm has a flag for this.
> > Is this an equivalent test?
> > If yes I'll try that.
> 
> Yes, it should be equivalent to qemu without that flag.
> 
> > 
> > 
> > > Cc: Rusty Russell <rusty@rustcorp.com.au>
> > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > Cc: virtualization@lists.linux-foundation.org
> > > Cc: kvm@vger.kernel.org
> > > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> > > ---
> > >  drivers/virtio/virtio_ring.c |   12 ++++++++++--
> > >  1 files changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index c7a2c20..5e0ce15 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -82,6 +82,7 @@ struct vring_virtqueue
> > >  
> > >  	/* Host supports indirect buffers */
> > >  	bool indirect;
> > 
> > We can get rid of bool indirect now, just set indirect_threshold to 0,
> > right?
> 
> Yup.
> 
> > 
> > > +	unsigned int indirect_threshold;
> > 
> > Please add a comment. It should be something like
> > 'Min. number of free space in the ring to trigger direct descriptor use'
> 
> Will do.
> 
> > 
> > >  
> > >  	/* Host publishes avail event idx */
> > >  	bool event;
> > > @@ -176,8 +177,9 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
> > >  	BUG_ON(data == NULL);
> > >  
> > >  	/* If the host supports indirect descriptor tables, and we have multiple
> > > -	 * buffers, then go indirect. FIXME: tune this threshold */
> > > -	if (vq->indirect && (out + in) > 1 && vq->num_free) {
> > > +	 * buffers, then go indirect. */
> > > +	if (vq->indirect && (out + in) > 1 &&
> > > +	   (vq->num_free < vq->indirect_threshold)) {
> > 
> > If num_free is 0, this will allocate the buffer which is
> > not a good idea.
> > 
> > I think there's a regression here: with a small vq, we could
> > previously put in an indirect descriptor, with your patch
> > add_buf will fail. I think this is a real problem for block
> > which was the original reason indirect bufs were introduced.
> 
> I defined the threshold so at least 16 descriptors will be used as
> indirect buffers, so if you have a small vq theres still a solid minimum
> of indirect descriptors it could use.

Yes but request size might be > 16.

> > 
> > 
> > >  		head = vring_add_indirect(vq, sg, out, in, gfp);
> > >  		if (likely(head >= 0))
> > >  			goto add_head;
> > > @@ -485,6 +487,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
> > >  #endif
> > >  
> > >  	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
> > > +	/*
> > > +	 * Use indirect descriptors only when we have less than either 12%
> > > +	 * or 16 of the descriptors in the ring available.
> > > +	 */
> > > +	if (vq->indirect)
> > > +		vq->indirect_threshold = max(16U, num >> 3);
> > 
> > Let's add some defines at top of the file please, maybe even
> > a module parameter.
> > 
> > >  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> > >  
> > >  	/* No callback?  Tell other side not to bother us. */
> > > -- 
> > > 1.7.8.rc3
> 
> -- 
> 
> Sasha.

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-11-29 13:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <20111129125622.GB19157@redhat.com>

On Tue, 2011-11-29 at 14:56 +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 29, 2011 at 11:33:16AM +0200, Sasha Levin wrote:
> > Currently if VIRTIO_RING_F_INDIRECT_DESC is enabled we will use indirect
> > descriptors even if we have plenty of space in the ring. This means that
> > we take a performance hit at all times due to the overhead of creating
> > indirect descriptors.
> 
> Is it the overhead of creating them or just allocating the pages?

My guess here is that it's the allocation since creating them is very
similar to creating regular descriptors.

> The logic you propose is basically add direct as long as
> the ring is mostly empty. So if the problem is in allocations,
> one simple optimization for this one workload is add a small
> cache of memory to use for indirect bufs. Of course building
> a good API for this is where we got blocked in the past...

I thought the issue of using a single pool was that the sizes of
indirect descriptors are dynamic, so you can't use a single kmemcache
for all of them unless you're ok with having a bunch of wasted bytes.

> 
> > With this patch, we will use indirect descriptors only if we have less than
> > either 16, or 12% of the total amount of descriptors available.
> 
> One notes that this to some level conflicts with patches that change
> virtio net not to drain the vq before add buf, in that we are
> required here to drain the vq to avoid indirect.

You don't have to avoid indirects by all means, if the vq is so full it
has to resort to indirect buffers we better let him do that.

> 
> Not necessarily a serious problem, but something to keep in mind:
> a memory pool would not have this issue.
> 
> > 
> > I did basic performance benchmark on virtio-net with vhost enabled.
> > 
> > Before:
> > 	Recv   Send    Send
> > 	Socket Socket  Message  Elapsed
> > 	Size   Size    Size     Time     Throughput
> > 	bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 	 87380  16384  16384    10.00    4563.92
> > 
> > After:
> > 	Recv   Send    Send
> > 	Socket Socket  Message  Elapsed
> > 	Size   Size    Size     Time     Throughput
> > 	bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 	 87380  16384  16384    10.00    5353.28
> 
> Is this with the kvm tool? what kind of benchmark is this?

It's using the kvm tool and netperf. It's a simple TCP_STREAM test with
vhost enabled and using a regular TAP device to connect between guest
and host.

> 
> Need to verify the effect on block too, and do some more
> benchmarks. In particular we are making the ring
> in effect smaller, how will this affect small packet perf
> with multiple streams?

I couldn't get good block benchmarks on my hardware. They were all over
the place even when I was trying to get the baseline. I'm guessing my
disk is about to kick the bucket.

This threshold should be dynamic and be based on the amount of avail
descriptors over time, so for example, if the vring is 90% full over
time the threshold will go up allowing for more indirect buffers.
Currently it's static, but it's a first step to making it dynamic :)

I'll do a benchmark with small packets.

> A very simple test is to disable indirect buffers altogether.
> qemu-kvm has a flag for this.
> Is this an equivalent test?
> If yes I'll try that.

Yes, it should be equivalent to qemu without that flag.

> 
> 
> > Cc: Rusty Russell <rusty@rustcorp.com.au>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: virtualization@lists.linux-foundation.org
> > Cc: kvm@vger.kernel.org
> > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> > ---
> >  drivers/virtio/virtio_ring.c |   12 ++++++++++--
> >  1 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index c7a2c20..5e0ce15 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -82,6 +82,7 @@ struct vring_virtqueue
> >  
> >  	/* Host supports indirect buffers */
> >  	bool indirect;
> 
> We can get rid of bool indirect now, just set indirect_threshold to 0,
> right?

Yup.

> 
> > +	unsigned int indirect_threshold;
> 
> Please add a comment. It should be something like
> 'Min. number of free space in the ring to trigger direct descriptor use'

Will do.

> 
> >  
> >  	/* Host publishes avail event idx */
> >  	bool event;
> > @@ -176,8 +177,9 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
> >  	BUG_ON(data == NULL);
> >  
> >  	/* If the host supports indirect descriptor tables, and we have multiple
> > -	 * buffers, then go indirect. FIXME: tune this threshold */
> > -	if (vq->indirect && (out + in) > 1 && vq->num_free) {
> > +	 * buffers, then go indirect. */
> > +	if (vq->indirect && (out + in) > 1 &&
> > +	   (vq->num_free < vq->indirect_threshold)) {
> 
> If num_free is 0, this will allocate the buffer which is
> not a good idea.
> 
> I think there's a regression here: with a small vq, we could
> previously put in an indirect descriptor, with your patch
> add_buf will fail. I think this is a real problem for block
> which was the original reason indirect bufs were introduced.

I defined the threshold so at least 16 descriptors will be used as
indirect buffers, so if you have a small vq theres still a solid minimum
of indirect descriptors it could use.

> 
> 
> >  		head = vring_add_indirect(vq, sg, out, in, gfp);
> >  		if (likely(head >= 0))
> >  			goto add_head;
> > @@ -485,6 +487,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
> >  #endif
> >  
> >  	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
> > +	/*
> > +	 * Use indirect descriptors only when we have less than either 12%
> > +	 * or 16 of the descriptors in the ring available.
> > +	 */
> > +	if (vq->indirect)
> > +		vq->indirect_threshold = max(16U, num >> 3);
> 
> Let's add some defines at top of the file please, maybe even
> a module parameter.
> 
> >  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> >  
> >  	/* No callback?  Tell other side not to bother us. */
> > -- 
> > 1.7.8.rc3

-- 

Sasha.

^ permalink raw reply

* Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Michael S. Tsirkin @ 2011-11-29 13:11 UTC (permalink / raw)
  To: Ohad Ben-Cohen; +Cc: linux-arm-kernel, linux-kernel, kvm, virtualization
In-Reply-To: <1322569886-13055-1-git-send-email-ohad@wizery.com>

On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote:
> Virtio is using memory barriers to control the ordering of
> references to the vrings on SMP systems. When the guest is compiled
> with SMP support, virtio is only using SMP barriers in order to
> avoid incurring the overhead involved with mandatory barriers.
> 
> Lately, though, virtio is being increasingly used with inter-processor
> communication scenarios too, which involve running two (separate)
> instances of operating systems on two (separate) processors, each of
> which might either be UP or SMP.

Is that using virtio-mmio? If yes, would the extra serialization belongs
at that layer?

> To control the ordering of memory references when the vrings are shared
> between two external processors, we must always use mandatory barriers.

Sorry, could you pls explain what are 'two external processors'?
I think I know that if two x86 CPUs in an SMP system run kernels built
in an SMP configuration, smp_*mb barriers are enough.

Documentation/memory-barriers.txt says:
	Mandatory barriers should not be used to control SMP effects ...
	They may, however, be used to control MMIO effects on accesses through
	relaxed memory I/O windows.

We don't control MMIO/relaxed memory I/O windows here, so what exactly
is the issue?

Could you please give an example of a setup that is currently broken?

> 
> A trivial, albeit sub-optimal, solution would be to simply revert
> commit d57ed95 "virtio: use smp_XX barriers on SMP". Obviously, though,
> that's going to have a negative impact on performance of SMP-based
> virtualization use cases.
> 
> A different approach, as demonstrated by this patch, would pick the type
> of memory barriers, in run time, according to the requirements of the
> virtio device. This way, both SMP virtualization scenarios and inter-
> processor communication use cases would run correctly, without making
> any performance compromises (except for those incurred by an additional
> branch or level of indirection).

Is an extra branch faster or slower than reverting d57ed95?

> 
> This patch introduces VIRTIO_RING_F_REMOTEPROC, a new virtio transport
> feature, which should be used by virtio devices that run on remote
> processors. The CONFIG_SMP variant of virtio_{mb, rmb, wmb} is then changed
> to use SMP barriers only if VIRTIO_RING_F_REMOTEPROC was absent.

One wonders how the remote side knows enough to set this flag?

> 
> Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
> ---
> Alternatively, we can also introduce some kind of virtio_mb_ops and set it
> according to the nature of the vdev with handlers that just do the right
> thing, instead of introducting that branch.
> 
> Though I also wonder how big really is the perforamnce gain of d57ed95 ?

Want to check and tell us?

>  drivers/virtio/virtio_ring.c |   78 +++++++++++++++++++++++++++++-------------
>  include/linux/virtio_ring.h  |    6 +++
>  2 files changed, 60 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index c7a2c20..cf66a2d 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -23,24 +23,6 @@
>  #include <linux/slab.h>
>  #include <linux/module.h>
>  
> -/* virtio guest is communicating with a virtual "device" that actually runs on
> - * a host processor.  Memory barriers are used to control SMP effects. */
> -#ifdef CONFIG_SMP
> -/* Where possible, use SMP barriers which are more lightweight than mandatory
> - * barriers, because mandatory barriers control MMIO effects on accesses
> - * through relaxed memory I/O windows (which virtio does not use). */
> -#define virtio_mb() smp_mb()
> -#define virtio_rmb() smp_rmb()
> -#define virtio_wmb() smp_wmb()
> -#else
> -/* We must force memory ordering even if guest is UP since host could be
> - * running on another CPU, but SMP barriers are defined to barrier() in that
> - * configuration. So fall back to mandatory barriers instead. */
> -#define virtio_mb() mb()
> -#define virtio_rmb() rmb()
> -#define virtio_wmb() wmb()
> -#endif
> -
>  #ifdef DEBUG
>  /* For development, we want to crash whenever the ring is screwed. */
>  #define BAD_RING(_vq, fmt, args...)				\
> @@ -86,6 +68,9 @@ struct vring_virtqueue
>  	/* Host publishes avail event idx */
>  	bool event;
>  
> +	/* Host runs on a remote processor */
> +	bool rproc;
> +
>  	/* Number of free buffers */
>  	unsigned int num_free;
>  	/* Head of free buffer list. */
> @@ -108,6 +93,48 @@ struct vring_virtqueue
>  	void *data[];
>  };
>  
> +/*
> + * virtio guest is communicating with a virtual "device" that may either run
> + * on the host processor, or on an external processor. The former requires
> + * memory barriers in order to control SMP effects, but the latter must
> + * use mandatory barriers.
> + */
> +#ifdef CONFIG_SMP
> +/* Where possible, use SMP barriers which are more lightweight than mandatory
> + * barriers, because mandatory barriers control MMIO effects on accesses
> + * through relaxed memory I/O windows. */
> +static inline void virtio_mb(struct vring_virtqueue *vq)
> +{
> +	if (vq->rproc)
> +		mb();
> +	else
> +		smp_mb();
> +}
> +
> +static inline void virtio_rmb(struct vring_virtqueue *vq)
> +{
> +	if (vq->rproc)
> +		rmb();
> +	else
> +		smp_rmb();
> +}
> +
> +static inline void virtio_wmb(struct vring_virtqueue *vq)
> +{
> +	if (vq->rproc)
> +		wmb();
> +	else
> +		smp_wmb();
> +}
> +#else
> +/* We must force memory ordering even if guest is UP since host could be
> + * running on another CPU, but SMP barriers are defined to barrier() in that
> + * configuration. So fall back to mandatory barriers instead. */
> +static inline void virtio_mb(struct vring_virtqueue *vq) { mb(); }
> +static inline void virtio_rmb(struct vring_virtqueue *vq) { rmb(); }
> +static inline void virtio_wmb(struct vring_virtqueue *vq) { wmb(); }
> +#endif
> +
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
>  /* Set up an indirect table of descriptors and add it to the queue. */
> @@ -245,14 +272,14 @@ void virtqueue_kick(struct virtqueue *_vq)
>  	START_USE(vq);
>  	/* Descriptors and available array need to be set before we expose the
>  	 * new available array entries. */
> -	virtio_wmb();
> +	virtio_wmb(vq);
>  
>  	old = vq->vring.avail->idx;
>  	new = vq->vring.avail->idx = old + vq->num_added;
>  	vq->num_added = 0;
>  
>  	/* Need to update avail index before checking if we should notify */
> -	virtio_mb();
> +	virtio_mb(vq);
>  
>  	if (vq->event ?
>  	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
> @@ -314,7 +341,7 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
>  	}
>  
>  	/* Only get used array entries after they have been exposed by host. */
> -	virtio_rmb();
> +	virtio_rmb(vq);
>  
>  	i = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].id;
>  	*len = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].len;
> @@ -337,7 +364,7 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
>  	 * the read in the next get_buf call. */
>  	if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
>  		vring_used_event(&vq->vring) = vq->last_used_idx;
> -		virtio_mb();
> +		virtio_mb(vq);
>  	}
>  
>  	END_USE(vq);
> @@ -366,7 +393,7 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
>  	 * entry. Always do both to keep code simple. */
>  	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
>  	vring_used_event(&vq->vring) = vq->last_used_idx;
> -	virtio_mb();
> +	virtio_mb(vq);
>  	if (unlikely(more_used(vq))) {
>  		END_USE(vq);
>  		return false;
> @@ -393,7 +420,7 @@ bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
>  	/* TODO: tune this threshold */
>  	bufs = (u16)(vq->vring.avail->idx - vq->last_used_idx) * 3 / 4;
>  	vring_used_event(&vq->vring) = vq->last_used_idx + bufs;
> -	virtio_mb();
> +	virtio_mb(vq);
>  	if (unlikely((u16)(vq->vring.used->idx - vq->last_used_idx) > bufs)) {
>  		END_USE(vq);
>  		return false;
> @@ -486,6 +513,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
>  
>  	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
>  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> +	vq->rproc = virtio_has_feature(vdev, VIRTIO_RING_F_REMOTEPROC);
>  
>  	/* No callback?  Tell other side not to bother us. */
>  	if (!callback)
> @@ -522,6 +550,8 @@ void vring_transport_features(struct virtio_device *vdev)
>  			break;
>  		case VIRTIO_RING_F_EVENT_IDX:
>  			break;
> +		case VIRTIO_RING_F_REMOTEPROC:
> +			break;
>  		default:
>  			/* We don't understand this bit. */
>  			clear_bit(i, vdev->features);
> diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> index 36be0f6..9839593 100644
> --- a/include/linux/virtio_ring.h
> +++ b/include/linux/virtio_ring.h
> @@ -58,6 +58,12 @@
>   * at the end of the used ring. Guest should ignore the used->flags field. */
>  #define VIRTIO_RING_F_EVENT_IDX		29
>  
> +/*
> + * The device we're talking to resides on a remote processor, so we must always
> + * use mandatory memory barriers.
> + */
> +#define VIRTIO_RING_F_REMOTEPROC	30
> +
>  /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
>  struct vring_desc {
>  	/* Address (guest-physical). */
> -- 
> 1.7.5.4

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-11-29 12:56 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1322559196-11139-1-git-send-email-levinsasha928@gmail.com>

On Tue, Nov 29, 2011 at 11:33:16AM +0200, Sasha Levin wrote:
> Currently if VIRTIO_RING_F_INDIRECT_DESC is enabled we will use indirect
> descriptors even if we have plenty of space in the ring. This means that
> we take a performance hit at all times due to the overhead of creating
> indirect descriptors.

Is it the overhead of creating them or just allocating the pages?

The logic you propose is basically add direct as long as
the ring is mostly empty. So if the problem is in allocations,
one simple optimization for this one workload is add a small
cache of memory to use for indirect bufs. Of course building
a good API for this is where we got blocked in the past...

> With this patch, we will use indirect descriptors only if we have less than
> either 16, or 12% of the total amount of descriptors available.

One notes that this to some level conflicts with patches that change
virtio net not to drain the vq before add buf, in that we are
required here to drain the vq to avoid indirect.

Not necessarily a serious problem, but something to keep in mind:
a memory pool would not have this issue.

> 
> I did basic performance benchmark on virtio-net with vhost enabled.
> 
> Before:
> 	Recv   Send    Send
> 	Socket Socket  Message  Elapsed
> 	Size   Size    Size     Time     Throughput
> 	bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 	 87380  16384  16384    10.00    4563.92
> 
> After:
> 	Recv   Send    Send
> 	Socket Socket  Message  Elapsed
> 	Size   Size    Size     Time     Throughput
> 	bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 	 87380  16384  16384    10.00    5353.28

Is this with the kvm tool? what kind of benchmark is this?

Need to verify the effect on block too, and do some more
benchmarks. In particular we are making the ring
in effect smaller, how will this affect small packet perf
with multiple streams?


A very simple test is to disable indirect buffers altogether.
qemu-kvm has a flag for this.
Is this an equivalent test?
If yes I'll try that.


> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: virtualization@lists.linux-foundation.org
> Cc: kvm@vger.kernel.org
> Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> ---
>  drivers/virtio/virtio_ring.c |   12 ++++++++++--
>  1 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index c7a2c20..5e0ce15 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -82,6 +82,7 @@ struct vring_virtqueue
>  
>  	/* Host supports indirect buffers */
>  	bool indirect;

We can get rid of bool indirect now, just set indirect_threshold to 0,
right?

> +	unsigned int indirect_threshold;

Please add a comment. It should be something like
'Min. number of free space in the ring to trigger direct descriptor use'

>  
>  	/* Host publishes avail event idx */
>  	bool event;
> @@ -176,8 +177,9 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
>  	BUG_ON(data == NULL);
>  
>  	/* If the host supports indirect descriptor tables, and we have multiple
> -	 * buffers, then go indirect. FIXME: tune this threshold */
> -	if (vq->indirect && (out + in) > 1 && vq->num_free) {
> +	 * buffers, then go indirect. */
> +	if (vq->indirect && (out + in) > 1 &&
> +	   (vq->num_free < vq->indirect_threshold)) {

If num_free is 0, this will allocate the buffer which is
not a good idea.

I think there's a regression here: with a small vq, we could
previously put in an indirect descriptor, with your patch
add_buf will fail. I think this is a real problem for block
which was the original reason indirect bufs were introduced.


>  		head = vring_add_indirect(vq, sg, out, in, gfp);
>  		if (likely(head >= 0))
>  			goto add_head;
> @@ -485,6 +487,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
>  #endif
>  
>  	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
> +	/*
> +	 * Use indirect descriptors only when we have less than either 12%
> +	 * or 16 of the descriptors in the ring available.
> +	 */
> +	if (vq->indirect)
> +		vq->indirect_threshold = max(16U, num >> 3);

Let's add some defines at top of the file please, maybe even
a module parameter.

>  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
>  
>  	/* No callback?  Tell other side not to bother us. */
> -- 
> 1.7.8.rc3

^ permalink raw reply

* [RFC] virtio: use mandatory barriers for remote processor vdevs
From: Ohad Ben-Cohen @ 2011-11-29 12:31 UTC (permalink / raw)
  To: virtualization, kvm; +Cc: linux-kernel, linux-arm-kernel, Michael S. Tsirkin

Virtio is using memory barriers to control the ordering of
references to the vrings on SMP systems. When the guest is compiled
with SMP support, virtio is only using SMP barriers in order to
avoid incurring the overhead involved with mandatory barriers.

Lately, though, virtio is being increasingly used with inter-processor
communication scenarios too, which involve running two (separate)
instances of operating systems on two (separate) processors, each of
which might either be UP or SMP.

To control the ordering of memory references when the vrings are shared
between two external processors, we must always use mandatory barriers.

A trivial, albeit sub-optimal, solution would be to simply revert
commit d57ed95 "virtio: use smp_XX barriers on SMP". Obviously, though,
that's going to have a negative impact on performance of SMP-based
virtualization use cases.

A different approach, as demonstrated by this patch, would pick the type
of memory barriers, in run time, according to the requirements of the
virtio device. This way, both SMP virtualization scenarios and inter-
processor communication use cases would run correctly, without making
any performance compromises (except for those incurred by an additional
branch or level of indirection).

This patch introduces VIRTIO_RING_F_REMOTEPROC, a new virtio transport
feature, which should be used by virtio devices that run on remote
processors. The CONFIG_SMP variant of virtio_{mb, rmb, wmb} is then changed
to use SMP barriers only if VIRTIO_RING_F_REMOTEPROC was absent.

Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
---
Alternatively, we can also introduce some kind of virtio_mb_ops and set it
according to the nature of the vdev with handlers that just do the right
thing, instead of introducting that branch.

Though I also wonder how big really is the perforamnce gain of d57ed95 ?

 drivers/virtio/virtio_ring.c |   78 +++++++++++++++++++++++++++++-------------
 include/linux/virtio_ring.h  |    6 +++
 2 files changed, 60 insertions(+), 24 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c7a2c20..cf66a2d 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -23,24 +23,6 @@
 #include <linux/slab.h>
 #include <linux/module.h>
 
-/* virtio guest is communicating with a virtual "device" that actually runs on
- * a host processor.  Memory barriers are used to control SMP effects. */
-#ifdef CONFIG_SMP
-/* Where possible, use SMP barriers which are more lightweight than mandatory
- * barriers, because mandatory barriers control MMIO effects on accesses
- * through relaxed memory I/O windows (which virtio does not use). */
-#define virtio_mb() smp_mb()
-#define virtio_rmb() smp_rmb()
-#define virtio_wmb() smp_wmb()
-#else
-/* We must force memory ordering even if guest is UP since host could be
- * running on another CPU, but SMP barriers are defined to barrier() in that
- * configuration. So fall back to mandatory barriers instead. */
-#define virtio_mb() mb()
-#define virtio_rmb() rmb()
-#define virtio_wmb() wmb()
-#endif
-
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
 #define BAD_RING(_vq, fmt, args...)				\
@@ -86,6 +68,9 @@ struct vring_virtqueue
 	/* Host publishes avail event idx */
 	bool event;
 
+	/* Host runs on a remote processor */
+	bool rproc;
+
 	/* Number of free buffers */
 	unsigned int num_free;
 	/* Head of free buffer list. */
@@ -108,6 +93,48 @@ struct vring_virtqueue
 	void *data[];
 };
 
+/*
+ * virtio guest is communicating with a virtual "device" that may either run
+ * on the host processor, or on an external processor. The former requires
+ * memory barriers in order to control SMP effects, but the latter must
+ * use mandatory barriers.
+ */
+#ifdef CONFIG_SMP
+/* Where possible, use SMP barriers which are more lightweight than mandatory
+ * barriers, because mandatory barriers control MMIO effects on accesses
+ * through relaxed memory I/O windows. */
+static inline void virtio_mb(struct vring_virtqueue *vq)
+{
+	if (vq->rproc)
+		mb();
+	else
+		smp_mb();
+}
+
+static inline void virtio_rmb(struct vring_virtqueue *vq)
+{
+	if (vq->rproc)
+		rmb();
+	else
+		smp_rmb();
+}
+
+static inline void virtio_wmb(struct vring_virtqueue *vq)
+{
+	if (vq->rproc)
+		wmb();
+	else
+		smp_wmb();
+}
+#else
+/* We must force memory ordering even if guest is UP since host could be
+ * running on another CPU, but SMP barriers are defined to barrier() in that
+ * configuration. So fall back to mandatory barriers instead. */
+static inline void virtio_mb(struct vring_virtqueue *vq) { mb(); }
+static inline void virtio_rmb(struct vring_virtqueue *vq) { rmb(); }
+static inline void virtio_wmb(struct vring_virtqueue *vq) { wmb(); }
+#endif
+
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
 /* Set up an indirect table of descriptors and add it to the queue. */
@@ -245,14 +272,14 @@ void virtqueue_kick(struct virtqueue *_vq)
 	START_USE(vq);
 	/* Descriptors and available array need to be set before we expose the
 	 * new available array entries. */
-	virtio_wmb();
+	virtio_wmb(vq);
 
 	old = vq->vring.avail->idx;
 	new = vq->vring.avail->idx = old + vq->num_added;
 	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
-	virtio_mb();
+	virtio_mb(vq);
 
 	if (vq->event ?
 	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
@@ -314,7 +341,7 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 	}
 
 	/* Only get used array entries after they have been exposed by host. */
-	virtio_rmb();
+	virtio_rmb(vq);
 
 	i = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].id;
 	*len = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].len;
@@ -337,7 +364,7 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 	 * the read in the next get_buf call. */
 	if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
 		vring_used_event(&vq->vring) = vq->last_used_idx;
-		virtio_mb();
+		virtio_mb(vq);
 	}
 
 	END_USE(vq);
@@ -366,7 +393,7 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 	 * entry. Always do both to keep code simple. */
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
 	vring_used_event(&vq->vring) = vq->last_used_idx;
-	virtio_mb();
+	virtio_mb(vq);
 	if (unlikely(more_used(vq))) {
 		END_USE(vq);
 		return false;
@@ -393,7 +420,7 @@ bool virtqueue_enable_cb_delayed(struct virtqueue *_vq)
 	/* TODO: tune this threshold */
 	bufs = (u16)(vq->vring.avail->idx - vq->last_used_idx) * 3 / 4;
 	vring_used_event(&vq->vring) = vq->last_used_idx + bufs;
-	virtio_mb();
+	virtio_mb(vq);
 	if (unlikely((u16)(vq->vring.used->idx - vq->last_used_idx) > bufs)) {
 		END_USE(vq);
 		return false;
@@ -486,6 +513,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
+	vq->rproc = virtio_has_feature(vdev, VIRTIO_RING_F_REMOTEPROC);
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)
@@ -522,6 +550,8 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_RING_F_EVENT_IDX:
 			break;
+		case VIRTIO_RING_F_REMOTEPROC:
+			break;
 		default:
 			/* We don't understand this bit. */
 			clear_bit(i, vdev->features);
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 36be0f6..9839593 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -58,6 +58,12 @@
  * at the end of the used ring. Guest should ignore the used->flags field. */
 #define VIRTIO_RING_F_EVENT_IDX		29
 
+/*
+ * The device we're talking to resides on a remote processor, so we must always
+ * use mandatory memory barriers.
+ */
+#define VIRTIO_RING_F_REMOTEPROC	30
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
 	/* Address (guest-physical). */
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-11-29  9:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization, Sasha Levin, kvm, Michael S. Tsirkin

Currently if VIRTIO_RING_F_INDIRECT_DESC is enabled we will use indirect
descriptors even if we have plenty of space in the ring. This means that
we take a performance hit at all times due to the overhead of creating
indirect descriptors.

With this patch, we will use indirect descriptors only if we have less than
either 16, or 12% of the total amount of descriptors available.

I did basic performance benchmark on virtio-net with vhost enabled.

Before:
	Recv   Send    Send
	Socket Socket  Message  Elapsed
	Size   Size    Size     Time     Throughput
	bytes  bytes   bytes    secs.    10^6bits/sec

	 87380  16384  16384    10.00    4563.92

After:
	Recv   Send    Send
	Socket Socket  Message  Elapsed
	Size   Size    Size     Time     Throughput
	bytes  bytes   bytes    secs.    10^6bits/sec

	 87380  16384  16384    10.00    5353.28

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
---
 drivers/virtio/virtio_ring.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c7a2c20..5e0ce15 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -82,6 +82,7 @@ struct vring_virtqueue
 
 	/* Host supports indirect buffers */
 	bool indirect;
+	unsigned int indirect_threshold;
 
 	/* Host publishes avail event idx */
 	bool event;
@@ -176,8 +177,9 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
 	BUG_ON(data == NULL);
 
 	/* If the host supports indirect descriptor tables, and we have multiple
-	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && (out + in) > 1 && vq->num_free) {
+	 * buffers, then go indirect. */
+	if (vq->indirect && (out + in) > 1 &&
+	   (vq->num_free < vq->indirect_threshold)) {
 		head = vring_add_indirect(vq, sg, out, in, gfp);
 		if (likely(head >= 0))
 			goto add_head;
@@ -485,6 +487,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 #endif
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	/*
+	 * Use indirect descriptors only when we have less than either 12%
+	 * or 16 of the descriptors in the ring available.
+	 */
+	if (vq->indirect)
+		vq->indirect_threshold = max(16U, num >> 3);
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
 	/* No callback?  Tell other side not to bother us. */
-- 
1.7.8.rc3

^ permalink raw reply related

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Miche Baker-Harvey @ 2011-11-28 23:40 UTC (permalink / raw)
  To: Amit Shah
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, linux-kernel, virtualization,
	Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
	Eric Northrup
In-Reply-To: <20111123103852.GG16665@amit-x200.redhat.com>

Amit,

You said that the work would be serialized "due to port additions
being on work items on the same workqueue".  I'm not seeing that.
I've double checked this by using a mutex_trylock in
hvc_console::hvc_alloc(), and here's the relevant output from dmesg:

root@myubuntu:~# dmesg | grep MBH
[3307216.210274] MBH: got hvc_ports_mutex
[3307216.210690] MBH: trylock of hvc_ports_mutex failed
[3307216.211143] MBH: got hvc_ports_mutex

This is in a system with two virtio console ports, each of which is a
console.  I think if the VIRTIO_CONSOLE_CONSOLE_PORT message handling
were actually being serialized, the trylock should never fail.

What's the source of the serialization for the workqueue items?  At
first reading it looks like the control_work_handler gets called for
each virtio interrupt?

Miche


On Wed, Nov 23, 2011 at 2:38 AM, Amit Shah <amit.shah@redhat.com> wrote:
> On (Thu) 17 Nov 2011 [10:57:37], Miche Baker-Harvey wrote:
>> Rusty, Michael, Stephen, et al,
>>
>> Thanks for your comments on these patches.
>>
>> For what I'm trying to do, all three patches are necessary, but maybe
>> I'm going about it the wrong way. Your input would be appreciated.
>> I'm in no way claiming that these patches are "right", just that it's
>> working for me, and that what's in the current pool is not.
>>
>> What I'm trying to do is:
>> On X86,
>> under KVM,
>> start a virtio console device,
>> with multiple ports on the device,
>> at least one of which is also a console (as well as ttyS0).
>>
>> (Eventually, we want to be able to add virtio console ports on the
>> fly, and to have multiple virtio console ports be consoles.)
>
> Are you using kvm-tool to do this?  QEMU can already hot-plug ports
> and virtio-console (virtio-serial) devices.
>
>> When all three of the patches are in place, this works great. (By
>> great, I mean that getty's start up on all of ttyS0, hvc0 and hvc1,
>> and console output goes to ttyS0 and to hvc0.
>> "who" shows three users:  ttyS0, hvc0, and hvc1.
>> "cat /proc/consoles" shows both ttyS0 and hvc0.
>> I can use all three getty's, and console output really does appear on
>> both the consoles.
>>
>> Based on Rusty's comments, I tried removing each of the patches
>> individually. Here's what happens today. I've seen other failure modes
>> depending on what precisely I'm passing the guest.
>> There's three patches:
>> 1/3 "fix locking of vtermno"
>> 2/3 "enforce one-time initialization with hvc_init
>> "3/3 "use separate struct console * for each console"
>>
>> If I remove the "fix locking of vtermno", I only get one virtio
>> console terminal.  "who" shows the ttyS0 and the hvc0, and I can log
>> into the gettys on both. I don't get the second virtio console getty.
>> Interestingly, hvc0 shows up in /proc/consoles twice, and in fact the
>> console output is dumped twice to hvc0 (as you'd expect from looking
>> at printk.c, each line appears twice, followed by the next line.)
>
> I don't really understand why.  "fix locking of vtermno" adds locks in
> init_port_console(), which is called from add_port(), which should be
> serialised due to port additions being on work items on the same
> workqueue.  I don't see a race here.
>
>> If I remove the "enforce one-time initialization with hvc_init" patch,
>> which makes sure only a single thread is doing the hvc_init, and gates
>> anyone from continuing until it has completed, I get different
>> failures, including hangs, and dereferences of NULL pointers.
>>
>> If I remove the "use separate struct console * for each console"patch,
>> what I'm seeing now is that while all of ttyS0, hvc0, and hvc1 are
>> present with gettys running on them, of the three, only ttyS0 is a
>> console.
>
> I don't see any difference in my testing with and without these
> patches.
>
> This is how I tested with qemu:
>
> ./x86_64-softmmu/qemu-system-x86_64 -m 512 -smp 2 -chardev
> socket,path=/tmp/foo,server,nowait,id=foo -chardev
> socket,path=/tmp/bar,server,nowait,id=bar -device virtio-serial
> -device virtconsole,chardev=foo,nr=4 -device
> virtconsole,chardev=bar,nr=3 -net none  -kernel
> /home/amit/src/linux/arch/x86/boot/bzImage -append 'root=/dev/sda1
> console=tty0 console=ttyS0' -initrd /tmp/initramfs.img
> /guests/f14-nolvm.qcow2 -enable-kvm -snapshot
>
> With this setup, with and without patches, I can spawn two consoles
> via:
>
> /sbin/agetty /dev/hvc0 9600 vt100
> /sbin/agetty /dev/hvc1 9600 vt100
>
> (Strange thing is, the second one gives a 'password incorrect' error
> on login attempts, while the first one logs in fine.  I do remember
> testing multiple consoles just fine a year and a half back, so no idea
> why this isn't behaving as expected -- but it mostly looks like a
> userspace issue rather than kernel one.)
>
> As mentioned earlier, I've not looked at the hvc code, but given my
> testing results, I'd like to first understand what you're seeing and
> what your environment is.
>
>                Amit
>

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Stephen Hemminger @ 2011-11-28 17:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Krishna Kumar2, arnd, Michael S. Tsirkin, netdev, virtualization,
	levinsasha928, davem
In-Reply-To: <4ED30D2B.10907@redhat.com>

On Mon, 28 Nov 2011 12:25:15 +0800
Jason Wang <jasowang@redhat.com> wrote:

> I'm using ixgbe for testing also, for host, its driver seems provide irq 
> affinity hint, so no binding or irqbalance is needed.

The hint is for irqbalance to use. You need to still do manual
affinity or use irqbalance.

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Sasha Levin @ 2011-11-28  9:15 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Michael S. Tsirkin,
	Alexey Kardashevskiy, Wang Sheng-Hui, lkml - Kernel Mailing List,
	virtualization, Christian Borntraeger, Amit Shah
In-Reply-To: <87vcq5t69c.fsf@rustcorp.com.au>

On Mon, 2011-11-28 at 11:25 +1030, Rusty Russell wrote:
> I'd like to see kvmtools remove support for legacy mode altogether,
> but they probably have existing users.

While we can't simply remove it right away, instead of mixing our
implementation for both legacy and new spec in the same code we can
split the virtio-pci implementation into two:

	- virtio/virtio-pci-legacy.c
	- virtio/virtio-pci.c

At that point we can #ifdef the entire virtio-pci-legacy.c for now and
remove it at the same time legacy virtio-pci is removed from the kernel.

I think this is something very similar to what you want done in the
kernel code, so an added plus is that the usermode code will be
mirroring the kernel code - which is something we try to have in the KVM
tool :)

-- 

Sasha.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox