* Re: [PATCH] vhost-net: add module alias
From: Stephen Hemminger @ 2012-01-11 16:58 UTC (permalink / raw)
To: Michael Tokarev
Cc: netdev, virtualization, Kay Sievers, kvm, Michael S. Tsirkin
In-Reply-To: <4F0D3543.7020401@msgid.tls.msk.ru>
On Wed, 11 Jan 2012 11:07:47 +0400
Michael Tokarev <mjt@tls.msk.ru> wrote:
> On 11.01.2012 08:54, Stephen Hemminger wrote:
> > By adding the a module alias, programs (or users) won't have to explicitly
> > call modprobe. Vhost-net will always be available if built into the kernel.
> > It does require assigning a permanent minor number for depmod to work.
> > Choose one next to TUN since this driver is related to it.
>
> Why do you think a statically-allocated device number will do any good
> at all? Static /dev is gone almost completely, at least on the systems
> where whole virt stuff makes any sense, so you don't have pre-created
> vhost-net device anymore, and hence this allocation makes no sense.
> Just IMHO anyway.
The statically allocated device number is required for the udev/module
autoloading to work. Probably the udev infrastructure needs a consistent
number to hang off of.
It looks like:
* driver adds MODULE_ALIAS() for devname and character device
* depmod scans modules and creates modules.devname (in /lib/modules)
* udev uses modules.devname to autoload the module
$ /sbin/modinfo vhost_net
filename: /lib/modules/3.2.0-net+/kernel/drivers/vhost/vhost_net.ko
alias: devname:vhost-net
alias: char-major-10-201
description: Host kernel accelerator for virtio net
...
See also: https://lkml.org/lkml/2010/5/21/134
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias
From: Michael Tokarev @ 2012-01-11 17:02 UTC (permalink / raw)
To: Stephen Hemminger
Cc: netdev, virtualization, Kay Sievers, kvm, Michael S. Tsirkin
In-Reply-To: <20120111085857.0753322f@nehalam.linuxnetplumber.net>
On 11.01.2012 20:58, Stephen Hemminger wrote:
> On Wed, 11 Jan 2012 11:07:47 +0400
> Michael Tokarev <mjt@tls.msk.ru> wrote:
>
>> On 11.01.2012 08:54, Stephen Hemminger wrote:
>>> By adding the a module alias, programs (or users) won't have to explicitly
>>> call modprobe. Vhost-net will always be available if built into the kernel.
>>> It does require assigning a permanent minor number for depmod to work.
>>> Choose one next to TUN since this driver is related to it.
>>
>> Why do you think a statically-allocated device number will do any good
>> at all? Static /dev is gone almost completely, at least on the systems
>> where whole virt stuff makes any sense, so you don't have pre-created
>> vhost-net device anymore, and hence this allocation makes no sense.
>> Just IMHO anyway.
[]
> See also: https://lkml.org/lkml/2010/5/21/134
Aha. So udev pre-creates statically-allocated devnodes nowadays:
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed...
This was the part I missed. Now it all looks logically.
Thanks,
/mjt
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias
From: Kay Sievers @ 2012-01-11 17:03 UTC (permalink / raw)
To: Stephen Hemminger
Cc: netdev, virtualization, Michael Tokarev, kvm, Michael S. Tsirkin
In-Reply-To: <20120111085857.0753322f@nehalam.linuxnetplumber.net>
On Wed, Jan 11, 2012 at 17:58, Stephen Hemminger <shemminger@vyatta.com> wrote:
> On Wed, 11 Jan 2012 11:07:47 +0400
> Michael Tokarev <mjt@tls.msk.ru> wrote:
>
>> On 11.01.2012 08:54, Stephen Hemminger wrote:
>> > By adding the a module alias, programs (or users) won't have to explicitly
>> > call modprobe. Vhost-net will always be available if built into the kernel.
>> > It does require assigning a permanent minor number for depmod to work.
>> > Choose one next to TUN since this driver is related to it.
>>
>> Why do you think a statically-allocated device number will do any good
>> at all?
It's totally fine to use them for single-instance devices. You are
right, enumerated devices must _never_ use any facility like that.
That would just be broken.
>> Static /dev is gone almost completely, at least on the systems
>> where whole virt stuff makes any sense, so you don't have pre-created
>> vhost-net device anymore, and hence this allocation makes no sense.
>> Just IMHO anyway.
It makes a lot of sense in this case. The kernel module files
advertise the dev_t, it's not stored anywhere else. UDev finds these
static numbers and does inplicit mkdev() for them.
> The statically allocated device number is required for the udev/module
> autoloading to work. Probably the udev infrastructure needs a consistent
> number to hang off of.
It does that properly.
Just check:
$ cat /lib/modules/$(uname -r)/modules.devname
# Device nodes to trigger on-demand module loading.
fuse fuse c10:229
btrfs btrfs-control c10:234
ppp_generic ppp c108:0
tun net/tun c10:200
uinput uinput c10:223
...
Kay
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-11 17:08 UTC (permalink / raw)
To: Anthony Liguori
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <4F0DB2AB.8030506@codemonkey.ws>
On Wed, Jan 11, 2012 at 10:02:51AM -0600, Anthony Liguori wrote:
> On 01/11/2012 09:45 AM, Michael S. Tsirkin wrote:
> >On Wed, Jan 11, 2012 at 09:28:27AM -0600, Anthony Liguori wrote:
> >>On 01/11/2012 09:21 AM, Michael S. Tsirkin wrote:
> >>>On Wed, Jan 11, 2012 at 09:15:49AM -0600, Anthony Liguori wrote:
> >>>>>This is similar to what we have now. But it's still buggy: e.g. if guest
> >>>>>updates MAC byte by byte, we have no way to know when it's done doing
> >>>>>so.
> >>>>
> >>>>This is no different than a normal network card. You have to use a
> >>>>secondary register to trigger an update.
> >>>>
> >>>>Regards,
> >>>>
> >>>>Anthony Liguori
> >>>
> >>>Possible but doesn't let us layer nicely to allow unchanged drivers
> >>>that work with all transports (new pci, old pci, non pci).
> >>
> >>If we declare config space LE, then we have to touch all drivers.
> >>There's no way around it because the virtio API is byte-based, not
> >>word based.
> >
> >Fine but don't we want to be compatible with old hypervisors?
>
> We can modify the drivers to work either with a virtio1 or virtio2
> transport. If the only difference is that we move to word access
> instead of byte access for the config space, it's a nop because
> drivers don't rely on sub-word access today.
>
> >>This is why I'm suggesting making the virtio API (and then the
> >>virtio-pci ABI) word based. It gives us the flexibility to make
> >>endianness a property of the transport and not a property of the
> >>devices.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >
> >Some fields are 64 bit, this is still tricky to do atomically.
> >What's the objection to using a config VQ?
>
> Then we move very far away from something that looks like a PCI
> device. The problem we're having here is specifically where we've
> deviated from what a normal PCI device would do. Fixing that by
> deviating even further seems counter intuitive to me.
>
> Regards,
>
> Anthony Liguori
Not sure what you mean. Using VQ is DMA which is pretty common for PCI.
> >
^ permalink raw reply
* [PATCH] vhost-net: add module alias (v2)
From: Stephen Hemminger @ 2012-01-11 17:16 UTC (permalink / raw)
To: Stephen Hemminger, Alan Cox
Cc: netdev, virtualization, Kay Sievers, kvm, Michael S. Tsirkin
In-Reply-To: <20120110205400.6c1cb306@nehalam.linuxnetplumber.net>
By adding the correct module alias, programs won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.
Choose one next to TUN since this driver is related to it.
Also, use C99 style initialization.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
v2 - document minor number and make sure to not overlap
Documentation/devices.txt | 2 ++
drivers/vhost/net.c | 8 +++++---
include/linux/miscdevice.h | 1 +
3 files changed, 8 insertions(+), 3 deletions(-)
--- a/drivers/vhost/net.c 2012-01-10 10:56:58.883179194 -0800
+++ b/drivers/vhost/net.c 2012-01-10 19:48:23.650225892 -0800
@@ -856,9 +856,9 @@ static const struct file_operations vhos
};
static struct miscdevice vhost_net_misc = {
- MISC_DYNAMIC_MINOR,
- "vhost-net",
- &vhost_net_fops,
+ .minor = VHOST_NET_MINOR,
+ .name = "vhost-net",
+ .fops = &vhost_net_fops,
};
static int vhost_net_init(void)
@@ -879,3 +879,5 @@ MODULE_VERSION("0.0.1");
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Michael S. Tsirkin");
MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
+MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
+MODULE_ALIAS("devname:vhost-net");
--- a/include/linux/miscdevice.h 2012-01-10 10:56:59.779189436 -0800
+++ b/include/linux/miscdevice.h 2012-01-11 09:13:20.803694316 -0800
@@ -42,6 +42,7 @@
#define AUTOFS_MINOR 235
#define MAPPER_CTRL_MINOR 236
#define LOOP_CTRL_MINOR 237
+#define VHOST_NET_MINOR 238
#define MISC_DYNAMIC_MINOR 255
struct device;
--- a/Documentation/devices.txt 2012-01-10 10:56:53.399116518 -0800
+++ b/Documentation/devices.txt 2012-01-11 09:12:49.251197653 -0800
@@ -447,6 +447,8 @@ Your cooperation is appreciated.
234 = /dev/btrfs-control Btrfs control device
235 = /dev/autofs Autofs control device
236 = /dev/mapper/control Device-Mapper control device
+ 237 = /dev/vhost-net Host kernel accelerator for virtio net
+
240-254 Reserved for local use
255 Reserved for MISC_DYNAMIC_MINOR
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Stefan Hajnoczi @ 2012-01-11 17:21 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <20120111153954.GC20570@redhat.com>
On Wed, Jan 11, 2012 at 3:39 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jan 11, 2012 at 02:28:48PM +0000, Stefan Hajnoczi wrote:
>> On Wed, Jan 11, 2012 at 9:10 AM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Wed, 2012-01-11 at 08:47 +0000, Stefan Hajnoczi wrote:
>> >>
>> >> This is also an opportunity to stop using CPU physical addresses in
>> >> the ring and instead perform DMA like a normal PCI device (use bus
>> >> addresses).
>> >
>> > Euh why ?
>>
>> Because it's a paravirt hack that ends up hitting corner cases. It's
>> not possible to do virtio-pci passthrough under nested virtualization
>> unless we use an IOMMU. Imagine passing virtio-net from L0 into the
>> L2 guest (i.e. PCI-passthrough). If virtio-pci is really "PCI" this
>> should be possible but it's not when we use physical addresses instead
>> of bus addresses.
>>
>> Stefan
>
> It won't be hard to show siginificant performance regression if
> we do this. Hard to justify for something as niche as nested virt.
For x86 this should be mostly a nop. For ppc and SPARC architectures
maybe you're right. I still think it's a design flaw because if
virtio v2 doesn't use bus addresses then it will simply not be
possible to do passthrough for nested virt and other cases we haven't
hit yet.
Stefan
^ permalink raw reply
* Re: [Xen-devel] [PATCH] xen: remove CONFIG_XEN_DOM0 compile option
From: Konrad Rzeszutek Wilk @ 2012-01-11 17:27 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Andrew Jones, xen-devel, virtualization, Jeremy Fitzhardinge,
Stefano Stabellini
In-Reply-To: <20120111161911.GB18203@andromeda.dapyr.net>
On Wed, Jan 11, 2012 at 12:19:11PM -0400, Konrad Rzeszutek Wilk wrote:
> > > If the root complaint is that "customers think that anything set in
> > > .config is a supported feature", then the solutions are to support
> > > all
> > > the features in .config, re-educate the customers that they're wrong,
> > > or
> > > maintain a local patch to do this stuff.
> >
> > If only re-educating people was free, like preempting questions is.
> > Local patches are of course always an option, and perhaps in this
> > case it's the best one. However, I think we already made a case for
> > better xen configurability for the driver domains, so I'm not 100%
>
> Could you repost those backend patches please? At this point I am not
> sure which one we have discarded?
hm, I was thinking in terms of the XenBus ones. We had somewhere in this
converstion something about seperating the backend's from depending on
CONFIG_XEN_DOM0 as they can be run in any domain nowadays.
>
> > convinced my initial patch (making dom0 configurable) isn't worthy
> > of upstream. Also, I didn't see any comments on my v2[*] of that
> > patch, which I believe satisfies the menu complexity issue and
> > brings in more configurability. That said, I'm about to reply to
> > that patch myself, since there's an issue with it.
> >
> > Drew
> >
> > [*] http://article.gmane.org/gmane.linux.kernel.virtualization/14635
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
^ permalink raw reply
* Re: [PATCH 1/4] xen kconfig: keep XEN_XENBUS_FRONTEND builtin
From: Konrad Rzeszutek Wilk @ 2012-01-11 17:28 UTC (permalink / raw)
To: Andrew Jones
Cc: konrad, jeremy, xen-devel, virtualization, stefano.stabellini
In-Reply-To: <1326299801-7966-1-git-send-email-drjones@redhat.com>
On Wed, Jan 11, 2012 at 05:36:38PM +0100, Andrew Jones wrote:
> When XEN_XENBUS_FRONTEND gets selected as a module it can lead to
> unbootable configs. If we need it, then we should just build it in.
Hm, don't the frontends by themsevles load this module? So if you
do 'modprobe xen-pcifront' it would load this automatically?
>
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
> drivers/xen/Kconfig | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> index 8795480..1d24061 100644
> --- a/drivers/xen/Kconfig
> +++ b/drivers/xen/Kconfig
> @@ -118,7 +118,7 @@ config XEN_SYS_HYPERVISOR
> but will have no xen contents.
>
> config XEN_XENBUS_FRONTEND
> - tristate
> + bool
>
> config XEN_GNTDEV
> tristate "userspace grant access device driver"
> --
> 1.7.7.5
^ permalink raw reply
* Re: [PATCH 2/4 v2] xen kconfig: relax INPUT_XEN_KBDDEV_FRONTEND deps
From: Konrad Rzeszutek Wilk @ 2012-01-11 17:30 UTC (permalink / raw)
To: Andrew Jones
Cc: konrad, jeremy, xen-devel, virtualization, stefano.stabellini
In-Reply-To: <1326299801-7966-2-git-send-email-drjones@redhat.com>
On Wed, Jan 11, 2012 at 05:36:39PM +0100, Andrew Jones wrote:
> PV-on-HVM guests may want to use the xen keyboard/mouse frontend, but
> they don't use the xen frame buffer frontend. For this case it doesn't
> make much sense for INPUT_XEN_KBDDEV_FRONTEND to depend on
> XEN_FBDEV_FRONTEND. The opposite direction always makes more sense, i.e.
What is the disadvantege of keeping it as is?
> if you're using xenfb, then you'll want xenkbd. Switch the dependencies.
>
You are missing the CC to the proper maintainer.
Also, did you test this with PV and PVonHVM guests?
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
> drivers/input/misc/Kconfig | 2 +-
> drivers/video/Kconfig | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/input/misc/Kconfig b/drivers/input/misc/Kconfig
> index 22d875f..36c15bf 100644
> --- a/drivers/input/misc/Kconfig
> +++ b/drivers/input/misc/Kconfig
> @@ -533,7 +533,7 @@ config INPUT_CMA3000_I2C
>
> config INPUT_XEN_KBDDEV_FRONTEND
> tristate "Xen virtual keyboard and mouse support"
> - depends on XEN_FBDEV_FRONTEND
> + depends on XEN
> default y
> select XEN_XENBUS_FRONTEND
> help
> diff --git a/drivers/video/Kconfig b/drivers/video/Kconfig
> index d83e967..3e38c2f 100644
> --- a/drivers/video/Kconfig
> +++ b/drivers/video/Kconfig
> @@ -2263,7 +2263,7 @@ config FB_VIRTUAL
>
> config XEN_FBDEV_FRONTEND
> tristate "Xen virtual frame buffer support"
> - depends on FB && XEN
> + depends on FB && XEN && INPUT_XEN_KBDDEV_FRONTEND
> select FB_SYS_FILLRECT
> select FB_SYS_COPYAREA
> select FB_SYS_IMAGEBLIT
> --
> 1.7.7.5
^ permalink raw reply
* Re: [PATCH 4/4] xen kconfig: describe xen tmem in the config menu
From: Konrad Rzeszutek Wilk @ 2012-01-11 17:35 UTC (permalink / raw)
To: Andrew Jones
Cc: konrad, jeremy, xen-devel, virtualization, stefano.stabellini
In-Reply-To: <1326299801-7966-4-git-send-email-drjones@redhat.com>
On Wed, Jan 11, 2012 at 05:36:41PM +0100, Andrew Jones wrote:
> Add a description to the config menu for xen tmem.
We will have another config option asked during 'make oldconfig' right?
I thought part of this cleanup patch is to remove some of this
complexity - not make it _more_ complex.
>
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
> drivers/xen/Kconfig | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> index 1d24061..7e8d728 100644
> --- a/drivers/xen/Kconfig
> +++ b/drivers/xen/Kconfig
> @@ -143,7 +143,7 @@ config SWIOTLB_XEN
> select SWIOTLB
>
> config XEN_TMEM
> - bool
> + bool "Xen Transcendent Memory (tmem)"
> default y if (CLEANCACHE || FRONTSWAP)
> help
> Shim to interface in-kernel Transcendent Memory hooks
> --
> 1.7.7.5
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias (v2)
From: Michael S. Tsirkin @ 2012-01-11 18:09 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, Alan Cox, Kay Sievers, kvm, virtualization
In-Reply-To: <20120111091653.188b24ab@nehalam.linuxnetplumber.net>
On Wed, Jan 11, 2012 at 09:16:53AM -0800, Stephen Hemminger wrote:
> By adding the correct module alias, programs won't have to explicitly
> call modprobe. Vhost-net will always be available if built into the kernel.
> It does require assigning a permanent minor number for depmod to work.
> Choose one next to TUN since this driver is related to it.
>
> Also, use C99 style initialization.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
I don't mind but this needs an Ack from Alan Cox
who made it dynamic in the first place,
see 79907d89c397b8bc2e05b347ec94e928ea919d33.
> ---
> v2 - document minor number and make sure to not overlap
>
> Documentation/devices.txt | 2 ++
> drivers/vhost/net.c | 8 +++++---
> include/linux/miscdevice.h | 1 +
> 3 files changed, 8 insertions(+), 3 deletions(-)
>
> --- a/drivers/vhost/net.c 2012-01-10 10:56:58.883179194 -0800
> +++ b/drivers/vhost/net.c 2012-01-10 19:48:23.650225892 -0800
> @@ -856,9 +856,9 @@ static const struct file_operations vhos
> };
>
> static struct miscdevice vhost_net_misc = {
> - MISC_DYNAMIC_MINOR,
> - "vhost-net",
> - &vhost_net_fops,
> + .minor = VHOST_NET_MINOR,
> + .name = "vhost-net",
> + .fops = &vhost_net_fops,
> };
>
> static int vhost_net_init(void)
> @@ -879,3 +879,5 @@ MODULE_VERSION("0.0.1");
> MODULE_LICENSE("GPL v2");
> MODULE_AUTHOR("Michael S. Tsirkin");
> MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
> +MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
> +MODULE_ALIAS("devname:vhost-net");
> --- a/include/linux/miscdevice.h 2012-01-10 10:56:59.779189436 -0800
> +++ b/include/linux/miscdevice.h 2012-01-11 09:13:20.803694316 -0800
> @@ -42,6 +42,7 @@
> #define AUTOFS_MINOR 235
> #define MAPPER_CTRL_MINOR 236
> #define LOOP_CTRL_MINOR 237
> +#define VHOST_NET_MINOR 238
> #define MISC_DYNAMIC_MINOR 255
>
> struct device;
> --- a/Documentation/devices.txt 2012-01-10 10:56:53.399116518 -0800
> +++ b/Documentation/devices.txt 2012-01-11 09:12:49.251197653 -0800
> @@ -447,6 +447,8 @@ Your cooperation is appreciated.
> 234 = /dev/btrfs-control Btrfs control device
> 235 = /dev/autofs Autofs control device
> 236 = /dev/mapper/control Device-Mapper control device
> + 237 = /dev/vhost-net Host kernel accelerator for virtio net
> +
> 240-254 Reserved for local use
> 255 Reserved for MISC_DYNAMIC_MINOR
>
>
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias
From: Michael S. Tsirkin @ 2012-01-11 18:15 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: kvm, netdev, virtualization, device, alan
In-Reply-To: <20120111085426.1c1ec3bf@nehalam.linuxnetplumber.net>
On Wed, Jan 11, 2012 at 08:54:26AM -0800, Stephen Hemminger wrote:
> On Wed, 11 Jan 2012 15:43:42 +0800
> Amos Kong <kongjianjun@gmail.com> wrote:
>
> > On Wed, Jan 11, 2012 at 12:54 PM, Stephen Hemminger
> > <shemminger@vyatta.com>wrote:
> >
> > > By adding the a module alias, programs (or users) won't have to explicitly
> > > call modprobe. Vhost-net will always be available if built into the kernel.
> > > It does require assigning a permanent minor number for depmod to work.
> > > Choose one next to TUN since this driver is related to it.
> > >
> > > Also, use C99 style initialization.
> > >
> > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > >
> > > ---
> > > drivers/vhost/net.c | 8 +++++---
> > > include/linux/miscdevice.h | 1 +
> > > 2 files changed, 6 insertions(+), 3 deletions(-)
> > >
> :
> > /*
> > * These allocations are managed by device@lanana.org. If you use an
> > * entry that is not in assigned your entry may well be moved and
> > * reassigned, or set dynamic if a fixed value is not justified.
> > */
>
> Didn't that mailing address was ever used any more. Like many places
> in kernel, the comment looked like a historical leftover.
This was only added in 2010, see
79907d89c397b8bc2e05b347ec94e928ea919d33.
That said at least lanana.org web site seems to be down.
Alan, any idea?
--
MST
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-11 18:34 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <CAJSP0QVm2f+A8hFa+tx3LRcXa_oe8mn4=DhrGMnozVB8_K4OxQ@mail.gmail.com>
On Wed, Jan 11, 2012 at 05:21:53PM +0000, Stefan Hajnoczi wrote:
> On Wed, Jan 11, 2012 at 3:39 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Jan 11, 2012 at 02:28:48PM +0000, Stefan Hajnoczi wrote:
> >> On Wed, Jan 11, 2012 at 9:10 AM, Benjamin Herrenschmidt
> >> <benh@kernel.crashing.org> wrote:
> >> > On Wed, 2012-01-11 at 08:47 +0000, Stefan Hajnoczi wrote:
> >> >>
> >> >> This is also an opportunity to stop using CPU physical addresses in
> >> >> the ring and instead perform DMA like a normal PCI device (use bus
> >> >> addresses).
> >> >
> >> > Euh why ?
> >>
> >> Because it's a paravirt hack that ends up hitting corner cases. It's
> >> not possible to do virtio-pci passthrough under nested virtualization
> >> unless we use an IOMMU. Imagine passing virtio-net from L0 into the
> >> L2 guest (i.e. PCI-passthrough). If virtio-pci is really "PCI" this
> >> should be possible but it's not when we use physical addresses instead
> >> of bus addresses.
> >>
> >> Stefan
> >
> > It won't be hard to show siginificant performance regression if
> > we do this. Hard to justify for something as niche as nested virt.
>
> For x86 this should be mostly a nop.
Maybe it should, but AFAIK it isn't.
> For ppc and SPARC architectures maybe you're right. I still think
> it's a design flaw because if virtio v2 doesn't use bus addresses then
> it will simply not be possible to do passthrough for nested virt and
> other cases we haven't hit yet.
>
> Stefan
virtio-pci does not implement things like SRIOV or FLR so it
won't work anyway.
If we ever fix this, and if we really want to pass through a virtio
device (why?) using an emulated iommu seems silly - we probably want a
PV IOMMU as well.
--
MST
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Anthony Liguori @ 2012-01-11 19:42 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <20120111170859.GA22310@redhat.com>
On 01/11/2012 11:08 AM, Michael S. Tsirkin wrote:
>
> Not sure what you mean. Using VQ is DMA which is pretty common for PCI.
Do you know of a network device that obtains it's mac address via a DMA transaction?
Regards,
Anthony Liguori
>
>>>
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-11 20:14 UTC (permalink / raw)
To: Anthony Liguori
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <4F0DE62F.6030805@codemonkey.ws>
On Wed, Jan 11, 2012 at 01:42:39PM -0600, Anthony Liguori wrote:
> On 01/11/2012 11:08 AM, Michael S. Tsirkin wrote:
> >
> >Not sure what you mean. Using VQ is DMA which is pretty common for PCI.
>
> Do you know of a network device that obtains it's mac address via a DMA transaction?
Sure.
See mlx4_replace_mac in drivers/net/ethernet/mellanox/mlx4/port.c
> Regards,
>
> Anthony Liguori
>
> >
> >>>
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Anthony Liguori @ 2012-01-11 20:26 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <20120111201436.GA27292@redhat.com>
On 01/11/2012 02:14 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 11, 2012 at 01:42:39PM -0600, Anthony Liguori wrote:
>> On 01/11/2012 11:08 AM, Michael S. Tsirkin wrote:
>>>
>>> Not sure what you mean. Using VQ is DMA which is pretty common for PCI.
>>
>> Do you know of a network device that obtains it's mac address via a DMA transaction?
>
> Sure.
> See mlx4_replace_mac in drivers/net/ethernet/mellanox/mlx4/port.c
I'd say that's a special case but I see what you're getting at here.
So what about keeping the config space read-only and using control queues for
everything else?
Regards,
Anthony Liguori
>
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>>>>
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 20:46 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Pawel Moll, Michael S. Tsirkin, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <CAJSP0QWMBhsg9ExmTsDG7d32p=7tokCPpm5Z0DMAm2_ADvvb+g@mail.gmail.com>
On Wed, 2012-01-11 at 14:28 +0000, Stefan Hajnoczi wrote:
> On Wed, Jan 11, 2012 at 9:10 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Wed, 2012-01-11 at 08:47 +0000, Stefan Hajnoczi wrote:
> >>
> >> This is also an opportunity to stop using CPU physical addresses in
> >> the ring and instead perform DMA like a normal PCI device (use bus
> >> addresses).
> >
> > Euh why ?
>
> Because it's a paravirt hack that ends up hitting corner cases. It's
> not possible to do virtio-pci passthrough under nested virtualization
> unless we use an IOMMU. Imagine passing virtio-net from L0 into the
> L2 guest (i.e. PCI-passthrough). If virtio-pci is really "PCI" this
> should be possible but it's not when we use physical addresses instead
> of bus addresses.
Is this just an academic exercise or is there any actual value in doing
this ?
Using an iommu is going to slaugher your performance, so at the very
least it should be kept an option.
Yes, it's a paravirt "hack" as you call it but that's what virtio is all
about.... paravirt. If you prefer you can emulate a real HW device :-)
Ben.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 20:50 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pawel Moll, virtualization, Christian Borntraeger, Sasha Levin,
Anthony Liguori
In-Reply-To: <20120111151230.GA20570@redhat.com>
On Wed, 2012-01-11 at 17:12 +0200, Michael S. Tsirkin wrote:
> This is similar to what we have now. But it's still buggy: e.g. if guest
> updates MAC byte by byte, we have no way to know when it's done doing
> so.
Do like real HW, there's plenty of options:
- (better) Have a command "update MAC" sent to a ring. A command ring
would be generally useful and could replace anything you do via writing
to config space today. The advantage of having a read-only config space
is that you significantly remove the need for synchronization. You could
also have an event ring and avoid the seqlock for reading. It's MUCH
better to have a fine granularity of what actually changed that having a
generic "something is changing in the config space".
With a new ring format allowing direct data in the ring descriptor that
would be trivial.
- If you really don't like a command ring, you can have a command
"register", write the new MAC and send a command to make it "apply", but
that's not fantastic as there's going to be a possible discrepancy
between what's in the config and what's actually used.
- Have a separate "MAC write" register set with the "hot" byte beeing
the low order byte. writing to it updates the MAC.
Etc etc...
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 20:51 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Pawel Moll, virtualization, Christian Borntraeger, Sasha Levin,
Anthony Liguori
In-Reply-To: <20120111152129.GB20570@redhat.com>
On Wed, 2012-01-11 at 17:21 +0200, Michael S. Tsirkin wrote:
>
> Possible but doesn't let us layer nicely to allow unchanged drivers
> that work with all transports (new pci, old pci, non pci).
> Something like a command VQ would be a generic transport
> that can be hidden behind config->set(...).
I agree, a command VQ (And possibly a status VQ) would be generally
useful.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 20:56 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Pawel Moll, Michael S. Tsirkin, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <CAJSP0QVm2f+A8hFa+tx3LRcXa_oe8mn4=DhrGMnozVB8_K4OxQ@mail.gmail.com>
On Wed, 2012-01-11 at 17:21 +0000, Stefan Hajnoczi wrote:
> > It won't be hard to show siginificant performance regression if
> > we do this. Hard to justify for something as niche as nested virt.
>
> For x86 this should be mostly a nop.
No it won't be. Or rather, it will be as long as you map your entire
guests in the iommu, which means your entire guest will have to be
pinned and you lost swap, ksm, yadadada...
Since the only way to have non-pinned guests today on x86 is to use
virtio, you just shot yourself in the foot.
Eventually x86 will have to grow some kind of virtualized iommu or
paravir iommu to overcome that at which point you will pay that price.
> For ppc and SPARC architectures
> maybe you're right. I still think it's a design flaw because if
> virtio v2 doesn't use bus addresses then it will simply not be
> possible to do passthrough for nested virt and other cases we haven't
> hit yet.
Then don't use virtio in those cases, use real emulated HW. Seriously,
is nested virt that interesting ?
At the very least, make that use of iommu a feature or something like
that so it can be negociated down when not doing nesting which is going
to be 99% of your use cases.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 20:59 UTC (permalink / raw)
To: Anthony Liguori
Cc: Pawel Moll, Michael S. Tsirkin, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <4F0DE62F.6030805@codemonkey.ws>
On Wed, 2012-01-11 at 13:42 -0600, Anthony Liguori wrote:
> On 01/11/2012 11:08 AM, Michael S. Tsirkin wrote:
> >
> > Not sure what you mean. Using VQ is DMA which is pretty common for PCI.
>
> Do you know of a network device that obtains it's mac address via a DMA transaction?
I wouldn't be surprised if we could find one, but even if we don't why
is that an issue ?
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH] vhost-net: add module alias (v2)
From: Ben Hutchings @ 2012-01-11 21:00 UTC (permalink / raw)
To: Stephen Hemminger
Cc: kvm, Michael S. Tsirkin, netdev, Kay Sievers, virtualization,
Alan Cox
In-Reply-To: <20120111091653.188b24ab@nehalam.linuxnetplumber.net>
On Wed, 2012-01-11 at 09:16 -0800, Stephen Hemminger wrote:
> By adding the correct module alias, programs won't have to explicitly
> call modprobe. Vhost-net will always be available if built into the kernel.
> It does require assigning a permanent minor number for depmod to work.
> Choose one next to TUN since this driver is related to it.
>
> Also, use C99 style initialization.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> ---
> v2 - document minor number and make sure to not overlap
[...]
> --- a/include/linux/miscdevice.h 2012-01-10 10:56:59.779189436 -0800
> +++ b/include/linux/miscdevice.h 2012-01-11 09:13:20.803694316 -0800
> @@ -42,6 +42,7 @@
> #define AUTOFS_MINOR 235
> #define MAPPER_CTRL_MINOR 236
> #define LOOP_CTRL_MINOR 237
> +#define VHOST_NET_MINOR 238
> #define MISC_DYNAMIC_MINOR 255
>
> struct device;
> --- a/Documentation/devices.txt 2012-01-10 10:56:53.399116518 -0800
> +++ b/Documentation/devices.txt 2012-01-11 09:12:49.251197653 -0800
> @@ -447,6 +447,8 @@ Your cooperation is appreciated.
> 234 = /dev/btrfs-control Btrfs control device
> 235 = /dev/autofs Autofs control device
> 236 = /dev/mapper/control Device-Mapper control device
> + 237 = /dev/vhost-net Host kernel accelerator for virtio net
[...]
238 != 237. It looks like someone forgot to add loopctrl here.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 21:02 UTC (permalink / raw)
To: Anthony Liguori
Cc: Pawel Moll, Michael S. Tsirkin, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <4F0DF08F.7010706@codemonkey.ws>
On Wed, 2012-01-11 at 14:26 -0600, Anthony Liguori wrote:
>
> I'd say that's a special case but I see what you're getting at here.
>
> So what about keeping the config space read-only and using control
> queues for
> everything else?
Which is exactly what Rusty and I are proposing :-) I would go further
and eliminate the idea of a seqlock and instead of a status queue with
precise messages indicating what changed.
I would couple that with the new queue format allowing immediate data in
the descriptor to avoid having to use indirect buffers for these, which
means no allocation, no buffer pool etc... which makes everything a lot
easier to deal with as well.
We could probably have a helper library for sending control messages
which could handle waiting for a ring slot to be free (practically
always the case on control queues), writing the message, sending it and
waiting for a status queue confirmation message.
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Benjamin Herrenschmidt @ 2012-01-11 21:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Christian Borntraeger, Sasha Levin, Pawel Moll, virtualization
In-Reply-To: <20120111102129.GC20988@redhat.com>
On Wed, 2012-01-11 at 12:21 +0200, Michael S. Tsirkin wrote:
>
> > BenH also convinced me we should finally make the config space LE if
> > we're going to change things. Since PCI is the most common transport,
> > guest-endian confuses people. And it sucks for really weird machines.
>
> Are we going to keep guest endian for e.g. virtio net header?
> If yes the benefit of switching config space is not that big.
> And changes in devices would affect non-PCI transports.
I think the concept of "guest endian" is broken by design. What does
that mean when running for example an ARM or a ppc 440 "guest" which
could be either endian ? Since you can't hard code your guest endian how
do you obtain/negociate it ? Also you now have to deal with dual endian
in the host, makes everything trickier.
Just make everything LE.
> Quite possibly all or some of these things help performance
> but do we have to change the spec before we have experimental
> proof?
Well, I would argue that the network driver world has proven countless
times that those are good ideas :-) But by all mean, let's do a
prototype implementation with virtio-net for example and bench it.
I don't think you need a single ring. For multiqueue net, you definitely
want multiple rings and you do want rings to remain uni-directional.
One other thing that can be useful is to separate the completion ring
from the actual ring of DMA descriptors, making the former completely
read-only by the guest and the later completely read only by the host.
For example take the ehea ethernet rx model. It has 3 rx "rings" per
queue. One contains the completions, it's a toggle-valid model so we
never write back to clear valid, it contains infos from the parser, the
tokenID of the packet and the index as to where in which ring the data
is, wich is either inline in the completion ring (small packet), header
inline & data in a data ring or completely in a data ring. Then you have
two data rings which are simply rings of SG list entries (more or less).
We typically pre-populate the data rings with skb's for 1500 and 9000
bytes packets. Small packets come in immediately in the completion ring,
and large packets via the data ring.
That's just -an- example. There's many other to take inspiration from.
Network folks have beaten to death the problem of ring efficiency vs.
CPU caches.
> > Moreover, I think we should make all these changes at once (at least, in
> > the spec). That makes it a big change, and it'll take longer to
> > develop, but makes it easy in the long run to differentiate legacy and
> > modern virtio.
> >
> > Thoughts?
> > Rusty.
>
> BTW this seems to be the reverse from what you have in Mar 2001,
> see 87mxkjls61.fsf@rustcorp.com.au :)
That was 10 years ago...
> I am much less concerned with what we do for configuration,
> but I do not believe we have learned all performance lessons
> from virtio ring1.
Maybe we have learned some more since then ? :-)
> Is there any reason why we shouldn't be
> able to experiment with inline within virtio1 and see
> whether that gets us anything?
> If we do a bunch of changes to the ring at once, we can't
> figure out what's right, what's wrong, or back out of
> mistakes later.
>
> Since there are non PCI transports that use the ring,
> we really shouldn't make both the configuration and
> the ring changes depend on the same feature bit.
Another advantage of inline data is that it makes things a lot easier
for cases where only small amount of data need to be exchanged, such as
control/status rings, maybe virtio-tty (which I'm working on), etc...
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
From: Michael S. Tsirkin @ 2012-01-11 21:58 UTC (permalink / raw)
To: Anthony Liguori
Cc: Pawel Moll, Benjamin Herrenschmidt, virtualization,
Christian Borntraeger, Sasha Levin
In-Reply-To: <4F0DF08F.7010706@codemonkey.ws>
On Wed, Jan 11, 2012 at 02:26:55PM -0600, Anthony Liguori wrote:
> On 01/11/2012 02:14 PM, Michael S. Tsirkin wrote:
> >On Wed, Jan 11, 2012 at 01:42:39PM -0600, Anthony Liguori wrote:
> >>On 01/11/2012 11:08 AM, Michael S. Tsirkin wrote:
> >>>
> >>>Not sure what you mean. Using VQ is DMA which is pretty common for PCI.
> >>
> >>Do you know of a network device that obtains it's mac address via a DMA transaction?
> >
> >Sure.
> >See mlx4_replace_mac in drivers/net/ethernet/mellanox/mlx4/port.c
>
> I'd say that's a special case but I see what you're getting at here.
>
> So what about keeping the config space read-only and using control
> queues for everything else?
Not just read-only - constant.
> Regards,
>
> Anthony Liguori
>
> >
> >>Regards,
> >>
> >>Anthony Liguori
> >>
> >>>
> >>>>>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox