* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Jesse Barnes @ 2008-12-17 18:51 UTC (permalink / raw)
To: Rose, Gregory V
Cc: randy.dunlap@oracle.com, grundler@parisc-linux.org,
achiang@hp.com, matthew@wil.cx, Greg KH, rdreier@cisco.com,
Jike Song, linux-kernel@vger.kernel.org, horms@verge.net.au,
kvm@vger.kernel.org, linux-pci@vger.kernel.org, mingo@elte.hu,
virtualization@lists.linux-foundation.org, yinghai@kernel.org,
bjorn.helgaas@hp.com
In-Reply-To: <43F901BD926A4E43B106BF17856F07554B525811@orsmsx508.amr.corp.intel.com>
On Wednesday, December 17, 2008 8:44 am Rose, Gregory V wrote:
> As noted in the attached email to the netdev list, we (e1000_devel) will
> support the API.
Do you think you'll have those changes ready for 2.6.29? Would merging core
SR-IOV support now make that any more likely?
Thanks,
Jesse
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Greg KH @ 2008-12-17 17:51 UTC (permalink / raw)
To: Rose, Gregory V
Cc: randy.dunlap@oracle.com, grundler@parisc-linux.org,
achiang@hp.com, matthew@wil.cx, linux-pci@vger.kernel.org,
rdreier@cisco.com, Jike Song, Jesse Barnes,
linux-kernel@vger.kernel.org, horms@verge.net.au,
kvm@vger.kernel.org, mingo@elte.hu,
virtualization@lists.linux-foundation.org, yinghai@kernel.org,
bjorn.helgaas@hp.com
In-Reply-To: <43F901BD926A4E43B106BF17856F07554B525811@orsmsx508.amr.corp.intel.com>
A: No.
Q: Should I include quotations after my reply?
On Wed, Dec 17, 2008 at 08:44:03AM -0800, Rose, Gregory V wrote:
> As noted in the attached email to the netdev list, we (e1000_devel) will support the API.
Great, will you have patches for the existing e1000 drivers soon to use
it? Or will they be a while before they can be available?
As it is, the one posted user of this api is for a driver that has been
rejected, so as there are no users of the api, I feel it should be
deferrred until there is a user to make sure it all works and feels
proper.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Jesse Barnes @ 2008-12-17 17:27 UTC (permalink / raw)
To: Matthew Wilcox
Cc: randy.dunlap, grundler, achiang, linux-pci, rdreier, linux-kernel,
virtualization, horms, kvm, greg, mingo, yinghai, bjorn.helgaas
In-Reply-To: <20081217141542.GB19967@parisc-linux.org>
On Wednesday, December 17, 2008 6:15 am Matthew Wilcox wrote:
> On Tue, Dec 16, 2008 at 03:23:53PM -0800, Jesse Barnes wrote:
> > I applied 1-9 to my linux-next branch; and at least patch #10 needs a
> > respin,
>
> I still object to #2. We should have the flexibility to have 'struct
> resource's that are not in this array in the pci_dev. I would like to
> see the SR-IOV resources _not_ in this array (and indeed, I'd like to
> see PCI bridges keep their producer resources somewhere other than in
> this array). I accept that there are still some problems with this, but
> patch #2 moves us further from being able to achieve this goal, not
> closer.
Yeah, I can see what you mean here... but on the other hand it makes the
existing code a bit clearer (no extra args), and really it doesn't push us
*that* much further from non-pci_dev tied resources. Any patches in that
direction will just get a few lines bigger, that's all.
But I agree that eventually we may want to have non-pci_dev resource lists,
especially if we start adding advanced host bridge drivers or something.
--
Jesse Barnes, Intel Open Source Technology Center
^ permalink raw reply
* RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Rose, Gregory V @ 2008-12-17 16:44 UTC (permalink / raw)
To: Greg KH, Jike Song
Cc: randy.dunlap@oracle.com, grundler@parisc-linux.org,
achiang@hp.com, matthew@wil.cx, linux-pci@vger.kernel.org,
rdreier@cisco.com, linux-kernel@vger.kernel.org, Jesse Barnes,
virtualization@lists.linux-foundation.org, horms@verge.net.au,
kvm@vger.kernel.org, mingo@elte.hu, yinghai@kernel.org,
bjorn.helgaas@hp.com
In-Reply-To: <20081217060608.GA12618@kroah.com>
[-- Attachment #1: Type: text/plain, Size: 1875 bytes --]
As noted in the attached email to the netdev list, we (e1000_devel) will support the API.
- Greg Rose
-----Original Message-----
From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of Greg KH
Sent: Tuesday, December 16, 2008 10:06 PM
To: Jike Song
Cc: Jesse Barnes; Zhao, Yu; linux-pci@vger.kernel.org; achiang@hp.com; bjorn.helgaas@hp.com; grundler@parisc-linux.org; mingo@elte.hu; matthew@wil.cx; randy.dunlap@oracle.com; rdreier@cisco.com; horms@verge.net.au; yinghai@kernel.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org; virtualization@lists.linux-foundation.org
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
> Jesse Barnes wrote:
> > Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
> > I'd be much happier about it if we got some driver code along with it, so as
> > not to have an unused interface sitting around for who knows how many
> > releases. Is that reasonable? Do you know if any of the corresponding PF/VF
> > driver bits are ready yet?
>
> Hi Jesse,
>
> Yu Zhao has posted a patch set with subject "SR-IOV driver example"
> at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
> drivers;-)
Yes, but that driver was soundly rejected by the network driver
maintainers, so I wouldn't go around showing that as your primary
example of how to use this interface :)
The point is valid, I don't think these apis should go into the tree
without a driver or some other code using them. Otherwise they make no
sense at all to have in-tree.
thanks,
greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: Type: message/rfc822, Size: 5600 bytes --]
From: "Kirsher, Jeffrey T" <jeffrey.t.kirsher@intel.com>
To: "Zhao, Yu" <yu.zhao@intel.com>
Cc: "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>, "greg@kroah.com" <greg@kroah.com>, "matthew@wil.cx" <matthew@wil.cx>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, "Nakajima, Jun" <jun.nakajima@intel.com>, "Rose, Gregory V" <gregory.v.rose@intel.com>
Subject: Re: [SR-IOV driver example 0/3 resend] introduction
Date: Tue, 2 Dec 2008 19:12:18 -0800
Message-ID: <9929d2390812021912ua5ddeafo3ae4a5759ffc4c4e@mail.gmail.com>
On Tue, Dec 2, 2008 at 1:27 AM, Yu Zhao <yu.zhao@intel.com> wrote:
> SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> of the drivers: Physical Function driver and Virtual Function driver.
> The PF driver is based on the IGB driver and is used to control PF to
> allocate hardware specific resources and interface with the SR-IOV core.
> The VF driver is a new NIC driver that is same as the traditional PCI
> device driver. It works in both the host and the guest (Xen and KVM)
> environment.
>
> These two drivers are testing versions and they are *only* intended to
> show how to use SR-IOV API.
>
> Intel 82576 NIC specification can be found at:
> http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf
>
> [SR-IOV driver example 0/3 resend] introduction
> [SR-IOV driver example 1/3 resend] PF driver: hardware specific operations
> [SR-IOV driver example 2/3 resend] PF driver: integrate with SR-IOV core
> [SR-IOV driver example 3/3 resend] VF driver: an independent PCI NIC driver
> --
>
First of all, we (e1000-devel) do support the SR-IOV API.
With that said, NAK on the driver changes. We were not involved in
these changes and are currently working on a version of the drivers
that will make them acceptable for kernel inclusion.
--
Cheers,
Jeff
[-- Attachment #3: Type: text/plain, Size: 184 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Gleb Natapov @ 2008-12-17 14:31 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: netdev, kvm, David Miller, Anthony Liguori, virtualization
In-Reply-To: <20081216212532.GA15360@ioremap.net>
[-- Attachment #1: Type: text/plain, Size: 1040 bytes --]
On Wed, Dec 17, 2008 at 12:25:32AM +0300, Evgeniy Polyakov wrote:
> On Tue, Dec 16, 2008 at 08:57:27AM +0200, Gleb Natapov (gleb@redhat.com) wrote:
> > > Another approach is to implement that virtio backend with netlink based
> > > userspace interface (like using connector or genetlink). This does not
> > > differ too much from what you have with special socket family, but at
> > > least it does not duplicate existing functionality of
> > > userspace-kernelspace communications.
> > >
> > I implemented vmchannel using connector initially (the downside is that
> > message can be dropped). Is this more expectable for upstream? The
> > implementation was 300 lines of code.
>
> Hard to tell, it depends on implementation. But if things are good, I
> have no objections as connector maintainer :)
>
Here it is. Sorry it is not in a patch format yet, but it gives
general idea how it looks. The problem with connector is that
we need different IDX for different channels and there is no way
to dynamically allocate them.
--
Gleb.
[-- Attachment #2: vmchannel_connector.c --]
[-- Type: text/x-csrc, Size: 6653 bytes --]
/*
* Copyright (c) 2008 Red Hat, Inc.
*
* Author(s): Gleb Natapov <gleb@redhat.com>
*/
#include <linux/module.h>
#include <linux/interrupt.h>
#include <linux/connector.h>
#include <linux/virtio.h>
#include <linux/scatterlist.h>
#include <linux/virtio_config.h>
#include <linux/list.h>
#include <linux/spinlock.h>
#include "vmchannel_connector.h"
static struct vmchannel_dev vmc_dev;
static int add_recq_buf(struct vmchannel_dev *vmc, struct vmchannel_hdr *hdr)
{
struct scatterlist sg[2];
sg_init_table(sg, 2);
sg_init_one(&sg[0], hdr, sizeof(struct vmchannel_desc));
sg_init_one(&sg[1], hdr->msg.data, MAX_PACKET_LEN);
if (!vmc->rq->vq_ops->add_buf(vmc->rq, sg, 0, 2, hdr))
return 1;
kfree(hdr);
return 0;
}
static int try_fill_recvq(struct vmchannel_dev *vmc)
{
int num = 0;
for (;;) {
struct vmchannel_hdr *hdr;
hdr = kmalloc(sizeof(*hdr) + MAX_PACKET_LEN, GFP_KERNEL);
if (unlikely(!hdr))
break;
if (!add_recq_buf(vmc, hdr))
break;
num++;
}
if (num)
vmc->rq->vq_ops->kick(vmc->rq);
return num;
}
static void vmchannel_recv(unsigned long data)
{
struct vmchannel_dev *vmc = (struct vmchannel_dev *)data;
struct vmchannel_hdr *hdr;
unsigned int len;
int posted = 0;
while ((hdr = vmc->rq->vq_ops->get_buf(vmc->rq, &len))) {
hdr->msg.len = le32_to_cpu(hdr->desc.len);
len -= sizeof(struct vmchannel_desc);
if (hdr->msg.len == len) {
hdr->msg.id.idx = VMCHANNEL_CONNECTOR_IDX;
hdr->msg.id.val = le32_to_cpu(hdr->desc.id);
hdr->msg.seq = vmc->seq++;
hdr->msg.ack = random32();
cn_netlink_send(&hdr->msg, VMCHANNEL_CONNECTOR_IDX,
GFP_ATOMIC);
} else
dev_printk(KERN_ERR, &vmc->vdev->dev,
"wrong length in received descriptor"
" (%d instead of %d)\n", hdr->msg.len,
len);
posted += add_recq_buf(vmc, hdr);
}
if (posted)
vmc->rq->vq_ops->kick(vmc->rq);
}
static void recvq_notify(struct virtqueue *recvq)
{
struct vmchannel_dev *vmc = recvq->vdev->priv;
tasklet_schedule(&vmc->tasklet);
}
static void cleanup_sendq(struct vmchannel_dev *vmc)
{
char *buf;
unsigned int len;
spin_lock(&vmc->sq_lock);
while ((buf = vmc->sq->vq_ops->get_buf(vmc->sq, &len)))
kfree(buf);
spin_unlock(&vmc->sq_lock);
}
static void sendq_notify(struct virtqueue *sendq)
{
struct vmchannel_dev *vmc = sendq->vdev->priv;
cleanup_sendq(vmc);
}
static void vmchannel_cn_callback(void *data)
{
struct vmchannel_desc *desc;
struct cn_msg *msg = data;
struct scatterlist sg;
char *buf;
int err;
unsigned long flags;
desc = kmalloc(msg->len + sizeof(*desc), GFP_KERNEL);
if (!desc)
return;
desc->id = cpu_to_le32(msg->id.val);
desc->len = cpu_to_le32(msg->len);
buf = (char *)(desc + 1);
memcpy(buf, msg->data, msg->len);
sg_init_one(&sg, desc, msg->len + sizeof(*desc));
spin_lock_irqsave(&vmc_dev.sq_lock, flags);
err = vmc_dev.sq->vq_ops->add_buf(vmc_dev.sq, &sg, 1, 0, desc);
if (err)
kfree(desc);
else
vmc_dev.sq->vq_ops->kick(vmc_dev.sq);
spin_unlock_irqrestore(&vmc_dev.sq_lock, flags);
}
static int vmchannel_probe(struct virtio_device *vdev)
{
struct vmchannel_dev *vmc = &vmc_dev;
struct cb_id cn_id;
int r, i;
__le32 count;
unsigned offset;
cn_id.idx = VMCHANNEL_CONNECTOR_IDX;
vdev->priv = vmc;
vmc->vdev = vdev;
vdev->config->get(vdev, 0, &count, sizeof(count));
vmc->channel_count = le32_to_cpu(count);
if (vmc->channel_count == 0) {
dev_printk(KERN_ERR, &vdev->dev, "No channels present\n");
return -ENODEV;
}
pr_debug("vmchannel: %d channel detected\n", vmc->channel_count);
vmc->channels =
kzalloc(vmc->channel_count * sizeof(struct vmchannel_info),
GFP_KERNEL);
if (!vmc->channels)
return -ENOMEM;
offset = sizeof(count);
for (i = 0; i < vmc->channel_count; i++) {
__u32 len;
__le32 tmp;
vdev->config->get(vdev, offset, &tmp, 4);
vmc->channels[i].id = le32_to_cpu(tmp);
offset += 4;
vdev->config->get(vdev, offset, &tmp, 4);
len = le32_to_cpu(tmp);
if (len > VMCHANNEL_NAME_MAX) {
dev_printk(KERN_ERR, &vdev->dev,
"Wrong device configuration. "
"Channel name is too long");
r = -ENODEV;
goto out;
}
vmc->channels[i].name = kmalloc(len, GFP_KERNEL);
if (!vmc->channels[i].name) {
r = -ENOMEM;
goto out;
}
offset += 4;
vdev->config->get(vdev, offset, vmc->channels[i].name, len);
offset += len;
pr_debug("vmhannel: found channel '%s' id %d\n",
vmc->channels[i].name,
vmc->channels[i].id);
}
vmc->rq = vdev->config->find_vq(vdev, 0, recvq_notify);
if (IS_ERR(vmc->rq)) {
r = PTR_ERR(vmc->rq);
goto out;
}
vmc->sq = vdev->config->find_vq(vdev, 1, sendq_notify);
if (IS_ERR(vmc->sq)) {
r = PTR_ERR(vmc->sq);
goto out;
}
spin_lock_init(&vmc->sq_lock);
for (i = 0; i < vmc->channel_count; i++) {
cn_id.val = vmc->channels[i].id;
r = cn_add_callback(&cn_id, "vmchannel", vmchannel_cn_callback);
if (r)
goto cn_unreg;
}
tasklet_init(&vmc->tasklet, vmchannel_recv, (unsigned long)vmc);
if (!try_fill_recvq(vmc)) {
r = -ENOMEM;
goto kill_task;
}
return 0;
kill_task:
tasklet_kill(&vmc->tasklet);
cn_unreg:
for (i = 0; i < vmc->channel_count; i++) {
cn_id.val = vmc->channels[i].id;
cn_del_callback(&cn_id);
}
out:
if (vmc->sq)
vdev->config->del_vq(vmc->sq);
if (vmc->rq)
vdev->config->del_vq(vmc->rq);
for (i = 0; i < vmc->channel_count; i++) {
if (!vmc->channels[i].name)
break;
kfree(vmc->channels[i].name);
}
kfree(vmc->channels);
return r;
}
static void vmchannel_remove(struct virtio_device *vdev)
{
struct vmchannel_dev *vmc = vdev->priv;
struct cb_id cn_id;
int i;
/* Stop all the virtqueues. */
vdev->config->reset(vdev);
tasklet_kill(&vmc->tasklet);
cn_id.idx = VMCHANNEL_CONNECTOR_IDX;
for (i = 0; i < vmc->channel_count; i++) {
cn_id.val = vmc->channels[i].id;
cn_del_callback(&cn_id);
}
vdev->config->del_vq(vmc->rq);
vdev->config->del_vq(vmc->sq);
for (i = 0; i < vmc_dev.channel_count; i++)
kfree(vmc_dev.channels[i].name);
kfree(vmc_dev.channels);
}
static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_VMCHANNEL, VIRTIO_DEV_ANY_ID }, { 0 },
};
static struct virtio_driver virtio_vmchannel = {
.driver.name = "virtio-vmchannel",
.driver.owner = THIS_MODULE,
.id_table = id_table,
.probe = vmchannel_probe,
.remove = __devexit_p(vmchannel_remove),
};
static int __init init(void)
{
return register_virtio_driver(&virtio_vmchannel);
}
static void __exit fini(void)
{
unregister_virtio_driver(&virtio_vmchannel);
}
module_init(init);
module_exit(fini);
MODULE_AUTHOR("Gleb Natapov");
MODULE_DEVICE_TABLE(virtio, id_table);
MODULE_DESCRIPTION("Virtio vmchannel driver");
MODULE_LICENSE("GPL");
[-- Attachment #3: vmchannel_connector.h --]
[-- Type: text/x-chdr, Size: 668 bytes --]
/*
* Copyright (c) 2008 Red Hat, Inc.
*
* Author(s): Gleb Natapov <gleb@redhat.com>
*/
#ifndef VMCHANNEL_H
#define VMCHANNEL_H
#define VMCHANNEL_NAME_MAX 80
#define VMCHANNEL_CONNECTOR_IDX 10
#define VIRTIO_ID_VMCHANNEL 6
#define MAX_PACKET_LEN 1024
struct vmchannel_info {
__u32 id;
char *name;
};
struct vmchannel_dev {
struct virtio_device *vdev;
struct virtqueue *rq;
struct virtqueue *sq;
spinlock_t sq_lock;
struct tasklet_struct tasklet;
__u16 channel_count;
struct vmchannel_info *channels;
__u32 seq;
};
struct vmchannel_desc {
__u32 id;
__u32 len;
};
struct vmchannel_hdr {
struct vmchannel_desc desc;
struct cn_msg msg;
};
#endif
[-- Attachment #4: Type: text/plain, Size: 184 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Matthew Wilcox @ 2008-12-17 14:15 UTC (permalink / raw)
To: Jesse Barnes
Cc: randy.dunlap, grundler, achiang, linux-pci, rdreier, linux-kernel,
virtualization, horms, kvm, greg, mingo, yinghai, bjorn.helgaas
In-Reply-To: <200812161523.55238.jbarnes@virtuousgeek.org>
On Tue, Dec 16, 2008 at 03:23:53PM -0800, Jesse Barnes wrote:
> I applied 1-9 to my linux-next branch; and at least patch #10 needs a respin,
I still object to #2. We should have the flexibility to have 'struct
resource's that are not in this array in the pci_dev. I would like to
see the SR-IOV resources _not_ in this array (and indeed, I'd like to
see PCI bridges keep their producer resources somewhere other than in
this array). I accept that there are still some problems with this, but
patch #2 moves us further from being able to achieve this goal, not
closer.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply
* RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Fischer, Anna @ 2008-12-17 11:42 UTC (permalink / raw)
To: Jesse Barnes, Yu Zhao
Cc: randy.dunlap@oracle.com, grundler@parisc-linux.org,
Chiang, Alexander, matthew@wil.cx, linux-pci@vger.kernel.org,
rdreier@cisco.com, linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, horms@verge.net.au,
kvm@vger.kernel.org, greg@kroah.com, mingo@elte.hu,
yinghai@kernel.org, Helgaas, Bjorn
In-Reply-To: <200812161523.55238.jbarnes@virtuousgeek.org>
> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
> owner@vger.kernel.org] On Behalf Of Jesse Barnes
> Sent: 16 December 2008 23:24
> To: Yu Zhao
> Cc: linux-pci@vger.kernel.org; Chiang, Alexander; Helgaas, Bjorn;
> grundler@parisc-linux.org; greg@kroah.com; mingo@elte.hu;
> matthew@wil.cx; randy.dunlap@oracle.com; rdreier@cisco.com;
> horms@verge.net.au; yinghai@kernel.org; linux-kernel@vger.kernel.org;
> kvm@vger.kernel.org; virtualization@lists.linux-foundation.org
> Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
>
> On Friday, November 21, 2008 10:36 am Yu Zhao wrote:
> > Greetings,
> >
> > Following patches are intended to support SR-IOV capability in the
> > Linux kernel. With these patches, people can turn a PCI device with
> > the capability into multiple ones from software perspective, which
> > will benefit KVM and achieve other purposes such as QoS, security,
> > and etc.
> >
> > The Physical Function and Virtual Function drivers using the SR-IOV
> > APIs will come soon!
> >
> > Major changes from v6 to v7:
> > 1, remove boot-time resource rebalancing support. (Greg KH)
> > 2, emit uevent upon the PF driver is loaded. (Greg KH)
> > 3, put SR-IOV callback function into the 'pci_driver'. (Matthew
> Wilcox)
> > 4, register SR-IOV service at the PF loading stage.
> > 5, remove unnecessary APIs (pci_iov_enable/disable).
>
> Thanks for your patience with this, Yu, I know it's been a long haul.
> :)
>
> I applied 1-9 to my linux-next branch; and at least patch #10 needs a
> respin,
> so can you re-do 10-13 as a new patch set?
>
> On re-reading the last thread, there was a lot of smoke, but very
> little fire
> afaict. The main questions I saw were:
>
> 1) do we need SR-IOV at all? why not just make each subsystem export
> devices to guests?
> This is a bit of a red herring. Nothing about SR-IOV prevents us
> from
> making subsystems more v12n friendly. And since SR-IOV is a
> hardware
> feature supported by devices these days, we should make Linux
> support it.
>
> 2) should the PF/VF drivers be the same or not?
> Again, the SR-IOV patchset and PCI spec don't dictate this. We're
> free to
> do what we want here.
>
> 3) should VF devices be represented by pci_dev structs?
> Yes. (This is an easy one :)
>
> 4) can VF devices be used on the host?
> Yet again, SR-IOV doesn't dictate this. Developers can make PF/VF
> combo
> drivers or split them, and export the resulting devices however
> they want.
> Some subsystem work may be needed to make this efficient, but SR-
> IOV
> itself is agnostic about it.
>
> So overall I didn't see many objections to the actual code in the last
> post,
> and the issues above certainly don't merit a NAK IMO...
I have two minor comments on this topic.
1) Currently the PF driver is called before the kernel initializes VFs and
their resources, and the current API does not allow the PF driver to
detect that easily if the allocation of the VFs and their resources
has succeeded or not. It would be quite useful if the PF driver gets
notified when the VFs have been created successfully as it might have
to do further device-specific work *after* IOV has been enabled.
2) Configuration of SR-IOV: the current API allows to enable/disable
VFs from userspace via SYSFS. At the moment I am not quite clear what
exactly is supposed to control these capabilities. This could be
Linux tools or, on a virtualized system, hypervisor control tools.
One thing I am missing though is an in-kernel API for this which I
think might be useful. After all the PF driver controls the device,
and, for example, when a device error occurs (e.g. a hardware failure
which only the PF driver will be able to detect, not Linux), then the
PF driver might have to de-allocate all resources, shut down VFs and
reset the device, or something like that. In that case the PF driver
needs to have a way to notify the Linux SR-IOV code about this and
initiate cleaning up of VFs and their resources. At the moment, this
would have to go through userspace, I believe, and I think that is not
an optimal solution. Yu, do you have an opinion on how this would be
realized?
Anna
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Greg KH @ 2008-12-17 7:21 UTC (permalink / raw)
To: Zhao, Yu
Cc: randy.dunlap@oracle.com, grundler@parisc-linux.org,
achiang@hp.com, matthew@wil.cx, linux-pci@vger.kernel.org,
rdreier@cisco.com, Jike Song, Jesse Barnes,
linux-kernel@vger.kernel.org, horms@verge.net.au,
kvm@vger.kernel.org, mingo@elte.hu,
virtualization@lists.linux-foundation.org, yinghai@kernel.org,
bjorn.helgaas@hp.com
In-Reply-To: <4948A52B.7040403@intel.com>
On Wed, Dec 17, 2008 at 03:07:23PM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
>>> Jesse Barnes wrote:
>>>> Given a respin of 10-13 I think it's reasonable to merge this into
>>>> 2.6.29, but I'd be much happier about it if we got some driver code
>>>> along with it, so as not to have an unused interface sitting around for
>>>> who knows how many releases. Is that reasonable? Do you know if any of
>>>> the corresponding PF/VF driver bits are ready yet?
>>> Hi Jesse,
>>> Yu Zhao has posted a patch set with subject "SR-IOV driver example" at
>>> November 26, which illustrated the usage of SR-IOV API in Intel 82576
>>> VF/PF
>>> drivers;-)
>> Yes, but that driver was soundly rejected by the network driver
>> maintainers, so I wouldn't go around showing that as your primary
>> example of how to use this interface :)
>> The point is valid, I don't think these apis should go into the tree
>> without a driver or some other code using them. Otherwise they make no
>> sense at all to have in-tree.
>
> I agree the point is valid, but on another hand this is a 'the chicken &
> the egg' problem -- if we don't have the SR-IOV base, people who are
> developing PF drivers can not get their changes in-tree. Maybe they are
> holding the patches and waiting on the infrastructure... :-)
Are they? They can both go in at the same time, like almost every other
api addition to the kernel, right?
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Zhao, Yu @ 2008-12-17 7:07 UTC (permalink / raw)
To: Greg KH
Cc: randy.dunlap@oracle.com, grundler@parisc-linux.org,
achiang@hp.com, matthew@wil.cx, linux-pci@vger.kernel.org,
rdreier@cisco.com, Jike Song, Jesse Barnes,
linux-kernel@vger.kernel.org, horms@verge.net.au,
kvm@vger.kernel.org, mingo@elte.hu,
virtualization@lists.linux-foundation.org, yinghai@kernel.org,
bjorn.helgaas@hp.com
In-Reply-To: <20081217060608.GA12618@kroah.com>
Greg KH wrote:
> On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
>> Jesse Barnes wrote:
>>> Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
>>> I'd be much happier about it if we got some driver code along with it, so as
>>> not to have an unused interface sitting around for who knows how many
>>> releases. Is that reasonable? Do you know if any of the corresponding PF/VF
>>> driver bits are ready yet?
>> Hi Jesse,
>>
>> Yu Zhao has posted a patch set with subject "SR-IOV driver example"
>> at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
>> drivers;-)
>
> Yes, but that driver was soundly rejected by the network driver
> maintainers, so I wouldn't go around showing that as your primary
> example of how to use this interface :)
>
> The point is valid, I don't think these apis should go into the tree
> without a driver or some other code using them. Otherwise they make no
> sense at all to have in-tree.
I agree the point is valid, but on another hand this is a 'the chicken &
the egg' problem -- if we don't have the SR-IOV base, people who are
developing PF drivers can not get their changes in-tree. Maybe they are
holding the patches and waiting on the infrastructure... :-)
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Greg KH @ 2008-12-17 6:06 UTC (permalink / raw)
To: Jike Song
Cc: randy.dunlap, grundler, achiang, matthew, linux-pci, rdreier,
linux-kernel, Jesse Barnes, virtualization, horms, kvm, mingo,
yinghai, bjorn.helgaas
In-Reply-To: <49486602.5000108@gmail.com>
On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
> Jesse Barnes wrote:
> > Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
> > I'd be much happier about it if we got some driver code along with it, so as
> > not to have an unused interface sitting around for who knows how many
> > releases. Is that reasonable? Do you know if any of the corresponding PF/VF
> > driver bits are ready yet?
>
> Hi Jesse,
>
> Yu Zhao has posted a patch set with subject "SR-IOV driver example"
> at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
> drivers;-)
Yes, but that driver was soundly rejected by the network driver
maintainers, so I wouldn't go around showing that as your primary
example of how to use this interface :)
The point is valid, I don't think these apis should go into the tree
without a driver or some other code using them. Otherwise they make no
sense at all to have in-tree.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Jike Song @ 2008-12-17 2:37 UTC (permalink / raw)
To: Jesse Barnes
Cc: randy.dunlap, grundler, achiang, matthew, linux-pci, rdreier,
linux-kernel, virtualization, horms, kvm, greg, mingo, yinghai,
bjorn.helgaas
In-Reply-To: <200812161523.55238.jbarnes@virtuousgeek.org>
Jesse Barnes wrote:
> Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
> I'd be much happier about it if we got some driver code along with it, so as
> not to have an unused interface sitting around for who knows how many
> releases. Is that reasonable? Do you know if any of the corresponding PF/VF
> driver bits are ready yet?
Hi Jesse,
Yu Zhao has posted a patch set with subject "SR-IOV driver example"
at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
drivers;-)
--
Thanks,
Jike
^ permalink raw reply
* Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
From: Jesse Barnes @ 2008-12-16 23:23 UTC (permalink / raw)
To: Yu Zhao
Cc: randy.dunlap, grundler, achiang, matthew, linux-pci, rdreier,
linux-kernel, virtualization, horms, kvm, greg, mingo, yinghai,
bjorn.helgaas
In-Reply-To: <20081121183605.GA7810@yzhao12-linux.sh.intel.com>
On Friday, November 21, 2008 10:36 am Yu Zhao wrote:
> Greetings,
>
> Following patches are intended to support SR-IOV capability in the
> Linux kernel. With these patches, people can turn a PCI device with
> the capability into multiple ones from software perspective, which
> will benefit KVM and achieve other purposes such as QoS, security,
> and etc.
>
> The Physical Function and Virtual Function drivers using the SR-IOV
> APIs will come soon!
>
> Major changes from v6 to v7:
> 1, remove boot-time resource rebalancing support. (Greg KH)
> 2, emit uevent upon the PF driver is loaded. (Greg KH)
> 3, put SR-IOV callback function into the 'pci_driver'. (Matthew Wilcox)
> 4, register SR-IOV service at the PF loading stage.
> 5, remove unnecessary APIs (pci_iov_enable/disable).
Thanks for your patience with this, Yu, I know it's been a long haul. :)
I applied 1-9 to my linux-next branch; and at least patch #10 needs a respin,
so can you re-do 10-13 as a new patch set?
On re-reading the last thread, there was a lot of smoke, but very little fire
afaict. The main questions I saw were:
1) do we need SR-IOV at all? why not just make each subsystem export
devices to guests?
This is a bit of a red herring. Nothing about SR-IOV prevents us from
making subsystems more v12n friendly. And since SR-IOV is a hardware
feature supported by devices these days, we should make Linux support it.
2) should the PF/VF drivers be the same or not?
Again, the SR-IOV patchset and PCI spec don't dictate this. We're free to
do what we want here.
3) should VF devices be represented by pci_dev structs?
Yes. (This is an easy one :)
4) can VF devices be used on the host?
Yet again, SR-IOV doesn't dictate this. Developers can make PF/VF combo
drivers or split them, and export the resulting devices however they want.
Some subsystem work may be needed to make this efficient, but SR-IOV
itself is agnostic about it.
So overall I didn't see many objections to the actual code in the last post,
and the issues above certainly don't merit a NAK IMO...
Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
I'd be much happier about it if we got some driver code along with it, so as
not to have an unused interface sitting around for who knows how many
releases. Is that reasonable? Do you know if any of the corresponding PF/VF
driver bits are ready yet?
Thanks,
--
Jesse Barnes, Intel Open Source Technology Center
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Dor Laor @ 2008-12-16 23:20 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: kvm, netdev, virtualization, Anthony Liguori, David Miller
In-Reply-To: <20081216212532.GA15360@ioremap.net>
Evgeniy Polyakov wrote:
> On Tue, Dec 16, 2008 at 08:57:27AM +0200, Gleb Natapov (gleb@redhat.com) wrote:
>
>>> Another approach is to implement that virtio backend with netlink based
>>> userspace interface (like using connector or genetlink). This does not
>>> differ too much from what you have with special socket family, but at
>>> least it does not duplicate existing functionality of
>>> userspace-kernelspace communications.
>>>
>>>
>> I implemented vmchannel using connector initially (the downside is that
>> message can be dropped). Is this more expectable for upstream? The
>> implementation was 300 lines of code.
>>
>
> Hard to tell, it depends on implementation. But if things are good, I
> have no objections as connector maintainer :)
>
> Messages in connector in particular and netlink in general are only
> dropped, when receiving buffer is full (or when there is no memory), you
> can tune buffer size to match virtual queue size or vice versa.
>
>
Gleb was aware of that and it's not a problem since all of the
anticipated usages may
drop msgs (guest statistics, cut&paste, mouse movements, single sign on
commands, etc).
Service that would need reliability could use basic acks.
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Evgeniy Polyakov @ 2008-12-16 21:25 UTC (permalink / raw)
To: Gleb Natapov; +Cc: netdev, kvm, David Miller, Anthony Liguori, virtualization
In-Reply-To: <20081216065727.GD13794@redhat.com>
On Tue, Dec 16, 2008 at 08:57:27AM +0200, Gleb Natapov (gleb@redhat.com) wrote:
> > Another approach is to implement that virtio backend with netlink based
> > userspace interface (like using connector or genetlink). This does not
> > differ too much from what you have with special socket family, but at
> > least it does not duplicate existing functionality of
> > userspace-kernelspace communications.
> >
> I implemented vmchannel using connector initially (the downside is that
> message can be dropped). Is this more expectable for upstream? The
> implementation was 300 lines of code.
Hard to tell, it depends on implementation. But if things are good, I
have no objections as connector maintainer :)
Messages in connector in particular and netlink in general are only
dropped, when receiving buffer is full (or when there is no memory), you
can tune buffer size to match virtual queue size or vice versa.
--
Evgeniy Polyakov
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Gleb Natapov @ 2008-12-16 6:57 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: netdev, kvm, David Miller, Anthony Liguori, virtualization
In-Reply-To: <20081215234511.GA24579@ioremap.net>
On Tue, Dec 16, 2008 at 02:45:11AM +0300, Evgeniy Polyakov wrote:
> Hi Anthony.
>
> On Mon, Dec 15, 2008 at 05:01:14PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
> > Yes, and I went down the road of using a dedicated network device and
> > using raw ethernet as the protocol. The thing that killed that was the
> > fact that it's not reliable. You need something like TCP to add
> > reliability.
> >
> > But that's a lot of work and a bit backwards. Use a unreliable
> > transport but use TCP on top of it to get reliability. Our link
> > (virtio) is inherently reliable so why not just expose a reliable
> > interface to userspace?
>
> I removed original mail and did not check archive, but doesn't rx/tx
> queues of the virtio device have limited size? I do hope they have,
> which means that either your network drops packets or blocks.
>
It blocks.
> Another approach is to implement that virtio backend with netlink based
> userspace interface (like using connector or genetlink). This does not
> differ too much from what you have with special socket family, but at
> least it does not duplicate existing functionality of
> userspace-kernelspace communications.
>
I implemented vmchannel using connector initially (the downside is that
message can be dropped). Is this more expectable for upstream? The
implementation was 300 lines of code.
--
Gleb.
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Herbert Xu @ 2008-12-16 2:55 UTC (permalink / raw)
To: Anthony Liguori; +Cc: netdev, kvm, davem, virtualization
In-Reply-To: <4946E597.6070308@codemonkey.ws>
Anthony Liguori <anthony@codemonkey.ws> wrote:
>
> If we used TCP, we don't have a useful TCP/IP stack in QEMU, so we'd
> have to inject that traffic into the host Linux instance, and then
> receive the traffic in QEMU. Besides being indirect, it has some nasty
> security implications that I outlined in my response to Jeremy's last note.
When combined with namespaces I don't see why using the kernel TCP
stack would create any security problems that wouldn't otherwise
exist.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Dor Laor @ 2008-12-16 0:01 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: kvm, netdev, virtualization, Anthony Liguori, David Miller
In-Reply-To: <20081215235253.GB24579@ioremap.net>
Evgeniy Polyakov wrote:
> On Mon, Dec 15, 2008 at 05:08:29PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
>
>> The KVM model is that a guest is a process. Any IO operations original
>> from the process (QEMU). The advantage to this is that you get very
>> good security because you can use things like SELinux and simply treat
>> the QEMU process as you would the guest. In fact, in general, I think
>> we want to assume that QEMU is guest code from a security perspective.
>>
>> By passing up the network traffic to the host kernel, we now face a
>> problem when we try to get the data back. We could setup a tun device
>> to send traffic to the kernel but then the rest of the system can see
>> that traffic too. If that traffic is sensitive, it's potentially unsafe.
>>
>
> You can even use unix sockets in this case, and each socket will be
> named as virtio channels names. IIRC tun/tap devices can be virtualizen
> with recent kernels, which also solves all problems of shared access.
>
> There are plenty of ways to implement this kind of functionality instead
> of developing some new protocol, which is effectively a duplication of
> what already exists in the kernel.
>
>
Well, it is kinda pv-unix-domain-socket.
I did not understand how a standard unix domain in the guest can reach
the host according
to your solution.
The initial implementation was some sort of pv-serial. Serial itself is
low performing and
there is no naming services what so every. Gleb did offer the netlink
option as a beginning
but we though a new address family would be more robust (you say too
robust).
So by suggestion new address family what can think of it as a
pv-unix-domain-socket.
Networking IS used since we think it is a good 'wheel'.
Indeed, David is right that instead of adding a new chunk of code we can
re-use the
existing one. But we do have some 'new' (afraid to tell virtualization)
problems that
might prevent us of using a standard virtual nic:
- Even if we can teach iptables to ignore this interface, other
3rd firewall might not obey: What if the VM is a Checkpoint firewall?
What if the VM is windows? + using a non MS firewall?
- Who will assign IPs for the vnic? How can I assure there is no ip
clash?
The standard dhcp for the other standard vnics might not be in
our control.
So I do understand the idea of using a standard network interface. It's
just not that simple.
So ideas to handle the above are welcomed.
Otherwise we might need to go back to serial/pv-serial approach.
btw: here are the usages/next usages of vmchannel:
VMchannel is a host-guest interface and in the future guest-guest interface.
Currently/soon it is used for
- guest statistics
- guest info
- guest single sign own
- guest log-in log-out
- mouse channel for multiple monitors
- cut&paste (guest-host, sometimes client-host-guest, company
firewall blocks client-guest).
- fencing (potentially)
tw2: without virtualization we wouldn't have new passionate issues to
discuss about!
Cheers,
Dor
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Evgeniy Polyakov @ 2008-12-15 23:52 UTC (permalink / raw)
To: Anthony Liguori; +Cc: netdev, David Miller, kvm, virtualization
In-Reply-To: <4946E36D.8060503@codemonkey.ws>
On Mon, Dec 15, 2008 at 05:08:29PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
> The KVM model is that a guest is a process. Any IO operations original
> from the process (QEMU). The advantage to this is that you get very
> good security because you can use things like SELinux and simply treat
> the QEMU process as you would the guest. In fact, in general, I think
> we want to assume that QEMU is guest code from a security perspective.
>
> By passing up the network traffic to the host kernel, we now face a
> problem when we try to get the data back. We could setup a tun device
> to send traffic to the kernel but then the rest of the system can see
> that traffic too. If that traffic is sensitive, it's potentially unsafe.
You can even use unix sockets in this case, and each socket will be
named as virtio channels names. IIRC tun/tap devices can be virtualizen
with recent kernels, which also solves all problems of shared access.
There are plenty of ways to implement this kind of functionality instead
of developing some new protocol, which is effectively a duplication of
what already exists in the kernel.
--
Evgeniy Polyakov
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Evgeniy Polyakov @ 2008-12-15 23:45 UTC (permalink / raw)
To: Anthony Liguori; +Cc: netdev, kvm, David Miller, virtualization
In-Reply-To: <4946E1BA.80206@codemonkey.ws>
Hi Anthony.
On Mon, Dec 15, 2008 at 05:01:14PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
> Yes, and I went down the road of using a dedicated network device and
> using raw ethernet as the protocol. The thing that killed that was the
> fact that it's not reliable. You need something like TCP to add
> reliability.
>
> But that's a lot of work and a bit backwards. Use a unreliable
> transport but use TCP on top of it to get reliability. Our link
> (virtio) is inherently reliable so why not just expose a reliable
> interface to userspace?
I removed original mail and did not check archive, but doesn't rx/tx
queues of the virtio device have limited size? I do hope they have,
which means that either your network drops packets or blocks.
Having dedicated preconfigured network device is essentially the same as
having this special socket option: guests which do not have this (either
network or vchannel socket) will not be able to communicate with the
host, so there is no difference. Except that usual network will just
work out of the box (and especially you will like it when there will be
no need to hack on X to support new network media).
Another approach is to implement that virtio backend with netlink based
userspace interface (like using connector or genetlink). This does not
differ too much from what you have with special socket family, but at
least it does not duplicate existing functionality of
userspace-kernelspace communications.
But IMO having special network device or running your protocol over
existing virtio network device is a cleaner solution both from technical
and convenience points of view.
--
Evgeniy Polyakov
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Jeremy Fitzhardinge @ 2008-12-15 23:44 UTC (permalink / raw)
To: Anthony Liguori; +Cc: netdev, David Miller, kvm, virtualization
In-Reply-To: <4946E36D.8060503@codemonkey.ws>
Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
>> Anthony Liguori wrote:
>>>
>>> That seems unnecessarily complex.
>>>
>>
>> Well, the simplest thing is to let the host TCP stack do TCP. Could
>> you go into more detail about why you'd want to avoid that?
>
> The KVM model is that a guest is a process. Any IO operations
> original from the process (QEMU). The advantage to this is that you
> get very good security because you can use things like SELinux and
> simply treat the QEMU process as you would the guest. In fact, in
> general, I think we want to assume that QEMU is guest code from a
> security perspective.
>
> By passing up the network traffic to the host kernel, we now face a
> problem when we try to get the data back. We could setup a tun device
> to send traffic to the kernel but then the rest of the system can see
> that traffic too. If that traffic is sensitive, it's potentially unsafe.
Well, one could come up with a mechanism to bind an interface to be only
visible to a particular context/container/something.
> You can use iptables to restrict who can receive traffic and possibly
> use SELinux packet tagging or whatever. This gets extremely complex
> though.
Well, if you can just tag everything based on interface its relatively
simple.
> It's far easier to avoid the host kernel entirely and implement the
> backends in QEMU. Then any actions the backend takes will be on
> behalf of the guest. You never have to worry about transport data
> leakage.
Well, a stream-like protocol layered over a reliable packet transport
would get you there without the complexity of tcp. Or just do a
usermode tcp; its not that complex if you really think it simplifies the
other aspects.
>
>>> This is why I've been pushing for the backends to be implemented in
>>> QEMU. Then QEMU can marshal the backend-specific state and transfer
>>> it during live migration. For something like copy/paste, this is
>>> obvious (the clipboard state). A general command interface is
>>> probably stateless so it's a nop.
>>>
>>
>> Copy/paste seems like a particularly bogus example. Surely this
>> isn't a sensible way to implement it?
>
> I think it's the most sensible way to implement it. Would you suggest
> something different?
Well, off the top of my head I'm assuming the requirements are:
* the goal is to unify the user's actual desktop session with a
virtual session within a vm
* a given user may have multiple VMs running on their desktop
* a VM may be serving multiple user sessions
* the VMs are not necessarily hosted by the user's desktop machine
* the VMs can migrate at any moment
To me that looks like a daemon running within the context of each of the
user's virtual sessions monitoring clipboard events, talking over a TCP
connection to a corresponding daemon in their desktop session, which is
responsible for reconciling cuts and pastes in all the various sessions.
I guess you'd say that each VM would multiplex all its cut/paste events
via its AF_VMCHANNEL/cut+paste channel to its qemu, which would then
demultiplex them off to the user's real desktops. And that since the VM
itself may have no networking, it needs to be a special magic connection.
And my counter argument to this nicely placed straw man is that the
VM<->qemu connection can still be TCP, even if its a private network
with no outside access.
>
>>> I'm not a fan of having external backends to QEMU for the very
>>> reasons you outline above. You cannot marshal the state of a
>>> channel we know nothing about. We're really just talking about
>>> extending virtio in a guest down to userspace so that we can
>>> implement paravirtual device drivers in guest userspace. This may
>>> be an X graphics driver, a mouse driver, copy/paste, remote
>>> shutdown, etc.
>>> A socket seems like a natural choice. If that's wrong, then we
>>> can explore other options (like a char device, virtual fs, etc.).
>>
>> I think a socket is a pretty poor choice. It's too low level, and it
>> only really makes sense for streaming data, not for data storage
>> (name/value pairs). It means that everyone ends up making up their
>> own serializations. A filesystem view with notifications seems to be
>> a better match for the use-cases you mention (aside from cut/paste),
>> with a single well-defined way to serialize onto any given channel.
>> Each "file" may well have an application-specific content, but in
>> general that's going to be something pretty simple.
>
> I had suggested a virtual file system at first and was thoroughly
> ridiculed for it :-) There is a 9p virtio transport already so we
> could even just use that.
You mean 9p directly over a virtio ringbuffer rather than via the
network stack? You could do that, but I'd still argue that using the
network stack is a better approach.
> The main issue with a virtual file system is that it does map well to
> other guests. It's actually easier to implement a socket interface
> for Windows than it is to implement a new file system.
There's no need to put the "filesystem" into the kernel unless something
else in the kernel needs to access it. A usermode implementation
talking over some stream interface would be fine.
> But we could find ways around this with libraries. If we used 9p as a
> transport, we could just provide a char device in Windows that
> received it in userspace.
Or just use a tcp connection, and do it all with no kernel mods.
(Is 9p a good choice? You need to be able to subscribe to events
happening to files, and you'd need some kind of atomicity guarantee. I
dunno, maybe 9p already has this or can be cleanly adapted.)
J
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Anthony Liguori @ 2008-12-15 23:17 UTC (permalink / raw)
To: David Miller; +Cc: netdev, kvm, virtualization
In-Reply-To: <20081215.151044.262451532.davem@davemloft.net>
David Miller wrote:
> From: Anthony Liguori <anthony@codemonkey.ws>
> Date: Mon, 15 Dec 2008 17:01:14 -0600
>
>
>> No, TCP falls under the not simple category because it requires the
>> backend to have access to a TCP/IP stack.
>>
>
> I'm at a loss for words if you need TCP in the hypervisor, if that's
> what you're implying here.
>
No. KVM is not a traditional "hypervisor". It's more of a userspace
accelerator for emulators.
QEMU, a system emulator, calls in to the Linux kernel whenever it needs
to run guest code. Linux returns to QEMU whenever the guest has done an
MMIO operation or something of that nature. In this way, all of our
device emulation (including paravirtual backends) are implemented in the
host userspace in the QEMU process.
If we used TCP, we don't have a useful TCP/IP stack in QEMU, so we'd
have to inject that traffic into the host Linux instance, and then
receive the traffic in QEMU. Besides being indirect, it has some nasty
security implications that I outlined in my response to Jeremy's last note.
Regards,
Anthony Liguori
> You only need it in the guest and the host, which you already have,
> in the Linux kernel. Just transport that over virtio or whatever
> and be done with it.
>
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Stephen Hemminger @ 2008-12-15 23:13 UTC (permalink / raw)
To: Anthony Liguori; +Cc: netdev, kvm, David Miller, virtualization
In-Reply-To: <4946E1BA.80206@codemonkey.ws>
On Mon, 15 Dec 2008 17:01:14 -0600
Anthony Liguori <anthony@codemonkey.ws> wrote:
> David Miller wrote:
> > From: Anthony Liguori <anthony@codemonkey.ws>
> > Date: Mon, 15 Dec 2008 14:44:26 -0600
> >
> >
> >> We want this communication mechanism to be simple and reliable as we
> >> want to implement the backends drivers in the host userspace with
> >> minimum mess.
> >>
> >
> > One implication of your statement here is that TCP is unreliable.
> > That's absolutely not true.
> >
>
> No, TCP falls under the not simple category because it requires the
> backend to have access to a TCP/IP stack.
>
> >> Within the guest, we need the interface to be always available and
> >> we need an addressing scheme that is hypervisor specific. Yes, we
> >> can build this all on top of TCP/IP. We could even build it on top
> >> of a serial port. Both have their down-sides wrt reliability and
> >> complexity.
> >>
> >
> > I don't know of any zero-copy through the hypervisor mechanisms for
> > serial ports, but I know we do that with the various virtualization
> > network devices.
> >
>
> Yes, and I went down the road of using a dedicated network device and
> using raw ethernet as the protocol. The thing that killed that was the
> fact that it's not reliable. You need something like TCP to add
> reliability.
>
> But that's a lot of work and a bit backwards. Use a unreliable
> transport but use TCP on top of it to get reliability. Our link
> (virtio) is inherently reliable so why not just expose a reliable
> interface to userspace?
>
> >> Do you have another recommendation?
> >>
> >
> > I don't have to make alternative recommendations until you can
> > show that what we have can't solve the problem acceptably, and
> > TCP emphatically can.
> >
>
> It can solve the problem but I don't think it's the best way to solve
> the problem mainly because the complexity it demands on the backend.
"Those who don't understand TCP are doomed to reimplement it, badly."
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: David Miller @ 2008-12-15 23:10 UTC (permalink / raw)
To: anthony; +Cc: netdev, kvm, virtualization
In-Reply-To: <4946E1BA.80206@codemonkey.ws>
From: Anthony Liguori <anthony@codemonkey.ws>
Date: Mon, 15 Dec 2008 17:01:14 -0600
> No, TCP falls under the not simple category because it requires the
> backend to have access to a TCP/IP stack.
I'm at a loss for words if you need TCP in the hypervisor, if that's
what you're implying here.
You only need it in the guest and the host, which you already have,
in the Linux kernel. Just transport that over virtio or whatever
and be done with it.
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Anthony Liguori @ 2008-12-15 23:08 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: netdev, David Miller, kvm, virtualization
In-Reply-To: <4946DF97.7070600@goop.org>
Jeremy Fitzhardinge wrote:
> Anthony Liguori wrote:
>>
>> That seems unnecessarily complex.
>>
>
> Well, the simplest thing is to let the host TCP stack do TCP. Could
> you go into more detail about why you'd want to avoid that?
The KVM model is that a guest is a process. Any IO operations original
from the process (QEMU). The advantage to this is that you get very
good security because you can use things like SELinux and simply treat
the QEMU process as you would the guest. In fact, in general, I think
we want to assume that QEMU is guest code from a security perspective.
By passing up the network traffic to the host kernel, we now face a
problem when we try to get the data back. We could setup a tun device
to send traffic to the kernel but then the rest of the system can see
that traffic too. If that traffic is sensitive, it's potentially unsafe.
You can use iptables to restrict who can receive traffic and possibly
use SELinux packet tagging or whatever. This gets extremely complex though.
It's far easier to avoid the host kernel entirely and implement the
backends in QEMU. Then any actions the backend takes will be on behalf
of the guest. You never have to worry about transport data leakage.
>> This is why I've been pushing for the backends to be implemented in
>> QEMU. Then QEMU can marshal the backend-specific state and transfer
>> it during live migration. For something like copy/paste, this is
>> obvious (the clipboard state). A general command interface is
>> probably stateless so it's a nop.
>>
>
> Copy/paste seems like a particularly bogus example. Surely this isn't
> a sensible way to implement it?
I think it's the most sensible way to implement it. Would you suggest
something different?
>> I'm not a fan of having external backends to QEMU for the very
>> reasons you outline above. You cannot marshal the state of a channel
>> we know nothing about. We're really just talking about extending
>> virtio in a guest down to userspace so that we can implement
>> paravirtual device drivers in guest userspace. This may be an X
>> graphics driver, a mouse driver, copy/paste, remote shutdown, etc.
>> A socket seems like a natural choice. If that's wrong, then we can
>> explore other options (like a char device, virtual fs, etc.).
>
> I think a socket is a pretty poor choice. It's too low level, and it
> only really makes sense for streaming data, not for data storage
> (name/value pairs). It means that everyone ends up making up their
> own serializations. A filesystem view with notifications seems to be
> a better match for the use-cases you mention (aside from cut/paste),
> with a single well-defined way to serialize onto any given channel.
> Each "file" may well have an application-specific content, but in
> general that's going to be something pretty simple.
I had suggested a virtual file system at first and was thoroughly
ridiculed for it :-) There is a 9p virtio transport already so we could
even just use that.
The main issue with a virtual file system is that it does map well to
other guests. It's actually easier to implement a socket interface for
Windows than it is to implement a new file system.
But we could find ways around this with libraries. If we used 9p as a
transport, we could just provide a char device in Windows that received
it in userspace.
>> This shouldn't be confused with networking though and all the talk
>> of doing silly things like streaming fence traffic through it just
>> encourages the confusion.
>
> I'm not sure what you're referring to here.
I'm just ranting, it's not important.
Regards,
Anthony Liguori
> J
^ permalink raw reply
* Re: [PATCH] AF_VMCHANNEL address family for guest<->host communication.
From: Anthony Liguori @ 2008-12-15 23:01 UTC (permalink / raw)
To: David Miller; +Cc: netdev, kvm, virtualization
In-Reply-To: <20081215.142918.190909950.davem@davemloft.net>
David Miller wrote:
> From: Anthony Liguori <anthony@codemonkey.ws>
> Date: Mon, 15 Dec 2008 14:44:26 -0600
>
>
>> We want this communication mechanism to be simple and reliable as we
>> want to implement the backends drivers in the host userspace with
>> minimum mess.
>>
>
> One implication of your statement here is that TCP is unreliable.
> That's absolutely not true.
>
No, TCP falls under the not simple category because it requires the
backend to have access to a TCP/IP stack.
>> Within the guest, we need the interface to be always available and
>> we need an addressing scheme that is hypervisor specific. Yes, we
>> can build this all on top of TCP/IP. We could even build it on top
>> of a serial port. Both have their down-sides wrt reliability and
>> complexity.
>>
>
> I don't know of any zero-copy through the hypervisor mechanisms for
> serial ports, but I know we do that with the various virtualization
> network devices.
>
Yes, and I went down the road of using a dedicated network device and
using raw ethernet as the protocol. The thing that killed that was the
fact that it's not reliable. You need something like TCP to add
reliability.
But that's a lot of work and a bit backwards. Use a unreliable
transport but use TCP on top of it to get reliability. Our link
(virtio) is inherently reliable so why not just expose a reliable
interface to userspace?
>> Do you have another recommendation?
>>
>
> I don't have to make alternative recommendations until you can
> show that what we have can't solve the problem acceptably, and
> TCP emphatically can.
>
It can solve the problem but I don't think it's the best way to solve
the problem mainly because the complexity it demands on the backend.
Regards,
Anthony Liguori
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox