* Re: Zerocopy VM-to-VM networking using virtio-net [not found] <20150422170138.GA8388@stefanha-thinkpad.redhat.com> @ 2015-04-22 17:46 ` Cornelia Huck [not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com> 2015-04-24 8:12 ` [virtio-dev] " Luke Gorrie 2 siblings, 0 replies; 25+ messages in thread From: Cornelia Huck @ 2015-04-22 17:46 UTC (permalink / raw) To: Stefan Hajnoczi Cc: virtio-dev, Andrew Jones, Michael S. Tsirkin, Rik van Riel, virtualization, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Wed, 22 Apr 2015 18:01:38 +0100 Stefan Hajnoczi <stefanha@redhat.com> wrote: > [It may be necessary to remove virtio-dev@lists.oasis-open.org from CC > if you are a non-TC member.] > > Hi, > Some modern networking applications bypass the kernel network stack so > that rx/tx rings and DMA buffers can be directly mapped. This is > typical in DPDK applications where virtio-net currently is one of > several NIC choices. > > Existing virtio-net implementations are not optimized for VM-to-VM > DPDK-style networking. The following outline describes a zero-copy > virtio-net solution for VM-to-VM networking. > > Thanks to Paolo Bonzini for the Shared Buffers BAR idea. > > Use case > -------- > Two VMs on the same host need to communicate in the most efficient > manner possible (e.g. the sole purpose of the VMs is to do network I/O). > > Applications running inside the VMs implement virtio-net in userspace so > they have full control over rx/tx rings and data buffer placement. Wouldn't that also benefit applications that use a kernel implementation? You still need to get the data to/from kernel space, but you'd get the benefit of being able to get the data to the peer immediately. > > Performance requirements are higher priority than security or isolation. > If this bothers you, stick to classic virtio-net. > > virtio-net VM-to-VM extensions > ------------------------------ > A few extensions to virtio-net are necessary to support zero-copy > VM-to-VM communication. The extensions are covered informally > throughout the text, this is not a VIRTIO specification change proposal. > > The VM-to-VM capable virtio-net PCI adapter has an additional MMIO BAR > called the Shared Buffers BAR. The Shared Buffers BAR is a shared > memory region on the host so that the virtio-net devices in VM1 and VM2 > both access the same region of memory. > > The vring is still allocated in guest RAM as usual but data buffers must > be located in the Shared Buffers BAR in order to take advantage of > zero-copy. > > When VM1 places a packet into the tx queue and the buffers are located > in the Shared Buffers BAR, the host finds the VM2's rx queue descriptor > with the same buffer address and completes it without copying any data > buffers. The shared buffers BAR looks PCI-specific, but what about other mechanisms to provide a shared space between two VMs with some kind of lightweight notifications? This should make it possible to implement a similar mode of operation for other transports if it is factored out correctly. (The actual implementation of this shared space is probably the difficult part :) > > Shared buffer allocation > ------------------------ > A simple scheme for two cooperating VMs to manage the Shared Buffers BAR > is as follows: > > VM1 VM2 > +---+ > rx->| 1 |<-tx > +---+ > tx->| 2 |<-rx > +---+ > Shared Buffers > > This is a trivial example where the Shared Buffers BAR has only two > packet buffers. > > VM1 starts by putting buffer 1 in its rx queue. VM2 starts by putting > buffer 2 in its rx queue. The VMs know which buffers to choose based on > a new uint8_t virtio_net_config.shared_buffers_offset field (0 for VM1 > and 1 for VM2). > > VM1 can transmit to VM2 by filling buffer 2 and placing it on its tx > queue. VM2 can transmit by filling buffer 1 and placing it on its tx > queue. > > As soon as a buffer is placed on a tx queue, the VM passes ownership of > the buffer to the other VM. In other words, the buffer must not be > touched even after virtio-net tx completion because it now belongs to > the other VM. > > This scheme of bouncing ownership back-and-forth between the two VMs > only works if both VMs transmit an equal number of buffers over time. > In reality the traffic pattern may be unbalanced so VM1 is always > transmitting and VM2 is always receiving. This problem can be overcome > if the VMs cooperate and return buffers if they accumulate too many. > > For example, after VM1 transmits buffer 2 it has run out of tx buffers: > > VM1 VM2 > +---+ > rx->| 1 |<-tx > +---+ > X->| 2 |<-rx > +---+ > > VM2 notices that it now holds all buffers. It can donate a buffer back > to VM1 by putting it on the tx queue with the new virtio_net_hdr.flags > VIRTIO_NET_HDR_F_GIFT_BUFFER flag. This flag indicates that this is not > a packet but rather an empty gifted buffer. VM1 checks the flags field > to detect that it has been gifted buffers. > > Also note that zero-copy networking is not mutually exclusive with > classic virtio-net. If the descriptor has buffer addresses outside the > Shared Buffers BAR, then classic non-zero-copy virtio-net behavior > occurs. Is simply writing the values in the header enough to trigger the other side? You don't need some kind of notification? (I'm obviously coming from a non-PCI view, and for my kind-of-nebulous idea I'd need a lightweight interrupt so that the other side knows it should check the header.) > > Host-side implementation > ------------------------ > The host facilitates zero-copy VM-to-VM communication by taking > descriptors off tx queues and filling in rx descriptors of the paired > VM. In the Linux vhost_net implementation this could work as follows: > > 1. VM1 places buffer 2 on the tx queue and kicks the host. Ownership of > the buffer no longer belongs to VM1. > 2. vhost_net pops the buffer from VM1's tx queue and verifies that the > buffer address is within the Shared Buffers BAR. > 3. vhost_net finds the VM2 rx queue descriptor whose buffer address > matches, completes that descriptor, and kicks VM2. > 4. VM2 pops buffer 2 from the rx queue. It can now reuse this buffer > for transmitting to VM1. > > The vhost_net.ko kernel module needs a new ioctl for pairing vhost_net > instances. This ioctl is used to establish the VM-to-VM connection > between VM1's virtio-net and VM2's virtio-net. > > Discussion > ---------- > The result is that applications in separate VMs can communicate in true > zero-copy fashion. > > I think this approach could be fruitful in bringing virtio-net to > VM-to-VM networking use cases. Unless virtio-net is extended for this > use case, I'm afraid DPDK and OpenDataPlane communities might steer > clear of VIRTIO. > > This is an idea I want to share but I'm not working on a prototype. > Feel free to flesh it out further and try it! Definetly interesting. It seems you get much of the needed infrastructure by simply leveraging what PCI gives you anyway? If we want something like in other environments (say, via ccw on s390), we'd have to come up with a mechanism that can give us the same (which is probably the hard part). > > Open issues: > * Multiple VMs? > * Multiqueue? > * Choice of shared buffer allocation algorithm? > * etc > > Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <20150422194603.1e650ec7.cornelia.huck@de.ibm.com>]
* Re: Zerocopy VM-to-VM networking using virtio-net [not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com> @ 2015-04-22 18:00 ` Stefan Hajnoczi 2015-04-23 16:54 ` Cornelia Huck 0 siblings, 1 reply; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-22 18:00 UTC (permalink / raw) To: Cornelia Huck Cc: virtio-dev, Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Wed, Apr 22, 2015 at 6:46 PM, Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > On Wed, 22 Apr 2015 18:01:38 +0100 > Stefan Hajnoczi <stefanha@redhat.com> wrote: > >> [It may be necessary to remove virtio-dev@lists.oasis-open.org from CC >> if you are a non-TC member.] >> >> Hi, >> Some modern networking applications bypass the kernel network stack so >> that rx/tx rings and DMA buffers can be directly mapped. This is >> typical in DPDK applications where virtio-net currently is one of >> several NIC choices. >> >> Existing virtio-net implementations are not optimized for VM-to-VM >> DPDK-style networking. The following outline describes a zero-copy >> virtio-net solution for VM-to-VM networking. >> >> Thanks to Paolo Bonzini for the Shared Buffers BAR idea. >> >> Use case >> -------- >> Two VMs on the same host need to communicate in the most efficient >> manner possible (e.g. the sole purpose of the VMs is to do network I/O). >> >> Applications running inside the VMs implement virtio-net in userspace so >> they have full control over rx/tx rings and data buffer placement. > > Wouldn't that also benefit applications that use a kernel > implementation? You still need to get the data to/from kernel space, > but you'd get the benefit of being able to get the data to the peer > immediately. If the applications are using the sockets API then there is a memory copy involved. But you are right that it bypasses tap/bridge on the host side, so it can still be an advantage. >> >> Performance requirements are higher priority than security or isolation. >> If this bothers you, stick to classic virtio-net. >> >> virtio-net VM-to-VM extensions >> ------------------------------ >> A few extensions to virtio-net are necessary to support zero-copy >> VM-to-VM communication. The extensions are covered informally >> throughout the text, this is not a VIRTIO specification change proposal. >> >> The VM-to-VM capable virtio-net PCI adapter has an additional MMIO BAR >> called the Shared Buffers BAR. The Shared Buffers BAR is a shared >> memory region on the host so that the virtio-net devices in VM1 and VM2 >> both access the same region of memory. >> >> The vring is still allocated in guest RAM as usual but data buffers must >> be located in the Shared Buffers BAR in order to take advantage of >> zero-copy. >> >> When VM1 places a packet into the tx queue and the buffers are located >> in the Shared Buffers BAR, the host finds the VM2's rx queue descriptor >> with the same buffer address and completes it without copying any data >> buffers. > > The shared buffers BAR looks PCI-specific, but what about other > mechanisms to provide a shared space between two VMs with some kind of > lightweight notifications? This should make it possible to implement a > similar mode of operation for other transports if it is factored out > correctly. (The actual implementation of this shared space is probably > the difficult part :) It depends on the primitives available. For example, in a virtual DMA page-flipping environment the hypervisor could change page ownership between the two VMs. This does not required shared memory. But there's a cost to virtual memory bookkeeping so it might only be a win for big packets. Does s390 have a mechanism for giving VMs permanent shared or temporary access to memory pages? >> >> Shared buffer allocation >> ------------------------ >> A simple scheme for two cooperating VMs to manage the Shared Buffers BAR >> is as follows: >> >> VM1 VM2 >> +---+ >> rx->| 1 |<-tx >> +---+ >> tx->| 2 |<-rx >> +---+ >> Shared Buffers >> >> This is a trivial example where the Shared Buffers BAR has only two >> packet buffers. >> >> VM1 starts by putting buffer 1 in its rx queue. VM2 starts by putting >> buffer 2 in its rx queue. The VMs know which buffers to choose based on >> a new uint8_t virtio_net_config.shared_buffers_offset field (0 for VM1 >> and 1 for VM2). >> >> VM1 can transmit to VM2 by filling buffer 2 and placing it on its tx >> queue. VM2 can transmit by filling buffer 1 and placing it on its tx >> queue. >> >> As soon as a buffer is placed on a tx queue, the VM passes ownership of >> the buffer to the other VM. In other words, the buffer must not be >> touched even after virtio-net tx completion because it now belongs to >> the other VM. >> >> This scheme of bouncing ownership back-and-forth between the two VMs >> only works if both VMs transmit an equal number of buffers over time. >> In reality the traffic pattern may be unbalanced so VM1 is always >> transmitting and VM2 is always receiving. This problem can be overcome >> if the VMs cooperate and return buffers if they accumulate too many. >> >> For example, after VM1 transmits buffer 2 it has run out of tx buffers: >> >> VM1 VM2 >> +---+ >> rx->| 1 |<-tx >> +---+ >> X->| 2 |<-rx >> +---+ >> >> VM2 notices that it now holds all buffers. It can donate a buffer back >> to VM1 by putting it on the tx queue with the new virtio_net_hdr.flags >> VIRTIO_NET_HDR_F_GIFT_BUFFER flag. This flag indicates that this is not >> a packet but rather an empty gifted buffer. VM1 checks the flags field >> to detect that it has been gifted buffers. >> >> Also note that zero-copy networking is not mutually exclusive with >> classic virtio-net. If the descriptor has buffer addresses outside the >> Shared Buffers BAR, then classic non-zero-copy virtio-net behavior >> occurs. > > Is simply writing the values in the header enough to trigger the other > side? You don't need some kind of notification? (I'm obviously coming > from a non-PCI view, and for my kind-of-nebulous idea I'd need a > lightweight interrupt so that the other side knows it should check the > header.) Virtqueue kick is still used for notification. In fact, the virtqueue operation is basically the same, except that data buffers are now located in the Shared Buffers BAR instead. >> Discussion >> ---------- >> The result is that applications in separate VMs can communicate in true >> zero-copy fashion. >> >> I think this approach could be fruitful in bringing virtio-net to >> VM-to-VM networking use cases. Unless virtio-net is extended for this >> use case, I'm afraid DPDK and OpenDataPlane communities might steer >> clear of VIRTIO. >> >> This is an idea I want to share but I'm not working on a prototype. >> Feel free to flesh it out further and try it! > > Definetly interesting. It seems you get much of the needed > infrastructure by simply leveraging what PCI gives you anyway? If we > want something like in other environments (say, via ccw on s390), we'd > have to come up with a mechanism that can give us the same (which is > probably the hard part). It may not be a win in all environments. It depends on the primitives available for memory access. With PCI devices and a Linux host we can use a shared memory region. If shared memory is not available then maybe there is no performance win to be had. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Zerocopy VM-to-VM networking using virtio-net 2015-04-22 18:00 ` Stefan Hajnoczi @ 2015-04-23 16:54 ` Cornelia Huck 0 siblings, 0 replies; 25+ messages in thread From: Cornelia Huck @ 2015-04-23 16:54 UTC (permalink / raw) To: Stefan Hajnoczi Cc: virtio-dev, Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Wed, 22 Apr 2015 19:00:52 +0100 Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Wed, Apr 22, 2015 at 6:46 PM, Cornelia Huck <cornelia.huck@de.ibm.com> wrote: > > On Wed, 22 Apr 2015 18:01:38 +0100 > > Stefan Hajnoczi <stefanha@redhat.com> wrote: > >> virtio-net VM-to-VM extensions > >> ------------------------------ > >> A few extensions to virtio-net are necessary to support zero-copy > >> VM-to-VM communication. The extensions are covered informally > >> throughout the text, this is not a VIRTIO specification change proposal. > >> > >> The VM-to-VM capable virtio-net PCI adapter has an additional MMIO BAR > >> called the Shared Buffers BAR. The Shared Buffers BAR is a shared > >> memory region on the host so that the virtio-net devices in VM1 and VM2 > >> both access the same region of memory. > >> > >> The vring is still allocated in guest RAM as usual but data buffers must > >> be located in the Shared Buffers BAR in order to take advantage of > >> zero-copy. > >> > >> When VM1 places a packet into the tx queue and the buffers are located > >> in the Shared Buffers BAR, the host finds the VM2's rx queue descriptor > >> with the same buffer address and completes it without copying any data > >> buffers. > > > > The shared buffers BAR looks PCI-specific, but what about other > > mechanisms to provide a shared space between two VMs with some kind of > > lightweight notifications? This should make it possible to implement a > > similar mode of operation for other transports if it is factored out > > correctly. (The actual implementation of this shared space is probably > > the difficult part :) > > It depends on the primitives available. For example, in a virtual DMA > page-flipping environment the hypervisor could change page ownership > between the two VMs. This does not required shared memory. But > there's a cost to virtual memory bookkeeping so it might only be a win > for big packets. > > Does s390 have a mechanism for giving VMs permanent shared or > temporary access to memory pages? Under kvm/qemu, currently not. Under z/VM, there's DCSS; while we don't want to copy that interface, we'll probably want to introduce something similar in the future. No design yet, though. > Virtqueue kick is still used for notification. In fact, the virtqueue > operation is basically the same, except that data buffers are now > located in the Shared Buffers BAR instead. You're right, if this is in the virtqueue buffers, this should just work. > > >> Discussion > >> ---------- > >> The result is that applications in separate VMs can communicate in true > >> zero-copy fashion. > >> > >> I think this approach could be fruitful in bringing virtio-net to > >> VM-to-VM networking use cases. Unless virtio-net is extended for this > >> use case, I'm afraid DPDK and OpenDataPlane communities might steer > >> clear of VIRTIO. > >> > >> This is an idea I want to share but I'm not working on a prototype. > >> Feel free to flesh it out further and try it! > > > > Definetly interesting. It seems you get much of the needed > > infrastructure by simply leveraging what PCI gives you anyway? If we > > want something like in other environments (say, via ccw on s390), we'd > > have to come up with a mechanism that can give us the same (which is > > probably the hard part). > > It may not be a win in all environments. It depends on the primitives > available for memory access. > > With PCI devices and a Linux host we can use a shared memory region. > If shared memory is not available then maybe there is no performance > win to be had. I think if there's a good split between concept and specific backend, we just can figure out the "shared" part later on. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net [not found] <20150422170138.GA8388@stefanha-thinkpad.redhat.com> 2015-04-22 17:46 ` Zerocopy VM-to-VM networking using virtio-net Cornelia Huck [not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com> @ 2015-04-24 8:12 ` Luke Gorrie 2015-04-24 8:20 ` Paolo Bonzini 2015-04-24 9:47 ` Stefan Hajnoczi 2 siblings, 2 replies; 25+ messages in thread From: Luke Gorrie @ 2015-04-24 8:12 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, virtualization, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert [-- Attachment #1.1: Type: text/plain, Size: 1403 bytes --] Hi Stefan, Great topic. I am also extremely interested in helping Virtio-net become the standard for the networking industry (the universe of DPDK, etc). On 22 April 2015 at 19:01, Stefan Hajnoczi <stefanha@redhat.com> wrote: > [It may be necessary to remove virtio-dev@lists.oasis-open.org from CC > if you are a non-TC member.] > [Done.] I think this approach could be fruitful in bringing virtio-net to > VM-to-VM networking use cases. Unless virtio-net is extended for this > use case, I'm afraid DPDK and OpenDataPlane communities might steer > clear of VIRTIO. > Questions: - How fast is needed? - How fast is the vhost-user support that shipped in DPDK 2.0? - How fast would the new design likely be? Our recent experience in Snabb Switch land is that networking on x86 is now more of a HPC problem than a system programming problem. The SIMD bandwidth per core keeps increasing that this erodes the value of traditional (and complex) system programming optimizations. I will be interested to compare notes with others on this, already on Haswell but more so when we have AVX512. Incidentally, we also did a pile of work last year on zero-copy NIC->VM transfers and discovered a lot of interesting problems and edge cases where Virtio-net spec and/or drivers are hard to match up with common NICs. Happy to explain a bit about our experience if that would be valuable. Cheers, -Luke [-- Attachment #1.2: Type: text/html, Size: 2155 bytes --] [-- Attachment #2: Type: text/plain, Size: 183 bytes --] _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 8:12 ` [virtio-dev] " Luke Gorrie @ 2015-04-24 8:20 ` Paolo Bonzini 2015-04-24 9:47 ` Stefan Hajnoczi 1 sibling, 0 replies; 25+ messages in thread From: Paolo Bonzini @ 2015-04-24 8:20 UTC (permalink / raw) To: Luke Gorrie, Stefan Hajnoczi Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, virtualization, virtio-comment, Dr. David Alan Gilbert On 24/04/2015 10:12, Luke Gorrie wrote: > > I think this approach could be fruitful in bringing virtio-net to > VM-to-VM networking use cases. Unless virtio-net is extended for this > use case, I'm afraid DPDK and OpenDataPlane communities might steer > clear of VIRTIO. > > > Questions: > > - How fast is needed? > > - How fast is the vhost-user support that shipped in DPDK 2.0? vhost-user is fast. The problem is not the speed, it's the desire of a more peer-to-peer operation. virtio by design has very distinct roles for driver and device, so for VM2VM communication the virtio design requires two devices in the guest and two drivers, comprising a "switch", in the host. The switch could be using vhost-user indeed, but my understanding is that in some cases this switch component is undesirable. However, my understanding does not include _why_ it is undesirable. This is where we need to gather more information from the DPDK folks. Paolo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 8:12 ` [virtio-dev] " Luke Gorrie 2015-04-24 8:20 ` Paolo Bonzini @ 2015-04-24 9:47 ` Stefan Hajnoczi 2015-04-24 9:50 ` Stefan Hajnoczi ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-24 9:47 UTC (permalink / raw) To: Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Fri, Apr 24, 2015 at 9:12 AM, Luke Gorrie <luke@snabb.co> wrote: > - How fast would the new design likely be? This proposal eliminates two things in the path: 1. Compared to vhost_net, it bypasses the host tun driver and network stack, replacing it with direct vhost_net <-> vhost_net data transfer. At this level it's compared to vhost-user, but it's not programmable in userspace! 2. Data copies are eliminated because the Shared Buffers BAR gives both VMs access to the packets. My concern is the overhead of the vhost_net component copying descriptors between NICs. In a 100% shared memory model, each VM only has a receive queue that the other VM places packets into. There are no tx queues. The notification mechanism is an event fd that is ioeventfd for VM1 and irqfd for VM2. In other words, when VM1 kicks the queue, VM2 receives an interrupt (of course polling the receive queue is also possible). It would be interesting to compare the two approaches. > Our recent experience in Snabb Switch land is that networking on x86 is now > more of a HPC problem than a system programming problem. The SIMD bandwidth > per core keeps increasing that this erodes the value of traditional (and > complex) system programming optimizations. I will be interested to compare > notes with others on this, already on Haswell but more so when we have > AVX512. > > Incidentally, we also did a pile of work last year on zero-copy NIC->VM > transfers and discovered a lot of interesting problems and edge cases where > Virtio-net spec and/or drivers are hard to match up with common NICs. Happy > to explain a bit about our experience if that would be valuable. That sounds interesting, can you describe the setup? Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 9:47 ` Stefan Hajnoczi @ 2015-04-24 9:50 ` Stefan Hajnoczi 2015-04-24 12:17 ` Luke Gorrie 2015-04-24 12:34 ` Luke Gorrie 2 siblings, 0 replies; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-24 9:50 UTC (permalink / raw) To: Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Fri, Apr 24, 2015 at 10:47 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > At this level it's compared to vhost-user, but it's not programmable > in userspace! s/compared/comparable/ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 9:47 ` Stefan Hajnoczi 2015-04-24 9:50 ` Stefan Hajnoczi @ 2015-04-24 12:17 ` Luke Gorrie 2015-04-24 13:10 ` Luke Gorrie 2015-04-24 13:22 ` Stefan Hajnoczi 2015-04-24 12:34 ` Luke Gorrie 2 siblings, 2 replies; 25+ messages in thread From: Luke Gorrie @ 2015-04-24 12:17 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert [-- Attachment #1.1: Type: text/plain, Size: 1554 bytes --] On 24 April 2015 at 11:47, Stefan Hajnoczi <stefanha@gmail.com> wrote: > My concern is the overhead of the vhost_net component copying > descriptors between NICs. I see. So you would not have to reserve CPU resources for vswitches. Instead you would give all cores to the VMs and they would pay for their own networking. This would be especially appealing in the extreme case where all networking is "Layer 1" connectivity between local virtual machines. This would make VM<->VM links different to VM<->network links. I suppose that when you created VMs you would need to be conscious of whether or not you are placing them on the same host or NUMA node so that you can predict what network performance will be available. For what it is worth, I think this would make life more difficult for network operators hosting DPDK-style network applications ("NFV"). Virtio-net would become a more complex abstraction, the orchestration systems would need to take this into account, and there would be more opportunity for interoperability problems between virtual machines. The simpler alternative that I prefer is to provide network operators with a Virtio-net abstraction that behaves and performs in exactly the same way for all kinds of network traffic -- whether or not the VMs are on the same machine and NUMA node. That would be more in line with SR-IOV behavior which seems to me like the other horse in this race. Perhaps my world view here is too narrow though and other technologies like ivshmem are more relevant than I give them credit for? [-- Attachment #1.2: Type: text/html, Size: 2022 bytes --] [-- Attachment #2: Type: text/plain, Size: 183 bytes --] _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 12:17 ` Luke Gorrie @ 2015-04-24 13:10 ` Luke Gorrie 2015-04-24 13:23 ` Stefan Hajnoczi 2015-04-24 13:22 ` Stefan Hajnoczi 1 sibling, 1 reply; 25+ messages in thread From: Luke Gorrie @ 2015-04-24 13:10 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert [-- Attachment #1.1: Type: text/plain, Size: 363 bytes --] On 24 April 2015 at 14:17, Luke Gorrie <luke@snabb.co> wrote: > For what it is worth, I think > Erm, sorry about ranting with my pre-existing ideas without having examined the proposed specification in detail. I have a long backlog of things that I have been meaning to discuss with the Virtio-net community but have not previously had time to. Humbly! -Luke [-- Attachment #1.2: Type: text/html, Size: 846 bytes --] [-- Attachment #2: Type: text/plain, Size: 183 bytes --] _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 13:10 ` Luke Gorrie @ 2015-04-24 13:23 ` Stefan Hajnoczi 0 siblings, 0 replies; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-24 13:23 UTC (permalink / raw) To: Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert On Fri, Apr 24, 2015 at 2:10 PM, Luke Gorrie <luke@snabb.co> wrote: > On 24 April 2015 at 14:17, Luke Gorrie <luke@snabb.co> wrote: >> >> For what it is worth, I think > > > Erm, sorry about ranting with my pre-existing ideas without having examined > the proposed specification in detail. My experience with DPDK and SDN/NFV is limited, so I appreciate your input! Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 12:17 ` Luke Gorrie 2015-04-24 13:10 ` Luke Gorrie @ 2015-04-24 13:22 ` Stefan Hajnoczi 2015-04-26 13:24 ` Luke Gorrie 1 sibling, 1 reply; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-24 13:22 UTC (permalink / raw) To: Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Fri, Apr 24, 2015 at 1:17 PM, Luke Gorrie <luke@snabb.co> wrote: > On 24 April 2015 at 11:47, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> >> My concern is the overhead of the vhost_net component copying >> descriptors between NICs. > > > I see. So you would not have to reserve CPU resources for vswitches. Instead > you would give all cores to the VMs and they would pay for their own > networking. This would be especially appealing in the extreme case where all > networking is "Layer 1" connectivity between local virtual machines. > > This would make VM<->VM links different to VM<->network links. I suppose > that when you created VMs you would need to be conscious of whether or not > you are placing them on the same host or NUMA node so that you can predict > what network performance will be available. The motivation for making VM-to-VM fast is that while software switches on the host are efficient today (thanks to vhost-user), there is no efficient solution if the software switch is a VM. Have you had requests to run SnabbSwitch in a VM instead of on the host? For example, if someone wants to deploy it in a cloud environment they will not be allowed to run arbitrary software on the host. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 13:22 ` Stefan Hajnoczi @ 2015-04-26 13:24 ` Luke Gorrie 2015-04-27 10:17 ` Stefan Hajnoczi 0 siblings, 1 reply; 25+ messages in thread From: Luke Gorrie @ 2015-04-26 13:24 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert [-- Attachment #1.1: Type: text/plain, Size: 2116 bytes --] On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: > The motivation for making VM-to-VM fast is that while software > switches on the host are efficient today (thanks to vhost-user), there > is no efficient solution if the software switch is a VM. > I see. This sounds like a noble goal indeed. I would love to run the software switch as just another VM in the long term. It would make it much easier for the various software switches to coexist in the world. The main technical risk I see in this proposal is that eliminating the memory copies might not have the desired effect. I might be tempted to keep the copies but prevent the kernel from having to inspect the vrings (more like vhost-user). But that is just a hunch and I suppose the first step would be a prototype to check the performance anyway. For what it is worth here is my view of networking performance on x86 in the Haswell+ era: https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow Have you had requests to run SnabbSwitch in a VM instead of on the > host? This is not something we have discussed. I can say that I am not satisfied with our installation process on the host. I want this to be trivially easy, but it is not. On the one hand we make some parts easy: we only require one executable file (~1.5MB) and it works on any modern distro and kernel. On the other hand we require the user to edit grub.conf to reserve cores and keep the IOMMU out of the way, and to manually run a traffic process for each 10G port pinned to a suitable core. That requires a bunch of downstream work. Gory details: https://github.com/SnabbCo/snabb-nfv/wiki/Compute-node-requirements This should be much simpler. I would quite like to be able to wrap this up in a VM or a container. The risk is that then we become dependent on other systems (e.g. OpenStack) pinning cores correctly, etc, and that might be placing unrealistic expectations on the orchestration systems of the present and near future (?). I mean: if we make this somebody else's problem, we had better trust that they will do it right. Cheers, -Luke [-- Attachment #1.2: Type: text/html, Size: 3132 bytes --] [-- Attachment #2: Type: text/plain, Size: 183 bytes --] _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-26 13:24 ` Luke Gorrie @ 2015-04-27 10:17 ` Stefan Hajnoczi 2015-04-27 10:36 ` Michael S. Tsirkin 2015-04-27 12:35 ` Jan Kiszka 0 siblings, 2 replies; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-27 10:17 UTC (permalink / raw) To: Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: > On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> >> The motivation for making VM-to-VM fast is that while software >> switches on the host are efficient today (thanks to vhost-user), there >> is no efficient solution if the software switch is a VM. > > > I see. This sounds like a noble goal indeed. I would love to run the > software switch as just another VM in the long term. It would make it much > easier for the various software switches to coexist in the world. > > The main technical risk I see in this proposal is that eliminating the > memory copies might not have the desired effect. I might be tempted to keep > the copies but prevent the kernel from having to inspect the vrings (more > like vhost-user). But that is just a hunch and I suppose the first step > would be a prototype to check the performance anyway. > > For what it is worth here is my view of networking performance on x86 in the > Haswell+ era: > https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow Thanks. I've been thinking about how to eliminate the VM <-> host <-> VM switching and instead achieve just VM <-> VM. The holy grail of VM-to-VM networking is an exitless I/O path. In other words, packets can be transferred between VMs without any vmexits (this requires a polling driver). Here is how it works. QEMU gets "-device vhost-user" so that a VM can act as the vhost-user server: VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device and plays the host role instead of the normal virtio-net guest driver role. The ugly thing about this is that VM2 needs to map all of VM1's guest RAM so it can access the vrings and packet data. The solution to this is something like the Shared Buffers BAR but this time it contains not just the packet data but also the vring, let's call it the Shared Virtqueues BAR. The Shared Virtqueues BAR eliminates the need for vhost-net on the host because VM1 and VM2 communicate directly using virtqueue notify or polling vring memory. Virtqueue notify works by connecting an eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have an ioeventfd that is irqfd for VM1 to signal completions. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 10:17 ` Stefan Hajnoczi @ 2015-04-27 10:36 ` Michael S. Tsirkin 2015-04-27 12:35 ` Jan Kiszka 1 sibling, 0 replies; 25+ messages in thread From: Michael S. Tsirkin @ 2015-04-27 10:36 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Rik van Riel, Andrew Jones, Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Mon, Apr 27, 2015 at 11:17:44AM +0100, Stefan Hajnoczi wrote: > On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: > > On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> > >> The motivation for making VM-to-VM fast is that while software > >> switches on the host are efficient today (thanks to vhost-user), there > >> is no efficient solution if the software switch is a VM. > > > > > > I see. This sounds like a noble goal indeed. I would love to run the > > software switch as just another VM in the long term. It would make it much > > easier for the various software switches to coexist in the world. > > > > The main technical risk I see in this proposal is that eliminating the > > memory copies might not have the desired effect. I might be tempted to keep > > the copies but prevent the kernel from having to inspect the vrings (more > > like vhost-user). But that is just a hunch and I suppose the first step > > would be a prototype to check the performance anyway. > > > > For what it is worth here is my view of networking performance on x86 in the > > Haswell+ era: > > https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow > > Thanks. > > I've been thinking about how to eliminate the VM <-> host <-> VM > switching and instead achieve just VM <-> VM. > > The holy grail of VM-to-VM networking is an exitless I/O path. In > other words, packets can be transferred between VMs without any > vmexits (this requires a polling driver). > > Here is how it works. QEMU gets "-device vhost-user" so that a VM can > act as the vhost-user server: > > VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) > > VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device > and plays the host role instead of the normal virtio-net guest driver > role. > > The ugly thing about this is that VM2 needs to map all of VM1's guest > RAM so it can access the vrings and packet data. The solution to this > is something like the Shared Buffers BAR but this time it contains not > just the packet data but also the vring, let's call it the Shared > Virtqueues BAR. > > The Shared Virtqueues BAR eliminates the need for vhost-net on the > host because VM1 and VM2 communicate directly using virtqueue notify > or polling vring memory. Virtqueue notify works by connecting an > eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have > an ioeventfd that is irqfd for VM1 to signal completions. > > Stefan So this definitely works, it's just another virtio transport. Though this might mean guests need to copy data out to/from this BAR. -- MST ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 10:17 ` Stefan Hajnoczi 2015-04-27 10:36 ` Michael S. Tsirkin @ 2015-04-27 12:35 ` Jan Kiszka 2015-04-27 12:55 ` Jan Kiszka ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Jan Kiszka @ 2015-04-27 12:35 UTC (permalink / raw) To: Stefan Hajnoczi, Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi: > On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: >> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>> >>> The motivation for making VM-to-VM fast is that while software >>> switches on the host are efficient today (thanks to vhost-user), there >>> is no efficient solution if the software switch is a VM. >> >> >> I see. This sounds like a noble goal indeed. I would love to run the >> software switch as just another VM in the long term. It would make it much >> easier for the various software switches to coexist in the world. >> >> The main technical risk I see in this proposal is that eliminating the >> memory copies might not have the desired effect. I might be tempted to keep >> the copies but prevent the kernel from having to inspect the vrings (more >> like vhost-user). But that is just a hunch and I suppose the first step >> would be a prototype to check the performance anyway. >> >> For what it is worth here is my view of networking performance on x86 in the >> Haswell+ era: >> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow > > Thanks. > > I've been thinking about how to eliminate the VM <-> host <-> VM > switching and instead achieve just VM <-> VM. > > The holy grail of VM-to-VM networking is an exitless I/O path. In > other words, packets can be transferred between VMs without any > vmexits (this requires a polling driver). > > Here is how it works. QEMU gets "-device vhost-user" so that a VM can > act as the vhost-user server: > > VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) > > VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device > and plays the host role instead of the normal virtio-net guest driver > role. > > The ugly thing about this is that VM2 needs to map all of VM1's guest > RAM so it can access the vrings and packet data. The solution to this > is something like the Shared Buffers BAR but this time it contains not > just the packet data but also the vring, let's call it the Shared > Virtqueues BAR. > > The Shared Virtqueues BAR eliminates the need for vhost-net on the > host because VM1 and VM2 communicate directly using virtqueue notify > or polling vring memory. Virtqueue notify works by connecting an > eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have > an ioeventfd that is irqfd for VM1 to signal completions. We had such a discussion before: http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658 Would be great to get this ball rolling again. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 12:35 ` Jan Kiszka @ 2015-04-27 12:55 ` Jan Kiszka 2015-04-27 13:01 ` Stefan Hajnoczi 2015-04-27 12:57 ` Stefan Hajnoczi 2015-04-27 13:17 ` Michael S. Tsirkin 2 siblings, 1 reply; 25+ messages in thread From: Jan Kiszka @ 2015-04-27 12:55 UTC (permalink / raw) To: Stefan Hajnoczi, Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert Am 2015-04-27 um 14:35 schrieb Jan Kiszka: > Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi: >> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: >>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>> >>>> The motivation for making VM-to-VM fast is that while software >>>> switches on the host are efficient today (thanks to vhost-user), there >>>> is no efficient solution if the software switch is a VM. >>> >>> >>> I see. This sounds like a noble goal indeed. I would love to run the >>> software switch as just another VM in the long term. It would make it much >>> easier for the various software switches to coexist in the world. >>> >>> The main technical risk I see in this proposal is that eliminating the >>> memory copies might not have the desired effect. I might be tempted to keep >>> the copies but prevent the kernel from having to inspect the vrings (more >>> like vhost-user). But that is just a hunch and I suppose the first step >>> would be a prototype to check the performance anyway. >>> >>> For what it is worth here is my view of networking performance on x86 in the >>> Haswell+ era: >>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow >> >> Thanks. >> >> I've been thinking about how to eliminate the VM <-> host <-> VM >> switching and instead achieve just VM <-> VM. >> >> The holy grail of VM-to-VM networking is an exitless I/O path. In >> other words, packets can be transferred between VMs without any >> vmexits (this requires a polling driver). >> >> Here is how it works. QEMU gets "-device vhost-user" so that a VM can >> act as the vhost-user server: >> >> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) >> >> VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device >> and plays the host role instead of the normal virtio-net guest driver >> role. >> >> The ugly thing about this is that VM2 needs to map all of VM1's guest >> RAM so it can access the vrings and packet data. The solution to this >> is something like the Shared Buffers BAR but this time it contains not >> just the packet data but also the vring, let's call it the Shared >> Virtqueues BAR. >> >> The Shared Virtqueues BAR eliminates the need for vhost-net on the >> host because VM1 and VM2 communicate directly using virtqueue notify >> or polling vring memory. Virtqueue notify works by connecting an >> eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have >> an ioeventfd that is irqfd for VM1 to signal completions. > > We had such a discussion before: > http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658 > > Would be great to get this ball rolling again. > > Jan > But one challenge would remain even then (unless both sides only poll): exit-free inter-VM signaling, no? But that's a hardware issue first of all. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 12:55 ` Jan Kiszka @ 2015-04-27 13:01 ` Stefan Hajnoczi 2015-04-27 13:08 ` Muli Ben-Yehuda 2015-04-27 14:30 ` Jan Kiszka 0 siblings, 2 replies; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-27 13:01 UTC (permalink / raw) To: Jan Kiszka Cc: Rik van Riel, Michael S. Tsirkin, Andrew Jones, Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert On Mon, Apr 27, 2015 at 1:55 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: > Am 2015-04-27 um 14:35 schrieb Jan Kiszka: >> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi: >>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: >>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>> >>>>> The motivation for making VM-to-VM fast is that while software >>>>> switches on the host are efficient today (thanks to vhost-user), there >>>>> is no efficient solution if the software switch is a VM. >>>> >>>> >>>> I see. This sounds like a noble goal indeed. I would love to run the >>>> software switch as just another VM in the long term. It would make it much >>>> easier for the various software switches to coexist in the world. >>>> >>>> The main technical risk I see in this proposal is that eliminating the >>>> memory copies might not have the desired effect. I might be tempted to keep >>>> the copies but prevent the kernel from having to inspect the vrings (more >>>> like vhost-user). But that is just a hunch and I suppose the first step >>>> would be a prototype to check the performance anyway. >>>> >>>> For what it is worth here is my view of networking performance on x86 in the >>>> Haswell+ era: >>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow >>> >>> Thanks. >>> >>> I've been thinking about how to eliminate the VM <-> host <-> VM >>> switching and instead achieve just VM <-> VM. >>> >>> The holy grail of VM-to-VM networking is an exitless I/O path. In >>> other words, packets can be transferred between VMs without any >>> vmexits (this requires a polling driver). >>> >>> Here is how it works. QEMU gets "-device vhost-user" so that a VM can >>> act as the vhost-user server: >>> >>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) >>> >>> VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device >>> and plays the host role instead of the normal virtio-net guest driver >>> role. >>> >>> The ugly thing about this is that VM2 needs to map all of VM1's guest >>> RAM so it can access the vrings and packet data. The solution to this >>> is something like the Shared Buffers BAR but this time it contains not >>> just the packet data but also the vring, let's call it the Shared >>> Virtqueues BAR. >>> >>> The Shared Virtqueues BAR eliminates the need for vhost-net on the >>> host because VM1 and VM2 communicate directly using virtqueue notify >>> or polling vring memory. Virtqueue notify works by connecting an >>> eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have >>> an ioeventfd that is irqfd for VM1 to signal completions. >> >> We had such a discussion before: >> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658 >> >> Would be great to get this ball rolling again. >> >> Jan >> > > But one challenge would remain even then (unless both sides only poll): > exit-free inter-VM signaling, no? But that's a hardware issue first of all. To start with ioeventfd<->irqfd can be used. It incurs a light-weight exit in VM1 and interrupt injection in VM2. For networking the cost is mitigated by NAPI drivers which switch between interrupts and polling. During notification-heavy periods the guests would use polling anyway. A hardware solution would be some kind of inter-guest interrupt injection. I don't know VMX well enough to know whether that is possible on Intel CPUs. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 13:01 ` Stefan Hajnoczi @ 2015-04-27 13:08 ` Muli Ben-Yehuda 2015-04-27 14:30 ` Jan Kiszka 1 sibling, 0 replies; 25+ messages in thread From: Muli Ben-Yehuda @ 2015-04-27 13:08 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Rik van Riel, Michael S. Tsirkin, Jan Kiszka, Andrew Jones, Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert On Mon, Apr 27, 2015 at 02:01:05PM +0100, Stefan Hajnoczi wrote: > A hardware solution would be some kind of inter-guest interrupt > injection. I don't know VMX well enough to know whether that is > possible on Intel CPUs. It is: http://www.mulix.org/pubs/eli/eli.pdf. (And there's hardware coming down the pipe that will make (some) of the nasty tricks we used unnecessary). Cheers, Muli ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 13:01 ` Stefan Hajnoczi 2015-04-27 13:08 ` Muli Ben-Yehuda @ 2015-04-27 14:30 ` Jan Kiszka 2015-04-27 14:36 ` Luke Gorrie 2015-04-27 14:40 ` Michael S. Tsirkin 1 sibling, 2 replies; 25+ messages in thread From: Jan Kiszka @ 2015-04-27 14:30 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Rik van Riel, Michael S. Tsirkin, Andrew Jones, Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert Am 2015-04-27 um 15:01 schrieb Stefan Hajnoczi: > On Mon, Apr 27, 2015 at 1:55 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: >> Am 2015-04-27 um 14:35 schrieb Jan Kiszka: >>> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi: >>>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: >>>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>>> >>>>>> The motivation for making VM-to-VM fast is that while software >>>>>> switches on the host are efficient today (thanks to vhost-user), there >>>>>> is no efficient solution if the software switch is a VM. >>>>> >>>>> >>>>> I see. This sounds like a noble goal indeed. I would love to run the >>>>> software switch as just another VM in the long term. It would make it much >>>>> easier for the various software switches to coexist in the world. >>>>> >>>>> The main technical risk I see in this proposal is that eliminating the >>>>> memory copies might not have the desired effect. I might be tempted to keep >>>>> the copies but prevent the kernel from having to inspect the vrings (more >>>>> like vhost-user). But that is just a hunch and I suppose the first step >>>>> would be a prototype to check the performance anyway. >>>>> >>>>> For what it is worth here is my view of networking performance on x86 in the >>>>> Haswell+ era: >>>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow >>>> >>>> Thanks. >>>> >>>> I've been thinking about how to eliminate the VM <-> host <-> VM >>>> switching and instead achieve just VM <-> VM. >>>> >>>> The holy grail of VM-to-VM networking is an exitless I/O path. In >>>> other words, packets can be transferred between VMs without any >>>> vmexits (this requires a polling driver). >>>> >>>> Here is how it works. QEMU gets "-device vhost-user" so that a VM can >>>> act as the vhost-user server: >>>> >>>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) >>>> >>>> VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device >>>> and plays the host role instead of the normal virtio-net guest driver >>>> role. >>>> >>>> The ugly thing about this is that VM2 needs to map all of VM1's guest >>>> RAM so it can access the vrings and packet data. The solution to this >>>> is something like the Shared Buffers BAR but this time it contains not >>>> just the packet data but also the vring, let's call it the Shared >>>> Virtqueues BAR. >>>> >>>> The Shared Virtqueues BAR eliminates the need for vhost-net on the >>>> host because VM1 and VM2 communicate directly using virtqueue notify >>>> or polling vring memory. Virtqueue notify works by connecting an >>>> eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have >>>> an ioeventfd that is irqfd for VM1 to signal completions. >>> >>> We had such a discussion before: >>> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658 >>> >>> Would be great to get this ball rolling again. >>> >>> Jan >>> >> >> But one challenge would remain even then (unless both sides only poll): >> exit-free inter-VM signaling, no? But that's a hardware issue first of all. > > To start with ioeventfd<->irqfd can be used. It incurs a light-weight > exit in VM1 and interrupt injection in VM2. > > For networking the cost is mitigated by NAPI drivers which switch > between interrupts and polling. During notification-heavy periods the > guests would use polling anyway. > > A hardware solution would be some kind of inter-guest interrupt > injection. I don't know VMX well enough to know whether that is > possible on Intel CPUs. Today, we have posted interrupts to avoid the vm-exit on the target CPU, but there is nothing yet (to my best knowledge) to avoid the exit on the sender side (unless we ignore security). That's the same problem with intra-guest IPIs, BTW. For throughput and given NAPI patterns, that's probably not an issue as you noted. It may be for latency, though, when almost every cycle counts. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 14:30 ` Jan Kiszka @ 2015-04-27 14:36 ` Luke Gorrie 2015-04-27 14:38 ` Jan Kiszka 2015-04-27 14:40 ` Michael S. Tsirkin 1 sibling, 1 reply; 25+ messages in thread From: Luke Gorrie @ 2015-04-27 14:36 UTC (permalink / raw) To: Jan Kiszka Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert [-- Attachment #1.1: Type: text/plain, Size: 793 bytes --] On 27 April 2015 at 16:30, Jan Kiszka <jan.kiszka@siemens.com> wrote: > Today, we have posted interrupts to avoid the vm-exit on the target CPU, > but there is nothing yet (to my best knowledge) to avoid the exit on the > sender side (unless we ignore security). That's the same problem with > intra-guest IPIs, BTW. > > For throughput and given NAPI patterns, that's probably not an issue as > you noted. It may be for latency, though, when almost every cycle counts. > Poll-mode networking applications (DPDK, Snabb Switch, etc) are typically busy-looping to poll the vring. They may have a very short usleep() between checks to save power but they don't wait on their eventfd. So for those particular applications latency is on the order of tens of microseconds even without guest exits. [-- Attachment #1.2: Type: text/html, Size: 1257 bytes --] [-- Attachment #2: Type: text/plain, Size: 183 bytes --] _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 14:36 ` Luke Gorrie @ 2015-04-27 14:38 ` Jan Kiszka 0 siblings, 0 replies; 25+ messages in thread From: Jan Kiszka @ 2015-04-27 14:38 UTC (permalink / raw) To: Luke Gorrie Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert Am 2015-04-27 um 16:36 schrieb Luke Gorrie: > On 27 April 2015 at 16:30, Jan Kiszka <jan.kiszka@siemens.com> wrote: > >> Today, we have posted interrupts to avoid the vm-exit on the target CPU, >> but there is nothing yet (to my best knowledge) to avoid the exit on the >> sender side (unless we ignore security). That's the same problem with >> intra-guest IPIs, BTW. >> >> For throughput and given NAPI patterns, that's probably not an issue as >> you noted. It may be for latency, though, when almost every cycle counts. >> > > Poll-mode networking applications (DPDK, Snabb Switch, etc) are typically > busy-looping to poll the vring. They may have a very short usleep() between > checks to save power but they don't wait on their eventfd. So for those > particular applications latency is on the order of tens of microseconds > even without guest exits. That's one side, don't forget the others (the "normal" guests). Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 14:30 ` Jan Kiszka 2015-04-27 14:36 ` Luke Gorrie @ 2015-04-27 14:40 ` Michael S. Tsirkin 1 sibling, 0 replies; 25+ messages in thread From: Michael S. Tsirkin @ 2015-04-27 14:40 UTC (permalink / raw) To: Jan Kiszka Cc: Rik van Riel, Andrew Jones, Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert On Mon, Apr 27, 2015 at 04:30:35PM +0200, Jan Kiszka wrote: > Am 2015-04-27 um 15:01 schrieb Stefan Hajnoczi: > > On Mon, Apr 27, 2015 at 1:55 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: > >> Am 2015-04-27 um 14:35 schrieb Jan Kiszka: > >>> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi: > >>>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: > >>>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>>> > >>>>>> The motivation for making VM-to-VM fast is that while software > >>>>>> switches on the host are efficient today (thanks to vhost-user), there > >>>>>> is no efficient solution if the software switch is a VM. > >>>>> > >>>>> > >>>>> I see. This sounds like a noble goal indeed. I would love to run the > >>>>> software switch as just another VM in the long term. It would make it much > >>>>> easier for the various software switches to coexist in the world. > >>>>> > >>>>> The main technical risk I see in this proposal is that eliminating the > >>>>> memory copies might not have the desired effect. I might be tempted to keep > >>>>> the copies but prevent the kernel from having to inspect the vrings (more > >>>>> like vhost-user). But that is just a hunch and I suppose the first step > >>>>> would be a prototype to check the performance anyway. > >>>>> > >>>>> For what it is worth here is my view of networking performance on x86 in the > >>>>> Haswell+ era: > >>>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow > >>>> > >>>> Thanks. > >>>> > >>>> I've been thinking about how to eliminate the VM <-> host <-> VM > >>>> switching and instead achieve just VM <-> VM. > >>>> > >>>> The holy grail of VM-to-VM networking is an exitless I/O path. In > >>>> other words, packets can be transferred between VMs without any > >>>> vmexits (this requires a polling driver). > >>>> > >>>> Here is how it works. QEMU gets "-device vhost-user" so that a VM can > >>>> act as the vhost-user server: > >>>> > >>>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) > >>>> > >>>> VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device > >>>> and plays the host role instead of the normal virtio-net guest driver > >>>> role. > >>>> > >>>> The ugly thing about this is that VM2 needs to map all of VM1's guest > >>>> RAM so it can access the vrings and packet data. The solution to this > >>>> is something like the Shared Buffers BAR but this time it contains not > >>>> just the packet data but also the vring, let's call it the Shared > >>>> Virtqueues BAR. > >>>> > >>>> The Shared Virtqueues BAR eliminates the need for vhost-net on the > >>>> host because VM1 and VM2 communicate directly using virtqueue notify > >>>> or polling vring memory. Virtqueue notify works by connecting an > >>>> eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have > >>>> an ioeventfd that is irqfd for VM1 to signal completions. > >>> > >>> We had such a discussion before: > >>> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658 > >>> > >>> Would be great to get this ball rolling again. > >>> > >>> Jan > >>> > >> > >> But one challenge would remain even then (unless both sides only poll): > >> exit-free inter-VM signaling, no? But that's a hardware issue first of all. > > > > To start with ioeventfd<->irqfd can be used. It incurs a light-weight > > exit in VM1 and interrupt injection in VM2. > > > > For networking the cost is mitigated by NAPI drivers which switch > > between interrupts and polling. During notification-heavy periods the > > guests would use polling anyway. > > > > A hardware solution would be some kind of inter-guest interrupt > > injection. I don't know VMX well enough to know whether that is > > possible on Intel CPUs. > > Today, we have posted interrupts to avoid the vm-exit on the target CPU, > but there is nothing yet (to my best knowledge) to avoid the exit on the > sender side (unless we ignore security). That's the same problem with > intra-guest IPIs, BTW. > > For throughput and given NAPI patterns, that's probably not an issue as > you noted. It may be for latency, though, when almost every cycle counts. > > Jan If you are counting cycles you likely can't afford the interrupt latency under linux, so you have to poll memory. > -- > Siemens AG, Corporate Technology, CT RTC ITP SES-DE > Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 12:35 ` Jan Kiszka 2015-04-27 12:55 ` Jan Kiszka @ 2015-04-27 12:57 ` Stefan Hajnoczi 2015-04-27 13:17 ` Michael S. Tsirkin 2 siblings, 0 replies; 25+ messages in thread From: Stefan Hajnoczi @ 2015-04-27 12:57 UTC (permalink / raw) To: Jan Kiszka Cc: Rik van Riel, Michael S. Tsirkin, Andrew Jones, Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Mon, Apr 27, 2015 at 1:35 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote: > Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi: >> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: >>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>> >>>> The motivation for making VM-to-VM fast is that while software >>>> switches on the host are efficient today (thanks to vhost-user), there >>>> is no efficient solution if the software switch is a VM. >>> >>> >>> I see. This sounds like a noble goal indeed. I would love to run the >>> software switch as just another VM in the long term. It would make it much >>> easier for the various software switches to coexist in the world. >>> >>> The main technical risk I see in this proposal is that eliminating the >>> memory copies might not have the desired effect. I might be tempted to keep >>> the copies but prevent the kernel from having to inspect the vrings (more >>> like vhost-user). But that is just a hunch and I suppose the first step >>> would be a prototype to check the performance anyway. >>> >>> For what it is worth here is my view of networking performance on x86 in the >>> Haswell+ era: >>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow >> >> Thanks. >> >> I've been thinking about how to eliminate the VM <-> host <-> VM >> switching and instead achieve just VM <-> VM. >> >> The holy grail of VM-to-VM networking is an exitless I/O path. In >> other words, packets can be transferred between VMs without any >> vmexits (this requires a polling driver). >> >> Here is how it works. QEMU gets "-device vhost-user" so that a VM can >> act as the vhost-user server: >> >> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) >> >> VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device >> and plays the host role instead of the normal virtio-net guest driver >> role. >> >> The ugly thing about this is that VM2 needs to map all of VM1's guest >> RAM so it can access the vrings and packet data. The solution to this >> is something like the Shared Buffers BAR but this time it contains not >> just the packet data but also the vring, let's call it the Shared >> Virtqueues BAR. >> >> The Shared Virtqueues BAR eliminates the need for vhost-net on the >> host because VM1 and VM2 communicate directly using virtqueue notify >> or polling vring memory. Virtqueue notify works by connecting an >> eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have >> an ioeventfd that is irqfd for VM1 to signal completions. > > We had such a discussion before: > http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658 > > Would be great to get this ball rolling again. Thanks for the interesting link. Now that vhost-user exists, a QEMU -device vhost-user feature is a logical step. It would allow any virtio device to be emulated by another VM, not just virtio-net. It seems like a nice model for storage and networking appliance VMs. I don't have time to write the patches in the near future but can participate in code review and discussion. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-27 12:35 ` Jan Kiszka 2015-04-27 12:55 ` Jan Kiszka 2015-04-27 12:57 ` Stefan Hajnoczi @ 2015-04-27 13:17 ` Michael S. Tsirkin 2 siblings, 0 replies; 25+ messages in thread From: Michael S. Tsirkin @ 2015-04-27 13:17 UTC (permalink / raw) To: Jan Kiszka Cc: Rik van Riel, Andrew Jones, Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert On Mon, Apr 27, 2015 at 02:35:19PM +0200, Jan Kiszka wrote: > Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi: > > On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote: > >> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>> > >>> The motivation for making VM-to-VM fast is that while software > >>> switches on the host are efficient today (thanks to vhost-user), there > >>> is no efficient solution if the software switch is a VM. > >> > >> > >> I see. This sounds like a noble goal indeed. I would love to run the > >> software switch as just another VM in the long term. It would make it much > >> easier for the various software switches to coexist in the world. > >> > >> The main technical risk I see in this proposal is that eliminating the > >> memory copies might not have the desired effect. I might be tempted to keep > >> the copies but prevent the kernel from having to inspect the vrings (more > >> like vhost-user). But that is just a hunch and I suppose the first step > >> would be a prototype to check the performance anyway. > >> > >> For what it is worth here is my view of networking performance on x86 in the > >> Haswell+ era: > >> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow > > > > Thanks. > > > > I've been thinking about how to eliminate the VM <-> host <-> VM > > switching and instead achieve just VM <-> VM. > > > > The holy grail of VM-to-VM networking is an exitless I/O path. In > > other words, packets can be transferred between VMs without any > > vmexits (this requires a polling driver). > > > > Here is how it works. QEMU gets "-device vhost-user" so that a VM can > > act as the vhost-user server: > > > > VM1 (virtio-net guest driver) <-> VM2 (vhost-user device) > > > > VM1 has a regular virtio-net PCI device. VM2 has a vhost-user device > > and plays the host role instead of the normal virtio-net guest driver > > role. > > > > The ugly thing about this is that VM2 needs to map all of VM1's guest > > RAM so it can access the vrings and packet data. The solution to this > > is something like the Shared Buffers BAR but this time it contains not > > just the packet data but also the vring, let's call it the Shared > > Virtqueues BAR. > > > > The Shared Virtqueues BAR eliminates the need for vhost-net on the > > host because VM1 and VM2 communicate directly using virtqueue notify > > or polling vring memory. Virtqueue notify works by connecting an > > eventfd as ioeventfd in VM1 and irqfd in VM2. And VM2 would also have > > an ioeventfd that is irqfd for VM1 to signal completions. > > We had such a discussion before: > http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658 > > Would be great to get this ball rolling again. > > Jan I think fundamentally, reducing the stress on the host scheduler can give a bigger gain than zero copy. But if I was to implement this, I wouldn't start with the funky virtio BAR thing. Start by enabling DPDK vhost-port within guest as-is. To this end, we can try implementing virtio-vhost: Assume we want to bridge VMX and VMY using bridge in VMB. - expose all of VMX and VMY memory as device BARs, or as some other region within VMB memory - add interface to send vhost-user messages to VMB (and ack them) the messages include tables that translate from VMX/VMY physical to VMB physical. The simplest guest driver then just copies from VMX TX ring to VMY RX ring, and vice versa. This will let you test performance somewhat easily. When used as a linux netdev, we probably will have to do extra data copies, at least initially. The point is that you get full interoperability with existing virtio, and test performance without rewriting everything first. One nice property is that KVM can log accesses for us. By detecting VMB accesses to memory of VMX and forwarding them to QEMU running VMX, we can make migration work out of box. This might also mean vringh code is reusable to make a linux driver for this device - IIRC dirty logging was the biggest hurdle to make vringh work well for vhost. > -- > Siemens AG, Corporate Technology, CT RTC ITP SES-DE > Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net 2015-04-24 9:47 ` Stefan Hajnoczi 2015-04-24 9:50 ` Stefan Hajnoczi 2015-04-24 12:17 ` Luke Gorrie @ 2015-04-24 12:34 ` Luke Gorrie 2 siblings, 0 replies; 25+ messages in thread From: Luke Gorrie @ 2015-04-24 12:34 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert [-- Attachment #1.1: Type: text/plain, Size: 2670 bytes --] On 24 April 2015 at 11:47, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > Incidentally, we also did a pile of work last year on zero-copy NIC->VM > > transfers and discovered a lot of interesting problems and edge cases > where > > Virtio-net spec and/or drivers are hard to match up with common NICs. > Happy > > to explain a bit about our experience if that would be valuable. > > That sounds interesting, can you describe the setup? > Sure. We implemented a zero-copy receive path that maps guest buffers received from the avail ring directly onto hardware receive buffers on a dedicated hardware receive queue for that VM (VMDq). This means that when the NIC receives a packet it stores it directly into the guest's memory but the vswitch has the opportunity to do as much or as little processing as it wants before making the packet available with a used ring descriptor. This scheme seems quite elegant to me. (I am sure it is not original - this is what the VMDq hardware feature is for, after all.) The devil is in the details though. I suspect it would work well given two extensions to Virtio-net: 1. The 'used' ring allow an offset where the payload starts. 2. The guest to always supply buffers with space for >= 2048 bytes of payload. but without these things it is tricky to satisfy the requirements of real NICs such as the Intel 10G ones. There are conflicting requirements. For example: - NIC requires buffer sizes to be uniform and a multiple of 1024 bytes. Guest suppliers variable-size buffers often of ~1500 bytes. These need to be either rounded down to 1024 bytes (causing excessive segmentation) or rounded up to 2048 bytes (requiring jumbo frames to be globally disabled on the port to avoid potential overruns). - Virtio-net with MRG_RXBUF expects the packet payload to be in a different offset for the first descriptor in a chain (offset 14 after the vnet header) vs following descriptions in the chain (offset 0). The NIC always stores packets at the same offset so the vswitch needs to pick one and then correct with memmove() when needed. - If the vswitch wants to shorten the packet payload, e.g. to remove encapsulation, then this will require a memmove() because there is no way to communicate an offset on the used ring. - The NIC has a limit to how many receive descriptors it can chain together. If the guest is supplying small buffers then this limit may be too low for jumbo frames to be received. ... and at a certain point we decided we were better off switching our focus away from clever-but-fragile NIC hacks and towards clever-and-robust SIMD hacks, and that is the path we have been on since a few months ago. [-- Attachment #1.2: Type: text/html, Size: 3392 bytes --] [-- Attachment #2: Type: text/plain, Size: 183 bytes --] _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2015-04-27 14:40 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20150422170138.GA8388@stefanha-thinkpad.redhat.com>
2015-04-22 17:46 ` Zerocopy VM-to-VM networking using virtio-net Cornelia Huck
[not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com>
2015-04-22 18:00 ` Stefan Hajnoczi
2015-04-23 16:54 ` Cornelia Huck
2015-04-24 8:12 ` [virtio-dev] " Luke Gorrie
2015-04-24 8:20 ` Paolo Bonzini
2015-04-24 9:47 ` Stefan Hajnoczi
2015-04-24 9:50 ` Stefan Hajnoczi
2015-04-24 12:17 ` Luke Gorrie
2015-04-24 13:10 ` Luke Gorrie
2015-04-24 13:23 ` Stefan Hajnoczi
2015-04-24 13:22 ` Stefan Hajnoczi
2015-04-26 13:24 ` Luke Gorrie
2015-04-27 10:17 ` Stefan Hajnoczi
2015-04-27 10:36 ` Michael S. Tsirkin
2015-04-27 12:35 ` Jan Kiszka
2015-04-27 12:55 ` Jan Kiszka
2015-04-27 13:01 ` Stefan Hajnoczi
2015-04-27 13:08 ` Muli Ben-Yehuda
2015-04-27 14:30 ` Jan Kiszka
2015-04-27 14:36 ` Luke Gorrie
2015-04-27 14:38 ` Jan Kiszka
2015-04-27 14:40 ` Michael S. Tsirkin
2015-04-27 12:57 ` Stefan Hajnoczi
2015-04-27 13:17 ` Michael S. Tsirkin
2015-04-24 12:34 ` Luke Gorrie
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.