Re: Zerocopy VM-to-VM networking using virtio-net

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Zerocopy VM-to-VM networking using virtio-net
       [not found] <20150422170138.GA8388@stefanha-thinkpad.redhat.com>
@ 2015-04-22 17:46 ` Cornelia Huck
       [not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com>
  2015-04-24  8:12 ` [virtio-dev] " Luke Gorrie
  2 siblings, 0 replies; 25+ messages in thread
From: Cornelia Huck @ 2015-04-22 17:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	virtualization, virtio-comment, Paolo Bonzini,
	Dr. David Alan Gilbert

On Wed, 22 Apr 2015 18:01:38 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> [It may be necessary to remove virtio-dev@lists.oasis-open.org from CC
> if you are a non-TC member.]
> 
> Hi,
> Some modern networking applications bypass the kernel network stack so
> that rx/tx rings and DMA buffers can be directly mapped.  This is
> typical in DPDK applications where virtio-net currently is one of
> several NIC choices.
> 
> Existing virtio-net implementations are not optimized for VM-to-VM
> DPDK-style networking.  The following outline describes a zero-copy
> virtio-net solution for VM-to-VM networking.
> 
> Thanks to Paolo Bonzini for the Shared Buffers BAR idea.
> 
> Use case
> --------
> Two VMs on the same host need to communicate in the most efficient
> manner possible (e.g. the sole purpose of the VMs is to do network I/O).
> 
> Applications running inside the VMs implement virtio-net in userspace so
> they have full control over rx/tx rings and data buffer placement.

Wouldn't that also benefit applications that use a kernel
implementation? You still need to get the data to/from kernel space,
but you'd get the benefit of being able to get the data to the peer
immediately.

> 
> Performance requirements are higher priority than security or isolation.
> If this bothers you, stick to classic virtio-net.
> 
> virtio-net VM-to-VM extensions
> ------------------------------
> A few extensions to virtio-net are necessary to support zero-copy
> VM-to-VM communication.  The extensions are covered informally
> throughout the text, this is not a VIRTIO specification change proposal.
> 
> The VM-to-VM capable virtio-net PCI adapter has an additional MMIO BAR
> called the Shared Buffers BAR.  The Shared Buffers BAR is a shared
> memory region on the host so that the virtio-net devices in VM1 and VM2
> both access the same region of memory.
> 
> The vring is still allocated in guest RAM as usual but data buffers must
> be located in the Shared Buffers BAR in order to take advantage of
> zero-copy.
> 
> When VM1 places a packet into the tx queue and the buffers are located
> in the Shared Buffers BAR, the host finds the VM2's rx queue descriptor
> with the same buffer address and completes it without copying any data
> buffers.

The shared buffers BAR looks PCI-specific, but what about other
mechanisms to provide a shared space between two VMs with some kind of
lightweight notifications? This should make it possible to implement a
similar mode of operation for other transports if it is factored out
correctly. (The actual implementation of this shared space is probably
the difficult part :)

> 
> Shared buffer allocation
> ------------------------
> A simple scheme for two cooperating VMs to manage the Shared Buffers BAR
> is as follows:
> 
>   VM1         VM2
>        +---+
>    rx->| 1 |<-tx
>        +---+
>    tx->| 2 |<-rx
>        +---+
>    Shared Buffers
> 
> This is a trivial example where the Shared Buffers BAR has only two
> packet buffers.
> 
> VM1 starts by putting buffer 1 in its rx queue.  VM2 starts by putting
> buffer 2 in its rx queue.  The VMs know which buffers to choose based on
> a new uint8_t virtio_net_config.shared_buffers_offset field (0 for VM1
> and 1 for VM2).
> 
> VM1 can transmit to VM2 by filling buffer 2 and placing it on its tx
> queue.  VM2 can transmit by filling buffer 1 and placing it on its tx
> queue.
> 
> As soon as a buffer is placed on a tx queue, the VM passes ownership of
> the buffer to the other VM.  In other words, the buffer must not be
> touched even after virtio-net tx completion because it now belongs to
> the other VM.
> 
> This scheme of bouncing ownership back-and-forth between the two VMs
> only works if both VMs transmit an equal number of buffers over time.
> In reality the traffic pattern may be unbalanced so VM1 is always
> transmitting and VM2 is always receiving.  This problem can be overcome
> if the VMs cooperate and return buffers if they accumulate too many.
> 
> For example, after VM1 transmits buffer 2 it has run out of tx buffers:
> 
>   VM1         VM2
>        +---+
>    rx->| 1 |<-tx
>        +---+
>     X->| 2 |<-rx
>        +---+
> 
> VM2 notices that it now holds all buffers.  It can donate a buffer back
> to VM1 by putting it on the tx queue with the new virtio_net_hdr.flags
> VIRTIO_NET_HDR_F_GIFT_BUFFER flag.  This flag indicates that this is not
> a packet but rather an empty gifted buffer.  VM1 checks the flags field
> to detect that it has been gifted buffers.
> 
> Also note that zero-copy networking is not mutually exclusive with
> classic virtio-net.  If the descriptor has buffer addresses outside the
> Shared Buffers BAR, then classic non-zero-copy virtio-net behavior
> occurs.

Is simply writing the values in the header enough to trigger the other
side? You don't need some kind of notification? (I'm obviously coming
from a non-PCI view, and for my kind-of-nebulous idea I'd need a
lightweight interrupt so that the other side knows it should check the
header.)

> 
> Host-side implementation
> ------------------------
> The host facilitates zero-copy VM-to-VM communication by taking
> descriptors off tx queues and filling in rx descriptors of the paired
> VM.  In the Linux vhost_net implementation this could work as follows:
> 
> 1. VM1 places buffer 2 on the tx queue and kicks the host.  Ownership of
>    the buffer no longer belongs to VM1.
> 2. vhost_net pops the buffer from VM1's tx queue and verifies that the
>    buffer address is within the Shared Buffers BAR.
> 3. vhost_net finds the VM2 rx queue descriptor whose buffer address
>    matches, completes that descriptor, and kicks VM2.
> 4. VM2 pops buffer 2 from the rx queue.  It can now reuse this buffer
>    for transmitting to VM1.
> 
> The vhost_net.ko kernel module needs a new ioctl for pairing vhost_net
> instances.  This ioctl is used to establish the VM-to-VM connection
> between VM1's virtio-net and VM2's virtio-net.
> 
> Discussion
> ----------
> The result is that applications in separate VMs can communicate in true
> zero-copy fashion.
> 
> I think this approach could be fruitful in bringing virtio-net to
> VM-to-VM networking use cases.  Unless virtio-net is extended for this
> use case, I'm afraid DPDK and OpenDataPlane communities might steer
> clear of VIRTIO.
> 
> This is an idea I want to share but I'm not working on a prototype.
> Feel free to flesh it out further and try it!

Definetly interesting. It seems you get much of the needed
infrastructure by simply leveraging what PCI gives you anyway? If we
want something like in other environments (say, via ccw on s390), we'd
have to come up with a mechanism that can give us the same (which is
probably the hard part).

> 
> Open issues:
>  * Multiple VMs?
>  * Multiqueue?
>  * Choice of shared buffer allocation algorithm?
>  * etc
> 
> Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

[parent not found: <20150422194603.1e650ec7.cornelia.huck@de.ibm.com>]

* Re: Zerocopy VM-to-VM networking using virtio-net
       [not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com>
@ 2015-04-22 18:00   ` Stefan Hajnoczi
  2015-04-23 16:54     ` Cornelia Huck
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-22 18:00 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: virtio-dev, Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

On Wed, Apr 22, 2015 at 6:46 PM, Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> On Wed, 22 Apr 2015 18:01:38 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
>> [It may be necessary to remove virtio-dev@lists.oasis-open.org from CC
>> if you are a non-TC member.]
>>
>> Hi,
>> Some modern networking applications bypass the kernel network stack so
>> that rx/tx rings and DMA buffers can be directly mapped.  This is
>> typical in DPDK applications where virtio-net currently is one of
>> several NIC choices.
>>
>> Existing virtio-net implementations are not optimized for VM-to-VM
>> DPDK-style networking.  The following outline describes a zero-copy
>> virtio-net solution for VM-to-VM networking.
>>
>> Thanks to Paolo Bonzini for the Shared Buffers BAR idea.
>>
>> Use case
>> --------
>> Two VMs on the same host need to communicate in the most efficient
>> manner possible (e.g. the sole purpose of the VMs is to do network I/O).
>>
>> Applications running inside the VMs implement virtio-net in userspace so
>> they have full control over rx/tx rings and data buffer placement.
>
> Wouldn't that also benefit applications that use a kernel
> implementation? You still need to get the data to/from kernel space,
> but you'd get the benefit of being able to get the data to the peer
> immediately.

If the applications are using the sockets API then there is a memory
copy involved.  But you are right that it bypasses tap/bridge on the
host side, so it can still be an advantage.

>>
>> Performance requirements are higher priority than security or isolation.
>> If this bothers you, stick to classic virtio-net.
>>
>> virtio-net VM-to-VM extensions
>> ------------------------------
>> A few extensions to virtio-net are necessary to support zero-copy
>> VM-to-VM communication.  The extensions are covered informally
>> throughout the text, this is not a VIRTIO specification change proposal.
>>
>> The VM-to-VM capable virtio-net PCI adapter has an additional MMIO BAR
>> called the Shared Buffers BAR.  The Shared Buffers BAR is a shared
>> memory region on the host so that the virtio-net devices in VM1 and VM2
>> both access the same region of memory.
>>
>> The vring is still allocated in guest RAM as usual but data buffers must
>> be located in the Shared Buffers BAR in order to take advantage of
>> zero-copy.
>>
>> When VM1 places a packet into the tx queue and the buffers are located
>> in the Shared Buffers BAR, the host finds the VM2's rx queue descriptor
>> with the same buffer address and completes it without copying any data
>> buffers.
>
> The shared buffers BAR looks PCI-specific, but what about other
> mechanisms to provide a shared space between two VMs with some kind of
> lightweight notifications? This should make it possible to implement a
> similar mode of operation for other transports if it is factored out
> correctly. (The actual implementation of this shared space is probably
> the difficult part :)

It depends on the primitives available.  For example, in a virtual DMA
page-flipping environment the hypervisor could change page ownership
between the two VMs.  This does not required shared memory.  But
there's a cost to virtual memory bookkeeping so it might only be a win
for big packets.

Does s390 have a mechanism for giving VMs permanent shared or
temporary access to memory pages?

>>
>> Shared buffer allocation
>> ------------------------
>> A simple scheme for two cooperating VMs to manage the Shared Buffers BAR
>> is as follows:
>>
>>   VM1         VM2
>>        +---+
>>    rx->| 1 |<-tx
>>        +---+
>>    tx->| 2 |<-rx
>>        +---+
>>    Shared Buffers
>>
>> This is a trivial example where the Shared Buffers BAR has only two
>> packet buffers.
>>
>> VM1 starts by putting buffer 1 in its rx queue.  VM2 starts by putting
>> buffer 2 in its rx queue.  The VMs know which buffers to choose based on
>> a new uint8_t virtio_net_config.shared_buffers_offset field (0 for VM1
>> and 1 for VM2).
>>
>> VM1 can transmit to VM2 by filling buffer 2 and placing it on its tx
>> queue.  VM2 can transmit by filling buffer 1 and placing it on its tx
>> queue.
>>
>> As soon as a buffer is placed on a tx queue, the VM passes ownership of
>> the buffer to the other VM.  In other words, the buffer must not be
>> touched even after virtio-net tx completion because it now belongs to
>> the other VM.
>>
>> This scheme of bouncing ownership back-and-forth between the two VMs
>> only works if both VMs transmit an equal number of buffers over time.
>> In reality the traffic pattern may be unbalanced so VM1 is always
>> transmitting and VM2 is always receiving.  This problem can be overcome
>> if the VMs cooperate and return buffers if they accumulate too many.
>>
>> For example, after VM1 transmits buffer 2 it has run out of tx buffers:
>>
>>   VM1         VM2
>>        +---+
>>    rx->| 1 |<-tx
>>        +---+
>>     X->| 2 |<-rx
>>        +---+
>>
>> VM2 notices that it now holds all buffers.  It can donate a buffer back
>> to VM1 by putting it on the tx queue with the new virtio_net_hdr.flags
>> VIRTIO_NET_HDR_F_GIFT_BUFFER flag.  This flag indicates that this is not
>> a packet but rather an empty gifted buffer.  VM1 checks the flags field
>> to detect that it has been gifted buffers.
>>
>> Also note that zero-copy networking is not mutually exclusive with
>> classic virtio-net.  If the descriptor has buffer addresses outside the
>> Shared Buffers BAR, then classic non-zero-copy virtio-net behavior
>> occurs.
>
> Is simply writing the values in the header enough to trigger the other
> side? You don't need some kind of notification? (I'm obviously coming
> from a non-PCI view, and for my kind-of-nebulous idea I'd need a
> lightweight interrupt so that the other side knows it should check the
> header.)

Virtqueue kick is still used for notification.  In fact, the virtqueue
operation is basically the same, except that data buffers are now
located in the Shared Buffers BAR instead.

>> Discussion
>> ----------
>> The result is that applications in separate VMs can communicate in true
>> zero-copy fashion.
>>
>> I think this approach could be fruitful in bringing virtio-net to
>> VM-to-VM networking use cases.  Unless virtio-net is extended for this
>> use case, I'm afraid DPDK and OpenDataPlane communities might steer
>> clear of VIRTIO.
>>
>> This is an idea I want to share but I'm not working on a prototype.
>> Feel free to flesh it out further and try it!
>
> Definetly interesting. It seems you get much of the needed
> infrastructure by simply leveraging what PCI gives you anyway? If we
> want something like in other environments (say, via ccw on s390), we'd
> have to come up with a mechanism that can give us the same (which is
> probably the hard part).

It may not be a win in all environments.  It depends on the primitives
available for memory access.

With PCI devices and a Linux host we can use a shared memory region.
If shared memory is not available then maybe there is no performance
win to be had.

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Zerocopy VM-to-VM networking using virtio-net
  2015-04-22 18:00   ` Stefan Hajnoczi
@ 2015-04-23 16:54     ` Cornelia Huck
  0 siblings, 0 replies; 25+ messages in thread
From: Cornelia Huck @ 2015-04-23 16:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

On Wed, 22 Apr 2015 19:00:52 +0100
Stefan Hajnoczi <stefanha@gmail.com> wrote:

> On Wed, Apr 22, 2015 at 6:46 PM, Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> > On Wed, 22 Apr 2015 18:01:38 +0100
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:

> >> virtio-net VM-to-VM extensions
> >> ------------------------------
> >> A few extensions to virtio-net are necessary to support zero-copy
> >> VM-to-VM communication.  The extensions are covered informally
> >> throughout the text, this is not a VIRTIO specification change proposal.
> >>
> >> The VM-to-VM capable virtio-net PCI adapter has an additional MMIO BAR
> >> called the Shared Buffers BAR.  The Shared Buffers BAR is a shared
> >> memory region on the host so that the virtio-net devices in VM1 and VM2
> >> both access the same region of memory.
> >>
> >> The vring is still allocated in guest RAM as usual but data buffers must
> >> be located in the Shared Buffers BAR in order to take advantage of
> >> zero-copy.
> >>
> >> When VM1 places a packet into the tx queue and the buffers are located
> >> in the Shared Buffers BAR, the host finds the VM2's rx queue descriptor
> >> with the same buffer address and completes it without copying any data
> >> buffers.
> >
> > The shared buffers BAR looks PCI-specific, but what about other
> > mechanisms to provide a shared space between two VMs with some kind of
> > lightweight notifications? This should make it possible to implement a
> > similar mode of operation for other transports if it is factored out
> > correctly. (The actual implementation of this shared space is probably
> > the difficult part :)
> 
> It depends on the primitives available.  For example, in a virtual DMA
> page-flipping environment the hypervisor could change page ownership
> between the two VMs.  This does not required shared memory.  But
> there's a cost to virtual memory bookkeeping so it might only be a win
> for big packets.
> 
> Does s390 have a mechanism for giving VMs permanent shared or
> temporary access to memory pages?

Under kvm/qemu, currently not. Under z/VM, there's DCSS; while we don't
want to copy that interface, we'll probably want to introduce something
similar in the future. No design yet, though.

> Virtqueue kick is still used for notification.  In fact, the virtqueue
> operation is basically the same, except that data buffers are now
> located in the Shared Buffers BAR instead.

You're right, if this is in the virtqueue buffers, this should just
work.

> 
> >> Discussion
> >> ----------
> >> The result is that applications in separate VMs can communicate in true
> >> zero-copy fashion.
> >>
> >> I think this approach could be fruitful in bringing virtio-net to
> >> VM-to-VM networking use cases.  Unless virtio-net is extended for this
> >> use case, I'm afraid DPDK and OpenDataPlane communities might steer
> >> clear of VIRTIO.
> >>
> >> This is an idea I want to share but I'm not working on a prototype.
> >> Feel free to flesh it out further and try it!
> >
> > Definetly interesting. It seems you get much of the needed
> > infrastructure by simply leveraging what PCI gives you anyway? If we
> > want something like in other environments (say, via ccw on s390), we'd
> > have to come up with a mechanism that can give us the same (which is
> > probably the hard part).
> 
> It may not be a win in all environments.  It depends on the primitives
> available for memory access.
> 
> With PCI devices and a Linux host we can use a shared memory region.
> If shared memory is not available then maybe there is no performance
> win to be had.

I think if there's a good split between concept and specific backend,
we just can figure out the "shared" part later on.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
       [not found] <20150422170138.GA8388@stefanha-thinkpad.redhat.com>
  2015-04-22 17:46 ` Zerocopy VM-to-VM networking using virtio-net Cornelia Huck
       [not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com>
@ 2015-04-24  8:12 ` Luke Gorrie
  2015-04-24  8:20   ` Paolo Bonzini
  2015-04-24  9:47   ` Stefan Hajnoczi
  2 siblings, 2 replies; 25+ messages in thread
From: Luke Gorrie @ 2015-04-24  8:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, virtualization,
	virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert

[-- Attachment #1.1: Type: text/plain, Size: 1403 bytes --]

Hi Stefan,

Great topic. I am also extremely interested in helping Virtio-net become
the standard for the networking industry (the universe of DPDK, etc).

On 22 April 2015 at 19:01, Stefan Hajnoczi <stefanha@redhat.com> wrote:

> [It may be necessary to remove virtio-dev@lists.oasis-open.org from CC
> if you are a non-TC member.]
>

[Done.]

I think this approach could be fruitful in bringing virtio-net to
> VM-to-VM networking use cases.  Unless virtio-net is extended for this
> use case, I'm afraid DPDK and OpenDataPlane communities might steer
> clear of VIRTIO.
>

Questions:

- How fast is needed?

- How fast is the vhost-user support that shipped in DPDK 2.0?

- How fast would the new design likely be?

Our recent experience in Snabb Switch land is that networking on x86 is now
more of a HPC problem than a system programming problem. The SIMD bandwidth
per core keeps increasing that this erodes the value of traditional (and
complex) system programming optimizations. I will be interested to compare
notes with others on this, already on Haswell but more so when we have
AVX512.

Incidentally, we also did a pile of work last year on zero-copy NIC->VM
transfers and discovered a lot of interesting problems and edge cases where
Virtio-net spec and/or drivers are hard to match up with common NICs. Happy
to explain a bit about our experience if that would be valuable.

Cheers,
-Luke

[-- Attachment #1.2: Type: text/html, Size: 2155 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24  8:12 ` [virtio-dev] " Luke Gorrie
@ 2015-04-24  8:20   ` Paolo Bonzini
  2015-04-24  9:47   ` Stefan Hajnoczi
  1 sibling, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2015-04-24  8:20 UTC (permalink / raw)
  To: Luke Gorrie, Stefan Hajnoczi
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel, virtualization,
	virtio-comment, Dr. David Alan Gilbert

On 24/04/2015 10:12, Luke Gorrie wrote:
> 
>     I think this approach could be fruitful in bringing virtio-net to
>     VM-to-VM networking use cases.  Unless virtio-net is extended for this
>     use case, I'm afraid DPDK and OpenDataPlane communities might steer
>     clear of VIRTIO.
> 
> 
> Questions:
> 
> - How fast is needed?
> 
> - How fast is the vhost-user support that shipped in DPDK 2.0?

vhost-user is fast.  The problem is not the speed, it's the desire of a
more peer-to-peer operation.

virtio by design has very distinct roles for driver and device, so for
VM2VM communication the virtio design requires two devices in the guest
and two drivers, comprising a "switch", in the host.

The switch could be using vhost-user indeed, but my understanding is
that in some cases this switch component is undesirable.  However, my
understanding does not include _why_ it is undesirable.  This is where
we need to gather more information from the DPDK folks.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24  8:12 ` [virtio-dev] " Luke Gorrie
  2015-04-24  8:20   ` Paolo Bonzini
@ 2015-04-24  9:47   ` Stefan Hajnoczi
  2015-04-24  9:50     ` Stefan Hajnoczi
                       ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-24  9:47 UTC (permalink / raw)
  To: Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

On Fri, Apr 24, 2015 at 9:12 AM, Luke Gorrie <luke@snabb.co> wrote:
> - How fast would the new design likely be?

This proposal eliminates two things in the path:

1. Compared to vhost_net, it bypasses the host tun driver and network
stack, replacing it with direct vhost_net <-> vhost_net data transfer.
At this level it's compared to vhost-user, but it's not programmable
in userspace!

2. Data copies are eliminated because the Shared Buffers BAR gives
both VMs access to the packets.

My concern is the overhead of the vhost_net component copying
descriptors between NICs.  In a 100% shared memory model, each VM only
has a receive queue that the other VM places packets into.  There are
no tx queues.  The notification mechanism is an event fd that is
ioeventfd for VM1 and irqfd for VM2.  In other words, when VM1 kicks
the queue, VM2 receives an interrupt (of course polling the receive
queue is also possible).

It would be interesting to compare the two approaches.

> Our recent experience in Snabb Switch land is that networking on x86 is now
> more of a HPC problem than a system programming problem. The SIMD bandwidth
> per core keeps increasing that this erodes the value of traditional (and
> complex) system programming optimizations. I will be interested to compare
> notes with others on this, already on Haswell but more so when we have
> AVX512.
>
> Incidentally, we also did a pile of work last year on zero-copy NIC->VM
> transfers and discovered a lot of interesting problems and edge cases where
> Virtio-net spec and/or drivers are hard to match up with common NICs. Happy
> to explain a bit about our experience if that would be valuable.

That sounds interesting, can you describe the setup?

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24  9:47   ` Stefan Hajnoczi
@ 2015-04-24  9:50     ` Stefan Hajnoczi
  2015-04-24 12:17     ` Luke Gorrie
  2015-04-24 12:34     ` Luke Gorrie
  2 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-24  9:50 UTC (permalink / raw)
  To: Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

On Fri, Apr 24, 2015 at 10:47 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> At this level it's compared to vhost-user, but it's not programmable
> in userspace!

s/compared/comparable/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24  9:47   ` Stefan Hajnoczi
  2015-04-24  9:50     ` Stefan Hajnoczi
@ 2015-04-24 12:17     ` Luke Gorrie
  2015-04-24 13:10       ` Luke Gorrie
  2015-04-24 13:22       ` Stefan Hajnoczi
  2015-04-24 12:34     ` Luke Gorrie
  2 siblings, 2 replies; 25+ messages in thread
From: Luke Gorrie @ 2015-04-24 12:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

[-- Attachment #1.1: Type: text/plain, Size: 1554 bytes --]

On 24 April 2015 at 11:47, Stefan Hajnoczi <stefanha@gmail.com> wrote:

> My concern is the overhead of the vhost_net component copying
> descriptors between NICs.

I see. So you would not have to reserve CPU resources for vswitches.
Instead you would give all cores to the VMs and they would pay for their
own networking. This would be especially appealing in the extreme case
where all networking is "Layer 1" connectivity between local virtual
machines.

This would make VM<->VM links different to VM<->network links. I suppose
that when you created VMs you would need to be conscious of whether or not
you are placing them on the same host or NUMA node so that you can predict
what network performance will be available.

For what it is worth, I think this would make life more difficult for
network operators hosting DPDK-style network applications ("NFV").
Virtio-net would become a more complex abstraction, the orchestration
systems would need to take this into account, and there would be more
opportunity for interoperability problems between virtual machines.

The simpler alternative that I prefer is to provide network operators with
a Virtio-net abstraction that behaves and performs in exactly the same way
for all kinds of network traffic -- whether or not the VMs are on the same
machine and NUMA node.

That would be more in line with SR-IOV behavior which seems to me like the
other horse in this race. Perhaps my world view here is too narrow though
and other technologies like ivshmem are more relevant than I give them
credit for?

[-- Attachment #1.2: Type: text/html, Size: 2022 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24 12:17     ` Luke Gorrie
@ 2015-04-24 13:10       ` Luke Gorrie
  2015-04-24 13:23         ` Stefan Hajnoczi
  2015-04-24 13:22       ` Stefan Hajnoczi
  1 sibling, 1 reply; 25+ messages in thread
From: Luke Gorrie @ 2015-04-24 13:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert


[-- Attachment #1.1: Type: text/plain, Size: 363 bytes --]

On 24 April 2015 at 14:17, Luke Gorrie <luke@snabb.co> wrote:

> For what it is worth, I think
>

Erm, sorry about ranting with my pre-existing ideas without having examined
the proposed specification in detail.

I have a long backlog of things that I have been meaning to discuss with
the Virtio-net community but have not previously had time to.

Humbly!
-Luke

[-- Attachment #1.2: Type: text/html, Size: 846 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24 13:10       ` Luke Gorrie
@ 2015-04-24 13:23         ` Stefan Hajnoczi
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-24 13:23 UTC (permalink / raw)
  To: Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert

On Fri, Apr 24, 2015 at 2:10 PM, Luke Gorrie <luke@snabb.co> wrote:
> On 24 April 2015 at 14:17, Luke Gorrie <luke@snabb.co> wrote:
>>
>> For what it is worth, I think
>
>
> Erm, sorry about ranting with my pre-existing ideas without having examined
> the proposed specification in detail.

My experience with DPDK and SDN/NFV is limited, so I appreciate your input!

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24 12:17     ` Luke Gorrie
  2015-04-24 13:10       ` Luke Gorrie
@ 2015-04-24 13:22       ` Stefan Hajnoczi
  2015-04-26 13:24         ` Luke Gorrie
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-24 13:22 UTC (permalink / raw)
  To: Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

On Fri, Apr 24, 2015 at 1:17 PM, Luke Gorrie <luke@snabb.co> wrote:
> On 24 April 2015 at 11:47, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> My concern is the overhead of the vhost_net component copying
>> descriptors between NICs.
>
>
> I see. So you would not have to reserve CPU resources for vswitches. Instead
> you would give all cores to the VMs and they would pay for their own
> networking. This would be especially appealing in the extreme case where all
> networking is "Layer 1" connectivity between local virtual machines.
>
> This would make VM<->VM links different to VM<->network links. I suppose
> that when you created VMs you would need to be conscious of whether or not
> you are placing them on the same host or NUMA node so that you can predict
> what network performance will be available.

The motivation for making VM-to-VM fast is that while software
switches on the host are efficient today (thanks to vhost-user), there
is no efficient solution if the software switch is a VM.

Have you had requests to run SnabbSwitch in a VM instead of on the
host?  For example, if someone wants to deploy it in a cloud
environment they will not be allowed to run arbitrary software on the
host.

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24 13:22       ` Stefan Hajnoczi
@ 2015-04-26 13:24         ` Luke Gorrie
  2015-04-27 10:17           ` Stefan Hajnoczi
  0 siblings, 1 reply; 25+ messages in thread
From: Luke Gorrie @ 2015-04-26 13:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

[-- Attachment #1.1: Type: text/plain, Size: 2116 bytes --]

On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:

> The motivation for making VM-to-VM fast is that while software
> switches on the host are efficient today (thanks to vhost-user), there
> is no efficient solution if the software switch is a VM.
>

I see. This sounds like a noble goal indeed. I would love to run the
software switch as just another VM in the long term. It would make it much
easier for the various software switches to coexist in the world.

The main technical risk I see in this proposal is that eliminating the
memory copies might not have the desired effect. I might be tempted to keep
the copies but prevent the kernel from having to inspect the vrings (more
like vhost-user). But that is just a hunch and I suppose the first step
would be a prototype to check the performance anyway.

For what it is worth here is my view of networking performance on x86 in
the Haswell+ era:
https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow

Have you had requests to run SnabbSwitch in a VM instead of on the
> host?

This is not something we have discussed.

I can say that I am not satisfied with our installation process on the
host. I want this to be trivially easy, but it is not.

On the one hand we make some parts easy: we only require one executable
file (~1.5MB) and it works on any modern distro and kernel.

On the other hand we require the user to edit grub.conf to reserve cores
and keep the IOMMU out of the way, and to manually run a traffic process
for each 10G port pinned to a suitable core. That requires a bunch of
downstream work.

Gory details:
https://github.com/SnabbCo/snabb-nfv/wiki/Compute-node-requirements

This should be much simpler. I would quite like to be able to wrap this up
in a VM or a container. The risk is that then we become dependent on other
systems (e.g. OpenStack) pinning cores correctly, etc, and that might be
placing unrealistic expectations on the orchestration systems of the
present and near future (?). I mean: if we make this somebody else's
problem, we had better trust that they will do it right.

Cheers,
-Luke

[-- Attachment #1.2: Type: text/html, Size: 3132 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-26 13:24         ` Luke Gorrie
@ 2015-04-27 10:17           ` Stefan Hajnoczi
  2015-04-27 10:36             ` Michael S. Tsirkin
  2015-04-27 12:35             ` Jan Kiszka
  0 siblings, 2 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-27 10:17 UTC (permalink / raw)
  To: Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> The motivation for making VM-to-VM fast is that while software
>> switches on the host are efficient today (thanks to vhost-user), there
>> is no efficient solution if the software switch is a VM.
>
>
> I see. This sounds like a noble goal indeed. I would love to run the
> software switch as just another VM in the long term. It would make it much
> easier for the various software switches to coexist in the world.
>
> The main technical risk I see in this proposal is that eliminating the
> memory copies might not have the desired effect. I might be tempted to keep
> the copies but prevent the kernel from having to inspect the vrings (more
> like vhost-user). But that is just a hunch and I suppose the first step
> would be a prototype to check the performance anyway.
>
> For what it is worth here is my view of networking performance on x86 in the
> Haswell+ era:
> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow

Thanks.

I've been thinking about how to eliminate the VM <-> host <-> VM
switching and instead achieve just VM <-> VM.

The holy grail of VM-to-VM networking is an exitless I/O path.  In
other words, packets can be transferred between VMs without any
vmexits (this requires a polling driver).

Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
act as the vhost-user server:

VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)

VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
and plays the host role instead of the normal virtio-net guest driver
role.

The ugly thing about this is that VM2 needs to map all of VM1's guest
RAM so it can access the vrings and packet data.  The solution to this
is something like the Shared Buffers BAR but this time it contains not
just the packet data but also the vring, let's call it the Shared
Virtqueues BAR.

The Shared Virtqueues BAR eliminates the need for vhost-net on the
host because VM1 and VM2 communicate directly using virtqueue notify
or polling vring memory.  Virtqueue notify works by connecting an
eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
an ioeventfd that is irqfd for VM1 to signal completions.

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 10:17           ` Stefan Hajnoczi
@ 2015-04-27 10:36             ` Michael S. Tsirkin
  2015-04-27 12:35             ` Jan Kiszka
  1 sibling, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2015-04-27 10:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Rik van Riel, Andrew Jones, Linux Virtualization, Luke Gorrie,
	Stefan Hajnoczi, virtio-comment, Paolo Bonzini,
	Dr. David Alan Gilbert

On Mon, Apr 27, 2015 at 11:17:44AM +0100, Stefan Hajnoczi wrote:
> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
> > On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>
> >> The motivation for making VM-to-VM fast is that while software
> >> switches on the host are efficient today (thanks to vhost-user), there
> >> is no efficient solution if the software switch is a VM.
> >
> >
> > I see. This sounds like a noble goal indeed. I would love to run the
> > software switch as just another VM in the long term. It would make it much
> > easier for the various software switches to coexist in the world.
> >
> > The main technical risk I see in this proposal is that eliminating the
> > memory copies might not have the desired effect. I might be tempted to keep
> > the copies but prevent the kernel from having to inspect the vrings (more
> > like vhost-user). But that is just a hunch and I suppose the first step
> > would be a prototype to check the performance anyway.
> >
> > For what it is worth here is my view of networking performance on x86 in the
> > Haswell+ era:
> > https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
> 
> Thanks.
> 
> I've been thinking about how to eliminate the VM <-> host <-> VM
> switching and instead achieve just VM <-> VM.
> 
> The holy grail of VM-to-VM networking is an exitless I/O path.  In
> other words, packets can be transferred between VMs without any
> vmexits (this requires a polling driver).
> 
> Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
> act as the vhost-user server:
> 
> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
> 
> VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
> and plays the host role instead of the normal virtio-net guest driver
> role.
> 
> The ugly thing about this is that VM2 needs to map all of VM1's guest
> RAM so it can access the vrings and packet data.  The solution to this
> is something like the Shared Buffers BAR but this time it contains not
> just the packet data but also the vring, let's call it the Shared
> Virtqueues BAR.
> 
> The Shared Virtqueues BAR eliminates the need for vhost-net on the
> host because VM1 and VM2 communicate directly using virtqueue notify
> or polling vring memory.  Virtqueue notify works by connecting an
> eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
> an ioeventfd that is irqfd for VM1 to signal completions.
> 
> Stefan

So this definitely works, it's just another virtio transport.
Though this might mean guests need to copy data out to/from
this BAR.

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 10:17           ` Stefan Hajnoczi
  2015-04-27 10:36             ` Michael S. Tsirkin
@ 2015-04-27 12:35             ` Jan Kiszka
  2015-04-27 12:55               ` Jan Kiszka
                                 ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Jan Kiszka @ 2015-04-27 12:35 UTC (permalink / raw)
  To: Stefan Hajnoczi, Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, virtio-comment,
	Paolo Bonzini, Dr. David Alan Gilbert

Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi:
> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>
>>> The motivation for making VM-to-VM fast is that while software
>>> switches on the host are efficient today (thanks to vhost-user), there
>>> is no efficient solution if the software switch is a VM.
>>
>>
>> I see. This sounds like a noble goal indeed. I would love to run the
>> software switch as just another VM in the long term. It would make it much
>> easier for the various software switches to coexist in the world.
>>
>> The main technical risk I see in this proposal is that eliminating the
>> memory copies might not have the desired effect. I might be tempted to keep
>> the copies but prevent the kernel from having to inspect the vrings (more
>> like vhost-user). But that is just a hunch and I suppose the first step
>> would be a prototype to check the performance anyway.
>>
>> For what it is worth here is my view of networking performance on x86 in the
>> Haswell+ era:
>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
> 
> Thanks.
> 
> I've been thinking about how to eliminate the VM <-> host <-> VM
> switching and instead achieve just VM <-> VM.
> 
> The holy grail of VM-to-VM networking is an exitless I/O path.  In
> other words, packets can be transferred between VMs without any
> vmexits (this requires a polling driver).
> 
> Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
> act as the vhost-user server:
> 
> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
> 
> VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
> and plays the host role instead of the normal virtio-net guest driver
> role.
> 
> The ugly thing about this is that VM2 needs to map all of VM1's guest
> RAM so it can access the vrings and packet data.  The solution to this
> is something like the Shared Buffers BAR but this time it contains not
> just the packet data but also the vring, let's call it the Shared
> Virtqueues BAR.
> 
> The Shared Virtqueues BAR eliminates the need for vhost-net on the
> host because VM1 and VM2 communicate directly using virtqueue notify
> or polling vring memory.  Virtqueue notify works by connecting an
> eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
> an ioeventfd that is irqfd for VM1 to signal completions.

We had such a discussion before:
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658

Would be great to get this ball rolling again.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 12:35             ` Jan Kiszka
@ 2015-04-27 12:55               ` Jan Kiszka
  2015-04-27 13:01                 ` Stefan Hajnoczi
  2015-04-27 12:57               ` Stefan Hajnoczi
  2015-04-27 13:17               ` Michael S. Tsirkin
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Kiszka @ 2015-04-27 12:55 UTC (permalink / raw)
  To: Stefan Hajnoczi, Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert

Am 2015-04-27 um 14:35 schrieb Jan Kiszka:
> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi:
>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>
>>>> The motivation for making VM-to-VM fast is that while software
>>>> switches on the host are efficient today (thanks to vhost-user), there
>>>> is no efficient solution if the software switch is a VM.
>>>
>>>
>>> I see. This sounds like a noble goal indeed. I would love to run the
>>> software switch as just another VM in the long term. It would make it much
>>> easier for the various software switches to coexist in the world.
>>>
>>> The main technical risk I see in this proposal is that eliminating the
>>> memory copies might not have the desired effect. I might be tempted to keep
>>> the copies but prevent the kernel from having to inspect the vrings (more
>>> like vhost-user). But that is just a hunch and I suppose the first step
>>> would be a prototype to check the performance anyway.
>>>
>>> For what it is worth here is my view of networking performance on x86 in the
>>> Haswell+ era:
>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
>>
>> Thanks.
>>
>> I've been thinking about how to eliminate the VM <-> host <-> VM
>> switching and instead achieve just VM <-> VM.
>>
>> The holy grail of VM-to-VM networking is an exitless I/O path.  In
>> other words, packets can be transferred between VMs without any
>> vmexits (this requires a polling driver).
>>
>> Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
>> act as the vhost-user server:
>>
>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
>>
>> VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
>> and plays the host role instead of the normal virtio-net guest driver
>> role.
>>
>> The ugly thing about this is that VM2 needs to map all of VM1's guest
>> RAM so it can access the vrings and packet data.  The solution to this
>> is something like the Shared Buffers BAR but this time it contains not
>> just the packet data but also the vring, let's call it the Shared
>> Virtqueues BAR.
>>
>> The Shared Virtqueues BAR eliminates the need for vhost-net on the
>> host because VM1 and VM2 communicate directly using virtqueue notify
>> or polling vring memory.  Virtqueue notify works by connecting an
>> eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
>> an ioeventfd that is irqfd for VM1 to signal completions.
> 
> We had such a discussion before:
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658
> 
> Would be great to get this ball rolling again.
> 
> Jan
> 

But one challenge would remain even then (unless both sides only poll):
exit-free inter-VM signaling, no? But that's a hardware issue first of all.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 12:55               ` Jan Kiszka
@ 2015-04-27 13:01                 ` Stefan Hajnoczi
  2015-04-27 13:08                   ` Muli Ben-Yehuda
  2015-04-27 14:30                   ` Jan Kiszka
  0 siblings, 2 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-27 13:01 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Rik van Riel, Michael S. Tsirkin, Andrew Jones,
	Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert

On Mon, Apr 27, 2015 at 1:55 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> Am 2015-04-27 um 14:35 schrieb Jan Kiszka:
>> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi:
>>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
>>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>
>>>>> The motivation for making VM-to-VM fast is that while software
>>>>> switches on the host are efficient today (thanks to vhost-user), there
>>>>> is no efficient solution if the software switch is a VM.
>>>>
>>>>
>>>> I see. This sounds like a noble goal indeed. I would love to run the
>>>> software switch as just another VM in the long term. It would make it much
>>>> easier for the various software switches to coexist in the world.
>>>>
>>>> The main technical risk I see in this proposal is that eliminating the
>>>> memory copies might not have the desired effect. I might be tempted to keep
>>>> the copies but prevent the kernel from having to inspect the vrings (more
>>>> like vhost-user). But that is just a hunch and I suppose the first step
>>>> would be a prototype to check the performance anyway.
>>>>
>>>> For what it is worth here is my view of networking performance on x86 in the
>>>> Haswell+ era:
>>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
>>>
>>> Thanks.
>>>
>>> I've been thinking about how to eliminate the VM <-> host <-> VM
>>> switching and instead achieve just VM <-> VM.
>>>
>>> The holy grail of VM-to-VM networking is an exitless I/O path.  In
>>> other words, packets can be transferred between VMs without any
>>> vmexits (this requires a polling driver).
>>>
>>> Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
>>> act as the vhost-user server:
>>>
>>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
>>>
>>> VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
>>> and plays the host role instead of the normal virtio-net guest driver
>>> role.
>>>
>>> The ugly thing about this is that VM2 needs to map all of VM1's guest
>>> RAM so it can access the vrings and packet data.  The solution to this
>>> is something like the Shared Buffers BAR but this time it contains not
>>> just the packet data but also the vring, let's call it the Shared
>>> Virtqueues BAR.
>>>
>>> The Shared Virtqueues BAR eliminates the need for vhost-net on the
>>> host because VM1 and VM2 communicate directly using virtqueue notify
>>> or polling vring memory.  Virtqueue notify works by connecting an
>>> eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
>>> an ioeventfd that is irqfd for VM1 to signal completions.
>>
>> We had such a discussion before:
>> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658
>>
>> Would be great to get this ball rolling again.
>>
>> Jan
>>
>
> But one challenge would remain even then (unless both sides only poll):
> exit-free inter-VM signaling, no? But that's a hardware issue first of all.

To start with ioeventfd<->irqfd can be used.  It incurs a light-weight
exit in VM1 and interrupt injection in VM2.

For networking the cost is mitigated by NAPI drivers which switch
between interrupts and polling.  During notification-heavy periods the
guests would use polling anyway.

A hardware solution would be some kind of inter-guest interrupt
injection.  I don't know VMX well enough to know whether that is
possible on Intel CPUs.

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 13:01                 ` Stefan Hajnoczi
@ 2015-04-27 13:08                   ` Muli Ben-Yehuda
  2015-04-27 14:30                   ` Jan Kiszka
  1 sibling, 0 replies; 25+ messages in thread
From: Muli Ben-Yehuda @ 2015-04-27 13:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Rik van Riel, Michael S. Tsirkin, Jan Kiszka, Andrew Jones,
	Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert

On Mon, Apr 27, 2015 at 02:01:05PM +0100, Stefan Hajnoczi wrote:

> A hardware solution would be some kind of inter-guest interrupt
> injection.  I don't know VMX well enough to know whether that is
> possible on Intel CPUs.

It is: http://www.mulix.org/pubs/eli/eli.pdf.

(And there's hardware coming down the pipe that will make (some) of
the nasty tricks we used unnecessary).

Cheers,
Muli

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 13:01                 ` Stefan Hajnoczi
  2015-04-27 13:08                   ` Muli Ben-Yehuda
@ 2015-04-27 14:30                   ` Jan Kiszka
  2015-04-27 14:36                     ` Luke Gorrie
  2015-04-27 14:40                     ` Michael S. Tsirkin
  1 sibling, 2 replies; 25+ messages in thread
From: Jan Kiszka @ 2015-04-27 14:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Rik van Riel, Michael S. Tsirkin, Andrew Jones,
	Linux Virtualization, Luke Gorrie, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert

Am 2015-04-27 um 15:01 schrieb Stefan Hajnoczi:
> On Mon, Apr 27, 2015 at 1:55 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> Am 2015-04-27 um 14:35 schrieb Jan Kiszka:
>>> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi:
>>>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
>>>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>
>>>>>> The motivation for making VM-to-VM fast is that while software
>>>>>> switches on the host are efficient today (thanks to vhost-user), there
>>>>>> is no efficient solution if the software switch is a VM.
>>>>>
>>>>>
>>>>> I see. This sounds like a noble goal indeed. I would love to run the
>>>>> software switch as just another VM in the long term. It would make it much
>>>>> easier for the various software switches to coexist in the world.
>>>>>
>>>>> The main technical risk I see in this proposal is that eliminating the
>>>>> memory copies might not have the desired effect. I might be tempted to keep
>>>>> the copies but prevent the kernel from having to inspect the vrings (more
>>>>> like vhost-user). But that is just a hunch and I suppose the first step
>>>>> would be a prototype to check the performance anyway.
>>>>>
>>>>> For what it is worth here is my view of networking performance on x86 in the
>>>>> Haswell+ era:
>>>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
>>>>
>>>> Thanks.
>>>>
>>>> I've been thinking about how to eliminate the VM <-> host <-> VM
>>>> switching and instead achieve just VM <-> VM.
>>>>
>>>> The holy grail of VM-to-VM networking is an exitless I/O path.  In
>>>> other words, packets can be transferred between VMs without any
>>>> vmexits (this requires a polling driver).
>>>>
>>>> Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
>>>> act as the vhost-user server:
>>>>
>>>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
>>>>
>>>> VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
>>>> and plays the host role instead of the normal virtio-net guest driver
>>>> role.
>>>>
>>>> The ugly thing about this is that VM2 needs to map all of VM1's guest
>>>> RAM so it can access the vrings and packet data.  The solution to this
>>>> is something like the Shared Buffers BAR but this time it contains not
>>>> just the packet data but also the vring, let's call it the Shared
>>>> Virtqueues BAR.
>>>>
>>>> The Shared Virtqueues BAR eliminates the need for vhost-net on the
>>>> host because VM1 and VM2 communicate directly using virtqueue notify
>>>> or polling vring memory.  Virtqueue notify works by connecting an
>>>> eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
>>>> an ioeventfd that is irqfd for VM1 to signal completions.
>>>
>>> We had such a discussion before:
>>> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658
>>>
>>> Would be great to get this ball rolling again.
>>>
>>> Jan
>>>
>>
>> But one challenge would remain even then (unless both sides only poll):
>> exit-free inter-VM signaling, no? But that's a hardware issue first of all.
> 
> To start with ioeventfd<->irqfd can be used.  It incurs a light-weight
> exit in VM1 and interrupt injection in VM2.
> 
> For networking the cost is mitigated by NAPI drivers which switch
> between interrupts and polling.  During notification-heavy periods the
> guests would use polling anyway.
> 
> A hardware solution would be some kind of inter-guest interrupt
> injection.  I don't know VMX well enough to know whether that is
> possible on Intel CPUs.

Today, we have posted interrupts to avoid the vm-exit on the target CPU,
but there is nothing yet (to my best knowledge) to avoid the exit on the
sender side (unless we ignore security). That's the same problem with
intra-guest IPIs, BTW.

For throughput and given NAPI patterns, that's probably not an issue as
you noted. It may be for latency, though, when almost every cycle counts.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 14:30                   ` Jan Kiszka
@ 2015-04-27 14:36                     ` Luke Gorrie
  2015-04-27 14:38                       ` Jan Kiszka
  2015-04-27 14:40                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 25+ messages in thread
From: Luke Gorrie @ 2015-04-27 14:36 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert


[-- Attachment #1.1: Type: text/plain, Size: 793 bytes --]

On 27 April 2015 at 16:30, Jan Kiszka <jan.kiszka@siemens.com> wrote:

> Today, we have posted interrupts to avoid the vm-exit on the target CPU,
> but there is nothing yet (to my best knowledge) to avoid the exit on the
> sender side (unless we ignore security). That's the same problem with
> intra-guest IPIs, BTW.
>
> For throughput and given NAPI patterns, that's probably not an issue as
> you noted. It may be for latency, though, when almost every cycle counts.
>

Poll-mode networking applications (DPDK, Snabb Switch, etc) are typically
busy-looping to poll the vring. They may have a very short usleep() between
checks to save power but they don't wait on their eventfd. So for those
particular applications latency is on the order of tens of microseconds
even without guest exits.

[-- Attachment #1.2: Type: text/html, Size: 1257 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 14:36                     ` Luke Gorrie
@ 2015-04-27 14:38                       ` Jan Kiszka
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kiszka @ 2015-04-27 14:38 UTC (permalink / raw)
  To: Luke Gorrie
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert

Am 2015-04-27 um 16:36 schrieb Luke Gorrie:
> On 27 April 2015 at 16:30, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> 
>> Today, we have posted interrupts to avoid the vm-exit on the target CPU,
>> but there is nothing yet (to my best knowledge) to avoid the exit on the
>> sender side (unless we ignore security). That's the same problem with
>> intra-guest IPIs, BTW.
>>
>> For throughput and given NAPI patterns, that's probably not an issue as
>> you noted. It may be for latency, though, when almost every cycle counts.
>>
> 
> Poll-mode networking applications (DPDK, Snabb Switch, etc) are typically
> busy-looping to poll the vring. They may have a very short usleep() between
> checks to save power but they don't wait on their eventfd. So for those
> particular applications latency is on the order of tens of microseconds
> even without guest exits.

That's one side, don't forget the others (the "normal" guests).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 14:30                   ` Jan Kiszka
  2015-04-27 14:36                     ` Luke Gorrie
@ 2015-04-27 14:40                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2015-04-27 14:40 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Rik van Riel, Andrew Jones, Linux Virtualization, Luke Gorrie,
	Stefan Hajnoczi, Paolo Bonzini, Dr. David Alan Gilbert

On Mon, Apr 27, 2015 at 04:30:35PM +0200, Jan Kiszka wrote:
> Am 2015-04-27 um 15:01 schrieb Stefan Hajnoczi:
> > On Mon, Apr 27, 2015 at 1:55 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> >> Am 2015-04-27 um 14:35 schrieb Jan Kiszka:
> >>> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi:
> >>>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
> >>>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>>>
> >>>>>> The motivation for making VM-to-VM fast is that while software
> >>>>>> switches on the host are efficient today (thanks to vhost-user), there
> >>>>>> is no efficient solution if the software switch is a VM.
> >>>>>
> >>>>>
> >>>>> I see. This sounds like a noble goal indeed. I would love to run the
> >>>>> software switch as just another VM in the long term. It would make it much
> >>>>> easier for the various software switches to coexist in the world.
> >>>>>
> >>>>> The main technical risk I see in this proposal is that eliminating the
> >>>>> memory copies might not have the desired effect. I might be tempted to keep
> >>>>> the copies but prevent the kernel from having to inspect the vrings (more
> >>>>> like vhost-user). But that is just a hunch and I suppose the first step
> >>>>> would be a prototype to check the performance anyway.
> >>>>>
> >>>>> For what it is worth here is my view of networking performance on x86 in the
> >>>>> Haswell+ era:
> >>>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
> >>>>
> >>>> Thanks.
> >>>>
> >>>> I've been thinking about how to eliminate the VM <-> host <-> VM
> >>>> switching and instead achieve just VM <-> VM.
> >>>>
> >>>> The holy grail of VM-to-VM networking is an exitless I/O path.  In
> >>>> other words, packets can be transferred between VMs without any
> >>>> vmexits (this requires a polling driver).
> >>>>
> >>>> Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
> >>>> act as the vhost-user server:
> >>>>
> >>>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
> >>>>
> >>>> VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
> >>>> and plays the host role instead of the normal virtio-net guest driver
> >>>> role.
> >>>>
> >>>> The ugly thing about this is that VM2 needs to map all of VM1's guest
> >>>> RAM so it can access the vrings and packet data.  The solution to this
> >>>> is something like the Shared Buffers BAR but this time it contains not
> >>>> just the packet data but also the vring, let's call it the Shared
> >>>> Virtqueues BAR.
> >>>>
> >>>> The Shared Virtqueues BAR eliminates the need for vhost-net on the
> >>>> host because VM1 and VM2 communicate directly using virtqueue notify
> >>>> or polling vring memory.  Virtqueue notify works by connecting an
> >>>> eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
> >>>> an ioeventfd that is irqfd for VM1 to signal completions.
> >>>
> >>> We had such a discussion before:
> >>> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658
> >>>
> >>> Would be great to get this ball rolling again.
> >>>
> >>> Jan
> >>>
> >>
> >> But one challenge would remain even then (unless both sides only poll):
> >> exit-free inter-VM signaling, no? But that's a hardware issue first of all.
> > 
> > To start with ioeventfd<->irqfd can be used.  It incurs a light-weight
> > exit in VM1 and interrupt injection in VM2.
> > 
> > For networking the cost is mitigated by NAPI drivers which switch
> > between interrupts and polling.  During notification-heavy periods the
> > guests would use polling anyway.
> > 
> > A hardware solution would be some kind of inter-guest interrupt
> > injection.  I don't know VMX well enough to know whether that is
> > possible on Intel CPUs.
> 
> Today, we have posted interrupts to avoid the vm-exit on the target CPU,
> but there is nothing yet (to my best knowledge) to avoid the exit on the
> sender side (unless we ignore security). That's the same problem with
> intra-guest IPIs, BTW.
> 
> For throughput and given NAPI patterns, that's probably not an issue as
> you noted. It may be for latency, though, when almost every cycle counts.
> 
> Jan

If you are counting cycles you likely can't afford the
interrupt latency under linux, so you have to poll
memory.

> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 12:35             ` Jan Kiszka
  2015-04-27 12:55               ` Jan Kiszka
@ 2015-04-27 12:57               ` Stefan Hajnoczi
  2015-04-27 13:17               ` Michael S. Tsirkin
  2 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2015-04-27 12:57 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Rik van Riel, Michael S. Tsirkin, Andrew Jones,
	Linux Virtualization, Luke Gorrie, Stefan Hajnoczi,
	virtio-comment, Paolo Bonzini, Dr. David Alan Gilbert

On Mon, Apr 27, 2015 at 1:35 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi:
>> On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
>>> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>
>>>> The motivation for making VM-to-VM fast is that while software
>>>> switches on the host are efficient today (thanks to vhost-user), there
>>>> is no efficient solution if the software switch is a VM.
>>>
>>>
>>> I see. This sounds like a noble goal indeed. I would love to run the
>>> software switch as just another VM in the long term. It would make it much
>>> easier for the various software switches to coexist in the world.
>>>
>>> The main technical risk I see in this proposal is that eliminating the
>>> memory copies might not have the desired effect. I might be tempted to keep
>>> the copies but prevent the kernel from having to inspect the vrings (more
>>> like vhost-user). But that is just a hunch and I suppose the first step
>>> would be a prototype to check the performance anyway.
>>>
>>> For what it is worth here is my view of networking performance on x86 in the
>>> Haswell+ era:
>>> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
>>
>> Thanks.
>>
>> I've been thinking about how to eliminate the VM <-> host <-> VM
>> switching and instead achieve just VM <-> VM.
>>
>> The holy grail of VM-to-VM networking is an exitless I/O path.  In
>> other words, packets can be transferred between VMs without any
>> vmexits (this requires a polling driver).
>>
>> Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
>> act as the vhost-user server:
>>
>> VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
>>
>> VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
>> and plays the host role instead of the normal virtio-net guest driver
>> role.
>>
>> The ugly thing about this is that VM2 needs to map all of VM1's guest
>> RAM so it can access the vrings and packet data.  The solution to this
>> is something like the Shared Buffers BAR but this time it contains not
>> just the packet data but also the vring, let's call it the Shared
>> Virtqueues BAR.
>>
>> The Shared Virtqueues BAR eliminates the need for vhost-net on the
>> host because VM1 and VM2 communicate directly using virtqueue notify
>> or polling vring memory.  Virtqueue notify works by connecting an
>> eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
>> an ioeventfd that is irqfd for VM1 to signal completions.
>
> We had such a discussion before:
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658
>
> Would be great to get this ball rolling again.

Thanks for the interesting link.

Now that vhost-user exists, a QEMU -device vhost-user feature is a
logical step.  It would allow any virtio device to be emulated by
another VM, not just virtio-net.  It seems like a nice model for
storage and networking appliance VMs.

I don't have time to write the patches in the near future but can
participate in code review and discussion.

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-27 12:35             ` Jan Kiszka
  2015-04-27 12:55               ` Jan Kiszka
  2015-04-27 12:57               ` Stefan Hajnoczi
@ 2015-04-27 13:17               ` Michael S. Tsirkin
  2 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2015-04-27 13:17 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Rik van Riel, Andrew Jones, Linux Virtualization, Luke Gorrie,
	Stefan Hajnoczi, virtio-comment, Paolo Bonzini,
	Dr. David Alan Gilbert

On Mon, Apr 27, 2015 at 02:35:19PM +0200, Jan Kiszka wrote:
> Am 2015-04-27 um 12:17 schrieb Stefan Hajnoczi:
> > On Sun, Apr 26, 2015 at 2:24 PM, Luke Gorrie <luke@snabb.co> wrote:
> >> On 24 April 2015 at 15:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>
> >>> The motivation for making VM-to-VM fast is that while software
> >>> switches on the host are efficient today (thanks to vhost-user), there
> >>> is no efficient solution if the software switch is a VM.
> >>
> >>
> >> I see. This sounds like a noble goal indeed. I would love to run the
> >> software switch as just another VM in the long term. It would make it much
> >> easier for the various software switches to coexist in the world.
> >>
> >> The main technical risk I see in this proposal is that eliminating the
> >> memory copies might not have the desired effect. I might be tempted to keep
> >> the copies but prevent the kernel from having to inspect the vrings (more
> >> like vhost-user). But that is just a hunch and I suppose the first step
> >> would be a prototype to check the performance anyway.
> >>
> >> For what it is worth here is my view of networking performance on x86 in the
> >> Haswell+ era:
> >> https://groups.google.com/forum/#!topic/snabb-devel/aez4pEnd4ow
> > 
> > Thanks.
> > 
> > I've been thinking about how to eliminate the VM <-> host <-> VM
> > switching and instead achieve just VM <-> VM.
> > 
> > The holy grail of VM-to-VM networking is an exitless I/O path.  In
> > other words, packets can be transferred between VMs without any
> > vmexits (this requires a polling driver).
> > 
> > Here is how it works.  QEMU gets "-device vhost-user" so that a VM can
> > act as the vhost-user server:
> > 
> > VM1 (virtio-net guest driver) <-> VM2 (vhost-user device)
> > 
> > VM1 has a regular virtio-net PCI device.  VM2 has a vhost-user device
> > and plays the host role instead of the normal virtio-net guest driver
> > role.
> > 
> > The ugly thing about this is that VM2 needs to map all of VM1's guest
> > RAM so it can access the vrings and packet data.  The solution to this
> > is something like the Shared Buffers BAR but this time it contains not
> > just the packet data but also the vring, let's call it the Shared
> > Virtqueues BAR.
> > 
> > The Shared Virtqueues BAR eliminates the need for vhost-net on the
> > host because VM1 and VM2 communicate directly using virtqueue notify
> > or polling vring memory.  Virtqueue notify works by connecting an
> > eventfd as ioeventfd in VM1 and irqfd in VM2.  And VM2 would also have
> > an ioeventfd that is irqfd for VM1 to signal completions.
> 
> We had such a discussion before:
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123014/focus=279658
> 
> Would be great to get this ball rolling again.
> 
> Jan

I think fundamentally, reducing the stress on the host scheduler
can give a bigger gain than zero copy.

But if I was to implement this, I wouldn't start with the funky virtio
BAR thing.

Start by enabling DPDK vhost-port within guest as-is.
To this end, we can try implementing virtio-vhost:

Assume we want to bridge VMX and VMY using bridge in VMB.

- expose all of VMX and VMY memory as device BARs, or as some other
  region within VMB memory
- add interface to send vhost-user messages to VMB
  (and ack them) the messages include tables that
  translate from VMX/VMY physical to VMB physical.

The simplest guest driver then just copies from VMX TX ring to VMY RX
ring, and vice versa.
This will let you test performance somewhat easily.
When used as a linux netdev, we probably will have to do extra data copies,
at least initially.


The point is that you get full interoperability with existing
virtio, and test performance without rewriting everything
first.

One nice property is that KVM can log accesses for us.
By detecting VMB accesses to memory of VMX and forwarding
them to QEMU running VMX, we can make migration work
out of box.

This might also mean vringh code is reusable to make a linux
driver for this device - IIRC dirty logging was the biggest
hurdle to make vringh work well for vhost.


> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [virtio-dev] Zerocopy VM-to-VM networking using virtio-net
  2015-04-24  9:47   ` Stefan Hajnoczi
  2015-04-24  9:50     ` Stefan Hajnoczi
  2015-04-24 12:17     ` Luke Gorrie
@ 2015-04-24 12:34     ` Luke Gorrie
  2 siblings, 0 replies; 25+ messages in thread
From: Luke Gorrie @ 2015-04-24 12:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Andrew Jones, Michael S. Tsirkin, Rik van Riel,
	Linux Virtualization, Stefan Hajnoczi, Paolo Bonzini,
	Dr. David Alan Gilbert

[-- Attachment #1.1: Type: text/plain, Size: 2670 bytes --]

On 24 April 2015 at 11:47, Stefan Hajnoczi <stefanha@gmail.com> wrote:

> > Incidentally, we also did a pile of work last year on zero-copy NIC->VM
> > transfers and discovered a lot of interesting problems and edge cases
> where
> > Virtio-net spec and/or drivers are hard to match up with common NICs.
> Happy
> > to explain a bit about our experience if that would be valuable.
>
> That sounds interesting, can you describe the setup?
>

Sure.

We implemented a zero-copy receive path that maps guest buffers received
from the avail ring directly onto hardware receive buffers on a dedicated
hardware receive queue for that VM (VMDq).

This means that when the NIC receives a packet it stores it directly into
the guest's memory but the vswitch has the opportunity to do as much or as
little processing as it wants before making the packet available with a
used ring descriptor.

This scheme seems quite elegant to me. (I am sure it is not original - this
is what the VMDq hardware feature is for, after all.) The devil is in the
details though.

I suspect it would work well given two extensions to Virtio-net:

1. The 'used' ring allow an offset where the payload starts.

2. The guest to always supply buffers with space for >= 2048 bytes of
payload.

but without these things it is tricky to satisfy the requirements of real
NICs such as the Intel 10G ones. There are conflicting requirements. For
example:

- NIC requires buffer sizes to be uniform and a multiple of 1024 bytes.
Guest suppliers variable-size buffers often of ~1500 bytes. These need to
be either rounded down to 1024 bytes (causing excessive segmentation) or
rounded up to 2048 bytes (requiring jumbo frames to be globally disabled on
the port to avoid potential overruns).

- Virtio-net with MRG_RXBUF expects the packet payload to be in a different
offset for the first descriptor in a chain (offset 14 after the vnet
header) vs following descriptions in the chain (offset 0). The NIC always
stores packets at the same offset so the vswitch needs to pick one and then
correct with memmove() when needed.

- If the vswitch wants to shorten the packet payload, e.g. to remove
encapsulation, then this will require a memmove() because there is no way
to communicate an offset on the used ring.

- The NIC has a limit to how many receive descriptors it can chain
together. If the guest is supplying small buffers then this limit may be
too low for jumbo frames to be received.

... and at a certain point we decided we were better off switching our
focus away from clever-but-fragile NIC hacks and towards clever-and-robust
SIMD hacks, and that is the path we have been on since a few months ago.

[-- Attachment #1.2: Type: text/html, Size: 3392 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-04-27 14:40 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20150422170138.GA8388@stefanha-thinkpad.redhat.com>
2015-04-22 17:46 ` Zerocopy VM-to-VM networking using virtio-net Cornelia Huck
     [not found] ` <20150422194603.1e650ec7.cornelia.huck@de.ibm.com>
2015-04-22 18:00   ` Stefan Hajnoczi
2015-04-23 16:54     ` Cornelia Huck
2015-04-24  8:12 ` [virtio-dev] " Luke Gorrie
2015-04-24  8:20   ` Paolo Bonzini
2015-04-24  9:47   ` Stefan Hajnoczi
2015-04-24  9:50     ` Stefan Hajnoczi
2015-04-24 12:17     ` Luke Gorrie
2015-04-24 13:10       ` Luke Gorrie
2015-04-24 13:23         ` Stefan Hajnoczi
2015-04-24 13:22       ` Stefan Hajnoczi
2015-04-26 13:24         ` Luke Gorrie
2015-04-27 10:17           ` Stefan Hajnoczi
2015-04-27 10:36             ` Michael S. Tsirkin
2015-04-27 12:35             ` Jan Kiszka
2015-04-27 12:55               ` Jan Kiszka
2015-04-27 13:01                 ` Stefan Hajnoczi
2015-04-27 13:08                   ` Muli Ben-Yehuda
2015-04-27 14:30                   ` Jan Kiszka
2015-04-27 14:36                     ` Luke Gorrie
2015-04-27 14:38                       ` Jan Kiszka
2015-04-27 14:40                     ` Michael S. Tsirkin
2015-04-27 12:57               ` Stefan Hajnoczi
2015-04-27 13:17               ` Michael S. Tsirkin
2015-04-24 12:34     ` Luke Gorrie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.