virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* Guest bridge setup variations
@ 2009-12-08 16:07 Arnd Bergmann
  2009-12-09 19:36 ` [Qemu-devel] " Anthony Liguori
  2009-12-10 12:26 ` Fischer, Anna
  0 siblings, 2 replies; 13+ messages in thread
From: Arnd Bergmann @ 2009-12-08 16:07 UTC (permalink / raw)
  To: virtualization; +Cc: qemu-devel

As promised, here is my small writeup on which setups I feel
are important in the long run for server-type guests. This
does not cover -net user, which is really for desktop kinds
of applications where you do not want to connect into the
guest from another IP address.

I can see four separate setups that we may or may not want to
support, the main difference being how the forwarding between
guests happens:

1. The current setup, with a bridge and tun/tap devices on ports
of the bridge. This is what Gerhard's work on access controls is
focused on and the only option where the hypervisor actually
is in full control of the traffic between guests. CPU utilization should
be highest this way, and network management can be a burden,
because the controls are done through a Linux, libvirt and/or Director
specific interface.

2. Using macvlan as a bridging mechanism, replacing the bridge
and tun/tap entirely. This should offer the best performance on
inter-guest communication, both in terms of throughput and
CPU utilization, but offer no access control for this traffic at all.
Performance of guest-external traffic should be slightly better
than bridge/tap.

3. Doing the bridging in the NIC using macvlan in passthrough
mode. This lowers the CPU utilization further compared to 2,
at the expense of limiting throughput by the performance of
the PCIe interconnect to the adapter. Whether or not this
is a win is workload dependent. Access controls now happen
in the NIC. Currently, this is not supported yet, due to lack of
device drivers, but it will be an important scenario in the future
according to some people.

4. Using macvlan for actual VEPA on the outbound interface.
This is mostly interesting because it makes the network access
controls visible in an external switch that is already managed.
CPU utilization and guest-external throughput should be
identical to 3, but inter-guest latency can only be worse because
all frames go through the external switch.

In case 2 through 4, we have the choice between macvtap and
the raw packet interface for connecting macvlan to qemu.
Raw sockets are better tested right now, while macvtap has
better permission management (i.e. it does not require
CAP_NET_ADMIN). Neither one is upstream though at the
moment. The raw driver only requires qemu patches, while
macvtap requires both a new kernel driver and a trivial change
in qemu.

In all four cases, vhost-net could be used to move the workload
from user space into the kernel, which may be an advantage.
The decision for or against vhost-net is entirely independent of
the other decisions.

	Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Guest bridge setup variations
  2009-12-08 16:07 Guest bridge setup variations Arnd Bergmann
@ 2009-12-09 19:36 ` Anthony Liguori
  2009-12-10  9:19   ` Arnd Bergmann
  2009-12-10 12:26 ` Fischer, Anna
  1 sibling, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2009-12-09 19:36 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: qemu-devel, virtualization

Arnd Bergmann wrote:
> As promised, here is my small writeup on which setups I feel
> are important in the long run for server-type guests. This
> does not cover -net user, which is really for desktop kinds
> of applications where you do not want to connect into the
> guest from another IP address.
>
> I can see four separate setups that we may or may not want to
> support, the main difference being how the forwarding between
> guests happens:
>
> 1. The current setup, with a bridge and tun/tap devices on ports
> of the bridge. This is what Gerhard's work on access controls is
> focused on and the only option where the hypervisor actually
> is in full control of the traffic between guests. CPU utilization should
> be highest this way, and network management can be a burden,
> because the controls are done through a Linux, libvirt and/or Director
> specific interface.
>   

Typical bridging.

> 2. Using macvlan as a bridging mechanism, replacing the bridge
> and tun/tap entirely. This should offer the best performance on
> inter-guest communication, both in terms of throughput and
> CPU utilization, but offer no access control for this traffic at all.
> Performance of guest-external traffic should be slightly better
> than bridge/tap.
>   

Optimization to typical bridge (no traffic control).

> 3. Doing the bridging in the NIC using macvlan in passthrough
> mode. This lowers the CPU utilization further compared to 2,
> at the expense of limiting throughput by the performance of
> the PCIe interconnect to the adapter. Whether or not this
> is a win is workload dependent. Access controls now happen
> in the NIC. Currently, this is not supported yet, due to lack of
> device drivers, but it will be an important scenario in the future
> according to some people.
>   

Optimization to typical bridge (hardware accelerated).

> 4. Using macvlan for actual VEPA on the outbound interface.
> This is mostly interesting because it makes the network access
> controls visible in an external switch that is already managed.
> CPU utilization and guest-external throughput should be
> identical to 3, but inter-guest latency can only be worse because
> all frames go through the external switch.
>   

VEPA.

While we go over all of these things one thing is becoming clear to me.  
We need to get qemu out of the network configuration business.  There's 
too much going on here.

What I'd like to see is the following interfaces supported:

1) given an fd, make socket calls to send packets.  Could be used with a 
raw socket, a multicast or tcp socket.
2) given an fd, use tap-style read/write calls to send packets*
3) given an fd, treat a vhost-style interface

* need to make all tun ioctls optional based on passed in flags

Every backend we have today could be implemented in terms of one of the 
above three.  They really come down to how the fd is created and setup.

I believe we should continue supporting the mechanisms we support 
today.  However, for people that invoke qemu directly from the command 
line, I believe we should provide a mechanism like the tap helper that 
can be used to call out to a separate program to create these initial 
file descriptors.  We'll have to think about how we can make this 
integrate well so that the syntax isn't clumsy.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Guest bridge setup variations
  2009-12-09 19:36 ` [Qemu-devel] " Anthony Liguori
@ 2009-12-10  9:19   ` Arnd Bergmann
  2009-12-10 23:53     ` Anthony Liguori
  0 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2009-12-10  9:19 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, virtualization

On Wednesday 09 December 2009, Anthony Liguori wrote:
> While we go over all of these things one thing is becoming clear to me.  
> We need to get qemu out of the network configuration business.  There's 
> too much going on here.

Agreed.

> What I'd like to see is the following interfaces supported:
> 
> 1) given an fd, make socket calls to send packets.  Could be used with a 
> raw socket, a multicast or tcp socket.
> 2) given an fd, use tap-style read/write calls to send packets*

yes.

> 3) given an fd, treat a vhost-style interface

This could mean two things, not sure which one you mean. Either the
file descriptor could be the vhost file descriptor, or the socket or tap file
descriptor from above, with qemu operating on the vhost interface itself.

Either option has its advantages, but I guess we should only implement
one of the two to keep it simple.

> I believe we should continue supporting the mechanisms we support 
> today.  However, for people that invoke qemu directly from the command 
> line, I believe we should provide a mechanism like the tap helper that 
> can be used to call out to a separate program to create these initial 
> file descriptors.  We'll have to think about how we can make this 
> integrate well so that the syntax isn't clumsy.

Right. I wonder if this helper should integrate with netcf as an abstraction,
or if we should rather do something generic. It may also be a good idea
to let the helper decide which of the three options you listed to use
and pass that back to qemu unless the user overrides it. The decision
probably needs to be host specific, depending e.g. on the availability
and version of tools (brctl, iproute, maybe tunctl, ...), the respective
kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan,
ebtables). Ideally the approach should be generic enough to work on
other platforms (BSD, Solaris, Windows, ...).

One thing I realized the last time we discussed the helper approach is
that qemu should not need to know or care about the arguments passed
to the helper, otherwise you get all the complexity back in qemu that
you're trying to avoid. Maybe for 0.13 we can convert -net socket and
-net tap to just pass all their options to the helper and move that code
out of qemu, along with introducting the new syntax.

Another unrelated issue that I think needs to be addressed in a
network code cleanup is adding better support for multi-queue
transmit and receive. I've prepared macvtap for that by letting you
open the chardev multiple times to get one queue per guest CPU,
but that needs to be supported by qemu and virtio-net as well
to actually parallelize network operation. Ideally, two guest CPUs
should be able to transmit and receive on separate queues of the
adapter without ever having to access any shared resources.

	Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Guest bridge setup variations
  2009-12-08 16:07 Guest bridge setup variations Arnd Bergmann
  2009-12-09 19:36 ` [Qemu-devel] " Anthony Liguori
@ 2009-12-10 12:26 ` Fischer, Anna
  2009-12-10 14:18   ` Arnd Bergmann
  1 sibling, 1 reply; 13+ messages in thread
From: Fischer, Anna @ 2009-12-10 12:26 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org

> Subject: Guest bridge setup variations
> 
> As promised, here is my small writeup on which setups I feel
> are important in the long run for server-type guests. This
> does not cover -net user, which is really for desktop kinds
> of applications where you do not want to connect into the
> guest from another IP address.
> 
> I can see four separate setups that we may or may not want to
> support, the main difference being how the forwarding between
> guests happens:
> 
> 1. The current setup, with a bridge and tun/tap devices on ports
> of the bridge. This is what Gerhard's work on access controls is
> focused on and the only option where the hypervisor actually
> is in full control of the traffic between guests. CPU utilization should
> be highest this way, and network management can be a burden,
> because the controls are done through a Linux, libvirt and/or Director
> specific interface.
> 
> 2. Using macvlan as a bridging mechanism, replacing the bridge
> and tun/tap entirely. This should offer the best performance on
> inter-guest communication, both in terms of throughput and
> CPU utilization, but offer no access control for this traffic at all.
> Performance of guest-external traffic should be slightly better
> than bridge/tap.
> 
> 3. Doing the bridging in the NIC using macvlan in passthrough
> mode. This lowers the CPU utilization further compared to 2,
> at the expense of limiting throughput by the performance of
> the PCIe interconnect to the adapter. Whether or not this
> is a win is workload dependent. Access controls now happen
> in the NIC. Currently, this is not supported yet, due to lack of
> device drivers, but it will be an important scenario in the future
> according to some people.

Can you differentiate this option from typical PCI pass-through mode? It is not clear to me where macvlan sits in a setup where the NIC does bridging.

Typically, in a PCI pass-through configuration, all configuration goes through the physical function device driver (and all data goes directly to the NIC). Are you suggesting to use macvlan as a common configuration layer that then configures the underlying NIC? I could see some benefit in such a model, though I am not certain I understand you correctly.

Thanks,
Anna

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Guest bridge setup variations
  2009-12-10 12:26 ` Fischer, Anna
@ 2009-12-10 14:18   ` Arnd Bergmann
  2009-12-10 19:14     ` Alexander Graf
  0 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2009-12-10 14:18 UTC (permalink / raw)
  To: Fischer, Anna
  Cc: qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org

On Thursday 10 December 2009, Fischer, Anna wrote:
> > 
> > 3. Doing the bridging in the NIC using macvlan in passthrough
> > mode. This lowers the CPU utilization further compared to 2,
> > at the expense of limiting throughput by the performance of
> > the PCIe interconnect to the adapter. Whether or not this
> > is a win is workload dependent. Access controls now happen
> > in the NIC. Currently, this is not supported yet, due to lack of
> > device drivers, but it will be an important scenario in the future
> > according to some people.
> 
> Can you differentiate this option from typical PCI pass-through mode?
> It is not clear to me where macvlan sits in a setup where the NIC does
> bridging.

In this setup (hypothetical so far, the code doesn't exist yet), we use
the configuration logic of macvlan, but not the forwarding. This also
doesn't do PCI pass-through but instead gives all the logical interfaces
to the host, using only the bridging and traffic separation capabilities
of the NIC, but not the PCI-separation.

Intel calls this mode VMDq, as opposed to SR-IOV, which implies
the assignment of the adapter to a guest.

It was confusing of me to call it passthrough above, sorry for that.

> Typically, in a PCI pass-through configuration, all configuration goes
> through the physical function device driver (and all data goes directly
> to the NIC). Are you suggesting to use macvlan as a common
> configuration layer that then configures the underlying NIC?
> I could see some benefit in such a model, though I am not certain I
> understand you correctly.

This is something I also have been thinking about, but it is not what
I was referring to above. I think it would be good to keep the three
cases (macvlan, VMDq, SR-IOV) as similar as possible from the user
perspective, so using macvlan as an infrastructure for all of them
sounds reasonable to me.

The difference between VMDq and SR-IOV in that case would be
that the former uses a virtio-net driver in the guest and a hardware
driver in the host, while the latter uses a hardware driver in the guest
only. The data flow on these two would be identical though, while
in the classic macvlan the data forwarding decisions are made in
the host kernel.

	Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Guest bridge setup variations
  2009-12-10 14:18   ` Arnd Bergmann
@ 2009-12-10 19:14     ` Alexander Graf
  2009-12-10 20:20       ` [Qemu-devel] " Arnd Bergmann
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Graf @ 2009-12-10 19:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Fischer, Anna, qemu-devel@nongnu.org,
	virtualization@lists.linux-foundation.org


On 10.12.2009, at 15:18, Arnd Bergmann wrote:

> On Thursday 10 December 2009, Fischer, Anna wrote:
>>> 
>>> 3. Doing the bridging in the NIC using macvlan in passthrough
>>> mode. This lowers the CPU utilization further compared to 2,
>>> at the expense of limiting throughput by the performance of
>>> the PCIe interconnect to the adapter. Whether or not this
>>> is a win is workload dependent. Access controls now happen
>>> in the NIC. Currently, this is not supported yet, due to lack of
>>> device drivers, but it will be an important scenario in the future
>>> according to some people.
>> 
>> Can you differentiate this option from typical PCI pass-through mode?
>> It is not clear to me where macvlan sits in a setup where the NIC does
>> bridging.
> 
> In this setup (hypothetical so far, the code doesn't exist yet), we use
> the configuration logic of macvlan, but not the forwarding. This also
> doesn't do PCI pass-through but instead gives all the logical interfaces
> to the host, using only the bridging and traffic separation capabilities
> of the NIC, but not the PCI-separation.
> 
> Intel calls this mode VMDq, as opposed to SR-IOV, which implies
> the assignment of the adapter to a guest.
> 
> It was confusing of me to call it passthrough above, sorry for that.
> 
>> Typically, in a PCI pass-through configuration, all configuration goes
>> through the physical function device driver (and all data goes directly
>> to the NIC). Are you suggesting to use macvlan as a common
>> configuration layer that then configures the underlying NIC?
>> I could see some benefit in such a model, though I am not certain I
>> understand you correctly.
> 
> This is something I also have been thinking about, but it is not what
> I was referring to above. I think it would be good to keep the three
> cases (macvlan, VMDq, SR-IOV) as similar as possible from the user
> perspective, so using macvlan as an infrastructure for all of them
> sounds reasonable to me.

Oh, so you'd basically do -net vt-d,if=eth0 and the rest would automatically work? That's a pretty slick idea!

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Re: Guest bridge setup variations
  2009-12-10 19:14     ` Alexander Graf
@ 2009-12-10 20:20       ` Arnd Bergmann
  2009-12-10 20:37         ` Alexander Graf
  0 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2009-12-10 20:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Fischer, Anna, virtualization@lists.linux-foundation.org

On Thursday 10 December 2009 19:14:28 Alexander Graf wrote:
> > This is something I also have been thinking about, but it is not what
> > I was referring to above. I think it would be good to keep the three
> > cases (macvlan, VMDq, SR-IOV) as similar as possible from the user
> > perspective, so using macvlan as an infrastructure for all of them
> > sounds reasonable to me.
> 
> Oh, so you'd basically do -net vt-d,if=eth0 and the rest would
> automatically work? That's a pretty slick idea!

I was only referring to how they get set up under the covers, e.g.
creating the virtual device, configuring the MAC address etc, not
the qemu side, but that would probably make sense as well.

Or even better, qemu should probably not even know the difference
between macvlan and VT-d. In both cases, it would open a macvtap
file, but for VT-d adapters, the macvlan infrastructure can
use hardware support, much in the way that VLAN tagging gets
offloaded automatically to the hardware.

	Arnd <><

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Re: Guest bridge setup variations
  2009-12-10 20:20       ` [Qemu-devel] " Arnd Bergmann
@ 2009-12-10 20:37         ` Alexander Graf
  0 siblings, 0 replies; 13+ messages in thread
From: Alexander Graf @ 2009-12-10 20:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Fischer, Anna, qemu-devel,
	virtualization@lists.linux-foundation.org


On 10.12.2009, at 21:20, Arnd Bergmann wrote:

> On Thursday 10 December 2009 19:14:28 Alexander Graf wrote:
>>> This is something I also have been thinking about, but it is not what
>>> I was referring to above. I think it would be good to keep the three
>>> cases (macvlan, VMDq, SR-IOV) as similar as possible from the user
>>> perspective, so using macvlan as an infrastructure for all of them
>>> sounds reasonable to me.
>> 
>> Oh, so you'd basically do -net vt-d,if=eth0 and the rest would
>> automatically work? That's a pretty slick idea!
> 
> I was only referring to how they get set up under the covers, e.g.
> creating the virtual device, configuring the MAC address etc, not
> the qemu side, but that would probably make sense as well.
> 
> Or even better, qemu should probably not even know the difference
> between macvlan and VT-d. In both cases, it would open a macvtap
> file, but for VT-d adapters, the macvlan infrastructure can
> use hardware support, much in the way that VLAN tagging gets
> offloaded automatically to the hardware.

Well, vt-d means we use PCI passthrough. But it probably makes sense to have a -net bridge,if=eth0 that automatically uses whatever is around (pci passthrough, macvtap, anthony's bridge script, etc.). Of course we should leverage vmdq for macvtap whenever available :-).

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Guest bridge setup variations
  2009-12-10  9:19   ` Arnd Bergmann
@ 2009-12-10 23:53     ` Anthony Liguori
  2009-12-14 12:09       ` Arnd Bergmann
  0 siblings, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2009-12-10 23:53 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: qemu-devel, virtualization

Arnd Bergmann wrote:
>> 3) given an fd, treat a vhost-style interface
>>     
>
> This could mean two things, not sure which one you mean. Either the
> file descriptor could be the vhost file descriptor, or the socket or tap file
> descriptor from above, with qemu operating on the vhost interface itself.
>
> Either option has its advantages, but I guess we should only implement
> one of the two to keep it simple.
>   

I was thinking the socket/tap descriptor.

>> I believe we should continue supporting the mechanisms we support 
>> today.  However, for people that invoke qemu directly from the command 
>> line, I believe we should provide a mechanism like the tap helper that 
>> can be used to call out to a separate program to create these initial 
>> file descriptors.  We'll have to think about how we can make this 
>> integrate well so that the syntax isn't clumsy.
>>     
>
> Right. I wonder if this helper should integrate with netcf as an abstraction,
> or if we should rather do something generic. It may also be a good idea
> to let the helper decide which of the three options you listed to use
> and pass that back to qemu unless the user overrides it. The decision
> probably needs to be host specific, depending e.g. on the availability
> and version of tools (brctl, iproute, maybe tunctl, ...), the respective
> kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan,
> ebtables). Ideally the approach should be generic enough to work on
> other platforms (BSD, Solaris, Windows, ...).
>   

For helpers, I think I'd like to stick with what we currently support, 
and then allow for a robust way for there to be independent projects 
that implement their own helpers.   For instance, I would love it if 
netcf had a qemu network "plugin" helper.

There's just too much in the networking space all wrapped up in layers 
of policy decisions.  I think it's time to move it out of qemu.

> One thing I realized the last time we discussed the helper approach is
> that qemu should not need to know or care about the arguments passed
> to the helper, otherwise you get all the complexity back in qemu that
> you're trying to avoid. Maybe for 0.13 we can convert -net socket and
> -net tap to just pass all their options to the helper and move that code
> out of qemu, along with introducting the new syntax.
>   

Yes, I was thinking the same thing.  New syntax may need exploring.

> Another unrelated issue that I think needs to be addressed in a
> network code cleanup is adding better support for multi-queue
> transmit and receive. I've prepared macvtap for that by letting you
> open the chardev multiple times to get one queue per guest CPU,
> but that needs to be supported by qemu and virtio-net as well
> to actually parallelize network operation. Ideally, two guest CPUs
> should be able to transmit and receive on separate queues of the
> adapter without ever having to access any shared resources.
>   

Multiqueue adds another dimension but I think your approach is pretty 
much right on the money.  Have multiple fds for each queue and we would 
support a mechanism with helpers to receive multiple fds as appropriate.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Guest bridge setup variations
  2009-12-10 23:53     ` Anthony Liguori
@ 2009-12-14 12:09       ` Arnd Bergmann
  0 siblings, 0 replies; 13+ messages in thread
From: Arnd Bergmann @ 2009-12-14 12:09 UTC (permalink / raw)
  To: virtualization; +Cc: qemu-devel, Anthony Liguori

On Friday 11 December 2009, Anthony Liguori wrote:
> Arnd Bergmann wrote:
> >> 3) given an fd, treat a vhost-style interface
> >
> > This could mean two things, not sure which one you mean. Either the
> > file descriptor could be the vhost file descriptor, or the socket or tap file
> > descriptor from above, with qemu operating on the vhost interface itself.
> >
> > Either option has its advantages, but I guess we should only implement
> > one of the two to keep it simple.
> >   
> 
> I was thinking the socket/tap descriptor.

ok.

> > Right. I wonder if this helper should integrate with netcf as an abstraction,
> > or if we should rather do something generic. It may also be a good idea
> > to let the helper decide which of the three options you listed to use
> > and pass that back to qemu unless the user overrides it. The decision
> > probably needs to be host specific, depending e.g. on the availability
> > and version of tools (brctl, iproute, maybe tunctl, ...), the respective
> > kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan,
> > ebtables). Ideally the approach should be generic enough to work on
> > other platforms (BSD, Solaris, Windows, ...).
> 
> For helpers, I think I'd like to stick with what we currently support, 
> and then allow for a robust way for there to be independent projects 
> that implement their own helpers.   For instance, I would love it if 
> netcf had a qemu network "plugin" helper.

Moving netcf specific qemu helpers into the netcf project sounds good.
I'm not sure what you mean with 'stick to what we currently support',
do you mean with helpers that ship with qemu itself? That sounds
reasonable, though I'd obviously like to make sure they also work
with macvtap, which is currently not supported unless you pass an
open file descriptor into -net tap,fd.

> There's just too much in the networking space all wrapped up in layers 
> of policy decisions.  I think it's time to move it out of qemu.

Yes.

	Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Guest bridge setup variations
       [not found] <78C9135A3D2ECE4B8162EBDCE82CAD7705FDEDCF@nekter>
@ 2009-12-16  0:55 ` Leonid Grossman
  2009-12-16 14:15   ` Arnd Bergmann
  0 siblings, 1 reply; 13+ messages in thread
From: Leonid Grossman @ 2009-12-16  0:55 UTC (permalink / raw)
  To: virtualization; +Cc: qemu-devel

> > -----Original Message-----
> > From: virtualization-bounces@lists.linux-foundation.org
> > [mailto:virtualization-bounces@lists.linux-foundation.org] On Behalf
> Of
> > Arnd Bergmann
> > Sent: Tuesday, December 08, 2009 8:08 AM
> > To: virtualization@lists.linux-foundation.org
> > Cc: qemu-devel@nongnu.org
> > Subject: Guest bridge setup variations
> >
> > As promised, here is my small writeup on which setups I feel
> > are important in the long run for server-type guests. This
> > does not cover -net user, which is really for desktop kinds
> > of applications where you do not want to connect into the
> > guest from another IP address.
> >
> > I can see four separate setups that we may or may not want to
> > support, the main difference being how the forwarding between
> > guests happens:
> >
> > 1. The current setup, with a bridge and tun/tap devices on ports
> > of the bridge. This is what Gerhard's work on access controls is
> > focused on and the only option where the hypervisor actually
> > is in full control of the traffic between guests. CPU utilization
> should
> > be highest this way, and network management can be a burden,
> > because the controls are done through a Linux, libvirt and/or
> Director
> > specific interface.
> >
> > 2. Using macvlan as a bridging mechanism, replacing the bridge
> > and tun/tap entirely. This should offer the best performance on
> > inter-guest communication, both in terms of throughput and
> > CPU utilization, but offer no access control for this traffic at
all.
> > Performance of guest-external traffic should be slightly better
> > than bridge/tap.
> >
> > 3. Doing the bridging in the NIC using macvlan in passthrough
> > mode. This lowers the CPU utilization further compared to 2,
> > at the expense of limiting throughput by the performance of
> > the PCIe interconnect to the adapter. Whether or not this
> > is a win is workload dependent. 

This is certainly true today for pci-e 1.1 and 2.0 devices, but 
as NICs move to pci-e 3.0 (while remaining almost exclusively dual port
10GbE for a long while), 
EVB internal bandwidth will significantly exceed external bandwidth.
So, #3 can become a win for most inter-guest workloads.

> > Access controls now happen
> > in the NIC. Currently, this is not supported yet, due to lack of
> > device drivers, but it will be an important scenario in the future
> > according to some people.

Actually, x3100 10GbE drivers support this today via sysfs interface to
the host driver 
that can choose to control VEB tables (and therefore MAC addresses, vlan
memberships, etc. for all passthru interfaces behind the VEB).
OF course a more generic vendor-independent interface will be important
in the future.

> >
> > 4. Using macvlan for actual VEPA on the outbound interface.
> > This is mostly interesting because it makes the network access
> > controls visible in an external switch that is already managed.
> > CPU utilization and guest-external throughput should be
> > identical to 3, but inter-guest latency can only be worse because
> > all frames go through the external switch.
> >
> > In case 2 through 4, we have the choice between macvtap and
> > the raw packet interface for connecting macvlan to qemu.
> > Raw sockets are better tested right now, while macvtap has
> > better permission management (i.e. it does not require
> > CAP_NET_ADMIN). Neither one is upstream though at the
> > moment. The raw driver only requires qemu patches, while
> > macvtap requires both a new kernel driver and a trivial change
> > in qemu.
> >
> > In all four cases, vhost-net could be used to move the workload
> > from user space into the kernel, which may be an advantage.
> > The decision for or against vhost-net is entirely independent of
> > the other decisions.
> >
> > 	Arnd
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Guest bridge setup variations
  2009-12-16  0:55 ` Leonid Grossman
@ 2009-12-16 14:15   ` Arnd Bergmann
  2009-12-17  6:18     ` Leonid Grossman
  0 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2009-12-16 14:15 UTC (permalink / raw)
  To: virtualization; +Cc: Leonid Grossman, qemu-devel

On Wednesday 16 December 2009, Leonid Grossman wrote:
> > > 3. Doing the bridging in the NIC using macvlan in passthrough
> > > mode. This lowers the CPU utilization further compared to 2,
> > > at the expense of limiting throughput by the performance of
> > > the PCIe interconnect to the adapter. Whether or not this
> > > is a win is workload dependent. 
> 
> This is certainly true today for pci-e 1.1 and 2.0 devices, but 
> as NICs move to pci-e 3.0 (while remaining almost exclusively dual port
> 10GbE for a long while), 
> EVB internal bandwidth will significantly exceed external bandwidth.
> So, #3 can become a win for most inter-guest workloads.

Right, it's also hardware dependent, but it usually comes down
to whether it's cheaper to spend CPU cycles or to spend IO bandwidth.

I would be surprised if all future machines with PCIe 3.0 suddenly have
a huge surplus of bandwidth but no CPU to keep up with that.

> > > Access controls now happen
> > > in the NIC. Currently, this is not supported yet, due to lack of
> > > device drivers, but it will be an important scenario in the future
> > > according to some people.
> 
> Actually, x3100 10GbE drivers support this today via sysfs interface to
> the host driver 
> that can choose to control VEB tables (and therefore MAC addresses, vlan
> memberships, etc. for all passthru interfaces behind the VEB).

Ok, I didn't know about that.

> OF course a more generic vendor-independent interface will be important
> in the future.

Right. I hope we can come up with something soon. I'll have a look at
what your driver does and see if that can be abstracted in some way.
I expect that if we can find an interface between the kernel and device
driver for two or three NIC implementations that it will be good enough
to adapt to everyone else as well.

	Arnd 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Guest bridge setup variations
  2009-12-16 14:15   ` Arnd Bergmann
@ 2009-12-17  6:18     ` Leonid Grossman
  0 siblings, 0 replies; 13+ messages in thread
From: Leonid Grossman @ 2009-12-17  6:18 UTC (permalink / raw)
  To: Arnd Bergmann, virtualization; +Cc: qemu-devel



> -----Original Message-----
> From: Arnd Bergmann [mailto:arnd@arndb.de]
> Sent: Wednesday, December 16, 2009 6:16 AM
> To: virtualization@lists.linux-foundation.org
> Cc: Leonid Grossman; qemu-devel@nongnu.org
> Subject: Re: Guest bridge setup variations
> 
> On Wednesday 16 December 2009, Leonid Grossman wrote:
> > > > 3. Doing the bridging in the NIC using macvlan in passthrough
> > > > mode. This lowers the CPU utilization further compared to 2,
> > > > at the expense of limiting throughput by the performance of
> > > > the PCIe interconnect to the adapter. Whether or not this
> > > > is a win is workload dependent.
> >
> > This is certainly true today for pci-e 1.1 and 2.0 devices, but
> > as NICs move to pci-e 3.0 (while remaining almost exclusively dual
> port
> > 10GbE for a long while),
> > EVB internal bandwidth will significantly exceed external bandwidth.
> > So, #3 can become a win for most inter-guest workloads.
> 
> Right, it's also hardware dependent, but it usually comes down
> to whether it's cheaper to spend CPU cycles or to spend IO bandwidth.
> 
> I would be surprised if all future machines with PCIe 3.0 suddenly
have
> a huge surplus of bandwidth but no CPU to keep up with that.
> 
> > > > Access controls now happen
> > > > in the NIC. Currently, this is not supported yet, due to lack of
> > > > device drivers, but it will be an important scenario in the
> future
> > > > according to some people.
> >
> > Actually, x3100 10GbE drivers support this today via sysfs interface
> to
> > the host driver
> > that can choose to control VEB tables (and therefore MAC addresses,
> vlan
> > memberships, etc. for all passthru interfaces behind the VEB).
> 
> Ok, I didn't know about that.
> 
> > OF course a more generic vendor-independent interface will be
> important
> > in the future.
> 
> Right. I hope we can come up with something soon. I'll have a look at
> what your driver does and see if that can be abstracted in some way.

Sounds good, please let us know if looking at the code/documentation
will suffice or you need a couple cards to go along with the code.

> I expect that if we can find an interface between the kernel and
device
> driver for two or three NIC implementations that it will be good
enough
> to adapt to everyone else as well.

The interface will likely evolve along with EVB standards and other
developments, but 
initial implementation can be pretty basic (and vendor-independent). 
Early IOV NIC deployments can benefit from an interface that sets couple
VF parameters missing in "legacy" NIC interface - things like bandwidth
limit and list of MAC addresses (since setting a NIC in promisc mode
doesn't work well for VEB, it is currently forced to learn the addresses
it is configured for). 
The interface can also include querying IOV NIC capabilities like number
of VFs, support for VEB and/or VEPA mode, etc as well as getting VF
stats and MAC/VLAN tables - all in all, it is not a long list.


> 
> 	Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-12-17  6:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-08 16:07 Guest bridge setup variations Arnd Bergmann
2009-12-09 19:36 ` [Qemu-devel] " Anthony Liguori
2009-12-10  9:19   ` Arnd Bergmann
2009-12-10 23:53     ` Anthony Liguori
2009-12-14 12:09       ` Arnd Bergmann
2009-12-10 12:26 ` Fischer, Anna
2009-12-10 14:18   ` Arnd Bergmann
2009-12-10 19:14     ` Alexander Graf
2009-12-10 20:20       ` [Qemu-devel] " Arnd Bergmann
2009-12-10 20:37         ` Alexander Graf
     [not found] <78C9135A3D2ECE4B8162EBDCE82CAD7705FDEDCF@nekter>
2009-12-16  0:55 ` Leonid Grossman
2009-12-16 14:15   ` Arnd Bergmann
2009-12-17  6:18     ` Leonid Grossman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).