From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1MRBF0-00026n-K8
	for qemu-devel@nongnu.org; Wed, 15 Jul 2009 16:38:14 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1MRBEx-00025K-0a
	for qemu-devel@nongnu.org; Wed, 15 Jul 2009 16:38:14 -0400
Received: from [199.232.76.173] (port=51543 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1MRBEw-00025E-P3
	for qemu-devel@nongnu.org; Wed, 15 Jul 2009 16:38:10 -0400
Received: from mail2.shareable.org ([80.68.89.115]:39887)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <jamie@shareable.org>) id 1MRBEw-0005V2-8I
	for qemu-devel@nongnu.org; Wed, 15 Jul 2009 16:38:10 -0400
Date: Wed, 15 Jul 2009 21:38:06 +0100
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [Qemu-devel] [PATCH] net: add raw backend
Message-ID: <20090715203806.GF3056@shareable.org>
References: <Pine.LNX.4.64.0907011844470.32248@zuben.voltaire.com>
	<20090701162115.GA4555@shareable.org>
	<4A4CA747.1050509@Voltaire.com>
	<20090703023911.GD938@shareable.org>
	<4A534EC4.5030209@voltaire.com>
	<20090707145739.GB14392@shareable.org>
	<4A54B0F1.3070201@voltaire.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4A54B0F1.3070201@voltaire.com>
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Or Gerlitz <ogerlitz@voltaire.com>
Cc: Herbert Xu <herbert.xu@redhat.com>, Jan Kiszka <jan.kiszka@web.de>, qemu-devel@nongnu.org

Or Gerlitz wrote:
> Jamie Lokier wrote:
> >The problem is simply what the guest sends goes out on the network and is 
> >not looped backed to the host network stack, and vice versa. So if your 
> >host is 192.168.1.1 and is running a DNS server (say), and the guest is 
> >192.168.1.2, when the guest sends queries to 192.168.1.1 the host won't 
> >see those queries.  Same if you're running an FTP server on the host and 
> >the guest wants to connect to it, etc. It also means multiple guests can't 
> >see each other, for the same reason. So it's much less useful than 
> >bridging, where the guests and host can all see each other and connect to 
> >each other.
> I wasn't sure to follow if your example refers to the case when 
> networking uses the bridge or NAT. If its bridge, then through which 
> bridge interface the packet arrives the host stack? say you have a 
> bridge whose attached interfaces are tap1(VM1), tap2(VM2) and eth0(NIC), 
> in your example did you mean that the host IP address is assigned to the 
> bridge interface? or you were referring a NAT based scheme?

When using a bridge, you set the IP address on the bridge itself (for
example, br0).  DHCP runs on the bridge itself, so does the rest of
the Linux host stack, although you can use raw sockets on the other
interfaces.

But reading and controlling the hardware is done on the interfaces.

So if you have some program like NetworkManager which checks if you
have a wire plugged into eth0, it has to read eth0 to get the wire
status, but it has to run DHCP on br0.

Those programs don't generally have that option, which makes bridges
difficult to use for VMs in a transparent way.

I wasn't referring to NAT, but you can use NAT with a bridge on Linux;
it's called brouting :-)

> >Unfortunately, bridging is a pain to set up, if your host has any 
> >complicated or automatic network configuration already.

> As you said bridging requires more configuration

A bridge is quite simple to configure.  Unfortunately because Linux
requires all the IP configuration on the bridge device, but network
device control on the network device, bridges don't work well with
automatic configuration tools.

If you could apply host IP configuration to the network device and
still have a bridge, that would be perfect.  You would just create
br0, add tap1(VM1), tap2(VM2) and eth0(NIC), and everything would work
perfectly.

> but not less important the performance (packets per second and cpu
> utilization) one can get with bridge+tap is much lower vs what you
> get with the raw mode approach.

Have you measured it?

> All in all, its clear that with this approach VM/VM and VM/Host
> communication would have to get switched either at the NIC (e.g
> SR/IOV capable NICs supporting a virtual bridge) or at an external
> switch and make a U turn.

Unfortunately that's usually impossible.  Most switches don't do U
turns, and a lot of simple networks don't have any switches except a
home router.

> There's a bunch of reasons why people would 
> like to do that, among them performance boost,

No, it makes performance _much_ worse if you have packets leaving the
host, do a U turn and come back on the same link.  Much better to use
a bridge inside the host.  Probably ten times faster because host's
internal networking is much faster than a typical gigabit link :-)

> the ability to shape, 
> manage and monitor VM/VM traffic in external switches and more.

That could be useful, but I think it's's probably quite unusual for
someone to want to shape traffic between a VM and it's own host.  Also
if you want to do that, you can do it inside the host.

Sometimes it would be useful to send it outside the host and U turn,
but not very often; only for diagnostics I would think.  And even that
can be done with Linux bridges, using VLANs :-)

> >It would be really nice to find a way which has the advantages of both.  
> >Either by adding a different bridging mode to Linux, where host interfaces 
> >can be configured for IP and the bridge hangs off the host interface, or 
> >by a modified tap interface, or by an alternative
> >pcap/packet-like interface which forwards packets in a similar way to 
> >bridging.  

> It seems that this will not yield  the performance improvement we can 
> get with going directly to the NIC.

If you don't need any host<->VM networking, maybe a raw packet socket
is faster.

But are you sure it's faster?
I'd want to see measurements before I believe it.

If you need any host<->VM networking, most of the time the packet
socket isn't an option at all.  Not many switches will 'U turn'
packets as you suggest.

-- Jamie