From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1My5kz-0006oU-1n
	for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:27:17 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1My5kt-0006h6-BL
	for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:27:15 -0400
Received: from [199.232.76.173] (port=43427 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1My5kt-0006gj-5x
	for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:27:11 -0400
Received: from mx1.redhat.com ([209.132.183.28]:58370)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <mst@redhat.com>) id 1My5ks-0004DT-Hk
	for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:27:11 -0400
Date: Wed, 14 Oct 2009 17:24:52 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>
Message-ID: <20091014152451.GA30179@redhat.com>
References: <20091014143415.GA29937@redhat.com>
	<4AD5E449.9070301@codemonkey.ws>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4AD5E449.9070301@codemonkey.ws>
Subject: [Qemu-devel] Re: [PATCH-updated] qemu/net: add raw backend
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Or Gerlitz <ogerlitz@voltaire.com>, Arnd Bergmann <arndbergmann@googlemail.com>, qemu-devel@nongnu.org

On Wed, Oct 14, 2009 at 09:46:33AM -0500, Anthony Liguori wrote:
> Or Gerlitz wrote:
>> Add raw network backend option which uses a packet socket to provide
>> raw networking access. Once the socket is opened it's bound to a
>> provided host interface, such that packets received on the interface
>> are delivered to the VM and packets sent by the VM are sent to the
>> interface.
>>
>> This is functionally similar to the existing pcap network
>> backend, with the same advantages and problems.
>> Differences from pcap:
>> - can get an open socket from the monitor,
>>   which allows running without NET_ADMIN priviledges
>> - support iovec sends with writev, saving one data copy
>> - one less dependency on an external library
>> - we have access to the underlying file descriptor
>>   which makes it possible to connect to vhost net
>> - don't support polling all interfaces, always bind to a specific one
>>   
>
> Networking is probably the area in qemu that users most frequently  
> stumble with.  The most common problems are:
>
> 1) slirp does not behave how they think it should (icmp doesn't work,  
> guest isn't accessable from host)
> 2) it's difficult to figure out which backend behaves the way they want  
> (socket vs. vde vs. tap)
> 3) when they figure out they need tap, tap is difficult to setup 

tap does not need any setup. problem is, bridge needs setup.

> The problem with introducing a raw backend (or a pcap backend)
> is that  
> it makes #2 even worse because now a user has to figure out whether they  
> need raw/pcap or tap.  But most troubling, it introduces another issue:
>
> 4) raw does not behave how they think it should because guest<->host  
> networking does not work bidirectionally

OTOH icmp works fine, so we are not worse off than with slirp.

> So unless there's an extremely compelling reason to have this,

I work with remote machines all the time, having to fiddle with host
bridge/network setup means I always risk losing the only way to admin
the machine.  So it's slirp or raw for me.  If I'm the only one like
this, I can keep maintaining this patch, but I doubt it.

I consider this a compelling reason.

> I'd rather not introduce this complexity.

Does another option really add that much complexity?
We add options all the time ...

>  NB, I see this as a problem with  
> vhost_net too if #4 is also true in that context.

It's up to user whether to connect vhost net to tap or socket.
I haven't posted userspace code to connect vhost to tap yet
but I will RSN.

>> Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>  hw/virtio-net.c |    3 +-
>>  net.c           |  192 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  qemu-options.hx |    4 +
>>  3 files changed, 198 insertions(+), 1 deletions(-)
>>
>> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
>> index 1ac05a2..95d9f93 100644
>> --- a/hw/virtio-net.c
>> +++ b/hw/virtio-net.c
>> @@ -545,7 +545,8 @@ static ssize_t virtio_net_receive2(VLANClientState *vc, const uint8_t *buf, size
>>              virtqueue_pop(n->rx_vq, &elem) == 0) {
>>              if (i == 0)
>>                  return -1;
>> -            fprintf(stderr, "virtio-net truncating packet\n");
>> +            fprintf(stderr, "virtio-net truncating packet. offset %zd size %zd\n",
>> +		    offset, size);
>>              exit(1);
>>          }
>>   
>
> This doesn't belong here.
>> diff --git a/net.c b/net.c
>> index d93eaef..1e0e874 100644
>> --- a/net.c
>> +++ b/net.c
>> @@ -93,6 +93,9 @@
>>  #endif
>>  #endif
>>  +#include <netpacket/packet.h>
>> +#include <net/ethernet.h>
>> +
>>   
>
> This is certainly missing guards.
>
>>  #if defined(__OpenBSD__)
>>  #include <util.h>
>>  #endif
>> @@ -1860,6 +1863,158 @@ static TAPState *net_tap_init(VLANState *vlan, const char *model,
>>   #endif /* !_WIN32 */
>>  +typedef struct RAWState {
>> +    VLANClientState *vc;
>> +    int fd;
>> +    uint8_t buf[4096];
>> +    int promisc;
>> +} RAWState;
>> +
>> +static int net_raw_fd_init(Monitor *mon, const char *ifname, int promisc)
>> +{
>> +	int fd, ret;
>> +	struct ifreq req;
>> +	struct sockaddr_ll lladdr;
>> +
>> +	fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
>> +	if (fd < 0)
>> +		fprintf(stderr, "packet socket failed\n");
>>   
>
> CodingStyle
>
> Also, this error checking should use the monitor error reporting  
> framework.  And falling through with fd=-1 certainly means we'll SEGV or  
> worse further down.
>
>> +	memset(&req, 0, sizeof(req));
>> +	strncpy(req.ifr_name, ifname, IFNAMSIZ-1);
>>   
>
> Would be better to just use snprintf
>
>> +	ret = ioctl(fd, SIOCGIFINDEX, &req);
>> +	if (ret < 0)
>> +		fprintf(stderr, "SIOCGIFINDEX failed\n");
>> +
>> +	memset(&lladdr, 0, sizeof(lladdr));
>> +	lladdr.sll_family   = AF_PACKET;
>> +	lladdr.sll_protocol = htons(ETH_P_ALL);
>> +	lladdr.sll_ifindex  = req.ifr_ifindex;
>> +	ret = bind(fd, (const struct sockaddr *)&lladdr, sizeof(lladdr));
>> +	if (ret < 0)
>> +		fprintf(stderr, "bind failed\n");
>>
>>   
>
> Error checking is bad here.
>
>> +	/* set iface to promiscuous mode (packets sent to the VM MAC) */
>> +	if (promisc) {
>> +		ret = ioctl(fd, SIOCGIFFLAGS, &req);
>> +		if (ret < 0)
>> +			perror("SIOCGIFFLAGS failed\n");
>> +		req.ifr_flags |= IFF_PROMISC;
>> +		ret = ioctl(fd, SIOCSIFFLAGS, &req);
>> +		if (ret < 0)
>> +			fprintf(stderr, "SIOCSIFFLAGS to promiscous failed\n");
>> +	}
>>   
>
> I suspect these ioctls are Linux specific.
>
>> +	ret = fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK);
>> +	if (ret < 0)
>> +		fprintf(stderr, "O_NONBLOCK set failed\n");
>> +
>> +	return fd;
>> +}
>> +
>> +static void raw_cleanup(VLANClientState *vc)
>> +{
>> +	struct ifreq req;
>> +	RAWState *s = vc->opaque;
>> +
>> +	qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>> +	if (s->promisc) {
>> +		ioctl(s->fd, SIOCGIFFLAGS, &req);
>> +		req.ifr_flags &= ~IFF_PROMISC;
>> +		ioctl(s->fd, SIOCSIFFLAGS, &req);
>> +	}
>> +	close(s->fd);
>> +	qemu_free(s);
>> +}
>> +
>> +static void raw_send(void *opaque);
>> +
>> +static int raw_can_send(void *opaque)
>> +{
>> +	RAWState *s = opaque;
>> +
>> +	return qemu_can_send_packet(s->vc);
>> +}
>> +
>> +static void raw_send_completed(VLANClientState *vc, ssize_t len)
>> +{
>> +	RAWState *s = vc->opaque;
>> +
>> +	qemu_set_fd_handler2(s->fd, raw_can_send, raw_send, NULL, s);
>> +}
>> +
>> +static void raw_send(void *opaque)
>> +{
>> +	RAWState *s = opaque;
>> +	int size;
>> +
>> +	do {
>> +		size = recv(s->fd, s->buf, sizeof(s->buf), MSG_TRUNC);
>> +		if (size <= 0)
>> +			break;
>>   
>
> Need to handle EINTR.
>
>> +		size = qemu_send_packet_async(s->vc, s->buf, size,
>> +						raw_send_completed);
>> +		if (size == 0)
>> +			qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
>> +
>> +	} while (size > 0);
>> +}
>> +
>> +static ssize_t raw_receive_iov(VLANClientState *vc, const struct iovec *iov,
>> +				int iovcnt)
>> +{
>> +	ssize_t len;
>> +	RAWState *s = vc->opaque;
>> +
>> +	do {
>> +		len = writev(s->fd, iov, iovcnt);
>> +	} while (len == -1 && (errno == EINTR || errno == EAGAIN));
>>   
>
> Spinning on EAGAIN is certainly wrong.
>
>> +static int net_raw_init(Monitor *mon, VLANState *vlan, const char *model,
>> +			const char *name, const char *ifname,
>> +			int promisc, int fd)
>> +{
>> +	RAWState *s;
>> +
>> +	s = qemu_mallocz(sizeof(RAWState));
>> +
>> +	if (fd == -1) {
>> +		s->fd = net_raw_fd_init(mon, ifname, promisc);
>> +		s->promisc = promisc;
>> +	} else
>> +		s->fd = fd;
>> +
>> +	fcntl(s->fd, F_SETFL, O_NONBLOCK);
>>   
>
> For net_raw_fd_int, you've already set O_NONBLOCK but you're also  
> removing any other flags that have been set which is probably wrong for  
> a passed in fd.
>
>> +	s->vc = qemu_new_vlan_client(vlan, model, name, NULL, raw_receive,
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index bde3e3f..0d5440f 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -825,6 +825,10 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>      "                default of 'sndbuf=1048576' can be disabled using 'sndbuf=0'\n"
>>  #endif
>>  #endif
>> +    "-net raw[,vlan=n][,name=str],ifname=name[,promisc=m]\n"
>> +    "                bound the host network interface to VLAN 'n' in a raw manner:\n"
>> +    "                packets received on the interface are delivered to the vlan and\n"
>> +    "                packets delivered on the vlan are sent to the interface\n"
>>      "-net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"
>>      "                connect the vlan 'n' to another VLAN using a socket connection\n"
>>      "-net socket[,vlan=n][,name=str][,fd=h][,mcast=maddr:port]\n"
>>   
>
> Needs documentation.
>
> Regards,
>
> Anthony Liguori