From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1My5qm-0000VJ-3V for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:33:16 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1My5qh-0000UY-Em for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:33:15 -0400 Received: from [199.232.76.173] (port=55754 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1My5qh-0000UV-89 for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:33:11 -0400 Received: from mx1.redhat.com ([209.132.183.28]:30271) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1My5qg-00057F-I0 for qemu-devel@nongnu.org; Wed, 14 Oct 2009 11:33:11 -0400 Date: Wed, 14 Oct 2009 17:33:00 +0200 From: Gleb Natapov Subject: Re: [Qemu-devel] Re: [PATCH-updated] qemu/net: add raw backend Message-ID: <20091014153300.GA30889@redhat.com> References: <20091014143415.GA29937@redhat.com> <4AD5E449.9070301@codemonkey.ws> <20091014152451.GA30179@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20091014152451.GA30179@redhat.com> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: Or Gerlitz , Arnd Bergmann , qemu-devel@nongnu.org On Wed, Oct 14, 2009 at 05:24:52PM +0200, Michael S. Tsirkin wrote: > On Wed, Oct 14, 2009 at 09:46:33AM -0500, Anthony Liguori wrote: > > Or Gerlitz wrote: > >> Add raw network backend option which uses a packet socket to provide > >> raw networking access. Once the socket is opened it's bound to a > >> provided host interface, such that packets received on the interface > >> are delivered to the VM and packets sent by the VM are sent to the > >> interface. > >> > >> This is functionally similar to the existing pcap network > >> backend, with the same advantages and problems. > >> Differences from pcap: > >> - can get an open socket from the monitor, > >> which allows running without NET_ADMIN priviledges > >> - support iovec sends with writev, saving one data copy > >> - one less dependency on an external library > >> - we have access to the underlying file descriptor > >> which makes it possible to connect to vhost net > >> - don't support polling all interfaces, always bind to a specific one > >> > > > > Networking is probably the area in qemu that users most frequently > > stumble with. The most common problems are: > > > > 1) slirp does not behave how they think it should (icmp doesn't work, > > guest isn't accessable from host) > > 2) it's difficult to figure out which backend behaves the way they want > > (socket vs. vde vs. tap) > > 3) when they figure out they need tap, tap is difficult to setup > > tap does not need any setup. problem is, bridge needs setup. > tap need IP assigning. > > The problem with introducing a raw backend (or a pcap backend) > > is that > > it makes #2 even worse because now a user has to figure out whether they > > need raw/pcap or tap. But most troubling, it introduces another issue: > > > > 4) raw does not behave how they think it should because guest<->host > > networking does not work bidirectionally > > OTOH icmp works fine, so we are not worse off than with slirp. > > > So unless there's an extremely compelling reason to have this, > > I work with remote machines all the time, having to fiddle with host > bridge/network setup means I always risk losing the only way to admin > the machine. So it's slirp or raw for me. If I'm the only one like > this, I can keep maintaining this patch, but I doubt it. > You are not the only one, but slirp works fine for my purposes. > I consider this a compelling reason. > > > I'd rather not introduce this complexity. > > Does another option really add that much complexity? > We add options all the time ... > > > NB, I see this as a problem with > > vhost_net too if #4 is also true in that context. > > It's up to user whether to connect vhost net to tap or socket. > I haven't posted userspace code to connect vhost to tap yet > but I will RSN. > > >> Signed-off-by: Or Gerlitz > >> Signed-off-by: Michael S. Tsirkin > >> --- > >> hw/virtio-net.c | 3 +- > >> net.c | 192 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> qemu-options.hx | 4 + > >> 3 files changed, 198 insertions(+), 1 deletions(-) > >> > >> diff --git a/hw/virtio-net.c b/hw/virtio-net.c > >> index 1ac05a2..95d9f93 100644 > >> --- a/hw/virtio-net.c > >> +++ b/hw/virtio-net.c > >> @@ -545,7 +545,8 @@ static ssize_t virtio_net_receive2(VLANClientState *vc, const uint8_t *buf, size > >> virtqueue_pop(n->rx_vq, &elem) == 0) { > >> if (i == 0) > >> return -1; > >> - fprintf(stderr, "virtio-net truncating packet\n"); > >> + fprintf(stderr, "virtio-net truncating packet. offset %zd size %zd\n", > >> + offset, size); > >> exit(1); > >> } > >> > > > > This doesn't belong here. > >> diff --git a/net.c b/net.c > >> index d93eaef..1e0e874 100644 > >> --- a/net.c > >> +++ b/net.c > >> @@ -93,6 +93,9 @@ > >> #endif > >> #endif > >> +#include > >> +#include > >> + > >> > > > > This is certainly missing guards. > > > >> #if defined(__OpenBSD__) > >> #include > >> #endif > >> @@ -1860,6 +1863,158 @@ static TAPState *net_tap_init(VLANState *vlan, const char *model, > >> #endif /* !_WIN32 */ > >> +typedef struct RAWState { > >> + VLANClientState *vc; > >> + int fd; > >> + uint8_t buf[4096]; > >> + int promisc; > >> +} RAWState; > >> + > >> +static int net_raw_fd_init(Monitor *mon, const char *ifname, int promisc) > >> +{ > >> + int fd, ret; > >> + struct ifreq req; > >> + struct sockaddr_ll lladdr; > >> + > >> + fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); > >> + if (fd < 0) > >> + fprintf(stderr, "packet socket failed\n"); > >> > > > > CodingStyle > > > > Also, this error checking should use the monitor error reporting > > framework. And falling through with fd=-1 certainly means we'll SEGV or > > worse further down. > > > >> + memset(&req, 0, sizeof(req)); > >> + strncpy(req.ifr_name, ifname, IFNAMSIZ-1); > >> > > > > Would be better to just use snprintf > > > >> + ret = ioctl(fd, SIOCGIFINDEX, &req); > >> + if (ret < 0) > >> + fprintf(stderr, "SIOCGIFINDEX failed\n"); > >> + > >> + memset(&lladdr, 0, sizeof(lladdr)); > >> + lladdr.sll_family = AF_PACKET; > >> + lladdr.sll_protocol = htons(ETH_P_ALL); > >> + lladdr.sll_ifindex = req.ifr_ifindex; > >> + ret = bind(fd, (const struct sockaddr *)&lladdr, sizeof(lladdr)); > >> + if (ret < 0) > >> + fprintf(stderr, "bind failed\n"); > >> > >> > > > > Error checking is bad here. > > > >> + /* set iface to promiscuous mode (packets sent to the VM MAC) */ > >> + if (promisc) { > >> + ret = ioctl(fd, SIOCGIFFLAGS, &req); > >> + if (ret < 0) > >> + perror("SIOCGIFFLAGS failed\n"); > >> + req.ifr_flags |= IFF_PROMISC; > >> + ret = ioctl(fd, SIOCSIFFLAGS, &req); > >> + if (ret < 0) > >> + fprintf(stderr, "SIOCSIFFLAGS to promiscous failed\n"); > >> + } > >> > > > > I suspect these ioctls are Linux specific. > > > >> + ret = fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK); > >> + if (ret < 0) > >> + fprintf(stderr, "O_NONBLOCK set failed\n"); > >> + > >> + return fd; > >> +} > >> + > >> +static void raw_cleanup(VLANClientState *vc) > >> +{ > >> + struct ifreq req; > >> + RAWState *s = vc->opaque; > >> + > >> + qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); > >> + if (s->promisc) { > >> + ioctl(s->fd, SIOCGIFFLAGS, &req); > >> + req.ifr_flags &= ~IFF_PROMISC; > >> + ioctl(s->fd, SIOCSIFFLAGS, &req); > >> + } > >> + close(s->fd); > >> + qemu_free(s); > >> +} > >> + > >> +static void raw_send(void *opaque); > >> + > >> +static int raw_can_send(void *opaque) > >> +{ > >> + RAWState *s = opaque; > >> + > >> + return qemu_can_send_packet(s->vc); > >> +} > >> + > >> +static void raw_send_completed(VLANClientState *vc, ssize_t len) > >> +{ > >> + RAWState *s = vc->opaque; > >> + > >> + qemu_set_fd_handler2(s->fd, raw_can_send, raw_send, NULL, s); > >> +} > >> + > >> +static void raw_send(void *opaque) > >> +{ > >> + RAWState *s = opaque; > >> + int size; > >> + > >> + do { > >> + size = recv(s->fd, s->buf, sizeof(s->buf), MSG_TRUNC); > >> + if (size <= 0) > >> + break; > >> > > > > Need to handle EINTR. > > > >> + size = qemu_send_packet_async(s->vc, s->buf, size, > >> + raw_send_completed); > >> + if (size == 0) > >> + qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); > >> + > >> + } while (size > 0); > >> +} > >> + > >> +static ssize_t raw_receive_iov(VLANClientState *vc, const struct iovec *iov, > >> + int iovcnt) > >> +{ > >> + ssize_t len; > >> + RAWState *s = vc->opaque; > >> + > >> + do { > >> + len = writev(s->fd, iov, iovcnt); > >> + } while (len == -1 && (errno == EINTR || errno == EAGAIN)); > >> > > > > Spinning on EAGAIN is certainly wrong. > > > >> +static int net_raw_init(Monitor *mon, VLANState *vlan, const char *model, > >> + const char *name, const char *ifname, > >> + int promisc, int fd) > >> +{ > >> + RAWState *s; > >> + > >> + s = qemu_mallocz(sizeof(RAWState)); > >> + > >> + if (fd == -1) { > >> + s->fd = net_raw_fd_init(mon, ifname, promisc); > >> + s->promisc = promisc; > >> + } else > >> + s->fd = fd; > >> + > >> + fcntl(s->fd, F_SETFL, O_NONBLOCK); > >> > > > > For net_raw_fd_int, you've already set O_NONBLOCK but you're also > > removing any other flags that have been set which is probably wrong for > > a passed in fd. > > > >> + s->vc = qemu_new_vlan_client(vlan, model, name, NULL, raw_receive, > >> diff --git a/qemu-options.hx b/qemu-options.hx > >> index bde3e3f..0d5440f 100644 > >> --- a/qemu-options.hx > >> +++ b/qemu-options.hx > >> @@ -825,6 +825,10 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, > >> " default of 'sndbuf=1048576' can be disabled using 'sndbuf=0'\n" > >> #endif > >> #endif > >> + "-net raw[,vlan=n][,name=str],ifname=name[,promisc=m]\n" > >> + " bound the host network interface to VLAN 'n' in a raw manner:\n" > >> + " packets received on the interface are delivered to the vlan and\n" > >> + " packets delivered on the vlan are sent to the interface\n" > >> "-net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n" > >> " connect the vlan 'n' to another VLAN using a socket connection\n" > >> "-net socket[,vlan=n][,name=str][,fd=h][,mcast=maddr:port]\n" > >> > > > > Needs documentation. > > > > Regards, > > > > Anthony Liguori > -- Gleb.