From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Kletzander Subject: Re: [libvirt] opening tap devices that are created in a container Date: Tue, 17 Jul 2018 13:45:57 +0200 Message-ID: <20180717114557.GD24690@wheatley> References: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com> <20180708060152.GB20206@wheatley> <6f5f40b6-3637-c7a9-44f8-81352ece2bef@akamai.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="+B+y8wtTXqdUj1xM" Cc: Roman Mohr , fdeutsch@redhat.com, libvir-list@redhat.com, netdev@vger.kernel.org, ebiederm@xmission.com, davem@davemloft.net, Laine Stump To: Jason Baron Return-path: Received: from mx3-rdu2.redhat.com ([66.187.233.73]:52990 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1730736AbeGQMSR (ORCPT ); Tue, 17 Jul 2018 08:18:17 -0400 Content-Disposition: inline In-Reply-To: <6f5f40b6-3637-c7a9-44f8-81352ece2bef@akamai.com> Sender: netdev-owner@vger.kernel.org List-ID: --+B+y8wtTXqdUj1xM Content-Type: text/plain; charset=iso-8859-1; format=flowed Content-Disposition: inline Content-Transfer-Encoding: quoted-printable [Not sure who got this message, but it probably didn't get anywhere due to = one mailserver, so resending to make sure] On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote: >On 07/08/2018 02:01 AM, Martin Kletzander wrote: >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote: >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron wrote: >>> >>>> Hi, >>>> >>>> Opening tap devices, such as macvtap, that are created in containers is >>>> problematic because the interface for opening tap devices is via >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as >>>> its not namespace aware. It is possible to do a mknod() in the >>>> container, once the tap devices are created, however, since the tap >>>> devices are created dynamically its not possible to apriori allow acce= ss >>>> to certain major/minor numbers, since we don't know what these are goi= ng >>>> to be. In addition, its desirable to not allow the mknod capability in >>>> containers. This behavior, I think is somewhat inconsistent with the >>>> tuntap driver where one can create tuntap devices inside a container by >>>> first opening /dev/net/tun and then using them by supplying the tuntap >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the >>>> network namespace, one is limited to opening network devices that belo= ng >>>> to your current network namespace. >>>> >>>> Here are some options to this issue, that I wanted to get feedback >>>> about, and just wondering if anybody else has run into this. >>>> >>>> 1) >>>> >>>> Don't create the tap device, such as macvtap in the container. Instead, >>>> create the tap device outside of the container and then move it into t= he >>>> desired container network namespace. In addition, do a mknod() for the >>>> corresponding /dev/tapNN device from outside the container before doing >>>> chroot(). >>>> >>>> This solution still doesn't allow tap devices to be created inside the >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside of >>>> a container, it would mean changing libvirtd to open existing tap >>>> devices (as opposed to the current behavior of creating new ones). This >>>> would not require any kernel changes, but as mentioned seems >>>> inconsistent with the tuntap interface. >>>> >>> >>> For KubeVirt, apart from how exactly the device ends up in the >>> container, I >>> would want to pursue a way where all network preparations which require >>> privileges happens from a privileged process *outside* of the container. >>> Like CNI solutions do it. They run outside, have privileges and then >>> create >>> devices in the right network/mount namespace or move them there. The >>> final >>> goal for KubeVirt is that our pod with the qemu process is completely >>> unprivileged and privileged setup happens from outside. >>> >>> As a consequence, and depending on which route Dan pursues with the >>> restructured libvirt, I would assume that either a privileged >>> libvirtd-part >>> outside of containers creates the devices by entering the right >>> namespaces, >>> or that libvirt in the container can consume pre-created tun/tap device= s, >>> like qemu. >>> >> >> That would be nice, but as far as I understand there will always be a >> need for >> some privileges if you want to use a tap device.=A0 It's nice that CNI >> does that >> and all the containers can run unprivileged, but that's because they do >> not open >> the tap device and they do not do any privileged operations on it.=A0 But >> QEMU >> needs to.=A0 So the only way would be passing an opened fd to the >> container or >> opening the tap device there and making the fd usable for one process in >> the >> container.=A0 Is this already supported for some type of containers in >> some way? >> >> Martin > >Hi, > >So another option here call it #3 is to pass open fds via unix sockets. >If there are privileged operations that QEMU is trying to do with the fd >though, how will opening it first and then passing it to an unprivileged >QEMU address that? Is the opener doing those operations first? > Sorry for the confusion, but QEMU is not doing any privileged operations. = I got confused by the fact that anyone can open and do a R/W on a tap device. Bu= t it looks like that's on purpose. No capabilities are needed for opening /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then doing= R/W operations on it. It just works. Correct me if I'm wrong, but to sum it all up, the only things that we need= to figure out (which might possibly be solved by ideas in the other thread) ar= e: tap: - Existence of /dev/net/tun - Having permissions to open it (0666 by default, shouldn't be a nig deal) - Knowing the device name macvtap: - Existence of /dev/tapXX - Having permissions to open /dev/tapXX - One of the following: - Knowing the device name (and being able to translate it using a netlink= socket) - Knowing the the device index The rest should be an implementation detail. Am I right? Did I miss anything? --+B+y8wtTXqdUj1xM Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEiXAnXDYdKAaCyvS1CB/CnyQXht0FAltN1vUACgkQCB/CnyQX ht02aA/+OoHAMALUixJ9YtOQ6iVu6O/toS9n6et3rI093x/drzqKWTagkiNkG1jO SS1KGHOJYTwOt7QLjRLYOCYqw46vwn89EasafF/RIBDdhTzbvwqy37Hvz4CvDR/x mv/ruGrMCjfqYQEGVn/3ik6KZpWpUPuCwoN7zjO+fWVRqDW4W/Tmftz53dnewwRX K0dxxDrKgOzfBz8xi+yK5yx+Bgzb+AuYF85iAMl78MG4Ob75DtbQD1np4Olw6bRi vLs3Zp0n35wBEnj2gSOMcfMR3nVjKIoHqLc3sdIi6o0bR64eNbjMZY4J9RVYZ5vF aYlfEk7dEorU7WojKsPkFslSo8qc03xEn+5PhMoFoIpvfyobllhBpIZ2GXRfQm+P zxgINYp3TuWO7kXsRUSrY/03jhyn6iUbywg1xjmhcfr+3XX5bp5J1olT7OS3znHK nC14EPqbyi6un8wXsFsuvj+w0N1dRiL+KjDgAk0cJ3POzUFPir0b6Grnc/971tsV M5QbGKZC5/WyWjzu3kFnM3kLWQubjHYmbohCxa+9xDdnt2mcXAY7UDgn0dd4QvRE GfTz/362Iv7kFdG2tjTP9Ucw29bk49+At+jrx3eZdZWvfUdroV/yp3JkqLbE7Q4N pxRBwXUYhRHlkYcmFg1QJFJFEjgA/tBJVyhmY27CeSkMkpwV8+I= =A+RZ -----END PGP SIGNATURE----- --+B+y8wtTXqdUj1xM--