Re: [libvirt] opening tap devices that are created in a container

From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Jason Baron <jbaron@akamai.com>
Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com,
	netdev@vger.kernel.org, Roman Mohr <rmohr@redhat.com>,
	ebiederm@xmission.com, Martin Kletzander <mkletzan@redhat.com>,
	davem@davemloft.net
Subject: Re: [libvirt] opening tap devices that are created in a container
Date: Tue, 10 Jul 2018 09:47:45 +0100	[thread overview]
Message-ID: <20180710084745.GB1612@redhat.com> (raw)
In-Reply-To: <6f5f40b6-3637-c7a9-44f8-81352ece2bef@akamai.com>

On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote:
> 
> 
> On 07/08/2018 02:01 AM, Martin Kletzander wrote:
> > On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
> >> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Opening tap devices, such as macvtap, that are created in containers is
> >>> problematic because the interface for opening tap devices is via
> >>> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> >>> its not namespace aware. It is possible to do a mknod() in the
> >>> container, once the tap devices are created, however, since the tap
> >>> devices are created dynamically its not possible to apriori allow access
> >>> to certain major/minor numbers, since we don't know what these are going
> >>> to be. In addition, its desirable to not allow the mknod capability in
> >>> containers. This behavior, I think is somewhat inconsistent with the
> >>> tuntap driver where one can create tuntap devices inside a container by
> >>> first opening /dev/net/tun and then using them by supplying the tuntap
> >>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the
> >>> network namespace, one is limited to opening network devices that belong
> >>> to your current network namespace.
> >>>
> >>> Here are some options to this issue, that I wanted to get feedback
> >>> about, and just wondering if anybody else has run into this.
> >>>
> >>> 1)
> >>>
> >>> Don't create the tap device, such as macvtap in the container. Instead,
> >>> create the tap device outside of the container and then move it into the
> >>> desired container network namespace. In addition, do a mknod() for the
> >>> corresponding /dev/tapNN device from outside the container before doing
> >>> chroot().
> >>>
> >>> This solution still doesn't allow tap devices to be created inside the
> >>> container. Thus, in the case of kubevirt, which runs libvirtd inside of
> >>> a container, it would mean changing libvirtd to open existing tap
> >>> devices (as opposed to the current behavior of creating new ones). This
> >>> would not require any kernel changes, but as mentioned seems
> >>> inconsistent with the tuntap interface.
> >>>
> >>
> >> For KubeVirt, apart from how exactly the device ends up in the
> >> container, I
> >> would want to pursue a way where all network preparations which require
> >> privileges happens from a privileged process *outside* of the container.
> >> Like CNI solutions do it. They run outside, have privileges and then
> >> create
> >> devices in the right network/mount namespace or move them there. The
> >> final
> >> goal for KubeVirt is that our pod with the qemu process is completely
> >> unprivileged and privileged setup happens from outside.
> >>
> >> As a consequence, and depending on which route Dan pursues with the
> >> restructured libvirt, I would assume that either a privileged
> >> libvirtd-part
> >> outside of containers creates the devices by entering the right
> >> namespaces,
> >> or that libvirt in the container can consume pre-created tun/tap devices,
> >> like qemu.
> >>
> > 
> > That would be nice, but as far as I understand there will always be a
> > need for
> > some privileges if you want to use a tap device.  It's nice that CNI
> > does that
> > and all the containers can run unprivileged, but that's because they do
> > not open
> > the tap device and they do not do any privileged operations on it.  But
> > QEMU
> > needs to.  So the only way would be passing an opened fd to the
> > container or
> > opening the tap device there and making the fd usable for one process in
> > the
> > container.  Is this already supported for some type of containers in
> > some way?
> > 
> > Martin
> 
> Hi,
> 
> So another option here call it #3 is to pass open fds via unix sockets.
> If there are privileged operations that QEMU is trying to do with the fd
> though, how will opening it first and then passing it to an unprivileged
> QEMU address that? Is the opener doing those operations first?

>From libvirt's POV, it would be preferrable to be able to open the
macvtap device by name inside the container, rather than having to
accept a pre-opened FD from the application.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list