From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roman Mohr Subject: Re: [libvirt] opening tap devices that are created in a container Date: Tue, 17 Jul 2018 13:58:21 +0200 Message-ID: References: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com> <20180708060152.GB20206@wheatley> <6f5f40b6-3637-c7a9-44f8-81352ece2bef@akamai.com> <20180711101005.GA13392@wheatley> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7813131353077081865==" Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com, netdev@vger.kernel.org, jbaron@akamai.com, ebiederm@xmission.com, davem@davemloft.net, laine@laine.org To: Martin Kletzander Return-path: In-Reply-To: <20180711101005.GA13392@wheatley> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: libvir-list-bounces@redhat.com Errors-To: libvir-list-bounces@redhat.com List-Id: netdev.vger.kernel.org --===============7813131353077081865== Content-Type: multipart/alternative; boundary="00000000000041c042057130aa40" --00000000000041c042057130aa40 Content-Type: text/plain; charset="UTF-8" On Wed, Jul 11, 2018 at 12:10 PM wrote: > On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote: > > > > > >On 07/08/2018 02:01 AM, Martin Kletzander wrote: > >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote: > >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron wrote: > >>> > >>>> Hi, > >>>> > >>>> Opening tap devices, such as macvtap, that are created in containers > is > >>>> problematic because the interface for opening tap devices is via > >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as > >>>> its not namespace aware. It is possible to do a mknod() in the > >>>> container, once the tap devices are created, however, since the tap > >>>> devices are created dynamically its not possible to apriori allow > access > >>>> to certain major/minor numbers, since we don't know what these are > going > >>>> to be. In addition, its desirable to not allow the mknod capability in > >>>> containers. This behavior, I think is somewhat inconsistent with the > >>>> tuntap driver where one can create tuntap devices inside a container > by > >>>> first opening /dev/net/tun and then using them by supplying the tuntap > >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates > the > >>>> network namespace, one is limited to opening network devices that > belong > >>>> to your current network namespace. > >>>> > >>>> Here are some options to this issue, that I wanted to get feedback > >>>> about, and just wondering if anybody else has run into this. > >>>> > >>>> 1) > >>>> > >>>> Don't create the tap device, such as macvtap in the container. > Instead, > >>>> create the tap device outside of the container and then move it into > the > >>>> desired container network namespace. In addition, do a mknod() for the > >>>> corresponding /dev/tapNN device from outside the container before > doing > >>>> chroot(). > >>>> > >>>> This solution still doesn't allow tap devices to be created inside the > >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside > of > >>>> a container, it would mean changing libvirtd to open existing tap > >>>> devices (as opposed to the current behavior of creating new ones). > This > >>>> would not require any kernel changes, but as mentioned seems > >>>> inconsistent with the tuntap interface. > >>>> > >>> > >>> For KubeVirt, apart from how exactly the device ends up in the > >>> container, I > >>> would want to pursue a way where all network preparations which require > >>> privileges happens from a privileged process *outside* of the > container. > >>> Like CNI solutions do it. They run outside, have privileges and then > >>> create > >>> devices in the right network/mount namespace or move them there. The > >>> final > >>> goal for KubeVirt is that our pod with the qemu process is completely > >>> unprivileged and privileged setup happens from outside. > >>> > >>> As a consequence, and depending on which route Dan pursues with the > >>> restructured libvirt, I would assume that either a privileged > >>> libvirtd-part > >>> outside of containers creates the devices by entering the right > >>> namespaces, > >>> or that libvirt in the container can consume pre-created tun/tap > devices, > >>> like qemu. > >>> > >> > >> That would be nice, but as far as I understand there will always be a > >> need for > >> some privileges if you want to use a tap device. It's nice that CNI > >> does that > >> and all the containers can run unprivileged, but that's because they do > >> not open > >> the tap device and they do not do any privileged operations on it. But > >> QEMU > >> needs to. So the only way would be passing an opened fd to the > >> container or > >> opening the tap device there and making the fd usable for one process in > >> the > >> container. Is this already supported for some type of containers in > >> some way? > >> > >> Martin > > > >Hi, > > > >So another option here call it #3 is to pass open fds via unix sockets. > >If there are privileged operations that QEMU is trying to do with the fd > >though, how will opening it first and then passing it to an unprivileged > >QEMU address that? Is the opener doing those operations first? > > > > Sorry for the confusion, but QEMU is not doing any privileged operations. > I got > confused by the fact that anyone can open and do a R/W on a tap device. > But it > looks like that's on purpose. No capabilities are needed for opening > /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then > doing R/W > operations on it. It just works. > > Correct me if I'm wrong, but to sum it all up, the only things that we > need to > figure out (which might possibly be solved by ideas in the other thread) > are: > > tap: > - Existence of /dev/net/tun > - Having permissions to open it (0666 by default, shouldn't be a nig deal) > - Knowing the device name > > macvtap: > - Existence of /dev/tapXX > - Having permissions to open /dev/tapXX > - One of the following: > - Knowing the device name (and being able to translate it using a > netlink socket) > - Knowing the the device index > > The rest should be an implementation detail. > > Am I right? Did I miss anything? At least from the KubeVirt use-case that sounds to be the things which we would need to solve the networking setup in a similar way like the Container Network Interface implementations solve the setup in k8s. Best Regards, Roman --00000000000041c042057130aa40 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed= , Jul 11, 2018 at 12:10 PM <nert@wheatley> wrote:
On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron = wrote:
>
>
>On 07/08/2018 02:01 AM, Martin Kletzander wrote:
>> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
>>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote: >>>
>>>> Hi,
>>>>
>>>> Opening tap devices, such as macvtap, that are created in = containers is
>>>> problematic because the interface for opening tap devices = is via
>>>> /dev/tapNN and devtmpfs is not typically mounted inside a = container as
>>>> its not namespace aware. It is possible to do a mknod() in= the
>>>> container, once the tap devices are created, however, sinc= e the tap
>>>> devices are created dynamically its not possible to aprior= i allow access
>>>> to certain major/minor numbers, since we don't know wh= at these are going
>>>> to be. In addition, its desirable to not allow the mknod c= apability in
>>>> containers. This behavior, I think is somewhat inconsisten= t with the
>>>> tuntap driver where one can create tuntap devices inside a= container by
>>>> first opening /dev/net/tun and then using them by supplyin= g the tuntap
>>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF = validates the
>>>> network namespace, one is limited to opening network devic= es that belong
>>>> to your current network namespace.
>>>>
>>>> Here are some options to this issue, that I wanted to get = feedback
>>>> about, and just wondering if anybody else has run into thi= s.
>>>>
>>>> 1)
>>>>
>>>> Don't create the tap device, such as macvtap in the co= ntainer. Instead,
>>>> create the tap device outside of the container and then mo= ve it into the
>>>> desired container network namespace. In addition, do a mkn= od() for the
>>>> corresponding /dev/tapNN device from outside the container= before doing
>>>> chroot().
>>>>
>>>> This solution still doesn't allow tap devices to be cr= eated inside the
>>>> container. Thus, in the case of kubevirt, which runs libvi= rtd inside of
>>>> a container, it would mean changing libvirtd to open exist= ing tap
>>>> devices (as opposed to the current behavior of creating ne= w ones). This
>>>> would not require any kernel changes, but as mentioned see= ms
>>>> inconsistent with the tuntap interface.
>>>>
>>>
>>> For KubeVirt, apart from how exactly the device ends up in the=
>>> container, I
>>> would want to pursue a way where all network preparations whic= h require
>>> privileges happens from a privileged process *outside* of the = container.
>>> Like CNI solutions do it. They run outside, have privileges an= d then
>>> create
>>> devices in the right network/mount namespace or move them ther= e. The
>>> final
>>> goal for KubeVirt is that our pod with the qemu process is com= pletely
>>> unprivileged and privileged setup happens from outside.
>>>
>>> As a consequence, and depending on which route Dan pursues wit= h the
>>> restructured libvirt, I would assume that either a privileged<= br> >>> libvirtd-part
>>> outside of containers creates the devices by entering the righ= t
>>> namespaces,
>>> or that libvirt in the container can consume pre-created tun/t= ap devices,
>>> like qemu.
>>>
>>
>> That would be nice, but as far as I understand there will always b= e a
>> need for
>> some privileges if you want to use a tap device.=C2=A0 It's ni= ce that CNI
>> does that
>> and all the containers can run unprivileged, but that's becaus= e they do
>> not open
>> the tap device and they do not do any privileged operations on it.= =C2=A0 But
>> QEMU
>> needs to.=C2=A0 So the only way would be passing an opened fd to t= he
>> container or
>> opening the tap device there and making the fd usable for one proc= ess in
>> the
>> container.=C2=A0 Is this already supported for some type of contai= ners in
>> some way?
>>
>> Martin
>
>Hi,
>
>So another option here call it #3 is to pass open fds via unix sockets.=
>If there are privileged operations that QEMU is trying to do with the f= d
>though, how will opening it first and then passing it to an unprivilege= d
>QEMU address that? Is the opener doing those operations first?
>

Sorry for the confusion, but QEMU is not doing any privileged operations.= =C2=A0 I got
confused by the fact that anyone can open and do a R/W on a tap device.=C2= =A0 But it
looks like that's on purpose.=C2=A0 No capabilities are needed for open= ing
/dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then doing= R/W
operations on it.=C2=A0 It just works.

Correct me if I'm wrong, but to sum it all up, the only things that we = need to
figure out (which might possibly be solved by ideas in the other thread) ar= e:

tap:
- Existence of /dev/net/tun
- Having permissions to open it (0666 by default, shouldn't be a nig de= al)
- Knowing the device name

macvtap:
- Existence of /dev/tapXX
- Having permissions to open /dev/tapXX
- One of the following:
=C2=A0 - Knowing the device name (and being able to translate it using a ne= tlink socket)
=C2=A0 - Knowing the the device index

The rest should be an implementation detail.

Am I right?=C2=A0 Did I miss anything?

At l= east from the KubeVirt use-case that sounds to be the things which we would= need to solve the networking setup in a similar way like the Container Net= work Interface implementations solve the setup in k8s.

=
Best Regards,
Roman
=C2=A0
--00000000000041c042057130aa40-- --===============7813131353077081865== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline --===============7813131353077081865==--