From mboxrd@z Thu Jan  1 00:00:00 1970
From: Roman Mohr <rmohr@redhat.com>
Subject: Re: [libvirt] opening tap devices that are created in a container
Date: Tue, 17 Jul 2018 13:58:21 +0200
Message-ID: <CALDPj7v-bmAWXWAVBC5ALtEc0fdDKO9=dnHSOscMPGUL221J1Q@mail.gmail.com>
References: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com>
	<CALDPj7tWaHLe4kfhyCwPk0zHawOEULYFOQ2sX-Y3wQX7ba+HEw@mail.gmail.com>
	<20180708060152.GB20206@wheatley>
	<6f5f40b6-3637-c7a9-44f8-81352ece2bef@akamai.com>
	<20180711101005.GA13392@wheatley>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============7813131353077081865=="
Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com,
	netdev@vger.kernel.org, jbaron@akamai.com, ebiederm@xmission.com,
	davem@davemloft.net, laine@laine.org
To: Martin Kletzander <mkletzan@redhat.com>
Return-path: <libvir-list-bounces@redhat.com>
In-Reply-To: <20180711101005.GA13392@wheatley>
List-Unsubscribe: <https://www.redhat.com/mailman/options/libvir-list>,
	<mailto:libvir-list-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/libvir-list>
List-Post: <mailto:libvir-list@redhat.com>
List-Help: <mailto:libvir-list-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/libvir-list>,
	<mailto:libvir-list-request@redhat.com?subject=subscribe>
Sender: libvir-list-bounces@redhat.com
Errors-To: libvir-list-bounces@redhat.com
List-Id: netdev.vger.kernel.org

--===============7813131353077081865==
Content-Type: multipart/alternative; boundary="00000000000041c042057130aa40"

--00000000000041c042057130aa40
Content-Type: text/plain; charset="UTF-8"

On Wed, Jul 11, 2018 at 12:10 PM <nert@wheatley> wrote:

> On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote:
> >
> >
> >On 07/08/2018 02:01 AM, Martin Kletzander wrote:
> >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
> >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Opening tap devices, such as macvtap, that are created in containers
> is
> >>>> problematic because the interface for opening tap devices is via
> >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> >>>> its not namespace aware. It is possible to do a mknod() in the
> >>>> container, once the tap devices are created, however, since the tap
> >>>> devices are created dynamically its not possible to apriori allow
> access
> >>>> to certain major/minor numbers, since we don't know what these are
> going
> >>>> to be. In addition, its desirable to not allow the mknod capability in
> >>>> containers. This behavior, I think is somewhat inconsistent with the
> >>>> tuntap driver where one can create tuntap devices inside a container
> by
> >>>> first opening /dev/net/tun and then using them by supplying the tuntap
> >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates
> the
> >>>> network namespace, one is limited to opening network devices that
> belong
> >>>> to your current network namespace.
> >>>>
> >>>> Here are some options to this issue, that I wanted to get feedback
> >>>> about, and just wondering if anybody else has run into this.
> >>>>
> >>>> 1)
> >>>>
> >>>> Don't create the tap device, such as macvtap in the container.
> Instead,
> >>>> create the tap device outside of the container and then move it into
> the
> >>>> desired container network namespace. In addition, do a mknod() for the
> >>>> corresponding /dev/tapNN device from outside the container before
> doing
> >>>> chroot().
> >>>>
> >>>> This solution still doesn't allow tap devices to be created inside the
> >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside
> of
> >>>> a container, it would mean changing libvirtd to open existing tap
> >>>> devices (as opposed to the current behavior of creating new ones).
> This
> >>>> would not require any kernel changes, but as mentioned seems
> >>>> inconsistent with the tuntap interface.
> >>>>
> >>>
> >>> For KubeVirt, apart from how exactly the device ends up in the
> >>> container, I
> >>> would want to pursue a way where all network preparations which require
> >>> privileges happens from a privileged process *outside* of the
> container.
> >>> Like CNI solutions do it. They run outside, have privileges and then
> >>> create
> >>> devices in the right network/mount namespace or move them there. The
> >>> final
> >>> goal for KubeVirt is that our pod with the qemu process is completely
> >>> unprivileged and privileged setup happens from outside.
> >>>
> >>> As a consequence, and depending on which route Dan pursues with the
> >>> restructured libvirt, I would assume that either a privileged
> >>> libvirtd-part
> >>> outside of containers creates the devices by entering the right
> >>> namespaces,
> >>> or that libvirt in the container can consume pre-created tun/tap
> devices,
> >>> like qemu.
> >>>
> >>
> >> That would be nice, but as far as I understand there will always be a
> >> need for
> >> some privileges if you want to use a tap device.  It's nice that CNI
> >> does that
> >> and all the containers can run unprivileged, but that's because they do
> >> not open
> >> the tap device and they do not do any privileged operations on it.  But
> >> QEMU
> >> needs to.  So the only way would be passing an opened fd to the
> >> container or
> >> opening the tap device there and making the fd usable for one process in
> >> the
> >> container.  Is this already supported for some type of containers in
> >> some way?
> >>
> >> Martin
> >
> >Hi,
> >
> >So another option here call it #3 is to pass open fds via unix sockets.
> >If there are privileged operations that QEMU is trying to do with the fd
> >though, how will opening it first and then passing it to an unprivileged
> >QEMU address that? Is the opener doing those operations first?
> >
>
> Sorry for the confusion, but QEMU is not doing any privileged operations.
> I got
> confused by the fact that anyone can open and do a R/W on a tap device.
> But it
> looks like that's on purpose.  No capabilities are needed for opening
> /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then
> doing R/W
> operations on it.  It just works.
>
> Correct me if I'm wrong, but to sum it all up, the only things that we
> need to
> figure out (which might possibly be solved by ideas in the other thread)
> are:
>
> tap:
> - Existence of /dev/net/tun
> - Having permissions to open it (0666 by default, shouldn't be a nig deal)
> - Knowing the device name
>
> macvtap:
> - Existence of /dev/tapXX
> - Having permissions to open /dev/tapXX
> - One of the following:
>   - Knowing the device name (and being able to translate it using a
> netlink socket)
>   - Knowing the the device index
>
> The rest should be an implementation detail.
>
> Am I right?  Did I miss anything?


At least from the KubeVirt use-case that sounds to be the things which we
would need to solve the networking setup in a similar way like the
Container Network Interface implementations solve the setup in k8s.

Best Regards,
Roman

--00000000000041c042057130aa40
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Wed=
, Jul 11, 2018 at 12:10 PM &lt;nert@wheatley&gt; wrote:<br></div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron =
wrote:<br>
&gt;<br>
&gt;<br>
&gt;On 07/08/2018 02:01 AM, Martin Kletzander wrote:<br>
&gt;&gt; On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:<br>
&gt;&gt;&gt; On Thu, Jul 5, 2018 at 4:20 PM Jason Baron &lt;<a href=3D"mail=
to:jbaron@akamai.com" target=3D"_blank">jbaron@akamai.com</a>&gt; wrote:<br=
>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Hi,<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Opening tap devices, such as macvtap, that are created in =
containers is<br>
&gt;&gt;&gt;&gt; problematic because the interface for opening tap devices =
is via<br>
&gt;&gt;&gt;&gt; /dev/tapNN and devtmpfs is not typically mounted inside a =
container as<br>
&gt;&gt;&gt;&gt; its not namespace aware. It is possible to do a mknod() in=
 the<br>
&gt;&gt;&gt;&gt; container, once the tap devices are created, however, sinc=
e the tap<br>
&gt;&gt;&gt;&gt; devices are created dynamically its not possible to aprior=
i allow access<br>
&gt;&gt;&gt;&gt; to certain major/minor numbers, since we don&#39;t know wh=
at these are going<br>
&gt;&gt;&gt;&gt; to be. In addition, its desirable to not allow the mknod c=
apability in<br>
&gt;&gt;&gt;&gt; containers. This behavior, I think is somewhat inconsisten=
t with the<br>
&gt;&gt;&gt;&gt; tuntap driver where one can create tuntap devices inside a=
 container by<br>
&gt;&gt;&gt;&gt; first opening /dev/net/tun and then using them by supplyin=
g the tuntap<br>
&gt;&gt;&gt;&gt; device name via the ioctl(TUNSETIFF). And since TUNSETIFF =
validates the<br>
&gt;&gt;&gt;&gt; network namespace, one is limited to opening network devic=
es that belong<br>
&gt;&gt;&gt;&gt; to your current network namespace.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Here are some options to this issue, that I wanted to get =
feedback<br>
&gt;&gt;&gt;&gt; about, and just wondering if anybody else has run into thi=
s.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; 1)<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Don&#39;t create the tap device, such as macvtap in the co=
ntainer. Instead,<br>
&gt;&gt;&gt;&gt; create the tap device outside of the container and then mo=
ve it into the<br>
&gt;&gt;&gt;&gt; desired container network namespace. In addition, do a mkn=
od() for the<br>
&gt;&gt;&gt;&gt; corresponding /dev/tapNN device from outside the container=
 before doing<br>
&gt;&gt;&gt;&gt; chroot().<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; This solution still doesn&#39;t allow tap devices to be cr=
eated inside the<br>
&gt;&gt;&gt;&gt; container. Thus, in the case of kubevirt, which runs libvi=
rtd inside of<br>
&gt;&gt;&gt;&gt; a container, it would mean changing libvirtd to open exist=
ing tap<br>
&gt;&gt;&gt;&gt; devices (as opposed to the current behavior of creating ne=
w ones). This<br>
&gt;&gt;&gt;&gt; would not require any kernel changes, but as mentioned see=
ms<br>
&gt;&gt;&gt;&gt; inconsistent with the tuntap interface.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; For KubeVirt, apart from how exactly the device ends up in the=
<br>
&gt;&gt;&gt; container, I<br>
&gt;&gt;&gt; would want to pursue a way where all network preparations whic=
h require<br>
&gt;&gt;&gt; privileges happens from a privileged process *outside* of the =
container.<br>
&gt;&gt;&gt; Like CNI solutions do it. They run outside, have privileges an=
d then<br>
&gt;&gt;&gt; create<br>
&gt;&gt;&gt; devices in the right network/mount namespace or move them ther=
e. The<br>
&gt;&gt;&gt; final<br>
&gt;&gt;&gt; goal for KubeVirt is that our pod with the qemu process is com=
pletely<br>
&gt;&gt;&gt; unprivileged and privileged setup happens from outside.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; As a consequence, and depending on which route Dan pursues wit=
h the<br>
&gt;&gt;&gt; restructured libvirt, I would assume that either a privileged<=
br>
&gt;&gt;&gt; libvirtd-part<br>
&gt;&gt;&gt; outside of containers creates the devices by entering the righ=
t<br>
&gt;&gt;&gt; namespaces,<br>
&gt;&gt;&gt; or that libvirt in the container can consume pre-created tun/t=
ap devices,<br>
&gt;&gt;&gt; like qemu.<br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; That would be nice, but as far as I understand there will always b=
e a<br>
&gt;&gt; need for<br>
&gt;&gt; some privileges if you want to use a tap device.=C2=A0 It&#39;s ni=
ce that CNI<br>
&gt;&gt; does that<br>
&gt;&gt; and all the containers can run unprivileged, but that&#39;s becaus=
e they do<br>
&gt;&gt; not open<br>
&gt;&gt; the tap device and they do not do any privileged operations on it.=
=C2=A0 But<br>
&gt;&gt; QEMU<br>
&gt;&gt; needs to.=C2=A0 So the only way would be passing an opened fd to t=
he<br>
&gt;&gt; container or<br>
&gt;&gt; opening the tap device there and making the fd usable for one proc=
ess in<br>
&gt;&gt; the<br>
&gt;&gt; container.=C2=A0 Is this already supported for some type of contai=
ners in<br>
&gt;&gt; some way?<br>
&gt;&gt;<br>
&gt;&gt; Martin<br>
&gt;<br>
&gt;Hi,<br>
&gt;<br>
&gt;So another option here call it #3 is to pass open fds via unix sockets.=
<br>
&gt;If there are privileged operations that QEMU is trying to do with the f=
d<br>
&gt;though, how will opening it first and then passing it to an unprivilege=
d<br>
&gt;QEMU address that? Is the opener doing those operations first?<br>
&gt;<br>
<br>
Sorry for the confusion, but QEMU is not doing any privileged operations.=
=C2=A0 I got<br>
confused by the fact that anyone can open and do a R/W on a tap device.=C2=
=A0 But it<br>
looks like that&#39;s on purpose.=C2=A0 No capabilities are needed for open=
ing<br>
/dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then doing=
 R/W<br>
operations on it.=C2=A0 It just works.<br>
<br>
Correct me if I&#39;m wrong, but to sum it all up, the only things that we =
need to<br>
figure out (which might possibly be solved by ideas in the other thread) ar=
e:<br>
<br>
tap:<br>
- Existence of /dev/net/tun<br>
- Having permissions to open it (0666 by default, shouldn&#39;t be a nig de=
al)<br>
- Knowing the device name<br>
<br>
macvtap:<br>
- Existence of /dev/tapXX<br>
- Having permissions to open /dev/tapXX<br>
- One of the following:<br>
=C2=A0 - Knowing the device name (and being able to translate it using a ne=
tlink socket)<br>
=C2=A0 - Knowing the the device index<br>
<br>
The rest should be an implementation detail.<br>
<br>
Am I right?=C2=A0 Did I miss anything?</blockquote><div><br></div><div>At l=
east from the KubeVirt use-case that sounds to be the things which we would=
 need to solve the networking setup in a similar way like the Container Net=
work Interface implementations solve the setup in k8s.</div><div><br></div>=
<div>Best Regards,</div><div>Roman</div><div>=C2=A0</div></div></div>

--00000000000041c042057130aa40--


--===============7813131353077081865==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline


--===============7813131353077081865==--