Re: [libvirt] opening tap devices that are created in a container

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Roman Mohr <rmohr@redhat.com>
To: Martin Kletzander <mkletzan@redhat.com>
Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com,
	netdev@vger.kernel.org, jbaron@akamai.com, ebiederm@xmission.com,
	davem@davemloft.net, laine@laine.org
Subject: Re: [libvirt] opening tap devices that are created in a container
Date: Tue, 17 Jul 2018 13:58:21 +0200	[thread overview]
Message-ID: <CALDPj7v-bmAWXWAVBC5ALtEc0fdDKO9=dnHSOscMPGUL221J1Q@mail.gmail.com> (raw)
In-Reply-To: <20180711101005.GA13392@wheatley>


[-- Attachment #1.1: Type: text/plain, Size: 5460 bytes --]

On Wed, Jul 11, 2018 at 12:10 PM <nert@wheatley> wrote:

> On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote:
> >
> >
> >On 07/08/2018 02:01 AM, Martin Kletzander wrote:
> >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
> >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Opening tap devices, such as macvtap, that are created in containers
> is
> >>>> problematic because the interface for opening tap devices is via
> >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> >>>> its not namespace aware. It is possible to do a mknod() in the
> >>>> container, once the tap devices are created, however, since the tap
> >>>> devices are created dynamically its not possible to apriori allow
> access
> >>>> to certain major/minor numbers, since we don't know what these are
> going
> >>>> to be. In addition, its desirable to not allow the mknod capability in
> >>>> containers. This behavior, I think is somewhat inconsistent with the
> >>>> tuntap driver where one can create tuntap devices inside a container
> by
> >>>> first opening /dev/net/tun and then using them by supplying the tuntap
> >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates
> the
> >>>> network namespace, one is limited to opening network devices that
> belong
> >>>> to your current network namespace.
> >>>>
> >>>> Here are some options to this issue, that I wanted to get feedback
> >>>> about, and just wondering if anybody else has run into this.
> >>>>
> >>>> 1)
> >>>>
> >>>> Don't create the tap device, such as macvtap in the container.
> Instead,
> >>>> create the tap device outside of the container and then move it into
> the
> >>>> desired container network namespace. In addition, do a mknod() for the
> >>>> corresponding /dev/tapNN device from outside the container before
> doing
> >>>> chroot().
> >>>>
> >>>> This solution still doesn't allow tap devices to be created inside the
> >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside
> of
> >>>> a container, it would mean changing libvirtd to open existing tap
> >>>> devices (as opposed to the current behavior of creating new ones).
> This
> >>>> would not require any kernel changes, but as mentioned seems
> >>>> inconsistent with the tuntap interface.
> >>>>
> >>>
> >>> For KubeVirt, apart from how exactly the device ends up in the
> >>> container, I
> >>> would want to pursue a way where all network preparations which require
> >>> privileges happens from a privileged process *outside* of the
> container.
> >>> Like CNI solutions do it. They run outside, have privileges and then
> >>> create
> >>> devices in the right network/mount namespace or move them there. The
> >>> final
> >>> goal for KubeVirt is that our pod with the qemu process is completely
> >>> unprivileged and privileged setup happens from outside.
> >>>
> >>> As a consequence, and depending on which route Dan pursues with the
> >>> restructured libvirt, I would assume that either a privileged
> >>> libvirtd-part
> >>> outside of containers creates the devices by entering the right
> >>> namespaces,
> >>> or that libvirt in the container can consume pre-created tun/tap
> devices,
> >>> like qemu.
> >>>
> >>
> >> That would be nice, but as far as I understand there will always be a
> >> need for
> >> some privileges if you want to use a tap device.  It's nice that CNI
> >> does that
> >> and all the containers can run unprivileged, but that's because they do
> >> not open
> >> the tap device and they do not do any privileged operations on it.  But
> >> QEMU
> >> needs to.  So the only way would be passing an opened fd to the
> >> container or
> >> opening the tap device there and making the fd usable for one process in
> >> the
> >> container.  Is this already supported for some type of containers in
> >> some way?
> >>
> >> Martin
> >
> >Hi,
> >
> >So another option here call it #3 is to pass open fds via unix sockets.
> >If there are privileged operations that QEMU is trying to do with the fd
> >though, how will opening it first and then passing it to an unprivileged
> >QEMU address that? Is the opener doing those operations first?
> >
>
> Sorry for the confusion, but QEMU is not doing any privileged operations.
> I got
> confused by the fact that anyone can open and do a R/W on a tap device.
> But it
> looks like that's on purpose.  No capabilities are needed for opening
> /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then
> doing R/W
> operations on it.  It just works.
>
> Correct me if I'm wrong, but to sum it all up, the only things that we
> need to
> figure out (which might possibly be solved by ideas in the other thread)
> are:
>
> tap:
> - Existence of /dev/net/tun
> - Having permissions to open it (0666 by default, shouldn't be a nig deal)
> - Knowing the device name
>
> macvtap:
> - Existence of /dev/tapXX
> - Having permissions to open /dev/tapXX
> - One of the following:
>   - Knowing the device name (and being able to translate it using a
> netlink socket)
>   - Knowing the the device index
>
> The rest should be an implementation detail.
>
> Am I right?  Did I miss anything?


At least from the KubeVirt use-case that sounds to be the things which we
would need to solve the networking setup in a similar way like the
Container Network Interface implementations solve the setup in k8s.

Best Regards,
Roman

[-- Attachment #1.2: Type: text/html, Size: 6899 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

     prev parent reply	other threads:[~2018-07-17 11:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-05 14:20 opening tap devices that are created in a container Jason Baron
2018-07-05 16:10 ` Daniel P. Berrangé
2018-07-09 20:56   ` Jason Baron
2018-07-10  8:46     ` Daniel P. Berrangé
2018-07-05 16:24 ` [libvirt] " Roman Mohr
2018-07-08  6:01   ` Martin Kletzander
2018-07-09 21:00     ` Jason Baron
2018-07-10  8:47       ` Daniel P. Berrangé
2018-07-17 11:45       ` Martin Kletzander
     [not found]       ` <20180711101005.GA13392@wheatley>
2018-07-12  3:33         ` Jason Baron
2018-07-17 11:58         ` Roman Mohr [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALDPj7v-bmAWXWAVBC5ALtEc0fdDKO9=dnHSOscMPGUL221J1Q@mail.gmail.com' \
    --to=rmohr@redhat.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=fabiand@sni.github.map.fastly.net \
    --cc=jbaron@akamai.com \
    --cc=laine@laine.org \
    --cc=libvir-list@redhat.com \
    --cc=mkletzan@redhat.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).