From: Vivek Goyal <vgoyal@redhat.com>
To: Christian Brauner <brauner@kernel.org>
Cc: "Amir Goldstein" <amir73il@gmail.com>,
"Miklos Szeredi" <mszeredi@redhat.com>,
linux-unionfs@vger.kernel.org, "Christoph Hellwig" <hch@lst.de>,
"Aleksa Sarai" <cyphar@cyphar.com>,
"Giuseppe Scrivano" <gscrivan@redhat.com>,
"Rodrigo Campos Catelin" <rodrigoca@microsoft.com>,
"Seth Forshee" <sforshee@digitalocean.com>,
"Luca Bocassi" <luca.boccassi@microsoft.com>,
"Lennart Poettering" <mzxreary@0pointer.de>,
"Stéphane Graber" <stgraber@ubuntu.com>
Subject: Re: [PATCH 00/18] overlay: support idmapped layers
Date: Wed, 30 Mar 2022 16:58:03 -0400 [thread overview]
Message-ID: <YkTEWy7byDiPAvzc@redhat.com> (raw)
In-Reply-To: <20220329103526.1207086-1-brauner@kernel.org>
On Tue, Mar 29, 2022 at 12:35:07PM +0200, Christian Brauner wrote:
> From: "Christian Brauner (Microsoft)" <brauner@kernel.org>
>
> Hey,
>
> This adds support for mounting overlay on top of idmapped layers.
>
> I have to start by saying a massive thank you to Amir! He did not just
> answer my constant overlay questions but also provided quite a few
> patches himself in this series in addition to reviews, comments and a
> lot of suggestions. Thank you!
>
> There have been a lot of requests to unblock this. For just a few select
> examples see [3], [4], and [5]. I've worked closely with various
> communities among them containerd, Kubernetes, Podman, LXD, runC, crun,
> and systemd (For the curious please see the various pull-request and
> issues below.) a lot of them already support idmapped mounts since they
> are enabled for btrfs, ext4, and xfs (and f2fs and fat fwiw). In
> additon, a few colleagues at Microsoft and from Red Hat work on a
> Kubernetes Enhancement Proposals (KEP) that also relies on overlayfs
> supporting idmapped layers, see [12].
>
> Overlayfs on top of idmapped layers will be used in various ways:
>
> * Container managers use overlayfs to efficiently share layers between
> containers. However, this only works for privileged containers.
> Layers cannot be shared if both privileged and unprivileged containers
> are used. Layers can also not be shared if non-overlapping idmappings
> are used for unprivileged containers. Layers cannot be shared because
> of the conflicting ownership requirements between the containers.
Hi Christian,
Thank you for this work. This is awesome.
Wanted to test it. I was wondering how to test it. Some simple
instructions will help.
I am assuming I will need to be priviliged to be able to setup idmapped
mounts. And then overlay can be mounted on top of these layers (either
as a priviliged operation outside container or unprivileged overlayfs
inside container). Sorry, just trying to wrap my head around which
operations will require me to be priviliged (root in init_user_ns)
and which operations I can do unprivileged (as normal user).
Thanks
Vivek
>
> Both the KEP proposal (see [13]) and LXD (see [14]) use
> non-overlapping idmappings to increase isolation.
>
> All of these cases can be supported if overlayfs can be mounted on
> idmapped layers. The container runtime will create idmapped mounts for
> the layers supposed to be shared. Each container will be given
> idmapped mounts with the correct idmapping. Either by attaching a
> custom or its own user namespace depending on the use-case. Then an
> overlay mount can be mounted on top of the idmapped layers. The
> underlying idmapped mounts can then be unmounted and the mount table
> will be in a clean state.
> This approach has been tested an verified by Giuseppe and others.
>
> * Because of the inability to share layers preparing a layer causes a
> big runtime overhead as ownership needs to be recursively changed.
> This becomes increasingly costly with the bigger the layers are.
> Especially for a large rootfs it becomes prohibitively expensive.
>
> This recursive ownership change also causes a lot storage overhead.
> Without metacopy it can quickly become unmanagable. With metacopy it
> still wastes a lot of space and inodes.
>
> For both the KEP and container manager being able to use idmapped
> layer with overlayfs on top of it solves all these problems.
>
> * Container managers such as LXD do run full system containers. Such
> systems are managed like virtual machines and users run application
> containers inside. Since LXD uses an idmapped mounts for the rootfs of
> the container with the container's user namespace attached to the
> mount container runtimes cannot user overlayfs inside.
>
> Once overlayfs can be mounted on top of idmapped layers it will be
> possible for container runtimes to use overlayfs inside LXD
> containers.
>
> * The systemd-homed daemon and toolsuite is a new component of systemd
> to manage home areas in a portable way. systemd-homed makes use of
> idmapped mounts for the home areas. If the kernel and used file system
> support it the user's home area will be mounted as an idmapped mount
> when they log into their system.
>
> All files are internally owned by the "nobody" user (i.e. the user
> typically used for indicating "this ownership is not mapped"), and
> dynamically mapped to the {g,u}id used locally on the system via
> idmapped mounts. This makes migrating home areas between different
> systems trivial because recursively chown()ing file system trees is no
> longer necessary. This also means that it is impossible to store files
> as {g,u}id 0 on disk.
>
> Once overlayfs can be mounted on top of idmapped layers it will be
> possible for container runtimes to work better together with
> systemd-homed.
>
> * systemd provides various sandboxing features (see [11]) for services.
> These serve as hardening measures. In this context, idmapped mounts
> can be used to prevent services from creating files on disk as {g,u}id
> 0, making specific files inaccessible, or to prevent access to whole
> directories. Since such services may also make use of overlayfs for
> e.g. the ExtensionImages= option supporting overlayfs on top of
> idmapped layers would be another huge hardening win.
>
> * systemd provides the ability to use system extension images [15] for
> /usr and /opt (with a look to /etc in the future). Such system
> extension images contain files and directories similar in fashion to
> a regular OS system tree. When one or more system extension images are
> activated, their /usr/ and /opt/ hierarchies are combined via
> overlay with the same hierarchies of the host OS, and the host /usr/
> and /opt/ overmounted with it ("merging"). These images are read-only.
>
> This feature is available to unprivileged container and sandboxed
> services as well. Idmapped layers are used here to avoid runtime and
> storage overhead from recursively changing ownership and ultimately an
> overlay mount is supposed to be created on top of it.
>
> Giuseppe provided testing for runC/crun/Podman and he put up a pull
> request to support overlayfs on top of idmapped layers at [16]. That
> already covers a lot of users. Other tools have put up pull requests as
> well and they are linked below.
>
> The patchset has been extensively tested for about 2 weeks with
> xfstests. The tests and results are explained in the following
> paragraphs.
>
> In order to test overlayfs with idmapped mounts a simple patch to
> xfstests has been added which is part of this series. The patchset
> simply allows for each test in the xfstests suite to be run on top of
> idmapped mounts. That is in addition to the generic idmapped mount tests
> that have existed and are already run for a long time.
>
> Since idmapped mounts can be created on top of btrfs, ext4, and xfs and
> these are the most relevant filesystems for users they were taken into
> the test matrix.
>
> Amir ensured that the test matrix also includes metacopy. So all tests
> are run once with metacopy=on and once with metacopy=off.
>
> Additionally, the unionmount tesuite that Amir maintains was run as part
> of xfstests. This brings testing for multi layer overlay and a few
> rarely used overlayfs use-cases.
>
> And last, xfstests were run with and without idmapped mounts.
>
> Since the patch series is based on Linux 5.17 the 5.17 kernel was chosen
> as a baseline kernel. The baseline kernel is pure 5.17 upstream without
> any of the patches in this series. The baseline kernel was used to run
> all xfstests with the same parameters minus the idmapped mount part of
> the test matrix. This ensured that regressions would be immediately
> noticeable.
>
> So the full test matrix is:
>
> ext4 x metacopy=off x -idmapped mounts
> ext4 x metacopy=on x -idmapped mounts
> ext4 x metacopy=off x +idmapped mounts
> ext4 x metacopy=on x +idmapped mounts
>
> xfs x metacopy=off x -idmapped mounts
> xfs x metacopy=on x -idmapped mounts
> xfs x metacopy=off x +idmapped mounts
> xfs x metacopy=on x +idmapped mounts
>
> btrfs x metacopy=off x -idmapped mounts
> btrfs x metacopy=on x -idmapped mounts
> btrfs x metacopy=off x +idmapped mounts
> btrfs x metacopy=on x +idmapped mounts
>
> The test runs were started with:
>
> sudo ./check -overlay
>
> so they encompass all xfstests apart from those that needed to be
> excluded because they triggered known issues (One such example is
> generic/530 which causes an xfs corruption that is currently under
> discussion for a fix upstream.).
>
> A single testrun with over 750 tests takes a bonkers long time given
> that I run them in a beefy VM on top of an ssd and not on bare metal.
>
> The successes and failures are identical for the whole test matrix with
> and without idmapped mounts. The tests and failures are also identical
> compared to the baseline kernel. All in all this should provide a good
> amount of convidence. The full test output can be found in the following
> repo: https://gitlab.com/brauner/fs.idmapped.overlay.xfstests.output
>
> In order to create an idmapped mount the mount_setattr() system call
> (see [17]) can be used together with the MOUNT_ATTR_IDMAP flag to attach
> a userns fd to a detached mount created with open_tree()'s
> OPEN_TREE_CLONE flag. The only effect is that the idmapping of the
> userns becomes the idmapping of the relevant. The links below should
> serve to illustrate how widely they are already used.
>
> For people who would like to test I've created a tag fs.idmapped.overlay.v1
> which can be fetched:
>
> git fetch git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git fs.idmapped.overlay.v1
>
> the same goes for xfstests:
>
> git fetch git://git.kernel.org/pub/scm/linux/kernel/git/brauner/xfstests-dev.git fs.idmapped.overlay.v1
>
> Thanks!
> Christian
>
> [1]: OCI runtime-spec extension for idmapped mounts
> https://github.com/opencontainers/runtime-spec/pull/1143
>
> [2]: Support for idmapped mounts for shared volumes
> https://github.com/opencontainers/runc/pull/3429
>
> [3]: containerd support for idmapped mounts with focus on overlayfs
> https://github.com/containerd/containerd/pull/5890
>
> [4]: runC support for idmapped mounts with focus on overlayfs
> https://github.com/opencontainers/runc/issues/2821
>
> [5]: Podman support for idmapped mounts with focus on overlayfs
> https://github.com/containers/podman/issues/10374
>
> [6]: systemd-nspawn support for idmapped mounts
> https://github.com/systemd/systemd/pull/19438
>
> [7]: systemd-homed support for idmapped mounts
> https://github.com/systemd/systemd/pull/21136
>
> [8]: LXD support for idmapped mounts
> https://github.com/lxc/lxd/pull/8778
>
> [9]: LXC support for idmapped mounts
> https://github.com/lxc/lxc/pull/3709
>
> [10]: crun support for idmapped mounts
> https://github.com/containers/crun/pull/874
>
> [11]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Sandboxing
>
> [12]: https://github.com/kubernetes/enhancements/pull/3065
>
> [13]: https://github.com/kubernetes/enhancements/pull/3065/files#diff-a9ca9fbce1538447b92f03125ca2b8474e2d875071f1172d2afa0b1e8cadeabaR118
>
> [14]: https://github.com/lxc/lxd/blob/master/doc/instances.md
> (The relevant entry can be found by looking for the
> "security.idmap.isolated" key.)
>
> [15]: https://www.freedesktop.org/software/systemd/man/systemd-sysext.html
>
> [16]: https://github.com/containers/storage/pull/1180
>
> [17]: https://man7.org/linux/man-pages/man2/mount_setattr.2.html
>
> Amir Goldstein (3):
> ovl: use wrappers to all vfs_*xattr() calls
> ovl: pass layer mnt to ovl_open_realfile()
> ovl: store lower path in ovl_inode
>
> Christian Brauner (15):
> fs: add two trivial lookup helpers
> exportfs: support idmapped mounts
> ovl: pass ofs to creation operations
> ovl: handle idmappings in creation operations
> ovl: pass ofs to setattr operations
> ovl: use ovl_do_notify_change() wrapper
> ovl: use ovl_lookup_upper() wrapper
> ovl: use ovl_path_getxattr() wrapper
> ovl: handle idmappings for layer fileattrs
> ovl: handle idmappings for layer lookup
> ovl: use ovl_copy_{real,upper}attr() wrappers
> ovl: handle idmappings in ovl_permission()
> ovl: handle idmappings in layer open helpers
> ovl: handle idmappings in ovl_xattr_{g,s}et()
> ovl: support idmapped layers
>
> fs/exportfs/expfs.c | 5 +-
> fs/namei.c | 52 +++++++--
> fs/overlayfs/copy_up.c | 97 +++++++++-------
> fs/overlayfs/dir.c | 133 +++++++++++-----------
> fs/overlayfs/export.c | 3 +-
> fs/overlayfs/file.c | 44 ++++----
> fs/overlayfs/inode.c | 63 +++++++----
> fs/overlayfs/namei.c | 49 ++++++---
> fs/overlayfs/overlayfs.h | 231 ++++++++++++++++++++++++++++-----------
> fs/overlayfs/ovl_entry.h | 7 +-
> fs/overlayfs/readdir.c | 48 ++++----
> fs/overlayfs/super.c | 57 +++++-----
> fs/overlayfs/util.c | 142 +++++++++++++++++++-----
> include/linux/namei.h | 2 +
> 14 files changed, 607 insertions(+), 326 deletions(-)
>
>
> base-commit: f443e374ae131c168a065ea1748feac6b2e76613
> --
> 2.32.0
>
next prev parent reply other threads:[~2022-03-30 20:58 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-29 10:35 [PATCH 00/18] overlay: support idmapped layers Christian Brauner
2022-03-29 10:35 ` [PATCH 01/18] fs: add two trivial lookup helpers Christian Brauner
2022-03-29 10:35 ` [PATCH 02/18] exportfs: support idmapped mounts Christian Brauner
2022-03-29 10:35 ` [PATCH 03/18] ovl: use wrappers to all vfs_*xattr() calls Christian Brauner
2022-03-29 11:18 ` Miklos Szeredi
2022-03-29 10:35 ` [PATCH 04/18] ovl: pass ofs to creation operations Christian Brauner
2022-03-29 10:35 ` [PATCH 05/18] ovl: handle idmappings in " Christian Brauner
2022-03-29 11:22 ` Miklos Szeredi
2022-03-29 10:35 ` [PATCH 06/18] ovl: pass ofs to setattr operations Christian Brauner
2022-03-29 10:35 ` [PATCH 07/18] ovl: pass layer mnt to ovl_open_realfile() Christian Brauner
2022-03-29 10:35 ` [PATCH 08/18] ovl: use ovl_do_notify_change() wrapper Christian Brauner
2022-03-29 11:56 ` Miklos Szeredi
2022-03-29 12:36 ` Christian Brauner
2022-03-29 10:35 ` [PATCH 09/18] ovl: use ovl_lookup_upper() wrapper Christian Brauner
2022-03-29 10:35 ` [PATCH 10/18] ovl: use ovl_path_getxattr() wrapper Christian Brauner
2022-03-29 10:35 ` [PATCH 11/18] ovl: handle idmappings for layer fileattrs Christian Brauner
2022-03-29 10:35 ` [PATCH 12/18] ovl: handle idmappings for layer lookup Christian Brauner
2022-03-29 10:35 ` [PATCH 13/18] ovl: store lower path in ovl_inode Christian Brauner
2022-03-29 10:35 ` [PATCH 14/18] ovl: use ovl_copy_{real,upper}attr() wrappers Christian Brauner
2022-03-29 10:35 ` [PATCH 15/18] ovl: handle idmappings in ovl_permission() Christian Brauner
2022-03-29 10:35 ` [PATCH 16/18] ovl: handle idmappings in layer open helpers Christian Brauner
2022-03-29 10:35 ` [PATCH 17/18] ovl: handle idmappings in ovl_xattr_{g,s}et() Christian Brauner
2022-03-29 10:35 ` [PATCH 18/18] ovl: support idmapped layers Christian Brauner
2022-03-29 10:35 ` [PATCH] common: allow to run all tests on idmapped mounts Christian Brauner
2022-03-29 12:25 ` [PATCH 00/18] overlay: support idmapped layers Miklos Szeredi
2022-03-29 15:02 ` Amir Goldstein
2022-03-30 20:58 ` Vivek Goyal [this message]
2022-03-31 8:47 ` Christian Brauner
2022-03-31 9:55 ` Giuseppe Scrivano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YkTEWy7byDiPAvzc@redhat.com \
--to=vgoyal@redhat.com \
--cc=amir73il@gmail.com \
--cc=brauner@kernel.org \
--cc=cyphar@cyphar.com \
--cc=gscrivan@redhat.com \
--cc=hch@lst.de \
--cc=linux-unionfs@vger.kernel.org \
--cc=luca.boccassi@microsoft.com \
--cc=mszeredi@redhat.com \
--cc=mzxreary@0pointer.de \
--cc=rodrigoca@microsoft.com \
--cc=sforshee@digitalocean.com \
--cc=stgraber@ubuntu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.