From: Vivek Goyal <vgoyal@redhat.com>
To: Christian Brauner <brauner@kernel.org>
Cc: Seth Forshee <sforshee@digitalocean.com>,
Amir Goldstein <amir73il@gmail.com>,
Miklos Szeredi <mszeredi@redhat.com>,
Christoph Hellwig <hch@lst.de>, Aleksa Sarai <cyphar@cyphar.com>,
linux-unionfs@vger.kernel.org
Subject: Re: [PATCH] ovl: turn of SB_POSIXACL with idmapped layers temporarily
Date: Wed, 6 Jul 2022 11:06:43 -0400 [thread overview]
Message-ID: <YsWlA3Y6M45aPeMW@redhat.com> (raw)
In-Reply-To: <20220706135611.257213-1-brauner@kernel.org>
On Wed, Jul 06, 2022 at 03:56:11PM +0200, Christian Brauner wrote:
> This cycle we added support for mounting overlayfs on top of idmapped mounts.
> Recently I've started looking into potential corner cases when trying to add
> additional tests and I noticed that reporting for POSIX ACLs is currently wrong
> when using idmapped layers with overlayfs mounted on top of it.
Disabling posix ACL support in overlayfs if any of the lower/upper
layers are idmapped mounts for the time being sounds reasonable
to me. Anyway this is a new piece of functionality. Once posix
acl stuff is fixed, then this restriction can be lifted.
Thanks
Vivek
>
> I have sent out an patch that fixes this and makes POSIX ACLs work correctly
> but the patch is a bit bigger and we're already at -rc5 so I recommend we
> simply don't raise SB_POSIXACL when idmapped layers are used. Then we can fix
> the VFS part described below for the next merge window so we can have good
> exposure in -next.
>
> I'm going to give a rather detailed explanation to both the origin of the
> problem and mention the solution so people know what's going on.
>
> Let's assume the user creates the following directory layout and they have a
> rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would
> expect files on your host system to be owned. For example, ~/.bashrc for your
> regular user would be owned by 1000:1000 and /root/.bashrc would be owned by
> 0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs
> filesystem.
>
> The user chooses to set POSIX ACLs using the setfacl binary granting the user
> with uid 4 read, write, and execute permissions for their .bashrc file:
>
> setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
>
> Now they to expose the whole rootfs to a container using an idmapped mount. So
> they first create:
>
> mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
> mkdir -pv /vol/contpool/ctrover/{over,work}
> chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
>
> The user now creates an idmapped mount for the rootfs:
>
> mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
> /var/lib/lxc/c2/rootfs \
> /vol/contpool/lowermap
>
> This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
> which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at
> /vol/contpool/lowermap/home/ubuntu/.bashrc.
>
> Assume the user wants to expose these idmapped mounts through an overlayfs
> mount to a container.
>
> mount -t overlay overlay \
> -o lowerdir=/vol/contpool/lowermap, \
> upperdir=/vol/contpool/overmap/over, \
> workdir=/vol/contpool/overmap/work \
> /vol/contpool/merge
>
> The user can do this in two ways:
>
> (1) Mount overlayfs in the initial user namespace and expose it to the
> container.
> (2) Mount overlayfs on top of the idmapped mounts inside of the container's
> user namespace.
>
> Let's assume the user chooses the (1) option and mounts overlayfs on the host
> and then changes into a container which uses the idmapping 0:10000000:65536
> which is the same used for the two idmapped mounts.
>
> Now the user tries to retrieve the POSIX ACLs using the getfacl command
>
> getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
>
> and to their surprise they see:
>
> # file: vol/contpool/merge/home/ubuntu/.bashrc
> # owner: 1000
> # group: 1000
> user::rw-
> user:4294967295:rwx
> group::r--
> mask::rwx
> other::r--
>
> indicating the the uid wasn't correctly translated according to the idmapped
> mount. The problem is how we currently translate POSIX ACLs. Let's inspect the
> callchain in this example:
>
> idmapped mount /vol/contpool/merge: 0:10000000:65536
> caller's idmapping: 0:10000000:65536
> overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
>
> sys_getxattr()
> -> path_getxattr()
> -> getxattr()
> -> do_getxattr()
> |> vfs_getxattr()
> | -> __vfs_getxattr()
> | -> handler->get == ovl_posix_acl_xattr_get()
> | -> ovl_xattr_get()
> | -> vfs_getxattr()
> | -> __vfs_getxattr()
> | -> handler->get() /* lower filesystem callback */
> |> posix_acl_fix_xattr_to_user()
> {
> 4 = make_kuid(&init_user_ns, 4);
> 4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
> /* FAILURE */
> -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
> }
>
> If the user chooses to use option (2) and mounts overlayfs on top of idmapped
> mounts inside the container things don't look that much better:
>
> idmapped mount /vol/contpool/merge: 0:10000000:65536
> caller's idmapping: 0:10000000:65536
> overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
>
> sys_getxattr()
> -> path_getxattr()
> -> getxattr()
> -> do_getxattr()
> |> vfs_getxattr()
> | -> __vfs_getxattr()
> | -> handler->get == ovl_posix_acl_xattr_get()
> | -> ovl_xattr_get()
> | -> vfs_getxattr()
> | -> __vfs_getxattr()
> | -> handler->get() /* lower filesystem callback */
> |> posix_acl_fix_xattr_to_user()
> {
> 4 = make_kuid(&init_user_ns, 4);
> 4 = mapped_kuid_fs(&init_user_ns, 4);
> /* FAILURE */
> -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
> }
>
> As is easily seen the problem arises because the idmapping of the lower mount
> isn't taken into account as all of this happens in do_gexattr(). But
> do_getxattr() is always called on an overlayfs mount and inode and thus cannot
> possible take the idmapping of the lower layers into account.
>
> This problem is similar for fscaps but there the translation happens as part of
> vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain:
>
> setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
>
> The expected outcome here is that we'll receive the cap_net_raw capability as
> we are able to map the uid associated with the fscap to 0 within our container.
> IOW, we want to see 0 as the result of the idmapping translations.
>
> If the user chooses option (1) we get the following callchain for fscaps:
>
> idmapped mount /vol/contpool/merge: 0:10000000:65536
> caller's idmapping: 0:10000000:65536
> overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
>
> sys_getxattr()
> -> path_getxattr()
> -> getxattr()
> -> do_getxattr()
> -> vfs_getxattr()
> -> xattr_getsecurity()
> -> security_inode_getsecurity() ________________________________
> -> cap_inode_getsecurity() | |
> { V |
> 10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
> 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
> /* Expected result is 0 and thus that we own the fscap. */ |
> 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
> } |
> -> vfs_getxattr_alloc() |
> -> handler->get == ovl_other_xattr_get() |
> -> vfs_getxattr() |
> -> xattr_getsecurity() |
> -> security_inode_getsecurity() |
> -> cap_inode_getsecurity() |
> { |
> 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
> 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
> 10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
> |____________________________________________________________________|
> }
> -> vfs_getxattr_alloc()
> -> handler->get == /* lower filesystem callback */
>
> And if the user chooses option (2) we get:
>
> idmapped mount /vol/contpool/merge: 0:10000000:65536
> caller's idmapping: 0:10000000:65536
> overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
>
> sys_getxattr()
> -> path_getxattr()
> -> getxattr()
> -> do_getxattr()
> -> vfs_getxattr()
> -> xattr_getsecurity()
> -> security_inode_getsecurity() _______________________________
> -> cap_inode_getsecurity() | |
> { V |
> 10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); |
> 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
> /* Expected result is 0 and thus that we own the fscap. */ |
> 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
> } |
> -> vfs_getxattr_alloc() |
> -> handler->get == ovl_other_xattr_get() |
> |-> vfs_getxattr() |
> -> xattr_getsecurity() |
> -> security_inode_getsecurity() |
> -> cap_inode_getsecurity() |
> { |
> 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
> 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
> 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
> |____________________________________________________________________|
> }
> -> vfs_getxattr_alloc()
> -> handler->get == /* lower filesystem callback */
>
> We can see how the translation happens correctly in those cases as the
> conversion happens within the vfs_getxattr() helper.
>
> For POSIX ACLs we need to do something similar. However, in contrast to fscaps
> we cannot apply the fix directly to the kernel internal posix acl data
> structure as this would alter the cached values and would also require a rework
> of how we currently deal with POSIX ACLs in general which almost never take the
> filesystem idmapping into account (the noteable exception being FUSE but even
> there the implementation is special) and instead retrieve the raw values based
> on the initial idmapping.
>
> The correct values are then generated right before returning to userspace. The
> fix for this is to move taking the mount's idmapping into account directly in
> vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user().
>
> To this end we simply move the idmapped mount translation into a separate step
> performed in vfs_{g,s}etxattr() instead of in
> posix_acl_fix_xattr_{from,to}_user().
>
> To see how this fixes things let's go back to the original example. Assume the
> user chose option (1) and mounted overlayfs on top of idmapped mounts on the
> host:
>
> idmapped mount /vol/contpool/merge: 0:10000000:65536
> caller's idmapping: 0:10000000:65536
> overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
>
> sys_getxattr()
> -> path_getxattr()
> -> getxattr()
> -> do_getxattr()
> |> vfs_getxattr()
> | |> __vfs_getxattr()
> | | -> handler->get == ovl_posix_acl_xattr_get()
> | | -> ovl_xattr_get()
> | | -> vfs_getxattr()
> | | |> __vfs_getxattr()
> | | | -> handler->get() /* lower filesystem callback */
> | | |> posix_acl_getxattr_idmapped_mnt()
> | | {
> | | 4 = make_kuid(&init_user_ns, 4);
> | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
> | | 10000004 = from_kuid(&init_user_ns, 10000004);
> | | |_______________________
> | | } |
> | | |
> | |> posix_acl_getxattr_idmapped_mnt() |
> | { |
> | V
> | 10000004 = make_kuid(&init_user_ns, 10000004);
> | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
> | 10000004 = from_kuid(&init_user_ns, 10000004);
> | } |_________________________________________________
> | |
> | |
> |> posix_acl_fix_xattr_to_user() |
> { V
> 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
> /* SUCCESS */
> 4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
> }
>
> And similarly if the user chooses option (1) and mounted overayfs on top of
> idmapped mounts inside the container:
>
> idmapped mount /vol/contpool/merge: 0:10000000:65536
> caller's idmapping: 0:10000000:65536
> overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
>
> sys_getxattr()
> -> path_getxattr()
> -> getxattr()
> -> do_getxattr()
> |> vfs_getxattr()
> | |> __vfs_getxattr()
> | | -> handler->get == ovl_posix_acl_xattr_get()
> | | -> ovl_xattr_get()
> | | -> vfs_getxattr()
> | | |> __vfs_getxattr()
> | | | -> handler->get() /* lower filesystem callback */
> | | |> posix_acl_getxattr_idmapped_mnt()
> | | {
> | | 4 = make_kuid(&init_user_ns, 4);
> | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
> | | 10000004 = from_kuid(&init_user_ns, 10000004);
> | | |_______________________
> | | } |
> | | |
> | |> posix_acl_getxattr_idmapped_mnt() |
> | { V
> | 10000004 = make_kuid(&init_user_ns, 10000004);
> | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
> | 10000004 = from_kuid(0(&init_user_ns, 10000004);
> | |_________________________________________________
> | } |
> | |
> |> posix_acl_fix_xattr_to_user() |
> { V
> 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
> /* SUCCESS */
> 4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
> }
>
> The last remaining problem we need to fix here is ovl_get_acl(). During
> ovl_permission() overlayfs will call:
>
> ovl_permission()
> -> generic_permission()
> -> acl_permission_check()
> -> check_acl()
> -> get_acl()
> -> inode->i_op->get_acl() == ovl_get_acl()
> > get_acl() /* on the underlying filesystem)
> ->inode->i_op->get_acl() == /*lower filesystem callback */
> -> posix_acl_permission()
>
> passing through the get_acl request to the underlying filesystem. This will
> retrieve the acls stored in the lower filesystem without taking the idmapping
> of the underlying mount into account as this would mean altering the cached
> values for the lower filesystem. The simple solution is to have ovl_get_acl()
> simply duplicate the ACLs, update the values according to the idmapped mount
> and return it to acl_permission_check() so it can be used in
> posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be released
> right after.
>
> Link: https://github.com/brauner/mount-idmapped/issues/9
> Cc: Seth Forshee <sforshee@digitalocean.com>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Cc: linux-unionfs@vger.kernel.org
> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> ---
> Hey Miklos,
>
> I describe in detail how I'm going to fix this with the series I intend
> to get ready for the next merge window in the commit message.
>
> I would just turn off POSIX ACLs until then. Would you be ok with that
> and route this to Linus this or next week?
>
> Thanks!
> Christian
> ---
> Documentation/filesystems/overlayfs.rst | 4 ++++
> fs/overlayfs/super.c | 21 ++++++++++++++++++++-
> 2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 7da6c30ed596..316cfd8b1891 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -466,6 +466,10 @@ overlay filesystem and the value of st_ino for filesystem objects may not be
> persistent and could change even while the overlay filesystem is mounted, as
> summarized in the `Inode properties`_ table above.
>
> +4) "idmapped mounts"
> +When the upper or lower layers are idmapped mounts overlayfs will be mounted
> +without support for POSIX Access Control Lists (ACLs). This limitation will
> +eventually be lifted.
>
> Changes to underlying filesystems
> ---------------------------------
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index e0a2e0468ee7..d21b61570cd1 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -1960,6 +1960,22 @@ static struct dentry *ovl_get_root(struct super_block *sb,
> return root;
> }
>
> +static bool ovl_has_idmapped_layers(struct ovl_fs *ofs)
> +{
> + const struct vfsmount *mnt = ovl_upper_mnt(ofs);
> + unsigned int i;
> +
> + if (mnt && is_idmapped_mnt(mnt))
> + return true;
> +
> + for (i = 1; i < ofs->numlayer; i++) {
> + if (is_idmapped_mnt(ofs->layers[i].mnt))
> + return true;
> + }
> +
> + return false;
> +}
> +
> static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> {
> struct path upperpath = { };
> @@ -2129,7 +2145,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> sb->s_xattr = ofs->config.userxattr ? ovl_user_xattr_handlers :
> ovl_trusted_xattr_handlers;
> sb->s_fs_info = ofs;
> - sb->s_flags |= SB_POSIXACL;
> + if (ovl_has_idmapped_layers(ofs))
> + pr_warn("POSIX ACLs are not yet supported with idmapped layers, mounting without ACL support.\n");
> + else
> + sb->s_flags |= SB_POSIXACL;
> sb->s_iflags |= SB_I_SKIP_SYNC;
>
> err = -ENOMEM;
>
> base-commit: 88084a3df1672e131ddc1b4e39eeacfd39864acf
> --
> 2.34.1
>
next prev parent reply other threads:[~2022-07-06 15:07 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-06 13:56 [PATCH] ovl: turn of SB_POSIXACL with idmapped layers temporarily Christian Brauner
2022-07-06 15:06 ` Vivek Goyal [this message]
2022-07-07 7:58 ` Miklos Szeredi
2022-07-07 10:33 ` Christian Brauner
2022-07-08 13:54 ` Miklos Szeredi
2022-07-08 13:58 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YsWlA3Y6M45aPeMW@redhat.com \
--to=vgoyal@redhat.com \
--cc=amir73il@gmail.com \
--cc=brauner@kernel.org \
--cc=cyphar@cyphar.com \
--cc=hch@lst.de \
--cc=linux-unionfs@vger.kernel.org \
--cc=mszeredi@redhat.com \
--cc=sforshee@digitalocean.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox