public inbox for linux-unionfs@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC PATCH 0/6] shiftfs fixes and enhancements
       [not found] <20181101214856.4563-1-seth.forshee@canonical.com>
@ 2018-11-02  8:59 ` Amir Goldstein
  2018-11-02 12:26   ` Seth Forshee
       [not found] ` <20181101214856.4563-7-seth.forshee@canonical.com>
  1 sibling, 1 reply; 7+ messages in thread
From: Amir Goldstein @ 2018-11-02  8:59 UTC (permalink / raw)
  To: Seth Forshee
  Cc: linux-fsdevel, linux-kernel, Linux Containers, James Bottomley,
	overlayfs, Miklos Szeredi

[cc: linux-unionfs
It should the mailing list for *all* "stacking fs".
We have a lot of common problems I think ;-) ]

On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
>
> I've done some work to fix and enhance shiftfs for a number of use
> cases, so that we would have an idea what a more full-featured shiftfs
> would look like. I'm intending for these to serve as a point of
> reference for discussing id shifting mounts/filesystems at plumbers in a
> couple of weeks [1].
>
> Note that these are based on 4.18, and I've added a small fix to James'
> most recent patch to fix a build issue there. To work with 4.19 they
> will need a number of updates due to changes in the vfs.
>

Seth,

I like the way you addressed my concerns about nesting and stacking depth.
Will provide specific nits on patch.

In preparation to the Plumbers talk (which I will not be attending), I wanted to
get your opinion on the matters I brought up last time:
https://marc.info/?l=linux-fsdevel&m=153013920904844&w=2

1) Having seen what it takes to catch up with overlayfs w.r.t inotify bugs
and having peeked into 4.19 to see what work you still have lined up for you
to bring shitfs up to speed with vfs, did you have time to look into my proposal
for sharing code with overlayfs in the manner that I have implemented the
snapshotfs POC?
https://github.com/amir73il/linux/commit/25416757f2ca47759f59b115e6461b11898c4f06

Even if you end up not saving a single line of code for shiftfs v1
meaning that all shiftfs_inode_ops are completely separate implementation
from overlayfs inode ops, this may still be beneficial to shitfs in
the long run.
For example, you may, in fact, won't need to change anything to work with v4.19.
shittfs (as an overlayfs alias) would use ovl_file_operations and
shiftfs_inode_ops.

Another example, from the top of my head, see what it took to add NFS export
support to snapshotfs, because of the code reuse with overlayfs:
https://github.com/amir73il/linux/commit/d082eb615133490ec26fa2efaa80ed4723860893
Its practically the exact same implementation shiftfs would need,
so in the far future, shitfs and snapshotfs can share the same
export_operations.

2) Regarding this part:
+               /*
+                * this part is visible unshifted, so make sure no
+                * executables that could be used to give suid
+                * privileges
+                */
+               sb->s_iflags = SB_I_NOEXEC;

Why would you want to make the unshifted fs visible at all?
Is there a requirement for container users to access the unshifted fs
content? Is there a requirement for container admin to mount shitfted fs
NOT from the root of the marked mount?

If those are not required, then I propose NOOP inode operations for
the unshifted fs, specifically, empty readdir, just enough ops to be able
to use the mark mount point as the shitfed mount source - no more.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
       [not found] ` <20181101214856.4563-7-seth.forshee@canonical.com>
@ 2018-11-02 10:02   ` Amir Goldstein
  2018-11-02 12:44     ` Seth Forshee
  0 siblings, 1 reply; 7+ messages in thread
From: Amir Goldstein @ 2018-11-02 10:02 UTC (permalink / raw)
  To: Seth Forshee
  Cc: linux-fsdevel, linux-kernel, Linux Containers, James Bottomley,
	overlayfs

On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
>
> shiftfs mounts cannot be nested for two reasons -- global
> CAP_SYS_ADMIN is required to set up a mark mount, and a single
> functional shiftfs mount meets the filesystem stacking depth
> limit.
>
> The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> ids in a mount must be within that mount's s_user_ns, so all that
> is needed is CAP_SYS_ADMIN within that s_user_ns.
>
> The stack depth issue can be worked around with a couple of
> adjustments. First, a mark mount doesn't really need to count
> against the stacking depth as it doesn't contribute to the call
> stack depth during filesystem operations. Therefore the mount
> over the mark mount only needs to count as one more than the
> lower filesystems stack depth.

That's true, but it also highlights the point that the "mark" sb is
completely unneeded and it really is just a nice little hack.
All the information it really stores is a lower mount reference,
a lower root dentry and a declarative statement "I am shiftable!".

Come to think about it, "container shiftable" really isn't that different from
NFS export with "no_root_squash" and auto mounted USB drive.
I mean the shifting itself is different of course, but the
declaration, not so much.
If I am allowing sudoers on another machine to mess with root owned
files visible
on my machine, I am pretty much have the same issues as container admins
accessing root owned files on my init_user_ns filesystem. In all those cases,
I'd better not be executing suid binaries from the untrusted "external" source.

Instead of mounting a dummy filesystem to make the declaration, you could
get the same thing with:
   mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
or whatnot)  constant to uapi and manage to come up good man page description.

Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
avoid the extra bind mount step (for a full filesystem tree export).
Declaring a mounted image MS_EXTERN has merits on its own even without
containers and shitfs, for example with pluggable storage. Other LSMs could make
good use of that declaration.

>
> Second, when the lower mount is shiftfs we can just skip down to
> that mount's lower filesystem and shift ids relative to that.
> There is no reason to shift ids twice, and the lower path has
> already been marked safe for id shifting by a user privileged
> towards all ids in that mount's user ns.
>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> ---
>  fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> index b19af7b2fe75..008ace2842b9 100644
> --- a/fs/shiftfs.c
> +++ b/fs/shiftfs.c
> @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         struct shiftfs_data *data = raw_data;
>         char *name = kstrdup(data->path, GFP_KERNEL);
>         int err = -ENOMEM;
> -       struct shiftfs_super_info *ssi = NULL;
> +       struct shiftfs_super_info *ssi = NULL, *mp_ssi;
>         struct path path;
>         struct dentry *dentry;
>
> @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         if (err)
>                 goto out;
>
> -       /* to mark a mount point, must be real root */
> -       if (ssi->mark && !capable(CAP_SYS_ADMIN))
> -               goto out;
> -
> -       /* else to mount a mark, must be userns admin */
> +       /* to mount a mark, must be userns admin */
>         if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
>                 goto out;

Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

>
> @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>
>         if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
>                 err = -ENOTDIR;
> -               goto out_put;
> -       }
> -
> -       sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> -       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -               printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> -               err = -EINVAL;
> -               goto out_put;
> +               goto out_put_path;
>         }
>
>         if (ssi->mark) {
> +               struct super_block *lower_sb = path.mnt->mnt_sb;
> +
> +               /* to mark a mount point, must root wrt lower s_user_ns */
> +               if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> +                       goto out_put_path;
> +
> +
>                 /*
>                  * this part is visible unshifted, so make sure no
>                  * executables that could be used to give suid
>                  * privileges
>                  */
>                 sb->s_iflags = SB_I_NOEXEC;

As commented on cover letter, why allow access to any files besides root at all?
In fact, the only justification for a dummy sb (instead of bind mount with
MS_EXTERN flag) would be in order to override inode operations with noop ops
to prevent access to unshifted files from within container.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/6] shiftfs fixes and enhancements
  2018-11-02  8:59 ` [RFC PATCH 0/6] shiftfs fixes and enhancements Amir Goldstein
@ 2018-11-02 12:26   ` Seth Forshee
  0 siblings, 0 replies; 7+ messages in thread
From: Seth Forshee @ 2018-11-02 12:26 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-kernel, Linux Containers, James Bottomley,
	overlayfs, Miklos Szeredi

On Fri, Nov 02, 2018 at 10:59:38AM +0200, Amir Goldstein wrote:
> [cc: linux-unionfs
> It should the mailing list for *all* "stacking fs".
> We have a lot of common problems I think ;-) ]
> 
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> >
> > I've done some work to fix and enhance shiftfs for a number of use
> > cases, so that we would have an idea what a more full-featured shiftfs
> > would look like. I'm intending for these to serve as a point of
> > reference for discussing id shifting mounts/filesystems at plumbers in a
> > couple of weeks [1].
> >
> > Note that these are based on 4.18, and I've added a small fix to James'
> > most recent patch to fix a build issue there. To work with 4.19 they
> > will need a number of updates due to changes in the vfs.
> >
> 
> Seth,
> 
> I like the way you addressed my concerns about nesting and stacking depth.
> Will provide specific nits on patch.
> 
> In preparation to the Plumbers talk (which I will not be attending), I wanted to
> get your opinion on the matters I brought up last time:
> https://marc.info/?l=linux-fsdevel&m=153013920904844&w=2

I want the session at plumbers to not be a "talk" but more of a
discussion of the sorts of things you raise below. But I'm also happy to
talk about them here.

> 1) Having seen what it takes to catch up with overlayfs w.r.t inotify bugs
> and having peeked into 4.19 to see what work you still have lined up for you
> to bring shitfs up to speed with vfs, did you have time to look into my proposal
> for sharing code with overlayfs in the manner that I have implemented the
> snapshotfs POC?
> https://github.com/amir73il/linux/commit/25416757f2ca47759f59b115e6461b11898c4f06
> 
> Even if you end up not saving a single line of code for shiftfs v1
> meaning that all shiftfs_inode_ops are completely separate implementation
> from overlayfs inode ops, this may still be beneficial to shitfs in
> the long run.
> For example, you may, in fact, won't need to change anything to work with v4.19.
> shittfs (as an overlayfs alias) would use ovl_file_operations and
> shiftfs_inode_ops.

I don't recall seeing the shapshotfs patches before. If id shifting
remains an overlay-style fs and not a feature of the vfs, then I
absolutely think something like this will make life much easier.

> Another example, from the top of my head, see what it took to add NFS export
> support to snapshotfs, because of the code reuse with overlayfs:
> https://github.com/amir73il/linux/commit/d082eb615133490ec26fa2efaa80ed4723860893
> Its practically the exact same implementation shiftfs would need,
> so in the far future, shitfs and snapshotfs can share the same
> export_operations.
> 
> 2) Regarding this part:
> +               /*
> +                * this part is visible unshifted, so make sure no
> +                * executables that could be used to give suid
> +                * privileges
> +                */
> +               sb->s_iflags = SB_I_NOEXEC;
> 
> Why would you want to make the unshifted fs visible at all?
> Is there a requirement for container users to access the unshifted fs
> content? Is there a requirement for container admin to mount shitfted fs
> NOT from the root of the marked mount?
> 
> If those are not required, then I propose NOOP inode operations for
> the unshifted fs, specifically, empty readdir, just enough ops to be able
> to use the mark mount point as the shitfed mount source - no more.

This is part of the original implementation that I didn't touch with
these updates. Imo the mark mount is kind of kludgy, and I'd like to see
it done a different way.

A couple of alternatives have been suggested. One was to use xattrs for
marking, or I did a PoC with an older version of the new mount API
patches where an fsfd was passed to the less privileged context that it
could attach to its mount tree:

https://lkml.kernel.org/r/20180717133847.GB15620@ubuntu-xps13

Either of these can accomplish the same things as the mark mount with
better control over who can create an id-shifted mount of the subtree.

However if the mark mount is kept then no-op inode operations seems
reasonable to me.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
  2018-11-02 10:02   ` [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts Amir Goldstein
@ 2018-11-02 12:44     ` Seth Forshee
  2018-11-02 13:16       ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Seth Forshee @ 2018-11-02 12:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-kernel, Linux Containers, James Bottomley,
	overlayfs

On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> >
> > shiftfs mounts cannot be nested for two reasons -- global
> > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > functional shiftfs mount meets the filesystem stacking depth
> > limit.
> >
> > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > ids in a mount must be within that mount's s_user_ns, so all that
> > is needed is CAP_SYS_ADMIN within that s_user_ns.
> >
> > The stack depth issue can be worked around with a couple of
> > adjustments. First, a mark mount doesn't really need to count
> > against the stacking depth as it doesn't contribute to the call
> > stack depth during filesystem operations. Therefore the mount
> > over the mark mount only needs to count as one more than the
> > lower filesystems stack depth.
> 
> That's true, but it also highlights the point that the "mark" sb is
> completely unneeded and it really is just a nice little hack.
> All the information it really stores is a lower mount reference,
> a lower root dentry and a declarative statement "I am shiftable!".

Seems I should have saved some of the things I said in my previous
response for this one. As you no doubt gleaned from that email, I agree
with this.

> Come to think about it, "container shiftable" really isn't that different from
> NFS export with "no_root_squash" and auto mounted USB drive.
> I mean the shifting itself is different of course, but the
> declaration, not so much.
> If I am allowing sudoers on another machine to mess with root owned
> files visible
> on my machine, I am pretty much have the same issues as container admins
> accessing root owned files on my init_user_ns filesystem. In all those cases,
> I'd better not be executing suid binaries from the untrusted "external" source.
> 
> Instead of mounting a dummy filesystem to make the declaration, you could
> get the same thing with:
>    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> or whatnot)  constant to uapi and manage to come up good man page description.
> 
> Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> avoid the extra bind mount step (for a full filesystem tree export).
> Declaring a mounted image MS_EXTERN has merits on its own even without
> containers and shitfs, for example with pluggable storage. Other LSMs could make
> good use of that declaration.

I'm missing how we figure out the target user ns in this scheme. We need
a context with privileges towards the source path's s_user_ns to say
it's okay to shift ids for the files under the source path, and then we
need a target user ns for the id shifts. Currently the target is
current_user_ns when the final shiftfs mount is created.

So, how do we determine the target s_user_ns in your scheme?

> 
> >
> > Second, when the lower mount is shiftfs we can just skip down to
> > that mount's lower filesystem and shift ids relative to that.
> > There is no reason to shift ids twice, and the lower path has
> > already been marked safe for id shifting by a user privileged
> > towards all ids in that mount's user ns.
> >
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > ---
> >  fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 46 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> > index b19af7b2fe75..008ace2842b9 100644
> > --- a/fs/shiftfs.c
> > +++ b/fs/shiftfs.c
> > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >         struct shiftfs_data *data = raw_data;
> >         char *name = kstrdup(data->path, GFP_KERNEL);
> >         int err = -ENOMEM;
> > -       struct shiftfs_super_info *ssi = NULL;
> > +       struct shiftfs_super_info *ssi = NULL, *mp_ssi;
> >         struct path path;
> >         struct dentry *dentry;
> >
> > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >         if (err)
> >                 goto out;
> >
> > -       /* to mark a mount point, must be real root */
> > -       if (ssi->mark && !capable(CAP_SYS_ADMIN))
> > -               goto out;
> > -
> > -       /* else to mount a mark, must be userns admin */
> > +       /* to mount a mark, must be userns admin */
> >         if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> >                 goto out;
> 
> Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

Yeah, I noticed that too. I left it in for the moment to emphasize the
change I was making, but it can be removed.

> 
> >
> > @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >
> >         if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
> >                 err = -ENOTDIR;
> > -               goto out_put;
> > -       }
> > -
> > -       sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> > -       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > -               printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> > -               err = -EINVAL;
> > -               goto out_put;
> > +               goto out_put_path;
> >         }
> >
> >         if (ssi->mark) {
> > +               struct super_block *lower_sb = path.mnt->mnt_sb;
> > +
> > +               /* to mark a mount point, must root wrt lower s_user_ns */
> > +               if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> > +                       goto out_put_path;
> > +
> > +
> >                 /*
> >                  * this part is visible unshifted, so make sure no
> >                  * executables that could be used to give suid
> >                  * privileges
> >                  */
> >                 sb->s_iflags = SB_I_NOEXEC;
> 
> As commented on cover letter, why allow access to any files besides root at all?
> In fact, the only justification for a dummy sb (instead of bind mount with
> MS_EXTERN flag) would be in order to override inode operations with noop ops
> to prevent access to unshifted files from within container.

Summarizing my response to the other message, if the mark mount is kept
(and I would prefer that it isn't kept) that seems reasonable to me.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
  2018-11-02 12:44     ` Seth Forshee
@ 2018-11-02 13:16       ` Amir Goldstein
  2018-11-02 13:47         ` Seth Forshee
  2018-11-02 16:57         ` James Bottomley
  0 siblings, 2 replies; 7+ messages in thread
From: Amir Goldstein @ 2018-11-02 13:16 UTC (permalink / raw)
  To: Seth Forshee
  Cc: linux-fsdevel, linux-kernel, Linux Containers, James Bottomley,
	overlayfs

On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <seth.forshee@canonical.com> wrote:
>
> On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> > >
> > > shiftfs mounts cannot be nested for two reasons -- global
> > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > functional shiftfs mount meets the filesystem stacking depth
> > > limit.
> > >
> > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > ids in a mount must be within that mount's s_user_ns, so all that
> > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > >
> > > The stack depth issue can be worked around with a couple of
> > > adjustments. First, a mark mount doesn't really need to count
> > > against the stacking depth as it doesn't contribute to the call
> > > stack depth during filesystem operations. Therefore the mount
> > > over the mark mount only needs to count as one more than the
> > > lower filesystems stack depth.
> >
> > That's true, but it also highlights the point that the "mark" sb is
> > completely unneeded and it really is just a nice little hack.
> > All the information it really stores is a lower mount reference,
> > a lower root dentry and a declarative statement "I am shiftable!".
>
> Seems I should have saved some of the things I said in my previous
> response for this one. As you no doubt gleaned from that email, I agree
> with this.
>
> > Come to think about it, "container shiftable" really isn't that different from
> > NFS export with "no_root_squash" and auto mounted USB drive.
> > I mean the shifting itself is different of course, but the
> > declaration, not so much.
> > If I am allowing sudoers on another machine to mess with root owned
> > files visible
> > on my machine, I am pretty much have the same issues as container admins
> > accessing root owned files on my init_user_ns filesystem. In all those cases,
> > I'd better not be executing suid binaries from the untrusted "external" source.
> >
> > Instead of mounting a dummy filesystem to make the declaration, you could
> > get the same thing with:
> >    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> > or whatnot)  constant to uapi and manage to come up good man page description.
> >
> > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> > avoid the extra bind mount step (for a full filesystem tree export).
> > Declaring a mounted image MS_EXTERN has merits on its own even without
> > containers and shitfs, for example with pluggable storage. Other LSMs could make
> > good use of that declaration.
>
> I'm missing how we figure out the target user ns in this scheme. We need
> a context with privileges towards the source path's s_user_ns to say
> it's okay to shift ids for the files under the source path, and then we
> need a target user ns for the id shifts. Currently the target is
> current_user_ns when the final shiftfs mount is created.
>
> So, how do we determine the target s_user_ns in your scheme?
>

Unless I am completely misunderstanding shiftfs, I think we are saying the
same thing. You said you wish to get rid of the "mark" fs and that you had
a POC of implementing the "mark" using xattr.
I'm just saying another option to implement the mark is using a super block
flag and you get the target s_user_ns from mnt_sb.
I did miss the fact that a mount flag is not enough, so that makes the bind
mount concept fail. Unless, maybe, the mount in the container is a slave
mount and the "shiftable" mark is set on the master.
I know too little about mount ns vs. user ns to tell if any of this makes sense.

Feel free to ignore MS_EXTERN idea.

Hopefully, mount API v2 will provide the proper facility to implement mark.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
  2018-11-02 13:16       ` Amir Goldstein
@ 2018-11-02 13:47         ` Seth Forshee
  2018-11-02 16:57         ` James Bottomley
  1 sibling, 0 replies; 7+ messages in thread
From: Seth Forshee @ 2018-11-02 13:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-kernel, Linux Containers, James Bottomley,
	overlayfs

On Fri, Nov 02, 2018 at 03:16:05PM +0200, Amir Goldstein wrote:
> On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> >
> > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> > > >
> > > > shiftfs mounts cannot be nested for two reasons -- global
> > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > > functional shiftfs mount meets the filesystem stacking depth
> > > > limit.
> > > >
> > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > > ids in a mount must be within that mount's s_user_ns, so all that
> > > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > > >
> > > > The stack depth issue can be worked around with a couple of
> > > > adjustments. First, a mark mount doesn't really need to count
> > > > against the stacking depth as it doesn't contribute to the call
> > > > stack depth during filesystem operations. Therefore the mount
> > > > over the mark mount only needs to count as one more than the
> > > > lower filesystems stack depth.
> > >
> > > That's true, but it also highlights the point that the "mark" sb is
> > > completely unneeded and it really is just a nice little hack.
> > > All the information it really stores is a lower mount reference,
> > > a lower root dentry and a declarative statement "I am shiftable!".
> >
> > Seems I should have saved some of the things I said in my previous
> > response for this one. As you no doubt gleaned from that email, I agree
> > with this.
> >
> > > Come to think about it, "container shiftable" really isn't that different from
> > > NFS export with "no_root_squash" and auto mounted USB drive.
> > > I mean the shifting itself is different of course, but the
> > > declaration, not so much.
> > > If I am allowing sudoers on another machine to mess with root owned
> > > files visible
> > > on my machine, I am pretty much have the same issues as container admins
> > > accessing root owned files on my init_user_ns filesystem. In all those cases,
> > > I'd better not be executing suid binaries from the untrusted "external" source.
> > >
> > > Instead of mounting a dummy filesystem to make the declaration, you could
> > > get the same thing with:
> > >    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> > > or whatnot)  constant to uapi and manage to come up good man page description.
> > >
> > > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> > > avoid the extra bind mount step (for a full filesystem tree export).
> > > Declaring a mounted image MS_EXTERN has merits on its own even without
> > > containers and shitfs, for example with pluggable storage. Other LSMs could make
> > > good use of that declaration.
> >
> > I'm missing how we figure out the target user ns in this scheme. We need
> > a context with privileges towards the source path's s_user_ns to say
> > it's okay to shift ids for the files under the source path, and then we
> > need a target user ns for the id shifts. Currently the target is
> > current_user_ns when the final shiftfs mount is created.
> >
> > So, how do we determine the target s_user_ns in your scheme?
> >
> 
> Unless I am completely misunderstanding shiftfs, I think we are saying the
> same thing. You said you wish to get rid of the "mark" fs and that you had
> a POC of implementing the "mark" using xattr.

The PoC was using filesystem contexts and (an earlier version of) the
new mount API, but yes I do think we're in agreement about the mark fs
being awkward.

> I'm just saying another option to implement the mark is using a super block
> flag and you get the target s_user_ns from mnt_sb.
> I did miss the fact that a mount flag is not enough, so that makes the bind
> mount concept fail. Unless, maybe, the mount in the container is a slave
> mount and the "shiftable" mark is set on the master.
> I know too little about mount ns vs. user ns to tell if any of this makes sense.

We need a source and target user ns for the shifting, currently they are
the lower filesystem's s_user_ns and the s_user_ns of the shiftfs fs
(which is current_user_ns at the time the real shiftfs fs is mounted). I
think doing it this way makes sense.

The problem with the slave mount idea is that s_user_ns for the shiftfs
sb is still not the target ns for the id shifting, which is going to
cause all kinds of problems. Unless all of this is done in the vfs, then
it could really be a bind mount. The shifting could be done based on the
mount namespace's user_ns and all permission checks in the vfs could
take this into account. This approach certainly has benefits and is one
of the things I want to discuss.

> Feel free to ignore MS_EXTERN idea.
> 
> Hopefully, mount API v2 will provide the proper facility to implement mark.

The approach I took was to have the source user_ns context set up an fd
that was passed to the target user_ns context. Then that fd was used to
attach the mount to the mount tree, and when that happened the target
s_user_ns was set. It worked, but there were some awkware bits. I
haven't kept up with the changes since then to know if things are any
better or worse with the current patches.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
  2018-11-02 13:16       ` Amir Goldstein
  2018-11-02 13:47         ` Seth Forshee
@ 2018-11-02 16:57         ` James Bottomley
  1 sibling, 0 replies; 7+ messages in thread
From: James Bottomley @ 2018-11-02 16:57 UTC (permalink / raw)
  To: Amir Goldstein, Seth Forshee
  Cc: linux-fsdevel, linux-kernel, Linux Containers, overlayfs

On Fri, 2018-11-02 at 15:16 +0200, Amir Goldstein wrote:
> On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <seth.forshee@canonical.c
> om> wrote:
> > 
> > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canoni
> > > cal.com> wrote:
> > > > 
> > > > shiftfs mounts cannot be nested for two reasons -- global
> > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > > functional shiftfs mount meets the filesystem stacking depth
> > > > limit.
> > > > 
> > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > > ids in a mount must be within that mount's s_user_ns, so all
> > > > that
> > > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > > > 
> > > > The stack depth issue can be worked around with a couple of
> > > > adjustments. First, a mark mount doesn't really need to count
> > > > against the stacking depth as it doesn't contribute to the call
> > > > stack depth during filesystem operations. Therefore the mount
> > > > over the mark mount only needs to count as one more than the
> > > > lower filesystems stack depth.
> > > 
> > > That's true, but it also highlights the point that the "mark" sb
> > > is
> > > completely unneeded and it really is just a nice little hack.
> > > All the information it really stores is a lower mount reference,
> > > a lower root dentry and a declarative statement "I am
> > > shiftable!".
> > 
> > Seems I should have saved some of the things I said in my previous
> > response for this one. As you no doubt gleaned from that email, I
> > agree
> > with this.
> > 
> > > Come to think about it, "container shiftable" really isn't that
> > > different from
> > > NFS export with "no_root_squash" and auto mounted USB drive.
> > > I mean the shifting itself is different of course, but the
> > > declaration, not so much.
> > > If I am allowing sudoers on another machine to mess with root
> > > owned
> > > files visible
> > > on my machine, I am pretty much have the same issues as container
> > > admins
> > > accessing root owned files on my init_user_ns filesystem. In all
> > > those cases,
> > > I'd better not be executing suid binaries from the untrusted
> > > "external" source.
> > > 
> > > Instead of mounting a dummy filesystem to make the declaration,
> > > you could
> > > get the same thing with:
> > >    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE
> > > MS_UNTRUSTED
> > > or whatnot)  constant to uapi and manage to come up good man page
> > > description.
> > > 
> > > Then users could actually mount a filesystem in init_user_ns
> > > MS_EXTERN and
> > > avoid the extra bind mount step (for a full filesystem tree
> > > export).
> > > Declaring a mounted image MS_EXTERN has merits on its own even
> > > without
> > > containers and shitfs, for example with pluggable storage. Other
> > > LSMs could make
> > > good use of that declaration.
> > 
> > I'm missing how we figure out the target user ns in this scheme. We
> > need
> > a context with privileges towards the source path's s_user_ns to
> > say
> > it's okay to shift ids for the files under the source path, and
> > then we
> > need a target user ns for the id shifts. Currently the target is
> > current_user_ns when the final shiftfs mount is created.
> > 
> > So, how do we determine the target s_user_ns in your scheme?
> > 
> 
> Unless I am completely misunderstanding shiftfs, I think we are
> saying the same thing. You said you wish to get rid of the "mark" fs
> and that you had a POC of implementing the "mark" using xattr.

I've got one of these too ... it works nicely.

> I'm just saying another option to implement the mark is using a super
> block flag and you get the target s_user_ns from mnt_sb.

This works a lot less well because the entire mount becomes shiftable,
not just a subtree, which is suboptimal for the unprivileged use case. 
The idea for getting around this was the one Ted mentioned of attaching
properties to the vfsmount structure (he'd like to use this for case
insensitive name comparisons in subtrees), but that requires
rethreading quite a few vfs calls to take a struct path instead of a
struct dentry.

James

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-02 16:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20181101214856.4563-1-seth.forshee@canonical.com>
2018-11-02  8:59 ` [RFC PATCH 0/6] shiftfs fixes and enhancements Amir Goldstein
2018-11-02 12:26   ` Seth Forshee
     [not found] ` <20181101214856.4563-7-seth.forshee@canonical.com>
2018-11-02 10:02   ` [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts Amir Goldstein
2018-11-02 12:44     ` Seth Forshee
2018-11-02 13:16       ` Amir Goldstein
2018-11-02 13:47         ` Seth Forshee
2018-11-02 16:57         ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox