[PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
  2016-04-26 19:36 [PATCH v4 00/21] Support fuse mounts in " Seth Forshee
@ 2016-04-26 19:36 ` Seth Forshee
  0 siblings, 0 replies; 6+ messages in thread
From: Seth Forshee @ 2016-04-26 19:36 UTC (permalink / raw)
  To: Eric W. Biederman, Alexander Viro, Greg Kroah-Hartman,
	Jeff Layton, J. Bruce Fields, Tejun Heo, Li Zefan,
	Johannes Weiner
  Cc: Serge Hallyn, Richard Weinberger, Austin S Hemmelgarn,
	Miklos Szeredi, Pavel Tikhomirov, linux-kernel, linux-bcache,
	dm-devel, linux-raid, linux-mtd, linux-fsdevel, fuse-devel,
	linux-security-module, selinux, cgroups, Seth Forshee

Both of these filesystems already have use cases for mounting the
same super block from multiple user namespaces. For sysfs this
happens when using criu for snapshotting a container, where sysfs
is mounted in the containers network ns but the hosts user ns.
The cgroup filesystem shares the same super block for all mounts
of the same hierarchy regardless of the namespace.

As a result, the restriction on mounting a super block from a
single user namespace creates regressions for existing uses of
these filesystems. For these specific filesystems this
restriction isn't really necessary since the backing store is
objects in kernel memory and thus the ids assigned from inodes
is not subject to translation relative to s_user_ns.

Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set
causes sget_userns() to skip the check of s_user_ns. Set this
flag for the sysfs and cgroup filesystems to fix the
regressions.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 fs/super.c         | 3 ++-
 fs/sysfs/mount.c   | 3 ++-
 include/linux/fs.h | 1 +
 kernel/cgroup.c    | 4 ++--
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 092a7828442e..ead156b44bf8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -472,7 +472,8 @@ retry:
 		hlist_for_each_entry(old, &type->fs_supers, s_instances) {
 			if (!test(old, data))
 				continue;
-			if (user_ns != old->s_user_ns) {
+			if (!(type->fs_flags & FS_USERNS_SHARE_SB) &&
+			    user_ns != old->s_user_ns) {
 				spin_unlock(&sb_lock);
 				if (s) {
 					up_write(&s->s_umount);
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index f3db82071cfb..9555accd4322 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -59,7 +59,8 @@ static struct file_system_type sysfs_fs_type = {
 	.name		= "sysfs",
 	.mount		= sysfs_mount,
 	.kill_sb	= sysfs_kill_sb,
-	.fs_flags	= FS_USERNS_VISIBLE | FS_USERNS_MOUNT,
+	.fs_flags	= FS_USERNS_VISIBLE | FS_USERNS_MOUNT |
+			  FS_USERNS_SHARE_SB,
 };
 
 int __init sysfs_init(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index be0f8023e28c..66a639ec1bc4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1988,6 +1988,7 @@ struct file_system_type {
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
 #define FS_USERNS_VISIBLE	32	/* FS must already be visible */
+#define FS_USERNS_SHARE_SB	64	/* Allow sharing sb between userns-es */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 671dc05c0b0f..9c9aa27e531a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2247,14 +2247,14 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
-	.fs_flags = FS_USERNS_MOUNT,
+	.fs_flags = FS_USERNS_MOUNT | FS_USERNS_SHARE_SB,
 };
 
 static struct file_system_type cgroup2_fs_type = {
 	.name = "cgroup2",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
-	.fs_flags = FS_USERNS_MOUNT,
+	.fs_flags = FS_USERNS_MOUNT | FS_USERNS_SHARE_SB,
 };
 
 static char *cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
       [not found] ` <1461699046-30485-4-git-send-email-seth.forshee@canonical.com>
@ 2016-05-17 22:39   ` Eric W. Biederman
  2016-05-17 23:58     ` Seth Forshee
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2016-05-17 22:39 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Alexander Viro, Greg Kroah-Hartman, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn,
	Richard Weinberger, Austin S Hemmelgarn, Miklos Szeredi,
	Pavel Tikhomirov, linux-kernel, linux-bcache, dm-devel,
	linux-raid, linux-mtd, linux-fsdevel, fuse-devel,
	linux-security-module, selinux, cgroups

Seth Forshee <seth.forshee@canonical.com> writes:

> Both of these filesystems already have use cases for mounting the
> same super block from multiple user namespaces. For sysfs this
> happens when using criu for snapshotting a container, where sysfs
> is mnounted in the containers network ns but the hosts user ns.
> The cgroup filesystem shares the same super block for all mounts
> of the same hierarchy regardless of the namespace.
>
> As a result, the restriction on mounting a super block from a
> single user namespace creates regressions for existing uses of
> these filesystems. For these specific filesystems this
> restriction isn't really necessary since the backing store is
> objects in kernel memory and thus the ids assigned from inodes
> is not subject to translation relative to s_user_ns.
>
> Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set
> causes sget_userns() to skip the check of s_user_ns. Set this
> flag for the sysfs and cgroup filesystems to fix the
> regressions.

So this one needs to be sget_userns(..., &init_user_ns, ...).
And not a new special case.

Apologies for not catching this earlier.

I am looking at folding all of this into the patch that introduces
sget_userns so that even bisects won't have regresssions.

We loose the ability to call mount -o remount and actually affect
these filesystems (which we don't have without s_user_ns) but we gain a
whole lot of simplicity, and we don't break the xattr and security label
on sysfs code.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
  2016-05-17 22:39   ` [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces Eric W. Biederman
@ 2016-05-17 23:58     ` Seth Forshee
  2016-05-18 15:45       ` Eric W. Biederman
  0 siblings, 1 reply; 6+ messages in thread
From: Seth Forshee @ 2016-05-17 23:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, Greg Kroah-Hartman, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn,
	Richard Weinberger, Austin S Hemmelgarn, Miklos Szeredi,
	Pavel Tikhomirov, linux-kernel, linux-bcache, dm-devel,
	linux-raid, linux-mtd, linux-fsdevel, fuse-devel,
	linux-security-module, selinux, cgroups

On Tue, May 17, 2016 at 05:39:33PM -0500, Eric W. Biederman wrote:
> Seth Forshee <seth.forshee@canonical.com> writes:
> 
> > Both of these filesystems already have use cases for mounting the
> > same super block from multiple user namespaces. For sysfs this
> > happens when using criu for snapshotting a container, where sysfs
> > is mnounted in the containers network ns but the hosts user ns.
> > The cgroup filesystem shares the same super block for all mounts
> > of the same hierarchy regardless of the namespace.
> >
> > As a result, the restriction on mounting a super block from a
> > single user namespace creates regressions for existing uses of
> > these filesystems. For these specific filesystems this
> > restriction isn't really necessary since the backing store is
> > objects in kernel memory and thus the ids assigned from inodes
> > is not subject to translation relative to s_user_ns.
> >
> > Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set
> > causes sget_userns() to skip the check of s_user_ns. Set this
> > flag for the sysfs and cgroup filesystems to fix the
> > regressions.
> 
> So this one needs to be sget_userns(..., &init_user_ns, ...).
> And not a new special case.

This is actually what I wanted to do, but based on a previous discussion
where I had suggested doing this (for a different reason) I came away
thinking you did not want it that way. So I'm happy with that change.

But if we do that it violates some of the assumptions of the patch to
rework MNT_NODEV on your testing branch (and also those behind patch 2
in this series). Something will need to be changed there to prevent a
regression in mount behavior when a user ns tries to mount without
MNT_NODEV when the mount inherited from its parent has it set.

> Apologies for not catching this earlier.

Actually this is a more recent patch, so you possibly hadn't seen it
before.

> I am looking at folding all of this into the patch that introduces
> sget_userns so that even bisects won't have regresssions.

That's fine with me.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
  2016-05-17 23:58     ` Seth Forshee
@ 2016-05-18 15:45       ` Eric W. Biederman
  2016-05-18 16:16         ` Seth Forshee
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2016-05-18 15:45 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Alexander Viro, Greg Kroah-Hartman, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn,
	Richard Weinberger, Austin S Hemmelgarn, Miklos Szeredi,
	Pavel Tikhomirov, linux-kernel, linux-bcache, dm-devel,
	linux-raid, linux-mtd, linux-fsdevel, fuse-devel,
	linux-security-module, selinux, cgroups

Seth Forshee <seth.forshee@canonical.com> writes:

> On Tue, May 17, 2016 at 05:39:33PM -0500, Eric W. Biederman wrote:
>> Seth Forshee <seth.forshee@canonical.com> writes:
>> 
>> > Both of these filesystems already have use cases for mounting the
>> > same super block from multiple user namespaces. For sysfs this
>> > happens when using criu for snapshotting a container, where sysfs
>> > is mnounted in the containers network ns but the hosts user ns.
>> > The cgroup filesystem shares the same super block for all mounts
>> > of the same hierarchy regardless of the namespace.
>> >
>> > As a result, the restriction on mounting a super block from a
>> > single user namespace creates regressions for existing uses of
>> > these filesystems. For these specific filesystems this
>> > restriction isn't really necessary since the backing store is
>> > objects in kernel memory and thus the ids assigned from inodes
>> > is not subject to translation relative to s_user_ns.
>> >
>> > Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set
>> > causes sget_userns() to skip the check of s_user_ns. Set this
>> > flag for the sysfs and cgroup filesystems to fix the
>> > regressions.
>> 
>> So this one needs to be sget_userns(..., &init_user_ns, ...).
>> And not a new special case.
>
> This is actually what I wanted to do, but based on a previous discussion
> where I had suggested doing this (for a different reason) I came away
> thinking you did not want it that way. So I'm happy with that change.

Yeah.  Somedays it seems like there are a lot of pieces in play here.
The security labels on sysfs seems to be a very compelling case.

> But if we do that it violates some of the assumptions of the patch to
> rework MNT_NODEV on your testing branch (and also those behind patch 2
> in this series). Something will need to be changed there to prevent a
> regression in mount behavior when a user ns tries to mount without
> MNT_NODEV when the mount inherited from its parent has it set.

Thank you for pointing that out.  I will look into that.

I believe I know exactly what you are talking about.  Of the choices I
think it is better to a minor localized change in the fs_fully_visible
logic than it is to cause problems elsewhere.

>> Apologies for not catching this earlier.
>
> Actually this is a more recent patch, so you possibly hadn't seen it
> before.
>
>> I am looking at folding all of this into the patch that introduces
>> sget_userns so that even bisects won't have regresssions.
>
> That's fine with me.

And thank you for keeping everything as separate patches.  That is at
least helping me catch up.  Even if I don't agree that these things
should be separate come merge time.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
  2016-05-18 15:45       ` Eric W. Biederman
@ 2016-05-18 16:16         ` Seth Forshee
  2016-05-18 16:27           ` Eric W. Biederman
  0 siblings, 1 reply; 6+ messages in thread
From: Seth Forshee @ 2016-05-18 16:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, Greg Kroah-Hartman, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn,
	Richard Weinberger, Austin S Hemmelgarn, Miklos Szeredi,
	Pavel Tikhomirov, linux-kernel, linux-bcache, dm-devel,
	linux-raid, linux-mtd, linux-fsdevel, fuse-devel,
	linux-security-module, selinux, cgroups

On Wed, May 18, 2016 at 10:45:31AM -0500, Eric W. Biederman wrote:
> > But if we do that it violates some of the assumptions of the patch to
> > rework MNT_NODEV on your testing branch (and also those behind patch 2
> > in this series). Something will need to be changed there to prevent a
> > regression in mount behavior when a user ns tries to mount without
> > MNT_NODEV when the mount inherited from its parent has it set.
> 
> Thank you for pointing that out.  I will look into that.
> 
> I believe I know exactly what you are talking about.  Of the choices I
> think it is better to a minor localized change in the fs_fully_visible
> logic than it is to cause problems elsewhere.

Agreed.

> >> Apologies for not catching this earlier.
> >
> > Actually this is a more recent patch, so you possibly hadn't seen it
> > before.
> >
> >> I am looking at folding all of this into the patch that introduces
> >> sget_userns so that even bisects won't have regresssions.
> >
> > That's fine with me.
> 
> And thank you for keeping everything as separate patches.  That is at
> least helping me catch up.  Even if I don't agree that these things
> should be separate come merge time.

Honestly I probably would have squashed some of them into that first
patch myself if you hadn't already applied it to your testing branch, so
that's all just luck.

Keep in mind that I also have that patch for mqueue that isn't in this
series, and I haven't yet checked to see if the 4.7 merges introduce
anything which is going to require updating these patches. I was
planning to wait and send out updates after -rc1, but if you want that
stuff sooner just let me know.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
  2016-05-18 16:16         ` Seth Forshee
@ 2016-05-18 16:27           ` Eric W. Biederman
  0 siblings, 0 replies; 6+ messages in thread
From: Eric W. Biederman @ 2016-05-18 16:27 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Alexander Viro, Greg Kroah-Hartman, Jeff Layton, J. Bruce Fields,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn,
	Richard Weinberger, Austin S Hemmelgarn, Miklos Szeredi,
	Pavel Tikhomirov, linux-kernel, linux-bcache, dm-devel,
	linux-raid, linux-mtd, linux-fsdevel, fuse-devel,
	linux-security-module, selinux, cgroups

Seth Forshee <seth.forshee@canonical.com> writes:

> On Wed, May 18, 2016 at 10:45:31AM -0500, Eric W. Biederman wrote:
>> > But if we do that it violates some of the assumptions of the patch to
>> > rework MNT_NODEV on your testing branch (and also those behind patch 2
>> > in this series). Something will need to be changed there to prevent a
>> > regression in mount behavior when a user ns tries to mount without
>> > MNT_NODEV when the mount inherited from its parent has it set.
>> 
>> Thank you for pointing that out.  I will look into that.
>> 
>> I believe I know exactly what you are talking about.  Of the choices I
>> think it is better to a minor localized change in the fs_fully_visible
>> logic than it is to cause problems elsewhere.
>
> Agreed.
>
>> >> Apologies for not catching this earlier.
>> >
>> > Actually this is a more recent patch, so you possibly hadn't seen it
>> > before.
>> >
>> >> I am looking at folding all of this into the patch that introduces
>> >> sget_userns so that even bisects won't have regresssions.
>> >
>> > That's fine with me.
>> 
>> And thank you for keeping everything as separate patches.  That is at
>> least helping me catch up.  Even if I don't agree that these things
>> should be separate come merge time.
>
> Honestly I probably would have squashed some of them into that first
> patch myself if you hadn't already applied it to your testing branch, so
> that's all just luck.
>
> Keep in mind that I also have that patch for mqueue that isn't in this
> series, and I haven't yet checked to see if the 4.7 merges introduce
> anything which is going to require updating these patches. I was
> planning to wait and send out updates after -rc1, but if you want that
> stuff sooner just let me know.

As unfortunately I don't have anything going into -rc1 I am working on
this right now.

Let me finish sorting out the sget_userns and mnt nodev mess and I will
push something out and then we can compare notes.  I think I have mqueue
covered by other changes.  As it is in the set of filesystems that
should just use sget_userns.

I am sorting through the nodev corner of this now.  It should just be a
day or two.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-05-18 16:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1461699046-30485-1-git-send-email-seth.forshee@canonical.com>
     [not found] ` <1461699046-30485-4-git-send-email-seth.forshee@canonical.com>
2016-05-17 22:39   ` [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces Eric W. Biederman
2016-05-17 23:58     ` Seth Forshee
2016-05-18 15:45       ` Eric W. Biederman
2016-05-18 16:16         ` Seth Forshee
2016-05-18 16:27           ` Eric W. Biederman
2016-04-26 19:36 [PATCH v4 00/21] Support fuse mounts in " Seth Forshee
2016-04-26 19:36 ` [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between " Seth Forshee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox