[RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
@ 2016-03-21 23:41 Serge E. Hallyn
       [not found] ` <20160321234133.GA22463-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-03-21 23:41 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

One practical problem I've found with cgroup namespaces is that there
is no way to disambiguate between a cgroupfs mount which was done in
a cgroup namespace, and a bind mount of a cgroupfs directory.  So
whether I do

unshare --cgroup -- bash -c "mount -t cgroup -o freezer f /mnt; cat /proc/self/mountinfo"

or whether I just

mount --bind /sys/fs/cgroup/freezer/$(awk -F: '/freezer/ { print $3 }' /proc/self/cgroup) /mnt

'mount root' field (field 3) in /proc/self/mountinfo will show the
same thing, the result of awk -F: '/freezer/ { print $3 }' /proc/self/cgroup.

This patch adds a 'nsroot=' field to cgroup mountinfo entries, so that
userspace can distinguish a mount made in a cgroup namespace from a bind
mount from a cgroup subdirectory.

Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 fs/kernfs/mount.c      |  2 +-
 include/linux/kernfs.h |  3 ++-
 kernel/cgroup.c        | 29 ++++++++++++++++++++++++++++-
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index b67dbcc..58f59fd 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
 
 	if (scops && scops->show_options)
-		return scops->show_options(sf, root);
+		return scops->show_options(sf, dentry, root);
 	return 0;
 }
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index c06c442..3124b91 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -145,7 +145,8 @@ struct kernfs_node {
  */
 struct kernfs_syscall_ops {
 	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
-	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
+	int (*show_options)(struct seq_file *sf, struct dentry *dentry,
+			    struct kernfs_root *root);
 
 	int (*mkdir)(struct kernfs_node *parent, const char *name,
 		     umode_t mode);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 671dc05..806d1e7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1593,7 +1593,32 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 	return 0;
 }
 
-static int cgroup_show_options(struct seq_file *seq,
+static void cgroup_show_nsroot(struct seq_file *seq, struct dentry *dentry,
+			       struct kernfs_root *kf_root)
+{
+	struct kernfs_node *d_kn = dentry->d_fsdata;
+	char *nsroot;
+	int len, ret;
+
+	if (!kf_root)
+		return;
+	len = kernfs_path_from_node(d_kn, kf_root->kn, NULL, 0);
+	if (len <= 0)
+		return;
+	nsroot = kzalloc(len + 1, GFP_ATOMIC);
+	if (!nsroot)
+		return;
+	ret = kernfs_path_from_node(d_kn, kf_root->kn, nsroot, len + 1);
+	if (ret <= 0 || ret > len)
+		goto out;
+
+	seq_show_option(seq, "nsroot", nsroot);
+
+out:
+	kfree(nsroot);
+}
+
+static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry,
 			       struct kernfs_root *kf_root)
 {
 	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
@@ -1619,6 +1644,8 @@ static int cgroup_show_options(struct seq_file *seq,
 		seq_puts(seq, ",clone_children");
 	if (strlen(root->name))
 		seq_show_option(seq, "name", root->name);
+	cgroup_show_nsroot(seq, dentry, kf_root);
+
 	return 0;
 }
 
-- 
2.7.3

^ permalink raw reply related	[flat|nested] 22+ messages in thread

[parent not found: <20160321234133.GA22463-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found] ` <20160321234133.GA22463-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-03-29  1:12   ` Serge E. Hallyn
       [not found]     ` <20160329011203.GA8974-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2016-03-29 13:58   ` Tycho Andersen
  2016-04-13 17:57   ` [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field Tejun Heo
  2 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-03-29  1:12 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> One practical problem I've found with cgroup namespaces is that there
> is no way to disambiguate between a cgroupfs mount which was done in
> a cgroup namespace, and a bind mount of a cgroupfs directory.  So
> whether I do
> 
> unshare --cgroup -- bash -c "mount -t cgroup -o freezer f /mnt; cat /proc/self/mountinfo"
> 
> or whether I just
> 
> mount --bind /sys/fs/cgroup/freezer/$(awk -F: '/freezer/ { print $3 }' /proc/self/cgroup) /mnt
> 
> 'mount root' field (field 3) in /proc/self/mountinfo will show the
> same thing, the result of awk -F: '/freezer/ { print $3 }' /proc/self/cgroup.
> 
> This patch adds a 'nsroot=' field to cgroup mountinfo entries, so that
> userspace can distinguish a mount made in a cgroup namespace from a bind
> mount from a cgroup subdirectory.

Hi Tejun,

no rush on the patch itself, I don't mind if i have to rewrite it from
scratch, but I'd like to get the patch into docker/libcontainer using it,
so can we decide on whether the syntax as shown here is ok?

thanks,
-serge

> Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/mount.c      |  2 +-
>  include/linux/kernfs.h |  3 ++-
>  kernel/cgroup.c        | 29 ++++++++++++++++++++++++++++-
>  3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index b67dbcc..58f59fd 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
>  	struct kernfs_syscall_ops *scops = root->syscall_ops;
>  
>  	if (scops && scops->show_options)
> -		return scops->show_options(sf, root);
> +		return scops->show_options(sf, dentry, root);
>  	return 0;
>  }
>  
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index c06c442..3124b91 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -145,7 +145,8 @@ struct kernfs_node {
>   */
>  struct kernfs_syscall_ops {
>  	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> -	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
> +	int (*show_options)(struct seq_file *sf, struct dentry *dentry,
> +			    struct kernfs_root *root);
>  
>  	int (*mkdir)(struct kernfs_node *parent, const char *name,
>  		     umode_t mode);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 671dc05..806d1e7 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1593,7 +1593,32 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
>  	return 0;
>  }
>  
> -static int cgroup_show_options(struct seq_file *seq,
> +static void cgroup_show_nsroot(struct seq_file *seq, struct dentry *dentry,
> +			       struct kernfs_root *kf_root)
> +{
> +	struct kernfs_node *d_kn = dentry->d_fsdata;
> +	char *nsroot;
> +	int len, ret;
> +
> +	if (!kf_root)
> +		return;
> +	len = kernfs_path_from_node(d_kn, kf_root->kn, NULL, 0);
> +	if (len <= 0)
> +		return;
> +	nsroot = kzalloc(len + 1, GFP_ATOMIC);
> +	if (!nsroot)
> +		return;
> +	ret = kernfs_path_from_node(d_kn, kf_root->kn, nsroot, len + 1);
> +	if (ret <= 0 || ret > len)
> +		goto out;
> +
> +	seq_show_option(seq, "nsroot", nsroot);
> +
> +out:
> +	kfree(nsroot);
> +}
> +
> +static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry,
>  			       struct kernfs_root *kf_root)
>  {
>  	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> @@ -1619,6 +1644,8 @@ static int cgroup_show_options(struct seq_file *seq,
>  		seq_puts(seq, ",clone_children");
>  	if (strlen(root->name))
>  		seq_show_option(seq, "name", root->name);
> +	cgroup_show_nsroot(seq, dentry, kf_root);
> +
>  	return 0;
>  }
>  
> -- 
> 2.7.3
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <20160329011203.GA8974-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]     ` <20160329011203.GA8974-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-03-30 18:58       ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2016-03-30 18:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Hello, Serge.

On Mon, Mar 28, 2016 at 08:12:03PM -0500, Serge E. Hallyn wrote:
> Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> > One practical problem I've found with cgroup namespaces is that there
> > is no way to disambiguate between a cgroupfs mount which was done in
> > a cgroup namespace, and a bind mount of a cgroupfs directory.  So
> > whether I do
> > 
> > unshare --cgroup -- bash -c "mount -t cgroup -o freezer f /mnt; cat /proc/self/mountinfo"
> > 
> > or whether I just
> > 
> > mount --bind /sys/fs/cgroup/freezer/$(awk -F: '/freezer/ { print $3 }' /proc/self/cgroup) /mnt
> > 
> > 'mount root' field (field 3) in /proc/self/mountinfo will show the
> > same thing, the result of awk -F: '/freezer/ { print $3 }' /proc/self/cgroup.
> > 
> > This patch adds a 'nsroot=' field to cgroup mountinfo entries, so that
> > userspace can distinguish a mount made in a cgroup namespace from a bind
> > mount from a cgroup subdirectory.
> 
> no rush on the patch itself, I don't mind if i have to rewrite it from
> scratch, but I'd like to get the patch into docker/libcontainer using it,
> so can we decide on whether the syntax as shown here is ok?

Yeah, I think the syntax is fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found] ` <20160321234133.GA22463-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2016-03-29  1:12   ` Serge E. Hallyn
@ 2016-03-29 13:58   ` Tycho Andersen
  2016-03-29 20:00     ` Serge E. Hallyn
  2016-04-13 17:57   ` [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field Tejun Heo
  2 siblings, 1 reply; 22+ messages in thread
From: Tycho Andersen @ 2016-03-29 13:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Hi Serge,

On Mon, Mar 21, 2016 at 06:41:33PM -0500, Serge E. Hallyn wrote:
> One practical problem I've found with cgroup namespaces is that there
> is no way to disambiguate between a cgroupfs mount which was done in
> a cgroup namespace, and a bind mount of a cgroupfs directory.  So
> whether I do
> 
> unshare --cgroup -- bash -c "mount -t cgroup -o freezer f /mnt; cat /proc/self/mountinfo"
> 
> or whether I just
> 
> mount --bind /sys/fs/cgroup/freezer/$(awk -F: '/freezer/ { print $3 }' /proc/self/cgroup) /mnt
> 
> 'mount root' field (field 3) in /proc/self/mountinfo will show the
> same thing, the result of awk -F: '/freezer/ { print $3 }' /proc/self/cgroup.
> 
> This patch adds a 'nsroot=' field to cgroup mountinfo entries, so that
> userspace can distinguish a mount made in a cgroup namespace from a bind
> mount from a cgroup subdirectory.

With this patch, mountinfo shows nsroot= in the mount options, but the
actual mount() call for cgroups doesn't allow nsroot. Would it be
possible to allow passing nsroot= to mount, as long is it does in fact
match the current nsroot?

The motivation for this is that CRIU just copies the mount options and
uses them on restore, so with this patch we have to add a special case
to trim off nsroot= before we restore.

Tycho

> Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/mount.c      |  2 +-
>  include/linux/kernfs.h |  3 ++-
>  kernel/cgroup.c        | 29 ++++++++++++++++++++++++++++-
>  3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index b67dbcc..58f59fd 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
>  	struct kernfs_syscall_ops *scops = root->syscall_ops;
>  
>  	if (scops && scops->show_options)
> -		return scops->show_options(sf, root);
> +		return scops->show_options(sf, dentry, root);
>  	return 0;
>  }
>  
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index c06c442..3124b91 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -145,7 +145,8 @@ struct kernfs_node {
>   */
>  struct kernfs_syscall_ops {
>  	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> -	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
> +	int (*show_options)(struct seq_file *sf, struct dentry *dentry,
> +			    struct kernfs_root *root);
>  
>  	int (*mkdir)(struct kernfs_node *parent, const char *name,
>  		     umode_t mode);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 671dc05..806d1e7 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1593,7 +1593,32 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
>  	return 0;
>  }
>  
> -static int cgroup_show_options(struct seq_file *seq,
> +static void cgroup_show_nsroot(struct seq_file *seq, struct dentry *dentry,
> +			       struct kernfs_root *kf_root)
> +{
> +	struct kernfs_node *d_kn = dentry->d_fsdata;
> +	char *nsroot;
> +	int len, ret;
> +
> +	if (!kf_root)
> +		return;
> +	len = kernfs_path_from_node(d_kn, kf_root->kn, NULL, 0);
> +	if (len <= 0)
> +		return;
> +	nsroot = kzalloc(len + 1, GFP_ATOMIC);
> +	if (!nsroot)
> +		return;
> +	ret = kernfs_path_from_node(d_kn, kf_root->kn, nsroot, len + 1);
> +	if (ret <= 0 || ret > len)
> +		goto out;
> +
> +	seq_show_option(seq, "nsroot", nsroot);
> +
> +out:
> +	kfree(nsroot);
> +}
> +
> +static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry,
>  			       struct kernfs_root *kf_root)
>  {
>  	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> @@ -1619,6 +1644,8 @@ static int cgroup_show_options(struct seq_file *seq,
>  		seq_puts(seq, ",clone_children");
>  	if (strlen(root->name))
>  		seq_show_option(seq, "name", root->name);
> +	cgroup_show_nsroot(seq, dentry, kf_root);
> +
>  	return 0;
>  }
>  
> -- 
> 2.7.3
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
  2016-03-29 13:58   ` Tycho Andersen
@ 2016-03-29 20:00     ` Serge E. Hallyn
       [not found]       ` <20160329200018.GA21908-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-03-29 20:00 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Serge E. Hallyn, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Quoting Tycho Andersen (tycho.andersen-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org):
> Hi Serge,
> 
> On Mon, Mar 21, 2016 at 06:41:33PM -0500, Serge E. Hallyn wrote:
> > One practical problem I've found with cgroup namespaces is that there
> > is no way to disambiguate between a cgroupfs mount which was done in
> > a cgroup namespace, and a bind mount of a cgroupfs directory.  So
> > whether I do
> > 
> > unshare --cgroup -- bash -c "mount -t cgroup -o freezer f /mnt; cat /proc/self/mountinfo"
> > 
> > or whether I just
> > 
> > mount --bind /sys/fs/cgroup/freezer/$(awk -F: '/freezer/ { print $3 }' /proc/self/cgroup) /mnt
> > 
> > 'mount root' field (field 3) in /proc/self/mountinfo will show the
> > same thing, the result of awk -F: '/freezer/ { print $3 }' /proc/self/cgroup.
> > 
> > This patch adds a 'nsroot=' field to cgroup mountinfo entries, so that
> > userspace can distinguish a mount made in a cgroup namespace from a bind
> > mount from a cgroup subdirectory.
> 
> With this patch, mountinfo shows nsroot= in the mount options, but the
> actual mount() call for cgroups doesn't allow nsroot. Would it be
> possible to allow passing nsroot= to mount, as long is it does in fact
> match the current nsroot?

Yeah, that should be possible.  I'll try to send a patch for that later
this week.  That's not to say Tejun will be ok with the behavior, but
it seems to make sense to me.

> The motivation for this is that CRIU just copies the mount options and
> uses them on restore, so with this patch we have to add a special case
> to trim off nsroot= before we restore.
> 
> Tycho
> 
> > Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  fs/kernfs/mount.c      |  2 +-
> >  include/linux/kernfs.h |  3 ++-
> >  kernel/cgroup.c        | 29 ++++++++++++++++++++++++++++-
> >  3 files changed, 31 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index b67dbcc..58f59fd 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
> >  	struct kernfs_syscall_ops *scops = root->syscall_ops;
> >  
> >  	if (scops && scops->show_options)
> > -		return scops->show_options(sf, root);
> > +		return scops->show_options(sf, dentry, root);
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index c06c442..3124b91 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -145,7 +145,8 @@ struct kernfs_node {
> >   */
> >  struct kernfs_syscall_ops {
> >  	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> > -	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
> > +	int (*show_options)(struct seq_file *sf, struct dentry *dentry,
> > +			    struct kernfs_root *root);
> >  
> >  	int (*mkdir)(struct kernfs_node *parent, const char *name,
> >  		     umode_t mode);
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 671dc05..806d1e7 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1593,7 +1593,32 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
> >  	return 0;
> >  }
> >  
> > -static int cgroup_show_options(struct seq_file *seq,
> > +static void cgroup_show_nsroot(struct seq_file *seq, struct dentry *dentry,
> > +			       struct kernfs_root *kf_root)
> > +{
> > +	struct kernfs_node *d_kn = dentry->d_fsdata;
> > +	char *nsroot;
> > +	int len, ret;
> > +
> > +	if (!kf_root)
> > +		return;
> > +	len = kernfs_path_from_node(d_kn, kf_root->kn, NULL, 0);
> > +	if (len <= 0)
> > +		return;
> > +	nsroot = kzalloc(len + 1, GFP_ATOMIC);
> > +	if (!nsroot)
> > +		return;
> > +	ret = kernfs_path_from_node(d_kn, kf_root->kn, nsroot, len + 1);
> > +	if (ret <= 0 || ret > len)
> > +		goto out;
> > +
> > +	seq_show_option(seq, "nsroot", nsroot);
> > +
> > +out:
> > +	kfree(nsroot);
> > +}
> > +
> > +static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry,
> >  			       struct kernfs_root *kf_root)
> >  {
> >  	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> > @@ -1619,6 +1644,8 @@ static int cgroup_show_options(struct seq_file *seq,
> >  		seq_puts(seq, ",clone_children");
> >  	if (strlen(root->name))
> >  		seq_show_option(seq, "name", root->name);
> > +	cgroup_show_nsroot(seq, dentry, kf_root);
> > +
> >  	return 0;
> >  }
> >  
> > -- 
> > 2.7.3
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <20160329200018.GA21908-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* [PATCH] cgroup mount: ignore nsroot=
       [not found]       ` <20160329200018.GA21908-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-03-30 17:21         ` Serge E. Hallyn
       [not found]           ` <20160330172100.GA11373-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-03-30 17:21 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tycho Andersen, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

As of the patch "cgroup namespaces: add a 'nsroot=' mountinfo field", 
cgroupfs mountinfo output shows 'nsroot='.  If userspace like criu
copy/pastes mount options from there into a new mount command, we should
ignore it.

Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ef0c25d..69fb112 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1680,6 +1680,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 			opts->none = true;
 			continue;
 		}
+		if (!strncmp(token, "nsroot=", 7)) {
+			/* ignore nsroot= copied from mountinfo */
+			continue;
+		}
 		if (!strcmp(token, "all")) {
 			/* Mutually exclusive option 'all' + subsystem name */
 			if (one_ss)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

[parent not found: <20160330172100.GA11373-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: [PATCH] cgroup mount: ignore nsroot=
       [not found]           ` <20160330172100.GA11373-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-03-30 18:09             ` Tycho Andersen
  0 siblings, 0 replies; 22+ messages in thread
From: Tycho Andersen @ 2016-03-30 18:09 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

On Wed, Mar 30, 2016 at 12:21:00PM -0500, Serge E. Hallyn wrote:
> As of the patch "cgroup namespaces: add a 'nsroot=' mountinfo field", 
> cgroupfs mountinfo output shows 'nsroot='.  If userspace like criu
> copy/pastes mount options from there into a new mount command, we should
> ignore it.
> 
> Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Tested-by: Tycho Andersen <tycho.andersen-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Thanks, Serge.

> ---
>  kernel/cgroup.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index ef0c25d..69fb112 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1680,6 +1680,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  			opts->none = true;
>  			continue;
>  		}
> +		if (!strncmp(token, "nsroot=", 7)) {
> +			/* ignore nsroot= copied from mountinfo */
> +			continue;
> +		}
>  		if (!strcmp(token, "all")) {
>  			/* Mutually exclusive option 'all' + subsystem name */
>  			if (one_ss)
> -- 
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found] ` <20160321234133.GA22463-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2016-03-29  1:12   ` Serge E. Hallyn
  2016-03-29 13:58   ` Tycho Andersen
@ 2016-04-13 17:57   ` Tejun Heo
       [not found]     ` <20160413175736.GC3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  2 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-04-13 17:57 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Hello, Serge.

Sorry about the delay.

On Mon, Mar 21, 2016 at 06:41:33PM -0500, Serge E. Hallyn wrote:
>  struct kernfs_syscall_ops {
>  	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> -	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
> +	int (*show_options)(struct seq_file *sf, struct dentry *dentry,
> +			    struct kernfs_root *root);

Wouldn't it make more sense to pass in kernfs_node pointer instead of
dentry pointer?

> +static void cgroup_show_nsroot(struct seq_file *seq, struct dentry *dentry,
> +			       struct kernfs_root *kf_root)
> +{
> +	struct kernfs_node *d_kn = dentry->d_fsdata;
> +	char *nsroot;
> +	int len, ret;
> +
> +	if (!kf_root)
> +		return;
> +	len = kernfs_path_from_node(d_kn, kf_root->kn, NULL, 0);
> +	if (len <= 0)
> +		return;
> +	nsroot = kzalloc(len + 1, GFP_ATOMIC);
> +	if (!nsroot)
> +		return;
> +	ret = kernfs_path_from_node(d_kn, kf_root->kn, nsroot, len + 1);
> +	if (ret <= 0 || ret > len)
> +		goto out;

Hmmm.... does this mean that someone inside cgroup ns would be able to
find out the absolute cgroup path of the ns root from inside?  If so,
wouldn't that be an unnecessary information leak?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <20160413175736.GC3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]     ` <20160413175736.GC3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-04-13 18:46       ` Serge E. Hallyn
       [not found]         ` <20160413184639.GA29483-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2016-04-14  4:04       ` [PATCH] " Serge E. Hallyn
  1 sibling, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-04-13 18:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Serge.
> 
> Sorry about the delay.
> 
> On Mon, Mar 21, 2016 at 06:41:33PM -0500, Serge E. Hallyn wrote:
> >  struct kernfs_syscall_ops {
> >  	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> > -	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
> > +	int (*show_options)(struct seq_file *sf, struct dentry *dentry,
> > +			    struct kernfs_root *root);
> 
> Wouldn't it make more sense to pass in kernfs_node pointer instead of
> dentry pointer?

Yeah that definately seems better.

> > +static void cgroup_show_nsroot(struct seq_file *seq, struct dentry *dentry,
> > +			       struct kernfs_root *kf_root)
> > +{
> > +	struct kernfs_node *d_kn = dentry->d_fsdata;
> > +	char *nsroot;
> > +	int len, ret;
> > +
> > +	if (!kf_root)
> > +		return;
> > +	len = kernfs_path_from_node(d_kn, kf_root->kn, NULL, 0);
> > +	if (len <= 0)
> > +		return;
> > +	nsroot = kzalloc(len + 1, GFP_ATOMIC);
> > +	if (!nsroot)
> > +		return;
> > +	ret = kernfs_path_from_node(d_kn, kf_root->kn, nsroot, len + 1);
> > +	if (ret <= 0 || ret > len)
> > +		goto out;
> 
> Hmmm.... does this mean that someone inside cgroup ns would be able to
> find out the absolute cgroup path of the ns root from inside?  If so,
> wouldn't that be an unnecessary information leak?

It's not a leak of any information we're trying to hide.  I realize
something like 8 years have passed, but I still basically go by the
ksummit guidance that containers are ok but the kernel's first priority
is to facilitate containers but not trick containers into thinking
they're not containerized.  So long as the container is properly set
up, I don't think there's anything the workload could do with the
nsroot= info other than *know* that it is in a ns cgroup.

If we did change that guidance, there's a slew of proc info that we
could better virtualize :)

thanks,
-serge

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <20160413184639.GA29483-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]         ` <20160413184639.GA29483-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-04-13 18:50           ` Tejun Heo
       [not found]             ` <20160413185033.GH3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2016-04-13 18:50 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Hello, Serge.

On Wed, Apr 13, 2016 at 01:46:39PM -0500, Serge E. Hallyn wrote:
> It's not a leak of any information we're trying to hide.  I realize
> something like 8 years have passed, but I still basically go by the
> ksummit guidance that containers are ok but the kernel's first priority
> is to facilitate containers but not trick containers into thinking
> they're not containerized.  So long as the container is properly set
> up, I don't think there's anything the workload could do with the
> nsroot= info other than *know* that it is in a ns cgroup.
> 
> If we did change that guidance, there's a slew of proc info that we
> could better virtualize :)

I see.  I'm just wondering because the information here seems a bit
gratuituous.  Isn't the only thing necessary telling whether the root
is bind mounted or namescoped?  Wouldn't simple "nsroot" work for that
purpose?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <20160413185033.GH3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]             ` <20160413185033.GH3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-04-13 19:01               ` Serge E. Hallyn
       [not found]                 ` <20160413190152.GA29753-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-04-13 19:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Serge.
> 
> On Wed, Apr 13, 2016 at 01:46:39PM -0500, Serge E. Hallyn wrote:
> > It's not a leak of any information we're trying to hide.  I realize
> > something like 8 years have passed, but I still basically go by the
> > ksummit guidance that containers are ok but the kernel's first priority
> > is to facilitate containers but not trick containers into thinking
> > they're not containerized.  So long as the container is properly set
> > up, I don't think there's anything the workload could do with the
> > nsroot= info other than *know* that it is in a ns cgroup.
> > 
> > If we did change that guidance, there's a slew of proc info that we
> > could better virtualize :)
> 
> I see.  I'm just wondering because the information here seems a bit
> gratuituous.  Isn't the only thing necessary telling whether the root
> is bind mounted or namescoped?  Wouldn't simple "nsroot" work for that
> purpose?

I don't think so - we could be in a cgroup namespace but still have
access only to bind-mounted cgroups.  So we need to compare the
superblock dentry root field to the nsroot= value.

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <20160413190152.GA29753-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]                 ` <20160413190152.GA29753-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-04-13 19:12                   ` Tejun Heo
  2016-04-13 23:31                   ` Aditya Kali
  1 sibling, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2016-04-13 19:12 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Hello,

On Wed, Apr 13, 2016 at 02:01:52PM -0500, Serge E. Hallyn wrote:
> I don't think so - we could be in a cgroup namespace but still have
> access only to bind-mounted cgroups.  So we need to compare the
> superblock dentry root field to the nsroot= value.

I see.  No objection then.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]                 ` <20160413190152.GA29753-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2016-04-13 19:12                   ` Tejun Heo
@ 2016-04-13 23:31                   ` Aditya Kali
       [not found]                     ` <CAGr1F2HXJ1BdMFY+vF40O_khE+4S7OnbQPv-h1Q_AmGGhL7mzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Aditya Kali @ 2016-04-13 23:31 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Linux API, Linux Containers, Eric W. Biederman,
	cgroups mailinglist, lkml

On Wed, Apr 13, 2016 at 12:01 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>> Hello, Serge.
>>
>> On Wed, Apr 13, 2016 at 01:46:39PM -0500, Serge E. Hallyn wrote:
>> > It's not a leak of any information we're trying to hide.  I realize
>> > something like 8 years have passed, but I still basically go by the
>> > ksummit guidance that containers are ok but the kernel's first priority
>> > is to facilitate containers but not trick containers into thinking
>> > they're not containerized.  So long as the container is properly set
>> > up, I don't think there's anything the workload could do with the
>> > nsroot= info other than *know* that it is in a ns cgroup.
>> >
>> > If we did change that guidance, there's a slew of proc info that we
>> > could better virtualize :)
>>
>> I see.  I'm just wondering because the information here seems a bit
>> gratuituous.  Isn't the only thing necessary telling whether the root
>> is bind mounted or namescoped?  Wouldn't simple "nsroot" work for that
>> purpose?
>
> I don't think so - we could be in a cgroup namespace but still have
> access only to bind-mounted cgroups.  So we need to compare the
> superblock dentry root field to the nsroot= value.

Umm, I don't think this is such a good idea. The main purpose of
cgroup namespace was to prevent this exposure of system cgroup
hierarchy that used to happen because of /proc/self/cgroup. Wouldn't
showing that information in /proc/self/mountinfo defeat the purpose?

> One practical problem I've found with cgroup namespaces is that there
> is no way to disambiguate between a cgroupfs mount which was done in
> a cgroup namespace, and a bind mount of a cgroupfs directory.

Thats actually by design, no? Namespaced apps should not know/care if
they are running inside namespace. If they can find it out today, its
just because of certain side-effects. I fear adding explicit "nsroot"
or something in /proc/self/mountinfo now becomes an API making it hard
to virtualize user-apps again.

-- 
Aditya

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <CAGr1F2HXJ1BdMFY+vF40O_khE+4S7OnbQPv-h1Q_AmGGhL7mzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]                     ` <CAGr1F2HXJ1BdMFY+vF40O_khE+4S7OnbQPv-h1Q_AmGGhL7mzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-04-13 23:52                       ` Serge E. Hallyn
  0 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2016-04-13 23:52 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge E. Hallyn, Tejun Heo, Linux API, Linux Containers,
	Eric W. Biederman, cgroups mailinglist, lkml

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Wed, Apr 13, 2016 at 12:01 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> >> Hello, Serge.
> >>
> >> On Wed, Apr 13, 2016 at 01:46:39PM -0500, Serge E. Hallyn wrote:
> >> > It's not a leak of any information we're trying to hide.  I realize
> >> > something like 8 years have passed, but I still basically go by the
> >> > ksummit guidance that containers are ok but the kernel's first priority
> >> > is to facilitate containers but not trick containers into thinking
> >> > they're not containerized.  So long as the container is properly set
> >> > up, I don't think there's anything the workload could do with the
> >> > nsroot= info other than *know* that it is in a ns cgroup.
> >> >
> >> > If we did change that guidance, there's a slew of proc info that we
> >> > could better virtualize :)
> >>
> >> I see.  I'm just wondering because the information here seems a bit
> >> gratuituous.  Isn't the only thing necessary telling whether the root
> >> is bind mounted or namescoped?  Wouldn't simple "nsroot" work for that
> >> purpose?
> >
> > I don't think so - we could be in a cgroup namespace but still have
> > access only to bind-mounted cgroups.  So we need to compare the
> > superblock dentry root field to the nsroot= value.
> 
> Umm, I don't think this is such a good idea. The main purpose of
> cgroup namespace was to prevent this exposure of system cgroup
> hierarchy that used to happen because of /proc/self/cgroup. Wouldn't
> showing that information in /proc/self/mountinfo defeat the purpose?

I disagree.  The primary purpose was to simplify init's job and to keep
cgroup mounts in sync with /proc/self/cgroup.  So that userspace doesn't
have to look at /proc/self/cgroup and then try and figure out how that
relates to its actual cgroup mountpoints.  It was not to *hide* the
information.

Field 3 already gives us the path, nsroot just tells us what part of
it we are namespaced under.

> > One practical problem I've found with cgroup namespaces is that there
> > is no way to disambiguate between a cgroupfs mount which was done in
> > a cgroup namespace, and a bind mount of a cgroupfs directory.
> 
> Thats actually by design, no? Namespaced apps should not know/care if
> they are running inside namespace. If they can find it out today, its

No.  If a workload isn't allowed to mount its own cgroups, and can only
see that freezer /lxc/x1 was mounted at /dev/cgroup (poorly done, but we
don't get to pass judgement or choose mountpoints for userspace), and
it sees /lxc/x1 in its freezer entry for /proc/self/cgroup, then it
cannot tell whether it should be using /dev/cgroup/tasks or
/dev/cgroup/lxc/tasks or /dev/cgroup/lxc/x1/tasks.  That's a problem.

> just because of certain side-effects. I fear adding explicit "nsroot"
> or something in /proc/self/mountinfo now becomes an API making it hard
> to virtualize user-apps again.

It doesn't make it hard to virtualize.  The only complication would be
if you wanted to checkpoint/restart and reproduce the exact
/proc/self/mountinfo output.  That's a bogus goal anyway, since the
restart could be in a different cgroup and field 3 would be different.

In contrast, not providing this makes it impossible for software to
deal with both cgroup namespace and any bind-mounted cgroups.  Which
means any new docker (say) which can run in cgroup namespaces will
not be able to run under old (that is, anything currently released
except lxc 2.0) container managers.  We're breaking all container
managers.

Now the other thing we could do would be to tweak field 3 in the
mountinfo output.  That had been my first inclination, but the way
the mountinfo code is currently done makes that ... challenging.

-serge

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]     ` <20160413175736.GC3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  2016-04-13 18:46       ` Serge E. Hallyn
@ 2016-04-14  4:04       ` Serge E. Hallyn
       [not found]         ` <20160414040436.GA3739-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-04-14  4:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

This is so that userspace can distinguish a mount made in a cgroup
namespace from a bind mount from a cgroup subdirectory.

Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
Changelog: 2016-04-13: pass kernfs_node rather than dentry to show_options
---
 fs/kernfs/mount.c      |  2 +-
 include/linux/kernfs.h |  3 ++-
 kernel/cgroup.c        | 28 +++++++++++++++++++++++++++-
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f73541f..58e8a86 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 	struct kernfs_syscall_ops *scops = root->syscall_ops;
 
 	if (scops && scops->show_options)
-		return scops->show_options(sf, root);
+		return scops->show_options(sf, dentry->d_fsdata, root);
 	return 0;
 }
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index c06c442..72b4081 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -145,7 +145,8 @@ struct kernfs_node {
  */
 struct kernfs_syscall_ops {
 	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
-	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
+	int (*show_options)(struct seq_file *sf, struct kernfs_node *kn,
+			    struct kernfs_root *root);
 
 	int (*mkdir)(struct kernfs_node *parent, const char *name,
 		     umode_t mode);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 671dc05..4d26d07 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1593,7 +1593,31 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 	return 0;
 }
 
-static int cgroup_show_options(struct seq_file *seq,
+static void cgroup_show_nsroot(struct seq_file *seq, struct kernfs_node *knode,
+			       struct kernfs_root *kroot)
+{
+	char *nsroot;
+	int len, ret;
+
+	if (!kroot)
+		return;
+	len = kernfs_path_from_node(knode, kroot->kn, NULL, 0);
+	if (len <= 0)
+		return;
+	nsroot = kzalloc(len + 1, GFP_ATOMIC);
+	if (!nsroot)
+		return;
+	ret = kernfs_path_from_node(knode, kroot->kn, nsroot, len + 1);
+	if (ret <= 0 || ret > len)
+		goto out;
+
+	seq_show_option(seq, "nsroot", nsroot);
+
+out:
+	kfree(nsroot);
+}
+
+static int cgroup_show_options(struct seq_file *seq, struct kernfs_node *kn,
 			       struct kernfs_root *kf_root)
 {
 	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
@@ -1619,6 +1643,8 @@ static int cgroup_show_options(struct seq_file *seq,
 		seq_puts(seq, ",clone_children");
 	if (strlen(root->name))
 		seq_show_option(seq, "name", root->name);
+	cgroup_show_nsroot(seq, kn, kf_root);
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

[parent not found: <20160414040436.GA3739-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]         ` <20160414040436.GA3739-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-04-14 14:42           ` Eric W. Biederman
       [not found]             ` <87oa9c6ymf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2016-04-14 14:42 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> This is so that userspace can distinguish a mount made in a cgroup
> namespace from a bind mount from a cgroup subdirectory.

To do that do you need to print the path, or is an extra option that
reveals nothing except that it was a cgroup mount sufficient?

Is there any practical difference between a mount in a namespace and a
bind mount?

Given the way the conversation has been going I think it would be good
to see the answers to these questions.  Perhaps I missed it but I
haven't seen the answers to those questions.

Eric


>
> Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
> Changelog: 2016-04-13: pass kernfs_node rather than dentry to show_options
> ---
>  fs/kernfs/mount.c      |  2 +-
>  include/linux/kernfs.h |  3 ++-
>  kernel/cgroup.c        | 28 +++++++++++++++++++++++++++-
>  3 files changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f73541f..58e8a86 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
>  	struct kernfs_syscall_ops *scops = root->syscall_ops;
>  
>  	if (scops && scops->show_options)
> -		return scops->show_options(sf, root);
> +		return scops->show_options(sf, dentry->d_fsdata, root);
>  	return 0;
>  }
>  
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index c06c442..72b4081 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -145,7 +145,8 @@ struct kernfs_node {
>   */
>  struct kernfs_syscall_ops {
>  	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> -	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
> +	int (*show_options)(struct seq_file *sf, struct kernfs_node *kn,
> +			    struct kernfs_root *root);
>  
>  	int (*mkdir)(struct kernfs_node *parent, const char *name,
>  		     umode_t mode);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 671dc05..4d26d07 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1593,7 +1593,31 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
>  	return 0;
>  }
>  
> -static int cgroup_show_options(struct seq_file *seq,
> +static void cgroup_show_nsroot(struct seq_file *seq, struct kernfs_node *knode,
> +			       struct kernfs_root *kroot)
> +{
> +	char *nsroot;
> +	int len, ret;
> +
> +	if (!kroot)
> +		return;
> +	len = kernfs_path_from_node(knode, kroot->kn, NULL, 0);
> +	if (len <= 0)
> +		return;
> +	nsroot = kzalloc(len + 1, GFP_ATOMIC);
> +	if (!nsroot)
> +		return;
> +	ret = kernfs_path_from_node(knode, kroot->kn, nsroot, len + 1);
> +	if (ret <= 0 || ret > len)
> +		goto out;
> +
> +	seq_show_option(seq, "nsroot", nsroot);
> +
> +out:
> +	kfree(nsroot);
> +}
> +
> +static int cgroup_show_options(struct seq_file *seq, struct kernfs_node *kn,
>  			       struct kernfs_root *kf_root)
>  {
>  	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> @@ -1619,6 +1643,8 @@ static int cgroup_show_options(struct seq_file *seq,
>  		seq_puts(seq, ",clone_children");
>  	if (strlen(root->name))
>  		seq_show_option(seq, "name", root->name);
> +	cgroup_show_nsroot(seq, kn, kf_root);
> +
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <87oa9c6ymf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]

* Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]             ` <87oa9c6ymf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-04-14 15:27               ` Serge E. Hallyn
       [not found]                 ` <20160414152747.GA12700-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2016-04-15 15:50                 ` Aditya Kali
  0 siblings, 2 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2016-04-14 15:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Tejun Heo, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > This is so that userspace can distinguish a mount made in a cgroup
> > namespace from a bind mount from a cgroup subdirectory.
> 
> To do that do you need to print the path, or is an extra option that
> reveals nothing except that it was a cgroup mount sufficient?
> 
> Is there any practical difference between a mount in a namespace and a
> bind mount?
> 
> Given the way the conversation has been going I think it would be good
> to see the answers to these questions.  Perhaps I missed it but I
> haven't seen the answers to those questions.

Yup, I tried to answer those in my last email, let me try again.

Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
that container, I start another container x1, not using cgroup namespaces.
It also wants a cgroup mount, and a common way to handle that (to prevent
container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
in that container will show '/lxc/x1'.  Unless it has been moved into
/lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
Every time I've thought "maybe we can just..." I've found a case where it
wouldn't work.

At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
the cgroupfs mounts are not bind mounts.  However, old userspace (and
container drivers) on new kernels is certainly possible, especially an
older distro in a container on a newer distro on the host.  That completely
breaks with this approach.

I also personally think there *is* value in letting a task know its
place on the system, so hiding the full cgroup path is imo not only not
a valid goal, it's counter-productive.  Part of making for better
virtualization is to give userspace all the info it needs about its
current limits.  Consider that with the unified hierarchy, you cannot
have tasks in a cgroup that also has child cgroups - except for the
root.  Cgroup namespaces do not make an exception for this, so knowing
that you are not in the absolute cgroup root actually can prevent you
from trying something that cannot work.  Or, I suppose, at least
understanding why you're unable to do what you're trying to do (namely
your container manager messed up).  I point this out because finding
a way to only show the namespaced root in field 3 of mountinfo would
fix the base problem, but at the cost of hiding useful information
from a container.

> Eric
> 
> 
> >
> > Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> > ---
> > Changelog: 2016-04-13: pass kernfs_node rather than dentry to show_options
> > ---
> >  fs/kernfs/mount.c      |  2 +-
> >  include/linux/kernfs.h |  3 ++-
> >  kernel/cgroup.c        | 28 +++++++++++++++++++++++++++-
> >  3 files changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index f73541f..58e8a86 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
> >  	struct kernfs_syscall_ops *scops = root->syscall_ops;
> >  
> >  	if (scops && scops->show_options)
> > -		return scops->show_options(sf, root);
> > +		return scops->show_options(sf, dentry->d_fsdata, root);
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index c06c442..72b4081 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -145,7 +145,8 @@ struct kernfs_node {
> >   */
> >  struct kernfs_syscall_ops {
> >  	int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
> > -	int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
> > +	int (*show_options)(struct seq_file *sf, struct kernfs_node *kn,
> > +			    struct kernfs_root *root);
> >  
> >  	int (*mkdir)(struct kernfs_node *parent, const char *name,
> >  		     umode_t mode);
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 671dc05..4d26d07 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1593,7 +1593,31 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
> >  	return 0;
> >  }
> >  
> > -static int cgroup_show_options(struct seq_file *seq,
> > +static void cgroup_show_nsroot(struct seq_file *seq, struct kernfs_node *knode,
> > +			       struct kernfs_root *kroot)
> > +{
> > +	char *nsroot;
> > +	int len, ret;
> > +
> > +	if (!kroot)
> > +		return;
> > +	len = kernfs_path_from_node(knode, kroot->kn, NULL, 0);
> > +	if (len <= 0)
> > +		return;
> > +	nsroot = kzalloc(len + 1, GFP_ATOMIC);
> > +	if (!nsroot)
> > +		return;
> > +	ret = kernfs_path_from_node(knode, kroot->kn, nsroot, len + 1);
> > +	if (ret <= 0 || ret > len)
> > +		goto out;
> > +
> > +	seq_show_option(seq, "nsroot", nsroot);
> > +
> > +out:
> > +	kfree(nsroot);
> > +}
> > +
> > +static int cgroup_show_options(struct seq_file *seq, struct kernfs_node *kn,
> >  			       struct kernfs_root *kf_root)
> >  {
> >  	struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> > @@ -1619,6 +1643,8 @@ static int cgroup_show_options(struct seq_file *seq,
> >  		seq_puts(seq, ",clone_children");
> >  	if (strlen(root->name))
> >  		seq_show_option(seq, "name", root->name);
> > +	cgroup_show_nsroot(seq, kn, kf_root);
> > +
> >  	return 0;
> >  }

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <20160414152747.GA12700-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]                 ` <20160414152747.GA12700-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-04-14 16:12                   ` Eric W. Biederman
       [not found]                     ` <877fg06uf9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2016-04-14 16:12 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>> 
>> > This is so that userspace can distinguish a mount made in a cgroup
>> > namespace from a bind mount from a cgroup subdirectory.
>> 
>> To do that do you need to print the path, or is an extra option that
>> reveals nothing except that it was a cgroup mount sufficient?
>> 
>> Is there any practical difference between a mount in a namespace and a
>> bind mount?
>> 
>> Given the way the conversation has been going I think it would be good
>> to see the answers to these questions.  Perhaps I missed it but I
>> haven't seen the answers to those questions.
>
> Yup, I tried to answer those in my last email, let me try again.
>
> Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
> freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
> and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
> that container, I start another container x1, not using cgroup namespaces.
> It also wants a cgroup mount, and a common way to handle that (to prevent
> container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
> create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
> the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
> Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
> with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
> in that container will show '/lxc/x1'.  Unless it has been moved into
> /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
> Every time I've thought "maybe we can just..." I've found a case where it
> wouldn't work.
>
> At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
> the cgroupfs mounts are not bind mounts.  However, old userspace (and
> container drivers) on new kernels is certainly possible, especially an
> older distro in a container on a newer distro on the host.  That completely
> breaks with this approach.
>
> I also personally think there *is* value in letting a task know its
> place on the system, so hiding the full cgroup path is imo not only not
> a valid goal, it's counter-productive.  Part of making for better
> virtualization is to give userspace all the info it needs about its
> current limits.  Consider that with the unified hierarchy, you cannot
> have tasks in a cgroup that also has child cgroups - except for the
> root.  Cgroup namespaces do not make an exception for this, so knowing
> that you are not in the absolute cgroup root actually can prevent you
> from trying something that cannot work.  Or, I suppose, at least
> understanding why you're unable to do what you're trying to do (namely
> your container manager messed up).  I point this out because finding
> a way to only show the namespaced root in field 3 of mountinfo would
> fix the base problem, but at the cost of hiding useful information
> from a container.

It is just the superblock show_path method.  And regardless of the rest
of the usefullness of your mount option implementing show_path appears
to be fundamentally the right thing in this context.  As that field
appears to have the same issue as /proc/self/cgroup.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <877fg06uf9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]

* Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]                     ` <877fg06uf9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-04-14 16:38                       ` Serge E. Hallyn
  2016-04-14 16:43                         ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2016-04-14 16:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Tejun Heo, linux-api-u79uwXL29TY76Z2rM5mHXA,
	adityakali-hpIqsD4AKlfQT0dZR+AlfA, Linux Containers,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lkml

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> >> 
> >> > This is so that userspace can distinguish a mount made in a cgroup
> >> > namespace from a bind mount from a cgroup subdirectory.
> >> 
> >> To do that do you need to print the path, or is an extra option that
> >> reveals nothing except that it was a cgroup mount sufficient?
> >> 
> >> Is there any practical difference between a mount in a namespace and a
> >> bind mount?
> >> 
> >> Given the way the conversation has been going I think it would be good
> >> to see the answers to these questions.  Perhaps I missed it but I
> >> haven't seen the answers to those questions.
> >
> > Yup, I tried to answer those in my last email, let me try again.
> >
> > Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
> > freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
> > and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
> > that container, I start another container x1, not using cgroup namespaces.
> > It also wants a cgroup mount, and a common way to handle that (to prevent
> > container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
> > create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
> > the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
> > Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
> > with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
> > in that container will show '/lxc/x1'.  Unless it has been moved into
> > /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
> > Every time I've thought "maybe we can just..." I've found a case where it
> > wouldn't work.
> >
> > At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
> > the cgroupfs mounts are not bind mounts.  However, old userspace (and
> > container drivers) on new kernels is certainly possible, especially an
> > older distro in a container on a newer distro on the host.  That completely
> > breaks with this approach.
> >
> > I also personally think there *is* value in letting a task know its
> > place on the system, so hiding the full cgroup path is imo not only not
> > a valid goal, it's counter-productive.  Part of making for better
> > virtualization is to give userspace all the info it needs about its
> > current limits.  Consider that with the unified hierarchy, you cannot
> > have tasks in a cgroup that also has child cgroups - except for the
> > root.  Cgroup namespaces do not make an exception for this, so knowing
> > that you are not in the absolute cgroup root actually can prevent you
> > from trying something that cannot work.  Or, I suppose, at least
> > understanding why you're unable to do what you're trying to do (namely
> > your container manager messed up).  I point this out because finding
> > a way to only show the namespaced root in field 3 of mountinfo would
> > fix the base problem, but at the cost of hiding useful information
> > from a container.
> 
> It is just the superblock show_path method.  And regardless of the rest
> of the usefullness of your mount option implementing show_path appears

Ugh.  Yeah as I've said implementing that would be the other way to go.
I'm somewhat loath to give up the extra information, but I can work
on that patch later this week.

> to be fundamentally the right thing in this context.  As that field
> appears to have the same issue as /proc/self/cgroup.

Well, /proc/self/cgroup could also have been fixed by adding a
':<nsroot>" field to each line, but it's used differently...

thanks,
-serge

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
  2016-04-14 16:38                       ` Serge E. Hallyn
@ 2016-04-14 16:43                         ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2016-04-14 16:43 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, linux-api, adityakali, Linux Containers, cgroups, lkml

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> >> 
>> >> > This is so that userspace can distinguish a mount made in a cgroup
>> >> > namespace from a bind mount from a cgroup subdirectory.
>> >> 
>> >> To do that do you need to print the path, or is an extra option that
>> >> reveals nothing except that it was a cgroup mount sufficient?
>> >> 
>> >> Is there any practical difference between a mount in a namespace and a
>> >> bind mount?
>> >> 
>> >> Given the way the conversation has been going I think it would be good
>> >> to see the answers to these questions.  Perhaps I missed it but I
>> >> haven't seen the answers to those questions.
>> >
>> > Yup, I tried to answer those in my last email, let me try again.
>> >
>> > Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
>> > freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
>> > and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
>> > that container, I start another container x1, not using cgroup namespaces.
>> > It also wants a cgroup mount, and a common way to handle that (to prevent
>> > container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
>> > create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
>> > the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
>> > Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
>> > with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
>> > in that container will show '/lxc/x1'.  Unless it has been moved into
>> > /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
>> > Every time I've thought "maybe we can just..." I've found a case where it
>> > wouldn't work.
>> >
>> > At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
>> > the cgroupfs mounts are not bind mounts.  However, old userspace (and
>> > container drivers) on new kernels is certainly possible, especially an
>> > older distro in a container on a newer distro on the host.  That completely
>> > breaks with this approach.
>> >
>> > I also personally think there *is* value in letting a task know its
>> > place on the system, so hiding the full cgroup path is imo not only not
>> > a valid goal, it's counter-productive.  Part of making for better
>> > virtualization is to give userspace all the info it needs about its
>> > current limits.  Consider that with the unified hierarchy, you cannot
>> > have tasks in a cgroup that also has child cgroups - except for the
>> > root.  Cgroup namespaces do not make an exception for this, so knowing
>> > that you are not in the absolute cgroup root actually can prevent you
>> > from trying something that cannot work.  Or, I suppose, at least
>> > understanding why you're unable to do what you're trying to do (namely
>> > your container manager messed up).  I point this out because finding
>> > a way to only show the namespaced root in field 3 of mountinfo would
>> > fix the base problem, but at the cost of hiding useful information
>> > from a container.
>> 
>> It is just the superblock show_path method.  And regardless of the rest
>> of the usefullness of your mount option implementing show_path appears
>
> Ugh.  Yeah as I've said implementing that would be the other way to go.
> I'm somewhat loath to give up the extra information, but I can work
> on that patch later this week.

It sounded like you couldn't see how to implement that which is slightly
different.

>> to be fundamentally the right thing in this context.  As that field
>> appears to have the same issue as /proc/self/cgroup.
>
> Well, /proc/self/cgroup could also have been fixed by adding a
> ':<nsroot>" field to each line, but it's used differently...

Usage may be a reasonable justification.  Mostly I am trying to pry
apart the chaos.  Not shoot down one approach or another.

My current perspective is that irrespective of what we do with
a informational mount option (and there seems to be value in that)
having mountinfo show the path relative to root, will allow more
software to just work without modification.

And old software just working when it can seem very valuable.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
  2016-04-14 15:27               ` Serge E. Hallyn
       [not found]                 ` <20160414152747.GA12700-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2016-04-15 15:50                 ` Aditya Kali
       [not found]                   ` <CAGr1F2EZtts38SPDc9cuH1prc6NfUJiwUQmqyRp-RpNYM5UzxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Aditya Kali @ 2016-04-15 15:50 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Tejun Heo, Linux API, Linux Containers,
	cgroups mailinglist, lkml

On Thu, Apr 14, 2016 at 8:27 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>
>> > This is so that userspace can distinguish a mount made in a cgroup
>> > namespace from a bind mount from a cgroup subdirectory.
>>
>> To do that do you need to print the path, or is an extra option that
>> reveals nothing except that it was a cgroup mount sufficient?
>>
>> Is there any practical difference between a mount in a namespace and a
>> bind mount?
>>
>> Given the way the conversation has been going I think it would be good
>> to see the answers to these questions.  Perhaps I missed it but I
>> haven't seen the answers to those questions.
>
> Yup, I tried to answer those in my last email, let me try again.
>
> Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
> freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
> and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
> that container, I start another container x1, not using cgroup namespaces.
> It also wants a cgroup mount, and a common way to handle that (to prevent
> container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
> create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
> the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
> Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
> with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
> in that container will show '/lxc/x1'.  Unless it has been moved into
> /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
> Every time I've thought "maybe we can just..." I've found a case where it
> wouldn't work.
>
> At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
> the cgroupfs mounts are not bind mounts.  However, old userspace (and
> container drivers) on new kernels is certainly possible, especially an
> older distro in a container on a newer distro on the host.  That completely
> breaks with this approach.
>

My main concern regarding making this a new kernel API is that its too
generic and exposes information about all system cgroups to every
process on the system, not just the container or the process inside it
that needs it. Not all containers need this information and not all
processes running inside the container needs this. I haven't spent too
much thought into it, but it seems you will still need to update the
container userspace to read this extra mount option. So seems like a
simpler approach where the host "cgroup manager" provides this
information to specific container cgroup manager via other user-space
channels (a config file, command-line args, environment vars, proper
container mounts, etc.) may also work, right?

> I also personally think there *is* value in letting a task know its
> place on the system, so hiding the full cgroup path is imo not only not
> a valid goal, it's counter-productive.  Part of making for better
> virtualization is to give userspace all the info it needs about its
> current limits.  Consider that with the unified hierarchy, you cannot
> have tasks in a cgroup that also has child cgroups - except for the
> root.  Cgroup namespaces do not make an exception for this, so knowing
> that you are not in the absolute cgroup root actually can prevent you
> from trying something that cannot work.  Or, I suppose, at least
> understanding why you're unable to do what you're trying to do (namely
> your container manager messed up).  I point this out because finding
> a way to only show the namespaced root in field 3 of mountinfo would
> fix the base problem, but at the cost of hiding useful information
> from a container.
>
>> Eric
>>
>>
>> >
>> > Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
>> > ---
>> > Changelog: 2016-04-13: pass kernfs_node rather than dentry to show_options
>> > ---
>> >  fs/kernfs/mount.c      |  2 +-
>> >  include/linux/kernfs.h |  3 ++-
>> >  kernel/cgroup.c        | 28 +++++++++++++++++++++++++++-
>> >  3 files changed, 30 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> > index f73541f..58e8a86 100644
>> > --- a/fs/kernfs/mount.c
>> > +++ b/fs/kernfs/mount.c
>> > @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
>> >     struct kernfs_syscall_ops *scops = root->syscall_ops;
>> >
>> >     if (scops && scops->show_options)
>> > -           return scops->show_options(sf, root);
>> > +           return scops->show_options(sf, dentry->d_fsdata, root);
>> >     return 0;
>> >  }
>> >
>> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> > index c06c442..72b4081 100644
>> > --- a/include/linux/kernfs.h
>> > +++ b/include/linux/kernfs.h
>> > @@ -145,7 +145,8 @@ struct kernfs_node {
>> >   */
>> >  struct kernfs_syscall_ops {
>> >     int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
>> > -   int (*show_options)(struct seq_file *sf, struct kernfs_root *root);
>> > +   int (*show_options)(struct seq_file *sf, struct kernfs_node *kn,
>> > +                       struct kernfs_root *root);
>> >
>> >     int (*mkdir)(struct kernfs_node *parent, const char *name,
>> >                  umode_t mode);
>> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> > index 671dc05..4d26d07 100644
>> > --- a/kernel/cgroup.c
>> > +++ b/kernel/cgroup.c
>> > @@ -1593,7 +1593,31 @@ static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
>> >     return 0;
>> >  }
>> >
>> > -static int cgroup_show_options(struct seq_file *seq,
>> > +static void cgroup_show_nsroot(struct seq_file *seq, struct kernfs_node *knode,
>> > +                          struct kernfs_root *kroot)
>> > +{
>> > +   char *nsroot;
>> > +   int len, ret;
>> > +
>> > +   if (!kroot)
>> > +           return;
>> > +   len = kernfs_path_from_node(knode, kroot->kn, NULL, 0);
>> > +   if (len <= 0)
>> > +           return;
>> > +   nsroot = kzalloc(len + 1, GFP_ATOMIC);
>> > +   if (!nsroot)
>> > +           return;
>> > +   ret = kernfs_path_from_node(knode, kroot->kn, nsroot, len + 1);
>> > +   if (ret <= 0 || ret > len)
>> > +           goto out;
>> > +
>> > +   seq_show_option(seq, "nsroot", nsroot);
>> > +
>> > +out:
>> > +   kfree(nsroot);
>> > +}
>> > +
>> > +static int cgroup_show_options(struct seq_file *seq, struct kernfs_node *kn,
>> >                            struct kernfs_root *kf_root)
>> >  {
>> >     struct cgroup_root *root = cgroup_root_from_kf(kf_root);
>> > @@ -1619,6 +1643,8 @@ static int cgroup_show_options(struct seq_file *seq,
>> >             seq_puts(seq, ",clone_children");
>> >     if (strlen(root->name))
>> >             seq_show_option(seq, "name", root->name);
>> > +   cgroup_show_nsroot(seq, kn, kf_root);
>> > +
>> >     return 0;
>> >  }



-- 
Aditya

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <CAGr1F2EZtts38SPDc9cuH1prc6NfUJiwUQmqyRp-RpNYM5UzxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field
       [not found]                   ` <CAGr1F2EZtts38SPDc9cuH1prc6NfUJiwUQmqyRp-RpNYM5UzxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-04-15 16:02                     ` Serge E. Hallyn
  0 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2016-04-15 16:02 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Serge E. Hallyn, Eric W. Biederman, Tejun Heo, Linux API,
	Linux Containers, cgroups mailinglist, lkml

Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Thu, Apr 14, 2016 at 8:27 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> >>
> >> > This is so that userspace can distinguish a mount made in a cgroup
> >> > namespace from a bind mount from a cgroup subdirectory.
> >>
> >> To do that do you need to print the path, or is an extra option that
> >> reveals nothing except that it was a cgroup mount sufficient?
> >>
> >> Is there any practical difference between a mount in a namespace and a
> >> bind mount?
> >>
> >> Given the way the conversation has been going I think it would be good
> >> to see the answers to these questions.  Perhaps I missed it but I
> >> haven't seen the answers to those questions.
> >
> > Yup, I tried to answer those in my last email, let me try again.
> >
> > Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
> > freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
> > and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
> > that container, I start another container x1, not using cgroup namespaces.
> > It also wants a cgroup mount, and a common way to handle that (to prevent
> > container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
> > create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
> > the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
> > Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
> > with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
> > in that container will show '/lxc/x1'.  Unless it has been moved into
> > /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
> > Every time I've thought "maybe we can just..." I've found a case where it
> > wouldn't work.
> >
> > At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
> > the cgroupfs mounts are not bind mounts.  However, old userspace (and
> > container drivers) on new kernels is certainly possible, especially an
> > older distro in a container on a newer distro on the host.  That completely
> > breaks with this approach.
> >
> 
> My main concern regarding making this a new kernel API is that its too
> generic and exposes information about all system cgroups to every
> process on the system, not just the container or the process inside it
> that needs it. Not all containers need this information and not all
> processes running inside the container needs this. I haven't spent too
> much thought into it, but it seems you will still need to update the
> container userspace to read this extra mount option. So seems like a
> simpler approach where the host "cgroup manager" provides this
> information to specific container cgroup manager via other user-space
> channels (a config file, command-line args, environment vars, proper
> container mounts, etc.) may also work, right?

No, because existing legacy userspace would need to be taught about
these new channels.

I'm testing a new patch which simply fixes the root dentry field in
mountinfo, which should also serve to fix this problem without adding
the nsroot= option field.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-04-15 16:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-21 23:41 [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field Serge E. Hallyn
     [not found] ` <20160321234133.GA22463-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-03-29  1:12   ` Serge E. Hallyn
     [not found]     ` <20160329011203.GA8974-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-03-30 18:58       ` Tejun Heo
2016-03-29 13:58   ` Tycho Andersen
2016-03-29 20:00     ` Serge E. Hallyn
     [not found]       ` <20160329200018.GA21908-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-03-30 17:21         ` [PATCH] cgroup mount: ignore nsroot= Serge E. Hallyn
     [not found]           ` <20160330172100.GA11373-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-03-30 18:09             ` Tycho Andersen
2016-04-13 17:57   ` [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field Tejun Heo
     [not found]     ` <20160413175736.GC3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-04-13 18:46       ` Serge E. Hallyn
     [not found]         ` <20160413184639.GA29483-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-04-13 18:50           ` Tejun Heo
     [not found]             ` <20160413185033.GH3676-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-04-13 19:01               ` Serge E. Hallyn
     [not found]                 ` <20160413190152.GA29753-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-04-13 19:12                   ` Tejun Heo
2016-04-13 23:31                   ` Aditya Kali
     [not found]                     ` <CAGr1F2HXJ1BdMFY+vF40O_khE+4S7OnbQPv-h1Q_AmGGhL7mzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-04-13 23:52                       ` Serge E. Hallyn
2016-04-14  4:04       ` [PATCH] " Serge E. Hallyn
     [not found]         ` <20160414040436.GA3739-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-04-14 14:42           ` Eric W. Biederman
     [not found]             ` <87oa9c6ymf.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-04-14 15:27               ` Serge E. Hallyn
     [not found]                 ` <20160414152747.GA12700-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-04-14 16:12                   ` Eric W. Biederman
     [not found]                     ` <877fg06uf9.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-04-14 16:38                       ` Serge E. Hallyn
2016-04-14 16:43                         ` Eric W. Biederman
2016-04-15 15:50                 ` Aditya Kali
     [not found]                   ` <CAGr1F2EZtts38SPDc9cuH1prc6NfUJiwUQmqyRp-RpNYM5UzxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-04-15 16:02                     ` Serge E. Hallyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).